EP3113180B1 - Procédé et appareil permettant d'effectuer des retouches audio sur un signal vocal - Google Patents

Procédé et appareil permettant d'effectuer des retouches audio sur un signal vocal Download PDF

Info

Publication number
EP3113180B1
EP3113180B1 EP15306085.0A EP15306085A EP3113180B1 EP 3113180 B1 EP3113180 B1 EP 3113180B1 EP 15306085 A EP15306085 A EP 15306085A EP 3113180 B1 EP3113180 B1 EP 3113180B1
Authority
EP
European Patent Office
Prior art keywords
speech
gap
speech signal
transcript
voice
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
EP15306085.0A
Other languages
German (de)
English (en)
Other versions
EP3113180A1 (fr
Inventor
Pierre Prablanc
Quang Khanh Ngoc DUONG
Alexey Ozerov
Patrick Perez
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
InterDigital CE Patent Holdings SAS
Original Assignee
InterDigital CE Patent Holdings SAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by InterDigital CE Patent Holdings SAS filed Critical InterDigital CE Patent Holdings SAS
Priority to PL15306085T priority Critical patent/PL3113180T3/pl
Priority to EP15306085.0A priority patent/EP3113180B1/fr
Publication of EP3113180A1 publication Critical patent/EP3113180A1/fr
Application granted granted Critical
Publication of EP3113180B1 publication Critical patent/EP3113180B1/fr
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/005Correction of errors induced by the transmission channel, if related to the coding algorithm

Definitions

  • the present principles relate to a method for performing audio inpainting on a speech signal, and an apparatus for performing audio inpainting on a speech signal.
  • Audio inpainting is the problem of recovering audio samples which are missing or distorted due to, e.g., lost IP packets during a Voice over IP (VoIP) transmission or any other kind of deterioration. Audio inpainting algorithms have various applications ranging from IP packets loss recovery, especially in VoIP or mobile phone, or voice censorship cancelling to various types of damaged audio repairs, including declipping and declicking. Moreover, inpainting might be used for speech modification, e.g., to replace a word of a sequence of words in a speech sequence by some other words. While signal completion has been thoroughly investigated for image and video inpainting, it is much less the case in the context of audio data in general and speech in particular.
  • VoIP Voice over IP
  • Adler et al. [1] introduced an audio inpainting algorithm for the specific purpose of audio declipping, i.e., intended to recover missing audio samples in the time domain that were clipped due to, e.g., limited range of the acquisition device. Also techniques for filling missing segments in the time-frequency domain have been developed [2,3]. However, these methods are not suitable in case of large spectral holes, especially when all frequency bins are missing in certain time frames.
  • Drori et al. [4] proposed another approach to audio inpainting in the spectral domain, relying on exemplar spectral patches taken from the known part of the spectrogram.
  • Bahat et al. [7] proposed a method for filling moderate gaps, e.g.
  • a novel method to fill gaps in speech data while preserving speech meaning and voice characteristics is disclosed in claim 1.
  • the disclosed speech audio inpainting technique plausibly recovers speech parts that are lost due to, e.g., specific audio editing or lossy transmission with the help of synthetic speech generated from the text transcript of the missing part.
  • the synthesized speech is modified based on conventional voice conversion (e.g., as in [5]) to fit with the original speaker's voice.
  • a text transcript of the missing speech part is generated or given, e.g., it can be provided by a user, infered by natural language processing techniques based on the known phrases before and/or after the gap, or available from any other source.
  • the text transcript of the missing speech part is used to complete an obfuscated speech signal. It allows leveraging recent progress of text-to-speech (TTS) synthesizers at generating very natural and high quality speech data.
  • TTS text-to-speech
  • a method for speech inpainting comprises synthesizing speech for a gap that occurs in a speech signal using a transcript of the speech signal, converting the synthesized speech by voice conversion according to the original speech signal, and blending the synthesized converted speech into the original speech signal to fill the gap.
  • the apparatus comprises a speech analyzer that is adapted for detecting a gap in the speech signal, a speech synthesizer that is adapted for performing automatic speech synthesis from text transcript at least for the gap, a voice converter that is adapted for performing voice conversion to adapt the synthesized speech to an original speaker's voice, and a mixer that is adapted for blending of the converted synthesized speech into the original speech audio track.
  • a speech analyzer that is adapted for detecting a gap in the speech signal
  • a speech synthesizer that is adapted for performing automatic speech synthesis from text transcript at least for the gap
  • a voice converter that is adapted for performing voice conversion to adapt the synthesized speech to an original speaker's voice
  • a mixer that is adapted for blending of the converted synthesized speech into the original speech audio track.
  • temporal and/or phase mismatches are removed.
  • Voice conversion is a process that transforms the speech signal from the voice of one person, which is called source speaker, as if it would have been uttered by another person, which is called target speaker.
  • a learning step two steps have to be considered: a learning step and a conversion step.
  • a mapping function is learned to map voice parameters of a source speaker to voice parameters of a target speaker.
  • some training data from both speakers are needed.
  • parallel training data which is a set of sentences uttered by both source and target speakers.
  • the target speaker is the one whose data are missing whereas the "source speaker" is a synthesized speech.
  • target data can be extracted from the surrounding region of the gap or, in case of a famous speaker, in one embodiment it can be retrieved from a database, e.g. on the Internet.
  • training data for the target speaker can be recorded by e.g. asking the target speaker to say some words, utterances or sentences.
  • source data are synthesized with a text-to-speech synthesizer thanks to the transcript of the source speech, in one embodiment.
  • text is extracted from the available speech signal by means of automatic speech recognition (ASR). Then, it is determined that one or more words or sounds are missing due to a gap in the speech signal, a context of the remainder of the speech signal is analyzed, and, according to the context and the remainder, one or more words, sounds or syllables are determined that are omitted by the gap. This can be done by estimating or guessing (e.g., in one embodiment by using a dictionary), or by obtaining from any source a complete transcript of the speech signal that covers at least the gap. It is easier to locate the gap if the complete transcript covers some more speech before and/or after the gap.
  • ASR automatic speech recognition
  • a computer readable medium has stored thereon executable instructions that when executed on a processor cause a processor to perform a method as disclosed above.
  • a method for performing speech inpainting on a speech signal comprises automatically generating a transcript on an input speech signal, determining voice characteristics of the input speech signal, processing the input speech signal, whereby a processed speech signal is obtained, detecting a gap in the processed speech signal, automatically synthesizing from the transcript speech at least for the gap, voice converting the synthesized speech according to the determined voice characteristics, and inpainting the processed speech signal, wherein the voice converted synthesized speech is filled into the gap.
  • an apparatus for performing speech inpainting on a speech signal comprises at least one hardware component, such as a hardware processor, and a non-transitory, tangible, computer-readable storage medium tangibly embodying at least one software component, and when executing on the at least one hardware processor, the software component causes the hardware processor to automatically perform the steps of claim 1.
  • a hardware component such as a hardware processor
  • a non-transitory, tangible, computer-readable storage medium tangibly embodying at least one software component
  • Fig.1 shows a general workflow of a speech inpainting system.
  • An input speech signal has a missing part 10, ie. a gap.
  • a textual transcript of the missing part is available, for example it can be generated from the original speech signal.
  • a speech utterance corresponding to the missing part 10 is synthesized 51 from the known text transcription through text-to-speech synthesis in a TTS synthesis block 11.
  • TTS synthesis systems may synthesize speech only phoneme by phoneme. Thus, if gaps occur in the middle of a phoneme, it is unlikely to recover only the utterance corresponding to the missing part.
  • the generated speech is used for the gap filling.
  • the synthesized speech has generally few similarities with the original speaker in terms of timbre and prosody. Therefore, its spectral features and fundamental frequency (F0) trajectory are adapted via voice conversion 12 to be similar to those of the target speech.
  • the gap is filled by the voice converted synthesized speech signal, which results in an inpainted output signal 13.
  • a conventional speech analysis-synthesis system e.g.[6]
  • a STRAIGHT smooth spectrogram representing the evolution of the vocal tract without time and frequency interference
  • an F0 trajectory and a voice/unvoiced detector and an aperiodic component.
  • the first two parameters are manipulated by voice conversion to modify the speech.
  • a STRAIGHT smooth spectrogram is known e.g. from [6].
  • STRAIGHT is a speech tool for speech analysis and synthesis. It allows flexible manipulations on speech because it decomposes speech in the source-filter model in three parts: a smooth spectrum representing a spectral envelope, a fundamental frequency F0 measurement, and an aperiodic component.
  • the fundamental frequency F0 measurement and the aperiodic component correspond to the source of the source-filter model, while the smooth spectrum representing a spectral envelope corresponds to the filter.
  • the smooth STRAIGHT spectrum is a good representation of the envelope, because STRAIGHT reconstructs the envelope as if it was sampled by the source. Manipulating this spectrum allows us to make good modification of the timbre of the voice.
  • the voice conversion system 12 comprises two steps. First a mapping function is learned on training data, and then it is used to convert new utterances. In order to get the mapping function, parameters to convert are extracted (e.g. with the STRAIGHT system) and aligned with dynamic time warping (DTW [8]). Then the learning phase is performed e.g. with a Gaussian mixture model (GMM [9]) or nonnegative matrix factorization (NMF [10])) to get the mapping function.
  • GMM [9] Gaussian mixture model
  • NMF [10] nonnegative matrix factorization
  • Fig.2 shows different embodiments of a voice conversion system, using a speech database. It is important to note that the original speech samples from the database do not necessarily need to cover the words or context of the current speech signal on which the inpainting is performed.
  • the mapping function allowing to perform the prediction comprises two kind of parameters: general parameters that need to be calculated only once and parameters specific to the utterance that should be calculated for each utterance that is possible to convert.
  • the general parameters may comprise e.g. Gaussian Mixture Model (GMM) parameters for GMM-based voice conversion and/or a phoneme dictionary for Non-negative Matrix Factorization (NMF)-based voice conversion.
  • the specific parameters may comprise posterior probabilities for GMM-based voice conversion and/or temporal activation matrices for NMF-based voice conversion.
  • the user is asked to enter, for a partly available speech signal 22, the speaker's identity in a query 21.
  • the query results in voice characteristics of the speaker, or in original speech samples of the speaker from which the voice characteristics are extracted.
  • the synthesized or original speech samples 23 obtained from a database or from the Internet 24 may be used to fill the gap. This approach may use standard voice conversion 25.
  • voice characteristics of the speaker are retrieved upon a query 26 or automatically from the remaining part of the speech signal 27, which serve as a small set of training data.
  • the synthesized speech for the gap 28 is voice converted 29 using the retrieved voice characteristics from around the gap.
  • two options may be considered to obtain training data, depending on whether the target speaker is a famous person or not. If the target speaker is e.g. well-known, it is generally possible to retrieve characteristic voice data from the Internet via the speaker's identity, or to try guessing the identity with automatic speaker recognition. Otherwise, only local data, i.e. data around the gap or some additional data, are available and the voice conversion system is adapted to the amount of data.
  • Fig.3 shows two embodiments 30,35 for learning voice conversion.
  • a mapping function is learned on training data, and then it is used to convert new utterances.
  • speech is generated from the training data by a text-to-speech block 31,38 (e.g. a speech synthesizer) and voice conversion parameters are extracted (e.g. with the STRAIGHT system) and aligned 32 to the synthesized speech with dynamic time warping (DTW).
  • DTW dynamic time warping
  • the learning phase is performed 33,39, e.g. with a Gaussian mixture model (GMM [9]) or Non-negative Matrix Factorization (NMF [10])), to get the mapping function 34.
  • GMM [9] Gaussian mixture model
  • NMF [10] Non-negative Matrix Factorization
  • only a small amount of training data is available, since only the speech surrounding the gap can be used as reliable speech to extract voice parameters.
  • a large amount of training data can be obtained from a database 36 such as the Internet, and automatic speech recognition 37 is used.
  • a waveform signal is resynthesized, e.g. by a conventional STRAIGHT synthesizer with the new voice parameters.
  • one or more additional steps may need to be performed, since once conversion is performed the resulting speech may still not perfectly fill the gap for the following reasons.
  • edge mismatches such as spectral, fundamental frequency and phase discontinuities may need to be counteracted.
  • spectral trajectories of the formants are naturally smooth due to the slow variation of the vocal tract shape.
  • Fundamental frequency and temporal phase are not as smooth as the spectral trajectories, but still need continuity to sound natural.
  • the speech signal is converted, it is unlikely that the parameters of the spectral envelope trajectory, fundamental frequency and temporal phase are temporally continuous at the border of the gap.
  • the parameters of the spectral envelope trajectory, fundamental frequency and temporal phase are adapted to the ones nearby in the non-missing part of the speech, so that any discontinuity at the border is reduced.
  • duration of the converted utterance may be longer or shorter than the true missing utterance. Therefore, in one embodiment, the speaking rate is converted to match the available part of the speech signal. If the speaking rate cannot be converted, at least a temporal adjustment may be done on the global time scaling of the converted utterance.
  • Fig.4 shows an overview of post-processing of the converted utterance.
  • the converted set of frames may not properly fill the gap. This can be seen as spectral discontinuities 4a.
  • the gaps may be properly filled by finding 41 in the spectral domain the best frames at the end of the converted spectrogram and merging them with the reliable spectrogram of the available portion of speech signal. This can be done by the known dynamic time warping (DTW) algorithm. Aligning converted and reliable spectra is a way to find which data are used to fill the gap. Then, in a similar adjustment to handle phase discontinuities 4b, the best samples to merge are found 42 on the signal waveform.
  • DTW dynamic time warping
  • the converted signal with F0 modification is time-scaled 44 (without pitch modification, in one embodiment) according to the indices found in the phase adjustment step.
  • the length-adjusted (ie. "stretched” or “compressed") signal is overlap-added 45 on the edges to minimize fuzzy artefacts that could still remain.
  • One advantage of the disclosed audio inpainting technique is that even long gaps can be inpainted. It is also robust when only a small amount of data is available in voice conversion.
  • Fig.5 shows a flow-chart of a method for performing speech inpainting on a speech signal, according to one embodiment.
  • the method 50 comprises determining 51 voice characteristics of the speech signal, detecting 52 a gap in the speech signal, automatically synthesizing 53, from a transcript, speech at least for the gap, voice converting 54 the synthesized speech according to the determined voice characteristics, and inpainting 55 the speech signal, wherein the voice converted synthesized speech is filled into the gap.
  • the method further comprises a step of automatically generating 56 said transcript on an input speech signal.
  • the method further comprises a step of processing 57 the voice signal, wherein the gap is generated during the processing, and wherein the transcript is generated before the processing.
  • the step of automatically synthesizing 53 speech at least for the gap comprises retrieving from a database recorded speech data from a natural speaker. This may support, enhance, replace or control the synthesis.
  • the method further comprises steps of detecting 581 that the transcript does not cover the gap, determining 582 one or more words or sounds omitted by the gap, and adding 583 the estimated word or sound to the transcript before synthesizing speech from the transcript.
  • the determining 582 is done by estimating or guessing the one or more words or sounds (e.g. from a dictionary).
  • the determining 582 is done by retrieving a complete transcript of the speech through other channels (e.g. the Internet).
  • the determined voice characteristics comprise parameters for a spectral envelope and a fundamental frequency F0 (or, in other words, it is timbre and prosody of the speech).
  • the method further comprises adapting parameters for a spectral envelope trajectory, a fundamental frequency and temporal phase at one or both boundaries of the gap to match the corresponding parameters of the available adjacent speech signal before and/or after the gap. This is in order for the parameters to be temporally continuous before and/or after the gap.
  • the method further comprises a step of time-scaling the voice-converted speech signal before it is filled into the gap.
  • Fig.6 shows a block diagram of an apparatus 60 for performing speech inpainting on a speech signal, according to one embodiment.
  • the apparatus comprises a speech analyser 61 for detecting a gap G in the speech signal SI, a speech synthesizer 62 for automatically synthesizing from a transcript T speech SS at least for the gap, a voice converter 63 for converting the synthesized speech SS according to the determined voice characteristics VC, and a mixer 64 for inpainting the speech signal, wherein the voice converted synthesized speech VCS is filled into the gap G of the speech signal to obtain an inpainted speech output signal SO.
  • the apparatus further comprises a voice analyzer 65 for determining voice characteristics of the speech signal.
  • the apparatus further comprises a speech-to-text converter 66 for automatically generating a transcript of the speech signal.
  • the apparatus further comprises a database having stored speech data of example phonemes or words of natural speech, and the speech synthesizer 62 retrieves speech data from the database for automatically synthesizing the speech at least for the gap.
  • the apparatus further comprises an interface 67 for receiving a complete transcript of the speech signal, the transcript covering at least text that is omitted by the gap.
  • the apparatus further comprises a time-scaler for time-scaling the voice-converted speech signal before it is filled into the gap.
  • an apparatus for performing speech inpainting on a speech signal comprises a processor and a memory storing instructions that, when executed by the processor, cause the apparatus to perform the method steps of any of the methods disclosed above.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)
  • Telephonic Communication Services (AREA)

Claims (15)

  1. Procédé (50) comprenant :
    - une obtention (51) des caractéristiques vocales d'un signal vocal ;
    - une détection (52) d'une partie manquante dans le signal vocal ;
    - une synthèse automatique (53), à partir d'une transcription, de parole au moins pour la partie manquante dans le signal vocal ;
    - une conversion en voix (54) de la parole synthétisée selon les caractéristiques vocales obtenues du signal vocal ; et
    - une retouche (55) du signal vocal, où la parole synthétisée convertie en voix est insérée dans la partie manquante.
  2. Procédé selon la revendication 1, comprenant une génération automatique (56) de ladite transcription à partir du signal vocal.
  3. Procédé selon la revendication 1 ou 2, comprenant un traitement (57) du signal vocal, dans lequel la partie manquante survient pendant le traitement et dans lequel la transcription est générée avant le traitement.
  4. Procédé selon l'une quelconque des revendications 1 à 3, dans lequel la synthèse automatique (53), à partir d'une transcription, de la parole au moins pour la partie manquante comprend une extraction à partir d'une base de données de données vocales enregistrées par une voix humaine.
  5. Procédé selon l'une quelconque des revendications 1 à 4, comprenant les étapes consistant à :
    - détecter (581) que la transcription ne couvre pas la partie manquante ;
    - déterminer (582) un ou plusieurs mots ou sons omis dans la partie manquante ; et
    - ajouter (583) le mot ou le son déterminé à la transcription avant la synthèse de la parole à partir de la transcription.
  6. Procédé selon la revendication 5, dans lequel la détermination (582) est effectuée en estimant ou en devinant le ou les mots ou sons.
  7. Procédé selon la revendication 5, dans lequel la détermination (582) s'effectue en extrayant une transcription complète de la parole via d'autres canaux.
  8. Procédé selon l'une quelconque des revendications 1 à 7, dans lequel les caractéristiques vocales comprennent des paramètres pour une enveloppe spectrale et une fréquence fondamentale.
  9. Procédé selon l'une des revendications 1 à 8, comprenant une adaptation de paramètres pour une trajectoire d'enveloppe spectrale, une fréquence fondamentale et une phase temporelle au niveau d'une ou des deux limites de la partie manquante afin d'établir une correspondance avec les paramètres correspondants du signal vocal adjacent disponible avant et/ou après la partie manquante.
  10. Procédé selon l'une des revendications 1 à 9, comprenant une mise à l'échelle temporelle du signal vocal converti en voix avant son insertion dans la partie manquante.
  11. Appareil (60) comprenant :
    - un analyseur vocal (61) pour détecter une partie manquante dans un signal vocal ;
    - un synthétiseur vocal (62) pour synthétiser automatiquement, à partir d'une transcription, une parole au moins pour une partie manquante dans le signal vocal ;
    - un moyen pour obtenir les caractéristiques vocales du signal vocal :
    - un convertisseur vocal (63) pour convertir la parole synthétisée selon les caractéristiques vocales obtenues du signal vocal ; et
    - un mélangeur (64) pour retoucher le signal vocal, où la parole synthétisée convertie en voix est insérée dans la partie manquante du signal vocal.
  12. Appareil selon la revendication 11, dans lequel ledit moyen d'obtention comprend un analyseur vocal (65) pour obtenir les caractéristiques vocales du signal vocal.
  13. Appareil selon la revendication 11 ou 12, comprenant un convertisseur voix-texte (66) pour générer automatiquement une transcription du signal vocal.
  14. Appareil selon l'une des revendications 11 à 13, comprenant une base de données contenant des données vocales d'exemples de phonèmes ou de mots de voix humaine, dans lequel le synthétiseur vocal (62) extrait des données vocales de la base de données pour synthétiser automatiquement la parole au moins pour la partie manquante.
  15. Appareil selon l'une des revendications 11 à 14, comprenant une interface (67) pour recevoir une transcription complète du signal vocal, la transcription couvrant au moins le texte omis par la partie manquante.
EP15306085.0A 2015-07-02 2015-07-02 Procédé et appareil permettant d'effectuer des retouches audio sur un signal vocal Active EP3113180B1 (fr)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PL15306085T PL3113180T3 (pl) 2015-07-02 2015-07-02 Sposób przeprowadzania rekonstrukcji sygnału mowy metodą audio inpaintingu i urządzenie do przeprowadzania rekonstrukcji sygnału mowy metodą audio inpaintingu
EP15306085.0A EP3113180B1 (fr) 2015-07-02 2015-07-02 Procédé et appareil permettant d'effectuer des retouches audio sur un signal vocal

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
EP15306085.0A EP3113180B1 (fr) 2015-07-02 2015-07-02 Procédé et appareil permettant d'effectuer des retouches audio sur un signal vocal

Publications (2)

Publication Number Publication Date
EP3113180A1 EP3113180A1 (fr) 2017-01-04
EP3113180B1 true EP3113180B1 (fr) 2020-01-22

Family

ID=53610835

Family Applications (1)

Application Number Title Priority Date Filing Date
EP15306085.0A Active EP3113180B1 (fr) 2015-07-02 2015-07-02 Procédé et appareil permettant d'effectuer des retouches audio sur un signal vocal

Country Status (2)

Country Link
EP (1) EP3113180B1 (fr)
PL (1) PL3113180T3 (fr)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016162384A1 (fr) * 2015-04-10 2016-10-13 Dolby International Ab Procédé de mise en œuvre de restauration audio, et appareil de mise en œuvre de restauration audio
JP6452061B1 (ja) * 2018-08-10 2019-01-16 クリスタルメソッド株式会社 学習データ生成方法、学習方法、及び評価装置
US11356492B2 (en) * 2020-09-16 2022-06-07 Kyndryl, Inc. Preventing audio dropout

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102117614B (zh) * 2010-01-05 2013-01-02 索尼爱立信移动通讯有限公司 个性化文本语音合成和个性化语音特征提取
US9583111B2 (en) * 2013-07-17 2017-02-28 Technion Research & Development Foundation Ltd. Example-based audio inpainting

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
None *

Also Published As

Publication number Publication date
EP3113180A1 (fr) 2017-01-04
PL3113180T3 (pl) 2020-06-01

Similar Documents

Publication Publication Date Title
EP3855340B1 (fr) Système et méthode de conversion vocale multilingue
US10733974B2 (en) System and method for synthesis of speech from provided text
US10255903B2 (en) Method for forming the excitation signal for a glottal pulse model based parametric speech synthesis system
AU2020227065B2 (en) Method for forming the excitation signal for a glottal pulse model based parametric speech synthesis system
EP4018439B1 (fr) Systèmes et procédés d'adaptation des intégrations de locuteur humain dans la synthèse de la parole
US10706867B1 (en) Global frequency-warping transformation estimation for voice timbre approximation
EP3113180B1 (fr) Procédé et appareil permettant d'effectuer des retouches audio sur un signal vocal
CN116994553A (zh) 语音合成模型的训练方法、语音合成方法、装置及设备
CA3004700C (fr) Procede permettant de former le signal d'excitation pour un systeme de synthese vocale parametrique base sur un modele d'impulsion glottale
US10446133B2 (en) Multi-stream spectral representation for statistical parametric speech synthesis
JP7040258B2 (ja) 発音変換装置、その方法、およびプログラム
KR102051235B1 (ko) 스피치 합성에서 푸어 얼라인먼트를 제거하기 위한 아웃라이어 식별 시스템 및 방법
JP5245962B2 (ja) 音声合成装置、音声合成方法、プログラム及び記録媒体
KR20100111544A (ko) 음성인식을 이용한 발음 교정 시스템 및 그 방법
US11302300B2 (en) Method and apparatus for forced duration in neural speech synthesis
JP6468518B2 (ja) 基本周波数パターン予測装置、方法、及びプログラム
CN116884385A (zh) 语音合成方法、装置及计算机可读存储介质
CN114299912A (zh) 语音合成方法及相关装置、设备和存储介质
Chomwihoke et al. Comparative study of text-to-speech synthesis techniques for mobile linguistic translation process
Qian et al. A unified trajectory tiling approach to high quality TTS and cross-lingual voice transformation

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION HAS BEEN PUBLISHED

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

AX Request for extension of the european patent

Extension state: BA ME

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE

17P Request for examination filed

Effective date: 20171005

RBV Designated contracting states (corrected)

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

GRAP Despatch of communication of intention to grant a patent

Free format text: ORIGINAL CODE: EPIDOSNIGR1

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: GRANT OF PATENT IS INTENDED

INTG Intention to grant announced

Effective date: 20180115

GRAJ Information related to disapproval of communication of intention to grant by the applicant or resumption of examination proceedings by the epo deleted

Free format text: ORIGINAL CODE: EPIDOSDIGR1

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE

INTC Intention to grant announced (deleted)
STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: EXAMINATION IS IN PROGRESS

17Q First examination report despatched

Effective date: 20180618

RAP1 Party data changed (applicant data changed or rights of an application transferred)

Owner name: INTERDIGITAL CE PATENT HOLDINGS

GRAP Despatch of communication of intention to grant a patent

Free format text: ORIGINAL CODE: EPIDOSNIGR1

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: GRANT OF PATENT IS INTENDED

INTG Intention to grant announced

Effective date: 20190926

GRAS Grant fee paid

Free format text: ORIGINAL CODE: EPIDOSNIGR3

GRAA (expected) grant

Free format text: ORIGINAL CODE: 0009210

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE PATENT HAS BEEN GRANTED

AK Designated contracting states

Kind code of ref document: B1

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

REG Reference to a national code

Ref country code: GB

Ref legal event code: FG4D

REG Reference to a national code

Ref country code: CH

Ref legal event code: EP

REG Reference to a national code

Ref country code: AT

Ref legal event code: REF

Ref document number: 1227425

Country of ref document: AT

Kind code of ref document: T

Effective date: 20200215

REG Reference to a national code

Ref country code: IE

Ref legal event code: FG4D

REG Reference to a national code

Ref country code: DE

Ref legal event code: R096

Ref document number: 602015045941

Country of ref document: DE

REG Reference to a national code

Ref country code: NL

Ref legal event code: FP

REG Reference to a national code

Ref country code: NO

Ref legal event code: T2

Effective date: 20200122

REG Reference to a national code

Ref country code: LT

Ref legal event code: MG4D

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: FI

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20200122

Ref country code: PT

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20200614

Ref country code: RS

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20200122

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: SE

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20200122

Ref country code: LV

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20200122

Ref country code: IS

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20200522

Ref country code: HR

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20200122

Ref country code: GR

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20200423

Ref country code: BG

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20200422

REG Reference to a national code

Ref country code: DE

Ref legal event code: R097

Ref document number: 602015045941

Country of ref document: DE

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: EE

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20200122

Ref country code: LT

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20200122

Ref country code: SM

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20200122

Ref country code: RO

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20200122

Ref country code: CZ

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20200122

Ref country code: ES

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20200122

Ref country code: SK

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20200122

Ref country code: DK

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20200122

REG Reference to a national code

Ref country code: AT

Ref legal event code: MK05

Ref document number: 1227425

Country of ref document: AT

Kind code of ref document: T

Effective date: 20200122

PLBE No opposition filed within time limit

Free format text: ORIGINAL CODE: 0009261

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: NO OPPOSITION FILED WITHIN TIME LIMIT

26N No opposition filed

Effective date: 20201023

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: AT

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20200122

Ref country code: IT

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20200122

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: SI

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20200122

Ref country code: MC

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20200122

REG Reference to a national code

Ref country code: CH

Ref legal event code: PL

REG Reference to a national code

Ref country code: BE

Ref legal event code: MM

Effective date: 20200731

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: CH

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20200731

Ref country code: IE

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20200702

Ref country code: LU

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20200702

Ref country code: LI

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20200731

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: BE

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20200731

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: TR

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20200122

Ref country code: MT

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20200122

Ref country code: CY

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20200122

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: MK

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20200122

Ref country code: AL

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20200122

P01 Opt-out of the competence of the unified patent court (upc) registered

Effective date: 20230511

PGFP Annual fee paid to national office [announced via postgrant information from national office to epo]

Ref country code: NO

Payment date: 20230719

Year of fee payment: 9

Ref country code: GB

Payment date: 20230725

Year of fee payment: 9

PGFP Annual fee paid to national office [announced via postgrant information from national office to epo]

Ref country code: FR

Payment date: 20230725

Year of fee payment: 9

Ref country code: DE

Payment date: 20230726

Year of fee payment: 9

PGFP Annual fee paid to national office [announced via postgrant information from national office to epo]

Ref country code: PL

Payment date: 20240621

Year of fee payment: 10

PGFP Annual fee paid to national office [announced via postgrant information from national office to epo]

Ref country code: NL

Payment date: 20240725

Year of fee payment: 10