WO2021101665A1 - Synthèse de voix de chant - Google Patents

Synthèse de voix de chant Download PDF

Info

Publication number
WO2021101665A1
WO2021101665A1 PCT/US2020/057268 US2020057268W WO2021101665A1 WO 2021101665 A1 WO2021101665 A1 WO 2021101665A1 US 2020057268 W US2020057268 W US 2020057268W WO 2021101665 A1 WO2021101665 A1 WO 2021101665A1
Authority
WO
WIPO (PCT)
Prior art keywords
phoneme
music score
duration
fundamental frequency
vector representation
Prior art date
Application number
PCT/US2020/057268
Other languages
English (en)
Inventor
Peiling LU
Jian Luan
Jie Wu
Original Assignee
Microsoft Technology Licensing, Llc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Microsoft Technology Licensing, Llc filed Critical Microsoft Technology Licensing, Llc
Publication of WO2021101665A1 publication Critical patent/WO2021101665A1/fr

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H7/00Instruments in which the tones are synthesised from a data store, e.g. computer organs
    • G10H7/08Instruments in which the tones are synthesised from a data store, e.g. computer organs by calculating functions or polynomial approximations to evaluate amplitudes at successive sample points of a tone waveform
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H1/00Details of electrophonic musical instruments
    • G10H1/0008Associated control or indicating means
    • G10H1/0025Automatic or semi-automatic music composition, e.g. producing random music, applying rules from music theory or modifying a musical piece
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H1/00Details of electrophonic musical instruments
    • G10H1/02Means for controlling the tone frequencies, e.g. attack or decay; Means for producing special musical effects, e.g. vibratos or glissandos
    • G10H1/06Circuits for establishing the harmonic content of tones, or other arrangements for changing the tone colour
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H7/00Instruments in which the tones are synthesised from a data store, e.g. computer organs
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2210/00Aspects or methods of musical processing having intrinsic musical character, i.e. involving musical theory or musical parameters or relying on musical knowledge, as applied in electrophonic musical tools or instruments
    • G10H2210/325Musical pitch modification
    • G10H2210/331Note pitch correction, i.e. modifying a note pitch or replacing it by the closest one in a given scale
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2240/00Data organisation or data communication aspects, specifically adapted for electrophonic musical tools or instruments
    • G10H2240/325Synchronizing two or more audio tracks or files according to musical features or musical timings
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2250/00Aspects of algorithms or signal processing methods without intrinsic musical character, yet specifically adapted for or used in electrophonic musical processing
    • G10H2250/311Neural networks for electrophonic musical instruments or musical processing, e.g. for musical recognition or control, automatic composition or improvisation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2250/00Aspects of algorithms or signal processing methods without intrinsic musical character, yet specifically adapted for or used in electrophonic musical processing
    • G10H2250/315Sound category-dependent sound synthesis processes [Gensound] for musical use; Sound category-specific synthesis-controlling parameters or control means therefor
    • G10H2250/455Gensound singing voices, i.e. generation of human voices for musical applications, vocal singing sounds or intelligible words at a desired pitch or with desired vocal effects, e.g. by phoneme synthesis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2250/00Aspects of algorithms or signal processing methods without intrinsic musical character, yet specifically adapted for or used in electrophonic musical processing
    • G10H2250/471General musical sound synthesis principles, i.e. sound category-independent synthesis methods
    • G10H2250/481Formant synthesis, i.e. simulating the human speech production mechanism by exciting formant resonators, e.g. mimicking vocal tract filtering as in LPC synthesis vocoders, wherein musical instruments may be used as excitation signal to the time-varying filter estimated from a singer's speech

Definitions

  • FIG.9 illustrates an exemplary training process of an acoustic feature predictor according to an embodiment of the present invention.
  • an adversarially-trained end-to-end SVS system which is based on an auto-regressive model is proposed.
  • the auto-regressive model has forward dependency.
  • a strategy for post-processing predicted fundamental frequency based on note pitch is proposed to ensure the fundamental frequency in tune.
  • a singing voice synthesis model leading to synthesized singing voices with high naturalness, fast processing speed and good audio quality.
  • the fundamental frequency residual may be set to be no higher than a semitone, so as to avoid an out-of-tune issue in the synthesized singing voice.
  • Exemplary system architecture of the spectrum decoder 329 will be described in details later in conjunction with FIG.6, and an exemplary training process of the spectrum decoder 329 will be described in details in conjunction with FIG.9.
  • the user's own corpus may be obtained in advance, and the corpus may be used for training the style encoder and/or the voice encoder in order to obtain a style vector representation and/or a voice vector representation associated with the user.
  • the user may provide a "style ID" corresponding to himself, so that the singing voice synthesizer may obtain the style vector representation of the user, and further synthesize singing voices in the singing style of the user.
  • the process 900 involves the phoneme loss function 992, the note loss function 994, the pitch loss function 996, and the spectrum loss function 998, the process 900 may adopt more or less loss functions according to specific application requirements.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Algebra (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Mathematical Physics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Auxiliary Devices For Music (AREA)

Abstract

La présente invention concerne des procédés et des appareils de synthèse de voix de chant. Des premières informations de phonème de score de musique extraites d'un score de musique peuvent être reçues, les premières informations de phonème de score de musique comprenant un premier phonème et une hauteur et un battement d'une note correspondant au premier phonème. Un résidu de fréquence fondamentale et des paramètres spectraux correspondant au premier phonème peuvent être générés sur la base des premières informations de phonème de score de musique. Une fréquence fondamentale correspondant au premier phonème peut être obtenue par régulation de la hauteur de la note avec le résidu de fréquence fondamentale. Une forme d'onde acoustique correspondant au premier phonème peut être générée sur la base, au moins en partie, de la fréquence fondamentale et des paramètres spectraux.
PCT/US2020/057268 2019-11-22 2020-10-26 Synthèse de voix de chant WO2021101665A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911156831.7 2019-11-22
CN201911156831.7A CN112951198A (zh) 2019-11-22 2019-11-22 歌声合成

Publications (1)

Publication Number Publication Date
WO2021101665A1 true WO2021101665A1 (fr) 2021-05-27

Family

ID=73476243

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2020/057268 WO2021101665A1 (fr) 2019-11-22 2020-10-26 Synthèse de voix de chant

Country Status (2)

Country Link
CN (1) CN112951198A (fr)
WO (1) WO2021101665A1 (fr)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113362801A (zh) * 2021-06-10 2021-09-07 携程旅游信息技术(上海)有限公司 基于梅尔谱对齐的音频合成方法、系统、设备及存储介质
CN113409747A (zh) * 2021-05-28 2021-09-17 北京达佳互联信息技术有限公司 歌曲生成方法、装置、电子设备及存储介质
CN113593520A (zh) * 2021-09-08 2021-11-02 广州虎牙科技有限公司 歌声合成方法及装置、电子设备及存储介质
CN113963723A (zh) * 2021-09-16 2022-01-21 秦慈军 一种音乐呈现方法、装置、设备及存储介质
CN114267375A (zh) * 2021-11-24 2022-04-01 北京百度网讯科技有限公司 音素检测方法及装置、训练方法及装置、设备和介质
CN114360492A (zh) * 2021-10-26 2022-04-15 腾讯科技(深圳)有限公司 音频合成方法、装置、计算机设备和存储介质
US20220122582A1 (en) * 2020-10-21 2022-04-21 Google Llc Parallel Tacotron Non-Autoregressive and Controllable TTS
CN115457923A (zh) * 2022-10-26 2022-12-09 北京红棉小冰科技有限公司 一种歌声合成方法、装置、设备及存储介质
WO2023063880A3 (fr) * 2021-10-15 2023-07-13 Lemon Inc. Système et procédé d'entraînement d'un modèle de réseau neuronal reposant sur un transformateur dans un transformateur pour des données audio

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11574624B1 (en) * 2021-03-31 2023-02-07 Amazon Technologies, Inc. Synthetic speech processing
TWI836255B (zh) * 2021-08-17 2024-03-21 國立清華大學 透過歌聲轉換設計個人化虛擬歌手的方法及裝置

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2276019A1 (fr) * 2009-07-02 2011-01-19 YAMAHA Corporation Appareil et procédé de création d'une base de données de synthétisation de chants et appareil de génération d'une courbe de tonalités et procédé

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7977562B2 (en) * 2008-06-20 2011-07-12 Microsoft Corporation Synthesized singing voice waveform generator
CN103035235A (zh) * 2011-09-30 2013-04-10 西门子公司 一种将语音转换为旋律的方法和装置
CN103915093B (zh) * 2012-12-31 2019-07-30 科大讯飞股份有限公司 一种实现语音歌唱化的方法和装置
CN109147757B (zh) * 2018-09-11 2021-07-02 广州酷狗计算机科技有限公司 歌声合成方法及装置
CN110148394B (zh) * 2019-04-26 2024-03-01 平安科技(深圳)有限公司 歌声合成方法、装置、计算机设备及存储介质

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2276019A1 (fr) * 2009-07-02 2011-01-19 YAMAHA Corporation Appareil et procédé de création d'une base de données de synthétisation de chants et appareil de génération d'une courbe de tonalités et procédé

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
KAZUHIRO NAKAMURA ET AL: "Singing voice synthesis based on convolutional neural networks", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 15 April 2019 (2019-04-15), XP081169221 *
MERLIJN BLAAUW ET AL: "A Neural Parametric Singing Synthesizer Modeling Timbre and Expression from Natural Songs", APPLIED SCIENCES, vol. 7, no. 12, 18 December 2017 (2017-12-18), pages 1313, XP055627719, DOI: 10.3390/app7121313 *
TAKESHI SAITOU ET AL: "Speech-to-Singing Synthesis: Converting Speaking Voices to Singing Voices by Controlling Acoustic Features Unique to Singing Voices", APPLICATIONS OF SIGNAL PROCESSING TO AUDIO AND ACOUSTICS, 2007 IEEE WO RKSHOP ON, IEEE, PI, 1 October 2007 (2007-10-01), pages 215 - 218, XP031167096, ISBN: 978-1-4244-1618-9 *

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11908448B2 (en) * 2020-10-21 2024-02-20 Google Llc Parallel tacotron non-autoregressive and controllable TTS
US20220122582A1 (en) * 2020-10-21 2022-04-21 Google Llc Parallel Tacotron Non-Autoregressive and Controllable TTS
CN113409747B (zh) * 2021-05-28 2023-08-29 北京达佳互联信息技术有限公司 歌曲生成方法、装置、电子设备及存储介质
CN113409747A (zh) * 2021-05-28 2021-09-17 北京达佳互联信息技术有限公司 歌曲生成方法、装置、电子设备及存储介质
CN113362801A (zh) * 2021-06-10 2021-09-07 携程旅游信息技术(上海)有限公司 基于梅尔谱对齐的音频合成方法、系统、设备及存储介质
CN113593520A (zh) * 2021-09-08 2021-11-02 广州虎牙科技有限公司 歌声合成方法及装置、电子设备及存储介质
CN113593520B (zh) * 2021-09-08 2024-05-17 广州虎牙科技有限公司 歌声合成方法及装置、电子设备及存储介质
CN113963723A (zh) * 2021-09-16 2022-01-21 秦慈军 一种音乐呈现方法、装置、设备及存储介质
WO2023063880A3 (fr) * 2021-10-15 2023-07-13 Lemon Inc. Système et procédé d'entraînement d'un modèle de réseau neuronal reposant sur un transformateur dans un transformateur pour des données audio
US11854558B2 (en) 2021-10-15 2023-12-26 Lemon Inc. System and method for training a transformer-in-transformer-based neural network model for audio data
CN114360492A (zh) * 2021-10-26 2022-04-15 腾讯科技(深圳)有限公司 音频合成方法、装置、计算机设备和存储介质
CN114267375B (zh) * 2021-11-24 2022-10-28 北京百度网讯科技有限公司 音素检测方法及装置、训练方法及装置、设备和介质
CN114267375A (zh) * 2021-11-24 2022-04-01 北京百度网讯科技有限公司 音素检测方法及装置、训练方法及装置、设备和介质
CN115457923A (zh) * 2022-10-26 2022-12-09 北京红棉小冰科技有限公司 一种歌声合成方法、装置、设备及存储介质

Also Published As

Publication number Publication date
CN112951198A (zh) 2021-06-11

Similar Documents

Publication Publication Date Title
WO2021101665A1 (fr) Synthèse de voix de chant
EP3588484B1 (fr) Instrument de musique électronique, procédé de commande d'instrument de musique électronique et support d'enregistrement
US11763797B2 (en) Text-to-speech (TTS) processing
Kuligowska et al. Speech synthesis systems: disadvantages and limitations
Umbert et al. Expression control in singing voice synthesis: Features, approaches, evaluation, and challenges
Bonada et al. Expressive singing synthesis based on unit selection for the singing synthesis challenge 2016
Umbert et al. Generating singing voice expression contours based on unit selection
US20230169953A1 (en) Phrase-based end-to-end text-to-speech (tts) synthesis
Gupta et al. Deep learning approaches in topics of singing information processing
Bonada et al. Hybrid neural-parametric f0 model for singing synthesis
KR102168529B1 (ko) 인공신경망을 이용한 가창음성 합성 방법 및 장치
Wada et al. Sequential generation of singing f0 contours from musical note sequences based on wavenet
JP5930738B2 (ja) 音声合成装置及び音声合成方法
Lee et al. A comparative study of spectral transformation techniques for singing voice synthesis.
Liu et al. Controllable accented text-to-speech synthesis
Delalez et al. Vokinesis: syllabic control points for performative singing synthesis.
JP5874639B2 (ja) 音声合成装置、音声合成方法及び音声合成プログラム
Saeed et al. A novel multi-speakers Urdu singing voices synthesizer using Wasserstein Generative Adversarial Network
Yin An overview of speech synthesis technology
Bonada et al. Spectral approach to the modeling of the singing voice
Yang et al. Mandarin singing voice synthesis with a phonology-based duration model
Li et al. A lyrics to singing voice synthesis system with variable timbre
JP2020204755A (ja) 音声処理装置、および音声処理方法
JP2020204651A (ja) 音声処理装置、および音声処理方法
Pucher et al. Development of a statistical parametric synthesis system for operatic singing in German

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20808570

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20808570

Country of ref document: EP

Kind code of ref document: A1