WO2021101665A1 - Synthèse de voix de chant - Google Patents
Synthèse de voix de chant Download PDFInfo
- Publication number
- WO2021101665A1 WO2021101665A1 PCT/US2020/057268 US2020057268W WO2021101665A1 WO 2021101665 A1 WO2021101665 A1 WO 2021101665A1 US 2020057268 W US2020057268 W US 2020057268W WO 2021101665 A1 WO2021101665 A1 WO 2021101665A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- phoneme
- music score
- duration
- fundamental frequency
- vector representation
- Prior art date
Links
- 230000015572 biosynthetic process Effects 0.000 title claims abstract description 31
- 238000003786 synthesis reaction Methods 0.000 title claims abstract description 31
- 238000000034 method Methods 0.000 claims abstract description 66
- 230000003595 spectral effect Effects 0.000 claims abstract description 57
- 230000001105 regulatory effect Effects 0.000 claims abstract description 9
- 238000001228 spectrum Methods 0.000 claims description 66
- 230000006870 function Effects 0.000 claims description 29
- 238000012549 training Methods 0.000 claims description 23
- 238000013459 approach Methods 0.000 claims description 17
- 238000013528 artificial neural network Methods 0.000 claims description 7
- 230000015654 memory Effects 0.000 claims description 6
- 239000011295 pitch Substances 0.000 description 63
- 230000008569 process Effects 0.000 description 36
- 238000012545 processing Methods 0.000 description 10
- 238000010586 diagram Methods 0.000 description 4
- 238000000605 extraction Methods 0.000 description 4
- 230000001020 rhythmical effect Effects 0.000 description 4
- 230000007246 mechanism Effects 0.000 description 3
- 230000033764 rhythmic process Effects 0.000 description 2
- 230000002194 synthesizing effect Effects 0.000 description 2
- 230000004913 activation Effects 0.000 description 1
- 210000004556 brain Anatomy 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000013434 data augmentation Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000008451 emotion Effects 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000012805 post-processing Methods 0.000 description 1
- 239000011435 rock Substances 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 238000000844 transformation Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10H—ELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
- G10H7/00—Instruments in which the tones are synthesised from a data store, e.g. computer organs
- G10H7/08—Instruments in which the tones are synthesised from a data store, e.g. computer organs by calculating functions or polynomial approximations to evaluate amplitudes at successive sample points of a tone waveform
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10H—ELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
- G10H1/00—Details of electrophonic musical instruments
- G10H1/0008—Associated control or indicating means
- G10H1/0025—Automatic or semi-automatic music composition, e.g. producing random music, applying rules from music theory or modifying a musical piece
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10H—ELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
- G10H1/00—Details of electrophonic musical instruments
- G10H1/02—Means for controlling the tone frequencies, e.g. attack or decay; Means for producing special musical effects, e.g. vibratos or glissandos
- G10H1/06—Circuits for establishing the harmonic content of tones, or other arrangements for changing the tone colour
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10H—ELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
- G10H7/00—Instruments in which the tones are synthesised from a data store, e.g. computer organs
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
- G10L13/033—Voice editing, e.g. manipulating the voice of the synthesiser
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10H—ELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
- G10H2210/00—Aspects or methods of musical processing having intrinsic musical character, i.e. involving musical theory or musical parameters or relying on musical knowledge, as applied in electrophonic musical tools or instruments
- G10H2210/325—Musical pitch modification
- G10H2210/331—Note pitch correction, i.e. modifying a note pitch or replacing it by the closest one in a given scale
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10H—ELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
- G10H2240/00—Data organisation or data communication aspects, specifically adapted for electrophonic musical tools or instruments
- G10H2240/325—Synchronizing two or more audio tracks or files according to musical features or musical timings
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10H—ELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
- G10H2250/00—Aspects of algorithms or signal processing methods without intrinsic musical character, yet specifically adapted for or used in electrophonic musical processing
- G10H2250/311—Neural networks for electrophonic musical instruments or musical processing, e.g. for musical recognition or control, automatic composition or improvisation
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10H—ELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
- G10H2250/00—Aspects of algorithms or signal processing methods without intrinsic musical character, yet specifically adapted for or used in electrophonic musical processing
- G10H2250/315—Sound category-dependent sound synthesis processes [Gensound] for musical use; Sound category-specific synthesis-controlling parameters or control means therefor
- G10H2250/455—Gensound singing voices, i.e. generation of human voices for musical applications, vocal singing sounds or intelligible words at a desired pitch or with desired vocal effects, e.g. by phoneme synthesis
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10H—ELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
- G10H2250/00—Aspects of algorithms or signal processing methods without intrinsic musical character, yet specifically adapted for or used in electrophonic musical processing
- G10H2250/471—General musical sound synthesis principles, i.e. sound category-independent synthesis methods
- G10H2250/481—Formant synthesis, i.e. simulating the human speech production mechanism by exciting formant resonators, e.g. mimicking vocal tract filtering as in LPC synthesis vocoders, wherein musical instruments may be used as excitation signal to the time-varying filter estimated from a singer's speech
Definitions
- FIG.9 illustrates an exemplary training process of an acoustic feature predictor according to an embodiment of the present invention.
- an adversarially-trained end-to-end SVS system which is based on an auto-regressive model is proposed.
- the auto-regressive model has forward dependency.
- a strategy for post-processing predicted fundamental frequency based on note pitch is proposed to ensure the fundamental frequency in tune.
- a singing voice synthesis model leading to synthesized singing voices with high naturalness, fast processing speed and good audio quality.
- the fundamental frequency residual may be set to be no higher than a semitone, so as to avoid an out-of-tune issue in the synthesized singing voice.
- Exemplary system architecture of the spectrum decoder 329 will be described in details later in conjunction with FIG.6, and an exemplary training process of the spectrum decoder 329 will be described in details in conjunction with FIG.9.
- the user's own corpus may be obtained in advance, and the corpus may be used for training the style encoder and/or the voice encoder in order to obtain a style vector representation and/or a voice vector representation associated with the user.
- the user may provide a "style ID" corresponding to himself, so that the singing voice synthesizer may obtain the style vector representation of the user, and further synthesize singing voices in the singing style of the user.
- the process 900 involves the phoneme loss function 992, the note loss function 994, the pitch loss function 996, and the spectrum loss function 998, the process 900 may adopt more or less loss functions according to specific application requirements.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Multimedia (AREA)
- Acoustics & Sound (AREA)
- Human Computer Interaction (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Algebra (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Analysis (AREA)
- Mathematical Optimization (AREA)
- Mathematical Physics (AREA)
- Pure & Applied Mathematics (AREA)
- Auxiliary Devices For Music (AREA)
Abstract
La présente invention concerne des procédés et des appareils de synthèse de voix de chant. Des premières informations de phonème de score de musique extraites d'un score de musique peuvent être reçues, les premières informations de phonème de score de musique comprenant un premier phonème et une hauteur et un battement d'une note correspondant au premier phonème. Un résidu de fréquence fondamentale et des paramètres spectraux correspondant au premier phonème peuvent être générés sur la base des premières informations de phonème de score de musique. Une fréquence fondamentale correspondant au premier phonème peut être obtenue par régulation de la hauteur de la note avec le résidu de fréquence fondamentale. Une forme d'onde acoustique correspondant au premier phonème peut être générée sur la base, au moins en partie, de la fréquence fondamentale et des paramètres spectraux.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911156831.7 | 2019-11-22 | ||
CN201911156831.7A CN112951198A (zh) | 2019-11-22 | 2019-11-22 | 歌声合成 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2021101665A1 true WO2021101665A1 (fr) | 2021-05-27 |
Family
ID=73476243
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2020/057268 WO2021101665A1 (fr) | 2019-11-22 | 2020-10-26 | Synthèse de voix de chant |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN112951198A (fr) |
WO (1) | WO2021101665A1 (fr) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113362801A (zh) * | 2021-06-10 | 2021-09-07 | 携程旅游信息技术(上海)有限公司 | 基于梅尔谱对齐的音频合成方法、系统、设备及存储介质 |
CN113409747A (zh) * | 2021-05-28 | 2021-09-17 | 北京达佳互联信息技术有限公司 | 歌曲生成方法、装置、电子设备及存储介质 |
CN113593520A (zh) * | 2021-09-08 | 2021-11-02 | 广州虎牙科技有限公司 | 歌声合成方法及装置、电子设备及存储介质 |
CN113963723A (zh) * | 2021-09-16 | 2022-01-21 | 秦慈军 | 一种音乐呈现方法、装置、设备及存储介质 |
CN114267375A (zh) * | 2021-11-24 | 2022-04-01 | 北京百度网讯科技有限公司 | 音素检测方法及装置、训练方法及装置、设备和介质 |
CN114360492A (zh) * | 2021-10-26 | 2022-04-15 | 腾讯科技(深圳)有限公司 | 音频合成方法、装置、计算机设备和存储介质 |
US20220122582A1 (en) * | 2020-10-21 | 2022-04-21 | Google Llc | Parallel Tacotron Non-Autoregressive and Controllable TTS |
CN115457923A (zh) * | 2022-10-26 | 2022-12-09 | 北京红棉小冰科技有限公司 | 一种歌声合成方法、装置、设备及存储介质 |
WO2023063880A3 (fr) * | 2021-10-15 | 2023-07-13 | Lemon Inc. | Système et procédé d'entraînement d'un modèle de réseau neuronal reposant sur un transformateur dans un transformateur pour des données audio |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11574624B1 (en) * | 2021-03-31 | 2023-02-07 | Amazon Technologies, Inc. | Synthetic speech processing |
TWI836255B (zh) * | 2021-08-17 | 2024-03-21 | 國立清華大學 | 透過歌聲轉換設計個人化虛擬歌手的方法及裝置 |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP2276019A1 (fr) * | 2009-07-02 | 2011-01-19 | YAMAHA Corporation | Appareil et procédé de création d'une base de données de synthétisation de chants et appareil de génération d'une courbe de tonalités et procédé |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7977562B2 (en) * | 2008-06-20 | 2011-07-12 | Microsoft Corporation | Synthesized singing voice waveform generator |
CN103035235A (zh) * | 2011-09-30 | 2013-04-10 | 西门子公司 | 一种将语音转换为旋律的方法和装置 |
CN103915093B (zh) * | 2012-12-31 | 2019-07-30 | 科大讯飞股份有限公司 | 一种实现语音歌唱化的方法和装置 |
CN109147757B (zh) * | 2018-09-11 | 2021-07-02 | 广州酷狗计算机科技有限公司 | 歌声合成方法及装置 |
CN110148394B (zh) * | 2019-04-26 | 2024-03-01 | 平安科技(深圳)有限公司 | 歌声合成方法、装置、计算机设备及存储介质 |
-
2019
- 2019-11-22 CN CN201911156831.7A patent/CN112951198A/zh active Pending
-
2020
- 2020-10-26 WO PCT/US2020/057268 patent/WO2021101665A1/fr active Application Filing
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP2276019A1 (fr) * | 2009-07-02 | 2011-01-19 | YAMAHA Corporation | Appareil et procédé de création d'une base de données de synthétisation de chants et appareil de génération d'une courbe de tonalités et procédé |
Non-Patent Citations (3)
Title |
---|
KAZUHIRO NAKAMURA ET AL: "Singing voice synthesis based on convolutional neural networks", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 15 April 2019 (2019-04-15), XP081169221 * |
MERLIJN BLAAUW ET AL: "A Neural Parametric Singing Synthesizer Modeling Timbre and Expression from Natural Songs", APPLIED SCIENCES, vol. 7, no. 12, 18 December 2017 (2017-12-18), pages 1313, XP055627719, DOI: 10.3390/app7121313 * |
TAKESHI SAITOU ET AL: "Speech-to-Singing Synthesis: Converting Speaking Voices to Singing Voices by Controlling Acoustic Features Unique to Singing Voices", APPLICATIONS OF SIGNAL PROCESSING TO AUDIO AND ACOUSTICS, 2007 IEEE WO RKSHOP ON, IEEE, PI, 1 October 2007 (2007-10-01), pages 215 - 218, XP031167096, ISBN: 978-1-4244-1618-9 * |
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11908448B2 (en) * | 2020-10-21 | 2024-02-20 | Google Llc | Parallel tacotron non-autoregressive and controllable TTS |
US20220122582A1 (en) * | 2020-10-21 | 2022-04-21 | Google Llc | Parallel Tacotron Non-Autoregressive and Controllable TTS |
CN113409747B (zh) * | 2021-05-28 | 2023-08-29 | 北京达佳互联信息技术有限公司 | 歌曲生成方法、装置、电子设备及存储介质 |
CN113409747A (zh) * | 2021-05-28 | 2021-09-17 | 北京达佳互联信息技术有限公司 | 歌曲生成方法、装置、电子设备及存储介质 |
CN113362801A (zh) * | 2021-06-10 | 2021-09-07 | 携程旅游信息技术(上海)有限公司 | 基于梅尔谱对齐的音频合成方法、系统、设备及存储介质 |
CN113593520A (zh) * | 2021-09-08 | 2021-11-02 | 广州虎牙科技有限公司 | 歌声合成方法及装置、电子设备及存储介质 |
CN113593520B (zh) * | 2021-09-08 | 2024-05-17 | 广州虎牙科技有限公司 | 歌声合成方法及装置、电子设备及存储介质 |
CN113963723A (zh) * | 2021-09-16 | 2022-01-21 | 秦慈军 | 一种音乐呈现方法、装置、设备及存储介质 |
WO2023063880A3 (fr) * | 2021-10-15 | 2023-07-13 | Lemon Inc. | Système et procédé d'entraînement d'un modèle de réseau neuronal reposant sur un transformateur dans un transformateur pour des données audio |
US11854558B2 (en) | 2021-10-15 | 2023-12-26 | Lemon Inc. | System and method for training a transformer-in-transformer-based neural network model for audio data |
CN114360492A (zh) * | 2021-10-26 | 2022-04-15 | 腾讯科技(深圳)有限公司 | 音频合成方法、装置、计算机设备和存储介质 |
CN114267375B (zh) * | 2021-11-24 | 2022-10-28 | 北京百度网讯科技有限公司 | 音素检测方法及装置、训练方法及装置、设备和介质 |
CN114267375A (zh) * | 2021-11-24 | 2022-04-01 | 北京百度网讯科技有限公司 | 音素检测方法及装置、训练方法及装置、设备和介质 |
CN115457923A (zh) * | 2022-10-26 | 2022-12-09 | 北京红棉小冰科技有限公司 | 一种歌声合成方法、装置、设备及存储介质 |
Also Published As
Publication number | Publication date |
---|---|
CN112951198A (zh) | 2021-06-11 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2021101665A1 (fr) | Synthèse de voix de chant | |
EP3588484B1 (fr) | Instrument de musique électronique, procédé de commande d'instrument de musique électronique et support d'enregistrement | |
US11763797B2 (en) | Text-to-speech (TTS) processing | |
Kuligowska et al. | Speech synthesis systems: disadvantages and limitations | |
Umbert et al. | Expression control in singing voice synthesis: Features, approaches, evaluation, and challenges | |
Bonada et al. | Expressive singing synthesis based on unit selection for the singing synthesis challenge 2016 | |
Umbert et al. | Generating singing voice expression contours based on unit selection | |
US20230169953A1 (en) | Phrase-based end-to-end text-to-speech (tts) synthesis | |
Gupta et al. | Deep learning approaches in topics of singing information processing | |
Bonada et al. | Hybrid neural-parametric f0 model for singing synthesis | |
KR102168529B1 (ko) | 인공신경망을 이용한 가창음성 합성 방법 및 장치 | |
Wada et al. | Sequential generation of singing f0 contours from musical note sequences based on wavenet | |
JP5930738B2 (ja) | 音声合成装置及び音声合成方法 | |
Lee et al. | A comparative study of spectral transformation techniques for singing voice synthesis. | |
Liu et al. | Controllable accented text-to-speech synthesis | |
Delalez et al. | Vokinesis: syllabic control points for performative singing synthesis. | |
JP5874639B2 (ja) | 音声合成装置、音声合成方法及び音声合成プログラム | |
Saeed et al. | A novel multi-speakers Urdu singing voices synthesizer using Wasserstein Generative Adversarial Network | |
Yin | An overview of speech synthesis technology | |
Bonada et al. | Spectral approach to the modeling of the singing voice | |
Yang et al. | Mandarin singing voice synthesis with a phonology-based duration model | |
Li et al. | A lyrics to singing voice synthesis system with variable timbre | |
JP2020204755A (ja) | 音声処理装置、および音声処理方法 | |
JP2020204651A (ja) | 音声処理装置、および音声処理方法 | |
Pucher et al. | Development of a statistical parametric synthesis system for operatic singing in German |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20808570 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20808570 Country of ref document: EP Kind code of ref document: A1 |