CN112397043B - Method and system for converting voice into song - Google Patents

Method and system for converting voice into song Download PDF

Info

Publication number
CN112397043B
CN112397043B CN202011207626.1A CN202011207626A CN112397043B CN 112397043 B CN112397043 B CN 112397043B CN 202011207626 A CN202011207626 A CN 202011207626A CN 112397043 B CN112397043 B CN 112397043B
Authority
CN
China
Prior art keywords
spectrogram
mel
song
voice
contour
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011207626.1A
Other languages
Chinese (zh)
Other versions
CN112397043A (en
Inventor
不公告发明人
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Zhongke Shenzhi Technology Co ltd
Original Assignee
Beijing Zhongke Shenzhi Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zhongke Shenzhi Technology Co ltd filed Critical Beijing Zhongke Shenzhi Technology Co ltd
Priority to CN202011207626.1A priority Critical patent/CN112397043B/en
Publication of CN112397043A publication Critical patent/CN112397043A/en
Application granted granted Critical
Publication of CN112397043B publication Critical patent/CN112397043B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H1/00Details of electrophonic musical instruments
    • G10H1/0008Associated control or indicating means
    • G10H1/0025Automatic or semi-automatic music composition, e.g. producing random music, applying rules from music theory or modifying a musical piece
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2210/00Aspects or methods of musical processing having intrinsic musical character, i.e. involving musical theory or musical parameters or relying on musical knowledge, as applied in electrophonic musical tools or instruments
    • G10H2210/101Music Composition or musical creation; Tools or processes therefor

Abstract

The invention discloses a method for converting voice into songs, which comprises the following steps: processing the voice signal and converting the voice signal into a mel spectrogram; extracting F0 contours from different sound sources through a sound melody extractor; the mel spectrogram is time-stretched to the length which is the same as the F0 contour, and the mel spectrogram and the F0 contour are respectively encoded by two encoders; correlating the encoded mel spectrogram with the F0 outline through a decoder, and generating a song spectrogram; and processing the song spectrogram through a MelGAN vocoder to improve the tone quality of the output song. The invention also discloses a system for converting the voice into the song. The invention can effectively improve the tone quality of the songs and improve the user experience.

Description

Method and system for converting voice into song
Technical Field
The invention relates to the technical field of voice signal processing, in particular to a method and a system for converting voice into songs.
Background
At present, there is a demand for application of song composition in the fields of entertainment, karaoke, music production, and the like. Song synthesis is performed under certain conditions, such as: lyrics, pitch tags, or reference audio, creating a natural song. Where the reference audio may be a singing paragraph of one person, the task is to convert the tone of the singing paragraph to the tone of another person. The reference audio may also be a piece of speech of a person whose task is to convert it into a vocal paragraph with the same timbre identity and language content without reference to its underlying phoneme sequence.
However, the sound of the song generated by the song synthesis method in the prior art is distorted and unnatural, and the user experience is greatly reduced.
Disclosure of Invention
The present invention aims to provide a method and a system for converting voice into songs, so as to solve the technical problems.
In order to achieve the purpose, the invention adopts the following technical scheme:
there is provided a method of speech conversion into songs comprising:
processing the voice signal and converting the voice signal into a mel spectrogram;
extracting F0 contours from different sound sources through a sound melody extractor;
the mel spectrogram is time-stretched to the length which is the same as the F0 contour, and the mel spectrogram and the F0 contour are respectively encoded by two encoders;
correlating the encoded mel spectrogram with the F0 outline through a decoder, and generating a song spectrogram;
and processing the song spectrogram through a MelGAN vocoder to improve the tone quality of the output song.
The present invention also provides a system for converting speech into songs, comprising:
the voice processing module is used for processing the voice signals and converting the voice signals into mel spectrograms;
a sound source processing module for extracting F0 contours from different sound sources by a melody extractor;
the encoding module is used for stretching the mel spectrogram time to the length which is the same as the F0 contour, and encoding the mel spectrogram and the F0 contour through two encoders respectively;
the decoding module is used for correlating the coded mel spectrogram with the F0 outline through a decoder and generating a song spectrogram;
and the output module is used for processing the song spectrogram through a MelGAN vocoder so as to improve the tone quality of the output song.
1. The invention converts the voice into the song by arranging the encoder and the decoder, and processes the converted song through the MelGAN vocoder, thereby avoiding the defects of distorted and unnatural song voice caused by voice-to-song conversion by referring to the audio in the prior art.
2. The invention processes the song spectrogram through the MelGAN vocoder, and the MelGAN has obvious efficiency and universality, thereby effectively improving the tone quality.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required to be used in the embodiments of the present invention will be briefly described below. It is obvious that the drawings described below are only some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.
FIG. 1 is a diagram of the steps of a method for converting speech into songs, according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of a system for converting speech into songs according to an embodiment of the present invention.
Detailed Description
The technical scheme of the invention is further explained by the specific implementation mode in combination with the attached drawings.
Wherein the showings are for the purpose of illustration only and are shown by way of illustration only and not in actual form, and are not to be construed as limiting the present patent; to better illustrate the embodiments of the present invention, some parts of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product; it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.
The method for converting voice into songs, as shown in fig. 1, includes the following steps:
processing the voice signal and converting the voice signal into a mel spectrogram;
extracting F0 contours from different sound sources through a sound melody extractor;
the mel spectrogram is time-stretched to the length which is the same as the F0 contour, and the mel spectrogram and the F0 contour are respectively encoded by two encoders;
correlating the encoded mel spectrogram with the F0 outline through a decoder, and generating a song spectrogram;
and processing the song spectrogram through a MelGAN vocoder to improve the tone quality of the output song.
Obviously, the encoder and the decoder are arranged to convert the voice into the song, and the converted song is processed through the MelGAN vocoder, so that the defects of distorted and unnatural song voice caused by voice-to-song conversion through reference audio in the prior art are overcome.
Meanwhile, the song spectrogram is processed through the MelGAN vocoder, and MelGAN has obvious efficiency and universality, so that the tone quality can be effectively improved.
In one embodiment, a speech signal is processed and converted into a mel-spectrum, wherein the processing of the speech signal comprises:
the speech signal is resampled a plurality of times and the rhythm of the speech signal is changed to effectively separate the content and rhythm of the speech signal.
In one embodiment, the F0 contour is extracted from different sound sources by a melody extractor, wherein the sound sources include humming and singing.
In one embodiment, the mel-spectrum is time-stretched to the same length as the F0 profile, and the mel-spectrum and the F0 profile are encoded by two encoders, respectively, wherein the time-stretching of the mel-spectrum to the same length as the F0 profile comprises the following:
the mel spectrogram is randomly segmented and each segment is stretched or squeezed along the time axis, in this embodiment, the mel spectrogram is segmented into segments of random length of 16-32 frames and each segment is time stretched or squeezed by a multiple of 0.5-2 to be the same length as the F0 profile.
The length of the mel spectrogram and the F0 outline is usually greatly different, and the elimination of the difference can effectively improve the sound quality of the song, so the scheme is provided.
Based on the same inventive concept, an embodiment of the present invention further provides a system for converting voice into songs, as shown in fig. 2, including:
the voice processing module 1 is used for processing the voice signals and converting the voice signals into mel spectrograms;
a sound source processing module 2 for extracting F0 contours from different sound sources by a melody extractor;
the encoding module 3 is used for stretching the mel spectrogram to the length which is the same as the F0 contour in time, and encoding the mel spectrogram and the F0 contour through two encoders respectively;
the decoding module 4 is used for correlating the encoded mel spectrogram with the F0 outline through a decoder and generating a song spectrogram;
and the output module 5 is used for processing the song spectrogram through the MelGAN vocoder so as to improve the tone quality of the output song.
Obviously, the encoder and the decoder are arranged to convert the voice into the song, and the converted song is processed through the MelGAN vocoder, so that the defects of distorted and unnatural song voice caused by voice-to-song conversion through reference audio in the prior art are overcome.
Meanwhile, the song spectrogram is processed through the MelGAN vocoder, and MelGAN has obvious efficiency and universality, so that the tone quality can be effectively improved.
In one embodiment, the speech processing module 1 comprises:
and the voice sampling module is used for resampling the voice signals for multiple times and changing the rhythm of the voice signals so as to effectively separate the content and the rhythm of the voice signals.
In one embodiment, the encoding module 3 includes:
and the stretching module is used for randomly segmenting the mel spectrogram and stretching or extruding each segment along the time axis, wherein in the embodiment, the mel spectrogram is segmented into segments with random lengths of 16-32 frames, and each segment is stretched or extruded in time by a multiple of 0.5-2 so as to have the same length as the F0 contour.
The length of the mel spectrogram and the F0 outline is usually greatly different, and the elimination of the difference can effectively improve the sound quality of the song, so the scheme is provided.
It should be understood that the above-described embodiments are merely preferred embodiments of the invention and the technical principles applied thereto. It will be understood by those skilled in the art that various modifications, equivalents, changes, and the like can be made to the present invention. However, such variations are within the scope of the invention as long as they do not depart from the spirit of the invention. In addition, certain terms used in the specification and claims of the present application are not limiting, but are used merely for convenience of description.

Claims (5)

1. A method for converting speech into songs, comprising:
processing a voice signal and converting the voice signal into a mel (Mel) spectrogram;
extracting F0 contours from different sound sources through a sound melody extractor;
the mel spectrogram is time-stretched to the same length as the F0 profile, and the mel spectrogram and the F0 profile are respectively encoded by two encoders, which comprises the following steps:
randomly segmenting the mel spectrogram, stretching or extruding each segment along the time axis to have the same length as the F0 contour, and specifically:
dividing the mel spectrogram into fragments with random lengths of 16-32 frames, and performing time stretching or squeezing on each fragment by a multiple of 0.5-2 so as to obtain the same length as the F0 contour;
correlating the encoded mel spectrogram with the F0 outline through a decoder, and generating a song spectrogram;
and processing the song spectrogram through a MelGAN vocoder to improve the tone quality of the output song.
2. The method of claim 1, wherein the processing the speech signal and converting the speech signal into mel spectrogram comprises:
the speech signal is resampled a plurality of times and the rhythm of the speech signal is changed to effectively separate the content and rhythm of the speech signal.
3. The method of claim 1, wherein the F0 contour is extracted from different sound sources by a melody extractor, wherein the sound sources include humming and singing.
4. A system for converting speech into songs, comprising:
the voice processing module is used for processing the voice signals and converting the voice signals into mel spectrograms;
a sound source processing module for extracting F0 contours from different sound sources by a melody extractor;
the encoding module is used for stretching the mel spectrogram time to the length which is the same as the F0 contour, and encoding the mel spectrogram and the F0 contour through two encoders respectively;
an encoding module, comprising:
the stretching module is used for randomly segmenting the mel spectrogram, stretching or extruding each segment along a time axis, and the length of each segment is equal to that of the F0 contour, and specifically comprises the following steps:
dividing the mel spectrogram into fragments with random lengths of 16-32 frames, and performing time stretching or squeezing on each fragment by a multiple of 0.5-2 so as to obtain the same length as the F0 contour;
the decoding module is used for correlating the coded mel spectrogram with the F0 outline through a decoder and generating a song spectrogram;
and the output module is used for processing the song spectrogram through a MelGAN vocoder so as to improve the tone quality of the output song.
5. The system for converting speech into songs of claim 4, wherein the speech processing module comprises:
and the voice sampling module is used for resampling the voice signals for multiple times and changing the rhythm of the voice signals so as to effectively separate the content and the rhythm of the voice signals.
CN202011207626.1A 2020-11-03 2020-11-03 Method and system for converting voice into song Active CN112397043B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011207626.1A CN112397043B (en) 2020-11-03 2020-11-03 Method and system for converting voice into song

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011207626.1A CN112397043B (en) 2020-11-03 2020-11-03 Method and system for converting voice into song

Publications (2)

Publication Number Publication Date
CN112397043A CN112397043A (en) 2021-02-23
CN112397043B true CN112397043B (en) 2021-11-16

Family

ID=74597852

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011207626.1A Active CN112397043B (en) 2020-11-03 2020-11-03 Method and system for converting voice into song

Country Status (1)

Country Link
CN (1) CN112397043B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104272382A (en) * 2012-03-06 2015-01-07 新加坡科技研究局 Method and system for template-based personalized singing synthesis
CN111091800A (en) * 2019-12-25 2020-05-01 北京百度网讯科技有限公司 Song generation method and device
CN111157099A (en) * 2020-01-02 2020-05-15 河海大学常州校区 Distributed optical fiber sensor vibration signal classification method and identification classification system
CN111179972A (en) * 2019-12-12 2020-05-19 中山大学 Human voice detection algorithm based on deep learning
CN111816148A (en) * 2020-06-24 2020-10-23 厦门大学 Virtual human voice and video singing method and system based on generation countermeasure network

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103632672B (en) * 2012-08-28 2017-03-22 腾讯科技(深圳)有限公司 Voice-changing system, voice-changing method, man-machine interaction system and man-machine interaction method
US9837101B2 (en) * 2014-11-25 2017-12-05 Facebook, Inc. Indexing based on time-variant transforms of an audio signal's spectrogram
JP6631199B2 (en) * 2015-11-27 2020-01-15 ヤマハ株式会社 Technique determination device
CA3019506C (en) * 2016-04-12 2021-01-19 Markus Multrus Audio encoder for encoding an audio signal, method for encoding an audio signal and computer program under consideration of a detected peak spectral region in an upper frequency band
JP6788560B2 (en) * 2017-09-05 2020-11-25 株式会社エクシング Singing evaluation device, singing evaluation program, singing evaluation method and karaoke device
CN108074557B (en) * 2017-12-11 2021-11-23 深圳Tcl新技术有限公司 Tone adjusting method, device and storage medium
CN111402858B (en) * 2020-02-27 2024-05-03 平安科技(深圳)有限公司 Singing voice synthesizing method, singing voice synthesizing device, computer equipment and storage medium
CN111445892B (en) * 2020-03-23 2023-04-14 北京字节跳动网络技术有限公司 Song generation method and device, readable medium and electronic equipment

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104272382A (en) * 2012-03-06 2015-01-07 新加坡科技研究局 Method and system for template-based personalized singing synthesis
CN111179972A (en) * 2019-12-12 2020-05-19 中山大学 Human voice detection algorithm based on deep learning
CN111091800A (en) * 2019-12-25 2020-05-01 北京百度网讯科技有限公司 Song generation method and device
CN111157099A (en) * 2020-01-02 2020-05-15 河海大学常州校区 Distributed optical fiber sensor vibration signal classification method and identification classification system
CN111816148A (en) * 2020-06-24 2020-10-23 厦门大学 Virtual human voice and video singing method and system based on generation countermeasure network

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
"Emotional Voice Conversion using a Hybrid Framework With Speaker-adptive DNN and Particle -Swarm Neural Network";Susmitha Vekkot;《IEEE Access》;20200420;第8卷;全文 *
"Emotional voice conversion using DNN with MCC and F0 features";Zhaojie Luo;《2016 ACIS》;20161231;全文 *

Also Published As

Publication number Publication date
CN112397043A (en) 2021-02-23

Similar Documents

Publication Publication Date Title
US10789290B2 (en) Audio data processing method and apparatus, and computer storage medium
CN108899009B (en) Chinese speech synthesis system based on phoneme
JP6290858B2 (en) Computer processing method, apparatus, and computer program product for automatically converting input audio encoding of speech into output rhythmically harmonizing with target song
US9135923B1 (en) Pitch synchronous speech coding based on timbre vectors
EP0140777A1 (en) Process for encoding speech and an apparatus for carrying out the process
Kim et al. Korean singing voice synthesis system based on an LSTM recurrent neural network
KR101565633B1 (en) APPARATUS AND METHOD FOR ENCODING AND DECODING OF INTEGRATed VOICE AND MUSIC
CN112466313B (en) Method and device for synthesizing singing voices of multiple singers
JP2005018097A (en) Singing synthesizer
CN109616131B (en) Digital real-time voice sound changing method
TW201312547A (en) Music generator
CN111681641A (en) Phrase-based end-to-end text-to-speech (TTS) synthesis
Zhang et al. Susing: Su-net for singing voice synthesis
CN112397043B (en) Method and system for converting voice into song
Cooper et al. Text-to-speech synthesis techniques for MIDI-to-audio synthesis
JP5360489B2 (en) Phoneme code converter and speech synthesizer
Patel An Empirical Method for Comparing Pitch Patterns in Spoken and Musical Melodies: A Comment on JGS Pearl's" Eavesdropping with a Master: Leos Janáček and the Music of Speech."
JP2006030609A (en) Voice synthesis data generating device, voice synthesizing device, voice synthesis data generating program, and voice synthesizing program
CN110610721B (en) Detection system and method based on lyric singing accuracy
Nordstrom et al. Transforming perceived vocal effort and breathiness using adaptive pre-emphasis linear prediction
CN1085367C (en) Chinese spoken language distinguishing and synthesis type vocoder
JP5560769B2 (en) Phoneme code converter and speech synthesizer
JP5471138B2 (en) Phoneme code converter and speech synthesizer
JPH11352997A (en) Voice synthesizing device and control method thereof
KR920005509B1 (en) Natural sound synthesizer by adding noise

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant