CN112397043B - Method and system for converting voice into song - Google Patents
Method and system for converting voice into song Download PDFInfo
- Publication number
- CN112397043B CN112397043B CN202011207626.1A CN202011207626A CN112397043B CN 112397043 B CN112397043 B CN 112397043B CN 202011207626 A CN202011207626 A CN 202011207626A CN 112397043 B CN112397043 B CN 112397043B
- Authority
- CN
- China
- Prior art keywords
- spectrogram
- mel
- song
- voice
- contour
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10H—ELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
- G10H1/00—Details of electrophonic musical instruments
- G10H1/0008—Associated control or indicating means
- G10H1/0025—Automatic or semi-automatic music composition, e.g. producing random music, applying rules from music theory or modifying a musical piece
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/003—Changing voice quality, e.g. pitch or formants
- G10L21/007—Changing voice quality, e.g. pitch or formants characterised by the process used
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10H—ELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
- G10H2210/00—Aspects or methods of musical processing having intrinsic musical character, i.e. involving musical theory or musical parameters or relying on musical knowledge, as applied in electrophonic musical tools or instruments
- G10H2210/101—Music Composition or musical creation; Tools or processes therefor
Abstract
The invention discloses a method for converting voice into songs, which comprises the following steps: processing the voice signal and converting the voice signal into a mel spectrogram; extracting F0 contours from different sound sources through a sound melody extractor; the mel spectrogram is time-stretched to the length which is the same as the F0 contour, and the mel spectrogram and the F0 contour are respectively encoded by two encoders; correlating the encoded mel spectrogram with the F0 outline through a decoder, and generating a song spectrogram; and processing the song spectrogram through a MelGAN vocoder to improve the tone quality of the output song. The invention also discloses a system for converting the voice into the song. The invention can effectively improve the tone quality of the songs and improve the user experience.
Description
Technical Field
The invention relates to the technical field of voice signal processing, in particular to a method and a system for converting voice into songs.
Background
At present, there is a demand for application of song composition in the fields of entertainment, karaoke, music production, and the like. Song synthesis is performed under certain conditions, such as: lyrics, pitch tags, or reference audio, creating a natural song. Where the reference audio may be a singing paragraph of one person, the task is to convert the tone of the singing paragraph to the tone of another person. The reference audio may also be a piece of speech of a person whose task is to convert it into a vocal paragraph with the same timbre identity and language content without reference to its underlying phoneme sequence.
However, the sound of the song generated by the song synthesis method in the prior art is distorted and unnatural, and the user experience is greatly reduced.
Disclosure of Invention
The present invention aims to provide a method and a system for converting voice into songs, so as to solve the technical problems.
In order to achieve the purpose, the invention adopts the following technical scheme:
there is provided a method of speech conversion into songs comprising:
processing the voice signal and converting the voice signal into a mel spectrogram;
extracting F0 contours from different sound sources through a sound melody extractor;
the mel spectrogram is time-stretched to the length which is the same as the F0 contour, and the mel spectrogram and the F0 contour are respectively encoded by two encoders;
correlating the encoded mel spectrogram with the F0 outline through a decoder, and generating a song spectrogram;
and processing the song spectrogram through a MelGAN vocoder to improve the tone quality of the output song.
The present invention also provides a system for converting speech into songs, comprising:
the voice processing module is used for processing the voice signals and converting the voice signals into mel spectrograms;
a sound source processing module for extracting F0 contours from different sound sources by a melody extractor;
the encoding module is used for stretching the mel spectrogram time to the length which is the same as the F0 contour, and encoding the mel spectrogram and the F0 contour through two encoders respectively;
the decoding module is used for correlating the coded mel spectrogram with the F0 outline through a decoder and generating a song spectrogram;
and the output module is used for processing the song spectrogram through a MelGAN vocoder so as to improve the tone quality of the output song.
1. The invention converts the voice into the song by arranging the encoder and the decoder, and processes the converted song through the MelGAN vocoder, thereby avoiding the defects of distorted and unnatural song voice caused by voice-to-song conversion by referring to the audio in the prior art.
2. The invention processes the song spectrogram through the MelGAN vocoder, and the MelGAN has obvious efficiency and universality, thereby effectively improving the tone quality.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required to be used in the embodiments of the present invention will be briefly described below. It is obvious that the drawings described below are only some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.
FIG. 1 is a diagram of the steps of a method for converting speech into songs, according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of a system for converting speech into songs according to an embodiment of the present invention.
Detailed Description
The technical scheme of the invention is further explained by the specific implementation mode in combination with the attached drawings.
Wherein the showings are for the purpose of illustration only and are shown by way of illustration only and not in actual form, and are not to be construed as limiting the present patent; to better illustrate the embodiments of the present invention, some parts of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product; it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.
The method for converting voice into songs, as shown in fig. 1, includes the following steps:
processing the voice signal and converting the voice signal into a mel spectrogram;
extracting F0 contours from different sound sources through a sound melody extractor;
the mel spectrogram is time-stretched to the length which is the same as the F0 contour, and the mel spectrogram and the F0 contour are respectively encoded by two encoders;
correlating the encoded mel spectrogram with the F0 outline through a decoder, and generating a song spectrogram;
and processing the song spectrogram through a MelGAN vocoder to improve the tone quality of the output song.
Obviously, the encoder and the decoder are arranged to convert the voice into the song, and the converted song is processed through the MelGAN vocoder, so that the defects of distorted and unnatural song voice caused by voice-to-song conversion through reference audio in the prior art are overcome.
Meanwhile, the song spectrogram is processed through the MelGAN vocoder, and MelGAN has obvious efficiency and universality, so that the tone quality can be effectively improved.
In one embodiment, a speech signal is processed and converted into a mel-spectrum, wherein the processing of the speech signal comprises:
the speech signal is resampled a plurality of times and the rhythm of the speech signal is changed to effectively separate the content and rhythm of the speech signal.
In one embodiment, the F0 contour is extracted from different sound sources by a melody extractor, wherein the sound sources include humming and singing.
In one embodiment, the mel-spectrum is time-stretched to the same length as the F0 profile, and the mel-spectrum and the F0 profile are encoded by two encoders, respectively, wherein the time-stretching of the mel-spectrum to the same length as the F0 profile comprises the following:
the mel spectrogram is randomly segmented and each segment is stretched or squeezed along the time axis, in this embodiment, the mel spectrogram is segmented into segments of random length of 16-32 frames and each segment is time stretched or squeezed by a multiple of 0.5-2 to be the same length as the F0 profile.
The length of the mel spectrogram and the F0 outline is usually greatly different, and the elimination of the difference can effectively improve the sound quality of the song, so the scheme is provided.
Based on the same inventive concept, an embodiment of the present invention further provides a system for converting voice into songs, as shown in fig. 2, including:
the voice processing module 1 is used for processing the voice signals and converting the voice signals into mel spectrograms;
a sound source processing module 2 for extracting F0 contours from different sound sources by a melody extractor;
the encoding module 3 is used for stretching the mel spectrogram to the length which is the same as the F0 contour in time, and encoding the mel spectrogram and the F0 contour through two encoders respectively;
the decoding module 4 is used for correlating the encoded mel spectrogram with the F0 outline through a decoder and generating a song spectrogram;
and the output module 5 is used for processing the song spectrogram through the MelGAN vocoder so as to improve the tone quality of the output song.
Obviously, the encoder and the decoder are arranged to convert the voice into the song, and the converted song is processed through the MelGAN vocoder, so that the defects of distorted and unnatural song voice caused by voice-to-song conversion through reference audio in the prior art are overcome.
Meanwhile, the song spectrogram is processed through the MelGAN vocoder, and MelGAN has obvious efficiency and universality, so that the tone quality can be effectively improved.
In one embodiment, the speech processing module 1 comprises:
and the voice sampling module is used for resampling the voice signals for multiple times and changing the rhythm of the voice signals so as to effectively separate the content and the rhythm of the voice signals.
In one embodiment, the encoding module 3 includes:
and the stretching module is used for randomly segmenting the mel spectrogram and stretching or extruding each segment along the time axis, wherein in the embodiment, the mel spectrogram is segmented into segments with random lengths of 16-32 frames, and each segment is stretched or extruded in time by a multiple of 0.5-2 so as to have the same length as the F0 contour.
The length of the mel spectrogram and the F0 outline is usually greatly different, and the elimination of the difference can effectively improve the sound quality of the song, so the scheme is provided.
It should be understood that the above-described embodiments are merely preferred embodiments of the invention and the technical principles applied thereto. It will be understood by those skilled in the art that various modifications, equivalents, changes, and the like can be made to the present invention. However, such variations are within the scope of the invention as long as they do not depart from the spirit of the invention. In addition, certain terms used in the specification and claims of the present application are not limiting, but are used merely for convenience of description.
Claims (5)
1. A method for converting speech into songs, comprising:
processing a voice signal and converting the voice signal into a mel (Mel) spectrogram;
extracting F0 contours from different sound sources through a sound melody extractor;
the mel spectrogram is time-stretched to the same length as the F0 profile, and the mel spectrogram and the F0 profile are respectively encoded by two encoders, which comprises the following steps:
randomly segmenting the mel spectrogram, stretching or extruding each segment along the time axis to have the same length as the F0 contour, and specifically:
dividing the mel spectrogram into fragments with random lengths of 16-32 frames, and performing time stretching or squeezing on each fragment by a multiple of 0.5-2 so as to obtain the same length as the F0 contour;
correlating the encoded mel spectrogram with the F0 outline through a decoder, and generating a song spectrogram;
and processing the song spectrogram through a MelGAN vocoder to improve the tone quality of the output song.
2. The method of claim 1, wherein the processing the speech signal and converting the speech signal into mel spectrogram comprises:
the speech signal is resampled a plurality of times and the rhythm of the speech signal is changed to effectively separate the content and rhythm of the speech signal.
3. The method of claim 1, wherein the F0 contour is extracted from different sound sources by a melody extractor, wherein the sound sources include humming and singing.
4. A system for converting speech into songs, comprising:
the voice processing module is used for processing the voice signals and converting the voice signals into mel spectrograms;
a sound source processing module for extracting F0 contours from different sound sources by a melody extractor;
the encoding module is used for stretching the mel spectrogram time to the length which is the same as the F0 contour, and encoding the mel spectrogram and the F0 contour through two encoders respectively;
an encoding module, comprising:
the stretching module is used for randomly segmenting the mel spectrogram, stretching or extruding each segment along a time axis, and the length of each segment is equal to that of the F0 contour, and specifically comprises the following steps:
dividing the mel spectrogram into fragments with random lengths of 16-32 frames, and performing time stretching or squeezing on each fragment by a multiple of 0.5-2 so as to obtain the same length as the F0 contour;
the decoding module is used for correlating the coded mel spectrogram with the F0 outline through a decoder and generating a song spectrogram;
and the output module is used for processing the song spectrogram through a MelGAN vocoder so as to improve the tone quality of the output song.
5. The system for converting speech into songs of claim 4, wherein the speech processing module comprises:
and the voice sampling module is used for resampling the voice signals for multiple times and changing the rhythm of the voice signals so as to effectively separate the content and the rhythm of the voice signals.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011207626.1A CN112397043B (en) | 2020-11-03 | 2020-11-03 | Method and system for converting voice into song |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011207626.1A CN112397043B (en) | 2020-11-03 | 2020-11-03 | Method and system for converting voice into song |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112397043A CN112397043A (en) | 2021-02-23 |
CN112397043B true CN112397043B (en) | 2021-11-16 |
Family
ID=74597852
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011207626.1A Active CN112397043B (en) | 2020-11-03 | 2020-11-03 | Method and system for converting voice into song |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112397043B (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104272382A (en) * | 2012-03-06 | 2015-01-07 | 新加坡科技研究局 | Method and system for template-based personalized singing synthesis |
CN111091800A (en) * | 2019-12-25 | 2020-05-01 | 北京百度网讯科技有限公司 | Song generation method and device |
CN111157099A (en) * | 2020-01-02 | 2020-05-15 | 河海大学常州校区 | Distributed optical fiber sensor vibration signal classification method and identification classification system |
CN111179972A (en) * | 2019-12-12 | 2020-05-19 | 中山大学 | Human voice detection algorithm based on deep learning |
CN111816148A (en) * | 2020-06-24 | 2020-10-23 | 厦门大学 | Virtual human voice and video singing method and system based on generation countermeasure network |
Family Cites Families (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103632672B (en) * | 2012-08-28 | 2017-03-22 | 腾讯科技(深圳)有限公司 | Voice-changing system, voice-changing method, man-machine interaction system and man-machine interaction method |
US9837101B2 (en) * | 2014-11-25 | 2017-12-05 | Facebook, Inc. | Indexing based on time-variant transforms of an audio signal's spectrogram |
JP6631199B2 (en) * | 2015-11-27 | 2020-01-15 | ヤマハ株式会社 | Technique determination device |
CA3019506C (en) * | 2016-04-12 | 2021-01-19 | Markus Multrus | Audio encoder for encoding an audio signal, method for encoding an audio signal and computer program under consideration of a detected peak spectral region in an upper frequency band |
JP6788560B2 (en) * | 2017-09-05 | 2020-11-25 | 株式会社エクシング | Singing evaluation device, singing evaluation program, singing evaluation method and karaoke device |
CN108074557B (en) * | 2017-12-11 | 2021-11-23 | 深圳Tcl新技术有限公司 | Tone adjusting method, device and storage medium |
CN111402858B (en) * | 2020-02-27 | 2024-05-03 | 平安科技(深圳)有限公司 | Singing voice synthesizing method, singing voice synthesizing device, computer equipment and storage medium |
CN111445892B (en) * | 2020-03-23 | 2023-04-14 | 北京字节跳动网络技术有限公司 | Song generation method and device, readable medium and electronic equipment |
-
2020
- 2020-11-03 CN CN202011207626.1A patent/CN112397043B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104272382A (en) * | 2012-03-06 | 2015-01-07 | 新加坡科技研究局 | Method and system for template-based personalized singing synthesis |
CN111179972A (en) * | 2019-12-12 | 2020-05-19 | 中山大学 | Human voice detection algorithm based on deep learning |
CN111091800A (en) * | 2019-12-25 | 2020-05-01 | 北京百度网讯科技有限公司 | Song generation method and device |
CN111157099A (en) * | 2020-01-02 | 2020-05-15 | 河海大学常州校区 | Distributed optical fiber sensor vibration signal classification method and identification classification system |
CN111816148A (en) * | 2020-06-24 | 2020-10-23 | 厦门大学 | Virtual human voice and video singing method and system based on generation countermeasure network |
Non-Patent Citations (2)
Title |
---|
"Emotional Voice Conversion using a Hybrid Framework With Speaker-adptive DNN and Particle -Swarm Neural Network";Susmitha Vekkot;《IEEE Access》;20200420;第8卷;全文 * |
"Emotional voice conversion using DNN with MCC and F0 features";Zhaojie Luo;《2016 ACIS》;20161231;全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN112397043A (en) | 2021-02-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10789290B2 (en) | Audio data processing method and apparatus, and computer storage medium | |
CN108899009B (en) | Chinese speech synthesis system based on phoneme | |
JP6290858B2 (en) | Computer processing method, apparatus, and computer program product for automatically converting input audio encoding of speech into output rhythmically harmonizing with target song | |
US9135923B1 (en) | Pitch synchronous speech coding based on timbre vectors | |
EP0140777A1 (en) | Process for encoding speech and an apparatus for carrying out the process | |
Kim et al. | Korean singing voice synthesis system based on an LSTM recurrent neural network | |
KR101565633B1 (en) | APPARATUS AND METHOD FOR ENCODING AND DECODING OF INTEGRATed VOICE AND MUSIC | |
CN112466313B (en) | Method and device for synthesizing singing voices of multiple singers | |
JP2005018097A (en) | Singing synthesizer | |
CN109616131B (en) | Digital real-time voice sound changing method | |
TW201312547A (en) | Music generator | |
CN111681641A (en) | Phrase-based end-to-end text-to-speech (TTS) synthesis | |
Zhang et al. | Susing: Su-net for singing voice synthesis | |
CN112397043B (en) | Method and system for converting voice into song | |
Cooper et al. | Text-to-speech synthesis techniques for MIDI-to-audio synthesis | |
JP5360489B2 (en) | Phoneme code converter and speech synthesizer | |
Patel | An Empirical Method for Comparing Pitch Patterns in Spoken and Musical Melodies: A Comment on JGS Pearl's" Eavesdropping with a Master: Leos Janáček and the Music of Speech." | |
JP2006030609A (en) | Voice synthesis data generating device, voice synthesizing device, voice synthesis data generating program, and voice synthesizing program | |
CN110610721B (en) | Detection system and method based on lyric singing accuracy | |
Nordstrom et al. | Transforming perceived vocal effort and breathiness using adaptive pre-emphasis linear prediction | |
CN1085367C (en) | Chinese spoken language distinguishing and synthesis type vocoder | |
JP5560769B2 (en) | Phoneme code converter and speech synthesizer | |
JP5471138B2 (en) | Phoneme code converter and speech synthesizer | |
JPH11352997A (en) | Voice synthesizing device and control method thereof | |
KR920005509B1 (en) | Natural sound synthesizer by adding noise |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |