CN112397043B

CN112397043B - Method and system for converting voice into song

Info

Publication number: CN112397043B
Application number: CN202011207626.1A
Authority: CN
Inventors: 不公告发明人
Original assignee: Beijing Zhongke Shenzhi Technology Co ltd
Current assignee: Beijing Zhongke Shenzhi Technology Co ltd
Priority date: 2020-11-03
Filing date: 2020-11-03
Publication date: 2021-11-16
Anticipated expiration: 2040-11-03
Also published as: CN112397043A

Abstract

The invention discloses a method for converting voice into songs, which comprises the following steps: processing the voice signal and converting the voice signal into a mel spectrogram; extracting F0 contours from different sound sources through a sound melody extractor; the mel spectrogram is time-stretched to the length which is the same as the F0 contour, and the mel spectrogram and the F0 contour are respectively encoded by two encoders; correlating the encoded mel spectrogram with the F0 outline through a decoder, and generating a song spectrogram; and processing the song spectrogram through a MelGAN vocoder to improve the tone quality of the output song. The invention also discloses a system for converting the voice into the song. The invention can effectively improve the tone quality of the songs and improve the user experience.

Description

Method and system for converting voice into song

Technical Field

The invention relates to the technical field of voice signal processing, in particular to a method and a system for converting voice into songs.

Background

At present, there is a demand for application of song composition in the fields of entertainment, karaoke, music production, and the like. Song synthesis is performed under certain conditions, such as: lyrics, pitch tags, or reference audio, creating a natural song. Where the reference audio may be a singing paragraph of one person, the task is to convert the tone of the singing paragraph to the tone of another person. The reference audio may also be a piece of speech of a person whose task is to convert it into a vocal paragraph with the same timbre identity and language content without reference to its underlying phoneme sequence.

However, the sound of the song generated by the song synthesis method in the prior art is distorted and unnatural, and the user experience is greatly reduced.

Disclosure of Invention

The present invention aims to provide a method and a system for converting voice into songs, so as to solve the technical problems.

In order to achieve the purpose, the invention adopts the following technical scheme:

there is provided a method of speech conversion into songs comprising:

processing the voice signal and converting the voice signal into a mel spectrogram;

extracting F0 contours from different sound sources through a sound melody extractor;

the mel spectrogram is time-stretched to the length which is the same as the F0 contour, and the mel spectrogram and the F0 contour are respectively encoded by two encoders;

correlating the encoded mel spectrogram with the F0 outline through a decoder, and generating a song spectrogram;

and processing the song spectrogram through a MelGAN vocoder to improve the tone quality of the output song.

The present invention also provides a system for converting speech into songs, comprising:

the voice processing module is used for processing the voice signals and converting the voice signals into mel spectrograms;

a sound source processing module for extracting F0 contours from different sound sources by a melody extractor;

the encoding module is used for stretching the mel spectrogram time to the length which is the same as the F0 contour, and encoding the mel spectrogram and the F0 contour through two encoders respectively;

the decoding module is used for correlating the coded mel spectrogram with the F0 outline through a decoder and generating a song spectrogram;

and the output module is used for processing the song spectrogram through a MelGAN vocoder so as to improve the tone quality of the output song.

1. The invention converts the voice into the song by arranging the encoder and the decoder, and processes the converted song through the MelGAN vocoder, thereby avoiding the defects of distorted and unnatural song voice caused by voice-to-song conversion by referring to the audio in the prior art.

2. The invention processes the song spectrogram through the MelGAN vocoder, and the MelGAN has obvious efficiency and universality, thereby effectively improving the tone quality.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required to be used in the embodiments of the present invention will be briefly described below. It is obvious that the drawings described below are only some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.

FIG. 1 is a diagram of the steps of a method for converting speech into songs, according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a system for converting speech into songs according to an embodiment of the present invention.

Detailed Description

The technical scheme of the invention is further explained by the specific implementation mode in combination with the attached drawings.

Wherein the showings are for the purpose of illustration only and are shown by way of illustration only and not in actual form, and are not to be construed as limiting the present patent; to better illustrate the embodiments of the present invention, some parts of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product; it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The method for converting voice into songs, as shown in fig. 1, includes the following steps:

Obviously, the encoder and the decoder are arranged to convert the voice into the song, and the converted song is processed through the MelGAN vocoder, so that the defects of distorted and unnatural song voice caused by voice-to-song conversion through reference audio in the prior art are overcome.

Meanwhile, the song spectrogram is processed through the MelGAN vocoder, and MelGAN has obvious efficiency and universality, so that the tone quality can be effectively improved.

In one embodiment, a speech signal is processed and converted into a mel-spectrum, wherein the processing of the speech signal comprises:

the speech signal is resampled a plurality of times and the rhythm of the speech signal is changed to effectively separate the content and rhythm of the speech signal.

In one embodiment, the F0 contour is extracted from different sound sources by a melody extractor, wherein the sound sources include humming and singing.

In one embodiment, the mel-spectrum is time-stretched to the same length as the F0 profile, and the mel-spectrum and the F0 profile are encoded by two encoders, respectively, wherein the time-stretching of the mel-spectrum to the same length as the F0 profile comprises the following:

the mel spectrogram is randomly segmented and each segment is stretched or squeezed along the time axis, in this embodiment, the mel spectrogram is segmented into segments of random length of 16-32 frames and each segment is time stretched or squeezed by a multiple of 0.5-2 to be the same length as the F0 profile.

The length of the mel spectrogram and the F0 outline is usually greatly different, and the elimination of the difference can effectively improve the sound quality of the song, so the scheme is provided.

Based on the same inventive concept, an embodiment of the present invention further provides a system for converting voice into songs, as shown in fig. 2, including:

the voice processing module 1 is used for processing the voice signals and converting the voice signals into mel spectrograms;

a sound source processing module 2 for extracting F0 contours from different sound sources by a melody extractor;

the encoding module 3 is used for stretching the mel spectrogram to the length which is the same as the F0 contour in time, and encoding the mel spectrogram and the F0 contour through two encoders respectively;

the decoding module 4 is used for correlating the encoded mel spectrogram with the F0 outline through a decoder and generating a song spectrogram;

and the output module 5 is used for processing the song spectrogram through the MelGAN vocoder so as to improve the tone quality of the output song.

In one embodiment, the speech processing module 1 comprises:

and the voice sampling module is used for resampling the voice signals for multiple times and changing the rhythm of the voice signals so as to effectively separate the content and the rhythm of the voice signals.

In one embodiment, the encoding module 3 includes:

and the stretching module is used for randomly segmenting the mel spectrogram and stretching or extruding each segment along the time axis, wherein in the embodiment, the mel spectrogram is segmented into segments with random lengths of 16-32 frames, and each segment is stretched or extruded in time by a multiple of 0.5-2 so as to have the same length as the F0 contour.

It should be understood that the above-described embodiments are merely preferred embodiments of the invention and the technical principles applied thereto. It will be understood by those skilled in the art that various modifications, equivalents, changes, and the like can be made to the present invention. However, such variations are within the scope of the invention as long as they do not depart from the spirit of the invention. In addition, certain terms used in the specification and claims of the present application are not limiting, but are used merely for convenience of description.

Claims

1. A method for converting speech into songs, comprising:

processing a voice signal and converting the voice signal into a mel (Mel) spectrogram;

the mel spectrogram is time-stretched to the same length as the F0 profile, and the mel spectrogram and the F0 profile are respectively encoded by two encoders, which comprises the following steps:

randomly segmenting the mel spectrogram, stretching or extruding each segment along the time axis to have the same length as the F0 contour, and specifically:

dividing the mel spectrogram into fragments with random lengths of 16-32 frames, and performing time stretching or squeezing on each fragment by a multiple of 0.5-2 so as to obtain the same length as the F0 contour;

2. The method of claim 1, wherein the processing the speech signal and converting the speech signal into mel spectrogram comprises:

3. The method of claim 1, wherein the F0 contour is extracted from different sound sources by a melody extractor, wherein the sound sources include humming and singing.

4. A system for converting speech into songs, comprising:

an encoding module, comprising:

the stretching module is used for randomly segmenting the mel spectrogram, stretching or extruding each segment along a time axis, and the length of each segment is equal to that of the F0 contour, and specifically comprises the following steps:

5. The system for converting speech into songs of claim 4, wherein the speech processing module comprises: