CN112750420A

CN112750420A - Singing voice synthesis method, device and equipment

Info

Publication number: CN112750420A
Application number: CN202011537625.3A
Authority: CN
Inventors: 杨喜鹏; 江明奇; 殷昊; 张旭; 陈云琳
Original assignee: Go Out And Ask Suzhou Information Technology Co ltd
Current assignee: Mobvoi Innovation Technology Co Ltd
Priority date: 2020-12-23
Filing date: 2020-12-23
Publication date: 2021-05-04
Anticipated expiration: 2040-12-23
Also published as: CN112750420B

Abstract

The invention discloses a singing voice synthesis method, a singing voice synthesis device and singing voice synthesis equipment, wherein the singing voice synthesis method comprises the following steps: acquiring a target song, a music score of the target song and broadcasting voice of lyrics of the target song, wherein the music score comprises the lyrics and first time length corresponding to characters in the lyrics; determining a first fundamental frequency of a target song and a first spectrum characteristic of broadcast voice; determining a first phoneme corresponding to the character according to the initial consonant and the final sound corresponding to the character; determining a second duration of the first phoneme according to the first phoneme corresponding to the character, a first preset duration proportion threshold of each first phoneme and a first duration corresponding to the character; determining a third duration of the first phoneme according to the broadcast voice and the lyrics; scaling the first spectral feature according to the second duration of the first phoneme and the third duration of the first phoneme to obtain a second spectral feature; and synthesizing the second spectral feature and the first fundamental frequency to obtain the synthesized singing voice. The invention can realize song synthesis without collecting a large amount of recording data, and can reduce the cost of song synthesis.

Description

Singing voice synthesis method, device and equipment

Technical Field

The present application relates to the field of singing voice synthesis technology, and in particular, to a singing voice synthesis method, apparatus, and device.

Background

In recent years, singing voice synthesis technology has been receiving attention from all communities. The most convenient of the singing voice synthesis technology is that it can make a computer sing a song with any melody. One of the mainstream techniques for synthesizing singing voice in the prior art is waveform splicing, which has the core of prerecording the singing voice of each pronunciation at different pitches in a certain language to obtain a voice synthesis database. Therefore, synthesizing singing voice using the singing voice inherent in the voice synthesis database relies on very large recording data, which requires a lot of time and labor for collecting the recording data, resulting in a high cost of singing voice synthesis.

Content of application

The embodiment of the application provides a singing voice synthesis method, a singing voice synthesis device and singing voice synthesis equipment, and is used for solving the problems that in the prior art, the singing voice synthesis depends on huge recording data, and a large amount of time and manpower are consumed for collecting the recording data, so that the singing voice synthesis cost is high.

In order to solve the above problem, in a first aspect, an embodiment of the present invention provides a singing voice synthesis method, including: acquiring a target song, a music score of the target song and broadcasting voice of lyrics of the target song, wherein the music score comprises the lyrics and first time length corresponding to characters in the lyrics; determining a first fundamental frequency of a target song and a first spectrum characteristic of broadcast voice; determining a first phoneme corresponding to the character according to the initial consonant and the final sound corresponding to the character; determining a second duration of the first phoneme according to the first phoneme corresponding to the character, a first preset duration proportion threshold of each first phoneme and a first duration corresponding to the character; determining a third duration of the first phoneme according to the broadcast voice and the lyrics; scaling the first spectral feature according to the second duration of the first phoneme and the third duration of the first phoneme to obtain a second spectral feature; and synthesizing the second spectral feature and the first fundamental frequency to obtain the synthesized singing voice.

Optionally, determining the first fundamental frequency of the target song comprises: carrying out audio track separation on the target song to obtain dry sound; and extracting the third fundamental frequency of the dry sound to obtain the first fundamental frequency of the target song.

Optionally, determining the third duration of the first phoneme according to the broadcast voice and the lyrics includes: analyzing the music score to obtain lyrics; inputting the broadcast voice and the lyrics into a preset voice recognition model; and labeling the first phoneme corresponding to the characters in the lyrics of the broadcast voice through a voice recognition model to obtain a third duration of the first phoneme.

Optionally, the scaling the first spectral feature according to the second duration of the first phoneme and the third duration of the first phoneme to obtain a second spectral feature, including: labeling the first spectral feature according to a third duration of the first phoneme to obtain a third spectral feature of the first phoneme; calculating a scaling ratio according to the second duration of the first phoneme and the third duration of the first phoneme; and carrying out scaling processing on the third spectral characteristics according to the scaling ratio to obtain second spectral characteristics.

Optionally, after determining the first fundamental frequency of the target song and the first spectral feature of the broadcast voice, the singing voice synthesizing method further includes: labeling the first fundamental frequency according to the second duration of the first phoneme to obtain a second fundamental frequency of the first phoneme; determining a second phoneme which does not contain fundamental frequency information in the first phoneme according to the pronunciation rule of the first phoneme; adjusting a second fundamental frequency corresponding to the second phoneme to be zero; and re-determining the first fundamental frequency of the target song according to the adjusted second fundamental frequency.

Optionally, the musical score further includes pitches corresponding to characters in the lyrics, and after determining the first fundamental frequency of the target song and the first spectral feature of the broadcast voice, the singing voice synthesizing method further includes: determining a fundamental frequency mean value and a fundamental frequency variance of the character according to a first phoneme corresponding to the character and a second fundamental frequency of the first phoneme; calculating the fundamental frequency ratio of the character according to the fundamental frequency mean value of the character and the pitch corresponding to the character; when the fundamental frequency ratio and/or the fundamental frequency variance are not within the preset threshold range, smoothing the second fundamental frequency of the first phoneme; and re-determining the first fundamental frequency of the target song according to the second fundamental frequency of the first phoneme after the smoothing processing.

Optionally, before synthesizing the second spectral feature and the first fundamental frequency to obtain the synthesized singing voice, the singing voice synthesizing method further includes: determining a third fundamental frequency of the broadcast voice; determining a zero value in the third fundamental frequency; interpolating zero values in the first fundamental frequency to non-zero values; the non-zero value in the first fundamental frequency is adjusted in accordance with the zero value in the third fundamental frequency.

In a second aspect, an embodiment of the present invention provides a singing voice synthesizing apparatus, including: the acquisition unit is used for acquiring a target song, a music score of the target song and broadcast voice of the target song, wherein the music score comprises lyrics and first duration corresponding to characters in the lyrics; the first determining unit is used for determining a first fundamental frequency of a target song and a first spectrum characteristic of broadcast voice; the second determining unit is used for determining a first phoneme corresponding to the character according to the initial consonant and the final consonant corresponding to the character; a third determining unit, configured to determine a second duration of the first phoneme according to the first phoneme corresponding to the character, the first preset duration proportion threshold of each first phoneme, and the first duration corresponding to the character; a fourth determining unit, configured to determine a third duration of the first phoneme according to the broadcast voice and the lyrics; the processing unit is used for carrying out scaling processing on the first spectral feature according to the second duration of the first phoneme and the third duration of the first phoneme to obtain a second spectral feature; and the synthesis unit is used for synthesizing the second spectrum characteristic and the first fundamental frequency to obtain the synthesized singing voice.

In a third aspect, an embodiment of the present invention provides a singing voice synthesizing apparatus, including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to perform a singing voice synthesis method as in the first aspect or any embodiment of the first aspect.

In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, where computer instructions are stored, and the computer instructions are used to cause a computer to execute the singing voice synthesizing method according to the first aspect or any implementation manner of the first aspect.

According to the method, the device and the equipment for synthesizing the singing voice, provided by the embodiment of the invention, the target song, the music score of the target song and the broadcasting voice of the target song are obtained, wherein the music score comprises the lyrics and the first time length corresponding to the characters in the lyrics; determining a first fundamental frequency of a target song and a first spectrum characteristic of broadcast voice; determining a first phoneme corresponding to the character according to the initial consonant and the final sound corresponding to the character; determining a second duration of the first phoneme according to the first phoneme corresponding to the character, a first preset duration proportion threshold of each first phoneme and a first duration corresponding to the character; determining a third duration of the first phoneme according to the broadcast voice and the lyrics; scaling the first spectral feature according to the second duration of the first phoneme and the third duration of the first phoneme to obtain a second spectral feature; synthesizing the second spectral feature and the first fundamental frequency to obtain a synthesized singing voice, so that the first spectral feature can be obtained from the broadcast voice, and the first spectral feature is scaled according to the second duration of the first phoneme in the target song and the third duration of the first phoneme in the broadcast voice, so that the obtained second spectral feature has the rhythm of the target song and can meet the habit of a person in singing the song, and the singing voice can be obtained by synthesizing the second spectral feature and the first fundamental frequency of the target song, so that the song synthesis can be realized without collecting a large amount of recording data, and the cost of the song synthesis can be reduced; moreover, the invention synthesizes singing voice by adopting the first fundamental frequency of the target song, and the singing voice is more natural.

The foregoing description is only an overview of the technical solutions of the present application, and the present application can be implemented according to the content of the description in order to make the technical means of the present application more clearly understood, and the following detailed description of the present application is given in order to make the above and other objects, features, and advantages of the present application more clearly understandable.

Drawings

FIG. 1 is a schematic flow chart of a singing voice synthesizing method according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a singing voice synthesizing apparatus according to an embodiment of the present invention;

fig. 3 is a schematic hardware configuration diagram of a singing voice synthesizing apparatus according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

An embodiment of the present invention provides a singing voice synthesis method, as shown in fig. 1, including:

s101, obtaining a target song, a music score of the target song and broadcast voice of lyrics of the target song, wherein the music score comprises the lyrics and first time length corresponding to characters in the lyrics; specifically, the executing subject of the present invention may be a singing voice synthesizing device, and may also be a terminal or a server, which is not specifically limited herein, and in the embodiment of the present invention, the singing voice synthesizing device is taken as an example for explanation.

The singing voice synthesis device can receive a singing request of a user in a wired connection mode or a wireless connection mode, then the singing voice synthesis device can obtain a target song and a music score of the target song, the music score is analyzed to obtain lyrics, the lyrics are guided into a preset voice broadcasting model, and broadcasting voice of the lyrics of the target song is obtained. The broadcasting voice of the target song is obtained by leading the lyrics into the preset voice broadcasting model, so that when the singing voice is synthesized, manual participation is not needed, and the full automation of the synthesis of the singing voice is realized. The voice broadcast model can be adjusted by a plurality of setting parameters, for example, parameters such as male voice, female voice, speech speed, tone, volume and audio code rate can be included. The singing voice synthesizing device can adjust the parameters according to the requirements of the user, so that the tone color of the synthesized singing voice is similar to the tone color favored by the user. When a singing request of a user is received, the invention also supports that indexes of a starting character and an ending character of the lyric are input on the singing voice synthesizing device according to the position index of the character in the lyric as input, and then a target song and a music score of the target song are obtained according to the indexes of the starting character and the ending character. And generating a broadcast voice according to the lyrics between the start character and the end character, thereby synthesizing a song corresponding to the lyrics between the start character and the end character.

The target song may be a song designated by the user, may be a song randomly selected by the singing voice synthesizing device from a preset song library when receiving the singing request, or may be a song selected by the singing voice synthesizing device from the preset song library according to the behavior and usage habit of the user. The music score may be a musicxml file of the target song or any file with lyrics, and the first duration corresponding to characters in the lyrics, and is not specifically limited herein. The first duration includes a singing duration of the characters and a pause duration between the characters. In calculating the pause time length between characters, unnecessary pauses between characters can be reduced by setting a preset threshold value. For example, for a pause duration smaller than a preset threshold, the pause duration thereof is set to zero.

S102, determining a first fundamental frequency of a target song and a first spectrum characteristic of broadcast voice; specifically, a fundamental frequency extracting tool may be used to extract the first fundamental frequency of the target song, so as to obtain the first fundamental frequency of the target song. The fundamental frequency extraction tool includes but is not limited to yin, world vocoder. A world vocoder can be adopted to extract the first spectrum feature of the broadcast voice to obtain the first spectrum feature of the broadcast voice. The first spectral features may include mel-frequency spectral features and aperiodic component features.

S103, determining a first phoneme corresponding to the character according to the initial consonant and the final sound corresponding to the character; specifically, characters in the lyric may be converted into initials and finals by using a pypinyin tool or a speech synthesis tool, the initials of the characters correspond to one first phoneme, the finals of the characters may correspond to at least one first phoneme, and the number of the finals corresponding to the first phonemes is determined according to the composition of the finals. For example, for a combined vowel, the composition of the vowel is iang, and the vowel corresponds to two first phonemes, i and ang respectively. For non-combined finals, the finals are combined into ei, and the finals correspond to a first phoneme.

S104, determining a second duration of the first phoneme according to the first phoneme corresponding to the character, a first preset duration proportion threshold of each first phoneme and a first duration corresponding to the character; specifically, when the second time length of the first phoneme is determined, the first phoneme and the second phoneme may be processed according to the initial consonant and the final consonant respectively. For example, for the initial part: the time length proportion of the initial consonants and the final consonants can be determined according to the fixed proportion and the longest limit frame number of the initial consonants; for the initials z, c, s, z, h, ch, sh, x, f, q, p, h, j, the time length proportion of the initial consonant and the final sound can be determined according to the initial consonant length; for y, r, m, w, l, n, the duration ratio of the initial consonant to the final consonant can be determined according to a fixed ratio. And obtaining a second duration of the first phoneme corresponding to the initial consonant and at least one second duration corresponding to at least one first phoneme corresponding to the final consonant according to the duration proportion of the initial consonant and the final consonant and the first duration corresponding to the character. For the final part, for example, for the combined final, the final composition is iang, and the second time lengths corresponding to i and ang can be determined according to the preset time length proportion between i and ang.

S105, determining a third duration of the first phoneme according to the broadcast voice and the lyrics; specifically, a forced alignment method of voice recognition may be used to align the lyrics with the broadcasted voice, determine a broadcasted duration for each character in the lyrics, then analyze the first phoneme of each character, and determine a third duration for each first phoneme.

S106, carrying out scaling processing on the first spectrum feature according to the second duration of the first phoneme and the third duration of the first phoneme to obtain a second spectrum feature; specifically, the scaling process may be performed on the first spectral feature according to a ratio of the second duration of the first phoneme to the third duration of the first phoneme, so as to obtain the second spectral feature. The first spectrum feature is scaled according to the second duration of the first phoneme and the third duration of the first phoneme, so that the first spectrum feature can be scaled according to the phonemes, and the second spectrum feature has the rhythm of the target song and can meet the habit of a person when singing the song because the stretching duration of each phoneme in the character is different when the person sings the long tone in the song. Scaling the first spectral feature according to phoneme correspondence can make the synthesized song more accurate.

And S107, synthesizing the second spectrum characteristic and the first fundamental frequency to obtain a synthesized singing voice. Specifically, after the synthetic singing voice, post-processing may also be performed on the synthetic singing voice. For example, the sound changing processing is performed on the synthesized singing voice by using a sound open source tool, and the hissing voice of the synthesized singing voice is eliminated by using low-pass filtering. Background music may also be added to the synthesized singing voice. In singing voice synthesis at a sampling rate, background music may be up-sampled or down-sampled (supporting, but not limited to, 16k, 22.05k, 24k, 44.1k, 48k, etc.). The reverberation operation can be carried out on the synthesized song, and the singing effect of the synthesized song voice is improved.

According to the singing voice synthesis method provided by the embodiment of the invention, a target song, a music score of the target song and broadcast voice of the target song are obtained, wherein the music score comprises lyrics and a first time length corresponding to characters in the lyrics; determining a first fundamental frequency of a target song and a first spectrum characteristic of broadcast voice; determining a first phoneme corresponding to the character according to the initial consonant and the final sound corresponding to the character; determining a second duration of the first phoneme according to the first phoneme corresponding to the character, a first preset duration proportion threshold of each first phoneme and a first duration corresponding to the character; determining a third duration of the first phoneme according to the broadcast voice and the lyrics; scaling the first spectral feature according to the second duration of the first phoneme and the third duration of the first phoneme to obtain a second spectral feature; synthesizing the second spectral feature and the first fundamental frequency to obtain a synthesized singing voice, so that the first spectral feature can be obtained from the broadcast voice, and the first spectral feature is scaled according to the second duration of the first phoneme in the target song and the third duration of the first phoneme in the broadcast voice, so that the obtained second spectral feature has the rhythm of the target song and can meet the habit of a person in singing the song, and the singing voice can be obtained by synthesizing the second spectral feature and the first fundamental frequency of the target song, so that the song synthesis can be realized without collecting a large amount of recording data, and the cost of the song synthesis can be reduced; moreover, the invention synthesizes singing voice by adopting the first fundamental frequency of the target song, and the singing voice is more natural.

In an alternative embodiment, in step S102, determining the first fundamental frequency of the target song includes: carrying out audio track separation on the target song to obtain dry sound; and extracting the third fundamental frequency of the dry sound to obtain the first fundamental frequency of the target song.

Specifically, if the target song is mixed with background music, when determining the first fundamental frequency of the target song, the target song may be subjected to audio track separation using an audio track separation tool, such as a speeter, to obtain dry sound and background music. And then extracting the third fundamental frequency of the dry sound to obtain the first fundamental frequency of the target song.

The target song is subjected to music track separation to obtain a dry sound, the third fundamental frequency of the dry sound is extracted to obtain the first fundamental frequency of the target song, and therefore the first fundamental frequency corresponding to the lyrics can be extracted, and interference of background music can be eliminated during subsequent song synthesis.

In an alternative embodiment, the step S105 of determining the third duration of the first phoneme according to the broadcasted voice and the lyrics may include: analyzing the music score to obtain lyrics; inputting the broadcast voice and the lyrics into a preset voice recognition model; and labeling the first phoneme corresponding to the characters in the lyrics of the broadcast voice through a voice recognition model to obtain a third duration of the first phoneme.

Specifically, the singing voice synthesis device analyzes the music score to obtain lyrics, the preset voice recognition model can perform voice analysis on the broadcast voice, the broadcast voice is labeled according to a first phoneme corresponding to characters in the lyrics, namely duration labeling is performed on the broadcast voice according to the phonemes, so that a timestamp and duration of the first phoneme are obtained, and a third duration of the first phoneme can be determined according to the timestamp and the duration of the first phoneme.

The third duration of the first phoneme is obtained by adopting the speech recognition model to carry out duration labeling on the broadcast speech, and the method is quick and accurate.

In an optional embodiment, in step S106, the scaling processing is performed on the first spectral feature according to the second duration of the first phoneme and the third duration of the first phoneme to obtain a second spectral feature, which specifically includes: labeling the first spectral feature according to a third duration of the first phoneme to obtain a third spectral feature of the first phoneme; calculating a scaling ratio according to the second duration of the first phoneme and the third duration of the first phoneme; and carrying out scaling processing on the third spectral characteristics according to the scaling ratio to obtain second spectral characteristics.

Specifically, the first spectral feature may be scaled by phoneme, syllable, or the like. In this embodiment, the example of the phoneme scaling process on the first spectral feature is described. Since the first spectral feature is extracted in units of frames, when the first spectral feature is scaled according to phonemes, the first spectral feature needs to be labeled according to a third duration of the first phoneme, so that the first spectral feature can be divided according to phonemes, and thus the third spectral feature of the first phoneme can be obtained. The scaling ratio may be determined based on a ratio of the second duration of the first phoneme to the third duration of the first phoneme. And scaling the third spectral feature of the first phoneme according to the scaling ratio to obtain a second spectral feature. Further, when the final corresponding to the character includes a first phoneme, scaling the third spectral feature of the first phoneme according to the scaling ratio, so as to obtain the second spectral feature. When the final corresponding to the character comprises a plurality of first phonemes, carrying out scaling processing on the third spectral feature of the last first phoneme corresponding to the final according to the scaling ratio, and carrying out scaling processing on the third spectral feature of the first phoneme corresponding to the initial according to the scaling ratio to obtain a second spectral feature. And the third spectral features of other first phonemes of the final are not scaled.

In the scaling process, a linear interpolation method may be adopted to linearly interpolate the third spectral feature according to the scaling.

Labeling the first spectral feature according to the third duration of the first phoneme to obtain a third spectral feature of the first phoneme; calculating a scaling ratio according to the second duration of the first phoneme and the third duration of the first phoneme; and scaling the third spectral feature according to the scaling ratio to obtain a second spectral feature, so that the first spectral feature can be scaled according to the phonemes, and the second spectral feature has the rhythm of the target song and can meet the habit of a person when singing the song because the stretching duration of each phoneme in the character is different when the person sings the long tone in the song. Scaling the first spectral feature according to phoneme correspondence can make the synthesized song more accurate.

In an alternative embodiment, after determining the first fundamental frequency of the target song and the first spectral feature of the broadcast voice, the first fundamental frequency may be further adjusted, and therefore, the singing voice synthesizing method further includes: labeling the first fundamental frequency according to the second duration of the first phoneme to obtain a second fundamental frequency of the first phoneme; determining a second phoneme which does not contain fundamental frequency information in the first phoneme according to the pronunciation rule of the first phoneme; adjusting a second fundamental frequency corresponding to the second phoneme to be zero; and re-determining the first fundamental frequency of the target song according to the adjusted second fundamental frequency.

Specifically, when the first fundamental frequency is adjusted, the first fundamental frequency may be adjusted according to syllables or phonemes, and the adjustment of the first fundamental frequency according to phonemes is taken as an example in this embodiment. Since the first fundamental frequency is extracted on a frame-by-frame basis, the first fundamental frequency needs to be divided into first phonemes. The first fundamental frequency may be labeled according to the second duration of the first phoneme to obtain a second fundamental frequency of the first phoneme. Since the partial initial consonants such as b and sh have no fundamental frequency information, the second phoneme which does not contain fundamental frequency information can be determined from the first phoneme, and then the second fundamental frequency corresponding to the second phoneme is linearly interpolated to zero, so that the first fundamental frequency is adjusted.

By adjusting the second fundamental frequency corresponding to the second phoneme without the fundamental frequency information in the first phoneme to zero, the potential problem of extracting the fundamental frequency without rotation can be eliminated.

In an alternative embodiment, the musical score further includes pitches corresponding to characters in the lyrics, and the first fundamental frequency may be further adjusted after determining the first fundamental frequency of the target song and the first spectral feature of the broadcast voice, so that the singing voice synthesizing method further includes: determining a fundamental frequency mean value and a fundamental frequency variance of the character according to a first phoneme corresponding to the character and a second fundamental frequency of the first phoneme; calculating the fundamental frequency ratio of the character according to the fundamental frequency mean value of the character and the pitch corresponding to the character; when the fundamental frequency ratio and/or the fundamental frequency variance are not within the preset threshold range, smoothing the second fundamental frequency of the first phoneme; and re-determining the first fundamental frequency of the target song according to the second fundamental frequency of the first phoneme after the smoothing processing.

Specifically, the mean and variance of the fundamental frequency of the character can be calculated by a numerical statistical method. When the fundamental frequency mean value and/or the fundamental frequency variance are not within the preset threshold value range, the problem that the boundary is inaccurate and/or the frequency half-wave problem may exist in the second fundamental frequency. Thus, the second fundamental frequency of the first phoneme may be smoothed.

By smoothing the second fundamental frequency of the first phoneme, the problem of inaccurate boundary and/or frequency half-doubling existing in the second fundamental frequency can be repaired, so that the problem of inaccurate boundary and/or frequency half-doubling does not exist in the first fundamental frequency of the target song which is re-determined according to the second fundamental frequency of the first phoneme after smoothing, and the synthesized song is more natural.

In an optional embodiment, before synthesizing the second spectral feature and the first fundamental frequency to obtain the synthesized singing voice, the singing voice synthesizing method further includes: determining a third fundamental frequency of the broadcast voice; determining a zero value in the third fundamental frequency; interpolating zero values in the first fundamental frequency to non-zero values; the non-zero value in the first fundamental frequency is adjusted in accordance with the zero value in the third fundamental frequency.

Specifically, the zero value in the first fundamental frequency is interpolated into a non-zero value, which mainly ensures that the first fundamental frequency at the beginning and end of each sentence of the lyrics in the lyrics does not gradually go to zero. And adjusting the non-zero value in the first fundamental frequency according to the zero value in the third fundamental frequency can reduce the noise problem caused by inaccurate extraction of the first fundamental frequency.

An embodiment of the present invention further provides a singing voice synthesizing apparatus, as shown in fig. 2, including: the acquiring unit 201 is configured to acquire a target song, a music score of the target song, and a broadcast voice of the target song, where the music score includes lyrics and a first duration corresponding to characters in the lyrics; the specific implementation manner is described in detail in step S101 of the above embodiment, and is not described again here. A first determining unit 202, configured to determine a first fundamental frequency of a target song and a first spectral feature of a broadcast voice; the specific implementation manner is described in detail in step S101 of the above embodiment, and is not described again here. The second determining unit 203 is configured to determine a first phoneme corresponding to the character according to the initial consonant and the final consonant corresponding to the character; the specific implementation manner is described in detail in step S101 of the above embodiment, and is not described again here. A third determining unit 204, configured to determine a second duration of the first phoneme according to the first phoneme corresponding to the character, the first preset duration ratio threshold of each first phoneme, and the first duration corresponding to the character; the specific implementation manner is described in detail in step S101 of the above embodiment, and is not described again here. A fourth determining unit 205, configured to determine a third duration of the first phoneme according to the broadcast voice and the lyrics; the specific implementation manner is described in detail in step S101 of the above embodiment, and is not described again here. The processing unit 206 is configured to perform scaling processing on the first spectral feature according to the second duration of the first phoneme and the third duration of the first phoneme to obtain a second spectral feature; the specific implementation manner is described in detail in step S101 of the above embodiment, and is not described again here. And a synthesizing unit 207, configured to synthesize the second spectral feature and the first fundamental frequency to obtain a synthesized singing voice. The specific implementation manner is described in detail in step S101 of the above embodiment, and is not described again here.

According to the singing voice synthesizing device provided by the embodiment of the invention, a target song, a music score of the target song and broadcast voice of the target song are obtained, wherein the music score comprises lyrics and a first duration corresponding to characters in the lyrics; determining a first fundamental frequency of a target song and a first spectrum characteristic of broadcast voice; determining a first phoneme corresponding to the character according to the initial consonant and the final sound corresponding to the character; determining a second duration of the first phoneme according to the first phoneme corresponding to the character, a first preset duration proportion threshold of each first phoneme and a first duration corresponding to the character; determining a third duration of the first phoneme according to the broadcast voice and the lyrics; scaling the first spectral feature according to the second duration of the first phoneme and the third duration of the first phoneme to obtain a second spectral feature; synthesizing the second spectral feature and the first fundamental frequency to obtain a synthesized singing voice, so that the first spectral feature can be obtained from the broadcast voice, and the first spectral feature is scaled according to the second duration of the first phoneme in the target song and the third duration of the first phoneme in the broadcast voice, so that the obtained second spectral feature has the rhythm of the target song and can meet the habit of a person in singing the song, and the singing voice can be obtained by synthesizing the second spectral feature and the first fundamental frequency of the target song, so that the song synthesis can be realized without collecting a large amount of recording data, and the cost of the song synthesis can be reduced; moreover, the invention synthesizes singing voice by adopting the first fundamental frequency of the target song, and the singing voice is more natural.

Based on the same inventive concept as one of the singing voice synthesizing methods in the foregoing embodiments, the present invention also provides a singing voice synthesizing apparatus having stored thereon a computer program that, when executed by a processor, implements the steps of any one of the foregoing singing voice synthesizing methods.

Where in fig. 3 a bus architecture (represented by bus 300), bus 300 may include any number of interconnected buses and bridges, bus 300 linking together various circuits including one or more processors, represented by processor 302, and memory, represented by memory 304. The bus 300 may also link together various other circuits such as peripherals, voltage regulators, power management circuits, and the like, which are well known in the art, and therefore, will not be described any further herein. A bus interface 306 provides an interface between the bus 300 and the receiver 301 and transmitter 303. The receiver 301 and the transmitter 303 may be the same element, i.e., a transceiver, providing a means for communicating with various other apparatus over a transmission medium.

The processor 302 is responsible for managing the bus 300 and general processing, and the memory 304 may be used for storing data used by the processor 302 in performing operations.

Based on the same inventive concept as one of the singing voice synthesizing methods in the foregoing embodiments, the present invention also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, realizes the steps of:

acquiring a target song, a music score of the target song and broadcast voice of the target song, wherein the music score comprises lyrics and first duration corresponding to characters in the lyrics; determining a first fundamental frequency of a target song and a first spectrum characteristic of broadcast voice; determining a first phoneme corresponding to the character according to the initial consonant and the final sound corresponding to the character; determining a second duration of the first phoneme according to the first phoneme corresponding to the character, a first preset duration proportion threshold of each first phoneme and a first duration corresponding to the character; determining a third duration of the first phoneme according to the broadcast voice and the lyrics; scaling the first spectral feature according to the second duration of the first phoneme and the third duration of the first phoneme to obtain a second spectral feature; and synthesizing the second spectral feature and the first fundamental frequency to obtain the synthesized singing voice.

In a specific implementation, when the program is executed by a processor, any method step in the first embodiment may be further implemented.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable information processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable information processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable information processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable information processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. A singing voice synthesizing method, comprising:

acquiring a target song, a music score of the target song and broadcasting voice of lyrics of the target song, wherein the music score comprises the lyrics and first duration corresponding to characters in the lyrics;

determining a first fundamental frequency of the target song and a first spectrum characteristic of the broadcast voice;

determining a first phoneme corresponding to the character according to the initial consonant and the final sound corresponding to the character;

determining a second duration of the first phoneme according to the first phoneme corresponding to the character, a first preset duration proportion threshold of each first phoneme and a first duration corresponding to the character;

determining a third duration of the first phoneme according to the broadcasting voice and the lyrics;

scaling the first spectral feature according to the second duration of the first phoneme and the third duration of the first phoneme to obtain a second spectral feature;

and synthesizing the second spectral feature and the first fundamental frequency to obtain synthesized singing voice.

2. The method of synthesizing singing voice according to claim 1, wherein determining the first fundamental frequency of the target song comprises:

carrying out audio track separation on the target song to obtain dry sound;

and extracting the third fundamental frequency of the dry sound to obtain the first fundamental frequency of the target song.

3. The method of synthesizing singing voice according to claim 1, wherein said determining a third duration of the first phoneme according to the announcement speech and the lyrics comprises:

analyzing the music score to obtain the lyrics;

inputting the broadcasting voice and the lyrics into a preset voice recognition model;

and labeling the first phoneme corresponding to the characters in the lyrics of the broadcast voice through the voice recognition model to obtain a third duration of the first phoneme.

4. The singing voice synthesis method of claim 1, wherein the scaling the first spectral feature according to the second duration of the first phoneme and the third duration of the first phoneme to obtain a second spectral feature comprises:

labeling the first spectral feature according to a third duration of the first phoneme to obtain a third spectral feature of the first phoneme;

calculating a scaling ratio according to the second duration of the first phoneme and the third duration of the first phoneme;

and carrying out scaling processing on the third spectral feature according to the scaling ratio to obtain the second spectral feature.

5. The singing voice synthesis method according to any one of claims 1-4, further comprising, after the determining the first fundamental frequency of the target song and the first spectral feature of the announcement speech:

labeling the first fundamental frequency according to the second duration of the first phoneme to obtain a second fundamental frequency of the first phoneme;

determining a second phoneme which does not contain fundamental frequency information in the first phoneme according to the pronunciation rule of the first phoneme;

adjusting a second fundamental frequency corresponding to the second phoneme to be zero;

and re-determining the first fundamental frequency of the target song according to the adjusted second fundamental frequency.

6. The singing voice synthesizing method according to claim 5, wherein the musical score further includes pitches corresponding to characters in the lyrics,

after the determining the first fundamental frequency of the target song and the first spectral feature of the broadcast voice, the method further comprises:

determining a fundamental frequency mean value and a fundamental frequency variance of the character according to a first phoneme corresponding to the character and a second fundamental frequency of the first phoneme;

calculating the fundamental frequency ratio of the character according to the fundamental frequency mean value of the character and the pitch corresponding to the character;

when the fundamental frequency ratio and/or the fundamental frequency variance are not within a preset threshold range, smoothing the second fundamental frequency of the first phoneme;

and re-determining the first fundamental frequency of the target song according to the smoothed second fundamental frequency of the first phoneme.

7. The method of synthesizing singing voice according to claim 1, further comprising, before said synthesizing said second spectral feature and said first fundamental frequency to obtain a synthesized singing voice:

determining a third fundamental frequency of the broadcast voice;

determining a zero value in the third fundamental frequency;

interpolating zero values in the first fundamental frequency to non-zero values;

adjusting a non-zero value in the first fundamental frequency based on a zero value in the third fundamental frequency.

8. A singing voice synthesizing apparatus, comprising:

the device comprises an acquisition unit, a display unit and a control unit, wherein the acquisition unit is used for acquiring a target song, a music score of the target song and broadcasting voice of the target song, and the music score comprises lyrics and first duration corresponding to characters in the lyrics;

the first determining unit is used for determining a first fundamental frequency of the target song and a first spectrum characteristic of the broadcast voice;

the second determining unit is used for determining a first phoneme corresponding to the character according to the initial consonant and the final consonant corresponding to the character;

a third determining unit, configured to determine a second duration of the first phoneme according to the first phoneme corresponding to the character, a first preset duration proportion threshold of each first phoneme, and the first duration corresponding to the character;

a fourth determining unit, configured to determine a third duration of the first phoneme according to the broadcast voice and the lyrics;

the processing unit is used for carrying out scaling processing on the first spectral feature according to the second duration of the first phoneme and the third duration of the first phoneme to obtain a second spectral feature;

and the synthesis unit is used for synthesizing the second spectral feature and the first fundamental frequency to obtain synthesized singing voice.

9. A singing voice synthesizing apparatus, comprising:

at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the singing voice synthesis method of any one of claims 1-7.

10. A computer-readable storage medium storing computer instructions for causing a computer to execute the singing voice synthesizing method according to any one of claims 1 to 7.