CN111681637A

CN111681637A - Song synthesis method, device, equipment and storage medium

Info

Publication number: CN111681637A
Application number: CN202010350256.0A
Authority: CN
Inventors: 朱清影; 韩宝强
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2020-04-28
Filing date: 2020-04-28
Publication date: 2020-09-18
Anticipated expiration: 2040-04-28
Also published as: CN111681637B; WO2021218138A1

Abstract

The invention relates to artificial intelligence and discloses a song synthesis method, which comprises the following steps: acquiring lyric recitation audio frequency and music score information; carrying out time length marking on the lyric recitation audio through a preset voice recognition model and a lyric pinyin text to obtain recitation time length; analyzing initial acoustic parameters from the lyric recitation audio through a preset vocoder; extracting singing duration from the lyric pinyin text according to a preset initial variable speed dictionary, rhythm information and beat information; carrying out variable speed processing on the initial acoustic parameters according to a preset variable speed algorithm, the recitation duration and the singing duration; carrying out formant enhancement processing on the frequency spectrum envelope after the speed change to obtain an enhanced frequency spectrum envelope; correcting the pitch information, the singing duration and the variable-speed fundamental frequency to obtain a corrected fundamental frequency; and performing song synthesis processing on the processed acoustic parameters through a preset vocoder. The invention also relates to a blockchain in which the synthesized song is stored.

Description

Song synthesis method, device, equipment and storage medium

Technical Field

The present invention relates to the field of speech signal processing, and in particular, to a song synthesis method, apparatus, device, and storage medium.

Background

In recent decades, singing synthesis technology has been receiving increasing attention from the industry. Inspired by the voice synthesis technology, a singing synthesis technology based on waveform splicing and parameter synthesis gradually appears, but related technical researches mostly focus on the direction of synthesizing singing by text or synthesizing lyrics, namely text information is converted into singing audio instead of directly converting voice audio into singing audio.

An algorithm for automatically distinguishing speech from singing has also been developed in the industry, but the technique is not further applied to the direction of reciting and synthesizing singing, which is to directly give the natural speech to the melody and convert the natural speech into the singing voice. The traditional variable speed tonal modification algorithm is based on the superposition operation of a waveform layer, and has the problems of waveform fracture and unnatural transition.

Disclosure of Invention

The invention mainly aims to solve the technical problems of waveform fracture and unnatural transition of the traditional variable speed pitch-changing algorithm based on the superposition operation of a waveform layer.

In order to achieve the above object, a first aspect of the present invention provides a song synthesizing method, including: acquiring lyric recitation audio frequency and music score information of a target song, wherein the music score information comprises lyric pinyin text, beat information, rhythm information and pitch information; performing duration labeling on a phoneme in the lyric recitation audio through a preset voice recognition model and the lyric pinyin text to obtain the recitation duration of the phoneme, wherein the recitation duration of the phoneme comprises initial recitation duration and final recitation duration; analyzing the lyric recitation audio through a preset vocoder to obtain initial acoustic parameters corresponding to the phonemes, wherein the initial acoustic parameters comprise a fundamental frequency, a frequency spectrum envelope and a non-periodic sequence; extracting the singing duration of the phoneme from the lyric pinyin text according to a preset initial variable speed dictionary, the rhythm information and the beat information, wherein the singing duration of the phoneme comprises an initial singing duration and a final singing duration; performing variable speed processing on the initial acoustic parameters according to a preset variable speed algorithm, the recitation duration and the singing duration to obtain target acoustic parameters, wherein the target acoustic parameters comprise variable-speed fundamental frequency, variable-speed frequency spectrum envelope and variable-speed aperiodic sequence; carrying out formant enhancement processing on the frequency spectrum envelope after the speed change to obtain an enhanced frequency spectrum envelope; correcting the pitch information, the singing duration and the variable-speed fundamental frequency to obtain a corrected fundamental frequency; and performing song synthesis processing on the variable-speed non-periodic sequence, the enhanced spectral envelope and the corrected fundamental frequency through the preset vocoder to obtain a synthesized song.

Optionally, in a first implementation manner of the first aspect of the present invention, the performing duration labeling on a phoneme in the lyric recitation audio through a preset speech recognition model and the lyric pinyin text to obtain a recitation duration of the phoneme, where the recitation duration of the phoneme includes an initial recitation duration and a final recitation duration, includes: analyzing the music score information, and reading the lyric pinyin text from the analyzed music score information; inputting the lyric recitation audio and the lyric pinyin text into a preset voice recognition model, and carrying out voice analysis on the lyric recitation audio through the preset voice recognition model; marking phonemes in the lyric reciting audio after voice analysis according to the lyric pinyin text through a preset voice recognition model to obtain a time stamp and a duration of the phonemes, wherein the phonemes comprise initials and finals; and determining the recitation duration of the phoneme according to the time stamp and the duration of the phoneme, wherein the recitation duration of the phoneme comprises the initial recitation duration and the final recitation duration.

Optionally, in a second implementation manner of the first aspect of the present invention, the extracting, from the lyric pinyin text, the singing duration of the phoneme including an initial singing duration and a final singing duration according to a preset initial variable speed dictionary, the rhythm information, and the tempo information includes: extracting singing duration t of each character from the lyric pinyin text according to the rhythm information and the beat information; inquiring a preset initial consonant variable speed dictionary according to the singing time t of each character to obtain the singing time t of the initial consonant of each character₁(ii) a A singing duration t for each of said characters and saidDuration t of singing of the initial consonant of each character₁Performing difference operation to obtain the singing duration t of the vowel of each character₂Wherein, t₂＝t-t₁(ii) a And setting the singing duration of the initial consonant of each character and the singing duration of the final consonant of each character as the singing duration of the phoneme corresponding to each character.

Optionally, in a third implementation manner of the first aspect of the present invention, the performing variable speed processing on the initial acoustic parameter according to a preset variable speed algorithm, the recited time length, and the singing time length to obtain a target acoustic parameter, where the target acoustic parameter includes a variable-speed fundamental frequency, a variable-speed spectrum envelope, and a variable-speed aperiodic sequence, and includes: calculating the speed change rate r of the phoneme according to the initial reading duration, the final reading duration, the initial singing duration and the final singing duration, wherein the r is greater than 0; carrying out speed change processing on the initial acoustic parameters according to the speed change multiplying power r through a preset speed change algorithm to obtain acoustic parameters after speed change; and performing series splicing on the acoustic parameters after the speed change to obtain target acoustic parameters, wherein the target acoustic parameters comprise the fundamental frequency after the speed change, the frequency spectrum envelope after the speed change and the non-periodic sequence after the speed change.

Optionally, in a fourth implementation manner of the first aspect of the present invention, the performing a speed change process on the initial acoustic parameter according to the speed change multiplying power r by using a preset speed change algorithm to obtain an acoustic parameter after speed change includes: when the r is equal to 1, determining the initial acoustic parameter as the acoustic parameter after the speed change; when the r is equal to 2, performing twice lengthening processing on the initial acoustic parameter to obtain the acoustic parameter after speed change; when the r is smaller than 2 and is not equal to 1, carrying out variable speed processing on the initial acoustic parameter by adopting a preset equal ratio addition and subtraction frame algorithm to obtain an acoustic parameter after variable speed; and when the r is more than 2, prolonging the initial acoustic parameters by more than two times to obtain the acoustic parameters after speed change.

Optionally, in a fifth implementation manner of the first aspect of the present invention, the performing a formant enhancement process on the frequency spectrum envelope after the speed change to obtain an enhanced frequency spectrum envelope includes: inquiring a formant in a frequency band of about 3 kilohertz from the frequency spectrum envelope after the speed change, and recording the central frequency and the amplitude of the formant; determining the intensity coefficient of a boosting filter and the central frequency to be enhanced according to the central frequency and the amplitude of the formants; carrying out formant enhancement according to the intensity coefficient of the boost filter and the central frequency to be enhanced to obtain a formant enhancement spectrum; and carrying out filtering processing on the formant enhanced spectrum to obtain an enhanced spectrum envelope.

Optionally, in a sixth implementation manner of the first aspect of the present invention, the step of storing the synthesized song in a block chain, and performing correction processing based on the pitch information, the singing duration, and the shifted fundamental frequency to obtain a corrected fundamental frequency includes: generating a fundamental frequency of a song based on the pitch information, the singing duration and the varied fundamental frequency; superposing the fundamental frequencies in the initial acoustic parameters and calculating an average value to obtain an average fundamental frequency; performing tone-up or tone-down processing on the fundamental frequency of the song based on the average fundamental frequency to obtain an initial fundamental frequency sequence, wherein the initial fundamental frequency sequence comprises a pitch and notes; when detecting that the same character corresponds to different pitches in the initial fundamental frequency sequence, smoothing notes corresponding to the same pitch; (ii) a When detecting that the pitch change exists between the adjacent notes in the initial fundamental frequency sequence, preparing and overshoot the adjacent notes by a preset formula

The method comprises the steps of obtaining an initial fundamental frequency sequence, obtaining a note in the initial fundamental frequency sequence, obtaining a tone of a fundamental frequency sequence, wherein s is the initial fundamental frequency sequence, omega is a natural frequency, ξ is a damping coefficient, and k is a proportional gain, adding a vibrato to the initial fundamental frequency sequence corresponding to the note when detecting that the preset duration of the note in the initial fundamental frequency sequence is larger than a preset threshold, and adding a white noise to the initial fundamental frequency sequence when detecting that the note in the initial fundamental frequency sequence is excessively smooth to obtain a corrected fundamental frequency.

A second aspect of the present invention provides a song synthesizing apparatus comprising: the acquisition module is used for acquiring lyric recitation audio frequency and music score information of a target song, wherein the music score information comprises a lyric pinyin text, beat information, rhythm information and pitch information; the marking module is used for marking the time length of a phoneme in the lyric recitation audio through a preset voice recognition model and the lyric pinyin text to obtain the recitation time length of the phoneme, wherein the recitation time length of the phoneme comprises an initial recitation time length and a final recitation time length; the analysis module is used for analyzing the lyric recitation audio through a preset vocoder to obtain initial acoustic parameters corresponding to the phonemes, wherein the initial acoustic parameters comprise a fundamental frequency, a frequency spectrum envelope and a non-periodic sequence; the extraction module is used for extracting the singing duration of the phoneme from the lyric pinyin text according to a preset initial consonant variable speed dictionary, the rhythm information and the beat information, wherein the singing duration of the phoneme comprises an initial consonant singing duration and a vowel singing duration; the speed change module is used for carrying out speed change processing on the initial acoustic parameters according to a preset speed change algorithm, the recitation duration and the singing duration to obtain target acoustic parameters, and the target acoustic parameters comprise a base frequency after speed change, a frequency spectrum envelope after speed change and a non-periodic sequence after speed change; the enhancement module is used for carrying out formant enhancement processing on the frequency spectrum envelope after the speed change to obtain an enhanced frequency spectrum envelope; the correcting module is used for correcting and processing the pitch information, the singing duration and the variable-speed fundamental frequency to obtain a corrected fundamental frequency; and the synthesis module is used for carrying out song synthesis processing on the variable-speed non-periodic sequence, the enhanced spectral envelope and the corrected fundamental frequency through the preset vocoder to obtain a synthesized song.

Optionally, in a first implementation manner of the second aspect of the present invention, the tagging module is specifically configured to: analyzing the music score information, and reading the lyric pinyin text from the analyzed music score information; inputting the lyric recitation audio and the lyric pinyin text into a preset voice recognition model, and carrying out voice analysis on the lyric recitation audio through the preset voice recognition model; marking phonemes in the lyric reciting audio after voice analysis according to the lyric pinyin text through a preset voice recognition model to obtain a time stamp and a duration of the phonemes, wherein the phonemes comprise initials and finals; and determining the recitation duration of the phoneme according to the time stamp and the duration of the phoneme, wherein the recitation duration of the phoneme comprises the initial recitation duration and the final recitation duration.

Optionally, in a second implementation manner of the second aspect of the present invention, the extraction module is specifically configured to: extracting singing duration t of each character from the lyric pinyin text according to the rhythm information and the beat information; inquiring a preset initial consonant variable speed dictionary according to the singing time t of each character to obtain the singing time t of the initial consonant of each character₁(ii) a For the singing duration t of each character and the singing duration t of the consonant of each character₁Performing difference operation to obtain the singing duration t of the vowel of each character₂Wherein, t₂＝t-t₁(ii) a And setting the singing duration of the initial consonant of each character and the singing duration of the final consonant of each character as the singing duration of the phoneme corresponding to each character.

Optionally, in a third implementation manner of the second aspect of the present invention, the speed changing module includes: the calculating unit is used for calculating the speed change rate r of the phoneme according to the initial reading duration, the final reading duration, the initial singing duration and the final singing duration, and the r is greater than 0; the speed change unit is used for carrying out speed change processing on the initial acoustic parameters according to the speed change multiplying power r through a preset speed change algorithm to obtain acoustic parameters after speed change; and the splicing unit is used for splicing the acoustic parameters after the speed change in series to obtain target acoustic parameters, and the target acoustic parameters comprise the fundamental frequency after the speed change, the frequency spectrum envelope after the speed change and the non-periodic sequence after the speed change.

Optionally, in a fourth implementation manner of the second aspect of the present invention, the speed change unit is specifically configured to: when the r is equal to 1, determining the initial acoustic parameter as the acoustic parameter after the speed change; when the r is equal to 2, performing twice lengthening processing on the initial acoustic parameter to obtain the acoustic parameter after speed change; when the r is smaller than 2 and is not equal to 1, carrying out variable speed processing on the initial acoustic parameter by adopting a preset equal ratio addition and subtraction frame algorithm to obtain an acoustic parameter after variable speed; and when the r is more than 2, prolonging the initial acoustic parameters by more than two times to obtain the acoustic parameters after speed change.

Optionally, in a fifth implementation manner of the second aspect of the present invention, the enhancing module is specifically configured to: inquiring a formant in a frequency band of about 3 kilohertz from the frequency spectrum envelope after the speed change, and recording the central frequency and the amplitude of the formant; determining the intensity coefficient of a boosting filter and the central frequency to be enhanced according to the central frequency and the amplitude of the formants; carrying out formant enhancement according to the intensity coefficient of the boost filter and the central frequency to be enhanced to obtain a formant enhancement spectrum; and carrying out filtering processing on the formant enhanced spectrum to obtain an enhanced spectrum envelope.

Optionally, in a sixth implementation manner of the second aspect of the present invention, the synthesized song is stored in a block chain, and the rectification module is specifically configured to: generating a fundamental frequency of a song based on the pitch information, the singing duration and the varied fundamental frequency; superposing the fundamental frequencies in the initial acoustic parameters and calculating an average value to obtain an average fundamental frequency; performing tone-up or tone-down processing on the fundamental frequency of the song based on the average fundamental frequency to obtain an initial fundamental frequency sequence, wherein the initial fundamental frequency sequence comprises a pitch and notes; when detecting that the same character corresponds to different pitches in the initial fundamental frequency sequence, smoothing notes corresponding to the same pitch; when detecting that the pitch change exists between the adjacent notes in the initial fundamental frequency sequence, preparing and overshoot the adjacent notes by a preset formula

Wherein s is the initial fundamental frequency sequence, ω is the natural frequency, ξ is the damping coefficient, k is the proportional gain, when detecting the position in the initial fundamental frequency sequenceWhen the preset duration of the notes is greater than a preset threshold value, adding vibrato to the initial fundamental frequency sequence corresponding to the notes; and when the note in the initial fundamental frequency sequence is detected to have excessive smoothness, adding white noise to the initial fundamental frequency sequence to obtain a corrected fundamental frequency.

A third aspect of the present invention provides a song synthesizing apparatus comprising: a memory having instructions stored therein and at least one processor, the memory and the at least one processor interconnected by a line; the at least one processor invokes the instructions in the memory to cause the song composition apparatus to perform the song composition method described above.

A fourth aspect of the present invention provides a computer-readable storage medium having stored therein instructions, which, when run on a computer, cause the computer to execute the above-described song composition method.

In the technical scheme provided by the invention, the lyric recitation audio frequency and the music score information of a target song are obtained, wherein the music score information comprises a lyric pinyin text, beat information, rhythm information and pitch information; performing duration labeling on a phoneme in the lyric recitation audio through a preset voice recognition model and the lyric pinyin text to obtain the recitation duration of the phoneme, wherein the recitation duration of the phoneme comprises initial recitation duration and final recitation duration; analyzing the lyric recitation audio through a preset vocoder to obtain initial acoustic parameters corresponding to the phonemes, wherein the initial acoustic parameters comprise a fundamental frequency, a frequency spectrum envelope and a non-periodic sequence; extracting the singing duration of the phoneme from the lyric pinyin text according to a preset initial variable speed dictionary, the rhythm information and the beat information, wherein the singing duration of the phoneme comprises an initial singing duration and a final singing duration; performing variable speed processing on the initial acoustic parameters according to a preset variable speed algorithm, the recitation duration and the singing duration to obtain target acoustic parameters, wherein the target acoustic parameters comprise variable-speed fundamental frequency, variable-speed frequency spectrum envelope and variable-speed aperiodic sequence; carrying out formant enhancement processing on the frequency spectrum envelope after the speed change to obtain an enhanced frequency spectrum envelope; correcting the pitch information, the singing duration and the variable-speed fundamental frequency to obtain a corrected fundamental frequency; and performing song synthesis processing on the variable-speed non-periodic sequence, the enhanced spectral envelope and the corrected fundamental frequency through the preset vocoder to obtain a synthesized song. In the embodiment of the invention, acoustic parameters are analyzed from the lyric reading audio, speed change and splicing are realized from the acoustic parameter level through the vocoder based on music score information, speaking voice is converted into singing voice, and the synthesis of songs is realized on the basis of keeping the original voice color and voice range of a user, so that the naturalness of the singing voice is improved, and meanwhile, the synthesis of songs can be realized without collecting a large amount of singing data, and the data collection cost of the synthesis of songs is reduced.

Drawings

FIG. 1 is a diagram of an embodiment of a song composition method according to an embodiment of the present invention;

FIG. 2 is a diagram of another embodiment of a song composition method according to an embodiment of the present invention;

fig. 3 is a schematic diagram of an embodiment of a song synthesizing apparatus according to the embodiment of the present invention;

fig. 4 is a schematic diagram of another embodiment of the song composition apparatus according to the embodiment of the present invention;

fig. 5 is a schematic diagram of an embodiment of a song composition apparatus in an embodiment of the present invention.

Detailed Description

The embodiment of the invention provides a song synthesis method, a device, equipment and a storage medium, which are used for analyzing acoustic parameters from the reciting audio of lyrics, realizing speed change and splicing from the acoustic parameter level through a vocoder based on music score information, converting speaking voice into song voice, realizing song synthesis on the basis of keeping the original voice color and voice range of a user, improving the naturalness of the song voice, realizing song synthesis without collecting a large amount of song data and reducing the data collection cost of song synthesis.

The terms "first," "second," "third," "fourth," and the like in the description and in the claims, as well as in the drawings, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced otherwise than as specifically illustrated or described herein. Furthermore, the terms "comprises," "comprising," or "having," and any variations thereof, are intended to cover non-exclusive inclusions, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

For understanding, a specific flow of an embodiment of the present invention is described below, and referring to fig. 1, an embodiment of a song synthesis method according to an embodiment of the present invention includes:

101. and acquiring lyric recitation audio frequency and music score information of the target song, wherein the music score information comprises lyric pinyin text, beat information, rhythm information and pitch information.

It is to be understood that the executing subject of the present invention may be a song synthesizing apparatus, and may also be a terminal or a server, which is not limited herein. The embodiment of the present invention is described by taking a server as an execution subject.

The lyric recitation audio frequency and the music score information of the target song are stored in a preset data table in advance and are associated through a unique identifier, and the lyrics are recorded in the music score information in a pinyin mode. Specifically, the server acquires a unique identifier of a target song; the server generates a query statement according to the grammar rule of the structured query language and the unique identifier; the server executes the query statement to obtain the lyric recitation audio frequency and the music score information of the target song, wherein the music score information comprises the lyric pinyin text, the beat information, the rhythm information and the pitch information. For example, wo de zu guo is used to represent my country and is used as a lyric pinyin text; the tempo information refers to the total note length of each bar in the music score, including one-quarter beat and six-eighth beat; the rhythm information is used for indicating the length and strength of the note; the pitch information is used to indicate the level of sound when singing the target song.

102. And carrying out time length marking on the phoneme in the lyric recitation audio through a preset voice recognition model and the lyric pinyin text to obtain the recitation time length of the phoneme, wherein the recitation time length of the phoneme comprises the initial recitation time length and the final recitation time length.

Each character corresponding to the lyrics in the lyric reading audio is represented by pinyin, the pinyin corresponds to at least one phoneme, and the phoneme comprises an initial consonant and a vowel, so that the server performs voice analysis on the lyric reading audio through a preset voice recognition model and performs duration marking on the phoneme in the lyric reading audio. It will be appreciated that a pinyin will be broken down into two phones, namely an initial and a final, for example, "xiang" into two phones, namely "x" and "iang", and the pre-set speech recognition model will output the recited durations of these two phones in the recited audio, namely, the initial recited duration and the final recited duration.

103. And analyzing the phrase recitation audio through a preset vocoder to obtain initial acoustic parameters corresponding to the phonemes, wherein the initial acoustic parameters comprise fundamental frequency, spectrum envelope and aperiodic sequence.

The preset vocoder comprises a vocoder WORLD, further, the server performs data processing on duration information of phonemes in the phrase reading audio through the preset vocoder to obtain initial acoustic parameters corresponding to the phonemes, and the data processing comprises filtering, standard deviation calculation and smoothing. The lyric reciting audio is a signal composed of sine waves, the fundamental frequency F0 is a sound signal generated by vibration, the group of signals can be composed of a plurality of groups of sine waves with different frequencies, wherein the sine wave with the lowest frequency is the fundamental frequency, and the other is harmonic waves, namely overtones; the spectrum envelope SP is an envelope line obtained by connecting the highest points of the amplitudes of different frequencies through a smooth curve; the non-periodic sequence AP corresponds to a non-periodic pulse sequence of a mixed excitation part, wherein the mixed excitation refers to control of periodic excitation, noise and non-periodic signals through various parameters.

104. And extracting the singing duration of the phoneme from the lyric pinyin text according to a preset initial variable speed dictionary, rhythm information and beat information, wherein the singing duration of the phoneme comprises the initial singing duration and the final singing duration.

The server obtains the pronunciation duration law of each initial consonant under the condition of counting different tempos and different pronunciation durations in advance, and formulates an initial variable speed dictionary in advance according to the pronunciation duration law of each initial consonant, namely the initial variable speed dictionary is preset. Specifically, the server inquires from a preset initial variable speed dictionary to obtain the initial singing duration of the phoneme, and calculates according to the singing duration of the phoneme and the initial singing duration of the phoneme to obtain the final singing duration of the phoneme, wherein the singing duration of the phoneme is the sum of the initial singing duration and the final singing duration.

105. And carrying out variable speed processing on the initial acoustic parameters according to a preset variable speed algorithm, the recitation duration and the singing duration to obtain target acoustic parameters, wherein the target acoustic parameters comprise the base frequency after variable speed, the frequency spectrum envelope after variable speed and the non-periodic sequence after variable speed.

Because the duration of the phonemes in recited speech differ from the duration of the phonemes in singing, the duration of the phonemes in recited speech can be extended or shortened depending on the duration of the singing. Because the duration and the occupation ratio of the initial consonant and the final are different under different pronunciation durations of the same character, the duration adjustment needs to be respectively carried out on the initial consonant and the final, and further, the server carries out the duration adjustment on the initial acoustic parameters to obtain target acoustic parameters, wherein the target acoustic parameters comprise the base frequency after speed change, the frequency spectrum envelope after speed change and the non-periodic sequence after speed change.

It can be understood that, the above scheme not only has a simple principle and is easy to implement, but also avoids the superposition operation of the waveform layer, and further avoids the problem of low accuracy of acoustic parameter extraction caused by waveform damage, so that the application object of the speed change algorithm is changed from the waveform into the acoustic parameter, and is unified with the subsequent pitch change algorithm, thereby effectively improving the controllability of the system.

106. And carrying out formant enhancement processing on the frequency spectrum envelope after the speed change to obtain the enhanced frequency spectrum envelope.

Formants refer to natural spectral peaks of voice, and compared with speech, the spectral envelope of singing audio has a sharp peak in a frequency band around 3 kilohertz, and the peak is unique to singing, so the peak is called as a singing formant. In order to make the converted audio more natural, a 'singing formant' is added to the spectral envelope of the audio, that is, the amplitude of the spectral envelope after speed change in a frequency band around 3 kilohertz is enhanced to obtain the enhanced spectral envelope.

107. And carrying out correction processing based on the pitch information, the singing duration and the variable-speed fundamental frequency to obtain the corrected fundamental frequency.

In order to enable the synthesized song to conform to the range of recitation as much as possible and reduce tone distortion caused by tone change, the server generates the fundamental frequency of the song according to the pitch information and the singing duration; the server performs tone-changing processing on the whole fundamental frequency of the song according to the variable-speed fundamental frequency, so that the average fundamental frequency corresponding to the fundamental frequency of the song is as close as possible to the average fundamental frequency corresponding to the voice.

If the accompaniment is matched with the synthesized singing, the accompaniment needs to be subjected to corresponding tone modification processing. Wherein, the tone changing process comprises a tone rising process or a tone falling process.

108. And performing song synthesis processing on the variable-speed non-periodic sequence, the enhanced spectral envelope and the corrected fundamental frequency through a preset vocoder to obtain a synthesized song.

That is, the server ultimately combines three acoustic features: the corrected fundamental frequency, the enhanced spectrum envelope and the variable-speed non-periodic sequence are input into a preset vocoder, a synthesized song is obtained through synthesis and output of the preset vocoder, the synthesized song is a waveform signal, the tone and the vocal range of the synthesized song and the lyric reading audio frequency are consistent, and the singing voice is more natural.

It is emphasized that to further ensure the privacy and security of the composite song, the composite song may also be stored in a node of a blockchain.

In the embodiment of the invention, acoustic parameters are analyzed from the lyric reading audio, speed change and splicing are realized from the acoustic parameter level through the vocoder based on music score information, speaking voice is converted into singing voice, and the synthesis of songs is realized on the basis of keeping the original voice color and voice range of a user, so that the naturalness of the singing voice is improved, and meanwhile, the synthesis of songs can be realized without collecting a large amount of singing data, and the data collection cost of the synthesis of songs is reduced.

Referring to fig. 2, another embodiment of the song synthesizing method according to the embodiment of the present invention includes:

201. and acquiring lyric recitation audio frequency and music score information of the target song, wherein the music score information comprises lyric pinyin text, beat information, rhythm information and pitch information.

Specifically, the server acquires a unique identifier of the target song, wherein the unique identifier is used for associating the lyric recitation audio frequency with the music score information, for example, the unique identifier is s _1, the target song is A, and a one-to-one correspondence relationship is formed between s _1 and A; the server generates a query statement according to the syntax rule and the unique identifier of the structured query language, for example, the query statement is select from ranges _ table where id ═ s _ 1'; the server executes the query sentence to obtain the lyric reciting audio frequency and the music score information of the target song, wherein the music score information comprises the lyric pinyin text, the beat information, the rhythm information, the pitch and the syllable singing time length, and the lyrics are recorded in the music score information according to the pinyin form.

202. And carrying out time length marking on the phoneme in the lyric recitation audio through a preset voice recognition model and the lyric pinyin text to obtain the recitation time length of the phoneme, wherein the recitation time length of the phoneme comprises the initial recitation time length and the final recitation time length.

Specifically, the server analyzes the music score information and reads a lyric pinyin text from the analyzed music score information; the server inputs the lyric recitation audio and the lyric pinyin text into a preset voice recognition model, and performs voice analysis on the lyric recitation audio through the preset voice recognition model; the server labels phonemes in the lyric recitation audio frequency after the voice analysis according to a lyric pinyin text through a preset voice recognition model to obtain a time stamp and a duration of the phonemes, wherein the phonemes comprise initials and finals; the server determines the recitation duration of the phoneme in the lyric recitation audio according to the time stamp and the duration of the phoneme, wherein the recitation duration of the phoneme comprises the initial recitation duration and the final recitation duration.

It will be appreciated that the time stamps serve to accurately mark the relative positions of the phonemes in the lyric recitation audio, and that the initial recitation duration and the final recitation duration can be determined in conjunction with the duration of the phonemes.

203. And analyzing the phrase recitation audio through a preset vocoder to obtain initial acoustic parameters corresponding to the phonemes, wherein the initial acoustic parameters comprise fundamental frequency, spectrum envelope and aperiodic sequence.

The preset vocoder comprises a vocoder WORLD, further, the server performs data processing on duration information of phonemes in the phrase reading audio through the preset vocoder to obtain initial acoustic parameters corresponding to the phonemes, and the data processing comprises filtering, standard deviation calculation and smoothing processing. The lyric reciting audio is a signal composed of sine waves, the fundamental frequency F0 is a sound signal generated by vibration, the group of signals can be composed of a plurality of groups of sine waves with different frequencies, wherein the sine wave with the lowest frequency is the fundamental frequency, and the other is harmonic waves, namely overtones; the spectrum envelope SP is an envelope line obtained by connecting the highest points of the amplitudes of different frequencies through a smooth curve; the non-periodic sequence AP corresponds to a non-periodic pulse sequence of a mixed excitation part, wherein the mixed excitation refers to control of periodic excitation, noise and non-periodic signals through various parameters.

204. And extracting the singing duration of the phoneme from the lyric pinyin text according to a preset initial variable speed dictionary, rhythm information and beat information, wherein the singing duration of the phoneme comprises the initial singing duration and the final singing duration.

Specifically, the server extracts singing duration t of each character from the lyric pinyin text according to the rhythm information and the beat information; the server inquires the preset initial speed change dictionary according to the singing time t of each character to obtain the initial singing time t of each character₁(ii) a The server sings the singing duration t of each character and the vocalist singing duration t of each character₁Performing difference operation to obtain the singing duration of the vowel of each character, wherein t₂＝t-t₁(ii) a The server sets the singing duration of the initial consonant and the singing duration of the final consonant of each character as the toneDuration of singing of the element. For example, for the word "incense", the corresponding phoneme is "xiang", and can be decomposed into "x" and "iang", the server determines that the duration of singing duration of "xiang" is 1 second, "x" is 0.3 second, and "iang" is 0.7 second.

205. And calculating the speed change rate r of the phoneme according to the reading duration of the initial consonants, the reading duration of the vowels, the singing duration of the initial consonants and the singing duration of the vowels, wherein r is greater than 0.

Further, the server calculates according to the vowel reciting duration and the vowel singing duration to obtain a variable speed multiplying factor r, wherein r is the vowel singing duration/the vowel reciting duration, and r is greater than 0; or the server calculates according to the recited time length of the initial and the singing time length of the initial to obtain the variable speed multiplying power r, wherein r is the recited singing time length/the recited time length of the initial, and r is greater than 0.

206. And carrying out variable speed processing on the initial acoustic parameters according to the variable speed multiplying power r through a preset variable speed algorithm to obtain the acoustic parameters after the speed is changed.

Firstly, when r is equal to 1, the server determines the initial acoustic parameters corresponding to the current phoneme as the acoustic parameters after speed change; secondly, when r is equal to 2, the server performs twice extension processing on the initial acoustic parameters corresponding to the current phoneme to obtain the acoustic parameters after speed change, and further, the server performs twice extension processing on the initial acoustic parameters by adopting a preset average frame adding algorithm, namely, a frame of new data is added between every two adjacent initial acoustic parameters, wherein the numerical value corresponding to the added frame of new data is the average value of the two adjacent frames of data to be added with frames.

And then, when r is less than 2 and is not equal to 1, the server performs speed change processing on the initial acoustic parameters corresponding to the current phoneme by adopting a preset equal ratio frame addition and subtraction algorithm to obtain the speed-changed acoustic parameters. Further, assuming that the sequence length corresponding to the initial acoustic parameter is l before the shift, the sequence length corresponding to the acoustic parameter after the shift is l × r. Specifically, the server obtains an integer sequence from 0 to l r, reduces the whole number in the certificate sequence by r times, and then rounds the whole number, and the server takes the obtained rounded number sequence as an index, and takes values from the sequence corresponding to the initial acoustic parameter to obtain a new sequence with the length of l r, namely the acoustic parameter after speed change.

And finally, when r is larger than 2, the server prolongs the initial acoustic parameters corresponding to the current phoneme by more than two times to obtain the acoustic parameters after speed change. Specifically, the server first executes the step of r being equal to 2 to obtain the speed change data, and then the server repeatedly executes the step of r being equal to 2 corresponding to the whole speed change processing together with the new speed change multiplying factor r/2.

207. And splicing the acoustic parameters after the speed change in series to obtain target acoustic parameters, wherein the target acoustic parameters comprise the fundamental frequency after the speed change, the frequency spectrum envelope after the speed change and the non-periodic sequence after the speed change.

It should be noted that, in the conventional speech rate changing algorithm, for example, the WSOLA and the phase vocoders are processed by performing audio time-invariant processing, and the basic idea is to frame the waveform, adjust the frame shift, and then perform the overlap joint again. However, the problem that the waveform at the superposition part is excessive and unnatural exists, so that the acoustic parameters cannot be normally extracted during subsequent tuning. Therefore, the acoustic parameters after the speed change are spliced in series, the superposition operation of the waveform layer is avoided, the problem of low accuracy of acoustic parameter extraction caused by waveform damage is further avoided, and the application object of the speed change algorithm is the acoustic parameters.

208. And carrying out formant enhancement processing on the frequency spectrum envelope after the speed change to obtain the enhanced frequency spectrum envelope.

Specifically, the server inquires about a formant in a frequency band of about 3 kilohertz from a frequency spectrum envelope after speed change, and records the central frequency and the amplitude of the formant; the server determines the intensity coefficient of the boosting filter and the central frequency to be enhanced according to the central frequency and the amplitude of the formants; the server performs formant enhancement according to the intensity coefficient of the boost filter and the central frequency to be enhanced to obtain a formant enhancement spectrum; and the server carries out filtering processing on the formant enhanced spectrum to obtain the enhanced spectrum envelope.

It will be appreciated that the spectral envelope is a curve formed by joining peaks of amplitude at different frequencies, i.e. the spectral envelope. The spectrum is a collection of many different frequencies, forming a wide range of frequencies, which may differ in amplitude.

209. And carrying out correction processing based on the pitch information, the singing duration and the variable-speed fundamental frequency to obtain the corrected fundamental frequency.

Specifically, the server generates the fundamental frequency of the song based on the pitch information, the singing duration and the variable-speed fundamental frequency; the server superposes the fundamental frequencies in the initial acoustic parameters and calculates an average value to obtain an average fundamental frequency; the server performs pitch-up or pitch-down processing on the fundamental frequency of the song based on the average fundamental frequency to obtain an initial fundamental frequency sequence, wherein the initial fundamental frequency sequence comprises a pitch and notes, and it needs to be explained that in Mandarin, the initial fundamental frequency sequence can be divided into unvoiced sounds and voiced sounds by whether the vocal cords vibrate during sounding. The frequency of vocal cord vibration is directly related to the fundamental frequency during pronunciation, and the absence of vocal cord vibration means no fundamental frequency, namely the fundamental frequency is 0. Therefore, when generating a new fundamental frequency sequence, the fundamental frequency of the unvoiced consonant needs to be set to 0; when detecting that the same character corresponds to different pitches in the initial fundamental frequency sequence, the server carries out smoothing processing on notes corresponding to the same pitches; when pitch variation between adjacent notes in the initial fundamental frequency sequence is detected, the server performs preparation and overshoot processing on the adjacent notes through a preset formula

The method comprises the steps of obtaining an initial fundamental frequency sequence, wherein s is the initial fundamental frequency sequence, omega is the natural frequency, ξ is the damping coefficient, k is the proportional gain, adding a vibrato to the initial fundamental frequency sequence corresponding to a note by a server when the preset duration of the note in the initial fundamental frequency sequence is detected to be larger than a preset threshold value, and adding a white noise to the initial fundamental frequency sequence by the server when excessive smoothness exists in the initial fundamental frequency sequence to obtain a corrected fundamental frequency sequence.

It should be noted that, a situation that one character corresponds to a plurality of notes with different pitches often occurs in a song, and for the situation, the server adds a smoothness between two notes, so that the server is more suitable for the habit of real persons singing, and the listening naturalness is improved. For example, the word "wave" is taken as an example of partial lyrics of my country, the word corresponds to 4 different pitches, the pitch transition before smoothing is more abrupt, and the pitch transition after smoothing is smoother and more suitable for singing by a real person.

It is understood that vibrato is a common singing technique, occurring primarily in sustain, appearing as small tremors resembling sine waves at the fundamental frequency. If the duration of a note exceeds a preset threshold x, a trill will be added to the initial fundamental sequence of the note. Further, when adding vibrato, three parameters are considered: the tremolo adding point a is between 0 and 1 and represents that tremolo is added from which moment of the note; the amplitude of the vibrato, extend, and the frequency rate of the vibrato. The values of x, a, extend and rate vary from singing modality to singing modality. For example, popular audio is where x and a values are larger and where extend and rate values are smaller than in mezzanine audio. For example, after the trill is added to the wide words in the lyrics of my motherland, the addition mode of the trill in popular phonography is embodied.

210. And performing song synthesis processing on the variable-speed non-periodic sequence, the enhanced spectral envelope and the corrected fundamental frequency through a preset vocoder to obtain a synthesized song.

That is, the server ultimately combines three acoustic features: inputting the corrected fundamental frequency, the adjusted spectrum envelope and the variable-speed non-periodic sequence into a WORLD, and synthesizing and outputting to obtain a synthesized song through the WORLD, wherein the synthesized song is a waveform signal.

It should be noted that the vocoder WORLD converts the text into the voice similar to the human pronunciation based on the human pronunciation frequency spectrum, that is, WORLD considers each pinyin as a sequence, predicts the sequence of each section of voice to be synthesized according to the non-periodic sequence after speed change, the enhanced spectrum envelope and the corrected fundamental frequency, and then converts the predicted voice spectrum into the singing voice waveform.

In the embodiment of the invention, the acoustic parameters are analyzed from the lyric reading audio, the speed change and the splicing are realized from the acoustic parameter level through the vocoder based on the music score information, the speaking voice is converted into the singing voice, the synthesis of the song is realized on the basis of keeping the original voice color and the voice range of the user, the naturalness of the singing voice is improved, meanwhile, the synthesis of the song can be realized without collecting a large amount of song data, and the data collection cost of the synthesis of the song is reduced.

With reference to fig. 3, the song synthesizing method in the embodiment of the present invention is described above, and a song synthesizing apparatus in the embodiment of the present invention is described below, where an embodiment of the song synthesizing apparatus in the embodiment of the present invention includes:

the acquisition module 301 is configured to acquire lyric reciting audio and music score information of a target song, where the music score information includes a lyric pinyin text, beat information, rhythm information, and pitch information;

the labeling module 302 is configured to perform time length labeling on a phoneme in the lyric recitation audio through a preset voice recognition model and a lyric pinyin text to obtain a recitation time length of the phoneme, where the recitation time length of the phoneme includes an initial recitation time length and a final recitation time length;

the analysis module 303 is configured to analyze the phrase recitation audio through a preset vocoder to obtain initial acoustic parameters corresponding to phonemes, where the initial acoustic parameters include a fundamental frequency, a spectrum envelope, and a non-periodic sequence;

the extracting module 304 is configured to extract a singing duration of a phoneme from the lyric pinyin text according to a preset initial variable speed dictionary, rhythm information and tempo information, where the singing duration of the phoneme includes an initial singing duration and a final singing duration;

a speed change module 305, configured to perform speed change processing on the initial acoustic parameter according to a preset speed change algorithm, the reciting duration and the singing duration to obtain a target acoustic parameter, where the target acoustic parameter includes a base frequency after speed change, a frequency spectrum envelope after speed change and a non-periodic sequence after speed change;

an enhancement module 306, configured to perform formant enhancement processing on the frequency spectrum envelope after the speed change, to obtain an enhanced frequency spectrum envelope;

a correction module 307, configured to perform correction processing based on the pitch information, the singing duration, and the variable-speed fundamental frequency to obtain a corrected fundamental frequency;

and a synthesis module 308, configured to perform song synthesis processing on the variable-speed non-periodic sequence, the enhanced spectral envelope, and the corrected fundamental frequency through a preset vocoder, so as to obtain a synthesized song.

Referring to fig. 4, another embodiment of the song synthesizing apparatus according to the embodiment of the present invention includes:

Optionally, the labeling module 302 may be further specifically configured to:

analyzing the music score information, and reading a lyric pinyin text from the analyzed music score information;

inputting the lyric recitation audio and the lyric pinyin text into a preset voice recognition model, and performing voice analysis on the lyric recitation audio through the preset voice recognition model;

marking phonemes in the lyric reciting audio after the voice analysis according to a lyric pinyin text through a preset voice recognition model to obtain a time stamp and a duration of the phonemes, wherein the phonemes comprise initials and finals;

and determining the recitation duration of the phoneme according to the time stamp and the duration of the phoneme, wherein the recitation duration of the phoneme comprises the initial recitation duration and the final recitation duration.

Optionally, the extracting module 304 may be further specifically configured to:

extracting singing duration t of each character from the lyric pinyin text according to the rhythm information and the beat information;

inquiring the preset initial consonant variable speed dictionary according to the singing time t of each character to obtain the singing time t of the initial consonant of each character₁；

For the singing duration t of each character and the singing duration t of the consonant of each character₁Performing difference operation to obtain the singing duration t of the vowel of each character₂Wherein, t₂＝t-t₁；

And setting the singing duration of the initial consonant of each character and the singing duration of the final consonant of each character as the singing duration of the phoneme corresponding to each character.

Optionally, the speed changing module 305 includes:

the calculating unit 3051, configured to calculate a speed change rate r of the phoneme according to the initial recited time, the final recited time, the initial singing time and the final singing time, where r is greater than 0;

the speed change unit 3052 is configured to perform speed change processing on the initial acoustic parameter according to a preset speed change algorithm and a speed change magnification r to obtain an acoustic parameter after speed change;

and the splicing unit 3053 is configured to splice the acoustic parameters after the speed change in series to obtain target acoustic parameters, where the target acoustic parameters include a fundamental frequency after the speed change, a frequency spectrum envelope after the speed change, and a non-periodic sequence after the speed change.

Alternatively, the shifting unit 3052 can be further specifically configured to:

when r is equal to 1, determining the initial acoustic parameter as the acoustic parameter after the speed change;

when r is equal to 2, performing twice extension processing on the initial acoustic parameters to obtain acoustic parameters after speed change;

when r is smaller than 2 and is not equal to 1, carrying out variable speed processing on the initial acoustic parameters by adopting a preset equal ratio addition and subtraction frame algorithm to obtain acoustic parameters after variable speed;

and when r is larger than 2, the initial acoustic parameters are prolonged by more than two times to obtain the acoustic parameters after speed change.

Optionally, the enhancing module 306 may be further specifically configured to:

inquiring a formant in a frequency band of about 3 kilohertz from a frequency spectrum envelope after speed change, and recording the central frequency and the amplitude of the formant;

determining the intensity coefficient of the boosting filter and the central frequency to be enhanced according to the central frequency and the amplitude of the formants;

carrying out formant enhancement according to the intensity coefficient of the boost filter and the central frequency to be enhanced to obtain a formant enhancement spectrum;

and filtering the formant enhancement spectrum to obtain an enhanced spectrum envelope.

Optionally, the correction module 307 may be further specifically configured to:

generating a fundamental frequency of the song based on the pitch information, the singing duration and the variable-speed fundamental frequency;

superposing the fundamental frequencies in the initial acoustic parameters and calculating an average value to obtain an average fundamental frequency;

performing tone-up or tone-down processing on the fundamental frequency of the song based on the average fundamental frequency to obtain an initial fundamental frequency sequence, wherein the initial fundamental frequency sequence comprises a pitch and notes;

when detecting that the same character corresponds to different pitches in the initial fundamental frequency sequence, smoothing notes corresponding to the same pitch;

when pitch change between adjacent notes in the initial fundamental frequency sequence is detected, preparing and overshoot between the adjacent notes in the initial fundamental frequency sequence by a preset formula

Wherein s is an initial fundamental frequency sequence, omega is a natural frequency, ξ is a damping coefficient, and k is a proportional gain;

when detecting that the preset duration of the notes in the initial fundamental frequency sequence is greater than a preset threshold value, adding vibrato into the initial fundamental frequency sequence corresponding to the notes;

when detecting that the tone symbol in the initial fundamental frequency sequence has excessive smoothness, adding white noise to the initial fundamental frequency sequence to obtain a corrected fundamental frequency sequence.

Fig. 3 and 4 above describe the song synthesizing apparatus in the embodiment of the present invention in detail from the perspective of the modular functional entity, and the song synthesizing apparatus in the embodiment of the present invention is described in detail from the perspective of the hardware processing.

Fig. 5 is a schematic structural diagram of a song composition apparatus 500 according to an embodiment of the present invention, where the song composition apparatus 500 may have a relatively large difference due to different configurations or performances, and may include one or more processors (CPUs) 510 (e.g., one or more processors) and a memory 520, one or more storage media 530 (e.g., one or more mass storage devices) storing an application 533 or data 532. Memory 520 and storage media 530 may be, among other things, transient or persistent storage. The program stored in the storage medium 530 may include one or more modules (not shown), each of which may include a series of instructions operating on the song composition apparatus 500. Still further, the processor 510 may be configured to communicate with the storage medium 530 to execute a series of instruction operations in the storage medium 530 on the song composition apparatus 500.

The song-based composition apparatus 500 may also include one or more power supplies 540, one or more wired or wireless network interfaces 550, one or more input-output interfaces 560, and/or one or more operating systems 531, such as Windows server, Mac OS X, Unix, Linux, FreeBSD, and the like. Those skilled in the art will appreciate that the song composition apparatus configuration shown in fig. 5 does not constitute a limitation on song-based composition apparatuses and may include more or fewer components than those shown, or some components in combination, or a different arrangement of components.

The present invention also provides a computer readable storage medium, which may be a non-volatile computer readable storage medium, which may also be a volatile computer readable storage medium, having stored therein instructions, which, when run on a computer, cause the computer to perform the steps of the song synthesizing method.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A song synthesizing method, characterized in that the song synthesizing method comprises:

acquiring lyric recitation audio frequency and music score information of a target song, wherein the music score information comprises lyric pinyin text, beat information, rhythm information and pitch information;

performing duration labeling on a phoneme in the lyric recitation audio through a preset voice recognition model and the lyric pinyin text to obtain the recitation duration of the phoneme, wherein the recitation duration of the phoneme comprises initial recitation duration and final recitation duration;

analyzing the lyric recitation audio through a preset vocoder to obtain initial acoustic parameters corresponding to the phonemes, wherein the initial acoustic parameters comprise a fundamental frequency, a frequency spectrum envelope and a non-periodic sequence;

extracting the singing duration of the phoneme from the lyric pinyin text according to a preset initial variable speed dictionary, the rhythm information and the beat information, wherein the singing duration of the phoneme comprises an initial singing duration and a final singing duration;

performing variable speed processing on the initial acoustic parameters according to a preset variable speed algorithm, the recitation duration and the singing duration to obtain target acoustic parameters, wherein the target acoustic parameters comprise variable-speed fundamental frequency, variable-speed frequency spectrum envelope and variable-speed aperiodic sequence;

carrying out formant enhancement processing on the frequency spectrum envelope after the speed change to obtain an enhanced frequency spectrum envelope;

correcting the pitch information, the singing duration and the variable-speed fundamental frequency to obtain a corrected fundamental frequency;

and performing song synthesis processing on the variable-speed non-periodic sequence, the enhanced spectral envelope and the corrected fundamental frequency through the preset vocoder to obtain a synthesized song.

2. The song synthesis method of claim 1, wherein the time duration labeling of the phoneme in the lyric recitation audio through the preset speech recognition model and the lyric pinyin text results in the recited time duration of the phoneme, the recited time duration of the phoneme including the initial recited time duration and the final recited time duration, comprising:

analyzing the music score information, and reading the lyric pinyin text from the analyzed music score information;

inputting the lyric recitation audio and the lyric pinyin text into a preset voice recognition model, and carrying out voice analysis on the lyric recitation audio through the preset voice recognition model;

marking phonemes in the lyric reciting audio after voice analysis according to the lyric pinyin text through a preset voice recognition model to obtain a time stamp and a duration of the phonemes, wherein the phonemes comprise initials and finals;

3. The song synthesizing method according to claim 1, wherein the extracting of the singing duration of the phoneme from the lyric pinyin text based on a preset initial variable speed dictionary, the tempo information, and the tempo information, the singing duration of the phoneme including an initial singing duration and a final singing duration, comprises:

inquiring a preset initial consonant variable speed dictionary according to the singing time t of each character to obtain the singing time t of the initial consonant of each character₁；

4. The song synthesizing method of claim 1, wherein the variable-speed processing of the initial acoustic parameters according to a preset variable-speed algorithm, the recited time length and the song singing time length to obtain target acoustic parameters, wherein the target acoustic parameters include a variable-speed fundamental frequency, a variable-speed spectral envelope and a variable-speed aperiodic sequence, and comprises:

calculating the speed change rate r of the phoneme according to the initial reading duration, the final reading duration, the initial singing duration and the final singing duration, wherein the r is greater than 0;

carrying out speed change processing on the initial acoustic parameters according to the speed change multiplying power r through a preset speed change algorithm to obtain acoustic parameters after speed change;

and performing series splicing on the acoustic parameters after the speed change to obtain target acoustic parameters, wherein the target acoustic parameters comprise the fundamental frequency after the speed change, the frequency spectrum envelope after the speed change and the non-periodic sequence after the speed change.

5. The song synthesizing method according to claim 4, wherein the obtaining of the acoustic parameters after speed change by performing speed change processing on the initial acoustic parameters according to the speed change multiplying factor r through a preset speed change algorithm comprises:

when the r is equal to 1, determining the initial acoustic parameter as the acoustic parameter after the speed change;

when the r is equal to 2, performing twice lengthening processing on the initial acoustic parameter to obtain the acoustic parameter after speed change;

when the r is smaller than 2 and is not equal to 1, carrying out variable speed processing on the initial acoustic parameter by adopting a preset equal ratio addition and subtraction frame algorithm to obtain an acoustic parameter after variable speed;

and when the r is more than 2, prolonging the initial acoustic parameters by more than two times to obtain the acoustic parameters after speed change.

6. The song synthesizing method according to claim 1, wherein the performing a formant enhancement process on the varied-speed spectral envelope to obtain an enhanced spectral envelope comprises:

inquiring a formant in a frequency band of about 3 kilohertz from the frequency spectrum envelope after the speed change, and recording the central frequency and the amplitude of the formant;

determining the intensity coefficient of a boosting filter and the central frequency to be enhanced according to the central frequency and the amplitude of the formants;

and carrying out filtering processing on the formant enhanced spectrum to obtain an enhanced spectrum envelope.

7. The song synthesizing method according to claim 1 or 3, wherein the synthesized song is stored in a block chain, and the correcting process is performed based on the pitch information, the singing duration and the varied fundamental frequency to obtain a corrected fundamental frequency, including:

generating a fundamental frequency of a song based on the pitch information, the singing duration and the varied fundamental frequency;

when detecting that the pitch change exists between the adjacent notes in the initial fundamental frequency sequence, preparing and overshoot the adjacent notes by a preset formula

Wherein, thes is the initial fundamental frequency sequence, ω is the natural frequency, ξ is the damping coefficient, and k is the proportional gain;

when detecting that the preset duration of the notes in the initial fundamental frequency sequence is greater than a preset threshold value, adding vibrato to the initial fundamental frequency sequence corresponding to the notes;

and when the note in the initial fundamental frequency sequence is detected to have excessive smoothness, adding white noise to the initial fundamental frequency sequence to obtain a corrected fundamental frequency.

8. A song synthesizing apparatus, characterized in that the song synthesizing apparatus comprises:

the acquisition module is used for acquiring lyric recitation audio frequency and music score information of a target song, wherein the music score information comprises a lyric pinyin text, beat information, rhythm information and pitch information;

the marking module is used for marking the time length of a phoneme in the lyric recitation audio through a preset voice recognition model and the lyric pinyin text to obtain the recitation time length of the phoneme, wherein the recitation time length of the phoneme comprises an initial recitation time length and a final recitation time length;

the analysis module is used for analyzing the lyric recitation audio through a preset vocoder to obtain initial acoustic parameters corresponding to the phonemes, wherein the initial acoustic parameters comprise a fundamental frequency, a frequency spectrum envelope and a non-periodic sequence;

the extraction module is used for extracting the singing duration of the phoneme from the lyric pinyin text according to a preset initial consonant variable speed dictionary, the rhythm information and the beat information, wherein the singing duration of the phoneme comprises an initial consonant singing duration and a vowel singing duration;

the speed change module is used for carrying out speed change processing on the initial acoustic parameters according to a preset speed change algorithm, the recitation duration and the singing duration to obtain target acoustic parameters, and the target acoustic parameters comprise a base frequency after speed change, a frequency spectrum envelope after speed change and a non-periodic sequence after speed change;

the enhancement module is used for carrying out formant enhancement processing on the frequency spectrum envelope after the speed change to obtain an enhanced frequency spectrum envelope;

the correcting module is used for correcting and processing the pitch information, the singing duration and the variable-speed fundamental frequency to obtain a corrected fundamental frequency;

and the synthesis module is used for carrying out song synthesis processing on the variable-speed non-periodic sequence, the enhanced spectral envelope and the corrected fundamental frequency through the preset vocoder to obtain a synthesized song.

9. A song synthesizing apparatus characterized by comprising: a memory having instructions stored therein and at least one processor, the memory and the at least one processor interconnected by a line;

the at least one processor invokes the instructions in the memory to cause the song synthesis apparatus to perform the song synthesis method of any of claims 1-7.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out a song composition method according to any one of claims 1 to 7.