WO2021218138A1 - 歌曲合成方法、装置、设备及存储介质 - Google Patents

歌曲合成方法、装置、设备及存储介质 Download PDF

Info

Publication number
WO2021218138A1
WO2021218138A1 PCT/CN2020/131663 CN2020131663W WO2021218138A1 WO 2021218138 A1 WO2021218138 A1 WO 2021218138A1 CN 2020131663 W CN2020131663 W CN 2020131663W WO 2021218138 A1 WO2021218138 A1 WO 2021218138A1
Authority
WO
WIPO (PCT)
Prior art keywords
initial
recitation
duration
singing
preset
Prior art date
Application number
PCT/CN2020/131663
Other languages
English (en)
French (fr)
Inventor
朱清影
韩宝强
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021218138A1 publication Critical patent/WO2021218138A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser
    • G10L13/0335Pitch control
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band

Definitions

  • This application relates to the field of speech signal processing, and in particular to a song synthesis method, device, equipment and storage medium.
  • singing synthesis technology has gradually attracted the attention of the industry.
  • speech synthesis technology singing synthesis technology based on waveform splicing and parameter synthesis has gradually appeared, but related technical research mostly focuses on the direction of text synthesis singing or lyrics synthesis singing, that is, the conversion of text information into singing audio. It does not directly convert voice audio into singing audio.
  • Recitation synthesis singing is to directly assign natural speech sounds to tunes and convert them into singing voices.
  • the main purpose of this application is to solve the technical problems of the traditional variable speed modulation algorithm based on the superimposition operation of the waveform level, which has the waveform fracture and unnatural transition.
  • the first aspect of the present application provides a song synthesis method, including: obtaining the lyrics recitation audio and score information of the target song, the score information including the lyric pinyin text, beat information, rhythm information and pitch information;
  • the preset speech recognition model and the lyric pinyin text label the phoneme in the lyrics recitation audio for the duration of the phoneme to obtain the recitation duration of the phoneme.
  • the recitation duration of the phoneme includes the initial recitation duration and the final recitation duration;
  • the vocoder analyzes the lyrics recitation audio to obtain the initial acoustic parameters corresponding to the phonemes.
  • the initial acoustic parameters include the fundamental frequency, the spectral envelope and the aperiodic sequence; according to the preset consonant variable speed dictionary and the rhythm information And the beat information extracting the singing duration of the phoneme from the lyric pinyin text, the singing duration of the phoneme includes the initial singing duration and the final singing duration; according to a preset variable speed algorithm, the recitation duration and the singing duration Variable speed processing is performed on the initial acoustic parameters to obtain target acoustic parameters.
  • the target acoustic parameters include the fundamental frequency after the speed change, the spectral envelope after the speed change, and the aperiodic sequence after the speed change; Perform formant enhancement processing to obtain an enhanced spectrum envelope; perform correction processing based on the pitch information, the singing duration, and the shifted fundamental frequency to obtain the corrected fundamental frequency; through the preset sound
  • the encoder performs song synthesis processing on the shifted aperiodic sequence, the enhanced spectrum envelope and the corrected fundamental frequency to obtain a synthesized song.
  • the second aspect of the present application provides a song synthesizing device, including a memory, a processor, and computer-readable instructions stored in the memory and running on the processor, and the processor executes the computer-readable
  • the following steps are implemented during the instruction: acquiring the lyrics recitation audio and music score information of the target song, the music score information including the lyric pinyin text, beat information, rhythm information and pitch information; the voice recognition model and the lyric pinyin text are compared to the The phoneme in the lyrics recitation audio is marked with the time length, and the recitation time length of the phoneme is obtained.
  • the recitation time length of the phoneme includes the initial recitation time and the final recitation time;
  • the singing duration of the phoneme includes the initial singing duration and the final singing duration; according to the preset variable speed algorithm, the recitation duration and the singing duration, the initial acoustic parameters are processed with variable speed to obtain the target acoustic parameters ,
  • the target acoustic parameters include the fundamental frequency after the speed change, the frequency spectrum envelope after the speed change, and the aperiodic sequence after the speed change; formant enhancement processing is performed on the frequency spectrum envelope after the speed change to obtain the enhanced spectrum envelope; Perform correction processing based on the pitch information, the singing duration, and the base frequency after the shifted frequency to obtain the corrected base frequency; the pre-set vocoder performs the correction process on the aperiodic sequence after the shift, the The enhanced spectrum envelope and the corrected fundamental frequency are subjected to song synthesis processing to obtain a synthesized song.
  • the third aspect of the present application provides a computer-readable storage medium in which computer instructions are stored.
  • the computer executes the following steps: Get the lyrics of the target song Recite audio and music score information, where the music score information includes lyric pinyin text, beat information, rhythm information, and pitch information; use a preset voice recognition model and the lyric phonetic text to label the phonemes in the lyrics recitation audio for time length to obtain The recitation duration of the phoneme, the recitation duration of the phoneme includes the initial recitation duration and the final recitation duration; the lyric recitation audio is analyzed by a preset vocoder to obtain the initial acoustic parameters corresponding to the phoneme.
  • Acoustic parameters include fundamental frequency, spectrum envelope, and aperiodic sequence; according to preset initial initials variable speed dictionary, said rhythm information and said beat information, the singing time length of the phoneme is extracted from the pinyin text of the lyrics, and the singing time of the phoneme
  • the duration includes the initial singing duration and the final singing duration; according to the preset variable speed algorithm, the recitation duration and the singing duration, the initial acoustic parameters are processed with variable speed to obtain the target acoustic parameters, and the target acoustic parameters include the base after the variable speed.
  • Frequency the spectrum envelope after the speed change, and the aperiodic sequence after the speed change; formant enhancement processing is performed on the spectrum envelope after the speed change to obtain an enhanced spectrum envelope; based on the pitch information and the singing duration Perform correction processing with the shifted fundamental frequency to obtain the corrected fundamental frequency; use the preset vocoder to perform the shifted aperiodic sequence, the enhanced spectral envelope, and the corrected fundamental frequency Perform song synthesis processing on the base frequency to obtain a synthesized song.
  • the fourth aspect of the present application provides a song synthesizing device, which includes: an acquisition module for acquiring the lyrics recitation audio and score information of the target song, the score information includes lyric pinyin text, beat information, rhythm information, and pitch information; The module is used to label the phonemes in the lyrics recitation audio by using a preset speech recognition model and the lyric pinyin text to obtain the recitation duration of the phoneme.
  • the recitation duration of the phoneme includes the initial recitation duration and the final recitation duration Duration; an analysis module for analyzing the lyrics recitation audio through a preset vocoder to obtain the initial acoustic parameters corresponding to the phoneme, the initial acoustic parameters including the fundamental frequency, the spectral envelope, and aperiodic sequences; extraction A module for extracting the singing time length of the phoneme from the lyric pinyin text according to a preset initials variable speed dictionary, the rhythm information and the beat information, and the singing time length of the phoneme includes the initial singing time and the final singing time;
  • the variable speed module is used to perform variable speed processing on the initial acoustic parameters according to the preset variable speed algorithm, the recitation duration and the singing duration to obtain target acoustic parameters.
  • the target acoustic parameters include the fundamental frequency after the speed change and the speed after the speed change.
  • Spectrum envelope and aperiodic sequence after variable speed ; an enhancement module for performing formant enhancement processing on the spectrum envelope after variable speed to obtain an enhanced spectrum envelope; a correction module for based on the pitch information , The singing duration and the base frequency after the shift are corrected to obtain the corrected base frequency; the synthesis module is used to perform the correction of the aperiodic sequence after the shift and the enhancement through the preset vocoder
  • the latter spectrum envelope and the corrected fundamental frequency are subjected to song synthesis processing to obtain a synthesized song.
  • the lyrics recitation audio and music score information of the target song are obtained, and the music score information includes lyric pinyin text, beat information, rhythm information, and pitch information; a preset voice recognition model and the lyric phonetic text pair
  • the phoneme in the lyrics recitation audio is marked with the duration to obtain the recitation duration of the phoneme.
  • the recitation duration of the phoneme includes the initial recitation duration and the final recitation duration;
  • the lyrics recitation audio is analyzed through a preset vocoder, Obtain the initial acoustic parameters corresponding to the phoneme, where the initial acoustic parameters include fundamental frequency, spectral envelope, and aperiodic sequence; according to a preset initial consonant variable speed dictionary, the rhythm information and the beat information from the lyric pinyin text
  • the singing duration of the phoneme is extracted, and the singing duration of the phoneme includes the initial singing duration and the final singing duration; according to the preset variable speed algorithm, the recitation duration and the singing duration, the initial acoustic parameters are processed with variable speed to obtain the target Acoustic parameters, the target acoustic parameters include the fundamental frequency after the speed change, the spectrum envelope after the speed change, and the aperiodic sequence after the speed change; the formant enhancement processing is performed on the spectrum envelope after the speed change to obtain the enhanced spectrum envelope Network; Perform correction processing based on
  • the acoustic parameters are analyzed from the lyrics recitation audio, and based on the score information, the vocoder realizes speed change and splicing from the acoustic parameter level, converts the speaking voice into a singing voice, and realizes song synthesis on the basis of retaining the user's original timbre and vocal range , Improve the naturalness of the singing voice, and at the same time can realize song synthesis without collecting a large amount of singing data, and reduce the data collection cost of song synthesis.
  • Figure 1 is a schematic diagram of an embodiment of a song synthesis method in an embodiment of the application
  • FIG. 2 is a schematic diagram of another embodiment of the song synthesis method in the embodiment of the application.
  • Fig. 3 is a schematic diagram of an embodiment of a song synthesizing device in an embodiment of the application
  • FIG. 4 is a schematic diagram of another embodiment of the song synthesizing device in the embodiment of the application.
  • Fig. 5 is a schematic diagram of an embodiment of a song synthesizing device in an embodiment of the application.
  • the embodiments of the application provide a song synthesis method, device, equipment, and storage medium, which are used to analyze the acoustic parameters from the lyrics recitation audio, and realize speed change and splicing from the acoustic parameter level through a vocoder based on the music score information, and convert the speaking voice Convert into singing voice, realize song synthesis on the basis of retaining the user's original timbre and vocal range, improve the naturalness of the singing voice, and realize song synthesis without collecting a large amount of singing data, reducing the data collection cost of song synthesis.
  • An embodiment of the song synthesis method in the embodiment of the present application includes:
  • the music score information includes lyric pinyin text, beat information, rhythm information, and pitch information.
  • the execution subject of this application may be a song synthesizing device, or a terminal or a server, which is not specifically limited here.
  • the embodiments of the present invention are described by taking a server as an execution subject as an example.
  • the lyrics recitation audio and music score information of the target song are pre-stored in a preset data table, and are associated with a unique identifier, and the lyrics are recorded in the music score information according to the pinyin form.
  • the server obtains the unique identifier of the target song; the server generates a query sentence according to the structured query language grammar rules and the unique identifier; the server executes the query sentence, and obtains the lyrics recitation audio and score information of the target song.
  • the score information includes lyrics, pinyin text, and beats. Information, rhythm information and pitch information.
  • wo de zu guo is used to represent my homeland and used as the pinyin text of the lyrics; beat information refers to the total length of the notes in each measure in the music score, including one-quarter beat and six-eighth beat; rhythm information is used for Indicate the length and strength of the note; pitch information is used to indicate the height of the sound when singing the target song.
  • the phoneme in the lyrics recitation audio is marked with a preset voice recognition model and the pinyin text to obtain the phoneme recitation duration.
  • the phoneme recitation duration includes the initial recitation duration and the final recitation duration.
  • Each text corresponding to the lyrics in the lyrics recitation audio is represented by pinyin, which corresponds to at least one phoneme, and the phonemes include initials and vowels. Therefore, the server performs voice analysis on the lyrics recitation audio through a preset voice recognition model, and then reads the lyrics.
  • the phonemes in are marked for duration. It is understandable that a pinyin will be decomposed into two phonemes, initials and finals, for example, "xiang” is decomposed into two phonemes “x” and “iang”, and the preset speech recognition model will output these two phonemes in the speech audio.
  • the recitation time of that is, the consonant recitation time and the vowel recitation time.
  • the initial acoustic parameters include the fundamental frequency, the spectral envelope, and the non-periodic sequence.
  • the preset vocoder includes the vocoder WORLD. Further, the server performs data processing on the duration information of the phoneme in the lyrics recitation audio through the preset vocoder to obtain the initial acoustic parameters corresponding to the phoneme.
  • the data processing includes filtering, standard Difference calculation, smoothing processing.
  • the lyrics recitation audio is a signal composed of sine waves
  • the fundamental frequency F0 is for a sound signal emitted by vibration. This set of signals can be composed of many sets of sine waves with different frequencies, and the sine wave with the lowest frequency is the base.
  • the others are harmonics, that is, overtones;
  • the spectrum envelope SP refers to the envelope obtained by connecting the highest points of the amplitude of different frequencies through a smooth curve;
  • the non-periodic sequence AP corresponds to the non-periodic pulse sequence of the mixed excitation part ,
  • the hybrid excitation refers to the control of periodic excitation, noise and aperiodic signals through a variety of parameters.
  • the singing time length of the phoneme includes the initial singing time and the final singing time.
  • the server obtains the pronunciation duration rule of each consonant under the pre-statistics of different rhythms and different pronunciation durations, and pre-forms the consonant variable-speed dictionary according to the pronunciation duration law of each consonant, that is, the preset consonant variable-speed dictionary. Specifically, the server queries the preset initial consonant variable-speed dictionary to obtain the initial singing duration of the phoneme, and calculates the singing duration of the phoneme and the initial initial singing duration of the phoneme to obtain the final singing duration of the phoneme, where the singing duration of the phoneme is the initial singing duration. The sum of the duration and the vowel singing duration.
  • the target acoustic parameters include the fundamental frequency after the variable speed, the spectral envelope after the variable speed, and the aperiodic sequence after the variable speed.
  • the server adjusts the duration of the initial acoustic parameters to obtain the target acoustic parameters.
  • Acoustic parameters include the fundamental frequency after shifting, the spectrum envelope after shifting, and the aperiodic sequence after shifting.
  • the above scheme is not only simple in principle and easy to implement, but also avoids the overlap operation at the waveform level, and thus avoids the problem of low accuracy of acoustic parameter extraction due to waveform damage, making the application object of the variable speed algorithm change from the waveform.
  • the acoustic parameters it is unified with the subsequent modulation algorithm, which effectively improves the controllability of the system.
  • Formant refers to the natural spectral peak of the sound. Compared with speech, the spectral envelope of singing audio has an obvious peak in the frequency range of about 3 kHz. This peak is unique to singing, so it is called "singing resonance.” peak”. In order to make the converted audio more natural, a "singing formant" is added to the frequency spectrum envelope of the audio, that is, the amplitude of the frequency spectrum envelope after the speed change in the frequency range of about 3 kilohertz is enhanced to obtain the enhanced spectrum envelope.
  • the server In order to make the synthesized song fit the vocal range of the recitation as much as possible and reduce the timbre distortion caused by the pitch shift, the server generates the fundamental frequency of the song according to the pitch information and the singing duration; the server performs the overall performance of the fundamental frequency of the song according to the fundamental frequency after the speed change.
  • the pitch shift processing makes the average fundamental frequency corresponding to the fundamental frequency of the song as close as possible to the average fundamental frequency corresponding to the voice.
  • the pitch modification processing includes pitch-up processing or pitch-down processing.
  • the server finally inputs three acoustic characteristics: the corrected fundamental frequency, the enhanced spectral envelope, and the variable-speed aperiodic sequence into the preset vocoder, and the synthesized output is obtained by the preset vocoder.
  • the synthesized song is a waveform signal, and the synthesized song is consistent with the timbre and range in the lyrics recitation audio, and the singing voice is more natural.
  • the above-mentioned synthesized song can also be stored in a node of a blockchain.
  • the acoustic parameters are analyzed from the lyrics recitation audio, and the speed and splicing are realized from the acoustic parameter level through the vocoder based on the music score information, and the speaking voice is converted into singing voice, and the user’s original timbre and range are retained.
  • Song synthesis improves the naturalness of the singing voice, and at the same time, it can realize song synthesis without collecting a large amount of singing data, reducing the data collection cost of song synthesis.
  • another embodiment of the song synthesis method in the embodiment of the present application includes:
  • the music score information includes lyric pinyin text, beat information, rhythm information, and pitch information.
  • the server obtains the unique identifier of the target song.
  • the unique identifier is used to associate lyrics recitation audio and music score information.
  • the unique identifier is s_1, the target song is A, and there is a one-to-one correspondence between s_1 and A;
  • the query language grammar rules and unique identifiers are used to generate query statements.
  • the score information includes lyrics, pinyin text, Beat information, rhythm information, pitch, syllable singing time, among them, the lyrics are recorded in the form of pinyin in the music score information.
  • the phoneme in the lyrics recitation audio is marked by the preset voice recognition model and the pinyin text of the lyrics to obtain the phoneme recitation duration.
  • the phoneme recitation duration includes the initial recitation duration and the final recitation duration.
  • the server parses the score information and reads the lyric pinyin text from the parsed score information; the server inputs the lyrics recitation audio and the lyric pinyin text into the preset voice recognition model, and the preset voice recognition model Lyrics recitation audio performs voice analysis; the server uses a preset voice recognition model to mark the phonemes in the parsed lyrics recitation audio according to the pinyin text of the lyrics to obtain the timestamp and duration of the phoneme.
  • the phonemes include initials and finals; the server according to the phonemes
  • the timestamp and duration of the lyric determine the recitation duration of the phoneme in the audio of the lyrics recitation.
  • the recitation duration of the phoneme includes the initial recitation duration and the final recitation duration.
  • the function of the time stamp is to accurately mark the relative position of the phoneme in the lyrics recitation audio, and the duration of the phoneme can be combined to determine the initial and final recitation duration.
  • the initial acoustic parameters include fundamental frequency, spectral envelope, and aperiodic sequence.
  • the preset vocoder includes the vocoder WORLD. Further, the server performs data processing on the duration information of the phoneme in the lyrics recitation audio through the preset vocoder to obtain the initial acoustic parameters corresponding to the phoneme.
  • the data processing includes filtering, standard Difference calculation and smoothing.
  • the lyrics recitation audio is a signal composed of sine waves
  • the fundamental frequency F0 is for a sound signal emitted by vibration. This set of signals can be composed of many sets of sine waves with different frequencies, and the sine wave with the lowest frequency is the base.
  • the others are harmonics, that is, overtones;
  • the spectrum envelope SP refers to the envelope obtained by connecting the highest points of the amplitude of different frequencies through a smooth curve;
  • the non-periodic sequence AP corresponds to the non-periodic pulse sequence of the mixed excitation part ,
  • the hybrid excitation refers to the control of periodic excitation, noise and aperiodic signals through a variety of parameters.
  • the singing time length of the phoneme includes the initial singing time and the final singing time.
  • the duration and the singing duration of the vowel of each character are set as the singing duration of the phoneme. For example, for the word "xiang”, the corresponding phoneme is "xiang", which can be decomposed into “x” and “iang”. The server determines that the singing duration of "xiang" is 1 second and "x" is 0.3 seconds, then “iang””Is 0.7 seconds.
  • the server determines that the initial acoustic parameter corresponding to the current phoneme is the acoustic parameter after the variable speed; second, when r is equal to 2, the server extends the initial acoustic parameter corresponding to the current phoneme twice to obtain For the acoustic parameters after changing the speed, further, the server uses a preset average plus frame algorithm to extend the initial acoustic parameters twice, that is, add a new frame of data between every two adjacent initial acoustic parameters, where , The value corresponding to adding a new frame of data is the average value of the two adjacent frames of the added frame.
  • the server uses a preset equal-ratio plus and minus frame algorithm to perform variable speed processing on the initial acoustic parameters corresponding to the current phoneme to obtain the acoustic parameters after variable speed.
  • the sequence length corresponding to the initial acoustic parameter is l before the shift
  • the sequence length corresponding to the acoustic parameter after the shift is l*r.
  • the server obtains an integer sequence from 0 to l*r, reduces the value in the certificate sequence by a factor of r, and rounds it.
  • the server uses the obtained integer column as an index and takes from the sequence corresponding to the initial acoustic parameters. Value, the new sequence of length l*r is obtained, which is the acoustic parameter after variable speed.
  • the server extends the initial acoustic parameters corresponding to the current phoneme by more than two times to obtain the acoustic parameters after the speed change. Specifically, the server first executes the step that r is equal to 2 to obtain the speed change data, and then the server combines the obtained speed change data with the new speed change magnification r/2, and repeats the steps corresponding to r equals 2 for the entire speed change process.
  • the target acoustic parameters include the fundamental frequency after the speed change, the spectral envelope after the speed change, and the aperiodic sequence after the speed change.
  • the traditional voice speed change algorithm for example, audio time-changing and invariant processing WSOLA and phase vocoding
  • the basic idea is to frame the waveform, adjust the frame shift, and then re-overlap splicing.
  • the waveform at the superposition is excessively unnatural, which causes the acoustic parameters to be unable to be extracted normally when the subsequent tone is changed. Therefore, the acoustic parameters after the speed change are connected in series to avoid the overlap operation at the waveform level, and thereby avoid the problem of low accuracy of acoustic parameter extraction due to waveform damage, so that the application object of the speed change algorithm is the acoustic parameter.
  • the server queries the formant in the frequency range of about 3 kHz from the frequency spectrum envelope after the speed change, and records the center frequency and amplitude of the formant; the server determines the strength of the boost filter according to the center frequency and amplitude of the formant Coefficient and the center frequency to be enhanced; the server performs formant enhancement according to the intensity coefficient of the boost filter and the center frequency to be enhanced to obtain the formant enhanced spectrum; the server filters the formant enhanced spectrum to obtain the enhanced spectrum envelope .
  • the spectrum envelope is a curve formed by connecting the highest points of the amplitude of different frequencies, that is, the spectrum envelope.
  • the spectrum is a collection of many different frequencies, forming a wide frequency range, and different frequencies may have different amplitudes.
  • the server generates the fundamental frequency of the song based on the pitch information, the singing duration, and the fundamental frequency after changing the speed; the server superimposes the fundamental frequency in the initial acoustic parameters and calculates the average value to obtain the average fundamental frequency; the server compares the fundamental frequency based on the average fundamental frequency.
  • the fundamental frequency of the song is raised or lowered to obtain the initial fundamental frequency sequence.
  • the initial fundamental frequency sequence includes pitch and notes. It should be noted that in Mandarin, the initials can be divided into unvoiced and unvoiced by using whether the vocal cords vibrate during vocalization. There are two types of voiced sounds. The frequency of vocal cord vibration is directly related to the fundamental frequency of the pronunciation.
  • the fundamental frequency of the unvoiced initials needs to be set to 0; when it is detected that the same character corresponds to different pitches in the initial fundamental frequency sequence, the server responds to the same pitch.
  • the notes are smoothed; when the pitch change between adjacent notes in the initial fundamental frequency sequence is detected, the server prepares and overshoots the adjacent notes through a preset formula.
  • the preset formula is Among them, s is the initial fundamental frequency sequence, ⁇ is the natural frequency, ⁇ is the damping coefficient, and k is the proportional gain; when it is detected that the preset duration of the note in the initial fundamental frequency sequence is greater than the preset threshold, the server will determine the The frequency sequence is added with vibrato; when it is detected that there is excessive smoothness in the initial fundamental frequency sequence, the server adds white noise to the initial fundamental frequency sequence to obtain the corrected fundamental frequency sequence.
  • vibrato is a common singing technique, which mainly appears on sustain, which is manifested as a small tremor on the fundamental frequency similar to a sine wave. If the duration of a note exceeds the preset threshold x, vibrato will be added to the initial fundamental frequency sequence of this note. Further, when adding vibrato, three parameters are considered: the vibrato adding point a, a is between 0 and 1, which indicates at which time the vibrato is added to the note; the vibrato's amplitude extent and the vibrato's frequency rate. In different singing styles, the values of x, a, extent and rate will vary. For example, compared to Bel Canto, x and a are larger in popular singing, while extent and rate are smaller. For example, adding tremolo to the word "wide" in the lyrics of "My Motherland" reflects the way in which tremolo is added in popular singing.
  • the server finally inputs three acoustic characteristics: the corrected fundamental frequency, the adjusted spectrum envelope, and the aperiodic sequence after the speed change into the pass WORLD, and the synthesized song is obtained through the WORLD synthesis output, where the synthesized song is Waveform signal.
  • the vocoder WORLD converts text into sounds similar to human pronunciation based on the human pronunciation spectrum, that is, WORLD will treat each pinyin as a sequence, according to the non-periodic sequence after variable speed and the enhanced spectrum package After the network and the corrected fundamental frequency, each segment of the sequence that needs to be synthesized speech is predicted, and then the predicted sound spectrum is converted into a singing sound waveform.
  • the acoustic parameters are analyzed from the lyrics recitation audio, and the speed and splicing are realized from the acoustic parameter level through the vocoder based on the music score information, and the speaking voice is converted into singing voice, and the user’s original timbre and range are retained.
  • Song synthesis improves the naturalness of the singing voice, and at the same time, it can realize song synthesis without collecting a large amount of song data, reducing the data collection cost of song synthesis.
  • an embodiment of the song synthesizing device in the embodiment of the present application includes:
  • the obtaining module 301 is used to obtain the lyrics recitation audio and music score information of the target song.
  • the music score information includes the pinyin text of the lyrics, beat information, rhythm information and pitch information;
  • the labeling module 302 is used to label the phonemes in the lyrics recitation audio by using a preset speech recognition model and the pinyin text of the lyrics to obtain the recitation duration of the phoneme.
  • the recitation duration of the phoneme includes the initial recitation duration and the final recitation duration;
  • the analysis module 303 is configured to analyze the lyrics recitation audio through a preset vocoder to obtain the initial acoustic parameters corresponding to the phonemes.
  • the initial acoustic parameters include the fundamental frequency, the spectral envelope, and the non-periodic sequence;
  • the extraction module 304 is used to extract the singing time length of the phoneme from the lyric pinyin text according to the preset initial variable speed dictionary, rhythm information and beat information.
  • the singing time length of the phoneme includes the initial singing time and the final singing time;
  • the variable speed module 305 is used to perform variable speed processing on the initial acoustic parameters according to the preset variable speed algorithm, recitation duration and singing duration to obtain the target acoustic parameters.
  • the target acoustic parameters include the fundamental frequency after the speed change, the spectrum envelope after the speed change, and the speed change after the speed change. Aperiodic sequence
  • the enhancement module 306 is configured to perform formant enhancement processing on the spectrum envelope after the speed change to obtain the enhanced spectrum envelope;
  • the correction module 307 is configured to perform correction processing based on the pitch information, the singing duration, and the fundamental frequency after the shift to obtain the corrected fundamental frequency;
  • the synthesis module 308 is used to perform song synthesis processing on the variable-speed aperiodic sequence, the enhanced spectrum envelope and the corrected fundamental frequency through a preset vocoder to obtain a synthesized song.
  • the synthesized song may also be stored in a node of a blockchain.
  • the acoustic parameters are analyzed from the lyrics recitation audio, and the speed and splicing are realized from the acoustic parameter level through the vocoder based on the music score information, and the speaking voice is converted into singing voice, and the user’s original timbre and range are retained.
  • Song synthesis improves the naturalness of the singing voice, and at the same time, it can realize song synthesis without collecting a large amount of singing data, reducing the data collection cost of song synthesis.
  • FIG. 4 another embodiment of the song synthesizing device in the embodiment of the present application includes:
  • the obtaining module 301 is used to obtain the lyrics recitation audio and music score information of the target song.
  • the music score information includes the pinyin text of the lyrics, beat information, rhythm information and pitch information;
  • the labeling module 302 is used to label the phonemes in the lyrics recitation audio by using a preset speech recognition model and the pinyin text of the lyrics to obtain the recitation duration of the phoneme.
  • the recitation duration of the phoneme includes the initial recitation duration and the final recitation duration;
  • the analysis module 303 is configured to analyze the lyrics recitation audio through a preset vocoder to obtain the initial acoustic parameters corresponding to the phonemes.
  • the initial acoustic parameters include the fundamental frequency, the spectral envelope, and the non-periodic sequence;
  • the extraction module 304 is used to extract the singing time length of the phoneme from the lyric pinyin text according to the preset initial variable speed dictionary, rhythm information and beat information.
  • the singing time length of the phoneme includes the initial singing time and the final singing time;
  • the variable speed module 305 is used to perform variable speed processing on the initial acoustic parameters according to the preset variable speed algorithm, recitation duration and singing duration to obtain the target acoustic parameters.
  • the target acoustic parameters include the fundamental frequency after the speed change, the spectrum envelope after the speed change, and the speed change after the speed change. Aperiodic sequence
  • the enhancement module 306 is configured to perform formant enhancement processing on the spectrum envelope after the speed change to obtain the enhanced spectrum envelope;
  • the correction module 307 is configured to perform correction processing based on the pitch information, the singing duration, and the fundamental frequency after the shift to obtain the corrected fundamental frequency;
  • the synthesis module 308 is used to perform song synthesis processing on the variable-speed aperiodic sequence, the enhanced spectrum envelope and the corrected fundamental frequency through a preset vocoder to obtain a synthesized song.
  • the marking module 302 may also be specifically used for:
  • the phonemes include initials and finals;
  • the phoneme recitation duration is determined.
  • the phoneme recitation duration includes the initial recitation duration and the final recitation duration.
  • the extraction module 304 may also be specifically used for:
  • the speed change module 305 includes:
  • the calculation unit 3051 is used to calculate the variable speed rate r of the phoneme according to the initial recitation time, the vowel recitation time, the initial singing time and the vowel singing time, and r>0;
  • the speed change unit 3052 is used to perform speed change processing on the initial acoustic parameters according to the speed change magnification r through a preset speed change algorithm to obtain the acoustic parameters after the speed change;
  • the splicing unit 3053 is used for serially splicing the acoustic parameters after the speed change to obtain the target acoustic parameters.
  • the target acoustic parameters include the fundamental frequency after the speed change, the spectral envelope after the speed change, and the aperiodic sequence after the speed change.
  • the speed change unit 3052 can also be specifically used for:
  • the initial acoustic parameters are processed with variable speed using the preset equal-ratio plus and minus frame algorithm to obtain the acoustic parameters after variable speed;
  • the initial acoustic parameters are extended by more than two times to obtain the acoustic parameters after shifting.
  • the enhancement module 306 may also be specifically used for:
  • the resonant peak in the frequency range of about 3 kHz from the spectrum envelope after changing the speed, and record the center frequency and amplitude of the resonant peak;
  • the formant enhanced spectrum is filtered to obtain the enhanced spectrum envelope.
  • correction module 307 may also be specifically used for:
  • the fundamental frequency of the song is raised or lowered based on the average fundamental frequency to obtain the initial fundamental frequency sequence, which includes pitch and notes;
  • a preset formula is used to prepare and overshoot between adjacent notes in the initial fundamental frequency sequence.
  • the preset formula is Among them, s is the initial fundamental frequency sequence, ⁇ is the natural frequency, ⁇ is the damping coefficient, and k is the proportional gain;
  • the acoustic parameters are analyzed from the lyrics recitation audio, and the speed and splicing are realized from the acoustic parameter level through the vocoder based on the music score information, and the speaking voice is converted into singing voice, and the user’s original timbre and range are retained.
  • Song synthesis improves the naturalness of the singing voice, and at the same time, it can realize song synthesis without collecting a large amount of singing data, reducing the data collection cost of song synthesis.
  • FIG. 5 is a schematic structural diagram of a song synthesizing device provided by an embodiment of the present application.
  • the song synthesizing device 500 may have relatively large differences due to different configurations or performance, and may include one or more processors (central processing units, CPU). 510 (for example, one or more processors) and memory 520, and one or more storage media 530 (for example, one or more storage devices) storing application programs 533 or data 532.
  • the memory 520 and the storage medium 530 may be short-term storage or persistent storage.
  • the program stored in the storage medium 530 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations on the song synthesizing device 500.
  • the processor 510 may be configured to communicate with the storage medium 530 and execute a series of instruction operations in the storage medium 530 on the song synthesizing device 500.
  • the song-based synthesis device 500 may also include one or more power supplies 540, one or more wired or wireless network interfaces 550, one or more input and output interfaces 560, and/or one or more operating systems 531, such as Windows Serve , Mac OS X, Unix, Linux, FreeBSD, etc.
  • operating systems 531 such as Windows Serve , Mac OS X, Unix, Linux, FreeBSD, etc.
  • FIG. 5 does not constitute a limitation on the song synthesizing device, and may include more or less components than shown in the figure, or a combination of certain components, or different components. Layout.
  • the present application also provides a computer-readable storage medium.
  • the computer-readable storage medium may be a non-volatile computer-readable storage medium or a volatile computer-readable storage medium.
  • the computer-readable storage medium stores computer instructions, and when the computer instructions are executed on the computer, the computer executes the following steps:
  • Acquire lyrics recitation audio and score information of the target song where the score information includes lyric pinyin text, beat information, rhythm information, and pitch information;
  • the recitation duration of the phoneme includes the initial recitation duration and the final recitation duration
  • the recitation time length and the singing time length are subjected to variable speed processing to obtain target acoustic parameters.
  • the target acoustic parameters include the fundamental frequency after the variable speed, the spectral envelope after the variable speed, and the variable speed Non-periodic sequence after;
  • the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium.
  • the technical solution of the present application essentially or the part that contributes to the existing technology or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , Including several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (read-only memory, ROM), random access memory (random access memory, RAM), magnetic disks or optical disks and other media that can store program codes. .
  • the blockchain referred to in this application is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm.
  • Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information for verification. The validity of the information (anti-counterfeiting) and the generation of the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.

Abstract

一种歌曲合成方法,包括:获取歌词朗诵音频和乐谱信息(101);通过预置语音识别模型和歌词拼音文本对歌词朗诵音频进行时长标注,得到朗诵时长(102);通过预置声码器从歌词朗诵音频中分析初始声学参数(103);根据预置声母变速字典、节奏信息和节拍信息从歌词拼音文本中提取歌唱时长(104);根据预置变速算法、朗诵时长和歌唱时长对初始声学参数进行变速处理(105);对变速后的频谱包络进行共振峰增强处理,得到增强后的频谱包络(106);基于音高信息、歌唱时长和变速后的基频进行矫正处理,得到矫正后的基频(107);通过预置声码器对处理后的声学参数进行歌曲合成处理(108)。还涉及区块链,合成的歌曲存储于区块链中。

Description

歌曲合成方法、装置、设备及存储介质
本申请要求于2020年04月28日提交中国专利局、申请号为202010350256.0、发明名称为“歌曲合成方法、装置、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在申请中。
技术领域
本申请涉及语音信号处理领域,尤其涉及一种歌曲合成方法、装置、设备及存储介质。
背景技术
近几十年来,歌唱合成技术逐渐受到业界的重视。受语音合成技术的启发,逐渐出现了基于波形拼接和参数合成的歌唱合成技术,但相关的技术研究大都集中于文本合成歌唱或者歌词合成歌唱的方向,也就是将文本信息转化为歌唱音频,而不是直接将语音音频转化为歌唱音频。
业界也曾有人研发一种自动区分说话与歌唱的算法,但并未进一步将该技术应用于朗诵合成歌唱这个方向上,朗诵合成歌唱就是直接将自然讲话的声音赋予曲调,转化为歌声。发明人意识到,传统变速变调算法基于波形层面的叠加操作,存在波形断裂和过渡不自然的问题。
发明内容
本申请的主要目的在于解决了传统变速变调算法基于波形层面的叠加操作,存在波形断裂和过渡不自然的技术问题。
为实现上述目的,本申请第一方面提供了一种歌曲合成方法,包括:获取目标歌曲的歌词朗诵音频和乐谱信息,所述乐谱信息包括歌词拼音文本、节拍信息、节奏信息和音高信息;通过预置语音识别模型和所述歌词拼音文本对所述歌词朗诵音频中的音素进行时长标注,得到所述音素的朗诵时长,所述音素的朗诵时长包括声母朗诵时长和韵母朗诵时长;通过预置声码器对所述歌词朗诵音频进行分析,得到所述音素对应的初始声学参数,所述初始声学参数包括基频、频谱包络与非周期序列;根据预置声母变速字典、所述节奏信息和所述节拍信息从所述歌词拼音文本中提取所述音素的歌唱时长,所述音素的歌唱时长包括声母歌唱时长和韵母歌唱时长;根据预置变速算法、所述朗诵时长和所述歌唱时长对所述初始声学参数进行变速处理,得到目标声学参数,所述目标声学参数包括变速后的基频、变速后的频谱包络和变速后的非周期序列;对所述变速后的频谱包络进行共振峰增强处理,得到增强后的频谱包络;基于所述音高信息、所述歌唱时长和所述变速后的基频进行矫正处理,得到矫正后的基频;通过所述预置声码器对所述变速后的非周期序列、所述增强后的频谱包络和所述矫正后的基频进行歌曲合成处理,得到合成歌曲。
本申请第二方面提供了一种歌曲合成设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现如下步骤:获取目标歌曲的歌词朗诵音频和乐谱信息,所述乐谱信息包括歌词拼音文本、节拍信息、节奏信息和音高信息;通过预置语音识别模型和所述歌词拼音文本对所述歌词朗诵音频中的音素进行时长标注,得到所述音素的朗诵时长,所述音素的朗诵时长包括声母朗诵时长和韵母朗诵时长;通过预置声码器对所述歌词朗诵音频进行分析,得到所述音素对应的初始声学参数,所述初始声学参数包括基频、频谱包络与非周期序列;根据预置声母变速字典、所述节奏信息和所述节拍信息从所述歌词拼音文本中提取所述音素的歌唱时长,所述音素的歌唱时长包括声母歌唱时长和韵母歌唱时长;根据预置变速算法、所述朗诵时长和所述歌唱时长对所述初始声学参数进行变速处理,得到目标声学参数,所述目标声学参数包括变速后的基频、变速后的频谱包络和变速后的非周期序列;对所述变速后的频谱包络进行共振峰增强处理,得到增强后的频谱包络;基于所述音高信息、所述歌唱 时长和所述变速后的基频进行矫正处理,得到矫正后的基频;通过所述预置声码器对所述变速后的非周期序列、所述增强后的频谱包络和所述矫正后的基频进行歌曲合成处理,得到合成歌曲。
本申请第三方面提供了一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机指令,当所述计算机指令在计算机上运行时,使得计算机执行如下步骤:获取目标歌曲的歌词朗诵音频和乐谱信息,所述乐谱信息包括歌词拼音文本、节拍信息、节奏信息和音高信息;通过预置语音识别模型和所述歌词拼音文本对所述歌词朗诵音频中的音素进行时长标注,得到所述音素的朗诵时长,所述音素的朗诵时长包括声母朗诵时长和韵母朗诵时长;通过预置声码器对所述歌词朗诵音频进行分析,得到所述音素对应的初始声学参数,所述初始声学参数包括基频、频谱包络与非周期序列;根据预置声母变速字典、所述节奏信息和所述节拍信息从所述歌词拼音文本中提取所述音素的歌唱时长,所述音素的歌唱时长包括声母歌唱时长和韵母歌唱时长;根据预置变速算法、所述朗诵时长和所述歌唱时长对所述初始声学参数进行变速处理,得到目标声学参数,所述目标声学参数包括变速后的基频、变速后的频谱包络和变速后的非周期序列;对所述变速后的频谱包络进行共振峰增强处理,得到增强后的频谱包络;基于所述音高信息、所述歌唱时长和所述变速后的基频进行矫正处理,得到矫正后的基频;通过所述预置声码器对所述变速后的非周期序列、所述增强后的频谱包络和所述矫正后的基频进行歌曲合成处理,得到合成歌曲。
本申请第四方面提供了一种歌曲合成装置,包括:获取模块,用于获取目标歌曲的歌词朗诵音频和乐谱信息,所述乐谱信息包括歌词拼音文本、节拍信息、节奏信息和音高信息;标注模块,用于通过预置语音识别模型和所述歌词拼音文本对所述歌词朗诵音频中的音素进行时长标注,得到所述音素的朗诵时长,所述音素的朗诵时长包括声母朗诵时长和韵母朗诵时长;分析模块,用于通过预置声码器对所述歌词朗诵音频进行分析,得到所述音素对应的初始声学参数,所述初始声学参数包括基频、频谱包络与非周期序列;提取模块,用于根据预置声母变速字典、所述节奏信息和所述节拍信息从所述歌词拼音文本中提取所述音素的歌唱时长,所述音素的歌唱时长包括声母歌唱时长和韵母歌唱时长;变速模块,用于根据预置变速算法、所述朗诵时长和所述歌唱时长对所述初始声学参数进行变速处理,得到目标声学参数,所述目标声学参数包括变速后的基频、变速后的频谱包络和变速后的非周期序列;增强模块,用于对所述变速后的频谱包络进行共振峰增强处理,得到增强后的频谱包络;矫正模块,用于基于所述音高信息、所述歌唱时长和所述变速后的基频进行矫正处理,得到矫正后的基频;合成模块,用于通过所述预置声码器对所述变速后的非周期序列、所述增强后的频谱包络和所述矫正后的基频进行歌曲合成处理,得到合成歌曲。
本申请提供的技术方案中,获取目标歌曲的歌词朗诵音频和乐谱信息,所述乐谱信息包括歌词拼音文本、节拍信息、节奏信息和音高信息;通过预置语音识别模型和所述歌词拼音文本对所述歌词朗诵音频中的音素进行时长标注,得到所述音素的朗诵时长,所述音素的朗诵时长包括声母朗诵时长和韵母朗诵时长;通过预置声码器对所述歌词朗诵音频进行分析,得到所述音素对应的初始声学参数,所述初始声学参数包括基频、频谱包络与非周期序列;根据预置声母变速字典、所述节奏信息和所述节拍信息从所述歌词拼音文本中提取所述音素的歌唱时长,所述音素的歌唱时长包括声母歌唱时长和韵母歌唱时长;根据预置变速算法、所述朗诵时长和所述歌唱时长对所述初始声学参数进行变速处理,得到目标声学参数,所述目标声学参数包括变速后的基频、变速后的频谱包络和变速后的非周期序列;对所述变速后的频谱包络进行共振峰增强处理,得到增强后的频谱包络;基于所述音高信息、所述歌唱时长和所述变速后的基频进行矫正处理,得到矫正后的基频;通过所 述预置声码器对所述变速后的非周期序列、所述增强后的频谱包络和所述矫正后的基频进行歌曲合成处理,得到合成歌曲。本申请中,从歌词朗诵音频中分析声学参数,并基于乐谱信息通过声码器从声学参数层面实现变速和拼接,将说话声音转换成歌声,保留用户原有音色和音域的基础上实现歌曲合成,提高歌声的自然度,同时无需收集大量的歌唱数据就能实现歌曲合成,降低歌曲合成的数据收集成本。
附图说明
图1为本申请实施例中歌曲合成方法的一个实施例示意图;
图2为本申请实施例中歌曲合成方法的另一个实施例示意图;
图3为本申请实施例中歌曲合成装置的一个实施例示意图;
图4为本申请实施例中歌曲合成装置的另一个实施例示意图;
图5为本申请实施例中歌曲合成设备的一个实施例示意图。
具体实施方式
本申请实施例提供了一种歌曲合成方法、装置、设备及存储介质,用于从歌词朗诵音频中分析声学参数,并基于乐谱信息通过声码器从声学参数层面实现变速和拼接,将说话声音转换成歌声,保留用户原有音色和音域的基础上实现歌曲合成,提高歌声的自然度,同时无需收集大量的歌唱数据就能实现歌曲合成,降低歌曲合成的数据收集成本。
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”、“第三”、“第四”等(如果存在)是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的实施例能够以除了在这里图示或描述的内容以外的顺序实施。此外,术语“包括”或“具有”及其任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。
为便于理解,下面对本申请实施例的具体流程进行描述,请参阅图1,本申请实施例中歌曲合成方法的一个实施例包括:
101、获取目标歌曲的歌词朗诵音频和乐谱信息,乐谱信息包括歌词拼音文本、节拍信息、节奏信息和音高信息。
可以理解的是,本申请的执行主体可以为歌曲合成装置,还可以是终端或者服务器,具体此处不做限定。本发明本申请实施例以服务器为执行主体为例进行说明。
目标歌曲的歌词朗诵音频和乐谱信息预先存储在预置数据表中,并通过唯一标识进行关联,而歌词在乐谱信息中按照拼音形式进行记录。具体的,服务器获取目标歌曲的唯一标识;服务器根据结构化查询语言语法规则和唯一标识生成查询语句;服务器执行查询语句,得到目标歌曲的歌词朗诵音频和乐谱信息,乐谱信息包括歌词拼音文本、节拍信息、节奏信息和音高信息。例如,采用wo de zu guo表示我的祖国,并作为歌词拼音文本;节拍信息是指在乐谱中每一小节的音符总长度,包括四分之一拍和八分之六拍;节奏信息用于指示音符的长短和强弱信息;音高信息用于指示在歌唱目标歌曲时声音的高低信息。
102、通过预置语音识别模型和歌词拼音文本对歌词朗诵音频中的音素进行时长标注,得到音素的朗诵时长,音素的朗诵时长包括声母朗诵时长和韵母朗诵时长。
歌词朗诵音频中对应歌词的每个文字采用拼音进行表示,拼音对应至少一个音素,音素包括声母和韵母,因此,服务器通过预置语音识别模型对歌词朗诵音频进行语音解析后,并对歌词朗诵音频中的音素进行时长标注。可以理解的是,一个拼音将被分解为声母和韵母两个音素,例如,“xiang”分解为“x”与“iang”两个音素,预置语音识别模型将输出朗诵音频中这两个音素的朗诵时长,也就是声母朗诵时长和韵母朗诵时长。
103、通过预置声码器对歌词朗诵音频进行分析,得到音素对应的初始声学参数,初始声学参数包括基频、频谱包络与非周期序列。
其中,预置声码器包括声码器WORLD,进一步地,服务器通过预置声码器对歌词朗诵音频中音素的时长信息进行数据处理,得到音素对应的初始声学参数,数据处理包括滤波、标准差计算、平滑处理。其中,歌词朗诵音频为正弦波组成的信号,基频F0就是对于一个由振动而发出的声音信号,这组信号可以是由许多组频率不同的正弦波组成,其中频率最低的正弦波即为基频,其他的为谐波,也就是泛音;频谱包络SP是指将不同频率的振幅最高点通过平滑的曲线连接起来得到的包络线;非周期序列AP对应混合激励部分的非周期脉冲序列,其中的混合激励是指通过多种参数控制周期激励、噪声与非周期信号。
104、根据预置声母变速字典、节奏信息和节拍信息从歌词拼音文本中提取音素的歌唱时长,音素的歌唱时长包括声母歌唱时长和韵母歌唱时长。
服务器在预先统计不同的节奏和不同的发音时长下,得到每个声母的发音时长规律,并根据每个声母的发音时长规律预先制定了声母变速字典,也就是预置声母变速字典。具体的,服务器从预置声母变速字典中查询得到音素的声母歌唱时长,并根据音素的歌唱时长与音素的声母歌唱时长进行运算,得到音素的韵母歌唱时长,其中,音素的歌唱时长为声母歌唱时长和韵母歌唱时长的总和。
105、根据预置变速算法、朗诵时长和歌唱时长对初始声学参数进行变速处理,得到目标声学参数,目标声学参数包括变速后的基频、变速后的频谱包络和变速后的非周期序列。
因为朗诵时音素的持续时长与歌唱时音素的持续时长存在不同,因此,可以根据歌唱时长将朗诵语音中音素的持续时长延长或者缩短。由于同一个文字在不同发音时长下,声韵母的时长和占比均不同,需要对分别对声母和韵母分别进行时长调整,进一步地,服务器对初始声学参数进行时长调整,得到目标声学参数,目标声学参数包括变速后的基频、变速后的频谱包络和变速后的非周期序列。
可以理解的是,上述方案不仅原理简洁易于实现,而且避免了波形层面的叠加操作,并进而避免了因波形受损而导致声学参数提取准确率低的问题,使得变速算法的应用对象从波形变为了声学参数,与后续变调算法形成了统一,有效的提升了系统的可控性。
106、对变速后的频谱包络进行共振峰增强处理,得到增强后的频谱包络。
共振峰是指声音的自然频谱峰值,与讲话相比,歌唱音频的频谱包络在3千赫兹左右的频率段内有明显的尖峰,这个尖峰是歌唱独有的,因此称之为“歌唱共振峰”。为使转换后的音频更加自然,为音频的频谱包络加入“歌唱共振峰”,也就是增强在3千赫兹左右频率段内变速后的频谱包络的振幅,得到增强后的频谱包络。
107、基于音高信息、歌唱时长和变速后的基频进行矫正处理,得到矫正后的基频。
为使合成后的歌曲尽量符合朗诵时的音域,减少因变调而导致的音色失真,服务器根据音高信息和歌唱时长生成歌曲的基频;服务器根据变速后的基频对歌曲的基频整体进行变调处理,使歌曲的基频对应的平均基频尽量接近语音对应的平均基频。
需要说明的是,若对合成的歌唱配上伴奏,则伴奏也需要进行对应的变调处理。其中,变调处理包括升调处理或者降调处理。
108、通过预置声码器对变速后的非周期序列、增强后的频谱包络和矫正后的基频进行歌曲合成处理,得到合成歌曲。
也就是说,服务器最终将三种声学特征:矫正后的基频、增强后的频谱包络和变速后的非周期序列输入到预置声码器中,通过预置声码器合成输出得到合成歌曲,合成歌曲为波形信号,合成歌曲与歌词朗诵音频中的音色和音域一致,歌声更为自然。
需要强调的是,为进一步保证上述合成的歌曲的私密和安全性,上述合成的歌曲还可 以存储于一区块链的节点中。
本申请实施例中,从歌词朗诵音频中分析声学参数,并基于乐谱信息通过声码器从声学参数层面实现变速和拼接,将说话声音转换成歌声,保留用户原有音色和音域的基础上实现歌曲合成,提高歌声的自然度,同时无需收集大量的歌唱数据就能实现歌曲合成,降低歌曲合成的数据收集成本。
请参阅图2,本申请实施例中歌曲合成方法的另一个实施例包括:
201、获取目标歌曲的歌词朗诵音频和乐谱信息,乐谱信息包括歌词拼音文本、节拍信息、节奏信息和音高信息。
具体的,服务器获取目标歌曲的唯一标识,唯一标识用于关联歌词朗诵音频和乐谱信息,例如,唯一标识为s_1,目标歌曲为A,s_1与A之间为一一对应关系;服务器根据结构化查询语言语法规则和唯一标识生成查询语句,例如,查询语句为select*from songs_table where id=`s_1`;服务器执行查询语句,得到目标歌曲的歌词朗诵音频和乐谱信息,乐谱信息包括歌词拼音文本、节拍信息、节奏信息、音高、音节歌唱时长,其中,歌词在乐谱信息中按照拼音形式进行记录。
202、通过预置语音识别模型和歌词拼音文本对歌词朗诵音频中的音素进行时长标注,得到音素的朗诵时长,音素的朗诵时长包括声母朗诵时长和韵母朗诵时长。
具体的,服务器对乐谱信息进行解析,并从解析后的乐谱信息中读取歌词拼音文本;服务器将歌词朗诵音频与歌词拼音文本输入到预置语音识别模型中,并通过预置语音识别模型对歌词朗诵音频进行语音解析;服务器通过预置语音识别模型对语音解析后的歌词朗诵音频中的音素按照歌词拼音文本进行标注,得到音素的时间戳和持续时长,音素包括声母和韵母;服务器根据音素的时间戳和持续时长确定歌词朗诵音频中音素的朗诵时长,音素的朗诵时长包括声母朗诵时长和韵母朗诵时长。
可以理解的是,时间戳的作用是用于精确标记歌词朗诵音频中音素的相对位置,结合音素的持续时长便可以确定声母朗诵时长和韵母朗诵时长。
203、通过预置声码器对歌词朗诵音频进行分析,得到音素对应的初始声学参数,初始声学参数包括基频、频谱包络与非周期序列。
其中,预置声码器包括声码器WORLD,进一步地,服务器通过预置声码器对歌词朗诵音频中音素的时长信息进行数据处理,得到音素对应的初始声学参数,数据处理包括滤波、标准差计算和平滑处理。其中,歌词朗诵音频为正弦波组成的信号,基频F0就是对于一个由振动而发出的声音信号,这组信号可以是由许多组频率不同的正弦波组成,其中频率最低的正弦波即为基频,其他的为谐波,也就是泛音;频谱包络SP是指将不同频率的振幅最高点通过平滑的曲线连接起来得到的包络线;非周期序列AP对应混合激励部分的非周期脉冲序列,其中的混合激励是指通过多种参数控制周期激励、噪声与非周期信号。
204、根据预置声母变速字典、节奏信息和节拍信息从歌词拼音文本中提取音素的歌唱时长,音素的歌唱时长包括声母歌唱时长和韵母歌唱时长。
具体的,服务器根据节奏信息和节拍信息从歌词拼音文本中提取每个文字的歌唱时长t;服务器根据每个文字的歌唱时长t从预置声母变速词典中查询得到每个文字的声母歌唱时长t 1;服务器对每个文字的歌唱时长t和每个文字的声母歌唱时长t 1进行差运算,得到每个文字的韵母歌唱时长,其中,t 2=t-t 1;服务器将每个文字的声母歌唱时长和每个文字的韵母歌唱时长设置为音素的歌唱时长。例如,对于文字“香”,对应的音素为“xiang”,可以分解为“x”与“iang”,服务器确定“xiang”的歌唱时长时长为1秒,“x”为0.3秒,那么“iang”为0.7秒。
205、根据声母朗诵时长、韵母朗诵时长、声母歌唱时长和韵母歌唱时长计算音素的变 速速率r,且r>0。
进一步地,服务器根据韵母朗诵时长和韵母歌唱时长进行计算,得到变速倍率r,r=韵母歌唱时长/韵母朗诵时长,且r>0;或者,服务器根据声母朗诵时长和声母歌唱时长进行计算,得到变速倍率r,r=声母歌唱时长/声母朗诵时长,且r>0。
206、通过预置变速算法按照变速倍率r对初始声学参数进行变速处理,得到变速后的声学参数。
首先,当r等于1时,服务器确定当前的音素对应的初始声学参数为变速后的声学参数;其次,当r等于2时,服务器对当前的音素对应的初始声学参数进行延长两倍处理,得到变速后的声学参数,进一步地,服务器采用预置平均加帧算法对初始声学参数进行延长两倍处理,也就是在每两帧相邻的初始声学参数之间都加入一帧新的数据,其中,加入一帧新的数据对应的数值为被加帧的相邻两帧数据的平均值。
然后,当r小于2且r不等于1时,服务器采用预置等比加减帧算法对当前的音素对应的初始声学参数进行变速处理,得到变速后的声学参数。进一步地,假设变速前,初始声学参数对应的序列长度为l,那么变速后的声学参数对应的序列长度为l*r。具体的,服务器获取0至l*r的整数序列,并对这一证书序列中的数值整体缩小r倍后取整,服务器将得到的取整数列作为索引,从初始声学参数对应的序列中取值,得到的新的长度为l*r的序列,也就是变速后的声学参数。
最后,当r大于2时,服务器将当前音素对应的初始声学参数延长两倍以上,得到变速后的声学参数。具体的,服务器首先执行r等于2的步骤,得到变速数据,然后服务器将得到的变速数据与新的变速倍率r/2一起,重复执行整个变速处理的r等于2对应的步骤。
207、将变速后的声学参数进行串联拼接,得到目标声学参数,目标声学参数包括变速后的基频、变速后的频谱包络和变速后的非周期序列。
需要说明的是,传统的语音变速算法,例如,音频变时不变调处理WSOLA和相位声码,其基本思路都是对波形进行分帧,进行调整帧移后再重新进行重叠拼接。但都存在叠加处的波形过度不自然的问题,导致后续变调时,声学参数无法正常提取。因此,将变速后的声学参数进行串联拼接,避免了波形层面的叠加操作,并进而避免了因波形受损而导致声学参数提取准确率低的问题,使得变速算法的应用对象为声学参数。
208、对变速后的频谱包络进行共振峰增强处理,得到增强后的频谱包络。
具体的,服务器从变速后的频谱包络中查询3千赫兹左右频率段内共振峰,并记录共振峰的中心频率和幅值;服务器根据共振峰的中心频率和幅值确定提升滤波器的强度系数和待增强的中心频率;服务器根据提升滤波器的强度系数和待增强的中心频率进行共振峰增强,得到共振峰增强谱;服务器对共振峰增强谱进行滤波处理,得到增强后的频谱包络。
可以理解的是,频谱包络是将不同频率的振幅最高点连结起来形成的曲线,也就是频谱包络线。频谱是许多不同频率的集合,形成一个很宽的频率范围,不同的频率其振幅可能不同。
209、基于音高信息、歌唱时长和变速后的基频进行矫正处理,得到矫正后的基频。
具体的,服务器基于音高信息、歌唱时长和变速后的基频生成歌曲的基频;服务器将初始声学参数中的基频进行叠加并计算平均值,得到平均基频;服务器基于平均基频对歌曲的基频进行升调或者降调处理,得到初始基频序列,初始基频序列包括音高和音符,需要说明的是,在普通话中,可以利用发声时声带是否震动将声母分为清音和浊音两类。而声带震动的频率则与发音时的基频有着直接的关联,声带不震动表示无基频,也就是基频为0。因此,在生成新的基频序列时,需将清音声母的基频设置为0;当检测到初始基频序列中存在同一个文字对应不同的音高时,服务器对相同的音高对应的音符进行平滑处理; 当检测到初始基频序列中相邻的音符之间存在音高的变化时,服务器通过预置公式对相邻的音符之间进行准备和过冲处理,预置公式为
Figure PCTCN2020131663-appb-000001
其中,s为初始基频序列,ω为固有频率,ξ为阻尼系数,k为比例增益;当检测到初始基频序列中音符的预置时长大于预置阈值时,服务器对音符对应的初始基频序列加入颤音;当检测到初始基频序列中存在过度平滑时,服务器对初始基频序列加入白噪声,得到矫正后的基频序列。
需要说明的是,歌曲中时常会出现一个文字对应多个音高不相同的音符的情况,针对这一情况,服务器在两个音符之间加入了平滑,使其更符合真人演唱时的习惯,提升听感自然度。例如《我的祖国》的部分歌词,以“浪”这个文字为例,这个文字对应了4个不同的音高,平滑前的音高过度显得比较生硬突兀,而平滑处理后的音高过渡则更加平滑,更符合真人演唱。
可以理解的是,颤音是一种常见的歌唱技巧,主要出现在延音上,表现为基频上类似正弦波的小幅震颤。若一个音符的时长超过预置阈值x,将在这个音符的初始基频序列上加入颤音。进一步地,当加入颤音时,考虑三个参数:颤音加入点a,a为0到1之间,表示从该音符的哪一个时刻开始加入颤音;颤音的振幅extent和颤音的频率rate。不同的演唱形式中,x、a、extent和rate的值都会有所变化。举例来说,相比美声唱法,流行唱法中x和a值更大,而extent和rate值更小。例如,对《我的祖国》的歌词中“宽”字添加颤音后,体现了流行唱法中颤音加入方式。
210、通过预置声码器对变速后的非周期序列、增强后的频谱包络和矫正后的基频进行歌曲合成处理,得到合成歌曲。
也就是说,服务器最终将三种声学特征:矫正后的基频、调整后的频谱包络和变速后的非周期序列输入到通过WORLD中,通过WORLD合成输出得到合成歌曲,其中,合成歌曲为波形信号。
需要说明的是,声码器WORLD基于人类发音频谱将文字转化为与人类发音相似的声音,也就是WORLD会把每个拼音看作为一个序列,根据变速后的非周期序列、增强后的频谱包络和矫正后的基频预测每段需要合成语音的序列,再将预测出的声谱转换为歌唱的声音波形。
本申请实施例中,从歌词朗诵音频中分析声学参数,并基于乐谱信息通过声码器从声学参数层面实现变速和拼接,将说话声音转换成歌声,保留用户原有音色和音域的基础上实现歌曲合成,提高歌声的自然度,同时无需收集大量的歌曲数据就能实现歌曲合成,降低歌曲合成的数据收集成本。
上面对本申请实施例中歌曲合成方法进行了描述,下面对本申请实施例中歌曲合成装置进行描述,请参阅图3,本申请实施例中歌曲合成装置的一个实施例包括:
获取模块301,用于获取目标歌曲的歌词朗诵音频和乐谱信息,乐谱信息包括歌词拼音文本、节拍信息、节奏信息和音高信息;
标注模块302,用于通过预置语音识别模型和歌词拼音文本对歌词朗诵音频中的音素进行时长标注,得到音素的朗诵时长,音素的朗诵时长包括声母朗诵时长和韵母朗诵时长;
分析模块303,用于通过预置声码器对歌词朗诵音频进行分析,得到音素对应的初始声学参数,初始声学参数包括基频、频谱包络与非周期序列;
提取模块304,用于根据预置声母变速字典、节奏信息和节拍信息从歌词拼音文本中提取音素的歌唱时长,音素的歌唱时长包括声母歌唱时长和韵母歌唱时长;
变速模块305,用于根据预置变速算法、朗诵时长和歌唱时长对初始声学参数进行变速处理,得到目标声学参数,目标声学参数包括变速后的基频、变速后的频谱包络和变速后的非周期序列;
增强模块306,用于对变速后的频谱包络进行共振峰增强处理,得到增强后的频谱包络;
矫正模块307,用于基于音高信息、歌唱时长和变速后的基频进行矫正处理,得到矫正后的基频;
合成模块308,用于通过预置声码器对变速后的非周期序列、增强后的频谱包络和矫正后的基频进行歌曲合成处理,得到合成歌曲。
需要强调的是,为进一步保证上述合成的歌曲的私密和安全性,上述合成的歌曲还可以存储于一区块链的节点中。
本申请实施例中,从歌词朗诵音频中分析声学参数,并基于乐谱信息通过声码器从声学参数层面实现变速和拼接,将说话声音转换成歌声,保留用户原有音色和音域的基础上实现歌曲合成,提高歌声的自然度,同时无需收集大量的歌唱数据就能实现歌曲合成,降低歌曲合成的数据收集成本。
请参阅图4,本申请实施例中歌曲合成装置的另一个实施例包括:
获取模块301,用于获取目标歌曲的歌词朗诵音频和乐谱信息,乐谱信息包括歌词拼音文本、节拍信息、节奏信息和音高信息;
标注模块302,用于通过预置语音识别模型和歌词拼音文本对歌词朗诵音频中的音素进行时长标注,得到音素的朗诵时长,音素的朗诵时长包括声母朗诵时长和韵母朗诵时长;
分析模块303,用于通过预置声码器对歌词朗诵音频进行分析,得到音素对应的初始声学参数,初始声学参数包括基频、频谱包络与非周期序列;
提取模块304,用于根据预置声母变速字典、节奏信息和节拍信息从歌词拼音文本中提取音素的歌唱时长,音素的歌唱时长包括声母歌唱时长和韵母歌唱时长;
变速模块305,用于根据预置变速算法、朗诵时长和歌唱时长对初始声学参数进行变速处理,得到目标声学参数,目标声学参数包括变速后的基频、变速后的频谱包络和变速后的非周期序列;
增强模块306,用于对变速后的频谱包络进行共振峰增强处理,得到增强后的频谱包络;
矫正模块307,用于基于音高信息、歌唱时长和变速后的基频进行矫正处理,得到矫正后的基频;
合成模块308,用于通过预置声码器对变速后的非周期序列、增强后的频谱包络和矫正后的基频进行歌曲合成处理,得到合成歌曲。
可选的,标注模块302还可以具体用于:
对乐谱信息进行解析,并从解析后的乐谱信息中读取歌词拼音文本;
将歌词朗诵音频与歌词拼音文本输入到预置语音识别模型中,并通过预置语音识别模型对歌词朗诵音频进行语音解析;
通过预置语音识别模型对语音解析后的歌词朗诵音频中的音素按照歌词拼音文本进行标注,得到音素的时间戳和持续时长,音素包括声母和韵母;
根据音素的时间戳和持续时长确定音素的朗诵时长,音素的朗诵时长包括声母朗诵时长和韵母朗诵时长。
可选的,提取模块304还可以具体用于:
根据节奏信息和节拍信息从歌词拼音文本中提取每个文字的歌唱时长t;
根据每个文字的歌唱时长t从预置声母变速词典中查询得到每个文字的声母歌唱时长t 1
对每个文字的歌唱时长t和每个文字的声母歌唱时长t 1进行差运算,得到每个文字的韵母歌唱时长t 2,其中,t 2=t-t 1
将每个文字的声母歌唱时长和每个文字的韵母歌唱时长设置为每个文字对应的音素的歌唱时长。
可选的,变速模块305包括:
计算单元3051,用于根据声母朗诵时长、韵母朗诵时长、声母歌唱时长和韵母歌唱时长计算音素的变速速率r,且r>0;
变速单元3052,用于通过预置变速算法按照变速倍率r对初始声学参数进行变速处理,得到变速后的声学参数;
拼接单元3053,用于将变速后的声学参数进行串联拼接,得到目标声学参数,目标声学参数包括变速后的基频、变速后的频谱包络和变速后的非周期序列。
可选的,变速单元3052还可以具体用于:
当r等于1时,确定初始声学参数为变速后的声学参数;
当r等于2时,对初始声学参数进行延长两倍处理,得到变速后的声学参数;
当r小于2,且r不等于1时,采用预置等比加减帧算法对初始声学参数进行变速处理,得到变速后的声学参数;
当r大于2时,将初始声学参数延长两倍以上,得到变速后的声学参数。
可选的,增强模块306还可以具体用于:
从变速后的频谱包络中查询3千赫兹左右频率段内共振峰,并记录共振峰的中心频率和幅值;
根据共振峰的中心频率和幅值确定提升滤波器的强度系数和待增强的中心频率;
根据提升滤波器的强度系数和待增强的中心频率进行共振峰增强,得到共振峰增强谱;
对共振峰增强谱进行滤波处理,得到增强后的频谱包络。
可选的,矫正模块307还可以具体用于:
基于音高信息、歌唱时长和变速后的基频生成歌曲的基频;
将初始声学参数中的基频进行叠加并计算平均值,得到平均基频;
基于平均基频对歌曲的基频进行升调或者降调处理,得到初始基频序列,初始基频序列包括音高和音符;
当检测到初始基频序列中存在同一个文字对应不同的音高时,对相同的音高对应的音符进行平滑处理;
当检测到初始基频序列中相邻的音符之间存在音高的变化时,通过预置公式对初始基频序列中相邻的音符之间进行准备和过冲处理,预置公式为
Figure PCTCN2020131663-appb-000002
其中,s为初始基频序列,ω为固有频率,ξ为阻尼系数,k为比例增益;
当检测到初始基频序列中音符的预置时长大于预置阈值时,对音符对应的初始基频序列加入颤音;
当检测到初始基频序列中音符存在过度平滑时,对初始基频序列加入白噪声,得到矫正后的基频序列。
本申请实施例中,从歌词朗诵音频中分析声学参数,并基于乐谱信息通过声码器从声学参数层面实现变速和拼接,将说话声音转换成歌声,保留用户原有音色和音域的基础上 实现歌曲合成,提高歌声的自然度,同时无需收集大量的歌唱数据就能实现歌曲合成,降低歌曲合成的数据收集成本。
上面图3和图4从模块化功能实体的角度对本申请实施例中的歌曲合成装置进行详细描述,下面从硬件处理的角度对本申请实施例中歌曲合成设备进行详细描述。
图5是本申请实施例提供的一种歌曲合成设备的结构示意图,该歌曲合成设备500可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上处理器(central processing units,CPU)510(例如,一个或一个以上处理器)和存储器520,一个或一个以上存储应用程序533或数据532的存储介质530(例如一个或一个以上海量存储设备)。其中,存储器520和存储介质530可以是短暂存储或持久存储。存储在存储介质530的程序可以包括一个或一个以上模块(图示没标出),每个模块可以包括对歌曲合成设备500中的一系列指令操作。更进一步地,处理器510可以设置为与存储介质530通信,在歌曲合成设备500上执行存储介质530中的一系列指令操作。
基于歌曲合成设备500还可以包括一个或一个以上电源540,一个或一个以上有线或无线网络接口550,一个或一个以上输入输出接口560,和/或,一个或一个以上操作系统531,例如Windows Serve,Mac OS X,Unix,Linux,FreeBSD等等。本领域技术人员可以理解,图5示出的歌曲合成设备结构并不构成对基于歌曲合成设备的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。
本申请还提供一种计算机可读存储介质,该计算机可读存储介质可以为非易失性计算机可读存储介质,也可以为易失性计算机可读存储介质。计算机可读存储介质存储有计算机指令,当所述计算机指令在计算机上运行时,使得计算机执行如下步骤:
获取目标歌曲的歌词朗诵音频和乐谱信息,所述乐谱信息包括歌词拼音文本、节拍信息、节奏信息和音高信息;
通过预置语音识别模型和所述歌词拼音文本对所述歌词朗诵音频中的音素进行时长标注,得到所述音素的朗诵时长,所述音素的朗诵时长包括声母朗诵时长和韵母朗诵时长;
通过预置声码器对所述歌词朗诵音频进行分析,得到所述音素对应的初始声学参数,所述初始声学参数包括基频、频谱包络与非周期序列;
根据预置声母变速字典、所述节奏信息和所述节拍信息从所述歌词拼音文本中提取所述音素的歌唱时长,所述音素的歌唱时长包括声母歌唱时长和韵母歌唱时长;
根据预置变速算法、所述朗诵时长和所述歌唱时长对所述初始声学参数进行变速处理,得到目标声学参数,所述目标声学参数包括变速后的基频、变速后的频谱包络和变速后的非周期序列;
对所述变速后的频谱包络进行共振峰增强处理,得到增强后的频谱包络;
基于所述音高信息、所述歌唱时长和所述变速后的基频进行矫正处理,得到矫正后的基频;
通过所述预置声码器对所述变速后的非周期序列、所述增强后的频谱包络和所述矫正后的基频进行歌曲合成处理,得到合成歌曲。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统,装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部 分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
本申请所指区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层等。
以上所述,以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围。

Claims (20)

  1. 一种歌曲合成方法,其中,包括:
    获取目标歌曲的歌词朗诵音频和乐谱信息,所述乐谱信息包括歌词拼音文本、节拍信息、节奏信息和音高信息;
    通过预置语音识别模型和所述歌词拼音文本对所述歌词朗诵音频中的音素进行时长标注,得到所述音素的朗诵时长,所述音素的朗诵时长包括声母朗诵时长和韵母朗诵时长;
    通过预置声码器对所述歌词朗诵音频进行分析,得到所述音素对应的初始声学参数,所述初始声学参数包括基频、频谱包络与非周期序列;
    根据预置声母变速字典、所述节奏信息和所述节拍信息从所述歌词拼音文本中提取所述音素的歌唱时长,所述音素的歌唱时长包括声母歌唱时长和韵母歌唱时长;
    根据预置变速算法、所述朗诵时长和所述歌唱时长对所述初始声学参数进行变速处理,得到目标声学参数,所述目标声学参数包括变速后的基频、变速后的频谱包络和变速后的非周期序列;
    对所述变速后的频谱包络进行共振峰增强处理,得到增强后的频谱包络;
    基于所述音高信息、所述歌唱时长和所述变速后的基频进行矫正处理,得到矫正后的基频;
    通过所述预置声码器对所述变速后的非周期序列、所述增强后的频谱包络和所述矫正后的基频进行歌曲合成处理,得到合成歌曲。
  2. 根据权利要求1所述的歌曲合成方法,其中,所述通过预置语音识别模型和所述歌词拼音文本对所述歌词朗诵音频中的音素进行时长标注,得到所述音素的朗诵时长,所述音素的朗诵时长包括声母朗诵时长和韵母朗诵时长,包括:
    对所述乐谱信息进行解析,并从解析后的乐谱信息中读取所述歌词拼音文本;
    将所述歌词朗诵音频与所述歌词拼音文本输入到预置语音识别模型中,并通过所述预置语音识别模型对所述歌词朗诵音频进行语音解析;
    通过预置语音识别模型对语音解析后的歌词朗诵音频中的音素按照所述歌词拼音文本进行标注,得到所述音素的时间戳和持续时长,所述音素包括声母和韵母;
    根据所述音素的时间戳和所述持续时长确定所述音素的朗诵时长,所述音素的朗诵时长包括声母朗诵时长和韵母朗诵时长。
  3. 根据权利要求1所述的歌曲合成方法,其中,所述根据预置声母变速字典、所述节奏信息和所述节拍信息从所述歌词拼音文本中提取所述音素的歌唱时长,所述音素的歌唱时长包括声母歌唱时长和韵母歌唱时长,包括:
    根据所述节奏信息和所述节拍信息从所述歌词拼音文本中提取每个文字的歌唱时长t;
    根据所述每个文字的歌唱时长t从预置声母变速词典中查询得到所述每个文字的声母歌唱时长t 1
    对所述每个文字的歌唱时长t和所述每个文字的声母歌唱时长t 1进行差运算,得到所述每个文字的韵母歌唱时长t 2,其中,t 2=t-t 1
    将所述每个文字的声母歌唱时长和所述每个文字的韵母歌唱时长设置为所述每个文字对应的音素的歌唱时长。
  4. 根据权利要求1所述的歌曲合成方法,其中,所述根据预置变速算法、所述朗诵时长和所述歌唱时长对所述初始声学参数进行变速处理,得到目标声学参数,所述目标声学参数包括变速后的基频、变速后的频谱包络和变速后的非周期序列,包括:
    根据所述声母朗诵时长、所述韵母朗诵时长、所述声母歌唱时长和所述韵母歌唱时长 计算所述音素的变速速率r,且所述r>0;
    通过预置变速算法按照所述变速倍率r对所述初始声学参数进行变速处理,得到变速后的声学参数;
    将所述变速后的声学参数进行串联拼接,得到目标声学参数,所述目标声学参数包括变速后的基频、变速后的频谱包络和变速后的非周期序列。
  5. 根据权利要求4所述的歌曲合成方法,其中,所述通过预置变速算法按照所述变速倍率r对所述初始声学参数进行变速处理,得到变速后的声学参数,包括:
    当所述r等于1时,确定所述初始声学参数为变速后的声学参数;
    当所述r等于2时,对所述初始声学参数进行延长两倍处理,得到变速后的声学参数;
    当所述r小于2,且所述r不等于1时,采用预置等比加减帧算法对所述初始声学参数进行变速处理,得到变速后的声学参数;
    当所述r大于2时,将所述初始声学参数延长两倍以上,得到变速后的声学参数。
  6. 根据权利要求1所述的歌曲合成方法,其中,所述对所述变速后的频谱包络进行共振峰增强处理,得到增强后的频谱包络,包括:
    从所述变速后的频谱包络中查询3千赫兹左右频率段内共振峰,并记录所述共振峰的中心频率和幅值;
    根据所述共振峰的中心频率和幅值确定提升滤波器的强度系数和待增强的中心频率;
    根据所述提升滤波器的强度系数和所述待增强的中心频率进行共振峰增强,得到共振峰增强谱;
    对所述共振峰增强谱进行滤波处理,得到增强后的频谱包络。
  7. 根据权利要求1或者3所述的歌曲合成方法,其中,合成的歌曲存储于区块链中,所述基于所述音高信息、所述歌唱时长和所述变速后的基频进行矫正处理,得到矫正后的基频,包括:
    基于所述音高信息、所述歌唱时长和所述变速后的基频生成歌曲的基频;
    将所述初始声学参数中的基频进行叠加并计算平均值,得到平均基频;
    基于所述平均基频对所述歌曲的基频进行升调或者降调处理,得到初始基频序列,所述初始基频序列包括音高和音符;
    当检测到所述初始基频序列中存在同一个文字对应不同的音高时,对相同的音高对应的音符进行平滑处理;
    当检测到所述初始基频序列中相邻的音符之间存在所述音高的变化时,通过预置公式对所述相邻的音符之间进行准备和过冲处理,所述预置公式为
    Figure PCTCN2020131663-appb-100001
    其中,所述s为所述初始基频序列,所述ω为固有频率,所述ξ为阻尼系数,所述k为比例增益;
    当检测到所述初始基频序列中所述音符的预置时长大于预置阈值时,对所述音符对应的初始基频序列加入颤音;
    当检测到所述初始基频序列中所述音符存在过度平滑时,对所述初始基频序列加入白噪声,得到矫正后的基频。
  8. 一种歌曲合成设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现如下步骤:
    获取目标歌曲的歌词朗诵音频和乐谱信息,所述乐谱信息包括歌词拼音文本、节拍信息、节奏信息和音高信息;
    通过预置语音识别模型和所述歌词拼音文本对所述歌词朗诵音频中的音素进行时长标注,得到所述音素的朗诵时长,所述音素的朗诵时长包括声母朗诵时长和韵母朗诵时长;
    通过预置声码器对所述歌词朗诵音频进行分析,得到所述音素对应的初始声学参数,所述初始声学参数包括基频、频谱包络与非周期序列;
    根据预置声母变速字典、所述节奏信息和所述节拍信息从所述歌词拼音文本中提取所述音素的歌唱时长,所述音素的歌唱时长包括声母歌唱时长和韵母歌唱时长;
    根据预置变速算法、所述朗诵时长和所述歌唱时长对所述初始声学参数进行变速处理,得到目标声学参数,所述目标声学参数包括变速后的基频、变速后的频谱包络和变速后的非周期序列;
    对所述变速后的频谱包络进行共振峰增强处理,得到增强后的频谱包络;
    基于所述音高信息、所述歌唱时长和所述变速后的基频进行矫正处理,得到矫正后的基频;
    通过所述预置声码器对所述变速后的非周期序列、所述增强后的频谱包络和所述矫正后的基频进行歌曲合成处理,得到合成歌曲。
  9. 根据权利要求8所述的歌曲合成设备,所述处理器执行所述计算机程序时还实现以下步骤:
    对所述乐谱信息进行解析,并从解析后的乐谱信息中读取所述歌词拼音文本;
    将所述歌词朗诵音频与所述歌词拼音文本输入到预置语音识别模型中,并通过所述预置语音识别模型对所述歌词朗诵音频进行语音解析;
    通过预置语音识别模型对语音解析后的歌词朗诵音频中的音素按照所述歌词拼音文本进行标注,得到所述音素的时间戳和持续时长,所述音素包括声母和韵母;
    根据所述音素的时间戳和所述持续时长确定所述音素的朗诵时长,所述音素的朗诵时长包括声母朗诵时长和韵母朗诵时长。
  10. 根据权利要求8所述的歌曲合成设备,所述处理器执行所述计算机程序时还实现以下步骤:
    根据所述节奏信息和所述节拍信息从所述歌词拼音文本中提取每个文字的歌唱时长t;
    根据所述每个文字的歌唱时长t从预置声母变速词典中查询得到所述每个文字的声母歌唱时长t 1
    对所述每个文字的歌唱时长t和所述每个文字的声母歌唱时长t 1进行差运算,得到所述每个文字的韵母歌唱时长t 2,其中,t 2=t-t 1
    将所述每个文字的声母歌唱时长和所述每个文字的韵母歌唱时长设置为所述每个文字对应的音素的歌唱时长。
  11. 根据权利要求8所述的歌曲合成设备,所述处理器执行所述计算机程序时还实现以下步骤:
    根据所述声母朗诵时长、所述韵母朗诵时长、所述声母歌唱时长和所述韵母歌唱时长计算所述音素的变速速率r,且所述r>0;
    通过预置变速算法按照所述变速倍率r对所述初始声学参数进行变速处理,得到变速后的声学参数;
    将所述变速后的声学参数进行串联拼接,得到目标声学参数,所述目标声学参数包括变速后的基频、变速后的频谱包络和变速后的非周期序列。
  12. 根据权利要求11所述的歌曲合成设备,所述处理器执行所述计算机程序时还实现以下步骤:
    当所述r等于1时,确定所述初始声学参数为变速后的声学参数;
    当所述r等于2时,对所述初始声学参数进行延长两倍处理,得到变速后的声学参数;
    当所述r小于2,且所述r不等于1时,采用预置等比加减帧算法对所述初始声学参数进行变速处理,得到变速后的声学参数;
    当所述r大于2时,将所述初始声学参数延长两倍以上,得到变速后的声学参数。
  13. 根据权利要求8所述的歌曲合成设备,所述处理器执行所述计算机程序时还实现以下步骤:
    从所述变速后的频谱包络中查询3千赫兹左右频率段内共振峰,并记录所述共振峰的中心频率和幅值;
    根据所述共振峰的中心频率和幅值确定提升滤波器的强度系数和待增强的中心频率;
    根据所述提升滤波器的强度系数和所述待增强的中心频率进行共振峰增强,得到共振峰增强谱;
    对所述共振峰增强谱进行滤波处理,得到增强后的频谱包络。
  14. 根据权利要求8或者10所述的歌曲合成设备,所述处理器执行所述计算机程序时还实现以下步骤:
    基于所述音高信息、所述歌唱时长和所述变速后的基频生成歌曲的基频;
    将所述初始声学参数中的基频进行叠加并计算平均值,得到平均基频;
    基于所述平均基频对所述歌曲的基频进行升调或者降调处理,得到初始基频序列,所述初始基频序列包括音高和音符;
    当检测到所述初始基频序列中存在同一个文字对应不同的音高时,对相同的音高对应的音符进行平滑处理;
    当检测到所述初始基频序列中相邻的音符之间存在所述音高的变化时,通过预置公式对所述相邻的音符之间进行准备和过冲处理,所述预置公式为
    Figure PCTCN2020131663-appb-100002
    其中,所述s为所述初始基频序列,所述ω为固有频率,所述ξ为阻尼系数,所述k为比例增益;
    当检测到所述初始基频序列中所述音符的预置时长大于预置阈值时,对所述音符对应的初始基频序列加入颤音;
    当检测到所述初始基频序列中所述音符存在过度平滑时,对所述初始基频序列加入白噪声,得到矫正后的基频。
  15. 一种计算机可读存储介质,所述计算机可读存储介质中存储计算机指令,当所述计算机指令在计算机上运行时,使得计算机执行如下步骤:
    获取目标歌曲的歌词朗诵音频和乐谱信息,所述乐谱信息包括歌词拼音文本、节拍信息、节奏信息和音高信息;
    通过预置语音识别模型和所述歌词拼音文本对所述歌词朗诵音频中的音素进行时长标注,得到所述音素的朗诵时长,所述音素的朗诵时长包括声母朗诵时长和韵母朗诵时长;
    通过预置声码器对所述歌词朗诵音频进行分析,得到所述音素对应的初始声学参数,所述初始声学参数包括基频、频谱包络与非周期序列;
    根据预置声母变速字典、所述节奏信息和所述节拍信息从所述歌词拼音文本中提取所述音素的歌唱时长,所述音素的歌唱时长包括声母歌唱时长和韵母歌唱时长;
    根据预置变速算法、所述朗诵时长和所述歌唱时长对所述初始声学参数进行变速处理,得到目标声学参数,所述目标声学参数包括变速后的基频、变速后的频谱包络和变速后的 非周期序列;
    对所述变速后的频谱包络进行共振峰增强处理,得到增强后的频谱包络;
    基于所述音高信息、所述歌唱时长和所述变速后的基频进行矫正处理,得到矫正后的基频;
    通过所述预置声码器对所述变速后的非周期序列、所述增强后的频谱包络和所述矫正后的基频进行歌曲合成处理,得到合成歌曲。
  16. 根据权利要求15所述的计算机可读存储介质,当所述计算机指令在计算机上运行时,使得计算机还执行以下步骤:
    对所述乐谱信息进行解析,并从解析后的乐谱信息中读取所述歌词拼音文本;
    将所述歌词朗诵音频与所述歌词拼音文本输入到预置语音识别模型中,并通过所述预置语音识别模型对所述歌词朗诵音频进行语音解析;
    通过预置语音识别模型对语音解析后的歌词朗诵音频中的音素按照所述歌词拼音文本进行标注,得到所述音素的时间戳和持续时长,所述音素包括声母和韵母;
    根据所述音素的时间戳和所述持续时长确定所述音素的朗诵时长,所述音素的朗诵时长包括声母朗诵时长和韵母朗诵时长。
  17. 根据权利要求15所述的计算机可读存储介质,当所述计算机指令在计算机上运行时,使得计算机还执行以下步骤:
    根据所述节奏信息和所述节拍信息从所述歌词拼音文本中提取每个文字的歌唱时长t;
    根据所述每个文字的歌唱时长t从预置声母变速词典中查询得到所述每个文字的声母歌唱时长t 1
    对所述每个文字的歌唱时长t和所述每个文字的声母歌唱时长t 1进行差运算,得到所述每个文字的韵母歌唱时长t 2,其中,t 2=t-t 1
    将所述每个文字的声母歌唱时长和所述每个文字的韵母歌唱时长设置为所述每个文字对应的音素的歌唱时长。
  18. 根据权利要求15所述的计算机可读存储介质,当所述计算机指令在计算机上运行时,使得计算机还执行以下步骤:
    根据所述声母朗诵时长、所述韵母朗诵时长、所述声母歌唱时长和所述韵母歌唱时长计算所述音素的变速速率r,且所述r>0;
    通过预置变速算法按照所述变速倍率r对所述初始声学参数进行变速处理,得到变速后的声学参数;
    将所述变速后的声学参数进行串联拼接,得到目标声学参数,所述目标声学参数包括变速后的基频、变速后的频谱包络和变速后的非周期序列。
  19. 根据权利要求18所述的计算机可读存储介质,当所述计算机指令在计算机上运行执行以下步骤时,使得计算机还执行以下步骤:
    当所述r等于1时,确定所述初始声学参数为变速后的声学参数;
    当所述r等于2时,对所述初始声学参数进行延长两倍处理,得到变速后的声学参数;
    当所述r小于2,且所述r不等于1时,采用预置等比加减帧算法对所述初始声学参数进行变速处理,得到变速后的声学参数;
    当所述r大于2时,将所述初始声学参数延长两倍以上,得到变速后的声学参数。
  20. 一种歌曲合成装置,其中,所述歌曲合成装置包括:
    获取模块,用于获取目标歌曲的歌词朗诵音频和乐谱信息,所述乐谱信息包括歌词拼音文本、节拍信息、节奏信息和音高信息;
    标注模块,用于通过预置语音识别模型和所述歌词拼音文本对所述歌词朗诵音频中的音素进行时长标注,得到所述音素的朗诵时长,所述音素的朗诵时长包括声母朗诵时长和韵母朗诵时长;
    分析模块,用于通过预置声码器对所述歌词朗诵音频进行分析,得到所述音素对应的初始声学参数,所述初始声学参数包括基频、频谱包络与非周期序列;
    提取模块,用于根据预置声母变速字典、所述节奏信息和所述节拍信息从所述歌词拼音文本中提取所述音素的歌唱时长,所述音素的歌唱时长包括声母歌唱时长和韵母歌唱时长;
    变速模块,用于根据预置变速算法、所述朗诵时长和所述歌唱时长对所述初始声学参数进行变速处理,得到目标声学参数,所述目标声学参数包括变速后的基频、变速后的频谱包络和变速后的非周期序列;
    增强模块,用于对所述变速后的频谱包络进行共振峰增强处理,得到增强后的频谱包络;
    矫正模块,用于基于所述音高信息、所述歌唱时长和所述变速后的基频进行矫正处理,得到矫正后的基频;
    合成模块,用于通过所述预置声码器对所述变速后的非周期序列、所述增强后的频谱包络和所述矫正后的基频进行歌曲合成处理,得到合成歌曲。
PCT/CN2020/131663 2020-04-28 2020-11-26 歌曲合成方法、装置、设备及存储介质 WO2021218138A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010350256.0 2020-04-28
CN202010350256.0A CN111681637B (zh) 2020-04-28 2020-04-28 歌曲合成方法、装置、设备及存储介质

Publications (1)

Publication Number Publication Date
WO2021218138A1 true WO2021218138A1 (zh) 2021-11-04

Family

ID=72452279

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/131663 WO2021218138A1 (zh) 2020-04-28 2020-11-26 歌曲合成方法、装置、设备及存储介质

Country Status (2)

Country Link
CN (1) CN111681637B (zh)
WO (1) WO2021218138A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114446268A (zh) * 2022-01-28 2022-05-06 北京百度网讯科技有限公司 一种音频数据处理方法、装置、电子设备、介质和程序产品

Families Citing this family (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111681637B (zh) * 2020-04-28 2024-03-22 平安科技(深圳)有限公司 歌曲合成方法、装置、设备及存储介质
CN112164387A (zh) * 2020-09-22 2021-01-01 腾讯音乐娱乐科技(深圳)有限公司 音频合成方法、装置及电子设备和计算机可读存储介质
CN112231512B (zh) * 2020-10-20 2023-11-14 标贝(青岛)科技有限公司 歌曲标注检测方法、装置和系统及存储介质
CN112289300B (zh) * 2020-10-28 2024-01-09 腾讯音乐娱乐科技(深圳)有限公司 音频处理方法、装置及电子设备和计算机可读存储介质
CN112309410A (zh) * 2020-10-30 2021-02-02 北京有竹居网络技术有限公司 一种歌曲修音方法、装置、电子设备及存储介质
CN112542155B (zh) * 2020-11-27 2021-09-21 北京百度网讯科技有限公司 歌曲合成方法及模型训练方法、装置、设备与存储介质
CN112562633A (zh) * 2020-11-30 2021-03-26 北京有竹居网络技术有限公司 一种歌唱合成方法、装置、电子设备及存储介质
CN112750420B (zh) * 2020-12-23 2023-01-31 出门问问创新科技有限公司 一种歌声合成方法、装置及设备
CN112750421B (zh) * 2020-12-23 2022-12-30 出门问问(苏州)信息科技有限公司 一种歌声合成方法、装置及可读存储介质
CN112750422B (zh) * 2020-12-23 2023-01-31 出门问问创新科技有限公司 一种歌声合成方法、装置及设备
CN112700762B (zh) * 2020-12-23 2022-10-04 武汉理工大学 一种基于缸压信号的汽车声音合成方法及装置
CN113781993A (zh) * 2021-01-20 2021-12-10 北京沃东天骏信息技术有限公司 定制音色歌声的合成方法、装置、电子设备和存储介质
CN113160849A (zh) * 2021-03-03 2021-07-23 腾讯音乐娱乐科技(深圳)有限公司 歌声合成方法、装置及电子设备和计算机可读存储介质
CN113066459B (zh) * 2021-03-24 2023-05-30 平安科技(深圳)有限公司 基于旋律的歌曲信息合成方法、装置、设备及存储介质
CN113257222A (zh) * 2021-04-13 2021-08-13 腾讯音乐娱乐科技(深圳)有限公司 合成歌曲音频的方法、终端及存储介质
CN113140204B (zh) * 2021-04-23 2021-10-15 中国搜索信息科技股份有限公司 一种用于脉冲星信号控制的数字音乐合成方法及设备
CN113223486B (zh) * 2021-04-29 2023-10-17 北京灵动音科技有限公司 信息处理方法、装置、电子设备及存储介质
CN113327586B (zh) * 2021-06-01 2023-11-28 深圳市北科瑞声科技股份有限公司 一种语音识别方法、装置、电子设备以及存储介质
CN113421589B (zh) * 2021-06-30 2024-03-01 平安科技(深圳)有限公司 歌手识别方法、装置、设备及存储介质
CN113488007A (zh) * 2021-07-07 2021-10-08 北京灵动音科技有限公司 信息处理方法、装置、电子设备及存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6304846B1 (en) * 1997-10-22 2001-10-16 Texas Instruments Incorporated Singing voice synthesis
CN101313477A (zh) * 2005-12-21 2008-11-26 Lg电子株式会社 音乐生成设备及其操作方法
CN101399036A (zh) * 2007-09-30 2009-04-01 三星电子株式会社 将语音转换为说唱音乐的设备和方法
CN103295574A (zh) * 2012-03-02 2013-09-11 盛乐信息技术(上海)有限公司 唱歌语音转换设备及其方法
CN106373580A (zh) * 2016-09-05 2017-02-01 北京百度网讯科技有限公司 基于人工智能的合成歌声的方法和装置
CN111681637A (zh) * 2020-04-28 2020-09-18 平安科技(深圳)有限公司 歌曲合成方法、装置、设备及存储介质

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3102335B2 (ja) * 1996-01-18 2000-10-23 ヤマハ株式会社 フォルマント変換装置およびカラオケ装置
JP4839891B2 (ja) * 2006-03-04 2011-12-21 ヤマハ株式会社 歌唱合成装置および歌唱合成プログラム
CN103440862B (zh) * 2013-08-16 2016-03-09 北京奇艺世纪科技有限公司 一种语音与音乐合成的方法、装置以及设备
CN105788589B (zh) * 2016-05-04 2021-07-06 腾讯科技(深圳)有限公司 一种音频数据的处理方法及装置
CN108053814B (zh) * 2017-11-06 2023-10-13 芋头科技(杭州)有限公司 一种模拟用户歌声的语音合成系统及方法
CN109147757B (zh) * 2018-09-11 2021-07-02 广州酷狗计算机科技有限公司 歌声合成方法及装置
CN110164460A (zh) * 2019-04-17 2019-08-23 平安科技(深圳)有限公司 歌唱合成方法和装置

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6304846B1 (en) * 1997-10-22 2001-10-16 Texas Instruments Incorporated Singing voice synthesis
CN101313477A (zh) * 2005-12-21 2008-11-26 Lg电子株式会社 音乐生成设备及其操作方法
CN101399036A (zh) * 2007-09-30 2009-04-01 三星电子株式会社 将语音转换为说唱音乐的设备和方法
CN103295574A (zh) * 2012-03-02 2013-09-11 盛乐信息技术(上海)有限公司 唱歌语音转换设备及其方法
CN106373580A (zh) * 2016-09-05 2017-02-01 北京百度网讯科技有限公司 基于人工智能的合成歌声的方法和装置
CN111681637A (zh) * 2020-04-28 2020-09-18 平安科技(深圳)有限公司 歌曲合成方法、装置、设备及存储介质

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114446268A (zh) * 2022-01-28 2022-05-06 北京百度网讯科技有限公司 一种音频数据处理方法、装置、电子设备、介质和程序产品
WO2023142413A1 (zh) * 2022-01-28 2023-08-03 北京百度网讯科技有限公司 音频数据处理方法、装置、电子设备、介质和程序产品

Also Published As

Publication number Publication date
CN111681637A (zh) 2020-09-18
CN111681637B (zh) 2024-03-22

Similar Documents

Publication Publication Date Title
WO2021218138A1 (zh) 歌曲合成方法、装置、设备及存储介质
US11468870B2 (en) Electronic musical instrument, electronic musical instrument control method, and storage medium
Durrieu et al. A musically motivated mid-level representation for pitch estimation and musical audio source separation
US8219398B2 (en) Computerized speech synthesizer for synthesizing speech from text
US5642470A (en) Singing voice synthesizing device for synthesizing natural chorus voices by modulating synthesized voice with fluctuation and emphasis
Macon et al. A singing voice synthesis system based on sinusoidal modeling
CN112382257B (zh) 一种音频处理方法、装置、设备及介质
WO2020140390A1 (zh) 颤音建模方法、装置、计算机设备及存储介质
CN112331222A (zh) 一种转换歌曲音色的方法、系统、设备及存储介质
CN103915093A (zh) 一种实现语音歌唱化的方法和装置
JP3711880B2 (ja) 音声分析及び合成装置、方法、プログラム
Stowell Making music through real-time voice timbre analysis: machine learning and timbral control
Saitou et al. Analysis of acoustic features affecting" singing-ness" and its application to singing-voice synthesis from speaking-voice
CN115050387A (zh) 一种艺术测评中多维度唱奏分析测评方法及系统
Zhang et al. Fundamental frequency adjustment and formant transition based emotional speech synthesis
JP4349316B2 (ja) 音声分析及び合成装置、方法、プログラム
JP2013210501A (ja) 素片登録装置,音声合成装置,及びプログラム
JP6578544B1 (ja) 音声処理装置、および音声処理方法
CN113129923A (zh) 一种艺术测评中多维度唱奏分析测评方法及系统
JP5810947B2 (ja) 発声区間特定装置、音声パラメータ生成装置、及びプログラム
Sundberg Phonetics of Singing in Western Classical Style
CN115457923B (zh) 一种歌声合成方法、装置、设备及存储介质
JP6191094B2 (ja) 音声素片切出装置
JP2013195928A (ja) 音声素片切出装置
Maddage et al. Word level automatic alignment of music and lyrics using vocal synthesis

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20933010

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20933010

Country of ref document: EP

Kind code of ref document: A1