WO2020140390A1 - 颤音建模方法、装置、计算机设备及存储介质 - Google Patents

颤音建模方法、装置、计算机设备及存储介质 Download PDF

Info

Publication number
WO2020140390A1
WO2020140390A1 PCT/CN2019/091093 CN2019091093W WO2020140390A1 WO 2020140390 A1 WO2020140390 A1 WO 2020140390A1 CN 2019091093 W CN2019091093 W CN 2019091093W WO 2020140390 A1 WO2020140390 A1 WO 2020140390A1
Authority
WO
WIPO (PCT)
Prior art keywords
vibrato
fundamental frequency
segment
features
feature
Prior art date
Application number
PCT/CN2019/091093
Other languages
English (en)
French (fr)
Inventor
朱清影
程宁
王健宗
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2020140390A1 publication Critical patent/WO2020140390A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H1/00Details of electrophonic musical instruments
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H1/00Details of electrophonic musical instruments
    • G10H1/02Means for controlling the tone frequencies, e.g. attack or decay; Means for producing special musical effects, e.g. vibratos or glissandos
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/02Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/21Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique

Definitions

  • the present application relates to the field of artificial intelligence technology, and in particular, to a vibrato modeling method, device, computer equipment, and storage medium.
  • Embodiments of the present application provide a vibrato modeling method, device, computer equipment, and storage medium, which can retain vibrato characteristics to improve the naturalness of synthesized songs.
  • an embodiment of the present application provides a method for modeling vibrato.
  • the method includes:
  • Acquire song data of multiple songs where the song data of each song includes a sheet music marked with lyrics and an a cappella recording that matches the sheet music; extract the linguistic and musical features of the sheet music; extract Describe the acoustic features of a cappella recording; extract the vibrato features of the a cappella recording according to the acoustic characteristics; based on the Hidden Markov Model, take the linguistic and musical features of the sheet music as input, and use the acoustic characteristics of the a cappella recording And the vibrato feature is output, and the vibrato model is trained.
  • an embodiment of the present invention provides a vibrato modeling apparatus, and the vibrato modeling apparatus includes a unit corresponding to the method described in the first aspect.
  • an embodiment of the present invention provides a computer device, where the computer device includes a memory and a processor connected to the memory;
  • the memory is used to store a computer program, and the processor is used to run the computer program stored in the memory to perform the method described in the first aspect.
  • an embodiment of the present invention provides a computer-readable storage medium that stores a computer program.
  • the computer program is executed by a processor, the method according to the first aspect described above is implemented.
  • FIG. 1 is a schematic flowchart of a vibrato modeling method provided by an embodiment of the present application
  • FIG. 2 is tag pair data corresponding to a musical note provided by an embodiment of the present application
  • FIG. 3 is a schematic diagram of the fundamental frequency corresponding to a piece of a cappella recording provided by an embodiment of the present application
  • FIG. 4 is a schematic diagram of a sub-process of a vibrato modeling method provided by an embodiment of the present application.
  • FIG. 5 is a schematic diagram of a sub-flow of a method for modeling vibrato provided by another embodiment of the present application.
  • FIG. 6 is a schematic flowchart of a vibrato modeling method provided by another embodiment of the present application.
  • FIG. 7 is a schematic block diagram of a vibrato modeling device provided by an embodiment of the present application.
  • FIG. 8 is a schematic block diagram of a vibrato positioning unit provided by an embodiment of the present application.
  • FIG. 9 is a schematic block diagram of a vibrato positioning unit provided by another embodiment of the present application.
  • FIG. 10 is a schematic block diagram of a vibrato modeling apparatus provided by another embodiment of the present application.
  • FIG. 11 is a schematic block diagram of a computer device provided by an embodiment of the present application.
  • FIG. 1 is a schematic flowchart of a vibrato modeling method provided by an embodiment of the present application. As shown in FIG. 1, the method includes S101-S105.
  • S101 Acquire song data of multiple songs, where the song data of each song includes a musical score marked with lyrics and an a cappella recording that matches the musical score.
  • the acrobat recording of each song is the acrobat recording corresponding to the musical score, that is, the musical score and the acrobat recording correspond.
  • the music scores exist in the form of text files. Under normal circumstances, when using some software such as MuseScore to create electronic scores, you can directly save the score files to some preset format files such as musicxml format. The essence of this type of format file is a text file with a special format.
  • a musical score file usually includes pitch, clef, time signature, tempo, display information (the display information is related to the visual display of musical notes in the musical score editing software, has nothing to do with model training, and can be ignored), multiple measures and other information.
  • each bar includes multiple notes.
  • Each note includes information such as pitch, duration, voice, note type, and lyrics.
  • the pitch includes information such as scale and octave, and the lyrics include information such as syllable and text.
  • FIG. 2 is tag pair data corresponding to a musical note provided by an embodiment of the present application. It should be noted that the Chinese characters in FIG. 2 correspond to annotations to facilitate understanding of the meaning of the label corresponding to the musical note to the data. As shown in Figure 2, the ⁇ note> and ⁇ /note> tags contain all information about this note.
  • the pitch information of the note is recorded between ⁇ pitch> and ⁇ /pitch>
  • the duration information of the note is recorded between ⁇ duration> and ⁇ /duration>, between ⁇ voice> and ⁇ /voice>
  • Recorded the part information of the note the note type of the note is recorded between ⁇ type> and ⁇ /type>
  • the display information of the note is recorded between ⁇ stem> and ⁇ /stem> (the display information and the note are in
  • the visual display in the score editing software is related to the training of the model and can be ignored).
  • the lyrics information of the note is recorded between ⁇ lyric> and ⁇ /lyric>.
  • the scale of the pitch information is recorded between ⁇ step> and ⁇ /step> in ⁇ pitch> and ⁇ /pitch> of the pitch information, and the pitch information is recorded between ⁇ octave> and ⁇ /octave> Octave; ⁇ syllabic> and ⁇ /syllabic> in ⁇ lyric> and ⁇ /lyric> recorded the syllable of the lyrics, and the text of the lyrics was recorded between ⁇ text> and ⁇ /text>.
  • tags By looking for tag pairs, you can read the specific keywords (such as pitch, step, lyric, etc.) and the corresponding values of the keywords in each tag pair.
  • step S102 that is, extracting the linguistic features and music features of the music score includes: acquiring a pair of sticky notes in the music score file; parsing the tag pairs to extract the values of the music features and linguistic features corresponding to the tag pairs.
  • the linguistic features include the pronunciation of the lyrics and the relationship between the upper and lower context.
  • the acoustic characteristics include fundamental frequency and Mel spectrum coefficient.
  • Each frame corresponds to a set of features.
  • the frequency of the fundamental tone is called the fundamental frequency, and the fundamental tone refers to the periodicity of the vocal cord vibration when voiced.
  • the fundamental frequency is the purest tone with the lowest frequency in speech, but the amplitude is indeed the largest, which determines the rise of the entire tone.
  • fundamental frequency is widely used in speech coding, speech recognition, speech synthesis and other fields.
  • ACF autocorrelation method
  • AMDF short-time average amplitude difference method
  • CEP Cepstral Method
  • Other fundamental frequency extraction methods can also be used to extract the fundamental frequency of a cappella recording.
  • the extracted fundamental frequency of the a cappella recording is the fundamental frequency sequence.
  • Mel cepstrum coefficients are the cepstrum parameters extracted in the frequency domain of the Mel scale.
  • the Mel scale describes the nonlinear characteristics of the human ear frequency.
  • the methods of extracting Mel cepstrum coefficients mainly include:
  • the preloading process is to pass the digital voice signal of the unvoiced recording through a high-pass filter.
  • the purpose of pre-emphasis is to increase the high frequency part, so that the frequency spectrum of the digital voice signal of the unvoiced recording becomes flat, and the spectral tilt is removed to compensate for the voice.
  • the high-frequency part of the signal is suppressed by the pronunciation system; at the same time, it is also to eliminate the effects of the vocal cords and lips during the occurrence.
  • the purpose of using a triangular band-pass filter is to smooth the frequency spectrum and eliminate the effect of harmonics, in addition to reducing the amount of calculation.
  • the vibrato is considered to be a small sine wave that appears on the fundamental frequency sequence, and the vibrato characteristics include amplitude and frequency. Extracting the vibrato feature of the unvoiced recording according to the acoustic feature includes extracting the vibrato feature of the unvoiced recording according to the fundamental frequency in the acoustic feature.
  • step S104 includes: locating the vibrato segment in the fundamental frequency sequence corresponding to the fundamental frequency; calculating the amplitude and frequency of the vibrato segment, and using the calculated amplitude and frequency of the vibrato segment as The vibrato feature of each frame in the vibrato segment.
  • the amplitude and frequency of the non-vibrato segments in the fundamental frequency need to be set to zero, so as to obtain the vibrato characteristics of the unvoiced recording. Understandably, the amplitude and frequency of the vibrato segment are calculated, and the calculated amplitude and frequency are used as the vibrato feature; for the non-vibrato segment in the unvoiced recording, that is, the segment without vibrato or silence, the amplitude and frequency are set to zero.
  • FIG. 3 is a schematic diagram of a fundamental frequency corresponding to a piece of a cappella recording provided by an embodiment of the present application.
  • the horizontal axis is time, in units of 0.5 milliseconds; the vertical axis is fundamental frequency, in units of cent.
  • the part of the ellipse circle is the vibrato segment. The method of specifically locating the vibrato segment in the fundamental frequency sequence corresponding to the fundamental frequency will be described in detail below. Among them, it can be seen that the vibrato is a small sinusoidal fluctuation appearing on the fundamental frequency sequence. For other clips without vibrato or silence, set the amplitude and frequency to zero.
  • locating the vibrato segment in the fundamental frequency sequence corresponding to the fundamental frequency includes S401-S405.
  • the continuous appearance of troughs or peaks means that the fundamental frequency sequence segment is uninterrupted, that is, the fundamental frequency sequence segment is uninterrupted, in some embodiments, it can also be understood that there is no frequency of 0 in the fundamental frequency sequence segment frequency. Detect whether the number of occurrences of troughs or crests in an uninterrupted fundamental frequency sequence segment reaches a preset number of times.
  • the number of consecutive occurrences of troughs or crests may be exactly the preset number of times, or may exceed the preset number of times. It happens to be the preset number of times, or exceeds the preset number of times, it is considered that the preset number of times is reached.
  • the preset number of times may be set to 5 times, or may be set to other times.
  • step S402 If the number of consecutive troughs or peaks in the fundamental frequency sequence reaches a preset number of times, step S402 is executed; otherwise, step S405 is executed.
  • S402 Acquire corresponding fundamental frequency sequence fragments, and count the average frequency in the fundamental frequency sequence fragments.
  • the average frequency in the fundamental frequency sequence segment is obtained by dividing the frequency of each frame in the fundamental frequency sequence segment, and dividing the sum of the frequencies of each frame in the fundamental frequency sequence segment by the total number of frames in the fundamental frequency sequence segment, To obtain the average frequency in the segment of the fundamental frequency sequence.
  • S403 Detect whether the frequency corresponding to the trough is less than the average frequency and whether the frequency corresponding to the crest is greater than the average frequency in the process of changing the frequency from the trough to the peak or the frequency from the crest to the trough each time in the preset number of times.
  • step S404 is executed; otherwise, step S405 is executed.
  • the frequency sequence segment is a non-vibrato segment.
  • the fundamental frequency in the fundamental frequency sequence continues for a preset number of times or more, or the average fundamental frequency that passes through the segment of the fundamental frequency sequence from high to low, or from low to high. If so, it is considered that the fundamental frequency sequence segment is a vibrato segment; otherwise, the fundamental frequency sequence segment is considered to be a non-vibrato segment. As shown in FIG. 3, the frequency of the determined vibrato segment is the average fundamental frequency that passes through the fundamental frequency sequence segment from a high number of consecutive preset times or more.
  • the predetermined number of times can be determined by counting the number of times that the frequency of occurrence in the fundamental frequency sequence continues from the trough to the peak or the frequency continuously from the peak to the trough, that is, the frequency of occurrence in the fundamental frequency sequence continues from the trough to the peak or Whether the number of consecutive frequencies from the peak to the valley reaches the preset number.
  • continuous valley to peak is understood as the frequency of the fundamental frequency from one valley to the peak (and then to the valley), and then from the valley to a peak (and then to the valley), so whether the number of consecutive times from the valley to the peak reaches the preset frequency.
  • the frequency of the fundamental frequency is not interrupted.
  • the preset number of times can be set to 5 times.
  • the understanding of the fundamental frequency continuously from high to low is also consistent, except that the statistics is the number of times the fundamental frequency reaches the trough from the peak.
  • the frequency range of the vibrato is within a preset Hertz frequency domain, for example, the preset Hertz frequency domain range is 5 to 8 Hertz, if a certain fundamental frequency sequence segment is a vibrato segment, then the fundamental frequency of the fundamental frequency sequence segment Vibrate at a frequency f between 5 and 8 Hz.
  • locating the vibrato segment in the fundamental frequency sequence corresponding to the fundamental frequency includes S501-S508.
  • S501 Perform a short-time Fourier transform on the fundamental frequency sequence segment in the fundamental frequency sequence to obtain a power spectrum of the fundamental frequency sequence segment.
  • the power spectrum of the fundamental frequency is expressed as X(f, t), where f is the vibration frequency of the fundamental frequency and t is time. Then X(f, t) can be understood as the amplitude of the waveform with the frequency f (the fundamental frequency curve shown in FIG. 3) at time t.
  • S503 Calculate the integration of the regularization function in the preset Hertz frequency domain to obtain the power of the fundamental frequency sequence segment.
  • F L represents the lowest frequency in the preset Hertz frequency domain range
  • F H represents the highest frequency in the preset Hertz frequency domain range
  • F L represents the lowest frequency in the preset Hertz frequency domain range
  • F H represents the highest frequency in the preset Hertz frequency domain range
  • S505 Determine the vibrato probability according to the calculated power and slope changes of the fundamental frequency sequence segment.
  • the time point is considered to belong to a point in the vibrato section. If multiple consecutive points are judged to belong to the vibrato segment, then the fundamental frequency sequence segment belongs to the vibrato segment.
  • step S507 is performed; if the vibrato probability does not exceed a preset value, step S508 is performed.
  • the order of the steps of calculating the slope change and calculating the power is not specifically limited, and the power may be calculated before the slope change.
  • the power of the preset Hertz frequency domain range (such as the frequency band of 5 to 8 Hertz) in a certain basic frequency sequence segment is higher than other frequency bands; the amplitude of the waveform at frequency f should be obvious Is greater than the amplitude of other waveforms between 5 and 8 Hz, then the fundamental frequency sequence segment is considered a vibrato segment.
  • the judgment methods in the above two embodiments of FIG. 4 and FIG. 5 can also be combined to locate the vibrato segment in the fundamental frequency sequence, then understandably, the fundamental frequency is continuously preset more than the number of times From high to low, or from low to high, through the average fundamental frequency of the segment, and the calculated vibrato probability exceeds a preset value, then confirm that the fundamental frequency sequence segment at the time point is a vibrato segment.
  • the linguistic features and music features of the music score are taken as input and input into the Hidden Markov Model, and the Hidden Markov Model is trained so that the output is the acoustic and tremolo features of the a cappella recording.
  • the hidden Markov model obtained by training is used as the vibrato model.
  • the above method embodiment extracts the vibrato features of the a cappuccino recording, based on the Hidden Markov Model, takes the linguistic and musical features of the music score of the song as input, and uses the acoustic and vibrato features of the unvoiced recording of the song as the output.
  • a tremolo model whose input is the linguistic feature and music feature of the music score and whose output is the acoustic feature and vibrato feature of the a cappella recording.
  • the above method embodiments implement the training of the vibrato model.
  • the vibrato model can effectively preserve the characteristics of vibrato to improve the naturalness of the synthesized song when synthesizing the song.
  • FIG. 6 is a schematic flowchart of a vibrato modeling method provided by another embodiment of the present application. As shown in FIG. 6, this embodiment includes steps S601-S609. The difference between this embodiment and the embodiment of FIG. 1 is that steps S606-S609 are added, and other steps S601-S605 correspond to steps S101-S105 of the embodiment of FIG. 1, respectively. The difference between this embodiment and the embodiment of FIG. 1 will be described in detail below.
  • S607 Input the extracted linguistic features and music features into the trained vibrato model, so as to obtain the acoustic features and vibrato features of the a cappella recording that matches the musical score of the song to be synthesized.
  • adding the obtained vibrato feature to the fundamental frequency includes adding a sine wave corresponding to the amplitude and frequency of the obtained vibrato feature to simulate the vibrato.
  • the correspondence here is understood to be the same, that is, a sine wave with the same amplitude and frequency as the obtained vibrato feature is added to the fundamental frequency to simulate the vibrato.
  • the acoustic feature added with the vibrato feature is input to the vocoder to synthesize the song.
  • the acoustic characteristics input to the vocoder include: the fundamental frequency and the Mel spectral coefficient added to the vibrato characteristic.
  • Vocoder is a speech analysis and synthesis system of a certain model of speech signal. Only the characteristic parameters in the model are used in the transmission, and the characteristic parameter estimation in the model and the speech signal codec for speech synthesis are used when coding and decoding, a codec and decoder for the analysis and synthesis of speech.
  • the trained vibrato model is used to obtain the acoustic characteristics and vibrato characteristics of the unvoiced recording that matches the score of the song to be synthesized, and the obtained vibrato characteristics are added to the fundamental frequency of the acoustic characteristics.
  • the acoustic characteristics including Vibrato can be expressed in fundamental frequency and Mel spectrum coefficient.
  • the device includes a unit corresponding to the above vibrato modeling method.
  • the device 70 includes a song data acquisition unit 701, a music feature extraction unit 702, an acoustic feature extraction unit 703, a vibrato feature extraction unit 704, and a model establishment unit 705.
  • the song data acquiring unit 701 is used to acquire song data of multiple songs, wherein the song data of each song includes a musical score marked with lyrics and an a cappella recording consistent with the musical score.
  • the music feature extraction unit 702 is used to extract linguistic features and music features of the music score.
  • the music feature extraction unit 702 includes a tag pair acquisition unit and an analysis unit.
  • the tag pair acquiring unit is used to acquire tag pairs in the music score file corresponding to the music score.
  • the parsing unit is used to parse the tag pair to extract the values of the music features and linguistic features corresponding to the tag pair.
  • An acoustic feature extraction unit 703 is used to extract the acoustic features of the a cappella recording.
  • the acoustic characteristics include the fundamental frequency and the reciprocal of the Mel spectrum.
  • the vibrato feature extraction unit 704 is configured to extract the vibrato feature of the a cappella recording according to the acoustic feature. Specifically, the vibrato feature of the a cappella recording is extracted according to the fundamental frequency in the acoustic feature. Vibrato characteristics include amplitude and frequency.
  • the vibrato feature extraction unit 704 includes a vibrato positioning unit and a vibrato feature determination unit.
  • the vibrato positioning unit is used to locate the vibrato segment in the fundamental frequency sequence corresponding to the fundamental frequency.
  • the vibrato feature determination unit is configured to calculate the amplitude and frequency of the vibrato segment, and use the calculated amplitude and frequency of the vibrato segment as the vibrato feature of each frame in the vibrato segment.
  • the vibrato feature extraction unit 704 further includes a setting unit.
  • the setting unit is configured to set the amplitude and frequency of the non-vibrato segment in the fundamental frequency to zero. In this way, the vibrato characteristic of the a cappella recording is obtained.
  • the vibrato positioning unit 80 includes a frequency detection unit 801, a statistics unit 802, a frequency detection unit 803, and a first vibrato determination unit 804.
  • the number detection unit 801 is configured to detect whether the number of consecutive occurrences of troughs or peaks in the fundamental frequency sequence reaches a preset number.
  • the statistics unit 802 is configured to obtain the corresponding fundamental frequency sequence segment if the preset number of times is reached, and count the average frequency in the fundamental frequency sequence segment.
  • the frequency detection unit 803 is configured to detect whether the frequency corresponding to the valley is less than the average frequency and whether the frequency corresponding to the peak is greater than the average frequency in the process of changing the frequency from the trough to the peak or the frequency from the crest to the trough each time in the preset number of times.
  • the first vibrato determining unit 804 is configured to determine that the fundamental frequency sequence segment is a vibrato segment if the frequency corresponding to each trough is less than the average frequency and the frequency corresponding to the peak is greater than the average frequency.
  • the first vibrato determining unit 804 is also used if the number of consecutive troughs or peaks in the fundamental frequency sequence does not reach the preset number; or the preset number of times is reached, but not that the frequency corresponding to each trough is less than the average frequency and the peak position The corresponding frequency is greater than the average frequency, and the segment of the fundamental frequency sequence is determined to be a non-vibrato segment.
  • the vibrato positioning unit 90 includes a transformation unit 901, a regularization unit 902, a power calculation unit 903, a slope change calculation unit 904, a probability calculation unit 905, a probability judgment unit 906, and a second vibrato Determination unit 907.
  • the transformation unit 901 is configured to perform a short-time Fourier transform on the fundamental frequency sequence segment in the fundamental frequency sequence to obtain the power spectrum of the fundamental frequency sequence segment.
  • the regularization unit 902 is configured to regularize the power spectrum to obtain a regularization function.
  • the power calculation unit 903 is configured to calculate the integration of the regularization function in the preset Hertz frequency domain to obtain the power of the fundamental frequency sequence segment.
  • the slope change calculation unit 904 is configured to calculate the slope change of the regularization function in the fundamental frequency sequence segment.
  • the probability calculation unit 905 is configured to determine the vibrato probability according to the calculated power and slope changes of the fundamental frequency sequence segment.
  • the probability judging unit 906 is used to judge whether the vibrato probability exceeds a preset value.
  • the second vibrato determining unit 907 is configured to determine that the base frequency sequence segment is a vibrato segment if the vibrato probability exceeds a preset value; if the vibrato probability does not exceed a preset value, determine the base frequency sequence segment to be non- Vibrato fragments.
  • the model building unit 705 is used to train the vibrato model based on the hidden Markov model, taking the linguistic features and music features of the music score as inputs, and using the acoustic features and vibrato features of the a cappella recording as outputs.
  • the device includes a unit corresponding to the above-mentioned vibrato modeling method.
  • the device 100 includes a song data acquisition unit 101, a music feature extraction unit 102, an acoustic feature extraction unit 103, a vibrato feature extraction unit 104, a model establishment unit 105, a model use unit 106, a vibrato addition unit 107 ⁇ 108 ⁇ 107 and the synthesis unit 108.
  • this embodiment is different from the embodiment shown in FIG. 7 in that a model use unit 106, a vibrato addition unit 107, and a synthesis unit 108 are added.
  • the difference between this embodiment and the embodiment of FIG. 7 will be described below.
  • the other units correspond to the units shown in the embodiment of FIG. 7 and will not be repeated here.
  • the music feature extraction unit 102 is also used to extract the linguistic features and music features of the music score of the song to be synthesized.
  • the model using unit 106 is configured to input the extracted linguistic features and music features into the trained vibrato model, so as to obtain the acoustic features and vibrato features of the unvoiced recording that matches the musical score of the song to be synthesized.
  • the vibrato adding unit 107 is used to add the obtained vibrato feature to the obtained acoustic feature.
  • the synthesizing unit 108 is used to input the acoustic feature added with the vibrato feature to the vocoder to synthesize the song.
  • the above device can be implemented in the form of a computer program, and the computer program can be run on the computer device shown in FIG. 11.
  • the device is a device such as a terminal, such as a mobile terminal, a PC terminal, and IPad.
  • the device 110 includes a processor 112, a memory, and a network interface 113 connected through a system bus 111, where the memory may include a non-volatile storage medium 114 and an internal memory 115.
  • the non-volatile storage medium 114 may store an operating system 1141 and a computer program 1142.
  • the computer program 1142 stored in the non-volatile storage medium is executed by the processor 112, the vibrato modeling method described above can be implemented.
  • the processor 112 is used to provide computing and control capabilities to support the operation of the entire device.
  • the internal memory 115 provides an environment for running a computer program in a non-volatile storage medium.
  • the processor 112 may cause the processor 112 to perform the vibrato modeling method described above.
  • the network interface 113 is used for network communication. Those skilled in the art can understand that the structure shown in FIG.
  • 11 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the device to which the solution of the present application is applied.
  • the specific device may include a ratio More or fewer components are shown in the figure, or some components are combined, or have different component arrangements.
  • the processor 112 is used to run a computer program stored in a memory to implement any embodiment of the aforementioned vibrato modeling method.
  • the so-called processor 112 may be a central processing unit (Central Processing Unit, CPU), and the processor may also be other general-purpose processors, digital signal processors (Digital Signal Processor, DSP) , Application Specific Integrated Circuit (Application Program Specific Integrated Circuit, ASIC), ready-made programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
  • a person of ordinary skill in the art may understand that all or part of the processes in the method for implementing the foregoing embodiments may be completed by instructing relevant hardware through a computer program.
  • the computer program may be stored in a storage medium, which may be a computer-readable storage medium.
  • the computer program is executed by at least one processor in the computer system to implement the process steps of the foregoing method embodiments.
  • the present application also provides a storage medium.
  • the storage medium may be a computer-readable storage medium, and the computer-readable storage medium includes a non-volatile computer-readable storage medium.
  • the storage medium stores a computer program that, when executed by the processor, implements any embodiment of the aforementioned vibrato modeling method.
  • the storage medium may be various computer-readable storage media that can store program codes, such as a U disk, a mobile hard disk, a read-only memory (Read-Only Memory, ROM), a magnetic disk, or an optical disk.
  • program codes such as a U disk, a mobile hard disk, a read-only memory (Read-Only Memory, ROM), a magnetic disk, or an optical disk.
  • the disclosed device, device, and method may be implemented in other ways.
  • the device embodiments described above are only schematic, and the division of the units is only a division of logical functions, and there may be other divisions in actual implementation.
  • the specific working processes of the devices, devices, and units described above can refer to the corresponding processes in the foregoing method embodiments, and details are not described herein again.

Abstract

一种颤音建模方法、装置、计算机设备及存储介质。该方法包括:获取多首歌的歌曲数据,其中,每一首歌的歌曲数据包括一篇标有歌词的乐谱和一段与乐谱相符的清唱录音(S101);提取乐谱的语言学特征和音乐特征(S102);提取清唱录音的声学特征(S103);根据声学特征提取清唱录音的颤音特征(S104);基于隐马尔可夫模型,以乐谱的语言学特征和音乐特征为输入,以清唱录音的声学特征和颤音特征为输出,训练得到颤音模型(S105)。

Description

颤音建模方法、装置、计算机设备及存储介质
本申请要求于2019年1月4日提交中国专利局、申请号为201910008576.5、发明名称为“颤音建模方法、装置、计算机设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能技术领域,尤其涉及一种颤音建模方法、装置、计算机设备及存储介质。
背景技术
近年来,基于隐马尔可夫模型进行参数合成的歌曲合成方法在业界非常受关注。使用隐马尔可夫模型合成歌曲的最大优点是,在不需要庞大的歌唱数据库的情况下,可以有效的模拟不同的声音特征,歌唱风格,甚至是情绪。而颤音,作为一种重要的歌唱技巧,对合成歌曲的自然度有很大的影响。颤音在声学特征上的具体体现为基频上的小幅震动,颤音的具体时间点和强度因歌手而异。然而普通的隐马尔可夫模型会在训练和合成时平滑基频上的小幅度的起伏,如此会平滑掉颤音,导致合成的歌唱中并没有颤音的效果。
发明内容
本申请实施例提供一种颤音建模方法、装置、计算机设备及存储介质,可保留颤音特征,以提高合成歌曲的自然度。
第一方面,本申请实施例提供了一种颤音建模方法,该方法包括:
获取多首歌的歌曲数据,其中,每一首歌的歌曲数据包括一篇标有歌词的乐谱和一段与所述乐谱相符的清唱录音;提取所述乐谱的语言学特征和音乐特征;提取所述清唱录音的声学特征;根据所述声学特征提取所述清唱录音的颤音特征;基于隐马尔可夫模型,以所述乐谱的语言学特征和音乐特征为输入,以所述清唱录音的声学特征和颤音特征为输出,训练得到颤音模型。
第二方面,本发明实施例提供了一种颤音建模装置,该颤音建模装置包括用于执行上述第一方面所述的方法对应的单元。
第三方面,本发明实施例提供了一种计算机设备,所述计算机设备包括存储器,以及与所述存储器相连的处理器;
所述存储器用于存储计算机程序,所述处理器用于运行所述存储器中存储的计算机程序,以执行上述第一方面所述的方法。
第四方面,本发明实施例提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时,实现上述第一方面所述的方法。
附图说明
为了更清楚地说明本发明实施例技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本申请实施例提供的颤音建模方法的流程示意图;
图2是本申请实施例提供的一个音符所对应的标签对数据;
图3是本申请实施例提供的一段清唱录音所对应的基频的示意图;
图4是本申请实施例提供的颤音建模方法的子流程示意图;
图5是本申请另一实施例提供的颤音建模方法的子流程示意图;
图6是本申请另一实施例提供的颤音建模方法的流程示意图;
图7是本申请实施例提供的颤音建模装置的示意性框图;
图8是本申请实施例提供的颤音定位单元的示意性框图;
图9是本申请另一实施例提供的颤音定位单元的示意性框图;
图10是本申请另一实施例提供的颤音建模装置的示意性框图;
图11是本申请实施例提供的计算机设备的示意性框图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清 楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
图1是本申请实施例提供的颤音建模方法的流程示意图。如图1所示,该方法包括S101-S105。
S101,获取多首歌的歌曲数据,其中,每一首歌的歌曲数据包括一篇标有歌词的乐谱和一段与所述乐谱相符的清唱录音。
其中,每一首歌的清唱录音是与该乐谱对应的清唱录音,即乐谱和清唱录音是对应的。其中,乐谱是以文本文件的形式存在。通常情况下,用一些软件如MuseScore制作电子乐谱时,可以直接将乐谱文件保存为一些预设的格式文件如musicxml格式。这类的格式文件其本质是有特殊格式的文本文件。
S102,提取所述乐谱的语言学特征和音乐特征。
从整体来看,一个乐谱文件的结构如下:
●音调
●谱号
●拍号
●速度
●(一些显示信息)
●第一小节:
·音符
·音符
·音符
……
●第二小节:
·音符
……
……
一个乐谱文件中通常包括音调、谱号、拍号、速度、显示信息(显示信息与音符在乐谱编辑软件中的视觉显示有关,与模型的训练无关,可忽略)、多个小节等信息。其中,每个小节中包括多个音符。每个音符中包括音高、时长、声部、音符类型、歌词等信息。而音高包括音阶和八度等信息,歌词包括音节、文本等信息。
乐谱文件中以标签对的形式来标识乐谱文件的各种信息,如<></>即为一个 标签对。图2是本申请实施例提供的一个音符所对应的标签对数据。需要注意的是,图2中的汉字所对应的是注解,以方便理解该音符所对应的标签对数据的含义。如图2所示,<note>和</note>这个标签之间包含了关于这个音符所有信息。而其中的<pitch>和</pitch>之间记录了该音符的音高信息,<duration>和</duration>之间记录了该音符的时长信息,<voice>和</voice>之间记录了该音符的声部信息,<type>和</type>之间记录了该音符的音符类型,<stem>和</stem>之间记录了该音符的显示信息(显示信息与音符在乐谱编辑软件中的视觉显示有关,与模型的训练无关,可忽略),<lyric>和</lyric>之间记录了该音符的歌词信息。其中,音高信息<pitch>和</pitch>中的<step>和</step>之间记录了音高信息中的音阶,<octave>和</octave>之间记录了音高信息中的八度;<lyric>和</lyric>中的<syllabic>和</syllabic>之间记录了歌词的音节,<text>和</text>之间记录了歌词文本。通过寻找标签对,可以读取到每个标签对中具体的关键字(如pitch、step、lyric等关键字)和关键字对应值的信息。
在一个乐谱文件中,“音调”、“谱号”、“拍号”、“速度”、每个音符的“音高”、“时长”、“声部”和“音符类型”都是音乐特征,“歌词”则是文本特征,对应语言学特征。具体地,步骤S102,即提取所述乐谱的语言学特征和音乐特征,包括:获取乐谱文件中的便签对;解析标签对,以提取标签对中所对应的音乐特征以及语言学特征的值。其中,语言学特征包括歌词的发音,以及上下前后文的关系等。如此通过获取并解析乐谱文件中的标签对,提取乐谱中的语言学特征和音乐特征,达到提取特征的目的。
S103,提取所述清唱录音的声学特征。
其中,声学特征包括基频和梅尔频谱系数等。每一帧对应一组特征。
基音的频率称为基频,而基音就是指发浊音时声带振动的周期性,基频是语音中频率最低的纯音,但振幅确是最大的,决定整个音的升高。基频作为语音的主要特征,广泛应用于语音编码、语音识别、语音合成等领域。
清唱录音的基频特征的提取方法有很多种,大致可以分为三类:一,时域分析算法,如自相关法(ACF)、短时平均幅度差法(AMDF)等;二,频域分析算法,如倒谱法(CEP)等;三,时频结合的分析算法,如小波分析算法等。还可以使用其他的基频提取方法来提取清唱录音的基频。其中,提取出的 清唱录音的基频为基频序列。
梅尔倒谱系数是在Mel标度频率域提取出来的倒谱参数,Mel标度描述了人耳频率的非线性特性。提取梅尔倒谱系数的方法主要包括:
对清唱录音的语音信号进行采样处理以得到清唱录音的数字语音信号;将清唱录音的数字语音信号进行预加重处理;将预加重处理后的数字语音信号进行分帧处理;将分帧处理后的数字语音信号进行加窗处理;将加窗处理后的数字语音信号进行快速傅里叶变换以得到频域的声音信号;通过三角形带通滤波器组对频域的声音信号进行滤波以使三角形带通滤波器中的每个滤波器分别输出滤波结果,其中,该三角形带通滤波器包括p个滤波器;将每个滤波器输出的滤波结果分别取对数以得到声音信号的p个对数能量;将所得的p个对数能量进行离散余弦变化得到梅尔频率倒谱系数的p阶分量。其中,p可以在22-26范围内取值,也可以在其他合适的范围内取值。
其中,预加载处理就是将清唱录音的数字语音信号通过一个高通滤波器,预加重的目的是提升高频部分,使得清唱录音的数字语音信号的频谱变得平坦,移除频谱倾斜,来补偿语音信号受到发音系统所抑制的高频部分;同时,也是为了消除发生过程中声带和嘴唇的效应。使用三角带通滤波器的目的是对频谱进行平滑化,并消除谐波的作用,此外还可以减少运算量。
S104,根据所述声学特征提取所述清唱录音的颤音特征。
在本实施例中,认为颤音是出现在基频序列上小幅的正弦的波动,颤音特征包括振幅和频率。根据所述声学特征提取所述清唱录音的颤音特征,包括:根据所述声学特征中的基频来提取所述清唱录音的颤音特征。
在一实施例中,步骤S104,包括:定位所述基频所对应的基频序列中的颤音片段;计算所述颤音片段的振幅和频率,将计算出的所述颤音片段的振幅和频率作为所述颤音片段中每一帧的颤音特征。
在一实施例中,还需将所述基频中的非颤音片段的振幅和频率设置为零,如此,以得到所述清唱录音的颤音特征。可以理解地,计算颤音片段的振幅和频率,将计算出的振幅和频率作为颤音特征;对于清唱录音中的非颤音片段,即没有颤音或者静音的片段,设置振幅和频率都为零。
图3是本申请实施例提供的一段清唱录音所对应的基频的示意图。图3中 横轴是时间,单位为0.5毫秒;纵轴为基频,单位为森特。椭圆圈的部分即为颤音片段,具体地定位基频所对应的基频序列中的颤音片段的方法下文将具体介绍。其中,可以看出颤音是出现在基频序列上小幅的正弦的波动。其他没有颤音或者静音的片段,设置振幅和频率都为零。
在一实施例中,如图4所示,定位所述基频所对应的基频序列中的颤音片段,包括S401-S405。
S401,检测所述基频序列中连续出现波谷或者波峰的次数是否达到预设次数。
其中,连续出现波谷或者波峰意味着该基频序列片段没有间断,即该基频序列片段是不间断的,在一些实施例中,也可以理解为,该基频序列片段中没有频率为0的频率。检测在一个不间断的基频序列片段中出现波谷或者波峰的次数是否达到预设次数。其中,连续出现波谷或者波峰的次数可以刚好是预设次数,也可以超过预设次数。刚好是预设次数,或者超过预设次数,都认为达到预设次数。其中,预设次数可以设置为5次,也可以设置为其他的次数。
若所述基频序列中连续出现波谷或者波峰的次数达到预设次数,执行步骤S402;否则,执行步骤S405。
S402,获取所对应的基频序列片段,并统计该基频序列片段中的平均频率。
基频序列片段中的平均频率通过获取该基频序列片段中每一帧的频率,将该基频序列片段中每一帧的频率之和,除以该基频序列片段中的总帧数,以得到该基频序列片段中的平均频率。
S403,检测预设次数中每次频率由波谷到波峰或者频率由波峰到波谷的过程中,波谷所对应的频率是否小于平均频率,且波峰所对应的频率是否大于平均频率。
若每次波谷所对应的频率小于平均频率,且波峰所对应的频率大于平均频率,执行步骤S404;否则,执行步骤S405。
S404,确定该基频序列片段为颤音片段。
S405,确定该基频序列片段为非颤音片段。
即基频序列中连续出现波谷或者波峰的次数未达到预设次数;或者达到预设次数,但是并非每次波谷所对应的频率小于平均频率、且波峰所对应的频率 大于平均频率,那么确定该频率序列片段为非颤音片段。
该实施例中通过判断基频序列中的基频是否连续预设次数或者预设次数以上由高到低,或由低到高的穿过该基频序列片段的平均基频。若是,则认为是该基频序列片段为颤音片段,否则,认为该基频序列片段为非颤音片段。如图3所示,所确定颤音片段的频率是连续预设次数或者预设次数以上由高到低穿过该基频序列片段的平均基频。
在一实施例中,预设次数的判断可通过统计该基频序列中出现频率连续由波谷到波峰或者频率连续由波峰到波谷的次数,即判断基频序列中出现频率连续由波谷到波峰或者频率连续由波峰到波谷的次数是否达到预设次数。
其中,连续波谷到波峰理解为基频的频率从一个波谷到达波峰(再达到波谷),紧接着又从波谷到达一个波峰(再达到波谷),如此连续的从波谷到达波峰的次数是否达到预设次数。需要注意的是,在该过程中,基频的频率并未间断。其中,预设次数可以设置为5次。基频连续由高到低的理解也是一致,只是统计的是基频从波峰到达波谷的次数。
由于颤音的频率范围是在一个预设赫兹频域范围内,如预设赫兹频域范围是5到8赫兹,因此如果某一基频序列片段是颤音片段,那么该基频序列片段的基频以一个5到8赫兹之间的频率f在振动。在一实施例中,如图5所示,定位所述基频所对应的基频序列中的颤音片段,包括S501-S508。
S501,对所述基频序列中的基频序列片段做短时距傅里叶变换,得到基频序列片段的功率谱。
将基频的功率谱表示为X(f,t),其中,f为基频的震动频率,t为时间。则X(f,t)可以理解为是在时间点t时,频率为f的波形(如图3所示的基频曲线)的振幅。
S502,将所述功率谱进行正则化以得到正则化函数。
即对X(f,t)进行正则化以得到正则化函数
Figure PCTCN2019091093-appb-000001
其中,正则化公式为:
Figure PCTCN2019091093-appb-000002
S503,计算正则化函数在预设赫兹频域范围上的积分,以得到所述基频序列片段的功率。
若预设赫兹频域范围为5到8赫兹,对
Figure PCTCN2019091093-appb-000003
做5到8赫兹频域上的积分, 得到该基频序列片段的功率ψ v(t),其中,计算功率的公式如下:
Figure PCTCN2019091093-appb-000004
其中,F L表示预设赫兹频域范围中最低的频率,F H表示预设赫兹频域范围中最高的频率。
S504,计算正则化函数在所述基频序列片段中的斜率变化。
其中,斜率变化越大代表函数值的尖峰越明显。计算斜率变化的计算公式如下:
Figure PCTCN2019091093-appb-000005
其中,F L表示预设赫兹频域范围中最低的频率,F H表示预设赫兹频域范围中最高的频率。
S505,根据计算出的所述基频序列片段的功率和斜率变化确定颤音概率。
其中,在时间t时,根据计算出的所述基频序列片段的功率和斜率变化确定颤音概率P v(t)的计算公式为:P v(t)=S v(t)ψ v(t)。
P v(t)的值越大,说明该时间点t属于颤音片段的可能性越大。
S506,判断所述颤音概率是否超过预设数值。
若所述颤音概率超过预设数值,则认为该时间点属于颤音片段中的点。若连续多个点判断出来都是属于颤音片段,那么该基频序列片段则属于颤音片段。
若所述颤音概率超过预设数值,执行步骤S507;若所述颤音概率未超过预设数值,执行步骤S508。
S507,确定所述基频序列片段为颤音片段。
S508,确定所述基频序列片段为非颤音片段。
其中,需要注意的是,计算斜率变化和计算功率的步骤的先后顺序不做具体限定,也可以先计算功率再计算斜率变化。
该实施例中,可以理解为,若某一个基频序列片段中预设赫兹频域范围段(如5到8赫兹这个频率段)的功率高于其他频率段;频率为f的波形振幅要明显的大于5到8赫兹之间其他波形的振幅,那么认为该基频序列片段是颤音片段。
在一实施例中,也可以将以上图4和图5两个实施例中的判断方法结合起来以用来定位基频序列中的颤音片段,那么可以理解地,基频连续预设次数以上由高到低,或由低到高的穿过该片段的平均基频,且计算出的颤音概率超过预设数值,那么确认该时间点所在的基频序列片段是颤音片段。
S105,基于隐马尔可夫模型,以所述乐谱的语言学特征和音乐特征为输入,以所述清唱录音的声学特征和颤音特征为输出,训练得到颤音模型。
可以理解地,将所述乐谱的语言学特征和音乐特征作为输入,输入到隐马尔科夫模型中,训练该隐马尔科夫模型,以使得输出为所述清唱录音的声学特征和颤音特征。将训练得到的隐马尔科夫模型作为颤音模型。
以上方法实施例通过提取清唱录音的颤音特征,基于隐马尔可夫模型,将歌曲的乐谱的语言学特征和音乐特征作为输入,将该歌曲的清唱录音的声学特征和颤音特征作为输出,如此,得到输入为所述乐谱的语言学特征和音乐特征,输出为所述清唱录音的声学特征和颤音特征的颤音模型。以上方法实施例实现了颤音模型的训练。该颤音模型可有效的保留颤音特征,以在合成歌曲时提高合成歌曲的自然度。
图6是本申请另一实施例提供的颤音建模方法的示意流程图。如图6所示,该实施例包括步骤S601-S609。其中,该实施例与图1实施例的区别在于:增加了步骤S606-S609,其他步骤S601-S605与图1实施例的步骤S101-S105分别对应。下面将详细介绍该实施例与图1实施例的区别之处。
S606,提取待合成歌曲的乐谱的语言学特征和音乐特征。
其中,提取待合成歌曲的乐谱的语言学特征和音乐特征的方法请参看图1实施例中的提取乐谱的语言学特征和音乐特征的描述,在此不再赘述。
S607,将提取的语言学特征和音乐特征输入到训练好的颤音模型中,以得到与所述待合成歌曲的乐谱相符的清唱录音的声学特征和颤音特征。
S608,在所得到的声学特征中加入所得到的颤音特征。
首先判断待合成歌曲中的每一帧是否为静音;如果是静音,那么基频不变;如果不为静音,那么就在基频上加入所得到的颤音特征。如此,在基频上加入了颤音特征。具体地,在基频上加入所得到的颤音特征,包括:在基频上加入与所得到的颤音特征的振幅和频率对应的正弦波来模拟颤音。该处的对应理解为相同,即在基频上加入与所得到的颤音特征的振幅和频率相同的正弦波来模拟颤音。
S609,将加入了颤音特征的声学特征输入声码器,以合成歌曲。
输入到声码器中的声学特征包括:加入了颤音特征的基频和梅尔频谱系数。
声码器(vocoder),是语音信号某种模型的语音分析合成系统。在传输中只利用模型中的特征参数,在编译码时利用模型中的特征参数估计和语音合成的语音信号编译码器,一种对话音进行分析和合成的编、译码器。
该实施例进一步通过训练好的颤音模型来得到待合成歌曲的乐谱相符的清唱录音的声学特征和颤音特征,并将得到的颤音特征加入到声学特征中的基频中,如此,声学特征(包括基频和梅尔频谱系数)中可以表达颤音。在将加入了颤音特征的声学特征合成歌曲时,歌曲中可以很好的体现颤音,如此,使生成的歌曲的自然度显著提升。
图7是本申请实施例提供的颤音建模装置的示意性框图。如图7所示,该装置包括用于执行上述颤音建模方法所对应的单元。具体地,如图7所示,该装置70包括歌曲数据获取单元701、音乐特征提取单元702、声学特征提取单元703、颤音特征提取单元704以及模型建立单元705。
歌曲数据获取单元701,用于获取多首歌的歌曲数据,其中,每一首歌的歌曲数据包括一篇标有歌词的乐谱和一段与所述乐谱相符的清唱录音。
音乐特征提取单元702,用于提取所述乐谱的语言学特征和音乐特征。具体地,音乐特征提取单元702,包括标签对获取单元、解析单元。其中,标签对获取单元,用于获取所述乐谱所对应的乐谱文件中的标签对。解析单元,用于解析所述标签对,以提取所述标签对中所对应的音乐特征以及语言学特征的值。
声学特征提取单元703,用于提取所述清唱录音的声学特征。其中,所述声学特征包括基频和梅尔频谱倒数。
颤音特征提取单元704,用于根据所述声学特征提取所述清唱录音的颤音特征。具体地,根据所述声学特征中的基频提取所述清唱录音的颤音特征。颤音特征包括振幅和频率。
在一实施例中,颤音特征提取单元704,包括:颤音定位单元、颤音特征确定单元。其中,颤音定位单元,用于定位所述基频所对应的基频序列中的颤音片段。颤音特征确定单元,用于计算所述颤音片段的振幅和频率,将计算出的所述颤音片段的振幅和频率作为所述颤音片段中每一帧的颤音特征。
在一实施例中,颤音特征提取单元704还包括设置单元。所述设置单元, 用于将所述基频中的非颤音片段的振幅和频率设置为零。如此,以得到所述清唱录音的颤音特征。
在一实施例中,如图8所示,颤音定位单元80包括次数检测单元801、统计单元802、频率检测单元803以及第一颤音确定单元804。其中,次数检测单元801,用于检测所述基频序列中连续出现波谷或者波峰的次数是否达到预设次数。统计单元802,用于若达到预设次数,获取所对应的基频序列片段,并统计该基频序列片段中的平均频率。频率检测单元803,用于检测预设次数中每次频率由波谷到波峰或者频率由波峰到波谷的过程中,波谷所对应的频率是否小于平均频率,且波峰所对应的频率是否大于平均频率。第一颤音确定单元804,用于若每次波谷所对应的频率小于平均频率,且波峰所对应的频率大于平均频率,确定该基频序列片段为颤音片段。第一颤音确定单元804,还用于若基频序列中连续出现波谷或者波峰的次数未达到预设次数;或者达到预设次数,但是并非每次波谷所对应的频率小于平均频率、且波峰所对应的频率大于平均频率,确定该基频序列片段为非颤音片段。
在一实施例中,如图9所示,颤音定位单元90包括变换单元901、正则化单元902、功率计算单元903、斜率变化计算单元904、概率计算单元905、概率判断单元906以及第二颤音确定单元907。其中,变换单元901,用于对所述基频序列中的基频序列片段做短时距傅里叶变换,得到基频序列片段的功率谱。正则化单元902,用于将所述功率谱进行正则化以得到正则化函数。功率计算单元903,用于计算正则化函数在预设赫兹频域范围上的积分,以得到所述基频序列片段的功率。斜率变化计算单元904,用于计算正则化函数在所述基频序列片段中的斜率变化。概率计算单元905,用于根据计算出的所述基频序列片段的功率和斜率变化确定颤音概率。概率判断单元906,用于判断所述颤音概率是否超过预设数值。第二颤音确定单元907,用于若所述颤音概率超过预设数值,确定所述基频序列片段为颤音片段;若所述颤音概率未超过预设数值,确定所述基频序列片段为非颤音片段。
模型建立单元705,用于基于隐马尔可夫模型,以所述乐谱的语言学特征和音乐特征为输入,以所述清唱录音的声学特征和颤音特征为输出,训练得到颤音模型。
图10本申请实施例提供的颤音建模装置的示意性框图。如图10所示,该装置包括用于执行上述颤音建模方法所对应的单元。具体地,如图10所示,该装置100包括歌曲数据获取单元101、音乐特征提取单元102、声学特征提取单元103、颤音特征提取单元104、模型建立单元105、模型使用单元106、颤音加入单元107以及合成单元108。其中,该实施例与图7所示的实施例的不同之处在于:增加了模型使用单元106、颤音加入单元107以及合成单元108。下面将描述该实施例与图7实施例的不同之处,其他的单元与图7实施例所示的单元分别对应,在此不再赘述。
音乐特征提取单元102,还用于提取待合成歌曲的乐谱的语言学特征和音乐特征。
模型使用单元106,用于将提取的语言学特征和音乐特征输入到训练好的颤音模型中,以得到与所述待合成歌曲的乐谱相符的清唱录音的声学特征和颤音特征。
颤音加入单元107,用于在所得到的声学特征中加入所得到的颤音特征。
合成单元108,用于将加入了颤音特征的声学特征输入声码器,以合成歌曲。
需要说明的是,所属领域的技术人员可以清楚地了解到,上述装置和各单元的具体实现过程,可以参考前述方法实施例中的相应描述,为了描述的方便和简洁,在此不再赘述。
上述装置可以实现为一种计算机程序的形式,计算机程序可以在如图11所示的计算机设备上运行。
图11为本申请实施例提供的一种计算机设备的示意性框图。该设备为终端等设备,如移动终端、PC终端、IPad等。该设备110包括通过系统总线111连接的处理器112、存储器和网络接口113,其中,存储器可以包括非易失性存储介质114和内存储器115。
该非易失性存储介质114可存储操作系统1141和计算机程序1142。该非易失性存储介质中所存储的计算机程序1142被处理器112执行时,可实现上述所述的颤音建模方法。该处理器112用于提供计算和控制能力,支撑整个设备的运行。该内存储器115为非易失性存储介质中的计算机程序的运行提供环境, 该计算机程序被处理器112执行时,可使得处理器112执行上述所述的颤音建模方法。该网络接口113用于进行网络通信。本领域技术人员可以理解,图11中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的设备的限定,具体的设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。
其中,所述处理器112用于运行存储在存储器中的计算机程序,以实现前述的颤音建模方法的任一实施例。
应当理解,在本申请实施例中,所称处理器112可以是中央处理单元(Central Processing Unit,CPU),该处理器还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(应用程序lication Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。
本领域普通技术人员可以理解的是实现上述实施例的方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成。该计算机程序可存储于一存储介质中,该存储介质可以为计算机可读存储介质。该计算机程序被该计算机系统中的至少一个处理器执行,以实现上述方法的实施例的流程步骤。
因此,本申请还提供了一种存储介质。该存储介质可以为计算机可读存储介质,该计算机可读存储介质包括非易失性计算机可读存储介质。该存储介质存储有计算机程序,该计算机程序当被处理器执行时实现前述的颤音建模方法的任一实施例。
所述存储介质可以是U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、磁碟或者光盘等各种可以存储程序代码的计算机可读存储介质。
在本申请所提供的几个实施例中,应该理解到,所揭露的装置、设备和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,上述描述的装置、设备和单元的具体工作过程,可以参考前述方法实施例中的对 应过程,在此不再赘述。以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以权利要求的保护范围为准。

Claims (20)

  1. 一种颤音建模方法,包括:
    获取多首歌的歌曲数据,其中,每一首歌的歌曲数据包括一篇标有歌词的乐谱和一段与所述乐谱相符的清唱录音;
    提取所述乐谱的语言学特征和音乐特征;
    提取所述清唱录音的声学特征;
    根据所述声学特征提取所述清唱录音的颤音特征;
    基于隐马尔可夫模型,以所述乐谱的语言学特征和音乐特征为输入,以所述清唱录音的声学特征和颤音特征为输出,训练得到颤音模型。
  2. 根据权利要求1所述的方法,其中,所述方法还包括:
    提取待合成歌曲的乐谱的语言学特征和音乐特征;
    将提取的语言学特征和音乐特征输入到训练好的颤音模型中,以得到与所述待合成歌曲的乐谱相符的清唱录音的声学特征和颤音特征;
    在所得到的声学特征中加入所得到的颤音特征;
    将加入了颤音特征的声学特征输入声码器,以合成歌曲。
  3. 根据权利要求1所述的方法,其中,所述声学特征包括基频,所述根据所述声学特征提取所述清唱录音的颤音特征,包括:
    定位所述基频所对应的基频序列中的颤音片段;
    计算所述颤音片段的振幅和频率,将计算出的所述颤音片段的振幅和频率作为所述颤音片段中每一帧的颤音特征。
  4. 根据权利要求3所述的方法,其中,所述定位所述基频所对应的基频序列中的颤音片段,包括:
    检测所述基频序列中连续出现波谷或者波峰的次数是否达到预设次数;
    若达到预设次数,获取所对应的基频序列片段,并统计该基频序列片段中的平均频率;
    检测预设次数中每次频率由波谷到波峰或者频率由波峰到波谷的过程中,波谷所对应的频率是否小于平均频率,且波峰所对应的频率是否大于平均频率;
    若每次波谷所对应的频率小于平均频率,且波峰所对应的频率大于平均频率,确定该基频序列片段为颤音片段;否则,确定该基频序列片段为非颤音片 段。
  5. 根据权利要求3所述的方法,其中,所述定位所述基频所对应的基频序列中的颤音片段,包括:
    对所述基频序列中的基频序列片段做短时距傅里叶变换,得到基频序列片段的功率谱;
    将所述功率谱进行正则化以得到正则化函数;
    计算正则化函数在预设赫兹频域范围上的积分,以得到所述基频序列片段的功率;
    计算正则化函数在所述基频序列片段中的斜率变化;
    根据计算出的所述基频序列片段的功率和斜率变化确定颤音概率;
    判断所述颤音概率是否超过预设数值;
    若所述颤音概率超过预设数值,确定所述基频序列片段为颤音片段;否则,确定所述基频序列片段为非颤音片段。
  6. 根据权利要求1所述的方法,其中,所述提取所述乐谱的语言学特征和音乐特征,包括:
    获取所述乐谱所对应的乐谱文件中的标签对;
    解析所述标签对,以提取所述标签对中所对应的音乐特征以及语言学特征的值。
  7. 根据权利要求1所述的方法,其中,所述声学特征包括梅尔频谱系数,所述提取所述清唱录音的声学特征,包括:
    对清唱录音的语音信号进行采样处理以得到清唱录音的数字语音信号;
    将清唱录音的数字语音信号进行预加重处理;
    将预加重处理后的数字语音信号进行分帧处理;
    将分帧处理后的数字语音信号进行加窗处理;
    将加窗处理后的数字语音信号进行快速傅里叶变换以得到频域的声音信号;
    通过三角形带通滤波器组对频域的声音信号进行滤波以使三角形带通滤波器中的每个滤波器分别输出滤波结果,其中,该三角形带通滤波器包括p个滤波器;
    将每个滤波器输出的滤波结果分别取对数以得到声音信号的p个对数能量;
    将所得的p个对数能量进行离散余弦变化得到梅尔频率倒谱系数的p阶分量。
  8. 根据权利要求2所述的方法,其中,所得到的声学特征中包括基频,所述在所得到的声学特征中加入所得到的颤音特征,包括:
    判断所述待合成歌曲中的每一帧是否为静音;
    若为静音,则保持所得到的声学特征中的基频不变;
    若不为静音,在所得到的声学特征中的基频上加入所对应的颤音特征。
  9. 根据权利要求8所述的方法,其中,所述在所得到的声学特征中的基频上加入所得到的颤音特征,包括:
    在所得到的声学特征中的基频上加入与所述颤音特征的振幅和频率对应的正弦波。
  10. 一种颤音建模装置,其中,所述颤音建模装置包括:
    歌曲数据获取单元,用于获取多首歌的歌曲数据,其中,每一首歌的歌曲数据包括一篇标有歌词的乐谱和一段与所述乐谱相符的清唱录音;
    音乐特征提取单元,用于提取所述乐谱的语言学特征和音乐特征;
    声学特征提取单元,用于提取所述清唱录音的声学特征;
    颤音特征提取单元,用于根据所述声学特征提取所述清唱录音的颤音特征;
    模型建立单元,用于基于隐马尔可夫模型,以所述乐谱的语言学特征和音乐特征为输入,以所述清唱录音的声学特征和颤音特征为输出,训练得到颤音模型。
  11. 一种计算机设备,所述计算机设备包括存储器,以及与所述存储器相连的处理器;其中,所述存储器用于存储计算机程序;所述处理器用于运行所述存储器中存储的计算机程序,以执行如下步骤:
    获取多首歌的歌曲数据,其中,每一首歌的歌曲数据包括一篇标有歌词的乐谱和一段与所述乐谱相符的清唱录音;
    提取所述乐谱的语言学特征和音乐特征;
    提取所述清唱录音的声学特征;
    根据所述声学特征提取所述清唱录音的颤音特征;
    基于隐马尔可夫模型,以所述乐谱的语言学特征和音乐特征为输入,以所述清唱录音的声学特征和颤音特征为输出,训练得到颤音模型。
  12. 根据权利要求11所述的计算机设备,其中,所述处理器还执行如下步骤:
    提取待合成歌曲的乐谱的语言学特征和音乐特征;
    将提取的语言学特征和音乐特征输入到训练好的颤音模型中,以得到与所述待合成歌曲的乐谱相符的清唱录音的声学特征和颤音特征;
    在所得到的声学特征中加入所得到的颤音特征;
    将加入了颤音特征的声学特征输入声码器,以合成歌曲。
  13. 根据权利要求11所述的计算机设备,其中,所述声学特征包括基频,所述处理器在执行所述根据所述声学特征提取所述清唱录音的颤音特征的步骤时,具体执行如下步骤:
    定位所述基频所对应的基频序列中的颤音片段;
    计算所述颤音片段的振幅和频率,将计算出的所述颤音片段的振幅和频率作为所述颤音片段中每一帧的颤音特征。
  14. 根据权利要求13所述的计算机设备,其中,所述处理器在执行所述定位所述基频所对应的基频序列中的颤音片段的步骤时,具体执行如下步骤:
    检测所述基频序列中连续出现波谷或者波峰的次数是否达到预设次数;
    若达到预设次数,获取所对应的基频序列片段,并统计该基频序列片段中的平均频率;
    检测预设次数中每次频率由波谷到波峰或者频率由波峰到波谷的过程中,波谷所对应的频率是否小于平均频率,且波峰所对应的频率是否大于平均频率;
    若每次波谷所对应的频率小于平均频率,且波峰所对应的频率大于平均频率,确定该基频序列片段为颤音片段;否则,确定该基频序列片段为非颤音片段。
  15. 根据权利要求13所述的计算机设备,其中,所述处理器在执行所述定位所述基频所对应的基频序列中的颤音片段的步骤时,具体执行如下步骤:
    对所述基频序列中的基频序列片段做短时距傅里叶变换,得到基频序列片 段的功率谱;
    将所述功率谱进行正则化以得到正则化函数;
    计算正则化函数在预设赫兹频域范围上的积分,以得到所述基频序列片段的功率;
    计算正则化函数在所述基频序列片段中的斜率变化;
    根据计算出的所述基频序列片段的功率和斜率变化确定颤音概率;
    判断所述颤音概率是否超过预设数值;
    若所述颤音概率超过预设数值,确定所述基频序列片段为颤音片段;否则,确定所述基频序列片段为非颤音片段。
  16. 根据权利要求11所述的计算机设备,其中,所述处理器在执行所述提取所述乐谱的语言学特征和音乐特征的步骤时,具体执行如下步骤:
    获取所述乐谱所对应的乐谱文件中的标签对;
    解析所述标签对,以提取所述标签对中所对应的音乐特征以及语言学特征的值。
  17. 根据权利要求11所述的方法,其中,所述声学特征包括梅尔频谱系数,所述处理器在执行所述提取所述清唱录音的声学特征的步骤时,具体执行如下步骤:
    对清唱录音的语音信号进行采样处理以得到清唱录音的数字语音信号;
    将清唱录音的数字语音信号进行预加重处理;
    将预加重处理后的数字语音信号进行分帧处理;
    将分帧处理后的数字语音信号进行加窗处理;
    将加窗处理后的数字语音信号进行快速傅里叶变换以得到频域的声音信号;
    通过三角形带通滤波器组对频域的声音信号进行滤波以使三角形带通滤波器中的每个滤波器分别输出滤波结果,其中,该三角形带通滤波器包括p个滤波器;
    将每个滤波器输出的滤波结果分别取对数以得到声音信号的p个对数能量;
    将所得的p个对数能量进行离散余弦变化得到梅尔频率倒谱系数的p阶分 量。
  18. 根据权利要求12所述的计算机设备,其中,所得到的声学特征中包括基频,所述处理器在执行所述在所得到的声学特征中加入所得到的颤音特征的步骤时,具体执行如下步骤:
    判断所述待合成歌曲中的每一帧是否为静音;
    若为静音,则保持所得到的声学特征中的基频不变;
    若不为静音,在所得到的声学特征中的基频上加入所对应的颤音特征。
  19. 根据权利要求18所述的计算机设备,其中,所述处理器在执行所述在所得到的声学特征中的基频上加入所得到的颤音特征的步骤时,具体执行如下步骤:
    在所得到的声学特征中的基频上加入与所述颤音特征的振幅和频率对应的正弦波。
  20. 一种计算机可读存储介质,其中,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时,实现如权利要求1-9任一项所述的方法。
PCT/CN2019/091093 2019-01-04 2019-06-13 颤音建模方法、装置、计算机设备及存储介质 WO2020140390A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910008576.5A CN109817191B (zh) 2019-01-04 2019-01-04 颤音建模方法、装置、计算机设备及存储介质
CN201910008576.5 2019-01-04

Publications (1)

Publication Number Publication Date
WO2020140390A1 true WO2020140390A1 (zh) 2020-07-09

Family

ID=66604030

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/091093 WO2020140390A1 (zh) 2019-01-04 2019-06-13 颤音建模方法、装置、计算机设备及存储介质

Country Status (2)

Country Link
CN (1) CN109817191B (zh)
WO (1) WO2020140390A1 (zh)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109817191B (zh) * 2019-01-04 2023-06-06 平安科技(深圳)有限公司 颤音建模方法、装置、计算机设备及存储介质
CN109979429A (zh) * 2019-05-29 2019-07-05 南京硅基智能科技有限公司 一种tts的方法及系统
CN110364140B (zh) * 2019-06-11 2024-02-06 平安科技(深圳)有限公司 歌声合成模型的训练方法、装置、计算机设备以及存储介质
CN110738980A (zh) * 2019-09-16 2020-01-31 平安科技(深圳)有限公司 歌声合成模型的训练方法、系统及歌声合成方法
CN110867194B (zh) * 2019-11-05 2022-05-17 腾讯音乐娱乐科技(深圳)有限公司 音频的评分方法、装置、设备及存储介质
CN113780811B (zh) * 2021-09-10 2023-12-26 平安科技(深圳)有限公司 乐器演奏评估方法、装置、设备及存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090306987A1 (en) * 2008-05-28 2009-12-10 National Institute Of Advanced Industrial Science And Technology Singing synthesis parameter data estimation system
US20090314155A1 (en) * 2008-06-20 2009-12-24 Microsoft Corporation Synthesized singing voice waveform generator
CN106898340A (zh) * 2017-03-30 2017-06-27 腾讯音乐娱乐(深圳)有限公司 一种歌曲的合成方法及终端
CN106971703A (zh) * 2017-03-17 2017-07-21 西北师范大学 一种基于hmm的歌曲合成方法及装置
CN109817191A (zh) * 2019-01-04 2019-05-28 平安科技(深圳)有限公司 颤音建模方法、装置、计算机设备及存储介质

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6304846B1 (en) * 1997-10-22 2001-10-16 Texas Instruments Incorporated Singing voice synthesis
TWI377557B (en) * 2008-12-12 2012-11-21 Univ Nat Taiwan Science Tech Apparatus and method for correcting a singing voice
CN105825868B (zh) * 2016-05-30 2019-11-12 福州大学 一种演唱者有效音域的提取方法
JP2017107228A (ja) * 2017-02-20 2017-06-15 株式会社テクノスピーチ 歌声合成装置および歌声合成方法
CN107680582B (zh) * 2017-07-28 2021-03-26 平安科技(深圳)有限公司 声学模型训练方法、语音识别方法、装置、设备及介质
CN108492817B (zh) * 2018-02-11 2020-11-10 北京光年无限科技有限公司 一种基于虚拟偶像的歌曲数据处理方法及演唱交互系统

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090306987A1 (en) * 2008-05-28 2009-12-10 National Institute Of Advanced Industrial Science And Technology Singing synthesis parameter data estimation system
US20090314155A1 (en) * 2008-06-20 2009-12-24 Microsoft Corporation Synthesized singing voice waveform generator
CN106971703A (zh) * 2017-03-17 2017-07-21 西北师范大学 一种基于hmm的歌曲合成方法及装置
CN106898340A (zh) * 2017-03-30 2017-06-27 腾讯音乐娱乐(深圳)有限公司 一种歌曲的合成方法及终端
CN109817191A (zh) * 2019-01-04 2019-05-28 平安科技(深圳)有限公司 颤音建模方法、装置、计算机设备及存储介质

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
HONO YUKIYA ET AL: "Recent Development of the DNN-based Singing Voice Synthesis System — Sinsy", 2018 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 15 November 2018 (2018-11-15), pages 1003 - 1009, XP033526148, DOI: 10.23919/APSIPA.2018.8659797 *
LI, XIAN: "Statistical Model Based Mandarin Chinese Singing Voice Synthesis", CHINA DOCTORAL DISSERTATIONS FULL-TEXT DATABASE, 15 March 2016 (2016-03-15), XP055716400 *
XIAN LI ET AL: "A HMM-based mandarin chinese singing voice synthesis system", IEEE/CAA JOURNAL OF AUTOMATICA SINICA, vol. 3, no. 2, 10 April 2016 (2016-04-10), pages 192 - 202, XP011606125, ISSN: 2329-9266, DOI: 10.1109/JAS.2016.7451107 *

Also Published As

Publication number Publication date
CN109817191B (zh) 2023-06-06
CN109817191A (zh) 2019-05-28

Similar Documents

Publication Publication Date Title
WO2020140390A1 (zh) 颤音建模方法、装置、计算机设备及存储介质
WO2021218138A1 (zh) 歌曲合成方法、装置、设备及存储介质
US8193436B2 (en) Segmenting a humming signal into musical notes
Yang et al. BaNa: A noise resilient fundamental frequency detection algorithm for speech and music
Zhang et al. An overview of speech endpoint detection algorithms
Al-Shoshan Speech and music classification and separation: a review
Nwe et al. Singing voice detection in popular music
Toh et al. Multiple-Feature Fusion Based Onset Detection for Solo Singing Voice.
Dubuisson et al. On the use of the correlation between acoustic descriptors for the normal/pathological voices discrimination
Yang Computational modelling and analysis of vibrato and portamento in expressive music performance
Stowell Making music through real-time voice timbre analysis: machine learning and timbral control
CN106970950B (zh) 相似音频数据的查找方法及装置
Ryynänen Singing transcription
Shenoy et al. Singing voice detection for karaoke application
Parlak et al. Harmonic differences method for robust fundamental frequency detection in wideband and narrowband speech signals
CN115050387A (zh) 一种艺术测评中多维度唱奏分析测评方法及系统
Sahoo et al. Analyzing the vocal tract characteristics for out-of-breath speech
CN111681674B (zh) 一种基于朴素贝叶斯模型的乐器种类识别方法和系统
Loscos et al. The Wahwactor: A Voice Controlled Wah-Wah Pedal.
CN113129923A (zh) 一种艺术测评中多维度唱奏分析测评方法及系统
Wang et al. Beijing opera synthesis based on straight algorithm and deep learning
Danayi et al. A novel algorithm based on time-frequency analysis for extracting melody from human whistling
Ingale et al. Singing voice separation using mono-channel mask
Jensen Perceptual and physical aspects of musical sounds
Marxer et al. Modelling and separation of singing voice breathiness in polyphonic mixtures

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19907962

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19907962

Country of ref document: EP

Kind code of ref document: A1