WO2014101168A1 - 一种实现语音歌唱化的方法和装置 - Google Patents

一种实现语音歌唱化的方法和装置 Download PDF

Info

Publication number
WO2014101168A1
WO2014101168A1 PCT/CN2012/087999 CN2012087999W WO2014101168A1 WO 2014101168 A1 WO2014101168 A1 WO 2014101168A1 CN 2012087999 W CN2012087999 W CN 2012087999W WO 2014101168 A1 WO2014101168 A1 WO 2014101168A1
Authority
WO
WIPO (PCT)
Prior art keywords
unit
basic
speech
fundamental frequency
segment
Prior art date
Application number
PCT/CN2012/087999
Other languages
English (en)
French (fr)
Inventor
孙见青
凌震华
江源
何婷婷
胡国平
胡郁
刘庆峰
Original Assignee
安徽科大讯飞信息科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 安徽科大讯飞信息科技股份有限公司 filed Critical 安徽科大讯飞信息科技股份有限公司
Publication of WO2014101168A1 publication Critical patent/WO2014101168A1/zh

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • G10L21/013Adapting to target pitch
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2250/00Aspects of algorithms or signal processing methods without intrinsic musical character, yet specifically adapted for or used in electrophonic musical processing
    • G10H2250/315Sound category-dependent sound synthesis processes [Gensound] for musical use; Sound category-specific synthesis-controlling parameters or control means therefor
    • G10H2250/455Gensound singing voices, i.e. generation of human voices for musical applications, vocal singing sounds or intelligible words at a desired pitch or with desired vocal effects, e.g. by phoneme synthesis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • G10L21/013Adapting to target pitch
    • G10L2021/0135Voice conversion or morphing

Definitions

  • the present invention relates to the field of speech signal processing, and in particular to a method and apparatus for implementing speech sing.
  • the singing synthesis system which converts text data input by users into a method of singing voice, has been widely studied and applied.
  • the implementation of the singing synthesis system first requires recording a large amount of song data, including voice data and bin data, to provide a speech segment required for the synthesis system or to train reliable model parameters.
  • the singing synthesis system usually only chooses to record the data of a particular speaker, and the corresponding singing synthesis effect is limited to the tone of the specific speaker, which is not suitable for personalized customization, and cannot be realized.
  • the interpretation of a specific tone especially the reproduction of the user's own tone.
  • a singing synthesis method is developed in the prior art, which allows a device to receive voice data input by a user in a speech style, and the system optimizes voice data according to a preset bin spectrum to implement song synthesis. This way preserves the tone of the user's voice data for personalized synthesis.
  • the specific operations include: (1) the system receives the lyrics voice input of the user's speaking style; (2) the voice signal is divided into individual independent phoneme-based voice segments by manual segmentation; (3) according to the cylinder spectrum Determining the correspondence between each phoneme unit and the barrel notes; (4) The system extracts the acoustic spectrum features, the fundamental frequency characteristics, etc.
  • the system determines the base frequency F0 of the target song according to the cylinder spectrum annotation information. Characteristic parameters and duration characteristics, and adjust the fundamental frequency characteristics and duration of each phoneme unit accordingly; (6) The system synthesizes singing according to the acoustic spectrum characteristics of each phoneme unit, and prosodic features (such as: fundamental frequency characteristics and long-term characteristics) Voice output.
  • the scheme can only realize the conversion of the speech style speech input of the lyrics corresponding to the cylinder spectrum. That is to say, the user can only input the lyrics of the specified song, and cannot realize the conversion of the song composition effect of any content of any length, the application method is limited, and the entertainment effect is also reduced;
  • the scheme realizes the segmentation of the continuous speech signal of the speech style and the correspondence of the tune musical notes by the manual segmentation method.
  • the requirements for labor are high, and they are restricted by language and cannot be universally promoted.
  • the scheme adopts the parameter synthesis method, that is, the speech signal is first converted into an acoustic feature, and then optimized according to the singing standard at the feature level, and finally the continuous speech signal is synthesized from the optimized feature in a synthetic manner. Obviously, there is a loss of signal from the conversion of speech signals to characteristic parameters, and the synthesis of characteristic parameters to speech signals, and the sound quality is significantly reduced.
  • Embodiments of the present invention provide a method and apparatus for implementing voice singing, which can automatically segment a voice, and can convert a voice of any length and arbitrary content into a song desired by the user.
  • An embodiment of the present invention provides a method for implementing voice singing, the method comprising: receiving a voice signal input by a user;
  • the speech segment of each basic unit is adjusted according to the target fundamental frequency value and the target duration, so that the fundamental frequency of the adjusted speech segment is the target fundamental frequency value, and the adjusted speech segment has a duration of the target duration.
  • An embodiment of the present invention further provides an apparatus for implementing voice singing, the apparatus comprising: a receiving unit, a segmentation unit, a corresponding relationship unit, a base frequency unit, a time length unit, and an adjustment unit;
  • the receiving unit is configured to receive a voice signal input by a user
  • the segmentation unit is configured to slice the voice signal to obtain a voice slice of each basic unit
  • Obtaining a base frequency unit configured to: according to a pitch of each note in the music spectrum, and the corresponding relationship, Do not determine the target fundamental frequency of the corresponding unit of investigation;
  • the acquiring duration unit is configured to determine, according to the number of beats of each note in the cadence, and the corresponding relationship, respectively, the target duration of the corresponding basic unit to be determined;
  • the adjusting unit is configured to adjust a voice segment of each basic unit according to the target base frequency value and a target duration, so that a base frequency of the adjusted voice segment is the target base frequency value, and the adjusted duration of the voice segment For the stated duration.
  • the input voice signal waveform can be directly adjusted, and the loss of multiple signal conversion is avoided by directly optimizing the waveform;
  • the technical solution can convert the spoken voice of any length and any content to the singing voice of any song, that is to say, the present invention is not limited to the input of the lyrics of the specific song, but allows the user to input any content and realize the conversion of any song.
  • FIG. 1 is a schematic flow chart of a method for implementing voice singing according to an embodiment of the present invention
  • FIG. 2 is a schematic flow chart of another method for implementing voice singing according to an embodiment of the present invention
  • FIG. 3 is a schematic flow chart of a speech segment in which a speech signal is divided into basic examination units according to an embodiment of the present invention
  • FIG. 4 is a pre-defined search network example
  • FIG. 5 is a schematic diagram of a flow chart for obtaining a correspondence between a note in a library and a basic inspection unit according to an embodiment of the present invention
  • FIG. 6 is a schematic flowchart of an operation flow for realizing the acquired target frequency value according to the characteristics of different speakers, according to an embodiment of the present invention
  • FIG. 7a is a target duration of obtaining each basic inspection unit according to an embodiment of the present invention
  • Operation flow Figure 7b shows an example of the number of beats for taking a note
  • Figure 8 is a schematic view of a device for implementing voice singing according to an embodiment of the present invention
  • Figure 9 is a schematic diagram of a splitting unit according to an embodiment of the present invention
  • FIG. 10 is a schematic view of a unit for obtaining a correspondence unit according to an embodiment of the present invention
  • FIG. 11 is a schematic view of a unit for adjusting a base unit according to an embodiment of the present invention
  • FIG. 1 is a schematic flowchart of a method for implementing voice singing according to an embodiment of the present invention.
  • Step 101 Receive a voice signal input by a user.
  • Step 102 Segment the voice signal to obtain a voice segment of each basic unit; wherein the basic unit is a minimum pronunciation unit corresponding to a single note, such as a Chinese song. Characters, syllables of English songs, etc.
  • Step 103 Determine, according to a preset bin spectrum, a correspondence between each note in the cell spectrum and each of the basic investigation units; Step 104, respectively, according to the pitch of each note in the cell spectrum, and the corresponding relationship, respectively The target base frequency value of the corresponding basic unit of investigation; Step 105: Determine, according to the number of beats of each note in the music spectrum, and the corresponding relationship, respectively, determine a target duration of the corresponding basic unit to be inspected; Step 106, adjust each basic unit according to the target base frequency value and the target duration
  • the speech segment is such that the fundamental frequency of the adjusted speech segment is the target fundamental frequency value, and the duration of the adjusted speech segment is the target duration.
  • the method for realizing the vocalization of the voice provided by the embodiment of the present invention can determine the pitch of each note in the chord and the beat of each note in the chord after determining the correspondence between the notes in the chord and the basic unit. Number, determining the target fundamental frequency value of each basic unit of investigation, and the target duration of each basic unit of investigation; then adjusting the corresponding speech segment of each basic unit of investigation such that the fundamental frequency of the adjusted speech is the determined target base The frequency value, the duration of the adjusted speech is the determined target duration. Therefore, the method directly adjusts the input voice signal waveform, thereby avoiding the loss of multiple signal conversions; and the technical solution provided by the embodiment of the present invention can sing the voice input to any song for the user voice of any length and arbitrary content.
  • Voice conversion that is to say, the case is not limited to the input of lyrics for a specific song, but allows the user to input arbitrary content to realize the conversion of any song.
  • Embodiment 2 As shown in FIG. 2, a schematic flowchart of a method for implementing voice singing is provided in an embodiment of the present invention.
  • Step S10 Receive a voice signal input by a user.
  • the voice signal is divided into voice segments of the basic unit.
  • the voice signal is divided into the voice segment of the basic unit, and the specific operation is as shown in FIG. 3, including: Step S111, performing pre-processing on the voice signal, where the pre-processing operation may specifically be a voice signal.
  • the noise reduction processing is performed; specifically, the voice segment is subjected to speech enhancement by a technique such as Wiener filtering, and the processing capability of the subsequent system for the signal is improved.
  • Step S112 extracting a speech acoustic feature vector from a speech signal frame by frame to generate an acoustic feature vector sequence;
  • the speech acoustic signal is extracted from the speech signal frame by frame, and the specificity of the speech can be:
  • the Mel Frequency Cepstrum Coefficient (MFCC) feature performs short-term analysis on each frame of speech data with a window length of 25 ms and a frame shift of 10 ms to obtain MFCC parameters and first-order second-order differences, for a total of 39 dimensions.
  • the speech segment in the buffer of the device is characterized as a sequence of 39-dimensional features.
  • Step S113 performing speech recognition on the acoustic feature vector sequence, and determining a basic speech recognition unit model sequence and a speech segment corresponding to each basic speech recognition model.
  • the basic speech recognition model may include: a mute recognition model, a voiced recognition model, and an unvoiced recognition model.
  • the human pronunciation process can be regarded as a double random process.
  • the speech signal itself is an observable time-varying sequence, which is a parameter of the phoneme emitted by the brain according to grammatical knowledge and language needs (unobservable state). flow.
  • this process can be reasonably simulated by Hidden Markov Model (HMM), which is an ideal way to describe the overall non-stationarity and local stationarity of speech signals. Speech signal model.
  • HMM Hidden Markov Model
  • an HMM is used to simulate the pronunciation characteristics of a silent segment, a voiced segment, and an unvoiced segment.
  • the system collects voice data in advance and trains the model parameters. Specifically, the training data set of silence, voiced, and unvoiced is determined by manual segmentation and labeling of the training voice data set; and then acoustic features are extracted from the corresponding training data sets, For example, the MFCC feature; then the system trains the model parameters of the mute segment, the voiced segment and the unvoiced segment under preset training criteria such as Maximum Likelihood Estimation (MLE).
  • MLE Maximum Likelihood Estimation
  • the MFCC parameter may be, according to the MFCC parameter and the preset HMM model, the model sequence of the silence segment, the voiced segment, and the unvoiced segment may be identified, and
  • the voice signal slices are: a silent segment, a voiced segment, and an unvoiced segment.
  • a pre-defined search network example is shown in Figure 4, where each path represents a possible combination of mute segments, voiced segments, and unvoiced segments.
  • the voice message may be adopted in the embodiment of the present invention.
  • step S113 The number is divided into two times, that is, the speech segment determined by the segmentation in step S113 is used as adaptive data, and the corresponding model parameters are updated to obtain a new model; step S113 is performed again according to the new model, thereby segmenting the speech signal. For the voice clip.
  • Step S114 combining the speech segments corresponding to the basic speech recognition unit to obtain the speech segment of the basic investigation unit.
  • the basic speech recognition model includes: a mute recognition model, a voiced recognition model and an unvoiced recognition model; the speech segment corresponding to the basic speech recognition unit is combined to obtain a speech segment of the basic investigation unit, specifically comprising: combining the voiced segment and the unvoiced segment. Basically examine the speech segment of the unit.
  • the embodiment of the invention also considers combining the model speech segments according to actual needs to form a basic investigation unit.
  • the specific operation may be: Combining each voiced segment with its previous unvoiced segment to form a new basic unit of investigation. For example, the pronunciation of "ben”, “ben”, can be divided into unvoiced segment “b” and voiced segment "en”, and "this” can be used as a basic unit of investigation.
  • a basic speech recognition model comprising: each phoneme recognition model or a syllable recognition model; therefore, combining the speech segments corresponding to the basic speech recognition unit to obtain a speech segment of the basic examination unit, comprising: merging adjacent phoneme unit segments into a syllable-based segment Basically examine the speech segment of the unit.
  • a specific operation of dividing the voice signal into basic examination units is implemented by performing the above steps S111 to S114.
  • step S12 according to the preset barrel spectrum, the correspondence between the notes in the tube spectrum and the basic inspection unit is determined.
  • Step S12 is a specific implementation manner, as shown in FIG. 5: Step S121, acquiring the number K of basic inspection units corresponding to the voice signal input by the user; Step S122, obtaining a sequence of the sub-segment of the cartridge spectrum;
  • the system pre-defines the genre according to the lyrics of the original song into a plurality of genre sub-segments, and each sub-segment can express the meaning of the complete lyrics, for example, each lyric in the song "Love You 10,000 Years" , as a sub-segment.
  • the sub-segment can be partitioned and stored in the device. Step S123, sequentially counting the number M of notes in each sub-segment;
  • Step S124 determining whether the number M of notes in the current sub-segment is greater than the number K of the basic unit of investigation, step S125, if M is greater than K, specifically, obtaining the parameter r according to the following formula (1), that is, The ratio of K is rounded down, ie
  • Step S127 the linear alignment method of the copied rK basic investigation units and the M musical symbols in the cartridge sub-segment may refer to the following formula (2).
  • NotIdxj [j * rK / M] (2) where, "Notldx" represents the sequence number of the basic unit of view corresponding to the jth note in the combination of the sub-segments of the library, that is, r / M is rounded off. If it is determined in step S124 that the number M of notes in the current note sub-segment is less than the number K of the basic unit of investigation, that is, M ⁇ K, step S128 is performed to determine whether the unit spectrum is over, and if the tube language has not ended, Step S129 is executed to associate the next sub-segment in the cadence with the current sub-segment and correspond to the basic unit sequence.
  • step S128 when the number of notes in the sub-segment is smaller than the number of basic units, it is considered that the middle of the next sub-segment is matched, so that the number of notes in the merged sub-segment is larger than the basic unit.
  • step S130 If it is determined in step S128 that the end of the cylinder is over, and the number of notes in the sub-segment is less than the number of basic units, step S130 is performed to match the notes in the current note sub-segment with the basic unit. , delete the basic unit of investigation that is not corresponding.
  • the device can be in units of sub-segments in the library, and the above steps S121-S130 are repeated to align the musical notes in the entire song with the basic unit.
  • step S13 the target fundamental frequency value of each basic investigation unit is determined according to the pitch of the notes in the cylinder spectrum and the correspondence between the notes in the cylinder spectrum determined in step S12 and the basic investigation unit.
  • the specific operation of determining the target fundamental frequency value of each basic investigation unit may refer to the following formula (1):
  • F0_mle is the target fundamental frequency value
  • 440 is the frequency of the A note on the central C (unit is HZ)
  • p is the distance between the pitch of the note corresponding to the basic unit and the A note on the center C, and the unit is semitone.
  • the embodiment of the present invention further provides the following operations, which can optimize the determined target fundamental frequency value according to the characteristics of different speaker's range to adapt to the pronunciation characteristics of the speaker.
  • Step S14 adjusting the target fundamental frequency value of the basic investigation unit according to the characteristics of the speaker's range.
  • Step S14 is a specific implementation manner, as shown in FIG. 6: Step S141, performing a lifting and lowering process on the target frequency value of each of the determined basic unit to obtain a corresponding fundamental frequency value under different keynotes; In step S141, the target base frequency value of each of the determined basic unit is determined to be up-and-down, in order to obtain a base frequency sequence of a wider sound range.
  • the specific lifting and lowering processing may include: traversing the -N-+N (in semitone) tone, and combining the previously generated F0_mle, referring to the following formula (2), obtaining a new fundamental frequency F0_new bt :
  • each basic unit after the lifting and lowering process has obtained 2N+1 adjusted fundamental frequency values, wherein bt has a value of (-N ⁇ +N).
  • the preferred setting parameter N in this embodiment is 15, but it should not be construed as limiting the embodiment of the present invention.
  • Step S142 acquiring a sequence of adjusted base frequency values of the basic unit of the unit under different keys; Step S143, extracting a fundamental frequency feature sequence of the speech segment of each basic unit, and calculating an average to generate a fundamental frequency feature value F0_nat; Step S144, acquiring a sequence of fundamental frequency feature values of the speech segment of the basic unit sequence; Step S145 Calculating a difference between the adjusted base frequency value sequence of the basic unit sequence under different keynotes and the extracted basic frequency characteristic value sequence of the speech segment of the basic unit sequence; that is, as shown in reference formula (3),
  • ⁇ 3 ⁇ 4 ⁇ indicates the difference between the adjusted base frequency value sequence and the fundamental frequency characteristic value sequence under the determination of the key bt, where K represents the number of basic units under investigation, and F0_new bt , , is the adjustment base of the i-th basic unit of investigation.
  • the frequency value, i is the fundamental frequency characteristic value of the speech segment of the i-th basic unit.
  • the value of bt is (-N ⁇ +N).
  • Step S146 according to the difference value calculated in step S145, select the adjusted base frequency value of each basic unit under the key to make the difference is the target frequency value of the corresponding optimization, which is denoted as F0_use.
  • step S15 the target duration of each basic unit is determined according to the number of beats of the notes in the volume, and the correspondence between the notes in the library and the basic unit of investigation determined in step S12.
  • the specific operation of step S15 may include: Step S151, obtaining each according to the number of beats of the notes in the volume, and the correspondence between the notes in the library obtained in step S12 and the basic unit of investigation. Basically check the number of beats corresponding to the unit.
  • calculating the number of beats corresponding to each basic unit of investigation may be based on the correspondence between the notes in the basic unit and the volume of the cylinder, and the number of beats of the notes in the cylinder, and statistically obtaining the beat corresponding to each of the basic units. number. As shown in Fig. 7b, for example, assuming that the "snow" syllable corresponds to the note "2", the number of beats corresponding to "snow" is 1/2 beat. Step S152, according to the determined number of beats corresponding to each basic unit, and the description in the cylinder spectrum The rhythm, the target duration of each basic unit of investigation.
  • d_ use 60/ tempo * d _ note ( 4 )
  • d_use is the target duration of the base unit, in seconds
  • tempo is the rhythm described in the library, ie the number of beats per minute
  • d_note is step one
  • step S16 the input voice is adjusted so that the base frequency of the adjusted voice is the acquired target frequency, and the adjusted voice duration is the target duration.
  • step S16 may be to adjust the duration and the fundamental frequency of the input speech by using the PSOLA algorithm, so that the speech segments of each basic investigation unit satisfy the corresponding target duration d_use and the target fundamental frequency F0_use. Adjust the target. If the obtained target fundamental frequency value is not optimized, the unoptimized target fundamental frequency value can also be used as the adjustment standard.
  • a method for implementing voice singing according to an embodiment of the present invention can determine the pitch of each note in the music spectrum and the beat of each note in the music string after determining the correspondence between the notes in the tube spectrum and the basic investigation unit. Number, determining the target fundamental frequency value of each basic unit of investigation, and the target duration of each basic unit of investigation; then adjusting the corresponding speech segment of each basic unit of investigation such that the fundamental frequency of the adjusted speech is the determined target base The frequency value, the duration of the adjusted speech is the determined target duration. Therefore, the method directly adjusts the input voice signal waveform to avoid loss of multiple signal conversion; and the technical solution provided by the embodiment of the present invention can sing to any song of any length and arbitrary content.
  • Voice conversion that is to say, the case is not limited to the input of lyrics for a specific song, but allows the user to input arbitrary content to achieve conversion of any song.
  • the technical solution provided by the embodiment of the present invention can convert the spoken voice of any length and any content to the singing voice of any song, that is, the present invention is not limited to the input of the lyrics of the specific song, but allows the user to input any content. Implement the conversion of any song.
  • FIG. 8 A schematic diagram of a device for implementing voice singing, the device may include: a receiving unit 801, a segmentation unit 802, a corresponding correspondence unit 803, a base frequency unit 804, and a time unit. 805, and an adjustment unit 806; a receiving unit 801, configured to receive a voice signal input by a user;
  • the segmentation unit 802 is configured to segment the voice signal to obtain a voice segment of each basic unit
  • the obtaining correspondence unit 803 is configured to determine a correspondence between each note in the cadence and the basic unit of investigation;
  • the acquiring the base frequency unit 804 is configured to determine, according to the pitch of each note in the cadence, and the corresponding relationship, respectively, a target base frequency value of the corresponding basic unit;
  • the acquiring duration unit 805 is configured to determine, according to the number of beats of each note in the chord, and the corresponding relationship, respectively, the target duration of the corresponding basic unit;
  • the adjusting unit 806 is configured to adjust a voice segment of each basic unit according to the target base frequency value and a target duration, so that a base frequency of the adjusted voice segment is the target base frequency value, and the adjusted voice segment is The duration is the target duration.
  • the device for implementing the voice singing according to the embodiment of the present invention can determine the pitch of each note in the music string and the beat of each note in the music string after determining the correspondence between the notes in the tube spectrum and the basic investigation unit. Number, determining the target fundamental frequency value of each basic unit of investigation, and the target duration of each basic unit of investigation; then adjusting the corresponding speech segment of each basic unit of investigation such that the fundamental frequency of the adjusted speech is the determined target base The frequency value, the duration of the adjusted speech is the determined target duration. Therefore, the method directly adjusts the input voice signal waveform, thereby avoiding the loss of multiple signal conversions; and the technical solution provided by the embodiment of the present invention can sing the voice input to any song for the user voice of any length and arbitrary content. Voice conversion; that is to say, the case is not limited to the input of lyrics for a specific song, but allows the user to input arbitrary content to realize the conversion of any song. Further, as shown in FIG. 9, the segmentation unit 802 may further include:
  • the extracting unit 8021 is configured to extract a speech acoustic feature vector from a speech signal frame by frame, Forming an acoustic feature vector sequence;
  • the speech acoustic signal is extracted from the speech signal frame by frame, and the specificity of the speech can be:
  • the determining unit 8022 is configured to perform speech recognition on the acoustic feature vector sequence, and determine a basic speech recognition unit model sequence and a speech segment corresponding to each basic speech recognition model;
  • the human pronunciation process can be regarded as a double random process.
  • the speech signal itself is an observable time-varying sequence, which is a parameter of the phoneme emitted by the brain according to grammatical knowledge and language needs (unobservable state). flow.
  • this process can be reasonably simulated by Hidden Markov Model (HMM), which is a good description of the overall non-stationarity and local stability of speech signals. Signal model.
  • HMM Hidden Markov Model
  • an HMM is used to simulate the pronunciation characteristics of a silent segment, a voiced segment, and an unvoiced segment.
  • the system collects speech data in advance and trains the model parameters. Specifically, the training data set of silence, voiced, and unvoiced is determined by manual segmentation and labeling of the training voice data set; and then acoustic features are extracted from the corresponding training data sets, For example, the MFCC feature; then the system trains the model parameters of the mute segment, the voiced segment and the unvoiced segment under preset training criteria such as Maximum Likelihood Estimation (MLE).
  • MLE Maximum Likelihood Estimation
  • the signal slices are: silent clips, voiced clips, and unvoiced clips.
  • a pre-defined search network example is shown in Figure 4, where each path represents a possible combination of mute segments, voiced segments, and unvoiced segments.
  • the merging unit 8023 configured to merge the voice segments corresponding to the basic voice recognition unit The speech segment to the basic unit of investigation.
  • the basic speech recognition model includes: a mute recognition model, a voiced recognition model and an unvoiced recognition model; then the speech segment corresponding to the basic speech recognition unit is obtained by the basic speech unit, specifically comprising: combining the voiced segment and the unvoiced segment Basically examine the speech segment of the unit.
  • Embodiments of the present invention also consider merging model speech segments according to actual needs to form a basic unit of investigation.
  • the specific operation may be: Combine each voiced segment with its previous unvoiced segment to form a new basic unit of investigation.
  • the pronunciation of "ben” "ben” can be divided into unvoiced segment “b” and voiced segment "en”, which can be used as a basic unit of investigation.
  • a basic speech recognition model comprising: each phoneme recognition model or a syllable recognition model; therefore, combining the speech segments corresponding to the basic speech recognition unit to obtain a speech segment of the basic examination unit includes: merging adjacent phoneme unit segments into a syllable-based segment Basically examine the speech segment of the unit.
  • the acquiring correspondence unit 803 specifically includes: a first statistic unit 8031, a first obtaining unit 8032, a second statistic unit 8033, a first determining unit 8034, a second obtaining unit 8035, and a copying unit. 8036, an aligning unit 8037; the first statistic unit 8031, configured to acquire a number K of basic inspection units corresponding to a voice signal input by a user;
  • the first obtaining unit 8032 is configured to obtain a sequence of the sub-segment of the snare
  • the second statistic unit 8033 is configured to sequentially count the number M of the notes in each sub-segment; the first determining unit 8034 is configured to determine whether the number M of the notes in the current sub-segment is greater than the basic unit Number K;
  • the second obtaining unit 8035 is configured to obtain the parameter r according to the following formula if M is greater than K,
  • the aligning unit 8037 is configured to linearly align the copied rK basic inspection units with the M notes in the platoon sub-segment.
  • the NotldX j represents the sequence number of the basic unit of investigation corresponding to the jth note in the sub-segment.
  • the device further includes: a second determining unit,
  • the second determining unit is configured to determine whether the cylinder spectrum is over; the aligning unit 8037 is specifically configured to: if the cylinder spectrum is not finished, combine the next sub-segment in the cylinder data with the current sub-segment Corresponding to the basic investigation unit; if it is judged that the cylinder spectrum is over, the notes in the current note sub-segment are associated with the basic inspection unit--the basic investigation unit that is not corresponding is deleted.
  • the obtained baseband unit is used in accordance with the formula
  • the apparatus further includes: an adjustment tone adjusting unit 807, configured to adjust the acquired target base frequency value according to a sound field characteristic of the speaker;
  • the adjustment key unit 807 specifically includes: a third obtaining unit 8071, a fourth obtaining unit 8072, a generating unit 8073, a fifth obtaining unit 8074, a sixth obtaining unit 8075, a selecting unit 8076, and a third obtaining unit 8071, configured to The target base frequency value of each basic unit is subjected to a lifting and lowering process to obtain an adjusted fundamental frequency value under different keynotes; and a fourth obtaining unit 8072 is configured to obtain a sequence of adjusted base frequency values of the basic unit sequence under different keynotes.
  • a generating unit 8073 configured to extract a base frequency feature sequence of the voice segment of each basic unit, and calculate an average to generate a base frequency feature value
  • a fifth obtaining unit 8074 configured to acquire a base of the voice segment of the basic unit sequence a frequency eigenvalue sequence
  • a sixth obtaining unit 8075 configured to calculate an adjustment base of the basic unit sequence under different keynotes a sequence of frequency values, and a difference between the sequence of fundamental frequency feature values of the speech segments of the extracted basic unit sequence
  • the selecting unit 8076 is configured to select an adjusted base frequency value of each basic unit under the key to minimize the difference as the target frequency value of the corresponding optimization.
  • the acquiring duration unit 805 specifically includes: acquiring a beat unit 8051, and acquiring a target unit 8052,
  • Obtaining a beat number unit 8051 configured to obtain a beat number corresponding to each basic survey unit according to a beat number of a note in the music string, and a correspondence relationship between the note in the music string and the basic survey unit, the acquisition target
  • the unit 8052 is configured to obtain a target duration of each basic unit according to the obtained number of beats corresponding to each basic unit and the rhythm described in the barrel.
  • the technical solution provided by the embodiment of the present invention can implement automatic voice segmentation, avoiding the burden of traditional manual segmentation, and is not limited by the language, and has a more general entertainment effect.
  • a person skilled in the art may understand that all or part of the various steps of the foregoing embodiments may be completed by a program instructing related hardware.
  • the program may be stored in a computer readable storage medium, and the storage medium may include: ROM, RAM, disk or CD, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Auxiliary Devices For Music (AREA)
  • Electrophonic Musical Instruments (AREA)

Abstract

一种实现语音歌唱化的方法和装置,所述方法包括:接收用户输入的语音信号;将所述语音信号切分获得各基本考察单元的语音片断;根据预置的简谱,确定简谱中的各音符与所述各基本考察单元的对应关系;根据简谱中各音符的音高,和所述对应关系,分别确定其所对应的基本考察单元的目标基频值;根据简谱中各音符的节拍数,和所述对应关系,分别确定其所对应的基本考察单元的目标时长;根据所述目标基频值和目标时长调整各基本考察单元的语音片断,使得调整后的语音片段的基频为所述目标基频值,调整后的语音片段的时长为所述目标时长。该方法避免了多次信号转换的损失,实现了对任意长度及任意内容的语音向任意歌曲的唱歌语音转换。

Description

一种实现语音歌唱化的方法和装置
技术领域
本发明涉及语音信号处理领域,具体涉及一种实现语音歌唱化的方法和装 置。
背景技术
近年来,歌唱合成系统, 即将用户输入的文本数据转换为歌唱语音的方法 以及得到了广泛的研究和应用。歌唱合成系统的实现首先要求录制大量的歌曲 数据, 包括语音数据和筒谱数据等, 以提供合成系统所需的语音片段或训练可 靠的模型参数。 然后, 由于歌曲数据录制的代价较大, 歌唱合成系统通常只能 选择录制某个特定发音人的数据,相应的提供的歌唱合成效果限定为特定发音 人的音色, 不适合个性化定制, 无法实现到特定音色的演绎, 特别是用户自身 音色的重现。
针对上述问题,现有技术中开发了一种歌唱合成方法, 允许设备接收用户 以说话风格方式输入的语音数据,系统按照预设的筒谱对语音数据进行优化实 现歌曲合成。 这种方式保留了用户语音数据的音色, 实现个性化合成。 具体操 作包括: (1 )系统接收用户说话风格的歌词语音输入; (2 )通过人工切分的方 式将语音信号切分为各个独立的基于音素单元的语音片段; ( 3 )并根据筒谱标 注确定各音素单元和筒谱音符的对应关系; (4 )系统从各音素单元的语音片段 中提取声学频谱特征, 基频特征等; (5 )系统根据筒谱标注信息确定目标歌曲 的基频 F0特征参数和时长特征, 并据此调整各音素单元的基频特征和时长; ( 6 ) 系统根据各音素单元的声学频谱特征, 以及韵律特征(如: 基频特征及 时长特征等), 合成歌唱语音输出。
该现有技术虽然实现了从说话风格语音信号到歌唱风格的转换,但具有如 下问题:
一方面, 该方案只能实现筒谱对应的歌词的说话风格语音输入的转换。也 就是说用户只能输入指定歌曲的歌词, 无法实现对任意长度的,任意内容的歌 曲合成效果转换, 应用方法受限, 同时也降低了娱乐效果;
进一步, 该方案通过人工切分方式, 实现了说话风格的连续语音信号的切 分, 以及筒谱音符的对应。对人工要求较高, 受到语种的限制,无法普适推广。 而且,该方案采用的是参数合成方式,即首先将语音信号转换为声学特征, 随后在特征层面上按照歌唱标准进行优化,最后按照合成方式从优化特征中合 成得到连续语音信号。显然从语音信号到特征参数的转换, 以及特征参数到语 音信号的合成中均存在信号的损失, 音质有明显的下降。
发明内容
本发明实施例提供了一种实现语音歌唱化的方法和装置,能够自动对语音 进行切分, 而且可以将任意长度和任意内容的说话语音转换为用户需要的歌 曲。
本发明实施例提供了一种实现语音歌唱化的方法, 所述方法包括: 接收用户输入的语音信号;
将所述语音信号切分获得各基本考察单元的语音片断;
根据预置的筒谱, 确定筒谱中的各音符与所述各基本考察单元的对应关 系;
根据筒谱中各音符的音高, 和所述对应关系, 分别确定其所对应的基本考 察单元的目标基频值;
根据筒谱中各音符的节拍数, 和所述对应关系, 分别确定其所对应的基本 考察单元的目标时长;
根据所述目标基频值和目标时长调整各基本考察单元的语音片断,使得调 整后的语音片段的基频为所述目标基频值,调整后的语音片段的时长为所述目 标时长。
本发明实施例还提供了一种实现语音歌唱化的装置, 该装置包括: 接收单 元, 切分单元, 获取对应关系单元, 获取基频单元, 获取时长单元, 和调整单 元;
所述接收单元, 用于接收用户输入的语音信号;
所述切分单元, 用于将所述语音信号切分获得各基本考察单元的语音片 断;
所述获取对应关系单元,用于确定简谱中的各音符与所述各基本考察单元 的对应关系;
所述获取基频单元, 用于根据筒谱中各音符的音高, 和所述对应关系, 分 别确定其所对应的基本考察单元的目标基频值;
所述获取时长单元, 用于根据筒谱中各音符的节拍数, 和所述对应关系, 分别确定其所对应的基本考察单元的目标时长;
所述调整单元,用于根据所述目标基频值和目标时长调整各基本考察单元 的语音片断,使得调整后的语音片段的基频为所述目标基频值,调整后的语音 片段的时长为所述目标时长。
从以上技术方案可以看出, 本发明实施例具有以下优点: 可以将输入的语 音信号波形直接进行调整,通过对波形的直接优化,避免了多次信号转换的损 失; 且本发明实施例提供的技术方案, 可以对任意长度及任意内容的说话语音 向任意歌曲的唱歌语音转换也就是说本案不局限于对特定歌曲的歌词输入,而 是允许用户输入任意内容, 实现任意歌曲的转换。
附图说明
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施 例或现有技术描述中所需要使用的附图作筒单地介绍,显而易见地, 下面描述 中的附图仅仅是本发明的一些实施例, 对于本领域普通技术人员来讲,在不付 出创造性劳动性的前提下, 还可以根据这些附图获得其他的附图。
图 1为本发明实施例提供的一种实现语音歌唱化的方法流程示意筒图; 图 2 为本发明实施例提供的另一种实现语音歌唱化的方法流程示意筒 图;
图 3为本发明实施例中将语音信号切分为基本考察单元的语音片段的流 程示意筒图;
图 4为预先定义的搜索网络示例; 图 5 为本发明实施例中获取筒谱中的音符与基本考察单元的对应关系流 程示意筒图;
图 6为本发明实施例中实现可根据不同发音人的音域特点对获取的目标 基频值进行优化的操作流程示意筒图; 图 7a 为本发明实施例中获取每个基本考察单元的目标时长操作流程示 意筒图; 图 7b所示获取音符的节拍数的举例; 图 8为本发明实施例提供的一种实现语音歌唱化的装置示意筒图; 图 9为本发明实施例提供的切分单元示意筒图; 图 10为本发明实施例提供的获取对应关系单元示意筒图; 图 11为本发明实施例提供的调整基调单元示意筒图; 图 12为本发明实施例提供的获取时长单元示意筒图。
具体实施方式
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清 楚、 完整地描述, 显然, 所描述的实施例仅仅是本发明一部分实施例, 而不是 全部的实施例。基于本发明中的实施例, 本领域普通技术人员在没有作出创造 性劳动前提下所获得的所有其他实施例, 都属于本发明保护的范围。
本发明实施例提供了一种实现语音歌唱化的方法和装置, 可以对用户任意 语音输入, 通过对该语音输入自动切分为基本考察单元的语音片段, 再对划分 的各基本考察单元进行语音片断的歌唱化调整, 实现将用户语音转换为歌唱语 音输出。 实施例一 如图 1所示本发明实施例提供的一种实现语音歌唱化的方法流程示意图。 步骤 101 , 接收用户输入的语音信号; 步骤 102, 将所述语音信号切分获得各基本考察单元的语音片断; 其中, 所述基本考察单元为单个音符所对应的最小的发音单元, 如中文歌曲的字符, 英文歌曲的音节等。 步骤 103, 根据预置的筒谱, 确定筒谱中的各音符与所述各基本考察单元 的对应关系; 步骤 104, 根据筒谱中各音符的音高, 和所述对应关系, 分别确定其所对 应的基本考察单元的目标基频值; 步骤 105 , 根据筒谱中各音符的节拍数, 和所述对应关系, 分别确定其所 对应的基本考察单元的目标时长; 步骤 106, 根据所述目标基频值和目标时长调整各基本考察单元的语音片 断,使得调整后的语音片段的基频为所述目标基频值,调整后的语音片段的时 长为所述目标时长。 本发明实施例提供的一种实现语音歌唱化的方法,在确定筒谱中的音符与 基本考察单元的对应关系后, 可以根据筒谱中各音符的音高, 和筒谱中各音符 的节拍数,确定每个基本考察单元的目标基频值, 和每个基本考察单元的目标 时长;随后对每个基本考察单元的对应语音片断进行调整使得调整后的语音的 基频为确定的目标基频值, 调整后的语音的时长为确定的目标时长。 因此, 该 方法通过对输入的语音信号波形直接进行调整, 避免了多次信号转换的损失; 且本发明实施例提供的技术方案,可以对任意长度及任意内容的用户语音输入 向任意歌曲的唱歌语音转换; 也就是说本案不局限于对特定歌曲的歌词输入, 而是允许用户输入任意内容, 实现任意歌曲的转换。 实施例二 如图 2所示, 本发明实施例提供的一种实现语音歌唱化的方法流程示意 图。
步骤 S10, 接收用户输入的语音信号。 步骤 S11 , 将语音信号切分为基本考察单元的语音片断。 在本发明实施例中将语音信号切分为基本考察单元的语音片断, 具体的 操作如图 3所示, 包括: 步骤 S111 , 对语音信号进行预处理, 该预处理操作具体可以是对语音信 号进行降噪处理; 具体可以是通过维纳滤波等技术对语音片断进行语音增强, 提高后续系统对该信号的处理能力。 步骤 S112, 从语音信号中逐帧提取语音声学特征矢量, 生成声学特征矢 量序列;
其中, 语音信号中逐帧提取语音声学特征矢量具体可以: 是提取语音的 Mel频率倒谱系数(MFCC, Mel Frequency Cepstrum Coefficient )特征, 对窗 长 25ms帧移 10ms的每帧语音数据做短时分析得到 MFCC参数及其一阶二阶 差分,共计 39维度。 因此,在设备的緩沖区的语音段表征为一 39维特征序列。 步骤 S113 , 对所述声学特征矢量序列执行语音识别, 确定基本语音识别 单元模型序列及各基本语音识别模型对应的语音片段。 其中, 基本语音识别模型, 可以包括: 静音识别模型, 浊音识别模型和 清音识别模型三种。
需要理解的是, 人的发音过程可以看作是一个双重随机过程, 语音信号 本身是一个可观测的时变序列,是由大脑根据语法知识和语言需要(不可观测 的状态)发出的音素的参数流。 现有技术中, 通过隐马尔可夫模型 (HMM, Hidden Markov Model )可以合理地模仿了这一过程, 4艮好地描述了语音信号 的整体非平稳性和局部平稳性,是一种理想的语音信号模型。在本发明实施例 采用 HMM来模拟静音片段, 浊音片段以及清音片段的发音特点。对每个模型 分别定义从左到右不可跳转的 N (本方案中可以采用 N=3 )状态 HMM模型, 且可以确定每个状态的高斯分量为确定的 K个 (K=8)。 为了准确模拟静音片断, 浊音片段以及清音片段的发音特点, 系统预先 收集语音数据并对模型参数进行训练。 具体可以是: 通过对训练语音数据集的 人工切分和标注, 确定静音(silence ), 浊音 ( voiced )和清音 ( unvoiced ) 的 训练数据集; 随后从所述各对应训练数据集中提取声学特征, 如 MFCC特征; 接着系统在预设的如最大似然估计( MLE , Maximum Likelihood Estimation ) 训练准则下训练得到静音片断, 浊音片断以及清音片断的模型参数。 当在步骤 S112 中从语音信号中提取声学特征矢量之后, 具体可以是 MFCC参数, 根据所述 MFCC参数和预设 HMM模型可以识别得到静音片段 段, 浊音片段以及清音片段的模型序列, 并且, 将所述语音信号切片为: 静音 片段, 浊音片段, 和清音片段。 如图 4所示预先定义的搜索网络示例, 其中, 每条路径都表示一种可能 的静音片段, 浊音片断, 清音片断的组合方式。 优选的, 为了得到更好的切分效果, 本发明实施例中可以采用对语音信 号切分两遍, 即: 将上述步骤 S113中切分确定的语音片段作为自适应数据, 更新其相应的模型参数得到新的模型; 根据新的模型再次执行步骤 S113 , 从 而将语音信号切分为语音片段。 步骤 S114, 合并基本语音识别单元对应的语音片段得到基本考察单元的 语音片段。 当基本语音识别模型包括: 静音识别模型, 浊音识别模型和清音识 别模型三种;则合并基本语音识别单元对应的语音片段得到基本考察单元的语 音片断,具体包括:将浊音片断和清音片断合并构成基本考察单元的语音片断。 由于步骤 S113中确定的语音片段的单元相应于音符往往过小, 因而不能 和筒谱的音符很好的对应。本发明实施例还考虑根据实际需要对模型语音片断 进行合并, 构成基本考察单元。 具体操作可以是: 将每个浊音片断和其之前的 清音片断合并构成新的基本考察单元。 例如: "本" 的发音 "ben" , 可以划分为清音片段 "b" 和浊音片段 "en" , "本" 字可以作为基本考察单元。 或者, 基本语音识别模型, 包括: 各音素识别模型或音节识别模型; 因此, 合并基本语音识别单元对应的语音片段得到基本考察单元的语音 片断, 包括: 将相邻音素单元片断合并构成基于音节的基本考察单元的语音片 断。 通过执行上述步骤 S111~S114 实现了将语音信号切分为基本考察单元的 一种具体操作。 步骤 S12,根据预置的筒谱,确定筒谱中的音符与基本考察单元的对应关 系。 其中, 对步骤 S12—种具体实现方式, 如图 5所示: 步骤 S121 , 获取用户输入的语音信号所对应的基本考察单元的个数 K; 步骤 S122, 获得筒谱子片断序列;
系统预先在歌曲库制作时根据原歌曲的歌词将筒谱划分为多个筒谱子片 段, 每个子片段可以表达完整歌词意义, 例如, 将《爱你一万年》这首歌中的 每句歌词, 作为子片段。 该子片段可以是划分好存储在设备中。 步骤 S123, 依次统计每个子片段中音符的个数 M;
步骤 S 124 ,判断当前子片段中音符的个数 M是否大于基本考察单元的个 数 K, 步骤 S125, 如果 M大于 K, 具体可以是根据如下式子( 1 )获得参数 r, 即对 M与 K的比值下取整, 即
Γ = Μ /^ί:」 (丄) 步骤 S126, 将基本考察单元序列复制 r遍顺序拼接, 其中, 复制后的总 的基本考察单元个数为 rK, 满足 rK<=M;
步骤 S127, 将复制后的 rK个基本考察单元, 与筒谱子片段中的 M个音 符的线性对齐方法可以参考如下式子 (2 ),
NotIdxj = [j * rK / M] (2) 其中, Notldx」表示筒谱子片段组合中第 j个音符所对应基本考察单元的 序号, 即 r /M四舍五入取整。 若步骤 S124中判断出当前音符子片段中音符个数 M是小于基本考察单 元个数 K, 即 M<K时, 执行步骤 S128, 判断该筒谱是否结束, 如果该筒语还 未结束, 则执行步骤 S129, 将筒谱中后一个子片段与当前的子片段联合, 与 基本考察单元序列进行对应。 具体的对应的方法与上述步骤 S124~S127相同。 通过执行步骤 S128与 S129,使得当筒谱子片段中的音符个数小于基本考 察单元的个数时, 考虑将下一个子片段中音符合并,使得合并后的子片段中音 符个数大于基本考察单元的个数, 进行对应。 若步骤 S128中判断出该筒谱结束,且此时子片段中的音符的个数小于基 本考察单元的个数, 执行步骤 S130, 将当前音符子片段中的音符与基本考察 单元——对应后, 删除未对应上的基本考察单元。 对于一整首歌, 设备可以以筒谱中的子片段为单位, 重复上述步骤 S121-S130将整首歌中的筒谱音符与基本考察单元进行对齐。 步骤 S13,根据筒谱中音符的音高,和步骤 S12中确定的筒谱中的音符与 基本考察单元的对应关系, 确定每个基本考察单元的目标基频值。 其中, 确定每个基本考察单元的目标基频值的具体操作可以是参考如下 式(1 ):
^0_ ΓΜ/£? = 440* 2(ρ_69) 12 ( 1 )
其中, F0_mle为目标基频值, 440表示中央 C上 A音符发出的频率(单 位为 HZ ), p为基本考察单元所对应的音符的音高与中央 C上 A音符的距离, 单位为半音。 优选的, 考虑到不同发音人音域上存在差异, 在演唱相同歌曲时选择的 基调也往往并不一致,如果直接根据目标基频值对基本考察单元进行优化, 容 易导致发音变声等现象,影响合成效果。因此,本发明实施例还提供如下操作, 可以根据不同发音人的音域特点对确定的目标基频值进行优化,使其自适应于 发音人的发音特点。
步骤 S14, 根据发音人的音域特点,对所述基本考察单元的目标基频值进 行调整。
其中, 对步骤 S14—种具体实现方式, 如图 6所示: 步骤 S141 , 对确定的每个基本考察单元的目标基频值进行升降调处理, 获取在不同基调下的对应基频值; 其中, 步骤 S141中对确定的每个基本考察单元的目标基频值进行升降调 处理, 是为了获取更广音域的基频序列。 具体的升降调处理可以包括: 遍历 -N-+N (单位为半音)基调, 结合之前生成的 F0_mle, 参考如下式(2 ), 得 到新的基频 F0_newbt:
F0 _ newbt = F0 _ rule * 2b"w ( 2 )
因此,进行升降调处理后的每个基本考察单元都得到了 2N+1个调整基频 值, 其中, bt的取值为 (-N~+N )。 考虑计算量和计算效果, 本实施例中优选的设置参数 N为 15, 但是不应 该理解为对本发明实施例的限制。 步骤 S142, 获取不同基调下的基本考察单元序列的调整基频值序列; 步骤 S143, 提取每个基本考察单元的语音片断的基频特征序列, 并计算 平均, 生成基频特征值 F0_ nat; 步骤 S144, 获取基本考察单元序列的语音片段的基频特征值序列; 步骤 S145, 计算不同基调下的基本考察单元序列的调整基频值序列, 与 提取的基本考察单元序列的语音片断的基频特征值序列之间的差值;即参考式 (3 )所示,
RMSEbt =∑ [F0 _ newbti -F0_ nat] ( 3 )
^¾^^表示在确定基调 bt 下的调整基频值序列和基频特征值序列的差 值, 其中 K表示基本考察单元的个数, F0_newbt, ,是第 i个基本考察单元的调 整基频值, i是第 i个基本考察单元的语音片段的基频特征值 。 bt的 取值为 (-N~+N)。 步骤 S146,根据步骤 S145中计算出的差值,选择使得差值最小的基调下 的各基本考察单元的调整基频值作为相应优化的目标基频值, 记为 F0_use。 通过执行上述步骤 S141至步骤 S146,使得本发明实施例提供的方法可以 根据不同发音人的音域特点对确定的目标基频值进行优化,使其自适应于发音 人的发音特点, 从而提供更好的用户体验。 步骤 S15,根据筒谱中音符的节拍数,和步骤 S12中确定的筒谱中的音符 与基本考察单元的对应关系, 确定每个基本考察单元的目标时长。 其中, 步骤 S15的具体操作参考图 7a所示, 可以包括: 步骤 S151, 根据筒谱中音符的节拍数, 和步骤 S12中获取的筒谱中的音 符与基本考察单元的对应关系, 获得每个基本考察单元对应的节拍数。 需要理解的是, 计算每个基本考察单元对应的节拍数, 可以是根据基本 考察单元和筒谱中音符的对应关系, 和筒谱中音符的节拍数, 统计获得每个基 本考察单元对应的节拍数。如图 7b所示,例如:假设 "雪"音节对应音符 "2", 则 "雪" 对应的节拍数为 1/2拍。 步骤 S152, 根据确定的每个基本考察单元对应的节拍数, 和筒谱中描述 的节奏, 获取每个基本考察单元的目标时长。
其中, 获取每个基本考察单元的目标时长的具体操作, 可以参考式(4 ) 所示, 计算获得。 d _ use = 60/ tempo * d _ note ( 4 ) 其中, d_use为基本考察单元的目标时长, 单位为秒, tempo为筒谱中描 述的节奏, 即每分钟含有的拍数, d_note为步一统计得到的所述基本考察单元 对应的节拍数。 步骤 S16,对输入的语音进行调整,使得调整后的语音的基频为获取的目 标基频, 调整后的语音的时长为目标时长。 其中,步骤 S16的具体操作可以是采用 PSOLA算法对输入的语音进行时 长和基频的调整,使各基本考察单元的语音片段均满足各自对应的所述的目标 时长 d_use和目标基频 F0 _use的调整目标。 若未对获取的目标基频值进行优 化, 也可将未优化的目标基频值作为调整的标准。
本发明实施例提供的一种实现语音歌唱化的方法, 在确定筒谱中的音符 与基本考察单元的对应关系后, 可以根据筒谱中各音符的音高, 和筒谱中各音 符的节拍数,确定每个基本考察单元的目标基频值, 和每个基本考察单元的目 标时长;随后对每个基本考察单元的对应语音片断进行调整使得调整后的语音 的基频为确定的目标基频值, 调整后的语音的时长为确定的目标时长。 因此, 该方法通过对输入的语音信号波形直接进行调整, 避免了多次信号转换的损 失; 且本发明实施例提供的技术方案, 可以对任意长度及任意内容的用户语音 输入向任意歌曲的唱歌语音转换;也就是说本案不局限于对特定歌曲的歌词输 入, 而是允许用户输入任意内容, 实现任意歌曲的转换。 进一步, 本发明实施例提供的技术方案, 可以对任意长度及任意内容的 说话语音向任意歌曲的唱歌语音转换也就是说本案不局限于对特定歌曲的歌 词输入, 而是允许用户输入任意内容, 实现任意歌曲的转换。
再次, 本发明实施例提供的技术方案, 可以实现自动语音切分, 避免了 传统人工切分的负担, 不受语种的限制, 具有更普遍的娱乐效果。 实施例三 如图 8所示, 一种实现语音歌唱化的装置示意筒图, 该装置可以包括: 接 收单元 801 , 切分单元 802, 获取对应关系单元 803 , 获取基频单元 804, 获取 时长单元 805 , 和调整单元 806; 接收单元 801 , 用于接收用户输入的语音信号;
所述切分单元 802, 用于将所述语音信号切分获得各基本考察单元的语音 片断;
所述获取对应关系单元 803 , 用于确定筒谱中的各音符与所述各基本考察 单元的对应关系;
所述获取基频单元 804,用于根据筒谱中各音符的音高,和所述对应关系, 分别确定其所对应的基本考察单元的目标基频值;
所述获取时长单元 805 , 用于根据筒谱中各音符的节拍数, 和所述对应关 系, 分别确定其所对应的基本考察单元的目标时长;
所述调整单元 806, 用于根据所述目标基频值和目标时长调整各基本考察 单元的语音片断,使得调整后的语音片段的基频为所述目标基频值,调整后的 语音片段的时长为所述目标时长。
本发明实施例提供的一种实现语音歌唱化的装置,在确定筒谱中的音符与 基本考察单元的对应关系后, 可以根据筒谱中各音符的音高, 和筒谱中各音符 的节拍数,确定每个基本考察单元的目标基频值, 和每个基本考察单元的目标 时长;随后对每个基本考察单元的对应语音片断进行调整使得调整后的语音的 基频为确定的目标基频值, 调整后的语音的时长为确定的目标时长。 因此, 该 方法通过对输入的语音信号波形直接进行调整, 避免了多次信号转换的损失; 且本发明实施例提供的技术方案,可以对任意长度及任意内容的用户语音输入 向任意歌曲的唱歌语音转换; 也就是说本案不局限于对特定歌曲的歌词输入, 而是允许用户输入任意内容, 实现任意歌曲的转换。 进一步, 如图 9所示, 所述切分单元 802还可以包括:
提取单元 8021 , 确定单元 8022, 和合并单元 8023;
所述提取单元 8021 , 用于从语音信号中逐帧提取语音声学特征矢量, 生 成声学特征矢量序列;
其中, 语音信号中逐帧提取语音声学特征矢量具体可以: 是提取语音的
Mel频率倒谱系数(MFCC, Mel Frequency Cepstrum Coefficient )特征, 对窗 长 25ms帧移 10ms的每帧语音数据做短时分析得到 MFCC参数及其一阶二阶 差分,共计 39维度。 因此,在设备的緩沖区的语音段表征为一 39维特征序列。 所述确定单元 8022, 用于对所述声学特征矢量序列执行语音识别, 确定 基本语音识别单元模型序列及各基本语音识别模型对应的语音片段;
需要理解的是,人的发音过程可以看作是一个双重随机过程,语音信号本 身是一个可观测的时变序列,是由大脑根据语法知识和语言需要(不可观测的 状态)发出的音素的参数流。现有技术中,通过隐马尔可夫模型( HMM, Hidden Markov Model )可以合理地模仿了这一过程,很好地描述了语音信号的整体非 平稳性和局部平稳性, 是一种理想的语音信号模型。 在本发明实施例采用 HMM来模拟静音片段, 浊音片段以及清音片段的发音特点。 对每个模型分别 定义从左到右不可跳转的 N (本方案中可以采用 N=3 )状态 HMM模型, 且可 以确定每个状态的高斯分量为确定的 K个 (K=8)。 为了准确模拟静音片断, 浊音片段以及清音片段的发音特点, 系统预先收 集语音数据并对模型参数进行训练。具体可以是: 通过对训练语音数据集的人 工切分和标注, 确定静音(silence ), 浊音(voiced )和清音(unvoiced ) 的训 练数据集; 随后从所述各对应训练数据集中提取声学特征, 如 MFCC特征; 接着系统在预设的如最大似然估计( MLE , Maximum Likelihood Estimation ) 训练准则下训练得到静音片断, 浊音片断以及清音片断的模型参数。 当在从语音信号中提取声学特征矢量之后, 具体可以是 MFCC参数, 根 据所述 MFCC参数和预设 HMM模型可以识别得到静音片段段, 浊音片段以 及清音片段的模型序列,并且,将所述语音信号切片为:静音片段, 浊音片段, 和清音片段。
如图 4所示预先定义的搜索网络示例, 其中,每条路径都表示一种可能的 静音片段, 浊音片断, 清音片断的组合方式。
所述合并单元 8023; 用于合并所述基本语音识别单元对应的语音片段得 到基本考察单元的语音片段。
当基本语音识别模型包括: 静音识别模型, 浊音识别模型和清音识别模型 三种; 则合并基本语音识别单元对应的语音片段得到基本考察单元的语音片 断, 具体包括: 将浊音片断和清音片断合并构成基本考察单元的语音片断。
由于确定的语音片段的单元相应于音符往往过小,因而不能和筒谱的音符 很好的对应。本发明实施例还考虑根据实际需要对模型语音片断进行合并, 构 成基本考察单元。具体操作可以是: 将每个浊音片断和其之前的清音片断合并 构成新的基本考察单元。
例如: "本"的发音 "ben", 可以划分为清音片段 "b"和浊音片段" en", "本" 字可以作为基本考察单元。 或者, 基本语音识别模型, 包括: 各音素识别模型或音节识别模型; 因此,合并基本语音识别单元对应的语音片段得到基本考察单元的语音片 断,包括:将相邻音素单元片断合并构成基于音节的基本考察单元的语音片断。
进一步, 如图 10所示, 所述获取对应关系单元 803具体包括: 第一统计 单元 8031 , 第一获取单元 8032, 第二统计单元 8033 , 第一判断单元 8034, 第 二获取单元 8035 , 复制单元 8036, 对齐单元 8037; 所述第一统计单元 8031 , 用于获取用户输入的语音信号所对应的基本考 察单元的个数 K;
所述第一获取单元 8032, 用于获得筒谱子片断序列;
所述第二统计单元 8033 , 用于依次统计每个子片段中音符的个数 M; 所述第一判断单元 8034 , 用于判断当前子片段中音符的个数 M是否大于 所述基本考察单元个数 K;
所述第二获取单元 8035 , 用于如果 M大于 K, 根据如下式子获取参数 r,
r = LM / JS:」 所述复制单元 8036, 用于将基本考察单元序列复制 r遍顺序拼接, 其中, 复制后的总的基本考察单元个数为 rK, 满足 rK<=M;
所述对齐单元 8037, 用于将所述复制后的 rK个基本考察单元, 与所述筒 谱子片段中的 M个音符进行线性对齐。 优选的, 所述对齐单元 8037, 具体用于根据公式: NotIdXj =、j K IM , 将 所述复制后的 rK个基本考察单元, 与所述筒谱子片段中的 M个音符, 进行线 性对齐;
所述 NotldX j表示筒谱子片段中第 j个音符所对应基本考察单元的序号。 优选的, 所述装置还包括: 第二判断单元,
所述第二判断单元, 用于判断所述筒谱是否结束; 所述对齐单元 8037 , 具体用于若所述筒谱未结束, 将所示筒谱中后一个 子片段与当前的子片段联合,与基本考察单元进行对应;若判断所述筒谱结束, 将当前音符子片段中的音符与基本考察单元——对应后删除未对应上的基本 考察单元。 优 选 的 , 所 述 获取基 频 单 元 : 具 体 用 于 根据 公 式
^0_ m/e = 440* 2(p-69)/12 , 计算所述音符音高对应的目标基频值, 作为对应的 基本考察单元的目标基频值; 其中 F0_mle为目标基频值, 440表示中央 C上 A音符发出的频率, p为当前筒谱中标注的音高与中央 C上 A音符的距离。 优选的, 如图 11所示, 所述装置还包括: 调整基调单元 807, 用于根据 发音人的音域特点, 对所述获取的目标基频值进行调整;
所述调整基调单元 807具体包括:第三获取单元 8071 ,第四获取单元 8072, 生成单元 8073 , 第五获取单元 8074, 第六获取单元 8075 , 选择单元 8076; 第三获取单元 8071 , 用于对每个基本考察单元的目标基频值进行升降调 处理, 获取在不同基调下的调整基频值; 第四获取单元 8072, 用于获取在不同基调下的基本考察单元序列的调整 基频值序列; 生成单元 8073 , 用于提取每个基本考察单元的语音片断的基频特征序列, 并计算平均, 生成基频特征值; 第五获取单元 8074, 用于获取基本考察单元序列的语音片段的基频特征 值序列; 第六获取单元 8075 , 用于计算不同基调下的基本考察单元序列的调整基 频值序列,与提取的基本考察单元序列的语音片断的基频特征值序列之间的差 值;
选择单元 8076, 用于选择使得差值最小的基调下的各基本考察单元的调整基 频值作为相应优化的目标基频值。
优选的, 如图 12所示, 所述获取时长单元 805具体包括: 获取节拍数单 元 8051 , 和获取目标单元 8052,
所述获取节拍数单元 8051 , 用于根据筒谱中音符的节拍数, 和所述筒谱 中的音符与基本考察单元的对应关系, 获得每个基本考察单元对应的节拍数, 所述获取目标单元 8052, 用于根据获取的每个基本考察单元对应的节拍 数, 和所述筒谱中描述的节奏, 获取每个基本考察单元的目标时长。
再次, 本发明实施例提供的技术方案, 可以实现自动语音切分, 避免了传 统人工切分的负担, 不受语种的限制, 具有更普遍的娱乐效果。 本领域普通技术人员可以理解上述实施例的各种方法中的全部或部分步 骤是可以通过程序来指令相关的硬件来完成,该程序可以存储于一计算机可读 存储介质中, 存储介质可以包括: ROM、 RAM, 磁盘或光盘等。
以上对本发明实施例所提供的种实现语音歌唱化的方法和装置,进行了详 实施例的说明只是用于帮助理解本发明的方法及其核心思想; 同时,对于本领 域的一般技术人员,依据本发明的思想,在具体实施方式及应用范围上均会有 改变之处, 综上所述, 本说明书内容不应理解为对本发明的限制。

Claims

权 利 要 求
1、 一种实现语音歌唱化的方法, 其特征在于, 所述方法包括:
接收用户输入的语音信号;
将所述语音信号切分获得各基本考察单元的语音片断;
根据预置的筒谱, 确定筒谱中的各音符与所述各基本考察单元的对应关 系;
根据筒谱中各音符的音高, 和所述对应关系, 分别确定其所对应的基本考 察单元的目标基频值;
根据筒谱中各音符的节拍数, 和所述对应关系, 分别确定其所对应的基本 考察单元的目标时长;
根据所述目标基频值和目标时长调整各基本考察单元的语音片断,使得调 整后的语音片段的基频为所述目标基频值,调整后的语音片段的时长为所述目 标时长。
2、 根据权利要求 1所述的方法, 其特征在于, 所述将所述语音信号切分 获得各基本考察单元的语音片断, 具体包括:
从语音信号中逐帧提取语音声学特征矢量, 生成声学特征矢量序列; 对所述声学特征矢量序列执行语音识别,确定基本语音识别单元模型序列 及各基本语音识别模型对应的语音片段;
合并所述基本语音识别单元对应的语音片段得到基本考察单元的语音片 段。
3、 根据权利要求 2所述的方法, 其特征在于,
所述基本语音识别模型, 包括: 静音识别模型, 浊音识别模型和清音识别 模型三种;
所述合并基本语音识别单元对应的语音片段得到基本考察单元的语音片 断, 包括: 将浊音片断和清音片断合并构成基本考察单元的语音片断。
4、 根据权利要求 2所述的方法, 其特征在于,
所述基本语音识别模型, 包括: 各音素识别模型或音节识别模型; 所述合并基本语音识别单元对应的语音片段得到基本考察单元的语音片 断,包括:将相邻音素单元片断合并构成基于音节的基本考察单元的语音片断。
5、 根据权利要求 1所述的方法, 其特征在于, 所述根据预置的筒谱, 确 定筒谱中的音符与基本考察单元的对应关系, 具体包括:
获取用户输入的语音信号所对应的基本考察单元的个数 K;
获得筒谱子片断序列;
依次统计每个子片段中音符的个数 M;
判断当前子片段中音符的个数 M是否大于所述基本考察单元个数 K, 如 果 M大于 K, 根据如下式子获取参数 r,
r = LM / JS:」
将基本考察单元序列复制 r遍顺序拼接, 其中, 复制后的总的基本考察单 元个数为 rK, 满足 rK<=M;
将所述复制后的 rK个基本考察单元,与所述筒谱子片段中的 M个音符进 行线性对齐。
6、根据权利要求 5所述的方法, 其特征在于, 所述将所述复制后的 rK个 基本考察单元, 与所述筒谱子片段中的 M个音符, 进行线性对齐, 具体包括: 根据公式: Nothhj =、j * rK l M , 将所述复制后的 rK个基本考察单元, 与 所述筒谱子片段中的 M个音符, 进行线性对齐;
所述 NotldX j表示筒谱子片段中第 j个音符所对应基本考察单元的序号。
7、 根据权利要求 5所述的方法, 其特征在于, 当判断当前音符子片段中 音符总个数 M小于基本考察单元个数 K, 即 M<K时, 所述方法还包括:
判断所述筒谱是否结束,如果未结束,将所示筒谱中后一个子片段与当前 的子片段联合, 与基本考察单元进行对应;
若判断所述筒谱结束,将当前音符子片段中的音符与基本考察单元一一对 应后删除未对应上的基本考察单元。
8、 根据权利要求 1所述的方法, 其特征在于, 所述根据筒谱中各音符的 音高, 和所述对应关系, 确定其所对应的基本考察单元的目标基频值, 包括: 根据公式^- ^^ = 44()* 2^69)/12 , 计算所述音符音高对应的目标基频 值, 作为对应的基本考察单元的目标基频值; 其中 F0_mle为目标基频值, 440 表示中央 C上 A音符发出的频率, p为当前筒谱中标注的音高与中央 C上 A 音符的距离。
9、 根据权利要 8所述的方法, 其特征在于, 在获取基本考察单元的目标 基频值后, 还包括:
根据发音人的音域特点, 对所述基本考察单元的目标基频值进行调整; 在获取基本考察单元的目标基频值后,还根据发音人的音域特点,对所述 基本考察单元的目标基频值进行优化, 具体包括:
对每个基本考察单元的目标基频值进行升降调处理,获取在不同基调下的 调整基频值; 获取在不同基调下的基本考察单元序列的调整基频值序列;
提取每个基本考察单元的语音片断的基频特征序列, 并计算平均, 生成基 频特征值;
获取基本考察单元序列的语音片段的基频特征值序列; 计算不同基调下的基本考察单元序列的调整基频值序列,与提取的基本考 察单元序列的语音片断的基频特征值序列之间的差值; 选择使得差值最小的基调下的各基本考察单元的调整基频值作为相应优化的 目标基频值。
10、 根据权利要求 1所述的方法, 其特征在于, 所述根据筒谱中音符的节 拍数,和所述对应关系,确定其所对应的基本考察单元的目标时长,具体包括: 根据筒谱中音符的节拍数,和所述筒谱中的音符与基本考察单元的对应关 系, 获得每个基本考察单元对应的节拍数, 根据获取的每个基本考察单元对应的节拍数, 和所述筒谱中描述的节奏, 获取每个基本考察单元的目标时长。
11、 一种实现语音歌唱化的装置, 其特征在于, 该装置包括: 接收单元, 切分单元, 获取对应关系单元, 获取基频单元, 获取时长单元, 和调整单元; 所述接收单元, 用于接收用户输入的语音信号;
所述切分单元, 用于将所述语音信号切分获得各基本考察单元的语音片 断; 所述获取对应关系单元,用于确定筒谱中的各音符与所述各基本考察单元 的对应关系;
所述获取基频单元, 用于根据筒谱中各音符的音高, 和所述对应关系, 分 别确定其所对应的基本考察单元的目标基频值;
所述获取时长单元, 用于根据筒谱中各音符的节拍数, 和所述对应关系, 分别确定其所对应的基本考察单元的目标时长;
所述调整单元,用于根据所述目标基频值和目标时长调整各基本考察单元 的语音片断,使得调整后的语音片段的基频为所述目标基频值,调整后的语音 片段的时长为所述目标时长。
12、 根据权利要求 11所述的装置, 其特征在于, 所述切分单元包括: 提取单元, 确定单元, 和合并单元;
所述提取单元, 用于从语音信号中逐帧提取语音声学特征矢量, 生成声学 特征矢量序列;
所述确定单元, 用于对所述声学特征矢量序列执行语音识别,确定基本语 音识别单元模型序列及各基本语音识别模型对应的语音片段;
所述合并单元;用于合并所述基本语音识别单元对应的语音片段得到基本 考察单元的语音片段。
13、 根据权利要求 12所述装置, 其特征在于,
所述基本语音识别模型, 包括: 静音识别模型, 浊音识别模型和清音识别 模型三种;
所述合并基本语音识别单元对应的语音片段得到基本考察单元的语音片 断, 包括: 将浊音片断和清音片断合并构成基本考察单元的语音片断。
14、 根据权利要求 12所述装置, 其特征在于,
所述基本语音识别模型, 包括: 各音素识别模型或音节识别模型; 所述合并基本语音识别单元对应的语音片段得到基本考察单元的语音片 断,包括:将相邻音素单元片断合并构成基于音节的基本考察单元的语音片断。
15、 根据权利要求 11所述的装置, 其特征在于, 所述获取对应关系单元 具体包括: 第一统计单元, 第一获取单元, 第二统计单元, 第一判断单元, 第 二获取单元, 复制单元, 对齐单元; 所述第一统计单元,用于获取用户输入的语音信号所对应的基本考察单元 的个数 K;
所述第一获取单元, 用于获得筒谱子片断序列;
所述第二统计单元, 用于依次统计每个子片段中音符的个数 M;
所述第一判断单元, 用于判断当前子片段中音符的个数 M是否大于所述 基本考察单元个数 K;
所述第二获取单元, 用于如果 M大于 K, 根据如下式子获取参数 r, r = LM / JS:」
所述复制单元, 用于将基本考察单元序列复制 r遍顺序拼接, 其中, 复制 后的总的基本考察单元个数为 rK, 满足 rK<=M;
所述对齐单元, 用于将所述复制后的 rK个基本考察单元, 与所述筒谱子 片段中的 M个音符进行线性对齐。
16、 根据权利要求 15所述装置, 其特征在于, 所述对齐单元, 具体用于 根据公式: Notldxj ^ rK /M , 将所述复制后的 rK个基本考察单元, 与所述 筒谱子片段中的 M个音符, 进行线性对齐;
所述 NotldX j表示筒谱子片段中第 j个音符所对应基本考察单元的序号。
17、 根据权利要求 15所述装置, 其特征在于, 所述装置还包括: 第二判 断单元,
所述第二判断单元, 用于判断所述筒谱是否结束; 所述对齐单元, 具体用于若所述筒谱未结束,将所示筒谱中后一个子片段 与当前的子片段联合, 与基本考察单元进行对应; 若判断所述筒谱结束, 将当 前音符子片段中的音符与基本考察单元——对应后删除未对应上的基本考察 单元。
18、 根据权利要求 11所述装置, 其特征在于, 所述获取基频单元: 具体用于根据公式^- ^/£? = 44()* 2^69)/12 ,计算所 述音符音高对应的目标基频值,作为对应的基本考察单元的目标基频值; 其中 F0_mle为目标基频值, 440表示中央 C上 A音符发出的频率, p为当前筒语 中标注的音高与中央 C上 A音符的距离。
19、 根据权利要求 18所述装置, 其特征在于, 所述装置还包括: 调整基 调单元, 用于根据发音人的音域特点, 对所述获取的目标基频值进行调整; 所述调整基调单元具体包括: 第三获取单元, 第四获取单元, 生成单元, 第五获取单元, 第六获取单元, 选择单元;
第三获取单元, 用于对每个基本考察单元的目标基频值进行升降调处理, 获取在不同基调下的调整基频值;
第四获取单元, 用于获取在不同基调下的基本考察单元序列的调整基频值 序列;
生成单元, 用于提取每个基本考察单元的语音片断的基频特征序列, 并计 算平均, 生成基频特征值;
第五获取单元, 用于获取基本考察单元序列的语音片段的基频特征值序 列;
第六获取单元, 用于计算不同基调下的基本考察单元序列的调整基频值序 列, 与提取的基本考察单元序列的语音片断的基频特征值序列之间的差值; 选择单元, 用于选择使得差值最小的基调下的各基本考察单元的调整基频值作 为相应优化的目标基频值。
20、 根据权利要求 11所述的装置, 其特征在于, 所述获取时长单元具体 包括: 获取节拍数单元, 和获取目标单元,
所述获取节拍数单元, 用于根据筒谱中音符的节拍数, 和所述筒谱中的音 符与基本考察单元的对应关系, 获得每个基本考察单元对应的节拍数,
所述获取目标单元, 用于根据获取的每个基本考察单元对应的节拍数, 和 所述筒谱中描述的节奏, 获取每个基本考察单元的目标时长。
PCT/CN2012/087999 2012-12-31 2012-12-31 一种实现语音歌唱化的方法和装置 WO2014101168A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201210591777.0A CN103915093B (zh) 2012-12-31 2012-12-31 一种实现语音歌唱化的方法和装置
CN201210591777.0 2012-12-31

Publications (1)

Publication Number Publication Date
WO2014101168A1 true WO2014101168A1 (zh) 2014-07-03

Family

ID=51019775

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2012/087999 WO2014101168A1 (zh) 2012-12-31 2012-12-31 一种实现语音歌唱化的方法和装置

Country Status (2)

Country Link
CN (1) CN103915093B (zh)
WO (1) WO2014101168A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112420008A (zh) * 2019-08-22 2021-02-26 北京峰趣互联网信息服务有限公司 录制歌曲的方法、装置、电子设备及存储介质
WO2021158613A1 (en) * 2020-02-06 2021-08-12 Tencent America LLC Learning singing from speech

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107248406B (zh) * 2017-06-29 2020-11-13 义乌市美杰包装制品有限公司 一种自动生成鬼畜类歌曲的方法
CN107749301B (zh) * 2017-09-18 2021-03-09 得理电子(上海)有限公司 一种音色样本重构方法及系统、存储介质及终端设备
CN107818792A (zh) * 2017-10-25 2018-03-20 北京奇虎科技有限公司 音频转换方法及装置
CN108053814B (zh) * 2017-11-06 2023-10-13 芋头科技(杭州)有限公司 一种模拟用户歌声的语音合成系统及方法
CN110838286B (zh) * 2019-11-19 2024-05-03 腾讯科技(深圳)有限公司 一种模型训练的方法、语种识别的方法、装置及设备
CN112951198B (zh) * 2019-11-22 2024-08-06 微软技术许可有限责任公司 歌声合成
CN111429877B (zh) * 2020-03-03 2023-04-07 云知声智能科技股份有限公司 歌曲处理方法及装置
CN111445892B (zh) * 2020-03-23 2023-04-14 北京字节跳动网络技术有限公司 歌曲生成方法、装置、可读介质及电子设备

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101308652A (zh) * 2008-07-17 2008-11-19 安徽科大讯飞信息科技股份有限公司 一种个性化歌唱语音的合成方法
US20090314155A1 (en) * 2008-06-20 2009-12-24 Microsoft Corporation Synthesized singing voice waveform generator
CN102568457A (zh) * 2011-12-23 2012-07-11 深圳市万兴软件有限公司 一种基于哼唱输入的乐曲合成方法及装置

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4483188B2 (ja) * 2003-03-20 2010-06-16 ソニー株式会社 歌声合成方法、歌声合成装置、プログラム及び記録媒体並びにロボット装置
CN1246825C (zh) * 2003-08-04 2006-03-22 扬智科技股份有限公司 预估语音信号的语调估测值的方法和装置
DE102004049457B3 (de) * 2004-10-11 2006-07-06 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. Verfahren und Vorrichtung zur Extraktion einer einem Audiosignal zu Grunde liegenden Melodie
CN100347741C (zh) * 2005-09-02 2007-11-07 清华大学 移动语音合成方法
CN101399036B (zh) * 2007-09-30 2013-05-29 三星电子株式会社 将语音转换为说唱音乐的设备和方法
CN101923861A (zh) * 2009-06-12 2010-12-22 傅可庭 可转换语音为歌曲的音频合成装置
CN101901598A (zh) * 2010-06-30 2010-12-01 北京捷通华声语音技术有限公司 一种哼唱合成方法和系统
CN102682760B (zh) * 2011-03-07 2014-06-25 株式会社理光 重叠语音检测方法和系统
CN102664016B (zh) * 2012-04-23 2014-05-14 安徽科大讯飞信息科技股份有限公司 唱歌评测方法及系统

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090314155A1 (en) * 2008-06-20 2009-12-24 Microsoft Corporation Synthesized singing voice waveform generator
CN101308652A (zh) * 2008-07-17 2008-11-19 安徽科大讯飞信息科技股份有限公司 一种个性化歌唱语音的合成方法
CN102568457A (zh) * 2011-12-23 2012-07-11 深圳市万兴软件有限公司 一种基于哼唱输入的乐曲合成方法及装置

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
JIA, JIA ET AL., A SPEECH MODIFICATION BASED SINGING VOICE SYNTHESIS SYSTEM, NCMMSC, vol. 09, 20 August 2009 (2009-08-20), pages 446 - 450 *
QI, FENGYAN ET AL.: "A Method for Voiced/Unvoiced/Silence Classification of Speech with Noise Using SVM", CHINESE JOURNAL OF ELECTRONICS, vol. 34, no. 4, April 2006 (2006-04-01), pages 605 - 611 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112420008A (zh) * 2019-08-22 2021-02-26 北京峰趣互联网信息服务有限公司 录制歌曲的方法、装置、电子设备及存储介质
WO2021158613A1 (en) * 2020-02-06 2021-08-12 Tencent America LLC Learning singing from speech
US11430431B2 (en) 2020-02-06 2022-08-30 Tencent America LLC Learning singing from speech

Also Published As

Publication number Publication date
CN103915093B (zh) 2019-07-30
CN103915093A (zh) 2014-07-09

Similar Documents

Publication Publication Date Title
WO2014101168A1 (zh) 一种实现语音歌唱化的方法和装置
US10789290B2 (en) Audio data processing method and apparatus, and computer storage medium
US8005666B2 (en) Automatic system for temporal alignment of music audio signal with lyrics
JP5024711B2 (ja) 歌声合成パラメータデータ推定システム
JP4246792B2 (ja) 声質変換装置および声質変換方法
US8880409B2 (en) System and method for automatic temporal alignment between music audio signal and lyrics
US9852721B2 (en) Musical analysis platform
CN110600055B (zh) 一种使用旋律提取与语音合成技术的歌声分离方法
US9804818B2 (en) Musical analysis platform
Sharma et al. NHSS: A speech and singing parallel database
JP6561499B2 (ja) 音声合成装置および音声合成方法
Devaney et al. A Study of Intonation in Three-Part Singing using the Automatic Music Performance Analysis and Comparison Toolkit (AMPACT).
JP7380809B2 (ja) 電子機器、電子楽器、方法及びプログラム
US11842720B2 (en) Audio processing method and audio processing system
JP2015068897A (ja) 発話の評価方法及び装置、発話を評価するためのコンピュータプログラム
Lux et al. The IMS Toucan System for the Blizzard Challenge 2023
Cen et al. Template-based personalized singing voice synthesis
Tsai et al. Singer identification based on spoken data in voice characterization
He et al. Turning a Monolingual Speaker into Multilingual for a Mixed-language TTS.
CN111837184A (zh) 声音处理方法、声音处理装置及程序
Nakano et al. A drum pattern retrieval method by voice percussion
Lee et al. A comparative study of spectral transformation techniques for singing voice synthesis.
JP5131904B2 (ja) 音楽音響信号と歌詞の時間的対応付けを自動で行うシステム及び方法
Turk et al. Application of voice conversion for cross-language rap singing transformation
Zhou et al. Towards Improving the Expressiveness of Singing Voice Synthesis with BERT Derived Semantic Information

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 12890785

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 12890785

Country of ref document: EP

Kind code of ref document: A1