CN109712634A

CN109712634A - A kind of automatic sound conversion method

Info

Publication number: CN109712634A
Application number: CN201811583082.1A
Authority: CN
Inventors: 栾峰; 杜中强; 张镇荣; 黄楚均; 潘步年
Original assignee: Northeastern University China
Current assignee: Northeastern University China
Priority date: 2018-12-24
Filing date: 2018-12-24
Publication date: 2019-05-03

Abstract

The present invention discloses a kind of automatic sound conversion method, comprising the following steps: 1) uses melody and these common traits of characteristics of speech sounds in sound that source voice and target voice are realized to the smooth alignment of voice；2) time scale modification is carried out to source voice, makes the time unifying of source voice and target voice as a result, according to time span ratio according to voice smooth alignment；3) tone and volume of source voice are modified in a manner of frame by frame using Pitch-synchronous overlap-add algorithm and simple amplitude envelops matching algorithm.The present invention realizes full automatic voice conversion, does not need manual correction, does not need additional information, the expressive element in singing only is modified while keeping song tone color, not only there is great purposes in field of singing, but also giving a lecture, teaching, the fields such as amusement have great purposes.

Description

A kind of automatic sound conversion method

Technical field

The present invention relates to a kind of voice process technology, specially a kind of automatic sound conversion method.

Background technique

With the continuous improvement of living standards, the cultural life of people is also more and more abundant.Singing (Karaoke) is people One of entertainment way.According to singing skills, song can be rendered to by voice processing software by moving music or be Noisy sound.Singing voice deformation, gesture language-voice synthesis, voice -> singing-singing -> voice are converted, in the conversion of voice tone color The phonetics transfer method, for object, to obtain singing voice expression parameter with reference to recording used is commonplace.

The commercialization voice correction tool such as Autotune, VariAudio and Melodyne, primarily focuses on the sound for changing song It adjusts, it is some of to manipulate note time started or other music expression modes by the MIDI note of editor's transcription.To the greatest extent It manages them and provides automatic control to a certain extent, but in order to obtain satisfied as a result, correction course is usually cumbersome and again Multiple.

Pervious a few thing attempts to minimize the audio signal in manual modification music expression.Bryan et al. is proposed A kind of variable Rate time-stretching method allows user easily to modify draw ratio.The stiffness curve of given user's guidance, this method are logical Planar Mechanisms optimization program calculates and the rate of extension of time correlation automatically.Roebel et al. proposes a kind of removal trill expression formula Algorithm.Based entirely on spectral envelope smooth operation without manipulating various pieces parameter.Although these methods are in processing song More conveniences are provided in terms of signal, but they still need users' guidebook or state modulator to a certain extent

Summary of the invention

That there are correction courses is cumbersome for the conversion of voice in the prior art, the deficiencies of needing users' guidebook or state modulator, The problem to be solved in the present invention is to provide a kind of a kind of automatic sounds for not needing manual correction, being converted to specified speech from voice Sound shifting method.

In order to solve the above technical problems, the technical solution adopted by the present invention is that:

A kind of voice automatic conversion processing method of the present invention, comprising the following steps:

1) use melody and these common traits of characteristics of speech sounds in sound that source voice and target voice are realized voice Smooth alignment；

2) time scale modification is carried out to source voice, is made as a result, according to time span ratio according to voice smooth alignment The time unifying of source voice and target voice；

3) source voice is modified in a manner of frame by frame using Pitch-synchronous overlap-add algorithm and simple amplitude envelops matching algorithm Tone and volume.

In step 2), the time unifying of source voice and target voice refers to and carries out feature extraction to source voice and target voice, Then these features are aligned using dynamic time warping.

The feature extraction is that maximum filter constant Q is converted and the phoneme score that extracts from phoneme classifier the two spies Sign.

In step 2), time span ratio refers to three rank Savitzky- using the sgolayfilt function in MATLAB Golay filter is applied to piecewise linearity align to path；Smoothed out result is compared with institute to align to path, filter is used The slope meter evaluation time extensibility of wave path.

In step 2), time scale modification refers to the smoothingtime draw ratio changed according to every frame, is applied to time ruler Degree modification TSM algorithm, to be aligned voice in time.

In step 3), the modification of Pitch-synchronous overlap-add algorithm refers to and is aligned tone, tone needed for the algorithm by algorithm It is as follows than calculating:

Wherein, β (i) is pitch ratio, f0_T(i) and f0_ST(i) the frame level pitch sequences of target and source voice are respectively indicated, as_T(i) aperiodicity to be obtained after time alignment from source.

In step 3), amplitude envelops matching algorithm is aligned for volume, by calculating the frame level amplitude between two voices Gain simultaneously realizes it multiplied by source sound, extracts envelope from each voice using root-mean-square value, and from two amplitude packets Amplitude gain is obtained in the ratio of network.

The invention has the following beneficial effects and advantage:

1. the present invention realizes full automatic voice conversion, manual correction is not needed, additional information, such as symbol are not needed Music notation and the lyrics etc., this method only modify the expressive element in singing while keeping song tone color.

2. the method for the present invention not only has great purposes in field of singing, but also is giving a lecture, impart knowledge to students, the fields such as amusement tool There is great purposes.

Detailed description of the invention

Fig. 1 is the method for the present invention flow chart；

Fig. 2A is the frequency spectrum for simply using two song, obtains similarity matrix and align to path by DTW.

Fig. 2 B is to convert (max-filtered constant-Q transfor) using maximum filter constant Q, passes through DTW Obtained similarity matrix and align to path

Fig. 2 C is to pass through the obtained similarity matrix of DTW and alignment road using the phoneme score extracted in phoneme classifier Diameter

Fig. 2 D is while converting (max-filtered constant-Q transfor) and sound using maximum filter constant Q The phoneme score extracted in plain classifier passes through the obtained similarity matrix of DTW and align to path

Fig. 3 is to be regarded by the amplification of Savitzky-Golay filter (solid line) align to path (dotted line) and filter paths Figure；Fig. 4 is to assess the invention used data；

Fig. 5 is time unifying result histogram；

Fig. 6 is the mean difference for converting pitch between opisthogenesis sound and target sound；

The mean difference (being indicated with RMS) of volume of the Fig. 7 between source voice and target voice.

Specific embodiment

The present invention is further elaborated with reference to the accompanying drawings of the specification.

The present invention has biggish difference etc. to the expression of same a word according to different people in terms of rhythm, tone, size Feature, by with the melody and the characteristics of speech sounds (phoneme point that filter constant Q is converted and extracted from phoneme classifier in sound Number) these common traits are aligned two kinds of sound, then according to smooth alignment as a result, according to regular hour length ratio Example carries out the modification of time scale to source voice, once two voice alignment, this method use Pitch-synchronous overlap-add algorithm The tone and volume of source voice are modified in a manner of frame by frame with simple amplitude envelops matching algorithm.

As shown in Figure 1, a kind of automatic sound conversion method of the present invention, comprising the following steps:

1) use melody and these common features of characteristics of speech sounds in sound that source voice and target voice are realized voice Smooth alignment；

3) source sound is modified in a manner of frame by frame using Pitch-synchronous overlap-add algorithm and simple amplitude envelops matching algorithm Tone and volume.

The method of the present invention be by voice fromSourceVoice is converted to specified voice.

In step 1), the smooth alignment of voice refers to and carries out feature extraction to source voice and target voice, then using dynamic Time alignment (DTW) is aligned these features.

Feature extraction is carried out to source voice and target voice first, mainly includes two features, a feature is processing rotation The maximum filter constant Q of rule aspect converts (max-filtered constant-Q transfor), specifically using being based on The constant Q transform (constant-Q transfor) of 88 band filter groups, each filter group are designed for covering one Note with halftoning resolution ratio exists using maximum filtering further to mitigate tonal variations especially for two songs There is the case where more than one semitone, for example, the note broadcasting or excessive pitch bending that pass through mistake in pitch disparity.Fig. 2 B In similarity matrix and align to path show that the detour in the segment with strong trill becomes more diagonal line.And Fig. 2A Only simply use the frequency spectrum of two song as the feature of sound, although by DTW alignment as a result, returning from DTW algorithm The align to path returned has found the starting and offset of note very successfully, but when there is a sound trill and pitch to be bent When, it can not often find correctly alignment path.

Another is characterized in the phoneme score extracted from phoneme classifier.This is to extract the letter of the voice in sound Breath, while eliminating the tone color difference between two voices.Frame level phoneme probability distribution is predicted using open source phoneme classifier.It Use the 39 Jan Vermeer frequency cepstral coefficients (MFCC) with delta and double-delta as input feature vector, and uses HTK The training of speech recognition tools packet is to predict the distributions of 39 phonemes as output.Use the output as the lyric spy of time unifying Levy vector.Similarity matrix and align to path in Fig. 2 C show that phonetic feature also contributes to mitigating detour problem.Fig. 2 D is shown Result when using melody and lyrics function.Align to path is similar to the align to path in Fig. 2 C, but it becomes more smooth.

It is aligned both the above feature as the input of DTW.

In step 2), the time unifying of source voice and target voice refers to and carries out feature extraction to source voice and target voice, Then these features are aligned using dynamic time warping.Feature extraction is the maximum filter constant Q in terms of handling melody The two are special for transformation (max-filtered constant-Q transfor) and the phoneme score that extracts from phoneme classifier Sign.

In step 2), time scale modification refers to the smoothingtime draw ratio changed according to every frame, is applied to time ruler Degree modification TSM algorithm, to be aligned sound in time.

Smoothingtime draw ratio, using Savitzky-Golay filter, this is a kind of approximation method, will in a manner of convolution The subset and lower order polynomial expressions of sequential value are fitted.Specifically, using the function (sgolayfilt function) in MATLAB by three Rank Savitzky-Golay filter is applied to piecewise linearity align to path.By the align to path in smoothed out result and Fig. 3 It is compared.In order to calculate the slope that time-stretching rate α simply use filter paths.Once obtaining the time of every frame variation Draw ratio is applied to time scale modification (TSM) algorithm, to be aligned sound in time.Specifically, it uses From the OverLap and Add (WSOLA) based on similitude of TSM instrument case.

Wherein, β (i) is pitch ratio, f0_T(i) and f0_ST(i) the frame level pitch sequences in target and source, as are respectively indicated_T(i) For the aperiodicity obtained after time alignment from source.

As shown in formula 1, the method for the present invention is only to strong periodically section application pitch modifications.Come using YIN algorithm Extract the tone of each voice.The algorithm is returned aperiodicity as byproduct.Also using harmonic wave-impulse source separation (HPSS) The harmonic signal from each sound is separated with median filter [15], then applies them to pitch detector.

The present embodiment has collected four songs, the style (sharing the 16 first recording from different singers) having nothing in common with each other.Four In the recording of song, one be the people from professional person or with skilled singing skills target singing voice, remaining come From common singer.Due to modifying common song by obtaining music expression from target, the present embodiment selects 12 pairs of songs (every first 3 pairs of song).Singer sees song on one side, looks at the display position of the lyrics on one side.The length of every first song is about 10 seconds to 20 seconds, It is to be taken out from the chorus of original song.Fig. 4 summarizes the data of the characteristics of song used when assessment and quantity etc. Collection.

The assessment of time alignment, in order to assess the performance of time unifying, the present embodiment is by the modified source language in Fig. 1 Sound STPE is aligned with the target voice using the DTW with spectrogram, and calculates the standard deviation of the slope local on the path DTW (slope when they are perfectly aligned) in addition, the present embodiment be not directly calculating standard deviation interval using slope local, and It is to convert slope θ=arctan (s) using arctan function, wherein s is the slope local in path, therefore is worth (from 0 to infinity It is mapped to limited range greatly) (from 0 to pi/2 radian).

Fig. 5 compares the standard deviation of slope local by different audio frequency characteristics.

In general, the use of the melody characteristics of phoneme classifier being all most reliable in all examples.This may be because of song Hand performs song with the lyrics, therefore phonetic feature is very accurate.Using with maximal filter constant Q transform (max- Filtered constant-Q transfor) melody characteristics also contribute to improving alignment, but for low pitch Song is (for example, song 2-1 to 2-3) fails sometimes.This is because low in constant Q transform (constant-Q transfor) Tone resolution ratio in range of pitch is not high enough.Result may not be able to be improved in conjunction with the two features.For the example of half, It achieve it is best as a result, but it generate result even it is more worse than the other half melody characteristics.

Pitch and volume alignment are assessed, i.e., for tone, is compared between source and target before and after tone alignment Average pitch is poor.With YIN algorithm measurement tone, and only calculating has strong periodically segment (i.e. when aperiodicity is less than 0.2 When).Fig. 6 shows that after tone alignment, average pitch difference total reduces 78:8%.Volume is aligned, the present invention Calculate the average value of the difference of amplitude envelops.Specifically, root mean square (RMS) value has been used.Fig. 7 is put down after showing dynamic alignment Equal dynamic differential reduces 86:4%.

Claims

1. a kind of automatic sound conversion method, it is characterised in that the following steps are included:

1) use melody and these common traits of characteristics of speech sounds in sound that source voice and target voice are realized the smooth of voice Alignment；

2) time scale modification is carried out to source voice, makes source language as a result, according to time span ratio according to voice smooth alignment The time unifying of sound and target voice；

3) sound of source voice is modified in a manner of frame by frame using Pitch-synchronous overlap-add algorithm and simple amplitude envelops matching algorithm Reconciliation volume.

2. voice automatic conversion processing method according to claim 1, it is characterised in that: in step 2), source voice and mesh The time unifying of poster sound refers to and carries out feature extraction to source voice and target voice, then using dynamic time warping to these Feature is aligned.

3. voice automatic conversion processing method according to claim 2, it is characterised in that: the feature extraction is processing rotation Rule aspect maximum filter constant Q transformation and phoneme score the two features for extracting from phoneme classifier.

4. voice automatic conversion processing method according to claim 1, it is characterised in that: in step 2), time span ratio Example is referred to and is aligned three rank Savitzky-Golay filters applied to piecewise linearity using the sgolayfilt function in MATLAB Path；Smoothed out result is compared with institute to align to path, the slope meter evaluation time extensibility of filter paths is used.

5. voice automatic conversion processing method according to claim 1, it is characterised in that: in step 2), time scale is repaired Change, refer to the smoothingtime draw ratio changed according to every frame, is applied to time scale modification TSM algorithm, so as in time It is aligned voice.

6. voice automatic conversion processing method according to claim 1, it is characterised in that: in step 3), Pitch-synchronous weight Folded phase computation system modification, refers to and is aligned tone by algorithm, and pitch ratio needed for the algorithm calculates as follows:

Wherein, β (i) is pitch ratio, f0_T(i) and f0_ST(i) the frame level pitch sequences of target and source voice, as are respectively indicated_T(i) For the aperiodicity obtained after time alignment from source.

7. voice automatic conversion processing method according to claim 1, it is characterised in that: in step 3), amplitude envelops It is aligned with algorithm for volume, by calculating the frame level amplitude gain between two voices and realizing it multiplied by source sound, Envelope is extracted from each voice using root-mean-square value, and obtains amplitude gain from the ratio of two amplitude envelops.