CN109712634A - A kind of automatic sound conversion method - Google Patents

A kind of automatic sound conversion method Download PDF

Info

Publication number
CN109712634A
CN109712634A CN201811583082.1A CN201811583082A CN109712634A CN 109712634 A CN109712634 A CN 109712634A CN 201811583082 A CN201811583082 A CN 201811583082A CN 109712634 A CN109712634 A CN 109712634A
Authority
CN
China
Prior art keywords
voice
source
time
sound
algorithm
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811583082.1A
Other languages
Chinese (zh)
Inventor
栾峰
杜中强
张镇荣
黄楚均
潘步年
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northeastern University China
Original Assignee
Northeastern University China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northeastern University China filed Critical Northeastern University China
Priority to CN201811583082.1A priority Critical patent/CN109712634A/en
Publication of CN109712634A publication Critical patent/CN109712634A/en
Pending legal-status Critical Current

Links

Landscapes

  • Auxiliary Devices For Music (AREA)

Abstract

The present invention discloses a kind of automatic sound conversion method, comprising the following steps: 1) uses melody and these common traits of characteristics of speech sounds in sound that source voice and target voice are realized to the smooth alignment of voice;2) time scale modification is carried out to source voice, makes the time unifying of source voice and target voice as a result, according to time span ratio according to voice smooth alignment;3) tone and volume of source voice are modified in a manner of frame by frame using Pitch-synchronous overlap-add algorithm and simple amplitude envelops matching algorithm.The present invention realizes full automatic voice conversion, does not need manual correction, does not need additional information, the expressive element in singing only is modified while keeping song tone color, not only there is great purposes in field of singing, but also giving a lecture, teaching, the fields such as amusement have great purposes.

Description

A kind of automatic sound conversion method
Technical field
The present invention relates to a kind of voice process technology, specially a kind of automatic sound conversion method.
Background technique
With the continuous improvement of living standards, the cultural life of people is also more and more abundant.Singing (Karaoke) is people One of entertainment way.According to singing skills, song can be rendered to by voice processing software by moving music or be Noisy sound.Singing voice deformation, gesture language-voice synthesis, voice -> singing-singing -> voice are converted, in the conversion of voice tone color The phonetics transfer method, for object, to obtain singing voice expression parameter with reference to recording used is commonplace.
The commercialization voice correction tool such as Autotune, VariAudio and Melodyne, primarily focuses on the sound for changing song It adjusts, it is some of to manipulate note time started or other music expression modes by the MIDI note of editor's transcription.To the greatest extent It manages them and provides automatic control to a certain extent, but in order to obtain satisfied as a result, correction course is usually cumbersome and again Multiple.
Pervious a few thing attempts to minimize the audio signal in manual modification music expression.Bryan et al. is proposed A kind of variable Rate time-stretching method allows user easily to modify draw ratio.The stiffness curve of given user's guidance, this method are logical Planar Mechanisms optimization program calculates and the rate of extension of time correlation automatically.Roebel et al. proposes a kind of removal trill expression formula Algorithm.Based entirely on spectral envelope smooth operation without manipulating various pieces parameter.Although these methods are in processing song More conveniences are provided in terms of signal, but they still need users' guidebook or state modulator to a certain extent
Summary of the invention
That there are correction courses is cumbersome for the conversion of voice in the prior art, the deficiencies of needing users' guidebook or state modulator, The problem to be solved in the present invention is to provide a kind of a kind of automatic sounds for not needing manual correction, being converted to specified speech from voice Sound shifting method.
In order to solve the above technical problems, the technical solution adopted by the present invention is that:
A kind of voice automatic conversion processing method of the present invention, comprising the following steps:
1) use melody and these common traits of characteristics of speech sounds in sound that source voice and target voice are realized voice Smooth alignment;
2) time scale modification is carried out to source voice, is made as a result, according to time span ratio according to voice smooth alignment The time unifying of source voice and target voice;
3) source voice is modified in a manner of frame by frame using Pitch-synchronous overlap-add algorithm and simple amplitude envelops matching algorithm Tone and volume.
In step 2), the time unifying of source voice and target voice refers to and carries out feature extraction to source voice and target voice, Then these features are aligned using dynamic time warping.
The feature extraction is that maximum filter constant Q is converted and the phoneme score that extracts from phoneme classifier the two spies Sign.
In step 2), time span ratio refers to three rank Savitzky- using the sgolayfilt function in MATLAB Golay filter is applied to piecewise linearity align to path;Smoothed out result is compared with institute to align to path, filter is used The slope meter evaluation time extensibility of wave path.
In step 2), time scale modification refers to the smoothingtime draw ratio changed according to every frame, is applied to time ruler Degree modification TSM algorithm, to be aligned voice in time.
In step 3), the modification of Pitch-synchronous overlap-add algorithm refers to and is aligned tone, tone needed for the algorithm by algorithm It is as follows than calculating:
Wherein, β (i) is pitch ratio, f0T(i) and f0ST(i) the frame level pitch sequences of target and source voice are respectively indicated, asT(i) aperiodicity to be obtained after time alignment from source.
In step 3), amplitude envelops matching algorithm is aligned for volume, by calculating the frame level amplitude between two voices Gain simultaneously realizes it multiplied by source sound, extracts envelope from each voice using root-mean-square value, and from two amplitude packets Amplitude gain is obtained in the ratio of network.
The invention has the following beneficial effects and advantage:
1. the present invention realizes full automatic voice conversion, manual correction is not needed, additional information, such as symbol are not needed Music notation and the lyrics etc., this method only modify the expressive element in singing while keeping song tone color.
2. the method for the present invention not only has great purposes in field of singing, but also is giving a lecture, impart knowledge to students, the fields such as amusement tool There is great purposes.
Detailed description of the invention
Fig. 1 is the method for the present invention flow chart;
Fig. 2A is the frequency spectrum for simply using two song, obtains similarity matrix and align to path by DTW.
Fig. 2 B is to convert (max-filtered constant-Q transfor) using maximum filter constant Q, passes through DTW Obtained similarity matrix and align to path
Fig. 2 C is to pass through the obtained similarity matrix of DTW and alignment road using the phoneme score extracted in phoneme classifier Diameter
Fig. 2 D is while converting (max-filtered constant-Q transfor) and sound using maximum filter constant Q The phoneme score extracted in plain classifier passes through the obtained similarity matrix of DTW and align to path
Fig. 3 is to be regarded by the amplification of Savitzky-Golay filter (solid line) align to path (dotted line) and filter paths Figure;Fig. 4 is to assess the invention used data;
Fig. 5 is time unifying result histogram;
Fig. 6 is the mean difference for converting pitch between opisthogenesis sound and target sound;
The mean difference (being indicated with RMS) of volume of the Fig. 7 between source voice and target voice.
Specific embodiment
The present invention is further elaborated with reference to the accompanying drawings of the specification.
The present invention has biggish difference etc. to the expression of same a word according to different people in terms of rhythm, tone, size Feature, by with the melody and the characteristics of speech sounds (phoneme point that filter constant Q is converted and extracted from phoneme classifier in sound Number) these common traits are aligned two kinds of sound, then according to smooth alignment as a result, according to regular hour length ratio Example carries out the modification of time scale to source voice, once two voice alignment, this method use Pitch-synchronous overlap-add algorithm The tone and volume of source voice are modified in a manner of frame by frame with simple amplitude envelops matching algorithm.
As shown in Figure 1, a kind of automatic sound conversion method of the present invention, comprising the following steps:
1) use melody and these common features of characteristics of speech sounds in sound that source voice and target voice are realized voice Smooth alignment;
2) time scale modification is carried out to source voice, is made as a result, according to time span ratio according to voice smooth alignment The time unifying of source voice and target voice;
3) source sound is modified in a manner of frame by frame using Pitch-synchronous overlap-add algorithm and simple amplitude envelops matching algorithm Tone and volume.
The method of the present invention be by voice fromSourceVoice is converted to specified voice.
In step 1), the smooth alignment of voice refers to and carries out feature extraction to source voice and target voice, then using dynamic Time alignment (DTW) is aligned these features.
Feature extraction is carried out to source voice and target voice first, mainly includes two features, a feature is processing rotation The maximum filter constant Q of rule aspect converts (max-filtered constant-Q transfor), specifically using being based on The constant Q transform (constant-Q transfor) of 88 band filter groups, each filter group are designed for covering one Note with halftoning resolution ratio exists using maximum filtering further to mitigate tonal variations especially for two songs There is the case where more than one semitone, for example, the note broadcasting or excessive pitch bending that pass through mistake in pitch disparity.Fig. 2 B In similarity matrix and align to path show that the detour in the segment with strong trill becomes more diagonal line.And Fig. 2A Only simply use the frequency spectrum of two song as the feature of sound, although by DTW alignment as a result, returning from DTW algorithm The align to path returned has found the starting and offset of note very successfully, but when there is a sound trill and pitch to be bent When, it can not often find correctly alignment path.
Another is characterized in the phoneme score extracted from phoneme classifier.This is to extract the letter of the voice in sound Breath, while eliminating the tone color difference between two voices.Frame level phoneme probability distribution is predicted using open source phoneme classifier.It Use the 39 Jan Vermeer frequency cepstral coefficients (MFCC) with delta and double-delta as input feature vector, and uses HTK The training of speech recognition tools packet is to predict the distributions of 39 phonemes as output.Use the output as the lyric spy of time unifying Levy vector.Similarity matrix and align to path in Fig. 2 C show that phonetic feature also contributes to mitigating detour problem.Fig. 2 D is shown Result when using melody and lyrics function.Align to path is similar to the align to path in Fig. 2 C, but it becomes more smooth.
It is aligned both the above feature as the input of DTW.
In step 2), the time unifying of source voice and target voice refers to and carries out feature extraction to source voice and target voice, Then these features are aligned using dynamic time warping.Feature extraction is the maximum filter constant Q in terms of handling melody The two are special for transformation (max-filtered constant-Q transfor) and the phoneme score that extracts from phoneme classifier Sign.
In step 2), time span ratio refers to three rank Savitzky- using the sgolayfilt function in MATLAB Golay filter is applied to piecewise linearity align to path;Smoothed out result is compared with institute to align to path, filter is used The slope meter evaluation time extensibility of wave path.
In step 2), time scale modification refers to the smoothingtime draw ratio changed according to every frame, is applied to time ruler Degree modification TSM algorithm, to be aligned sound in time.
Smoothingtime draw ratio, using Savitzky-Golay filter, this is a kind of approximation method, will in a manner of convolution The subset and lower order polynomial expressions of sequential value are fitted.Specifically, using the function (sgolayfilt function) in MATLAB by three Rank Savitzky-Golay filter is applied to piecewise linearity align to path.By the align to path in smoothed out result and Fig. 3 It is compared.In order to calculate the slope that time-stretching rate α simply use filter paths.Once obtaining the time of every frame variation Draw ratio is applied to time scale modification (TSM) algorithm, to be aligned sound in time.Specifically, it uses From the OverLap and Add (WSOLA) based on similitude of TSM instrument case.
In step 3), the modification of Pitch-synchronous overlap-add algorithm refers to and is aligned tone, tone needed for the algorithm by algorithm It is as follows than calculating:
Wherein, β (i) is pitch ratio, f0T(i) and f0ST(i) the frame level pitch sequences in target and source, as are respectively indicatedT(i) For the aperiodicity obtained after time alignment from source.
As shown in formula 1, the method for the present invention is only to strong periodically section application pitch modifications.Come using YIN algorithm Extract the tone of each voice.The algorithm is returned aperiodicity as byproduct.Also using harmonic wave-impulse source separation (HPSS) The harmonic signal from each sound is separated with median filter [15], then applies them to pitch detector.
In step 3), amplitude envelops matching algorithm is aligned for volume, by calculating the frame level amplitude between two voices Gain simultaneously realizes it multiplied by source sound, extracts envelope from each voice using root-mean-square value, and from two amplitude packets Amplitude gain is obtained in the ratio of network.
The present embodiment has collected four songs, the style (sharing the 16 first recording from different singers) having nothing in common with each other.Four In the recording of song, one be the people from professional person or with skilled singing skills target singing voice, remaining come From common singer.Due to modifying common song by obtaining music expression from target, the present embodiment selects 12 pairs of songs (every first 3 pairs of song).Singer sees song on one side, looks at the display position of the lyrics on one side.The length of every first song is about 10 seconds to 20 seconds, It is to be taken out from the chorus of original song.Fig. 4 summarizes the data of the characteristics of song used when assessment and quantity etc. Collection.
The assessment of time alignment, in order to assess the performance of time unifying, the present embodiment is by the modified source language in Fig. 1 Sound STPE is aligned with the target voice using the DTW with spectrogram, and calculates the standard deviation of the slope local on the path DTW (slope when they are perfectly aligned) in addition, the present embodiment be not directly calculating standard deviation interval using slope local, and It is to convert slope θ=arctan (s) using arctan function, wherein s is the slope local in path, therefore is worth (from 0 to infinity It is mapped to limited range greatly) (from 0 to pi/2 radian).
Fig. 5 compares the standard deviation of slope local by different audio frequency characteristics.
In general, the use of the melody characteristics of phoneme classifier being all most reliable in all examples.This may be because of song Hand performs song with the lyrics, therefore phonetic feature is very accurate.Using with maximal filter constant Q transform (max- Filtered constant-Q transfor) melody characteristics also contribute to improving alignment, but for low pitch Song is (for example, song 2-1 to 2-3) fails sometimes.This is because low in constant Q transform (constant-Q transfor) Tone resolution ratio in range of pitch is not high enough.Result may not be able to be improved in conjunction with the two features.For the example of half, It achieve it is best as a result, but it generate result even it is more worse than the other half melody characteristics.
Pitch and volume alignment are assessed, i.e., for tone, is compared between source and target before and after tone alignment Average pitch is poor.With YIN algorithm measurement tone, and only calculating has strong periodically segment (i.e. when aperiodicity is less than 0.2 When).Fig. 6 shows that after tone alignment, average pitch difference total reduces 78:8%.Volume is aligned, the present invention Calculate the average value of the difference of amplitude envelops.Specifically, root mean square (RMS) value has been used.Fig. 7 is put down after showing dynamic alignment Equal dynamic differential reduces 86:4%.

Claims (7)

1. a kind of automatic sound conversion method, it is characterised in that the following steps are included:
1) use melody and these common traits of characteristics of speech sounds in sound that source voice and target voice are realized the smooth of voice Alignment;
2) time scale modification is carried out to source voice, makes source language as a result, according to time span ratio according to voice smooth alignment The time unifying of sound and target voice;
3) sound of source voice is modified in a manner of frame by frame using Pitch-synchronous overlap-add algorithm and simple amplitude envelops matching algorithm Reconciliation volume.
2. voice automatic conversion processing method according to claim 1, it is characterised in that: in step 2), source voice and mesh The time unifying of poster sound refers to and carries out feature extraction to source voice and target voice, then using dynamic time warping to these Feature is aligned.
3. voice automatic conversion processing method according to claim 2, it is characterised in that: the feature extraction is processing rotation Rule aspect maximum filter constant Q transformation and phoneme score the two features for extracting from phoneme classifier.
4. voice automatic conversion processing method according to claim 1, it is characterised in that: in step 2), time span ratio Example is referred to and is aligned three rank Savitzky-Golay filters applied to piecewise linearity using the sgolayfilt function in MATLAB Path;Smoothed out result is compared with institute to align to path, the slope meter evaluation time extensibility of filter paths is used.
5. voice automatic conversion processing method according to claim 1, it is characterised in that: in step 2), time scale is repaired Change, refer to the smoothingtime draw ratio changed according to every frame, is applied to time scale modification TSM algorithm, so as in time It is aligned voice.
6. voice automatic conversion processing method according to claim 1, it is characterised in that: in step 3), Pitch-synchronous weight Folded phase computation system modification, refers to and is aligned tone by algorithm, and pitch ratio needed for the algorithm calculates as follows:
Wherein, β (i) is pitch ratio, f0T(i) and f0ST(i) the frame level pitch sequences of target and source voice, as are respectively indicatedT(i) For the aperiodicity obtained after time alignment from source.
7. voice automatic conversion processing method according to claim 1, it is characterised in that: in step 3), amplitude envelops It is aligned with algorithm for volume, by calculating the frame level amplitude gain between two voices and realizing it multiplied by source sound, Envelope is extracted from each voice using root-mean-square value, and obtains amplitude gain from the ratio of two amplitude envelops.
CN201811583082.1A 2018-12-24 2018-12-24 A kind of automatic sound conversion method Pending CN109712634A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811583082.1A CN109712634A (en) 2018-12-24 2018-12-24 A kind of automatic sound conversion method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811583082.1A CN109712634A (en) 2018-12-24 2018-12-24 A kind of automatic sound conversion method

Publications (1)

Publication Number Publication Date
CN109712634A true CN109712634A (en) 2019-05-03

Family

ID=66256120

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811583082.1A Pending CN109712634A (en) 2018-12-24 2018-12-24 A kind of automatic sound conversion method

Country Status (1)

Country Link
CN (1) CN109712634A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111680187A (en) * 2020-05-26 2020-09-18 平安科技(深圳)有限公司 Method and device for determining music score following path, electronic equipment and storage medium

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1118493A (en) * 1994-08-01 1996-03-13 中国科学院声学研究所 Language and speech converting system with synchronous fundamental tone waves
KR20030031936A (en) * 2003-02-13 2003-04-23 배명진 Mutiple Speech Synthesizer using Pitch Alteration Method
CN1682281A (en) * 2002-09-17 2005-10-12 皇家飞利浦电子股份有限公司 Method for controlling duration in speech synthesis
CN102306492A (en) * 2011-09-09 2012-01-04 中国人民解放军理工大学 Voice conversion method based on convolutive nonnegative matrix factorization
CN102568476A (en) * 2012-02-21 2012-07-11 南京邮电大学 Voice conversion method based on self-organizing feature map network cluster and radial basis network
CN102664003A (en) * 2012-04-24 2012-09-12 南京邮电大学 Residual excitation signal synthesis and voice conversion method based on harmonic plus noise model (HNM)
CN103021418A (en) * 2012-12-13 2013-04-03 南京邮电大学 Voice conversion method facing to multi-time scale prosodic features
CN104392717A (en) * 2014-12-08 2015-03-04 常州工学院 Sound track spectrum Gaussian mixture model based rapid voice conversion system and method
CN104885153A (en) * 2012-12-20 2015-09-02 三星电子株式会社 Apparatus and method for correcting audio data

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1118493A (en) * 1994-08-01 1996-03-13 中国科学院声学研究所 Language and speech converting system with synchronous fundamental tone waves
CN1682281A (en) * 2002-09-17 2005-10-12 皇家飞利浦电子股份有限公司 Method for controlling duration in speech synthesis
KR20030031936A (en) * 2003-02-13 2003-04-23 배명진 Mutiple Speech Synthesizer using Pitch Alteration Method
CN102306492A (en) * 2011-09-09 2012-01-04 中国人民解放军理工大学 Voice conversion method based on convolutive nonnegative matrix factorization
CN102568476A (en) * 2012-02-21 2012-07-11 南京邮电大学 Voice conversion method based on self-organizing feature map network cluster and radial basis network
CN102664003A (en) * 2012-04-24 2012-09-12 南京邮电大学 Residual excitation signal synthesis and voice conversion method based on harmonic plus noise model (HNM)
CN103021418A (en) * 2012-12-13 2013-04-03 南京邮电大学 Voice conversion method facing to multi-time scale prosodic features
CN104885153A (en) * 2012-12-20 2015-09-02 三星电子株式会社 Apparatus and method for correcting audio data
CN104392717A (en) * 2014-12-08 2015-03-04 常州工学院 Sound track spectrum Gaussian mixture model based rapid voice conversion system and method

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
YUNBO ZHU ETC: "A Chinese Text to Speech System Based on TD-PSOLA", 《PROCEDINGS OF IEEE TENCON"02》 *
严勤 等: "《语音信号处理与识别》", 31 December 2015, 国防工业出版社 *
李清华: "语音转换技术研究及实现", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
袁晓勇: "基于LPAC-PSOLA合成算法语音转换系统", 《中国优秀硕士学位论文全文数据库》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111680187A (en) * 2020-05-26 2020-09-18 平安科技(深圳)有限公司 Method and device for determining music score following path, electronic equipment and storage medium
CN111680187B (en) * 2020-05-26 2023-11-24 平安科技(深圳)有限公司 Music score following path determining method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
WO2021218138A1 (en) Song synthesis method, apparatus and device, and storage medium
Muller et al. Signal processing for music analysis
Marolt A connectionist approach to automatic transcription of polyphonic piano music
CN104272382B (en) Personalized singing synthetic method based on template and system
US20080115656A1 (en) Tempo detection apparatus, chord-name detection apparatus, and programs therefor
CN109979488B (en) System for converting human voice into music score based on stress analysis
CN112382257B (en) Audio processing method, device, equipment and medium
CN110136730B (en) Deep learning-based piano and acoustic automatic configuration system and method
Marolt SONIC: Transcription of polyphonic piano music with neural networks
CN103915093A (en) Method and device for realizing voice singing
New et al. Voice conversion: From spoken vowels to singing vowels
CN109903778A (en) The method and system of real-time singing marking
Lerch Software-based extraction of objective parameters from music performances
Bonada et al. Singing voice synthesis combining excitation plus resonance and sinusoidal plus residual models
CN109712634A (en) A kind of automatic sound conversion method
JP2876861B2 (en) Automatic transcription device
WO2008037115A1 (en) An automatic pitch following method and system for a musical accompaniment apparatus
Shenoy et al. Singing voice detection for karaoke application
CN115050387A (en) Multi-dimensional singing playing analysis evaluation method and system in art evaluation
Cwitkowitz Jr End-to-end music transcription using fine-tuned variable-Q filterbanks
Traube et al. Phonetic gestures underlying guitar timbre description
CN113129923A (en) Multi-dimensional singing playing analysis evaluation method and system in art evaluation
JP5810947B2 (en) Speech segment specifying device, speech parameter generating device, and program
Salamon et al. A chroma-based salience function for melody and bass line estimation from music audio signals
CN111681674A (en) Method and system for identifying musical instrument types based on naive Bayes model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20190503

RJ01 Rejection of invention patent application after publication