CN107221321A - A kind of phonetics transfer method being used between any source and target voice - Google Patents
A kind of phonetics transfer method being used between any source and target voice Download PDFInfo
- Publication number
- CN107221321A CN107221321A CN201710186569.5A CN201710186569A CN107221321A CN 107221321 A CN107221321 A CN 107221321A CN 201710186569 A CN201710186569 A CN 201710186569A CN 107221321 A CN107221321 A CN 107221321A
- Authority
- CN
- China
- Prior art keywords
- mrow
- msub
- voice
- dictionary
- target voice
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 51
- 238000012546 transfer Methods 0.000 title claims abstract description 22
- 238000006243 chemical reaction Methods 0.000 claims abstract description 36
- 239000013598 vector Substances 0.000 claims description 79
- 239000011159 matrix material Substances 0.000 claims description 41
- 230000005284 excitation Effects 0.000 claims description 16
- 230000006870 function Effects 0.000 claims description 13
- 239000000203 mixture Substances 0.000 claims description 4
- 239000000284 extract Substances 0.000 claims 1
- 230000017105 transposition Effects 0.000 claims 1
- 230000009466 transformation Effects 0.000 abstract description 2
- 230000008859 change Effects 0.000 description 12
- 230000008569 process Effects 0.000 description 9
- 238000012549 training Methods 0.000 description 7
- 238000010586 diagram Methods 0.000 description 6
- 238000002474 experimental method Methods 0.000 description 4
- 230000000694 effects Effects 0.000 description 3
- 238000007476 Maximum Likelihood Methods 0.000 description 2
- 238000010276 construction Methods 0.000 description 2
- 238000000354 decomposition reaction Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 230000001360 synchronised effect Effects 0.000 description 2
- 241000208340 Araliaceae Species 0.000 description 1
- 206010013952 Dysphonia Diseases 0.000 description 1
- 235000005035 Panax pseudoginseng ssp. pseudoginseng Nutrition 0.000 description 1
- 235000003140 Panax quinquefolius Nutrition 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000008909 emotion recognition Effects 0.000 description 1
- 235000008434 ginseng Nutrition 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000001228 spectrum Methods 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/10—Speech classification or search using distance or distortion measures between unknown speech and reference templates
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
- G10L2015/0631—Creating reference templates; Clustering
- G10L2015/0633—Creating reference templates; Clustering using lexical or orthographic knowledge sources
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
- G10L2015/0635—Training updating or merging of old and new templates; Mean values; Weighting
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Compression, Expansion, Code Conversion, And Decoders (AREA)
Abstract
The invention discloses a kind of phonetics transfer method being directed between any source and target voice, comprise the following steps, step one, set up the tensor dictionary of at least one target voice;Step 2, builds the corresponding voice dictionary of target voice and the voice dictionary in any source;Step 3, reconstructs the voice content in any source, realizes any source to the conversion of target voice.The present invention, using voice signal in any source, tensor dictionary corresponding with the target voice to be changed, constructs respective voice dictionary using in same Zhang Zhangliang dictionaries, realizes that voice is changed;The parallel speech that substantial amounts of any source and target speaker was collected before being changed without each voice is trained, and practicality and voice transformation efficiency are higher.
Description
Technical field
The invention belongs to Voice Conversion Techniques field, and in particular to a kind of voice being used between any source and target source turns
Change method.
Background technology
Voice signal is the voice signal of language, loads certain language meaning, wherein containing much information, such as
Identity information, affective state and voice content of speaker etc..
Voice is changed, and is a kind of identity information with target speaker come the identity information of replacing source speaker, but is protected
Hold a kind of constant technology of voice content.Voice conversion function is directed in many important application aspects:Emotion recognition is with turning
Technical elements are changed, in terms of the text-to-speech system (TTS) that text information form is changed to voice messaging form, restoration methods side are composed
Face, audio bandwidth expansion technical elements and people's reconstructed voice with helping dysphonia etc..
At present, the method for voice conversion has many kinds, and the classical method of the most frequently used comparison is following two:One class is
Based on statistical method;One class is to be based on rarefaction representation.
In the phonetics transfer method based on statistical parameter, gauss hybrid models are employed most wide, in Gaussian Mixture mould
, it is necessary to use transfer function to realize that weighting is averaged in type algorithm, and the parameter of transfer function is accurate with least mean-square error
Then (Minimum Mean-Square Error, MMSE) or maximum-likelihood criterion (maximum likelihood, ML) are estimated
Meter.Although this conversion method simple, intuitive, and effect is very well, there is shortcoming one is to need a large amount of parallel corporas to be instructed
Practice, otherwise can produce over-fitting, two be that voice spectrum after conversion is excessively smooth, not enough naturally.
In the phonetics transfer method based on rarefaction representation, because rarefaction representation is widely used in signal transacting,
Phonetics transfer method based on sample has also obtained very big development.D.Seung in 2001 etc. proposes Non-negative Matrix Factorization
(Non-negative matrix factorization, NMF) voice conversion algorithm, this method is first by source speaker's voice
Rarefaction representation, that is, be expressed as the product of voice dictionary and excitation matrix.Target speaker's voice dictionary is used in the conversion stage
Realize that voice is changed instead of source speaker's voice dictionary.This method based on NMF can be improved effectively based on statistical parameter
Over-fitting problem caused by method, produces more natural voice, and this method also has good noise robustness.But
It is that there is also following shortcoming for this method:Need to collect enough source and target speakers' before each voice is changed
Parallel speech be used for dictionary creation training stage, once therefore voice conversion stage source speaker's identity change, also
It is that the voice that can not complete source speaker's voice to target speaker's voice converted.In actual applications, it is impossible to collect every
A large amount of parallel speech section of one source speaker and target speaker are trained process, therefore, and the voice conversion based on NMF is calculated
There is limitation in method, it is impossible to effectively quickly realize that the voice of any source and target voice is changed.
The content of the invention
The invention aims to solve the above problems, there is provided a kind of language being directed between any source and target speaker
Sound conversion method, the training process and speech conversion process of the dictionary creation digitized the speech into are separated, in speech conversion process
Need not because of voice change source voice and target voice identity change and remove re -training voice dictionary.The present invention exists
Based on the concept that tensor is introduced on the basis of NMF methods, one, two or more target voices are chosen from corpus
As the basic speech of voice tensor dictionary, by multisequencing dynamic time warping algorithm make this, two or more mesh
The parallel speech section alignment of poster sound, so as to set up the tensor dictionary being made up of one, the basic dictionary of two or more two dimensions.
Changed the stage in voice, source, target speaker voice can be by the linear combinations of each basic dictionary in tensor dictionary, construction
Go out respective voice dictionary, realize voice conversion.
In order to reach foregoing invention purpose, the present invention uses following technical scheme:One kind is directed to any source and target voice
Between phonetics transfer method, comprise the following steps,
Step one, the tensor dictionary of at least one basic speaker's voice is set up;
Step 2, builds the corresponding voice dictionary of target voice and the voice dictionary in any source;
Step 3, reconstructs the voice content in any source, realizes any source to the conversion of target voice.
Further, the step 2 is that the voice dictionary of target voice is constructed in tensor dictionary using tensor algorithm and is appointed
The voice dictionary in meaning source.
Further, the step one set up the tensor dictionary of at least one target voice process it is specific as follows:
1) N number of target voice is randomly selected out from corpus, the basic speech of tensor dictionary is formed, from this N number of target
Semantic content identical voice signal x is randomly selected in voice1,x2,...,xN, N >=1, X represents voice signal;
2) the feature parameter vector sequence S in each voice signal is extracted1,S2,…,SN;
3) multisequencing dynamic time warping algorithm is utilized by features described above parameter vector sequence S1,S2,…,SNAlignment, alignment
Speech characteristic parameter vector sequence afterwards is S '1,S′2,…,S′N;
4) it is S ' from feature parameter vector sequence1,S′2,…,S′NRandomly select some speech frame Characteristic Vectors of same position
Measure S "1,S″2,…,S″N, it is used as a part for the tensor dictionary of each target voice;
5) when the size of the tensor dictionary of each target voice is less than the tensor dictionary preset value of the target voice, repeat
Step (2) arrives (4);When the size of the tensor dictionary of each target voice is equal to the tensor dictionary preset value of the target voice,
Stop step (2) and arrive (4), form N number of target voice dictionary A1,A2,…,AN;
6) by N number of target voice dictionary A1,A2,…,ANIt is stacked together, composition tensor dictionary A.
Further, the step 3) in refer to that alternative expression is dynamic using multisequencing using multisequencing dynamic time warping algorithm
The synchronous method of state Time alignment, target voice feature parameter vector sequence S1,S2,…,SNIn each feature parameter vector
Sequence at most participates in multisequencing dynamic time warping twice.
Further, the step 2 builds the corresponding voice dictionary of target voice and includes procedure below,
1) for tensor dictionary A, the weight coefficient α of each target voice in calculating tensor dictionary A, weight coefficient α=
[α1,α2,…,αN], 0 < α < 1, N >=1;
2) linear combination being weighted to the tensor dictionary of each target voice
3) linear combination of weighting is multiplied with excitation matrix H, obtains the voice dictionary S of each target voice,
Wherein AnRepresent the tensor dictionary of n-th of target voice in tensor dictionary A, αn(0≤αn≤ 1) represent n-th
The weight coefficient of the tensor dictionary of target voice;
4) weight coefficient α and excitation matrix H cost function is calculated:
D () is from KL divergence ‖ ‖11 norm is represented, λ is sparse penalty factor, wherein 0≤λ≤1, to ensure excitation
The degree of rarefication of matrix, H >=0, weight coefficient α=[α1,α2,…,αN], it is necessary to ensure
5) keep the A value sizes in step (4) constant, it is continuous using the multiplication replacement criteria of Algorithms of Non-Negative Matrix Factorization
Iteration weight coefficient α and excitation matrix H, causes algorithm cost function value to reach minimum, obtains by running parameter α, H
Arrive
Iterative formula:
WhereinElement is multiplied one by one between representing matrix, and element is divided by one by one between/representing matrix.T representing matrixs turn
Put, the division sign in formula all represents a little to remove, λ is sparse penalty factor, 1 represents:All 1's matrix, k represents k-th of target voice;
6) using formula (1), (3), (4) calculate the feature vector sequence S of target voicetgtUnder tensor dictionary A
Rarefaction representation:
Wherein,αtgtRepresent the feature vector sequence S of target voicetgtTensor dictionary add
Weight coefficient, H2Represent the feature vector sequence S of target voicetgtExcitation matrix;
7) using formula (1), (3), (4) calculate the feature vector sequence S in any sourcesrcIt is dilute under tensor dictionary A
Dredge and represent:
Wherein,αsrcRepresent the tensor dictionary weight coefficient in any source, H1Represent any
The feature vector sequence S in sourcesrcExcitation matrix.
Further, the feature vector sequence S for calculating any sourcesrcWith arbitrary target speech characteristic vector sequence Stgt
In rarefaction representation step under tensor dictionary A:As the feature vector sequence S in any sourcesrcWith target voice feature vector sequence
StgtIt is the voice extraction feature for the same voice content said by source and target speaker (target voice), then according to non-negative
Matrix decomposition algorithm understands the excitation matrix H in rarefaction representation of two sequences under same tensor dictionary A1And H2Approximately equal.It is right
Two sections of different voices in same speaker (same any source), such as target voice characteristic sequenceSame
The weight coefficient of the tensor dictionary in rarefaction representation under tensor dictionary AAlso approximately equal, this is also readily appreciated that:
Measure the original purpose of the weight coefficient under dictionary to seek to select the voice dictionary for belonging to speaker, when speaker's identity is constant
When, the dictionary selected is also naturally constant.
Further, the step 3 is any source speech characteristic vector sequence S for needing to change of selectionsrcExcitation square
Battle array H1It is multiplied with target voice dictionary, reconstructs the feature vector sequence of the target voice after conversion, the target after being changed
Voice:
WhereinRepresent the feature vector sequence of the target voice after conversion.
Further, the tensor dictionary in the step one refers to make at least two by multisequencing dynamic time warping algorithm
The parallel speech section alignment of target voice, sets up the tensor dictionary being made up of the basic dictionary of at least two two dimensions.
Further, the step one includes the constraint for the tensor dictionary size set up, and tensor dictionary size is:Speak on basis
The language of people's number × characteristic dimension × number of speech frames, the number of basic speaker and the choice relation of dictionary language frame number to conversion
Sound quality and the complexity of conversion.When basic speaker's number is very few,Target speaker represented by linear combination
The voice dictionary of the voice dictionary of (target voice) just and in practice has larger deviation;When basic speaker's number is excessive,
The complexity of calculating can be lifted again.Surface is tested, selects 10-20 basic speaker to can be very good to reconstruct target voice word
Allusion quotation does not have very high computation complexity again simultaneously.The problem of selection of number of speech frames is equally faced with similar, experiment shows, selects
When selecting 3000-3500 frame voices, preferable speech conversion effects can be obtained, while without higher computation complexity.
Further, the step 3 is specially the feature vector sequence and target language in the voice dictionary using target voice
The voice dictionary of sound is made after linear weighted function, is multiplied with the excitation matrix in any source, realizes any source to the conversion of target voice.
Compared with prior art, beneficial effect is the present invention:1st, the present invention is using in same Zhang Zhangliang dictionaries, using appointing
Voice signal in meaning source, tensor dictionary corresponding with the target voice to be changed, constructs respective voice dictionary, realizes that voice turns
Change;2nd, the parallel speech that substantial amounts of any source and target speaker was collected before being changed without each voice is trained, real
It is higher with property and voice transformation efficiency, on the basis of tensor dictionary, it is only necessary to calculate any source, the characteristic vector sequence of target voice
Row, then reconstruct the voice content in any source, realize that voice is changed;3 do not need the parallel speech of source and target speaker to be used for
Training dictionary, so the training stage of experiment can be independently of the experiment conversion stage;4 work as speech source and target voice to be converted
Identity when changing, train the tensor dictionary come and keep constant, can effectively realize between any source and target speaker
Voice conversion.
Brief description of the drawings
Fig. 1 is the frame diagram of the tensor dictionary creation process in the training stage of the invention;
Fig. 2 is one of many vector sequence DTW schematic diagrames during tensor dictionary creation;
Fig. 3 is two of many vector sequence DTW schematic diagrames during tensor dictionary creation;
Fig. 4 is three of many vector sequence DTW schematic diagrames during tensor dictionary creation;
Fig. 5 is the transfer process schematic diagram in the voice conversion stage.
Embodiment
Explanation is further described to technical scheme below by specific embodiment.
Embodiment 1
Present embodiment discloses a kind of phonetics transfer method being directed between any source and target voice, appointing in this programme
Meaning source refers to the different phonetic that different phonetic object to be converted is sent.Method comprises the following steps:
Step one, the tensor dictionary of at least one target voice is set up;With reference to Fig. 1, specifically,
1) N number of target voice is randomly selected out from corpus, the basic speech of tensor dictionary is formed, from this N number of target
Semantic content identical voice signal x is randomly selected in voice1,x2,...,xN, N >=1, X represents voice signal;
2) the feature parameter vector sequence S in each voice signal is extracted1,S2,…,SN;
3) multisequencing dynamic time warping algorithm is utilized by features described above parameter vector sequence S1,S2,…,SNAlignment, alignment
Speech characteristic parameter vector sequence afterwards is S '1,S′2,…,S′N;
4) it is S ' from feature parameter vector sequence1,S′2,…,S′NRandomly select some speech frame Characteristic Vectors of same position
Measure S "1,S″2,…,S″N, it is used as a part for the tensor dictionary of each target voice;
5) when the size of the tensor dictionary of each target voice is less than the tensor dictionary preset value of the target voice, repeat
Step (2) arrives (4);When the size of the tensor dictionary of each target voice is equal to the default value of the target voice, stop step
Suddenly (2) arrive (4), form N number of target voice dictionary A1,A2,…,AN;
6) by N number of target voice dictionary A1,A2,…,ANIt is stacked together, composition tensor dictionary A.
Step 2, builds the corresponding voice dictionary of target voice on the basis of tensor dictionary;Specially
1) for tensor dictionary A, the weight coefficient α of each target voice in calculating tensor dictionary A, weight coefficient α=
[α1,α2,…,αN], 0 < α < 1;
2) linear combination being weighted to the tensor dictionary of each target voice
3) linear combination of weighting is multiplied with excitation matrix H, obtains the voice dictionary S of each target voice,
Wherein AnRepresent the tensor dictionary of n-th of target voice in tensor dictionary A, αn(0≤αn≤ 1) represent n-th of mesh
The weight coefficient of the tensor dictionary of poster sound;
4) weight coefficient α and excitation matrix H cost function is calculated:
D () selects KL divergences, ‖ ‖11 norm is represented, λ is sparse penalty factor, to ensure the sparse of excitation matrix
Degree, H >=0, parameter alpha=[α1,α2,…,αN],
5) holding A is constant, and running parameter α, H causes algorithm cost function value to reach minimum, obtains iterative formula:
WhereinElement is multiplied one by one between representing matrix, and element is divided by one by one between/representing matrix.T representing matrixs turn
Put, the division sign in formula all represents a little to remove, λ is sparse penalty factor, 1 represents:All 1's matrix, k represents k-th of target voice;
6) above-mentioned steps are utilized, the feature vector sequence S of target voice is calculatedtgtIn the sparse table under tensor dictionary A
Show:
Wherein,αtgtRepresent the feature vector sequence S of target voicetgtTensor dictionary add
Weight coefficient, H2Represent the feature vector sequence S of target voicetgtExcitation matrix;
7) voice dictionary in any source is built on the basis of tensor dictionary;Using formula (1), (3), (4), calculating is taken the post as
The feature vector sequence S in meaning sourcesrcIn the rarefaction representation under tensor dictionary A:
Wherein,αsrcRepresent the tensor dictionary weight coefficient in any source, H1Represent any
The feature vector sequence S in sourcesrcExcitation matrix;
Step 3, reconstructs the voice content in any source, realizes any source to the conversion of target voice;Any language in a steady stream of selection
Sound dictionary SsrcExcitation matrix H1It is multiplied with target voice dictionary, the target voice after being changed:
WhereinRepresent the feature vector sequence of the target voice after conversion.
Above-mentioned steps two are to step 3 with reference to shown in Fig. 5.
The purpose of the phonetics transfer method of the present invention be voice that effectively and rapidly any people says be converted into it is specific some
Target voice, the principle of conversion is:
In the training stage of dictionary creation, N number of target voice is chosen as the basic speech of tensor dictionary.From this N number of mesh
Parallel speech section is extracted in poster sound, the basic language material of generation tensor dictionary is used as;
The characteristic vector of each voice segments is extracted, multisequencing dynamic time warping algorithm (Multi-Dynamic is utilized
Time Warping, Multi-DTW) feature parameter vector is alignd;
The characteristic vector alignd of enough frame numbers is collected, they are generated as N number of symmetrical dictionary, this N number of dictionary is folded
Put together, constitute tensor dictionary;
The stage is changed in voice, using based on unified tensor dictionary (Unified Tensor Dictionary, UTD)
Voice conversion algorithm, using the automatic construction source in same tensor dictionary of UTD algorithms, the voice dictionary of target speaker, is realized
Voice conversion between any source and target speaker.
In the above method, the tensor dictionary in step one refers to make N number of target by multisequencing dynamic time warping algorithm
The parallel speech section alignment of voice, sets up the tensor dictionary that a N number of basic dictionary of two dimension is constituted.
In the above method, the feature vector sequence S in any source is calculatedsrcWith arbitrary target speech characteristic vector sequence Stgt
In rarefaction representation step under tensor dictionary A:As the feature vector sequence S in any sourcesrcWith target voice feature vector sequence
StgtIt is the voice extraction feature for the same voice content said by source and target speaker (target voice), then according to non-negative
Matrix decomposition algorithm understands the excitation matrix H in rarefaction representation of two sequences under same tensor dictionary A1And H2Approximately equal.It is right
Two sections of different voices in same speaker (same any source), such as target voice characteristic sequenceSame
The weight coefficient of the tensor dictionary in rarefaction representation under tensor dictionary AAlso approximately equal, this is also readily appreciated that:
The original purpose of weight coefficient under tensor dictionary seeks to select the voice dictionary for belonging to speaker (any source), works as speaker
When identity is constant, the dictionary selected is also naturally constant.
Tensor dictionary refers to make the parallel speech section of at least two target voices by multisequencing dynamic time warping algorithm
Alignment, sets up the tensor dictionary that a basic dictionary of at least two two dimensions is constituted.The foundation of tensor dictionary includes set up
The constraint of dictionary size is measured, tensor dictionary size is:Basic speaker's number × characteristic dimension × number of speech frames.Basic speaker
Number and dictionary language frame number choice relation to conversion voice quality and conversion complexity.When basic speaker's number
When very few,The voice dictionary of target speaker (target voice) represented by linear combination just with voice word in practice
Allusion quotation has larger deviation;When basic speaker's number is excessive, the complexity of calculating can be lifted again.Surface is tested, 10-20 is selected
Individual basic speaker can be very good to reconstruct target voice dictionary simultaneously and not have very high computation complexity.Number of speech frames
Selection the problem of be equally faced with similar, experiment shows, during selection 3000-3500 frame voices, can obtain preferable voice
Conversion effect, while without higher computation complexity.
Embodiment 2
As different from Example 1, in generation tensor dictionary process, multisequencing dynamic time warping algorithm is utilized
(Multi-Dynamic Time Warping, Multi-DTW) aligns feature parameter vector, may be referred to Fig. 2, Fig. 3 and Fig. 4
Shown, Fig. 2, Fig. 3 and Fig. 4 are many vector sequence DTW schematic diagrames during tensor dictionary creation, and figure includes N number of sequence pair
The neat first step, second step and final step.Many vector sequence DTW are for making the dynamic time rule of multiple parallel speech alignment
Whole algorithm.DTW between our known two voice sequences is in order that two parallel speech realize alignment, it is assumed that two voices
Parameter vector sequence S1And S2It is f by change functions of the DTW after regular1() and f2().Phonetic feature ginseng after then changing
Counting vector is:
S1=f1(S1) (8)
S2=f2(S2) (9)
When there is multiple parallel speech vector sequence S1,S2,…,SNNeed alignment when, set forth herein many vector sequence DTW
Algorithm is method of the alternative expression using DTW synchronizations, and each feature parameter vector sequence at most participates in DTW twice.So can be with
Avoid working as and randomly select a feature parameter vector sequence SiIt is synchronous one by one for N-1 feature parameter vector sequence of template and other
And occur synchronization after feature parameter vector sequence it is long, deformation the problems such as.The detailed process of many vector sequence DTW algorithms is as follows
It is shown:
(1) first to S1,S2Carry out DTW, it is assumed that the change function after each DTW all uses f1() and f2() represents, then
The change of vector sequence such as formula (1), (2).As shown in Fig. 2 S before change1,S2It is unjustified, the S after change1,S2Alignment is presented
State.
(2) next to S2,S3When carrying out DTW, speech parameter vector sequence, which becomes, to be turned to:
S1=f1(S1),S2=f1(S2) (10)
S3=f2(S3) (11)
As shown in figure 3, to S2,S3Before progress DTW, S1,S2Aligned condition is just kept, so working as S2,S3Carry out
When DTW changes, S1Necessary and S2The change maintained like.After finally changing through DTW, S1,S2,S3Aligned condition is all presented.
(3) next in sequence respectively to (S3,S4),(S4,S5),…,(SN-1,SN) carry out feature parameter vector sequence
Between DTW.With as Fig. 3, for the sequence alignd, follow-up change must being consistent property.
(4) finally to SN-1,SNDTW is carried out, the change of speech parameter vector sequence is as follows:
S′1=f1(S1),S′2=f1(S2),...,S′N-1=f1(SN-1) (12)
S′N=f2(SN) (13)
As shown in figure 4, SN-1,SNBefore progress DTW, S1,S2,…,SN-1It has been the sequence of alignment, so in SN-1,
SNWhen making DTW, S1,S2,…,SN-1Need the changing rule maintained like.Work as SN-1,SNAfter having made DTW, sequence S1,S2,…,
SNAlignment is all realized, so the speech characteristic vector sequence S ' that sequence now has been reformed into after alignment1,S′2,…,S′N。
It is to be used to realize the present invention and embodiment to be described above, and each step is example, those of ordinary skill in the art
The actual step to be used can be determined according to actual conditions, and each step has a variety of implementation methods, all should belong to this hair
Within the scope of bright.The scope of the present invention should not necessarily be limited by this description.It should be appreciated by those skilled in the art do not taking off
From any modification or partial replacement of the scope of the present invention, belong to the claims in the present invention to limit scope.
Claims (10)
1. a kind of phonetics transfer method being used between any source and target voice, comprises the following steps:
Step one, the tensor dictionary of at least one basic speaker's voice is set up;
Step 2, builds the corresponding voice dictionary of target voice and the voice dictionary in any source;
Step 3, reconstructs the voice content in any source, realizes any source to the conversion of target voice.
2. the phonetics transfer method according to claim 1 being used between any source and target voice, it is characterised in that institute
It is that the voice dictionary of target voice and the voice dictionary in any source are constructed in tensor dictionary using tensor algorithm to state step 2.
3. the phonetics transfer method according to claim 1 being used between any source and target voice, it is characterised in that institute
State step one specific as follows,
1) N number of basic speaker's voice is randomly selected out from corpus, the basic speech of tensor dictionary is formed, from this N number of base
Semantic content identical voice signal x is randomly selected in plinth speaker's voice1,x2,...,xN, N >=1, X represents voice signal;
2) the feature parameter vector sequence S in each voice signal is extracted1,S2,…,SN;
3) multisequencing dynamic time warping algorithm is utilized by features described above parameter vector sequence S1,S2,…,SNAlignment, after alignment
Speech characteristic parameter vector sequence is S '1,S′2,…,S′N;
4) it is S ' from feature parameter vector sequence1,S′2,…,S′NRandomly select some voice frame feature vector S of same position
″1,S″2,…,S″N, it is used as a part for the tensor dictionary of each target voice;
5) when the size of the tensor dictionary of each target voice is less than the tensor dictionary preset value of the target voice, repeat step
(2) (4) are arrived;When the size of the tensor dictionary of each target voice is equal to the default value of the target voice, stop step (2)
To (4), N number of basic speaker's voice dictionary A is formed1,A2,…,AN;
6) by N number of basic speaker's voice dictionary A1,A2,…,ANIt is stacked together, composition tensor dictionary A.
4. the phonetics transfer method according to claim 3 being used between any source and target voice, it is characterised in that
The step 3) in using multisequencing dynamic time warping algorithm refer to alternative expression using multisequencing dynamic time warping it is same
The method of step, target voice feature parameter vector sequence S1,S2,…,SNIn each feature parameter vector sequence at most participate in
Multisequencing dynamic time warping twice.
5. the phonetics transfer method according to claim 1 being used between any source and target voice, it is characterised in that
The step 2 builds the corresponding voice dictionary of target voice and includes procedure below,
1) tensor dictionary A is directed to, the weight coefficient α, weight coefficient α=[α of each target voice in tensor dictionary A is calculated1,
α2,…,αN], 0 < αn< 1;
2) linear combination being weighted to the tensor dictionary of each target voice
3) linear combination of weighting obtains the voice dictionary of each target voice, then is multiplied with excitation matrix H, obtains target voice
S,
<mrow>
<mi>S</mi>
<mo>&ap;</mo>
<mrow>
<mo>(</mo>
<munderover>
<mi>&Sigma;</mi>
<mrow>
<mi>n</mi>
<mo>=</mo>
<mn>1</mn>
</mrow>
<mi>N</mi>
</munderover>
<msub>
<mi>&alpha;</mi>
<mi>n</mi>
</msub>
<msub>
<mi>A</mi>
<mi>n</mi>
</msub>
<mo>)</mo>
</mrow>
<mi>H</mi>
<mo>-</mo>
<mo>-</mo>
<mo>-</mo>
<mrow>
<mo>(</mo>
<mn>1</mn>
<mo>)</mo>
</mrow>
</mrow>
Wherein AnRepresent the voice dictionary of n-th of target voice in tensor dictionary A, αn(0≤αn≤ 1) represent n-th of target language
The weight coefficient of the tensor dictionary of sound;
4) weight coefficient α and excitation matrix H cost function is calculated:
<mrow>
<mi>d</mi>
<mrow>
<mo>(</mo>
<mi>s</mi>
<mo>,</mo>
<mo>(</mo>
<mrow>
<munderover>
<mi>&Sigma;</mi>
<mrow>
<mi>n</mi>
<mo>=</mo>
<mn>1</mn>
</mrow>
<mi>N</mi>
</munderover>
<msub>
<mi>&alpha;</mi>
<mi>n</mi>
</msub>
<msub>
<mi>A</mi>
<mi>n</mi>
</msub>
</mrow>
<mo>)</mo>
<mi>H</mi>
<mo>)</mo>
</mrow>
<mo>+</mo>
<mi>&lambda;</mi>
<msub>
<mrow>
<mo>|</mo>
<mrow>
<mo>|</mo>
<mi>H</mi>
<mo>|</mo>
</mrow>
<mo>|</mo>
</mrow>
<mn>1</mn>
</msub>
<mo>-</mo>
<mo>-</mo>
<mo>-</mo>
<mrow>
<mo>(</mo>
<mn>2</mn>
<mo>)</mo>
</mrow>
</mrow>
1
D () selects KL divergences | | | |11 norm is represented, λ is sparse penalty factor, wherein 0≤λ≤1, to ensure to encourage square
The degree of rarefication of battle array, H >=0, weight coefficient α=[α1,α2,…,αN],
5) keep the A value sizes in step (4) constant, added using the continuous iteration of multiplication replacement criteria of Algorithms of Non-Negative Matrix Factorization
Weight coefficient α and excitation matrix H cause algorithm cost function value to reach minimum, and wherein iterative formula is:
<mrow>
<msub>
<mi>&alpha;</mi>
<mi>n</mi>
</msub>
<mo>=</mo>
<mfrac>
<msub>
<mi>&alpha;</mi>
<mi>n</mi>
</msub>
<mrow>
<msub>
<mi>&Sigma;</mi>
<mrow>
<mi>d</mi>
<mo>,</mo>
<mi>l</mi>
</mrow>
</msub>
<msub>
<mrow>
<mo>(</mo>
<msub>
<mi>A</mi>
<mi>n</mi>
</msub>
<mi>H</mi>
<mo>)</mo>
</mrow>
<mrow>
<mi>d</mi>
<mi>l</mi>
</mrow>
</msub>
</mrow>
</mfrac>
<munder>
<mi>&Sigma;</mi>
<mrow>
<mi>d</mi>
<mo>,</mo>
<mi>l</mi>
</mrow>
</munder>
<msub>
<mrow>
<mo>(</mo>
<mfrac>
<mrow>
<mi>S</mi>
<mo>&CircleTimes;</mo>
<mrow>
<mo>(</mo>
<msub>
<mi>A</mi>
<mi>n</mi>
</msub>
<mi>H</mi>
<mo>)</mo>
</mrow>
</mrow>
<mrow>
<msub>
<mi>&Sigma;</mi>
<mi>n</mi>
</msub>
<msub>
<mi>&alpha;</mi>
<mi>n</mi>
</msub>
<mrow>
<mo>(</mo>
<msub>
<mi>A</mi>
<mi>n</mi>
</msub>
<mi>H</mi>
<mo>)</mo>
</mrow>
</mrow>
</mfrac>
<mo>)</mo>
</mrow>
<mrow>
<mi>d</mi>
<mi>l</mi>
</mrow>
</msub>
<mo>-</mo>
<mo>-</mo>
<mo>-</mo>
<mrow>
<mo>(</mo>
<mn>3</mn>
<mo>)</mo>
</mrow>
</mrow>
<mrow>
<mi>H</mi>
<mo>=</mo>
<mi>H</mi>
<mo>&CircleTimes;</mo>
<mrow>
<mo>(</mo>
<msup>
<mrow>
<mo>(</mo>
<mrow>
<munderover>
<mi>&Sigma;</mi>
<mrow>
<mi>n</mi>
<mo>=</mo>
<mn>1</mn>
</mrow>
<mi>N</mi>
</munderover>
<msub>
<mi>&alpha;</mi>
<mi>n</mi>
</msub>
<msub>
<mi>A</mi>
<mi>n</mi>
</msub>
</mrow>
<mo>)</mo>
</mrow>
<mi>T</mi>
</msup>
<mo>(</mo>
<mfrac>
<mi>S</mi>
<mrow>
<msubsup>
<mi>&Sigma;</mi>
<mrow>
<mi>n</mi>
<mo>=</mo>
<mn>1</mn>
</mrow>
<mi>N</mi>
</msubsup>
<msub>
<mi>&alpha;</mi>
<mi>n</mi>
</msub>
<msub>
<mi>A</mi>
<mi>n</mi>
</msub>
<mi>H</mi>
</mrow>
</mfrac>
<mo>)</mo>
<mo>)</mo>
</mrow>
<mo>/</mo>
<mrow>
<mo>(</mo>
<msup>
<mrow>
<mo>(</mo>
<mrow>
<munderover>
<mi>&Sigma;</mi>
<mrow>
<mi>n</mi>
<mo>=</mo>
<mn>1</mn>
</mrow>
<mi>N</mi>
</munderover>
<msub>
<mi>&alpha;</mi>
<mi>n</mi>
</msub>
<msub>
<mi>A</mi>
<mi>n</mi>
</msub>
</mrow>
<mo>)</mo>
</mrow>
<mi>T</mi>
</msup>
<mn>1</mn>
<mo>+</mo>
<mi>&lambda;</mi>
<mo>&CircleTimes;</mo>
<mn>1</mn>
<mo>)</mo>
</mrow>
<mo>-</mo>
<mo>-</mo>
<mo>-</mo>
<mrow>
<mo>(</mo>
<mn>4</mn>
<mo>)</mo>
</mrow>
</mrow>
WhereinElement is multiplied one by one between representing matrix, and element is divided by one by one between/representing matrix.T representing matrix transposition, it is public
The division sign in formula all represents a little to remove, and λ is sparse penalty factor, and 1 represents:All 1's matrix, k represents k-th of target voice;
6) using formula (1), (3) and (4), the feature vector sequence S of target voice is calculatedtgtDilute under tensor dictionary A
Dredge and represent:
<mrow>
<msub>
<mi>S</mi>
<mrow>
<mi>t</mi>
<mi>g</mi>
<mi>t</mi>
</mrow>
</msub>
<mo>&ap;</mo>
<mrow>
<mo>(</mo>
<munderover>
<mi>&Sigma;</mi>
<mrow>
<mi>n</mi>
<mo>=</mo>
<mn>1</mn>
</mrow>
<mi>N</mi>
</munderover>
<msubsup>
<mi>&alpha;</mi>
<mi>n</mi>
<mrow>
<mi>t</mi>
<mi>g</mi>
<mi>t</mi>
</mrow>
</msubsup>
<msub>
<mi>A</mi>
<mi>n</mi>
</msub>
<mo>)</mo>
</mrow>
<msub>
<mi>H</mi>
<mn>2</mn>
</msub>
<mo>-</mo>
<mo>-</mo>
<mo>-</mo>
<mrow>
<mo>(</mo>
<mn>5</mn>
<mo>)</mo>
</mrow>
</mrow>
Wherein,αtgtRepresent the feature vector sequence S of target voicetgtTensor dictionary weighting system
Number, H2Represent the feature vector sequence S of target voicetgtExcitation matrix;
7) using formula (1), (3), (4), the feature vector sequence S in any source is calculatedsrcSparse table under tensor dictionary A
Show:
<mrow>
<msub>
<mi>S</mi>
<mrow>
<mi>s</mi>
<mi>r</mi>
<mi>c</mi>
</mrow>
</msub>
<mo>&ap;</mo>
<mrow>
<mo>(</mo>
<munderover>
<mi>&Sigma;</mi>
<mrow>
<mi>n</mi>
<mo>=</mo>
<mn>1</mn>
</mrow>
<mi>N</mi>
</munderover>
<msubsup>
<mi>&alpha;</mi>
<mi>n</mi>
<mrow>
<mi>s</mi>
<mi>r</mi>
<mi>c</mi>
</mrow>
</msubsup>
<msub>
<mi>A</mi>
<mi>n</mi>
</msub>
<mo>)</mo>
</mrow>
<msub>
<mi>H</mi>
<mn>1</mn>
</msub>
<mo>-</mo>
<mo>-</mo>
<mo>-</mo>
<mrow>
<mo>(</mo>
<mn>6</mn>
<mo>)</mo>
</mrow>
</mrow>
Wherein,αsrcRepresent the tensor dictionary weight coefficient in any source, H1Represent the spy in any source
Levy vector sequence SsrcExcitation matrix.
6. the phonetics transfer method according to claim 5 being used between any source and target voice, it is characterised in that institute
State the feature vector sequence S for calculating any sourcesrcWith arbitrary target speech characteristic vector sequence StgtIt is sparse under tensor dictionary A
Represent in step:As the feature vector sequence S in any sourcesrcWith target voice feature vector sequence StgtIt is by any source and target
The voice for the same voice content that speaker says extracts feature, then understands two sequences same according to Algorithms of Non-Negative Matrix Factorization
The excitation matrix H in rarefaction representation under one tensor dictionary A1And H2Approximately equal;It is different for two sections of same any source
Voice, i.e. target voice characteristic sequenceThe weighting system of the tensor dictionary in rarefaction representation under same tensor dictionary A
NumberAlso approximately equal.
7. the phonetics transfer method according to claim 6 being used between any source and target voice, it is characterised in that institute
State step 3 and select any source speech characteristic vector sequence S for needing to changesrcExcitation matrix H1With target voice dictionary phase
Multiply, reconstruct the feature vector sequence of the target voice after conversion, the target voice after being changed:
<mrow>
<msub>
<mover>
<mi>S</mi>
<mo>^</mo>
</mover>
<mrow>
<mi>t</mi>
<mi>g</mi>
<mi>t</mi>
</mrow>
</msub>
<mo>=</mo>
<mrow>
<mo>(</mo>
<munderover>
<mi>&Sigma;</mi>
<mrow>
<mi>n</mi>
<mo>=</mo>
<mn>1</mn>
</mrow>
<mi>N</mi>
</munderover>
<msubsup>
<mi>&alpha;</mi>
<mi>n</mi>
<mrow>
<mi>t</mi>
<mi>g</mi>
<mi>t</mi>
</mrow>
</msubsup>
<msub>
<mi>A</mi>
<mi>n</mi>
</msub>
<mo>)</mo>
</mrow>
<msub>
<mi>H</mi>
<mn>1</mn>
</msub>
<mo>-</mo>
<mo>-</mo>
<mo>-</mo>
<mrow>
<mo>(</mo>
<mn>7</mn>
<mo>)</mo>
</mrow>
</mrow>
2
WhereinRepresent the feature vector sequence of the target voice after conversion.
8. the phonetics transfer method according to claim 1 being used between any source and target voice, it is characterised in that institute
The tensor dictionary in step one is stated to refer to make the parallel language of at least two target voices by multisequencing dynamic time warping algorithm
Segment is alignd, and sets up the tensor dictionary that a basic dictionary of at least two two dimensions is constituted.
9. the phonetics transfer method according to claim 1 being used between any source and target voice, it is characterised in that institute
Stating step one includes the constraint of tensor dictionary size of foundation, and tensor dictionary size is:Basic speaker's number × characteristic dimension
× number of speech frames.
10. the phonetics transfer method according to claim 1 being directed between any source and target voice, it is characterised in that
The step 3 is specially made using the feature vector sequence and the voice dictionary of target voice in the voice dictionary of target voice
After linear weighted function, it is multiplied with the excitation matrix in any source, realizes any source to the conversion of target voice.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710186569.5A CN107221321A (en) | 2017-03-27 | 2017-03-27 | A kind of phonetics transfer method being used between any source and target voice |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710186569.5A CN107221321A (en) | 2017-03-27 | 2017-03-27 | A kind of phonetics transfer method being used between any source and target voice |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107221321A true CN107221321A (en) | 2017-09-29 |
Family
ID=59928387
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710186569.5A Pending CN107221321A (en) | 2017-03-27 | 2017-03-27 | A kind of phonetics transfer method being used between any source and target voice |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107221321A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108510995A (en) * | 2018-02-06 | 2018-09-07 | 杭州电子科技大学 | Identity information hidden method towards voice communication |
CN108766450A (en) * | 2018-04-16 | 2018-11-06 | 杭州电子科技大学 | A kind of phonetics transfer method decomposed based on harmonic wave impulse |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102306492A (en) * | 2011-09-09 | 2012-01-04 | 中国人民解放军理工大学 | Voice conversion method based on convolutive nonnegative matrix factorization |
CN103345923A (en) * | 2013-07-26 | 2013-10-09 | 电子科技大学 | Sparse representation based short-voice speaker recognition method |
-
2017
- 2017-03-27 CN CN201710186569.5A patent/CN107221321A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102306492A (en) * | 2011-09-09 | 2012-01-04 | 中国人民解放军理工大学 | Voice conversion method based on convolutive nonnegative matrix factorization |
CN103345923A (en) * | 2013-07-26 | 2013-10-09 | 电子科技大学 | Sparse representation based short-voice speaker recognition method |
Non-Patent Citations (2)
Title |
---|
RYO AIHARA等: ""MANY-TO-ONE VOICE CONVERSION USING EXEMPLAR-BASED SPARSE REPRESENTATION"", 《2015 IEEE WORKSHOP ON APPLICATIONS OF SIGNAL PROCESSING TO AUDIO AND ACOUSTICS》 * |
RYO AIHARA等: ""Multiple Non-Negative Matrix Factorization for Many-to-Many Voi", 《IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING》 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108510995A (en) * | 2018-02-06 | 2018-09-07 | 杭州电子科技大学 | Identity information hidden method towards voice communication |
CN108510995B (en) * | 2018-02-06 | 2021-06-08 | 杭州电子科技大学 | Identity information hiding method facing voice communication |
CN108766450A (en) * | 2018-04-16 | 2018-11-06 | 杭州电子科技大学 | A kind of phonetics transfer method decomposed based on harmonic wave impulse |
CN108766450B (en) * | 2018-04-16 | 2023-02-17 | 杭州电子科技大学 | Voice conversion method based on harmonic impulse decomposition |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Wu et al. | Vqvc+: One-shot voice conversion by vector quantization and u-net architecture | |
CN109326283B (en) | Many-to-many voice conversion method based on text encoder under non-parallel text condition | |
Uhlich et al. | Improving music source separation based on deep neural networks through data augmentation and network blending | |
Mohammadi et al. | Voice conversion using deep neural networks with speaker-independent pre-training | |
Sun et al. | Voice conversion using deep bidirectional long short-term memory based recurrent neural networks | |
Saito et al. | One-to-many voice conversion based on tensor representation of speaker space | |
CN108777140A (en) | Phonetics transfer method based on VAE under a kind of training of non-parallel corpus | |
CN102306492B (en) | Voice conversion method based on convolutive nonnegative matrix factorization | |
CN110060691B (en) | Many-to-many voice conversion method based on i-vector and VARSGAN | |
JP2013205697A (en) | Speech synthesizer, speech synthesis method, speech synthesis program and learning device | |
CN110060657A (en) | Multi-to-multi voice conversion method based on SN | |
CN109584893A (en) | Based on the multi-to-multi speech conversion system of VAE and i-vector under non-parallel text condition | |
Xue et al. | Online streaming end-to-end neural diarization handling overlapping speech and flexible numbers of speakers | |
CN110047501A (en) | Multi-to-multi phonetics transfer method based on beta-VAE | |
Ohtani et al. | Non-parallel training for many-to-many eigenvoice conversion | |
CN107221321A (en) | A kind of phonetics transfer method being used between any source and target voice | |
Gao et al. | Mixed-bandwidth cross-channel speech recognition via joint optimization of DNN-based bandwidth expansion and acoustic modeling | |
Luong et al. | Many-to-many voice conversion based feature disentanglement using variational autoencoder | |
CN103680491A (en) | Speed dependent prosodic message generating device and speed dependent hierarchical prosodic module | |
Mansouri et al. | Laughter synthesis: A comparison between Variational autoencoder and Autoencoder | |
Liu et al. | Spectral conversion using deep neural networks trained with multi-source speakers | |
Grais et al. | Single channel speech music separation using nonnegative matrix factorization with sliding windows and spectral masks | |
Chen et al. | Lightgrad: Lightweight diffusion probabilistic model for text-to-speech | |
Wu et al. | Denoising Recurrent Neural Network for Deep Bidirectional LSTM Based Voice Conversion. | |
Kang et al. | Connectionist temporal classification loss for vector quantized variational autoencoder in zero-shot voice conversion |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB03 | Change of inventor or designer information |
Inventor after: Jian Zhihua Inventor after: Gu Dong Inventor before: Jian Zhihua |
|
CB03 | Change of inventor or designer information | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170929 |
|
RJ01 | Rejection of invention patent application after publication |