CN107221321A - A kind of phonetics transfer method being used between any source and target voice - Google Patents

A kind of phonetics transfer method being used between any source and target voice Download PDF

Info

Publication number
CN107221321A
CN107221321A CN201710186569.5A CN201710186569A CN107221321A CN 107221321 A CN107221321 A CN 107221321A CN 201710186569 A CN201710186569 A CN 201710186569A CN 107221321 A CN107221321 A CN 107221321A
Authority
CN
China
Prior art keywords
mrow
msub
voice
dictionary
target voice
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710186569.5A
Other languages
Chinese (zh)
Inventor
简志华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Dianzi University
Hangzhou Electronic Science and Technology University
Original Assignee
Hangzhou Electronic Science and Technology University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Electronic Science and Technology University filed Critical Hangzhou Electronic Science and Technology University
Priority to CN201710186569.5A priority Critical patent/CN107221321A/en
Publication of CN107221321A publication Critical patent/CN107221321A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/10Speech classification or search using distance or distortion measures between unknown speech and reference templates
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0631Creating reference templates; Clustering
    • G10L2015/0633Creating reference templates; Clustering using lexical or orthographic knowledge sources
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0635Training updating or merging of old and new templates; Mean values; Weighting

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The invention discloses a kind of phonetics transfer method being directed between any source and target voice, comprise the following steps, step one, set up the tensor dictionary of at least one target voice;Step 2, builds the corresponding voice dictionary of target voice and the voice dictionary in any source;Step 3, reconstructs the voice content in any source, realizes any source to the conversion of target voice.The present invention, using voice signal in any source, tensor dictionary corresponding with the target voice to be changed, constructs respective voice dictionary using in same Zhang Zhangliang dictionaries, realizes that voice is changed;The parallel speech that substantial amounts of any source and target speaker was collected before being changed without each voice is trained, and practicality and voice transformation efficiency are higher.

Description

A kind of phonetics transfer method being used between any source and target voice
Technical field
The invention belongs to Voice Conversion Techniques field, and in particular to a kind of voice being used between any source and target source turns Change method.
Background technology
Voice signal is the voice signal of language, loads certain language meaning, wherein containing much information, such as Identity information, affective state and voice content of speaker etc..
Voice is changed, and is a kind of identity information with target speaker come the identity information of replacing source speaker, but is protected Hold a kind of constant technology of voice content.Voice conversion function is directed in many important application aspects:Emotion recognition is with turning Technical elements are changed, in terms of the text-to-speech system (TTS) that text information form is changed to voice messaging form, restoration methods side are composed Face, audio bandwidth expansion technical elements and people's reconstructed voice with helping dysphonia etc..
At present, the method for voice conversion has many kinds, and the classical method of the most frequently used comparison is following two:One class is Based on statistical method;One class is to be based on rarefaction representation.
In the phonetics transfer method based on statistical parameter, gauss hybrid models are employed most wide, in Gaussian Mixture mould , it is necessary to use transfer function to realize that weighting is averaged in type algorithm, and the parameter of transfer function is accurate with least mean-square error Then (Minimum Mean-Square Error, MMSE) or maximum-likelihood criterion (maximum likelihood, ML) are estimated Meter.Although this conversion method simple, intuitive, and effect is very well, there is shortcoming one is to need a large amount of parallel corporas to be instructed Practice, otherwise can produce over-fitting, two be that voice spectrum after conversion is excessively smooth, not enough naturally.
In the phonetics transfer method based on rarefaction representation, because rarefaction representation is widely used in signal transacting, Phonetics transfer method based on sample has also obtained very big development.D.Seung in 2001 etc. proposes Non-negative Matrix Factorization (Non-negative matrix factorization, NMF) voice conversion algorithm, this method is first by source speaker's voice Rarefaction representation, that is, be expressed as the product of voice dictionary and excitation matrix.Target speaker's voice dictionary is used in the conversion stage Realize that voice is changed instead of source speaker's voice dictionary.This method based on NMF can be improved effectively based on statistical parameter Over-fitting problem caused by method, produces more natural voice, and this method also has good noise robustness.But It is that there is also following shortcoming for this method:Need to collect enough source and target speakers' before each voice is changed Parallel speech be used for dictionary creation training stage, once therefore voice conversion stage source speaker's identity change, also It is that the voice that can not complete source speaker's voice to target speaker's voice converted.In actual applications, it is impossible to collect every A large amount of parallel speech section of one source speaker and target speaker are trained process, therefore, and the voice conversion based on NMF is calculated There is limitation in method, it is impossible to effectively quickly realize that the voice of any source and target voice is changed.
The content of the invention
The invention aims to solve the above problems, there is provided a kind of language being directed between any source and target speaker Sound conversion method, the training process and speech conversion process of the dictionary creation digitized the speech into are separated, in speech conversion process Need not because of voice change source voice and target voice identity change and remove re -training voice dictionary.The present invention exists Based on the concept that tensor is introduced on the basis of NMF methods, one, two or more target voices are chosen from corpus As the basic speech of voice tensor dictionary, by multisequencing dynamic time warping algorithm make this, two or more mesh The parallel speech section alignment of poster sound, so as to set up the tensor dictionary being made up of one, the basic dictionary of two or more two dimensions. Changed the stage in voice, source, target speaker voice can be by the linear combinations of each basic dictionary in tensor dictionary, construction Go out respective voice dictionary, realize voice conversion.
In order to reach foregoing invention purpose, the present invention uses following technical scheme:One kind is directed to any source and target voice Between phonetics transfer method, comprise the following steps,
Step one, the tensor dictionary of at least one basic speaker's voice is set up;
Step 2, builds the corresponding voice dictionary of target voice and the voice dictionary in any source;
Step 3, reconstructs the voice content in any source, realizes any source to the conversion of target voice.
Further, the step 2 is that the voice dictionary of target voice is constructed in tensor dictionary using tensor algorithm and is appointed The voice dictionary in meaning source.
Further, the step one set up the tensor dictionary of at least one target voice process it is specific as follows:
1) N number of target voice is randomly selected out from corpus, the basic speech of tensor dictionary is formed, from this N number of target Semantic content identical voice signal x is randomly selected in voice1,x2,...,xN, N >=1, X represents voice signal;
2) the feature parameter vector sequence S in each voice signal is extracted1,S2,…,SN
3) multisequencing dynamic time warping algorithm is utilized by features described above parameter vector sequence S1,S2,…,SNAlignment, alignment Speech characteristic parameter vector sequence afterwards is S '1,S′2,…,S′N
4) it is S ' from feature parameter vector sequence1,S′2,…,S′NRandomly select some speech frame Characteristic Vectors of same position Measure S "1,S″2,…,S″N, it is used as a part for the tensor dictionary of each target voice;
5) when the size of the tensor dictionary of each target voice is less than the tensor dictionary preset value of the target voice, repeat Step (2) arrives (4);When the size of the tensor dictionary of each target voice is equal to the tensor dictionary preset value of the target voice, Stop step (2) and arrive (4), form N number of target voice dictionary A1,A2,…,AN
6) by N number of target voice dictionary A1,A2,…,ANIt is stacked together, composition tensor dictionary A.
Further, the step 3) in refer to that alternative expression is dynamic using multisequencing using multisequencing dynamic time warping algorithm The synchronous method of state Time alignment, target voice feature parameter vector sequence S1,S2,…,SNIn each feature parameter vector Sequence at most participates in multisequencing dynamic time warping twice.
Further, the step 2 builds the corresponding voice dictionary of target voice and includes procedure below,
1) for tensor dictionary A, the weight coefficient α of each target voice in calculating tensor dictionary A, weight coefficient α= [α12,…,αN], 0 < α < 1, N >=1;
2) linear combination being weighted to the tensor dictionary of each target voice
3) linear combination of weighting is multiplied with excitation matrix H, obtains the voice dictionary S of each target voice,
Wherein AnRepresent the tensor dictionary of n-th of target voice in tensor dictionary A, αn(0≤αn≤ 1) represent n-th The weight coefficient of the tensor dictionary of target voice;
4) weight coefficient α and excitation matrix H cost function is calculated:
D () is from KL divergence ‖ ‖11 norm is represented, λ is sparse penalty factor, wherein 0≤λ≤1, to ensure excitation The degree of rarefication of matrix, H >=0, weight coefficient α=[α12,…,αN], it is necessary to ensure
5) keep the A value sizes in step (4) constant, it is continuous using the multiplication replacement criteria of Algorithms of Non-Negative Matrix Factorization
Iteration weight coefficient α and excitation matrix H, causes algorithm cost function value to reach minimum, obtains by running parameter α, H Arrive
Iterative formula:
WhereinElement is multiplied one by one between representing matrix, and element is divided by one by one between/representing matrix.T representing matrixs turn Put, the division sign in formula all represents a little to remove, λ is sparse penalty factor, 1 represents:All 1's matrix, k represents k-th of target voice;
6) using formula (1), (3), (4) calculate the feature vector sequence S of target voicetgtUnder tensor dictionary A Rarefaction representation:
Wherein,αtgtRepresent the feature vector sequence S of target voicetgtTensor dictionary add Weight coefficient, H2Represent the feature vector sequence S of target voicetgtExcitation matrix;
7) using formula (1), (3), (4) calculate the feature vector sequence S in any sourcesrcIt is dilute under tensor dictionary A Dredge and represent:
Wherein,αsrcRepresent the tensor dictionary weight coefficient in any source, H1Represent any The feature vector sequence S in sourcesrcExcitation matrix.
Further, the feature vector sequence S for calculating any sourcesrcWith arbitrary target speech characteristic vector sequence Stgt In rarefaction representation step under tensor dictionary A:As the feature vector sequence S in any sourcesrcWith target voice feature vector sequence StgtIt is the voice extraction feature for the same voice content said by source and target speaker (target voice), then according to non-negative Matrix decomposition algorithm understands the excitation matrix H in rarefaction representation of two sequences under same tensor dictionary A1And H2Approximately equal.It is right Two sections of different voices in same speaker (same any source), such as target voice characteristic sequenceSame The weight coefficient of the tensor dictionary in rarefaction representation under tensor dictionary AAlso approximately equal, this is also readily appreciated that: Measure the original purpose of the weight coefficient under dictionary to seek to select the voice dictionary for belonging to speaker, when speaker's identity is constant When, the dictionary selected is also naturally constant.
Further, the step 3 is any source speech characteristic vector sequence S for needing to change of selectionsrcExcitation square Battle array H1It is multiplied with target voice dictionary, reconstructs the feature vector sequence of the target voice after conversion, the target after being changed Voice:
WhereinRepresent the feature vector sequence of the target voice after conversion.
Further, the tensor dictionary in the step one refers to make at least two by multisequencing dynamic time warping algorithm The parallel speech section alignment of target voice, sets up the tensor dictionary being made up of the basic dictionary of at least two two dimensions.
Further, the step one includes the constraint for the tensor dictionary size set up, and tensor dictionary size is:Speak on basis The language of people's number × characteristic dimension × number of speech frames, the number of basic speaker and the choice relation of dictionary language frame number to conversion Sound quality and the complexity of conversion.When basic speaker's number is very few,Target speaker represented by linear combination The voice dictionary of the voice dictionary of (target voice) just and in practice has larger deviation;When basic speaker's number is excessive, The complexity of calculating can be lifted again.Surface is tested, selects 10-20 basic speaker to can be very good to reconstruct target voice word Allusion quotation does not have very high computation complexity again simultaneously.The problem of selection of number of speech frames is equally faced with similar, experiment shows, selects When selecting 3000-3500 frame voices, preferable speech conversion effects can be obtained, while without higher computation complexity.
Further, the step 3 is specially the feature vector sequence and target language in the voice dictionary using target voice The voice dictionary of sound is made after linear weighted function, is multiplied with the excitation matrix in any source, realizes any source to the conversion of target voice.
Compared with prior art, beneficial effect is the present invention:1st, the present invention is using in same Zhang Zhangliang dictionaries, using appointing Voice signal in meaning source, tensor dictionary corresponding with the target voice to be changed, constructs respective voice dictionary, realizes that voice turns Change;2nd, the parallel speech that substantial amounts of any source and target speaker was collected before being changed without each voice is trained, real It is higher with property and voice transformation efficiency, on the basis of tensor dictionary, it is only necessary to calculate any source, the characteristic vector sequence of target voice Row, then reconstruct the voice content in any source, realize that voice is changed;3 do not need the parallel speech of source and target speaker to be used for Training dictionary, so the training stage of experiment can be independently of the experiment conversion stage;4 work as speech source and target voice to be converted Identity when changing, train the tensor dictionary come and keep constant, can effectively realize between any source and target speaker Voice conversion.
Brief description of the drawings
Fig. 1 is the frame diagram of the tensor dictionary creation process in the training stage of the invention;
Fig. 2 is one of many vector sequence DTW schematic diagrames during tensor dictionary creation;
Fig. 3 is two of many vector sequence DTW schematic diagrames during tensor dictionary creation;
Fig. 4 is three of many vector sequence DTW schematic diagrames during tensor dictionary creation;
Fig. 5 is the transfer process schematic diagram in the voice conversion stage.
Embodiment
Explanation is further described to technical scheme below by specific embodiment.
Embodiment 1
Present embodiment discloses a kind of phonetics transfer method being directed between any source and target voice, appointing in this programme Meaning source refers to the different phonetic that different phonetic object to be converted is sent.Method comprises the following steps:
Step one, the tensor dictionary of at least one target voice is set up;With reference to Fig. 1, specifically,
1) N number of target voice is randomly selected out from corpus, the basic speech of tensor dictionary is formed, from this N number of target Semantic content identical voice signal x is randomly selected in voice1,x2,...,xN, N >=1, X represents voice signal;
2) the feature parameter vector sequence S in each voice signal is extracted1,S2,…,SN
3) multisequencing dynamic time warping algorithm is utilized by features described above parameter vector sequence S1,S2,…,SNAlignment, alignment Speech characteristic parameter vector sequence afterwards is S '1,S′2,…,S′N
4) it is S ' from feature parameter vector sequence1,S′2,…,S′NRandomly select some speech frame Characteristic Vectors of same position Measure S "1,S″2,…,S″N, it is used as a part for the tensor dictionary of each target voice;
5) when the size of the tensor dictionary of each target voice is less than the tensor dictionary preset value of the target voice, repeat Step (2) arrives (4);When the size of the tensor dictionary of each target voice is equal to the default value of the target voice, stop step Suddenly (2) arrive (4), form N number of target voice dictionary A1,A2,…,AN
6) by N number of target voice dictionary A1,A2,…,ANIt is stacked together, composition tensor dictionary A.
Step 2, builds the corresponding voice dictionary of target voice on the basis of tensor dictionary;Specially
1) for tensor dictionary A, the weight coefficient α of each target voice in calculating tensor dictionary A, weight coefficient α= [α12,…,αN], 0 < α < 1;
2) linear combination being weighted to the tensor dictionary of each target voice
3) linear combination of weighting is multiplied with excitation matrix H, obtains the voice dictionary S of each target voice,
Wherein AnRepresent the tensor dictionary of n-th of target voice in tensor dictionary A, αn(0≤αn≤ 1) represent n-th of mesh The weight coefficient of the tensor dictionary of poster sound;
4) weight coefficient α and excitation matrix H cost function is calculated:
D () selects KL divergences, ‖ ‖11 norm is represented, λ is sparse penalty factor, to ensure the sparse of excitation matrix Degree, H >=0, parameter alpha=[α12,…,αN],
5) holding A is constant, and running parameter α, H causes algorithm cost function value to reach minimum, obtains iterative formula:
WhereinElement is multiplied one by one between representing matrix, and element is divided by one by one between/representing matrix.T representing matrixs turn Put, the division sign in formula all represents a little to remove, λ is sparse penalty factor, 1 represents:All 1's matrix, k represents k-th of target voice;
6) above-mentioned steps are utilized, the feature vector sequence S of target voice is calculatedtgtIn the sparse table under tensor dictionary A Show:
Wherein,αtgtRepresent the feature vector sequence S of target voicetgtTensor dictionary add Weight coefficient, H2Represent the feature vector sequence S of target voicetgtExcitation matrix;
7) voice dictionary in any source is built on the basis of tensor dictionary;Using formula (1), (3), (4), calculating is taken the post as The feature vector sequence S in meaning sourcesrcIn the rarefaction representation under tensor dictionary A:
Wherein,αsrcRepresent the tensor dictionary weight coefficient in any source, H1Represent any The feature vector sequence S in sourcesrcExcitation matrix;
Step 3, reconstructs the voice content in any source, realizes any source to the conversion of target voice;Any language in a steady stream of selection Sound dictionary SsrcExcitation matrix H1It is multiplied with target voice dictionary, the target voice after being changed:
WhereinRepresent the feature vector sequence of the target voice after conversion.
Above-mentioned steps two are to step 3 with reference to shown in Fig. 5.
The purpose of the phonetics transfer method of the present invention be voice that effectively and rapidly any people says be converted into it is specific some Target voice, the principle of conversion is:
In the training stage of dictionary creation, N number of target voice is chosen as the basic speech of tensor dictionary.From this N number of mesh Parallel speech section is extracted in poster sound, the basic language material of generation tensor dictionary is used as;
The characteristic vector of each voice segments is extracted, multisequencing dynamic time warping algorithm (Multi-Dynamic is utilized Time Warping, Multi-DTW) feature parameter vector is alignd;
The characteristic vector alignd of enough frame numbers is collected, they are generated as N number of symmetrical dictionary, this N number of dictionary is folded Put together, constitute tensor dictionary;
The stage is changed in voice, using based on unified tensor dictionary (Unified Tensor Dictionary, UTD) Voice conversion algorithm, using the automatic construction source in same tensor dictionary of UTD algorithms, the voice dictionary of target speaker, is realized Voice conversion between any source and target speaker.
In the above method, the tensor dictionary in step one refers to make N number of target by multisequencing dynamic time warping algorithm The parallel speech section alignment of voice, sets up the tensor dictionary that a N number of basic dictionary of two dimension is constituted.
In the above method, the feature vector sequence S in any source is calculatedsrcWith arbitrary target speech characteristic vector sequence Stgt In rarefaction representation step under tensor dictionary A:As the feature vector sequence S in any sourcesrcWith target voice feature vector sequence StgtIt is the voice extraction feature for the same voice content said by source and target speaker (target voice), then according to non-negative Matrix decomposition algorithm understands the excitation matrix H in rarefaction representation of two sequences under same tensor dictionary A1And H2Approximately equal.It is right Two sections of different voices in same speaker (same any source), such as target voice characteristic sequenceSame The weight coefficient of the tensor dictionary in rarefaction representation under tensor dictionary AAlso approximately equal, this is also readily appreciated that: The original purpose of weight coefficient under tensor dictionary seeks to select the voice dictionary for belonging to speaker (any source), works as speaker When identity is constant, the dictionary selected is also naturally constant.
Tensor dictionary refers to make the parallel speech section of at least two target voices by multisequencing dynamic time warping algorithm Alignment, sets up the tensor dictionary that a basic dictionary of at least two two dimensions is constituted.The foundation of tensor dictionary includes set up The constraint of dictionary size is measured, tensor dictionary size is:Basic speaker's number × characteristic dimension × number of speech frames.Basic speaker Number and dictionary language frame number choice relation to conversion voice quality and conversion complexity.When basic speaker's number When very few,The voice dictionary of target speaker (target voice) represented by linear combination just with voice word in practice Allusion quotation has larger deviation;When basic speaker's number is excessive, the complexity of calculating can be lifted again.Surface is tested, 10-20 is selected Individual basic speaker can be very good to reconstruct target voice dictionary simultaneously and not have very high computation complexity.Number of speech frames Selection the problem of be equally faced with similar, experiment shows, during selection 3000-3500 frame voices, can obtain preferable voice Conversion effect, while without higher computation complexity.
Embodiment 2
As different from Example 1, in generation tensor dictionary process, multisequencing dynamic time warping algorithm is utilized (Multi-Dynamic Time Warping, Multi-DTW) aligns feature parameter vector, may be referred to Fig. 2, Fig. 3 and Fig. 4 Shown, Fig. 2, Fig. 3 and Fig. 4 are many vector sequence DTW schematic diagrames during tensor dictionary creation, and figure includes N number of sequence pair The neat first step, second step and final step.Many vector sequence DTW are for making the dynamic time rule of multiple parallel speech alignment Whole algorithm.DTW between our known two voice sequences is in order that two parallel speech realize alignment, it is assumed that two voices Parameter vector sequence S1And S2It is f by change functions of the DTW after regular1() and f2().Phonetic feature ginseng after then changing Counting vector is:
S1=f1(S1) (8)
S2=f2(S2) (9)
When there is multiple parallel speech vector sequence S1,S2,…,SNNeed alignment when, set forth herein many vector sequence DTW Algorithm is method of the alternative expression using DTW synchronizations, and each feature parameter vector sequence at most participates in DTW twice.So can be with Avoid working as and randomly select a feature parameter vector sequence SiIt is synchronous one by one for N-1 feature parameter vector sequence of template and other And occur synchronization after feature parameter vector sequence it is long, deformation the problems such as.The detailed process of many vector sequence DTW algorithms is as follows It is shown:
(1) first to S1,S2Carry out DTW, it is assumed that the change function after each DTW all uses f1() and f2() represents, then The change of vector sequence such as formula (1), (2).As shown in Fig. 2 S before change1,S2It is unjustified, the S after change1,S2Alignment is presented State.
(2) next to S2,S3When carrying out DTW, speech parameter vector sequence, which becomes, to be turned to:
S1=f1(S1),S2=f1(S2) (10)
S3=f2(S3) (11)
As shown in figure 3, to S2,S3Before progress DTW, S1,S2Aligned condition is just kept, so working as S2,S3Carry out When DTW changes, S1Necessary and S2The change maintained like.After finally changing through DTW, S1,S2,S3Aligned condition is all presented.
(3) next in sequence respectively to (S3,S4),(S4,S5),…,(SN-1,SN) carry out feature parameter vector sequence Between DTW.With as Fig. 3, for the sequence alignd, follow-up change must being consistent property.
(4) finally to SN-1,SNDTW is carried out, the change of speech parameter vector sequence is as follows:
S′1=f1(S1),S′2=f1(S2),...,S′N-1=f1(SN-1) (12)
S′N=f2(SN) (13)
As shown in figure 4, SN-1,SNBefore progress DTW, S1,S2,…,SN-1It has been the sequence of alignment, so in SN-1, SNWhen making DTW, S1,S2,…,SN-1Need the changing rule maintained like.Work as SN-1,SNAfter having made DTW, sequence S1,S2,…, SNAlignment is all realized, so the speech characteristic vector sequence S ' that sequence now has been reformed into after alignment1,S′2,…,S′N
It is to be used to realize the present invention and embodiment to be described above, and each step is example, those of ordinary skill in the art The actual step to be used can be determined according to actual conditions, and each step has a variety of implementation methods, all should belong to this hair Within the scope of bright.The scope of the present invention should not necessarily be limited by this description.It should be appreciated by those skilled in the art do not taking off From any modification or partial replacement of the scope of the present invention, belong to the claims in the present invention to limit scope.

Claims (10)

1. a kind of phonetics transfer method being used between any source and target voice, comprises the following steps:
Step one, the tensor dictionary of at least one basic speaker's voice is set up;
Step 2, builds the corresponding voice dictionary of target voice and the voice dictionary in any source;
Step 3, reconstructs the voice content in any source, realizes any source to the conversion of target voice.
2. the phonetics transfer method according to claim 1 being used between any source and target voice, it is characterised in that institute It is that the voice dictionary of target voice and the voice dictionary in any source are constructed in tensor dictionary using tensor algorithm to state step 2.
3. the phonetics transfer method according to claim 1 being used between any source and target voice, it is characterised in that institute State step one specific as follows,
1) N number of basic speaker's voice is randomly selected out from corpus, the basic speech of tensor dictionary is formed, from this N number of base Semantic content identical voice signal x is randomly selected in plinth speaker's voice1,x2,...,xN, N >=1, X represents voice signal;
2) the feature parameter vector sequence S in each voice signal is extracted1,S2,…,SN
3) multisequencing dynamic time warping algorithm is utilized by features described above parameter vector sequence S1,S2,…,SNAlignment, after alignment Speech characteristic parameter vector sequence is S '1,S′2,…,S′N
4) it is S ' from feature parameter vector sequence1,S′2,…,S′NRandomly select some voice frame feature vector S of same position ″1,S″2,…,S″N, it is used as a part for the tensor dictionary of each target voice;
5) when the size of the tensor dictionary of each target voice is less than the tensor dictionary preset value of the target voice, repeat step (2) (4) are arrived;When the size of the tensor dictionary of each target voice is equal to the default value of the target voice, stop step (2) To (4), N number of basic speaker's voice dictionary A is formed1,A2,…,AN
6) by N number of basic speaker's voice dictionary A1,A2,…,ANIt is stacked together, composition tensor dictionary A.
4. the phonetics transfer method according to claim 3 being used between any source and target voice, it is characterised in that
The step 3) in using multisequencing dynamic time warping algorithm refer to alternative expression using multisequencing dynamic time warping it is same The method of step, target voice feature parameter vector sequence S1,S2,…,SNIn each feature parameter vector sequence at most participate in Multisequencing dynamic time warping twice.
5. the phonetics transfer method according to claim 1 being used between any source and target voice, it is characterised in that
The step 2 builds the corresponding voice dictionary of target voice and includes procedure below,
1) tensor dictionary A is directed to, the weight coefficient α, weight coefficient α=[α of each target voice in tensor dictionary A is calculated1, α2,…,αN], 0 < αn< 1;
2) linear combination being weighted to the tensor dictionary of each target voice
3) linear combination of weighting obtains the voice dictionary of each target voice, then is multiplied with excitation matrix H, obtains target voice S,
<mrow> <mi>S</mi> <mo>&amp;ap;</mo> <mrow> <mo>(</mo> <munderover> <mi>&amp;Sigma;</mi> <mrow> <mi>n</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>N</mi> </munderover> <msub> <mi>&amp;alpha;</mi> <mi>n</mi> </msub> <msub> <mi>A</mi> <mi>n</mi> </msub> <mo>)</mo> </mrow> <mi>H</mi> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>1</mn> <mo>)</mo> </mrow> </mrow>
Wherein AnRepresent the voice dictionary of n-th of target voice in tensor dictionary A, αn(0≤αn≤ 1) represent n-th of target language The weight coefficient of the tensor dictionary of sound;
4) weight coefficient α and excitation matrix H cost function is calculated:
<mrow> <mi>d</mi> <mrow> <mo>(</mo> <mi>s</mi> <mo>,</mo> <mo>(</mo> <mrow> <munderover> <mi>&amp;Sigma;</mi> <mrow> <mi>n</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>N</mi> </munderover> <msub> <mi>&amp;alpha;</mi> <mi>n</mi> </msub> <msub> <mi>A</mi> <mi>n</mi> </msub> </mrow> <mo>)</mo> <mi>H</mi> <mo>)</mo> </mrow> <mo>+</mo> <mi>&amp;lambda;</mi> <msub> <mrow> <mo>|</mo> <mrow> <mo>|</mo> <mi>H</mi> <mo>|</mo> </mrow> <mo>|</mo> </mrow> <mn>1</mn> </msub> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>2</mn> <mo>)</mo> </mrow> </mrow> 1
D () selects KL divergences | | | |11 norm is represented, λ is sparse penalty factor, wherein 0≤λ≤1, to ensure to encourage square The degree of rarefication of battle array, H >=0, weight coefficient α=[α12,…,αN],
5) keep the A value sizes in step (4) constant, added using the continuous iteration of multiplication replacement criteria of Algorithms of Non-Negative Matrix Factorization Weight coefficient α and excitation matrix H cause algorithm cost function value to reach minimum, and wherein iterative formula is:
<mrow> <msub> <mi>&amp;alpha;</mi> <mi>n</mi> </msub> <mo>=</mo> <mfrac> <msub> <mi>&amp;alpha;</mi> <mi>n</mi> </msub> <mrow> <msub> <mi>&amp;Sigma;</mi> <mrow> <mi>d</mi> <mo>,</mo> <mi>l</mi> </mrow> </msub> <msub> <mrow> <mo>(</mo> <msub> <mi>A</mi> <mi>n</mi> </msub> <mi>H</mi> <mo>)</mo> </mrow> <mrow> <mi>d</mi> <mi>l</mi> </mrow> </msub> </mrow> </mfrac> <munder> <mi>&amp;Sigma;</mi> <mrow> <mi>d</mi> <mo>,</mo> <mi>l</mi> </mrow> </munder> <msub> <mrow> <mo>(</mo> <mfrac> <mrow> <mi>S</mi> <mo>&amp;CircleTimes;</mo> <mrow> <mo>(</mo> <msub> <mi>A</mi> <mi>n</mi> </msub> <mi>H</mi> <mo>)</mo> </mrow> </mrow> <mrow> <msub> <mi>&amp;Sigma;</mi> <mi>n</mi> </msub> <msub> <mi>&amp;alpha;</mi> <mi>n</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>A</mi> <mi>n</mi> </msub> <mi>H</mi> <mo>)</mo> </mrow> </mrow> </mfrac> <mo>)</mo> </mrow> <mrow> <mi>d</mi> <mi>l</mi> </mrow> </msub> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>3</mn> <mo>)</mo> </mrow> </mrow>
<mrow> <mi>H</mi> <mo>=</mo> <mi>H</mi> <mo>&amp;CircleTimes;</mo> <mrow> <mo>(</mo> <msup> <mrow> <mo>(</mo> <mrow> <munderover> <mi>&amp;Sigma;</mi> <mrow> <mi>n</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>N</mi> </munderover> <msub> <mi>&amp;alpha;</mi> <mi>n</mi> </msub> <msub> <mi>A</mi> <mi>n</mi> </msub> </mrow> <mo>)</mo> </mrow> <mi>T</mi> </msup> <mo>(</mo> <mfrac> <mi>S</mi> <mrow> <msubsup> <mi>&amp;Sigma;</mi> <mrow> <mi>n</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>N</mi> </msubsup> <msub> <mi>&amp;alpha;</mi> <mi>n</mi> </msub> <msub> <mi>A</mi> <mi>n</mi> </msub> <mi>H</mi> </mrow> </mfrac> <mo>)</mo> <mo>)</mo> </mrow> <mo>/</mo> <mrow> <mo>(</mo> <msup> <mrow> <mo>(</mo> <mrow> <munderover> <mi>&amp;Sigma;</mi> <mrow> <mi>n</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>N</mi> </munderover> <msub> <mi>&amp;alpha;</mi> <mi>n</mi> </msub> <msub> <mi>A</mi> <mi>n</mi> </msub> </mrow> <mo>)</mo> </mrow> <mi>T</mi> </msup> <mn>1</mn> <mo>+</mo> <mi>&amp;lambda;</mi> <mo>&amp;CircleTimes;</mo> <mn>1</mn> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>4</mn> <mo>)</mo> </mrow> </mrow>
WhereinElement is multiplied one by one between representing matrix, and element is divided by one by one between/representing matrix.T representing matrix transposition, it is public The division sign in formula all represents a little to remove, and λ is sparse penalty factor, and 1 represents:All 1's matrix, k represents k-th of target voice;
6) using formula (1), (3) and (4), the feature vector sequence S of target voice is calculatedtgtDilute under tensor dictionary A Dredge and represent:
<mrow> <msub> <mi>S</mi> <mrow> <mi>t</mi> <mi>g</mi> <mi>t</mi> </mrow> </msub> <mo>&amp;ap;</mo> <mrow> <mo>(</mo> <munderover> <mi>&amp;Sigma;</mi> <mrow> <mi>n</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>N</mi> </munderover> <msubsup> <mi>&amp;alpha;</mi> <mi>n</mi> <mrow> <mi>t</mi> <mi>g</mi> <mi>t</mi> </mrow> </msubsup> <msub> <mi>A</mi> <mi>n</mi> </msub> <mo>)</mo> </mrow> <msub> <mi>H</mi> <mn>2</mn> </msub> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>5</mn> <mo>)</mo> </mrow> </mrow>
Wherein,αtgtRepresent the feature vector sequence S of target voicetgtTensor dictionary weighting system Number, H2Represent the feature vector sequence S of target voicetgtExcitation matrix;
7) using formula (1), (3), (4), the feature vector sequence S in any source is calculatedsrcSparse table under tensor dictionary A Show:
<mrow> <msub> <mi>S</mi> <mrow> <mi>s</mi> <mi>r</mi> <mi>c</mi> </mrow> </msub> <mo>&amp;ap;</mo> <mrow> <mo>(</mo> <munderover> <mi>&amp;Sigma;</mi> <mrow> <mi>n</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>N</mi> </munderover> <msubsup> <mi>&amp;alpha;</mi> <mi>n</mi> <mrow> <mi>s</mi> <mi>r</mi> <mi>c</mi> </mrow> </msubsup> <msub> <mi>A</mi> <mi>n</mi> </msub> <mo>)</mo> </mrow> <msub> <mi>H</mi> <mn>1</mn> </msub> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>6</mn> <mo>)</mo> </mrow> </mrow>
Wherein,αsrcRepresent the tensor dictionary weight coefficient in any source, H1Represent the spy in any source Levy vector sequence SsrcExcitation matrix.
6. the phonetics transfer method according to claim 5 being used between any source and target voice, it is characterised in that institute State the feature vector sequence S for calculating any sourcesrcWith arbitrary target speech characteristic vector sequence StgtIt is sparse under tensor dictionary A Represent in step:As the feature vector sequence S in any sourcesrcWith target voice feature vector sequence StgtIt is by any source and target The voice for the same voice content that speaker says extracts feature, then understands two sequences same according to Algorithms of Non-Negative Matrix Factorization The excitation matrix H in rarefaction representation under one tensor dictionary A1And H2Approximately equal;It is different for two sections of same any source Voice, i.e. target voice characteristic sequenceThe weighting system of the tensor dictionary in rarefaction representation under same tensor dictionary A NumberAlso approximately equal.
7. the phonetics transfer method according to claim 6 being used between any source and target voice, it is characterised in that institute State step 3 and select any source speech characteristic vector sequence S for needing to changesrcExcitation matrix H1With target voice dictionary phase Multiply, reconstruct the feature vector sequence of the target voice after conversion, the target voice after being changed:
<mrow> <msub> <mover> <mi>S</mi> <mo>^</mo> </mover> <mrow> <mi>t</mi> <mi>g</mi> <mi>t</mi> </mrow> </msub> <mo>=</mo> <mrow> <mo>(</mo> <munderover> <mi>&amp;Sigma;</mi> <mrow> <mi>n</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>N</mi> </munderover> <msubsup> <mi>&amp;alpha;</mi> <mi>n</mi> <mrow> <mi>t</mi> <mi>g</mi> <mi>t</mi> </mrow> </msubsup> <msub> <mi>A</mi> <mi>n</mi> </msub> <mo>)</mo> </mrow> <msub> <mi>H</mi> <mn>1</mn> </msub> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>7</mn> <mo>)</mo> </mrow> </mrow> 2
WhereinRepresent the feature vector sequence of the target voice after conversion.
8. the phonetics transfer method according to claim 1 being used between any source and target voice, it is characterised in that institute The tensor dictionary in step one is stated to refer to make the parallel language of at least two target voices by multisequencing dynamic time warping algorithm Segment is alignd, and sets up the tensor dictionary that a basic dictionary of at least two two dimensions is constituted.
9. the phonetics transfer method according to claim 1 being used between any source and target voice, it is characterised in that institute Stating step one includes the constraint of tensor dictionary size of foundation, and tensor dictionary size is:Basic speaker's number × characteristic dimension × number of speech frames.
10. the phonetics transfer method according to claim 1 being directed between any source and target voice, it is characterised in that The step 3 is specially made using the feature vector sequence and the voice dictionary of target voice in the voice dictionary of target voice After linear weighted function, it is multiplied with the excitation matrix in any source, realizes any source to the conversion of target voice.
CN201710186569.5A 2017-03-27 2017-03-27 A kind of phonetics transfer method being used between any source and target voice Pending CN107221321A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710186569.5A CN107221321A (en) 2017-03-27 2017-03-27 A kind of phonetics transfer method being used between any source and target voice

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710186569.5A CN107221321A (en) 2017-03-27 2017-03-27 A kind of phonetics transfer method being used between any source and target voice

Publications (1)

Publication Number Publication Date
CN107221321A true CN107221321A (en) 2017-09-29

Family

ID=59928387

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710186569.5A Pending CN107221321A (en) 2017-03-27 2017-03-27 A kind of phonetics transfer method being used between any source and target voice

Country Status (1)

Country Link
CN (1) CN107221321A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108510995A (en) * 2018-02-06 2018-09-07 杭州电子科技大学 Identity information hidden method towards voice communication
CN108766450A (en) * 2018-04-16 2018-11-06 杭州电子科技大学 A kind of phonetics transfer method decomposed based on harmonic wave impulse

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102306492A (en) * 2011-09-09 2012-01-04 中国人民解放军理工大学 Voice conversion method based on convolutive nonnegative matrix factorization
CN103345923A (en) * 2013-07-26 2013-10-09 电子科技大学 Sparse representation based short-voice speaker recognition method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102306492A (en) * 2011-09-09 2012-01-04 中国人民解放军理工大学 Voice conversion method based on convolutive nonnegative matrix factorization
CN103345923A (en) * 2013-07-26 2013-10-09 电子科技大学 Sparse representation based short-voice speaker recognition method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
RYO AIHARA等: ""MANY-TO-ONE VOICE CONVERSION USING EXEMPLAR-BASED SPARSE REPRESENTATION"", 《2015 IEEE WORKSHOP ON APPLICATIONS OF SIGNAL PROCESSING TO AUDIO AND ACOUSTICS》 *
RYO AIHARA等: ""Multiple Non-Negative Matrix Factorization for Many-to-Many Voi", 《IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108510995A (en) * 2018-02-06 2018-09-07 杭州电子科技大学 Identity information hidden method towards voice communication
CN108510995B (en) * 2018-02-06 2021-06-08 杭州电子科技大学 Identity information hiding method facing voice communication
CN108766450A (en) * 2018-04-16 2018-11-06 杭州电子科技大学 A kind of phonetics transfer method decomposed based on harmonic wave impulse
CN108766450B (en) * 2018-04-16 2023-02-17 杭州电子科技大学 Voice conversion method based on harmonic impulse decomposition

Similar Documents

Publication Publication Date Title
Wu et al. Vqvc+: One-shot voice conversion by vector quantization and u-net architecture
CN109326283B (en) Many-to-many voice conversion method based on text encoder under non-parallel text condition
Uhlich et al. Improving music source separation based on deep neural networks through data augmentation and network blending
Mohammadi et al. Voice conversion using deep neural networks with speaker-independent pre-training
Sun et al. Voice conversion using deep bidirectional long short-term memory based recurrent neural networks
Saito et al. One-to-many voice conversion based on tensor representation of speaker space
CN108777140A (en) Phonetics transfer method based on VAE under a kind of training of non-parallel corpus
CN102306492B (en) Voice conversion method based on convolutive nonnegative matrix factorization
CN110060691B (en) Many-to-many voice conversion method based on i-vector and VARSGAN
JP2013205697A (en) Speech synthesizer, speech synthesis method, speech synthesis program and learning device
CN110060657A (en) Multi-to-multi voice conversion method based on SN
CN109584893A (en) Based on the multi-to-multi speech conversion system of VAE and i-vector under non-parallel text condition
Xue et al. Online streaming end-to-end neural diarization handling overlapping speech and flexible numbers of speakers
CN110047501A (en) Multi-to-multi phonetics transfer method based on beta-VAE
Ohtani et al. Non-parallel training for many-to-many eigenvoice conversion
CN107221321A (en) A kind of phonetics transfer method being used between any source and target voice
Gao et al. Mixed-bandwidth cross-channel speech recognition via joint optimization of DNN-based bandwidth expansion and acoustic modeling
Luong et al. Many-to-many voice conversion based feature disentanglement using variational autoencoder
CN103680491A (en) Speed dependent prosodic message generating device and speed dependent hierarchical prosodic module
Mansouri et al. Laughter synthesis: A comparison between Variational autoencoder and Autoencoder
Liu et al. Spectral conversion using deep neural networks trained with multi-source speakers
Grais et al. Single channel speech music separation using nonnegative matrix factorization with sliding windows and spectral masks
Chen et al. Lightgrad: Lightweight diffusion probabilistic model for text-to-speech
Wu et al. Denoising Recurrent Neural Network for Deep Bidirectional LSTM Based Voice Conversion.
Kang et al. Connectionist temporal classification loss for vector quantized variational autoencoder in zero-shot voice conversion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB03 Change of inventor or designer information

Inventor after: Jian Zhihua

Inventor after: Gu Dong

Inventor before: Jian Zhihua

CB03 Change of inventor or designer information
RJ01 Rejection of invention patent application after publication

Application publication date: 20170929

RJ01 Rejection of invention patent application after publication