CN109686382A - A kind of speaker clustering method and device - Google Patents
A kind of speaker clustering method and device Download PDFInfo
- Publication number
- CN109686382A CN109686382A CN201811639415.8A CN201811639415A CN109686382A CN 109686382 A CN109686382 A CN 109686382A CN 201811639415 A CN201811639415 A CN 201811639415A CN 109686382 A CN109686382 A CN 109686382A
- Authority
- CN
- China
- Prior art keywords
- speaker
- sound
- sound bite
- bite
- bites
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 32
- 238000013528 artificial neural network Methods 0.000 claims abstract description 58
- 238000012549 training Methods 0.000 claims abstract description 40
- 239000011159 matrix material Substances 0.000 claims abstract description 15
- 230000006870 function Effects 0.000 claims description 28
- 230000015654 memory Effects 0.000 claims description 17
- 210000004218 nerve net Anatomy 0.000 claims description 2
- 238000004458 analytical method Methods 0.000 abstract description 7
- 238000013473 artificial intelligence Methods 0.000 abstract description 2
- 238000004364 calculation method Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 230000008569 process Effects 0.000 description 4
- 238000004590 computer program Methods 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 238000012545 processing Methods 0.000 description 3
- 230000004913 activation Effects 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 230000008878 coupling Effects 0.000 description 2
- 238000007796 conventional method Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 239000012634 fragment Substances 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000001537 neural effect Effects 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Abstract
The embodiment of the invention provides a kind of speaker clustering method and devices, and the present invention relates to field of artificial intelligence, this method includes: to obtain 2N sound bite;N number of identical speaker couple is combined into according to 2N sound bite;M different speaker couple is combined into according to 2N sound bite;By N number of identical speaker to and M different speakers input neural network to as training sample, neural network is trained;Target voice is subjected to cutting, obtains multiple sound bites;It will be in the neural network after the completion of the input training of multiple sound bites;Receive the sound bite of multiple classifications of the neural network output after the completion of training, wherein the sound bite of each classification corresponds to the same speaker.Therefore, technical solution provided in an embodiment of the present invention is able to solve speaker clustering in the prior art gauss hybrid models and a very big projection matrix is needed to lead to computationally intensive problem to do factorial analysis.
Description
[technical field]
The present invention relates to field of artificial intelligence more particularly to a kind of speaker clustering methods and device.
[background technique]
Speaker clustering is that the segment root that will speak is clustered according to speaker, can be summarized as " when who is saying
What word ", are widely used in the fields such as speech recognition and Speaker Identification.
The method that the relevant technologies use is to extract i-vector from short segment of speaking, and learns a PLDA
Whether (probabilistic linear discriminant analysis) score equation is to determine two i-vector
From the same speaker.The extraction of i-vector needs a GMM (gauss hybrid models) and a very big projection
Matrix does factorial analysis, causes computationally intensive.
[summary of the invention]
In view of this, the embodiment of the invention provides a kind of speaker clustering method and devices, to solve the prior art
Middle speaker clustering need gauss hybrid models and a very big projection matrix do factorial analysis cause it is computationally intensive
Problem.
On the one hand, the embodiment of the invention provides a kind of speaker clustering methods, which comprises obtains 2N voice
Segment, the 2N sound bite is from N number of speaker, also, every 2 sound bites, from the same speaker, N is
Natural number more than or equal to 2;N number of identical speaker couple is combined into according to the 2N sound bite, wherein described in each
Identical speaker is to including 2 sound bites, also, each described identical speaker comes from 2 sound bites for including
The same speaker;M different speaker couple is combined into according to the 2N sound bite, wherein each described difference is said
People is talked about to including 2 sound bites, also, each described different speaker comes from different theorys to 2 sound bites for including
People is talked about, M is the natural number more than or equal to 2;By N number of identical speaker to speakers different with described M to as instruction
Practice sample and input neural network, the neural network is trained, when the value of objective function meets preset condition, stops instruction
Practice the neural network;Target voice is subjected to cutting, obtains multiple sound bites;The multiple sound bite is inputted and is trained
In neural network after the completion;Receive the sound bite of multiple classifications of the neural network output after the completion of the training, wherein
The sound bite of each classification corresponds to the same speaker.
Further, the value of the objective function is calculated according to formula E=E1+K × E2, wherein E is the target letter
Number, K is constant,X and y is sound bite,
PdiffAnd PsameRespectively different speakers to identical speaker couple, Pr (x, y) is that sound bite x and sound bite y belong to together
The probability of one speaker.
Further, according to formulaIt calculates Pr (x, y), L (x, y) is sound bite x
The distance between sound bite y.
Further, according to formula L (x, y)=xTy-xTSx-yTSy+b calculates L (x, y), and S is symmetrical matrix, and b is offset
Amount.
Further, the value of K is closed by the ratio between the quantity of identical speaker couple and the quantity of different speakers couple
System determines.
On the one hand, the embodiment of the invention provides a kind of speaker clustering device, described device includes: acquiring unit, is used
In obtaining 2N sound bite, the 2N sound bite is from N number of speaker, also, every 2 sound bites are from same
One speaker, N are the natural number more than or equal to 2;First assembled unit, for being combined into according to the 2N sound bite
N number of identical speaker couple, wherein each described identical speaker is to including 2 sound bites, also, each described phase
The same speaker is come to 2 sound bites for including with speaker;Second assembled unit, for according to the 2N voice
Fragment combination goes out M different speaker couple, wherein each described different speaker is also, every to including 2 sound bites
One different speaker comes from different speakers to 2 sound bites for including, and M is the natural number more than or equal to 2;Instruction
Practice unit, for N number of identical speaker to be inputted nerve net to as training sample to speakers different with described M
Network is trained the neural network, when the value of objective function meets preset condition, neural network described in deconditioning;
Cutting unit obtains multiple sound bites for target voice to be carried out cutting;Input unit is used for the multiple voice
In neural network after the completion of segment input training;Receiving unit, for receiving the neural network output after the completion of the training
Multiple classifications sound bite, wherein the sound bite of each classification corresponds to the same speaker.
Further, the training unit includes: computation subunit, for according to formula E=E1+K × E2 calculating
The value of objective function, wherein E is the objective function, and K is constant, X and y is sound bite, PdiffAnd PsameRespectively different speakers speak to identical
People couple, Pr (x, y) are the probability that sound bite x and sound bite y belong to the same speaker.
Further, the computation subunit is also used to according to formulaCalculating Pr (x,
Y), L (x, y) is the distance between sound bite x and sound bite y.
Further, the computation subunit is also used to according to formula L (x, y)=xTy-xTSx-yTSy+b calculating L (x,
Y), S is symmetrical matrix, and b is offset.
Further, the value of K is closed by the ratio between the quantity of identical speaker couple and the quantity of different speakers couple
System determines.
On the one hand, the embodiment of the invention provides a kind of storage medium, the storage medium includes the program of storage,
In, equipment where controlling the storage medium in described program operation executes above-mentioned speaker clustering method.
On the one hand, the embodiment of the invention provides a kind of computer equipment, including memory and processor, the memories
For storing the information including program instruction, the processor is used to control the execution of program instruction, and described program instruction is located
The step of reason device loads and realizes above-mentioned speaker clustering method when executing.
In embodiments of the present invention, by N number of identical speaker to and M different speakers input mind to as training sample
Through network, neural network is trained, until the value of objective function meets preset condition, deconditioning neural network;By mesh
Poster sound carries out cutting, obtains multiple sound bites;By in the neural network after the completion of the input training of multiple sound bites, obtain
The sound bite of multiple classifications, wherein the corresponding speaker of the sound bite of each classification does not need to extract i-vector,
Avoiding speaker clustering in the prior art needs gauss hybrid models and a very big projection matrix to do factorial analysis
Lead to computationally intensive problem, reduces the calculation amount of speaker clustering.
[Detailed description of the invention]
In order to illustrate the technical solution of the embodiments of the present invention more clearly, below will be to needed in the embodiment attached
Figure is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for this field
For those of ordinary skill, without any creative labor, it can also be obtained according to these attached drawings other attached
Figure.
Fig. 1 is a kind of flow chart of optional speaker clustering method provided in an embodiment of the present invention;
Fig. 2 is a kind of schematic diagram of the structure of optional neural network provided in an embodiment of the present invention;
Fig. 3 is a kind of schematic diagram of optional speaker clustering device provided in an embodiment of the present invention;
Fig. 4 is a kind of schematic diagram of optional computer equipment provided in an embodiment of the present invention.
[specific embodiment]
For a better understanding of the technical solution of the present invention, being retouched in detail to the embodiment of the present invention with reference to the accompanying drawing
It states.
It will be appreciated that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.Base
Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts it is all its
Its embodiment, shall fall within the protection scope of the present invention.
The term used in embodiments of the present invention is only to be not intended to be limiting merely for for the purpose of describing particular embodiments
The present invention.In the embodiment of the present invention and the "an" of singular used in the attached claims, " described " and "the"
It is also intended to including most forms, unless the context clearly indicates other meaning.
It should be appreciated that term "and/or" used herein is only a kind of incidence relation for describing affiliated partner, indicate
There may be three kinds of relationships, for example, A and/or B, can indicate: individualism A, exist simultaneously A and B, individualism B these three
Situation.In addition, character "/" herein, typicallys represent the relationship that forward-backward correlation object is a kind of "or".
Fig. 1 is a kind of flow chart of optional speaker clustering method provided in an embodiment of the present invention, as shown in Figure 1, should
Method includes: step S102 to step S114.
Step S102 obtains 2N sound bite, and 2N sound bite is from N number of speaker, also, every 2 voices
For segment from the same speaker, N is the natural number more than or equal to 2.
Step S104 is combined into N number of identical speaker couple according to 2N sound bite, wherein each identical speaker
To include 2 sound bites, also, each identical speaker to 2 sound bites for including come from the same speaker.
Step S106 is combined into M different speaker couple according to 2N sound bite, wherein each different speaker
To including 2 sound bites, also, each different speaker comes from different speakers to 2 sound bites for including, and M is
Natural number more than or equal to 2.
M can there are many values, such as M=N, M=2N, M=3N can be enabled etc., the maximum value that M can take is
Illustrate below to how according to 2N sound bite be combined into N number of identical speaker to and M different speaker couple
Process be described in detail.Wherein, it is 17994000 that the value of N, which is 3000, M value,
6000 sound bites are obtained, this 6000 sound bites are from 3000 speakers, each speaker couple
Answer 2 sound bites, it is assumed that this 3000 speakers be respectively speaker R1, speaker R2, speaker R3 ..., speaker
R2999, speaker R3000.Assuming that corresponding 2 sound bites of speaker Ri are sound bite Si-1 and sound bite Si-2,
Wherein, i takes the random natural number between 1 to 3000.That is, corresponding 2 sound bites of speaker R1 be sound bite S1-1 and
Sound bite S1-2;Corresponding 2 sound bites of speaker R2 are sound bite S2-1 and sound bite S2-2;R3 pairs of speaker
2 sound bites answered are sound bite S3-1 and sound bite S3-2;……;Corresponding 2 sound bites of speaker R2999
It is sound bite S2999-1 and sound bite S2999-2;Corresponding 2 sound bites of speaker R3000 are sound bites
S3000-1 and sound bite S3000-2.
3000 identical speakers couple are combined into according to this 6000 sound bites, this 3000 identical speakers are to respectively
For identical speaker to DP1, identical speaker to DP2, identical speaker to DP3 ..., identical speaker is to DP2999, phase
With speaker to DP3000.
Identical speaker includes 2 sound bites: sound bite S1-1 and sound bite S1-2 to DP1;
Identical speaker includes 2 sound bites: sound bite S2-1 and sound bite S2-2 to DP2;
Identical speaker includes 2 sound bites: sound bite S3-1 and sound bite S3-2 to DP3;
……;
Identical speaker includes 2 sound bites: sound bite S2999-1 and sound bite S2999-2 to DP2999;
Identical speaker includes 2 sound bites: sound bite S3000-1 and sound bite S3000-2 to DP3000.
17994000 are combined into according to this 6000 sound bitesA difference speaker
Right, this 17994000 different speakers are to respectively different speakers to DQ1, different speakers to DQ2, different speakers couple
DQ3 ..., different speaker to DQ17993999, different speakers to DQ17994000.
Different speakers include 2 sound bites: sound bite S1-1 and sound bite S2-1 to DQ1;
Different speakers include 2 sound bites: sound bite S1-1 and sound bite S2-2 to DQ2;
Different speakers include 2 sound bites: sound bite S1-1 and sound bite S3-1 to DQ3;
Different speakers include 2 sound bites: sound bite S1-1 and sound bite S3-2 to DQ4;
……;
Different speakers include 2 sound bites: sound bite S1-1 and sound bite S2999-1 to DQ5995;
Different speakers include 2 sound bites: sound bite S1-1 and sound bite S2999-2 to DQ5996;
Different speakers include 2 sound bites: sound bite S1-1 and sound bite S3000-1 to DQ5997;
Different speakers include 2 sound bites: sound bite S1-1 and sound bite S3000-2 to DQ5998;
Different speakers include 2 sound bites: sound bite S1-2 and sound bite S2-1 to DQ5999;
Different speakers include 2 sound bites: sound bite S1-2 and sound bite S2-2 to DQ6000;
Different speakers include 2 sound bites: sound bite S1-2 and sound bite S3-1 to DQ6001;
Different speakers include 2 sound bites: sound bite S1-2 and sound bite S3-2 to DQ6002;
……;
Different speakers include 2 sound bites: sound bite S1-2 and sound bite S2999-1 to DQ11993;
Different speakers include 2 sound bites: sound bite S1-2 and sound bite S2999-2 to DQ11994;
Different speakers include 2 sound bites: sound bite S1-2 and sound bite S3000-1 to DQ11995;
Different speakers include 2 sound bites: sound bite S1-2 and sound bite S3000-2 to DQ11996;
……。
Step S108, by N number of identical speaker to and M different speakers input neural network to as training sample,
Neural network is trained, when the value of objective function meets preset condition, deconditioning neural network.
According to the value of formula E=E1+K × E2 calculating target function, wherein E is objective function, and K is constant,X and y is sound bite, PdiffAnd PsameRespectively
For different speakers to identical speaker couple, Pr (x, y) is that sound bite x and sound bite y belong to the same speaker
Probability.The value of K is determined by the proportionate relationship between the quantity of identical speaker couple and the quantity of different speakers couple.According to public affairs
FormulaIt calculates Pr (x, y), L (x, y) is the distance between sound bite x and sound bite y.Root
According to formula L (x, y)=xTy-xTSx-xTSy+b calculates L (x, y), and S is symmetrical matrix, and b is offset.
Target voice is carried out cutting, obtains multiple sound bites by step S110.
Step S112, will be in the neural network after the completion of the input training of multiple sound bites.
Step S114 receives the sound bite of multiple classifications of the neural network output after the completion of training, wherein each class
Other sound bite corresponds to the same speaker.
Under the scene that multiple speakers speak, the voice that multiple speakers speak is recorded, target voice is obtained.Target language
The length of sound can be 20 minutes, 1 hour, 2 hours etc..Such as the first and second the third voices spoken in a meeting are recorded,
Obtain target voice, target voice when a length of 1 hour.It is the sound bite of fixed length (such as 2s) by target voice cutting,
Also, each sound bite is overlapped with its previous sound bite and latter sound bite, for example, each sound bite
Preceding 500ms and the sound bite of front one are overlapped, and rear 500ms and latter one sound bite are overlapped.The voice that cutting is obtained
In neural network after the completion of segment input training, the voice sheet of 3 classifications of the neural network output after the completion of training is received
Section, the sound bite of this 3 classifications respectively correspond speaker first, speaker second, speaker third.
As shown in Fig. 2, neural network successively includes 5 hidden layers, a timing pond layer, a linear convergent rate layer.With
Timing information in short-term is added in the method for time-delay structure in first four layers, it is assumed that T is all frames in a segment
Number, t (0≤t≤) is the index of some frames.Input layer gets up the merging features of [t-1, t+1].At the 2nd 34 layer,
By activation primitive at [t-2, t+1], { t-3, t, t+3 } and { t-3, t, t+3 } output is stitched together, so when the 4th layer
It is exactly [t-9, t+8].It is directed to the total length of input, timing pond layer gets up the 4th layer of output set, calculates mean value, then
Pass it to layer 5.Finally, being transmitted to linear layer, this layer exports embedding x (400 dimension).Symmetrical matrix S and offset
Amount b is the output of network, the two outputs are learnt together with embeddings (sound bite).
Tradition will calculate a score according to score formula when being PLDA with ivector, and score formula is calculating two
A hypothesis is (assuming that 1: two ivector is from the same speaker;Assuming that 2: two ivector are from different speakers)
Log-likelihood ratio, the parameter of S and b are had inside this formula, S and b is derived in conventional method, in the present invention
The all training of the two parameters obtain in embodiment.Trained detailed process are as follows: by N number of identical speaker to and M it is different
Speaker inputs neural network to as training sample, is trained to neural network, until the value of objective function reaches maximum
Value, deconditioning neural network export the value of symmetrical matrix S and offset b.
The activation of hidden layer is a kind of network-in-network (NIN) nonlinearity.It is one that this is non-linear
It is a from diDimension is input to doThe mapping of the output of dimension.Input is projected to d by n miniature neural networkshThe space of dimension.One micro-
Type neural network includes 3 ReLU.The configuration parameter of NIN is { n=50, di=150, dh=1000, do=500 }, so
Entire model one shares 460K parameter.
In embodiments of the present invention, by N number of identical speaker to and M different speakers input mind to as training sample
Through network, neural network is trained, until the value of objective function meets preset condition, deconditioning neural network;By mesh
Poster sound carries out cutting, obtains multiple sound bites;By in the neural network after the completion of the input training of multiple sound bites, obtain
The sound bite of multiple classifications, wherein the corresponding speaker of the sound bite of each classification does not need to extract i-vector,
Avoiding speaker clustering in the prior art needs gauss hybrid models and a very big projection matrix to do factorial analysis
Lead to computationally intensive problem, reduces the calculation amount of speaker clustering.
The embodiment of the invention also provides a kind of speaker clustering device, the device is for executing above-mentioned speaker clustering side
Method, as shown in figure 3, the device include: acquiring unit 10, the first assembled unit 20, the second assembled unit 30, training unit 40,
Cutting unit 50, input unit 60, receiving unit 70.
Acquiring unit 10, for obtaining 2N sound bite, 2N sound bite is from N number of speaker, also, every 2
For a sound bite from the same speaker, N is the natural number more than or equal to 2.
First assembled unit 20, for being combined into N number of identical speaker couple according to 2N sound bite, wherein each
Identical speaker to including 2 sound bites, also, each identical speaker to 2 sound bites for including from same
A speaker.
Second assembled unit 30, for being combined into M different speaker couple according to 2N sound bite, wherein each
Different speakers to including 2 sound bites, also, each different speaker to 2 sound bites for including from difference
Speaker, M are the natural number more than or equal to 2.
Training unit 40, for will N number of identical speaker to and M different speakers to neural as training sample input
Network is trained neural network, when the value of objective function meets preset condition, deconditioning neural network.
Cutting unit 50 obtains multiple sound bites for target voice to be carried out cutting.
Input unit 60, for inputting multiple sound bites in the neural network after the completion of training.
Receiving unit 70, the sound bite of multiple classifications for receiving the neural network output after the completion of training, wherein
The sound bite of each classification corresponds to the same speaker.
In embodiments of the present invention, by N number of identical speaker to and M different speakers input mind to as training sample
Through network, neural network is trained, until the value of objective function meets preset condition, deconditioning neural network;By mesh
Poster sound carries out cutting, obtains multiple sound bites;By in the neural network after the completion of the input training of multiple sound bites, obtain
The sound bite of multiple classifications, wherein the corresponding speaker of the sound bite of each classification does not need to extract i-vector,
Avoiding speaker clustering in the prior art needs gauss hybrid models and a very big projection matrix to do factorial analysis
Lead to computationally intensive problem, reduces the calculation amount of speaker clustering.
Optionally, training unit 40 includes: computation subunit.Computation subunit, based on according to formula E=E1+K × E2
Calculating the value of objective function, wherein E is objective function, and K is constant,X and y is sound bite, PdiffAnd PsameRespectively
For different speakers to identical speaker couple, Pr (x, y) is that sound bite x and sound bite y belong to the same speaker
Probability.
Optionally, computation subunit is also used to according to formulaIt calculates Pr (x, y), L (x,
It y) is the distance between sound bite x and sound bite y.
Optionally, computation subunit is also used to according to formula L (x, y)=xTy-xTSx-yTSy+b calculates L (x, y), and S is pair
Claim matrix, b is offset.
Optionally, the value of K is by the proportionate relationship between the quantity of identical speaker couple and the quantity of different speakers couple
It determines.
The embodiment of the invention provides a kind of storage medium, storage medium includes the program of storage, wherein is run in program
When equipment where control storage medium execute following steps: obtain 2N sound bite, 2N sound bite speak from N number of
People, also, every 2 sound bites, from the same speaker, N is the natural number more than or equal to 2;According to 2N voice sheet
Section is combined into N number of identical speaker couple, wherein each identical speaker is to including 2 sound bites, also, each phase
The same speaker is come to 2 sound bites for including with speaker;M difference is combined into according to 2N sound bite to speak
People couple, wherein each different speaker is to including 2 sound bites, also, each different speaker is to 2 for including
Sound bite comes from different speakers, and M is the natural number more than or equal to 2;By N number of identical speaker to and M difference speak
People inputs neural network to as training sample, is trained to neural network, when the value of objective function meets preset condition,
Deconditioning neural network;Target voice is subjected to cutting, obtains multiple sound bites;The input of multiple sound bites has been trained
In neural network after;Receive the sound bite of multiple classifications of the neural network output after the completion of training, wherein each class
Other sound bite corresponds to the same speaker.
Optionally, when program is run, equipment where control storage medium also executes following steps: according to formula E=E1+K
The value of × E2 calculating target function, wherein E is objective function, and K is constant,X and y is sound bite, PdiffAnd PsameRespectively
For different speakers to identical speaker couple, Pr (x, y) is that sound bite x and sound bite y belong to the same speaker
Probability.
Optionally, when program is run, equipment where control storage medium also executes following steps: according to formulaIt calculates Pr (x, y), L (x, y) is the distance between sound bite x and sound bite y.
Optionally, when program is run, equipment where control storage medium also executes following steps: according to formula L (x, y)
=xTy-xTSx-yTSy+b calculates L (x, y), and S is symmetrical matrix, and b is offset.
The embodiment of the invention provides a kind of computer equipments, including memory and processor, and memory is for storing packet
The information of program instruction is included, processor is used to control the execution of program instruction, real when program instruction is loaded and executed by processor
Existing following steps: 2N sound bite is obtained, 2N sound bite is from N number of speaker, also, every 2 sound bites come
From in the same speaker, N is the natural number more than or equal to 2;N number of identical speaker is combined into according to 2N sound bite
It is right, wherein each identical speaker is to including 2 sound bites, also, each identical speaker is to 2 languages for including
Tablet section comes from the same speaker;M different speaker couple is combined into according to 2N sound bite, wherein each is different
Speaker is to including 2 sound bites, also, each different speaker speaks to 2 sound bites for including from difference
People, M are the natural number more than or equal to 2;By N number of identical speaker to and M different speakers inputted to as training sample
Neural network is trained neural network, when the value of objective function meets preset condition, deconditioning neural network;It will
Target voice carries out cutting, obtains multiple sound bites;It will be in the neural network after the completion of the input training of multiple sound bites;It connects
Receive the sound bite of multiple classifications of the neural network output after the completion of training, wherein the sound bite of each classification is corresponding same
One speaker.
Optionally, it also performs the steps of when program instruction is loaded and executed by processor according to formula E=E1+K × E2
The value of calculating target function, wherein E is objective function, and K is constant,X and y is sound bite, PdiffAnd PsameRespectively
For different speakers to identical speaker couple, Pr (x, y) is that sound bite x and sound bite y belong to the same speaker
Probability.
Optionally, it also performs the steps of when program instruction is loaded and executed by processor according to formulaIt calculates Pr (x, y), L (x, y) is the distance between sound bite x and sound bite y.
Optionally, also performed the steps of when program instruction is loaded and executed by processor according to formula L (x, y)=
xTy-xTSx-yTSy+b calculates L (x, y), and S is symmetrical matrix, and b is offset.
Fig. 4 is a kind of schematic diagram of computer equipment provided in an embodiment of the present invention.As shown in figure 4, the meter of the embodiment
Machine equipment 50 is calculated to include: processor 51, memory 52 and be stored in the meter that can be run in memory 52 and on processor 51
Calculation machine program 53 realizes the speaker clustering method in embodiment, to avoid when the computer program 53 is executed by processor 51
It repeats, does not repeat one by one herein.Alternatively, realizing that speaker clustering fills in embodiment when the computer program is executed by processor 51
The function of each model/unit does not repeat one by one herein in setting to avoid repeating.
Computer equipment 50 can be desktop PC, notebook, palm PC and cloud server etc. and calculate equipment.
Computer equipment may include, but be not limited only to, processor 51, memory 52.It will be understood by those skilled in the art that Fig. 4 is only
It is the example of computer equipment 50, does not constitute the restriction to computer equipment 50, may include more more or fewer than illustrating
Component perhaps combines certain components or different components, such as computer equipment can also include input-output equipment, net
Network access device, bus etc..
Alleged processor 51 can be central processing unit (Central Processing Unit, CPU), can also be
Other general processors, digital signal processor (Digital Signal Processor, DSP), specific integrated circuit
(Application Specific Integrated Circuit, ASIC), field programmable gate array (Field-
Programmable Gate Array, FPGA) either other programmable logic device, discrete gate or transistor logic,
Discrete hardware components etc..General processor can be microprocessor or the processor is also possible to any conventional processor
Deng.
Memory 52 can be the internal storage unit of computer equipment 50, such as the hard disk or interior of computer equipment 50
It deposits.Memory 52 is also possible to the plug-in type being equipped on the External memory equipment of computer equipment 50, such as computer equipment 50
Hard disk, intelligent memory card (Smart Media Card, SMC), secure digital (Secure Digital, SD) card, flash card
(Flash Card) etc..Further, memory 52 can also both including computer equipment 50 internal storage unit and also including
External memory equipment.Memory 52 is for storing other programs and data needed for computer program and computer equipment.It deposits
Reservoir 52 can be also used for temporarily storing the data that has exported or will export.
It is apparent to those skilled in the art that for convenience and simplicity of description, the system of foregoing description,
The specific work process of device and unit, can refer to corresponding processes in the foregoing method embodiment, and details are not described herein.
In several embodiments provided by the present invention, it should be understood that disclosed system, device and method can be with
It realizes by another way.For example, the apparatus embodiments described above are merely exemplary, for example, the unit
It divides, only a kind of logical function partition, there may be another division manner in actual implementation, for example, multiple units or group
Part can be combined or can be integrated into another system, or some features can be ignored or not executed.Another point, it is shown
Or the mutual coupling, direct-coupling or communication connection discussed can be through some interfaces, device or unit it is indirect
Coupling or communication connection can be electrical property, mechanical or other forms.
The unit as illustrated by the separation member may or may not be physically separated, aobvious as unit
The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple
In network unit.It can select some or all of unit therein according to the actual needs to realize the mesh of this embodiment scheme
's.
It, can also be in addition, the functional units in various embodiments of the present invention may be integrated into one processing unit
It is that each unit physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated list
Member both can take the form of hardware realization, can also realize in the form of hardware adds SFU software functional unit.
The above-mentioned integrated unit being realized in the form of SFU software functional unit can store and computer-readable deposit at one
In storage media.Above-mentioned SFU software functional unit is stored in a storage medium, including some instructions are used so that a computer
It is each that device (can be personal computer, server or network equipment etc.) or processor (Processor) execute the present invention
The part steps of embodiment the method.And storage medium above-mentioned includes: USB flash disk, mobile hard disk, read-only memory (Read-
Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic or disk etc. it is various
It can store the medium of program code.
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all in essence of the invention
Within mind and principle, any modification, equivalent substitution, improvement and etc. done be should be included within the scope of the present invention.
Claims (10)
1. a kind of speaker clustering method, which is characterized in that the described method includes:
Obtain 2N sound bite, the 2N sound bite from N number of speaker, also, every 2 sound bites from
The same speaker, N are the natural number more than or equal to 2;
N number of identical speaker couple is combined into according to the 2N sound bite, wherein each described identical speaker is to including
2 sound bites, also, each described identical speaker comes from the same speaker to 2 sound bites for including;
M different speaker couple is combined into according to the 2N sound bite, wherein each described different speaker is to including
2 sound bites, also, each described different speaker comes from different speakers to 2 sound bites for including, and M is big
In or equal to 2 natural number;
N number of identical speaker is inputted into neural network to as training sample to speakers different with described M, to described
Neural network is trained, when the value of objective function meets preset condition, neural network described in deconditioning;
Target voice is subjected to cutting, obtains multiple sound bites;
It will be in the neural network after the completion of the input training of the multiple sound bite;
Receive the sound bite of multiple classifications of the neural network output after the completion of the training, wherein the voice of each classification
Segment corresponds to the same speaker.
2. the method according to claim 1, wherein calculating the objective function according to formula E=E1+K × E2
Value, wherein E be the objective function, K is constant, X and y is sound bite, PdiffAnd PsameRespectively different speakers speak to identical
People couple, Pr (x, y) are the probability that sound bite x and sound bite y belong to the same speaker.
3. according to the method described in claim 2, it is characterized in that, according to formulaCalculate Pr
(x, y), L (x, y) are the distance between sound bite x and sound bite y.
4. according to the method described in claim 3, it is characterized in that, according to formula L (x, y)=xTy-xTSx-yTSy+b calculates L
(x, y), S are symmetrical matrix, and b is offset.
5. according to the described in any item methods of claim 2 to 4, which is characterized in that the value of K by identical speaker couple quantity
Proportionate relationship between the quantity of different speakers couple determines.
6. a kind of speaker clustering device, which is characterized in that described device includes:
Acquiring unit, for obtaining 2N sound bite, the 2N sound bite is from N number of speaker, also, every 2
For sound bite from the same speaker, N is the natural number more than or equal to 2;
First assembled unit, for being combined into N number of identical speaker couple according to the 2N sound bite, wherein each institute
State identical speaker to include 2 sound bites, also, each described identical speaker to 2 sound bites for including come
From the same speaker;
Second assembled unit, for being combined into M different speaker couple according to the 2N sound bite, wherein each institute
Different speakers are stated to including 2 sound bites, also, each described different speaker carrys out 2 sound bites for including
From different speakers, M is the natural number more than or equal to 2;
Training unit, for N number of identical speaker to be inputted mind to as training sample to speakers different with described M
Through network, the neural network is trained, when the value of objective function meets preset condition, nerve net described in deconditioning
Network;
Cutting unit obtains multiple sound bites for target voice to be carried out cutting;
Input unit, for inputting the multiple sound bite in the neural network after the completion of training;
Receiving unit, the sound bite of multiple classifications for receiving the neural network output after the completion of the training, wherein every
The sound bite of a classification corresponds to the same speaker.
7. device according to claim 6, which is characterized in that the training unit includes:
Computation subunit, for calculating the value of the objective function according to formula E=E1+K × E2, wherein E is the target letter
Number, K is constant,X and y is sound bite,
PdiffAnd PsameRespectively different speakers to identical speaker couple, Pr (x, y) is that sound bite x and sound bite y belong to together
The probability of one speaker.
8. device according to claim 7, which is characterized in that the computation subunit is also used to according to formulaIt calculates Pr (x, y), L (x, y) is the distance between sound bite x and sound bite y.
9. a kind of storage medium, which is characterized in that the storage medium includes the program of storage, wherein run in described program
When control the storage medium where equipment perform claim require any one of 1 to 5 described in speaker clustering method.
10. a kind of computer equipment, including memory and processor, the memory is for storing the letter including program instruction
Breath, the processor are used to control the execution of program instruction, it is characterised in that: described program instruction is loaded and executed by processor
The step of speaker clustering method described in Shi Shixian claim 1 to 5 any one.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811639415.8A CN109686382A (en) | 2018-12-29 | 2018-12-29 | A kind of speaker clustering method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811639415.8A CN109686382A (en) | 2018-12-29 | 2018-12-29 | A kind of speaker clustering method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109686382A true CN109686382A (en) | 2019-04-26 |
Family
ID=66191331
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811639415.8A Pending CN109686382A (en) | 2018-12-29 | 2018-12-29 | A kind of speaker clustering method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109686382A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110162176A (en) * | 2019-05-20 | 2019-08-23 | 北京百度网讯科技有限公司 | The method for digging and device terminal, computer-readable medium of phonetic order |
CN110853666A (en) * | 2019-12-17 | 2020-02-28 | 科大讯飞股份有限公司 | Speaker separation method, device, equipment and storage medium |
CN111710331A (en) * | 2020-08-24 | 2020-09-25 | 城云科技(中国)有限公司 | Voice scheme setting method and device based on multi-slice deep neural network |
CN112017685A (en) * | 2020-08-27 | 2020-12-01 | 北京字节跳动网络技术有限公司 | Voice generation method, device, equipment and computer readable medium |
CN112669855A (en) * | 2020-12-17 | 2021-04-16 | 北京沃东天骏信息技术有限公司 | Voice processing method and device |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102543080A (en) * | 2010-12-24 | 2012-07-04 | 索尼公司 | Audio editing system and audio editing method |
CN107358945A (en) * | 2017-07-26 | 2017-11-17 | 谢兵 | A kind of more people's conversation audio recognition methods and system based on machine learning |
CN107527620A (en) * | 2017-07-25 | 2017-12-29 | 平安科技(深圳)有限公司 | Electronic installation, the method for authentication and computer-readable recording medium |
CN107785015A (en) * | 2016-08-26 | 2018-03-09 | 阿里巴巴集团控股有限公司 | A kind of audio recognition method and device |
CN108766440A (en) * | 2018-05-28 | 2018-11-06 | 平安科技(深圳)有限公司 | Speaker's disjunctive model training method, two speaker's separation methods and relevant device |
CN108777146A (en) * | 2018-05-31 | 2018-11-09 | 平安科技(深圳)有限公司 | Speech model training method, method for distinguishing speek person, device, equipment and medium |
-
2018
- 2018-12-29 CN CN201811639415.8A patent/CN109686382A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102543080A (en) * | 2010-12-24 | 2012-07-04 | 索尼公司 | Audio editing system and audio editing method |
CN107785015A (en) * | 2016-08-26 | 2018-03-09 | 阿里巴巴集团控股有限公司 | A kind of audio recognition method and device |
CN107527620A (en) * | 2017-07-25 | 2017-12-29 | 平安科技(深圳)有限公司 | Electronic installation, the method for authentication and computer-readable recording medium |
CN107358945A (en) * | 2017-07-26 | 2017-11-17 | 谢兵 | A kind of more people's conversation audio recognition methods and system based on machine learning |
CN108766440A (en) * | 2018-05-28 | 2018-11-06 | 平安科技(深圳)有限公司 | Speaker's disjunctive model training method, two speaker's separation methods and relevant device |
CN108777146A (en) * | 2018-05-31 | 2018-11-09 | 平安科技(深圳)有限公司 | Speech model training method, method for distinguishing speek person, device, equipment and medium |
Non-Patent Citations (1)
Title |
---|
YANICK LUKIC等: "Speaker Identification and clustering using convolutional neural networks", 《2016 IEEE INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING》 * |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110162176A (en) * | 2019-05-20 | 2019-08-23 | 北京百度网讯科技有限公司 | The method for digging and device terminal, computer-readable medium of phonetic order |
CN110162176B (en) * | 2019-05-20 | 2022-04-26 | 北京百度网讯科技有限公司 | Voice instruction mining method and device, terminal and computer readable medium |
CN110853666A (en) * | 2019-12-17 | 2020-02-28 | 科大讯飞股份有限公司 | Speaker separation method, device, equipment and storage medium |
CN111710331A (en) * | 2020-08-24 | 2020-09-25 | 城云科技(中国)有限公司 | Voice scheme setting method and device based on multi-slice deep neural network |
CN112017685A (en) * | 2020-08-27 | 2020-12-01 | 北京字节跳动网络技术有限公司 | Voice generation method, device, equipment and computer readable medium |
CN112017685B (en) * | 2020-08-27 | 2023-12-22 | 抖音视界有限公司 | Speech generation method, device, equipment and computer readable medium |
CN112669855A (en) * | 2020-12-17 | 2021-04-16 | 北京沃东天骏信息技术有限公司 | Voice processing method and device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109686382A (en) | A kind of speaker clustering method and device | |
CN107680600B (en) | Sound-groove model training method, audio recognition method, device, equipment and medium | |
US10275672B2 (en) | Method and apparatus for authenticating liveness face, and computer program product thereof | |
CN108986835B (en) | Based on speech de-noising method, apparatus, equipment and the medium for improving GAN network | |
CN108962237A (en) | Mixing voice recognition methods, device and computer readable storage medium | |
CN112233698B (en) | Character emotion recognition method, device, terminal equipment and storage medium | |
CN109887484A (en) | A kind of speech recognition based on paired-associate learning and phoneme synthesizing method and device | |
CN107492379A (en) | A kind of voice-print creation and register method and device | |
CN109166586A (en) | A kind of method and terminal identifying speaker | |
CN109461437A (en) | The verifying content generating method and relevant apparatus of lip reading identification | |
CN110363081A (en) | Face identification method, device, equipment and computer readable storage medium | |
CN110010125A (en) | A kind of control method of intelligent robot, device, terminal device and medium | |
CN108510982A (en) | Audio event detection method, device and computer readable storage medium | |
WO2022048239A1 (en) | Audio processing method and device | |
CN112837669B (en) | Speech synthesis method, device and server | |
CN110489659A (en) | Data matching method and device | |
CN111968678B (en) | Audio data processing method, device, equipment and readable storage medium | |
CN113299312A (en) | Image generation method, device, equipment and storage medium | |
CN110111769A (en) | A kind of cochlear implant control method, device, readable storage medium storing program for executing and cochlear implant | |
CN109785846A (en) | The role recognition method and device of the voice data of monophonic | |
CN110348409A (en) | A kind of method and apparatus that facial image is generated based on vocal print | |
CN109859747A (en) | Voice interactive method, equipment and storage medium | |
CN110491409B (en) | Method and device for separating mixed voice signal, storage medium and electronic device | |
CN109448732A (en) | A kind of digit string processing method and processing device | |
Ng et al. | Teacher-student training for text-independent speaker recognition |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190426 |
|
RJ01 | Rejection of invention patent application after publication |