CN109686382A

CN109686382A - A kind of speaker clustering method and device

Info

Publication number: CN109686382A
Application number: CN201811639415.8A
Authority: CN
Inventors: 刘博卿; 贾雪丽; 程宁; 王健宗
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2018-12-29
Filing date: 2018-12-29
Publication date: 2019-04-26

Abstract

The embodiment of the invention provides a kind of speaker clustering method and devices, and the present invention relates to field of artificial intelligence, this method includes: to obtain 2N sound bite；N number of identical speaker couple is combined into according to 2N sound bite；M different speaker couple is combined into according to 2N sound bite；By N number of identical speaker to and M different speakers input neural network to as training sample, neural network is trained；Target voice is subjected to cutting, obtains multiple sound bites；It will be in the neural network after the completion of the input training of multiple sound bites；Receive the sound bite of multiple classifications of the neural network output after the completion of training, wherein the sound bite of each classification corresponds to the same speaker.Therefore, technical solution provided in an embodiment of the present invention is able to solve speaker clustering in the prior art gauss hybrid models and a very big projection matrix is needed to lead to computationally intensive problem to do factorial analysis.

Description

A kind of speaker clustering method and device

[technical field]

The present invention relates to field of artificial intelligence more particularly to a kind of speaker clustering methods and device.

[background technique]

Speaker clustering is that the segment root that will speak is clustered according to speaker, can be summarized as " when who is saying What word ", are widely used in the fields such as speech recognition and Speaker Identification.

The method that the relevant technologies use is to extract i-vector from short segment of speaking, and learns a PLDA Whether (probabilistic linear discriminant analysis) score equation is to determine two i-vector From the same speaker.The extraction of i-vector needs a GMM (gauss hybrid models) and a very big projection Matrix does factorial analysis, causes computationally intensive.

[summary of the invention]

In view of this, the embodiment of the invention provides a kind of speaker clustering method and devices, to solve the prior art Middle speaker clustering need gauss hybrid models and a very big projection matrix do factorial analysis cause it is computationally intensive Problem.

On the one hand, the embodiment of the invention provides a kind of speaker clustering methods, which comprises obtains 2N voice Segment, the 2N sound bite is from N number of speaker, also, every 2 sound bites, from the same speaker, N is Natural number more than or equal to 2；N number of identical speaker couple is combined into according to the 2N sound bite, wherein described in each Identical speaker is to including 2 sound bites, also, each described identical speaker comes from 2 sound bites for including The same speaker；M different speaker couple is combined into according to the 2N sound bite, wherein each described difference is said People is talked about to including 2 sound bites, also, each described different speaker comes from different theorys to 2 sound bites for including People is talked about, M is the natural number more than or equal to 2；By N number of identical speaker to speakers different with described M to as instruction Practice sample and input neural network, the neural network is trained, when the value of objective function meets preset condition, stops instruction Practice the neural network；Target voice is subjected to cutting, obtains multiple sound bites；The multiple sound bite is inputted and is trained In neural network after the completion；Receive the sound bite of multiple classifications of the neural network output after the completion of the training, wherein The sound bite of each classification corresponds to the same speaker.

Further, the value of the objective function is calculated according to formula E=E1+K × E2, wherein E is the target letter Number, K is constant,X and y is sound bite, P_diffAnd P_sameRespectively different speakers to identical speaker couple, Pr (x, y) is that sound bite x and sound bite y belong to together The probability of one speaker.

Further, according to formulaIt calculates Pr (x, y), L (x, y) is sound bite x The distance between sound bite y.

Further, according to formula L (x, y)=x^Ty-x^TSx-y^TSy+b calculates L (x, y), and S is symmetrical matrix, and b is offset Amount.

Further, the value of K is closed by the ratio between the quantity of identical speaker couple and the quantity of different speakers couple System determines.

On the one hand, the embodiment of the invention provides a kind of speaker clustering device, described device includes: acquiring unit, is used In obtaining 2N sound bite, the 2N sound bite is from N number of speaker, also, every 2 sound bites are from same One speaker, N are the natural number more than or equal to 2；First assembled unit, for being combined into according to the 2N sound bite N number of identical speaker couple, wherein each described identical speaker is to including 2 sound bites, also, each described phase The same speaker is come to 2 sound bites for including with speaker；Second assembled unit, for according to the 2N voice Fragment combination goes out M different speaker couple, wherein each described different speaker is also, every to including 2 sound bites One different speaker comes from different speakers to 2 sound bites for including, and M is the natural number more than or equal to 2；Instruction Practice unit, for N number of identical speaker to be inputted nerve net to as training sample to speakers different with described M Network is trained the neural network, when the value of objective function meets preset condition, neural network described in deconditioning； Cutting unit obtains multiple sound bites for target voice to be carried out cutting；Input unit is used for the multiple voice In neural network after the completion of segment input training；Receiving unit, for receiving the neural network output after the completion of the training Multiple classifications sound bite, wherein the sound bite of each classification corresponds to the same speaker.

Further, the training unit includes: computation subunit, for according to formula E=E1+K × E2 calculating The value of objective function, wherein E is the objective function, and K is constant, X and y is sound bite, P_diffAnd P_sameRespectively different speakers speak to identical People couple, Pr (x, y) are the probability that sound bite x and sound bite y belong to the same speaker.

Further, the computation subunit is also used to according to formulaCalculating Pr (x, Y), L (x, y) is the distance between sound bite x and sound bite y.

Further, the computation subunit is also used to according to formula L (x, y)=x^Ty-x^TSx-y^TSy+b calculating L (x, Y), S is symmetrical matrix, and b is offset.

On the one hand, the embodiment of the invention provides a kind of storage medium, the storage medium includes the program of storage, In, equipment where controlling the storage medium in described program operation executes above-mentioned speaker clustering method.

On the one hand, the embodiment of the invention provides a kind of computer equipment, including memory and processor, the memories For storing the information including program instruction, the processor is used to control the execution of program instruction, and described program instruction is located The step of reason device loads and realizes above-mentioned speaker clustering method when executing.

In embodiments of the present invention, by N number of identical speaker to and M different speakers input mind to as training sample Through network, neural network is trained, until the value of objective function meets preset condition, deconditioning neural network；By mesh Poster sound carries out cutting, obtains multiple sound bites；By in the neural network after the completion of the input training of multiple sound bites, obtain The sound bite of multiple classifications, wherein the corresponding speaker of the sound bite of each classification does not need to extract i-vector, Avoiding speaker clustering in the prior art needs gauss hybrid models and a very big projection matrix to do factorial analysis Lead to computationally intensive problem, reduces the calculation amount of speaker clustering.

[Detailed description of the invention]

In order to illustrate the technical solution of the embodiments of the present invention more clearly, below will be to needed in the embodiment attached Figure is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for this field For those of ordinary skill, without any creative labor, it can also be obtained according to these attached drawings other attached Figure.

Fig. 1 is a kind of flow chart of optional speaker clustering method provided in an embodiment of the present invention；

Fig. 2 is a kind of schematic diagram of the structure of optional neural network provided in an embodiment of the present invention；

Fig. 3 is a kind of schematic diagram of optional speaker clustering device provided in an embodiment of the present invention；

Fig. 4 is a kind of schematic diagram of optional computer equipment provided in an embodiment of the present invention.

[specific embodiment]

For a better understanding of the technical solution of the present invention, being retouched in detail to the embodiment of the present invention with reference to the accompanying drawing It states.

It will be appreciated that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.Base Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts it is all its Its embodiment, shall fall within the protection scope of the present invention.

The term used in embodiments of the present invention is only to be not intended to be limiting merely for for the purpose of describing particular embodiments The present invention.In the embodiment of the present invention and the "an" of singular used in the attached claims, " described " and "the" It is also intended to including most forms, unless the context clearly indicates other meaning.

It should be appreciated that term "and/or" used herein is only a kind of incidence relation for describing affiliated partner, indicate There may be three kinds of relationships, for example, A and/or B, can indicate: individualism A, exist simultaneously A and B, individualism B these three Situation.In addition, character "/" herein, typicallys represent the relationship that forward-backward correlation object is a kind of "or".

Fig. 1 is a kind of flow chart of optional speaker clustering method provided in an embodiment of the present invention, as shown in Figure 1, should Method includes: step S102 to step S114.

Step S102 obtains 2N sound bite, and 2N sound bite is from N number of speaker, also, every 2 voices For segment from the same speaker, N is the natural number more than or equal to 2.

Step S104 is combined into N number of identical speaker couple according to 2N sound bite, wherein each identical speaker To include 2 sound bites, also, each identical speaker to 2 sound bites for including come from the same speaker.

Step S106 is combined into M different speaker couple according to 2N sound bite, wherein each different speaker To including 2 sound bites, also, each different speaker comes from different speakers to 2 sound bites for including, and M is Natural number more than or equal to 2.

M can there are many values, such as M=N, M=2N, M=3N can be enabled etc., the maximum value that M can take is

Illustrate below to how according to 2N sound bite be combined into N number of identical speaker to and M different speaker couple Process be described in detail.Wherein, it is 17994000 that the value of N, which is 3000, M value,

6000 sound bites are obtained, this 6000 sound bites are from 3000 speakers, each speaker couple Answer 2 sound bites, it is assumed that this 3000 speakers be respectively speaker R1, speaker R2, speaker R3 ..., speaker R2999, speaker R3000.Assuming that corresponding 2 sound bites of speaker Ri are sound bite Si-1 and sound bite Si-2, Wherein, i takes the random natural number between 1 to 3000.That is, corresponding 2 sound bites of speaker R1 be sound bite S1-1 and Sound bite S1-2；Corresponding 2 sound bites of speaker R2 are sound bite S2-1 and sound bite S2-2；R3 pairs of speaker 2 sound bites answered are sound bite S3-1 and sound bite S3-2；……；Corresponding 2 sound bites of speaker R2999 It is sound bite S2999-1 and sound bite S2999-2；Corresponding 2 sound bites of speaker R3000 are sound bites S3000-1 and sound bite S3000-2.

3000 identical speakers couple are combined into according to this 6000 sound bites, this 3000 identical speakers are to respectively For identical speaker to DP1, identical speaker to DP2, identical speaker to DP3 ..., identical speaker is to DP2999, phase With speaker to DP3000.

Identical speaker includes 2 sound bites: sound bite S1-1 and sound bite S1-2 to DP1；

Identical speaker includes 2 sound bites: sound bite S2-1 and sound bite S2-2 to DP2；

Identical speaker includes 2 sound bites: sound bite S3-1 and sound bite S3-2 to DP3；

……；

Identical speaker includes 2 sound bites: sound bite S2999-1 and sound bite S2999-2 to DP2999；

Identical speaker includes 2 sound bites: sound bite S3000-1 and sound bite S3000-2 to DP3000.

17994000 are combined into according to this 6000 sound bitesA difference speaker Right, this 17994000 different speakers are to respectively different speakers to DQ1, different speakers to DQ2, different speakers couple DQ3 ..., different speaker to DQ17993999, different speakers to DQ17994000.

Different speakers include 2 sound bites: sound bite S1-1 and sound bite S2-1 to DQ1；

Different speakers include 2 sound bites: sound bite S1-1 and sound bite S2-2 to DQ2；

Different speakers include 2 sound bites: sound bite S1-1 and sound bite S3-1 to DQ3；

Different speakers include 2 sound bites: sound bite S1-1 and sound bite S3-2 to DQ4；

……；

Different speakers include 2 sound bites: sound bite S1-1 and sound bite S2999-1 to DQ5995；

Different speakers include 2 sound bites: sound bite S1-1 and sound bite S2999-2 to DQ5996；

Different speakers include 2 sound bites: sound bite S1-1 and sound bite S3000-1 to DQ5997；

Different speakers include 2 sound bites: sound bite S1-1 and sound bite S3000-2 to DQ5998；

Different speakers include 2 sound bites: sound bite S1-2 and sound bite S2-1 to DQ5999；

Different speakers include 2 sound bites: sound bite S1-2 and sound bite S2-2 to DQ6000；

Different speakers include 2 sound bites: sound bite S1-2 and sound bite S3-1 to DQ6001；

Different speakers include 2 sound bites: sound bite S1-2 and sound bite S3-2 to DQ6002；

……；

Different speakers include 2 sound bites: sound bite S1-2 and sound bite S2999-1 to DQ11993；

Different speakers include 2 sound bites: sound bite S1-2 and sound bite S2999-2 to DQ11994；

Different speakers include 2 sound bites: sound bite S1-2 and sound bite S3000-1 to DQ11995；

Different speakers include 2 sound bites: sound bite S1-2 and sound bite S3000-2 to DQ11996；

……。

Step S108, by N number of identical speaker to and M different speakers input neural network to as training sample, Neural network is trained, when the value of objective function meets preset condition, deconditioning neural network.

According to the value of formula E=E1+K × E2 calculating target function, wherein E is objective function, and K is constant,X and y is sound bite, P_diffAnd P_sameRespectively For different speakers to identical speaker couple, Pr (x, y) is that sound bite x and sound bite y belong to the same speaker Probability.The value of K is determined by the proportionate relationship between the quantity of identical speaker couple and the quantity of different speakers couple.According to public affairs FormulaIt calculates Pr (x, y), L (x, y) is the distance between sound bite x and sound bite y.Root According to formula L (x, y)=x^Ty-x^TSx-x^TSy+b calculates L (x, y), and S is symmetrical matrix, and b is offset.

Target voice is carried out cutting, obtains multiple sound bites by step S110.

Step S112, will be in the neural network after the completion of the input training of multiple sound bites.

Step S114 receives the sound bite of multiple classifications of the neural network output after the completion of training, wherein each class Other sound bite corresponds to the same speaker.

Under the scene that multiple speakers speak, the voice that multiple speakers speak is recorded, target voice is obtained.Target language The length of sound can be 20 minutes, 1 hour, 2 hours etc..Such as the first and second the third voices spoken in a meeting are recorded, Obtain target voice, target voice when a length of 1 hour.It is the sound bite of fixed length (such as 2s) by target voice cutting, Also, each sound bite is overlapped with its previous sound bite and latter sound bite, for example, each sound bite Preceding 500ms and the sound bite of front one are overlapped, and rear 500ms and latter one sound bite are overlapped.The voice that cutting is obtained In neural network after the completion of segment input training, the voice sheet of 3 classifications of the neural network output after the completion of training is received Section, the sound bite of this 3 classifications respectively correspond speaker first, speaker second, speaker third.

As shown in Fig. 2, neural network successively includes 5 hidden layers, a timing pond layer, a linear convergent rate layer.With Timing information in short-term is added in the method for time-delay structure in first four layers, it is assumed that T is all frames in a segment Number, t (0≤t≤) is the index of some frames.Input layer gets up the merging features of [t-1, t+1].At the 2nd 34 layer, By activation primitive at [t-2, t+1], { t-3, t, t+3 } and { t-3, t, t+3 } output is stitched together, so when the 4th layer It is exactly [t-9, t+8].It is directed to the total length of input, timing pond layer gets up the 4th layer of output set, calculates mean value, then Pass it to layer 5.Finally, being transmitted to linear layer, this layer exports embedding x (400 dimension).Symmetrical matrix S and offset Amount b is the output of network, the two outputs are learnt together with embeddings (sound bite).

Tradition will calculate a score according to score formula when being PLDA with ivector, and score formula is calculating two A hypothesis is (assuming that 1: two ivector is from the same speaker；Assuming that 2: two ivector are from different speakers) Log-likelihood ratio, the parameter of S and b are had inside this formula, S and b is derived in conventional method, in the present invention The all training of the two parameters obtain in embodiment.Trained detailed process are as follows: by N number of identical speaker to and M it is different Speaker inputs neural network to as training sample, is trained to neural network, until the value of objective function reaches maximum Value, deconditioning neural network export the value of symmetrical matrix S and offset b.

The activation of hidden layer is a kind of network-in-network (NIN) nonlinearity.It is one that this is non-linear It is a from d_iDimension is input to d_oThe mapping of the output of dimension.Input is projected to d by n miniature neural networks_hThe space of dimension.One micro- Type neural network includes 3 ReLU.The configuration parameter of NIN is { n=50, d_i=150, d_h=1000, d_o=500 }, so Entire model one shares 460K parameter.

The embodiment of the invention also provides a kind of speaker clustering device, the device is for executing above-mentioned speaker clustering side Method, as shown in figure 3, the device include: acquiring unit 10, the first assembled unit 20, the second assembled unit 30, training unit 40, Cutting unit 50, input unit 60, receiving unit 70.

Acquiring unit 10, for obtaining 2N sound bite, 2N sound bite is from N number of speaker, also, every 2 For a sound bite from the same speaker, N is the natural number more than or equal to 2.

First assembled unit 20, for being combined into N number of identical speaker couple according to 2N sound bite, wherein each Identical speaker to including 2 sound bites, also, each identical speaker to 2 sound bites for including from same A speaker.

Second assembled unit 30, for being combined into M different speaker couple according to 2N sound bite, wherein each Different speakers to including 2 sound bites, also, each different speaker to 2 sound bites for including from difference Speaker, M are the natural number more than or equal to 2.

Training unit 40, for will N number of identical speaker to and M different speakers to neural as training sample input Network is trained neural network, when the value of objective function meets preset condition, deconditioning neural network.

Cutting unit 50 obtains multiple sound bites for target voice to be carried out cutting.

Input unit 60, for inputting multiple sound bites in the neural network after the completion of training.

Receiving unit 70, the sound bite of multiple classifications for receiving the neural network output after the completion of training, wherein The sound bite of each classification corresponds to the same speaker.

Optionally, training unit 40 includes: computation subunit.Computation subunit, based on according to formula E=E1+K × E2 Calculating the value of objective function, wherein E is objective function, and K is constant,X and y is sound bite, P_diffAnd P_sameRespectively For different speakers to identical speaker couple, Pr (x, y) is that sound bite x and sound bite y belong to the same speaker Probability.

Optionally, computation subunit is also used to according to formulaIt calculates Pr (x, y), L (x, It y) is the distance between sound bite x and sound bite y.

Optionally, computation subunit is also used to according to formula L (x, y)=x^Ty-x^TSx-y^TSy+b calculates L (x, y), and S is pair Claim matrix, b is offset.

Optionally, the value of K is by the proportionate relationship between the quantity of identical speaker couple and the quantity of different speakers couple It determines.

The embodiment of the invention provides a kind of storage medium, storage medium includes the program of storage, wherein is run in program When equipment where control storage medium execute following steps: obtain 2N sound bite, 2N sound bite speak from N number of People, also, every 2 sound bites, from the same speaker, N is the natural number more than or equal to 2；According to 2N voice sheet Section is combined into N number of identical speaker couple, wherein each identical speaker is to including 2 sound bites, also, each phase The same speaker is come to 2 sound bites for including with speaker；M difference is combined into according to 2N sound bite to speak People couple, wherein each different speaker is to including 2 sound bites, also, each different speaker is to 2 for including Sound bite comes from different speakers, and M is the natural number more than or equal to 2；By N number of identical speaker to and M difference speak People inputs neural network to as training sample, is trained to neural network, when the value of objective function meets preset condition, Deconditioning neural network；Target voice is subjected to cutting, obtains multiple sound bites；The input of multiple sound bites has been trained In neural network after；Receive the sound bite of multiple classifications of the neural network output after the completion of training, wherein each class Other sound bite corresponds to the same speaker.

Optionally, when program is run, equipment where control storage medium also executes following steps: according to formula E=E1+K The value of × E2 calculating target function, wherein E is objective function, and K is constant,X and y is sound bite, P_diffAnd P_sameRespectively For different speakers to identical speaker couple, Pr (x, y) is that sound bite x and sound bite y belong to the same speaker Probability.

Optionally, when program is run, equipment where control storage medium also executes following steps: according to formulaIt calculates Pr (x, y), L (x, y) is the distance between sound bite x and sound bite y.

Optionally, when program is run, equipment where control storage medium also executes following steps: according to formula L (x, y) =x^Ty-x^TSx-y^TSy+b calculates L (x, y), and S is symmetrical matrix, and b is offset.

The embodiment of the invention provides a kind of computer equipments, including memory and processor, and memory is for storing packet The information of program instruction is included, processor is used to control the execution of program instruction, real when program instruction is loaded and executed by processor Existing following steps: 2N sound bite is obtained, 2N sound bite is from N number of speaker, also, every 2 sound bites come From in the same speaker, N is the natural number more than or equal to 2；N number of identical speaker is combined into according to 2N sound bite It is right, wherein each identical speaker is to including 2 sound bites, also, each identical speaker is to 2 languages for including Tablet section comes from the same speaker；M different speaker couple is combined into according to 2N sound bite, wherein each is different Speaker is to including 2 sound bites, also, each different speaker speaks to 2 sound bites for including from difference People, M are the natural number more than or equal to 2；By N number of identical speaker to and M different speakers inputted to as training sample Neural network is trained neural network, when the value of objective function meets preset condition, deconditioning neural network；It will Target voice carries out cutting, obtains multiple sound bites；It will be in the neural network after the completion of the input training of multiple sound bites；It connects Receive the sound bite of multiple classifications of the neural network output after the completion of training, wherein the sound bite of each classification is corresponding same One speaker.

Optionally, it also performs the steps of when program instruction is loaded and executed by processor according to formula E=E1+K × E2 The value of calculating target function, wherein E is objective function, and K is constant,X and y is sound bite, P_diffAnd P_sameRespectively For different speakers to identical speaker couple, Pr (x, y) is that sound bite x and sound bite y belong to the same speaker Probability.

Optionally, it also performs the steps of when program instruction is loaded and executed by processor according to formulaIt calculates Pr (x, y), L (x, y) is the distance between sound bite x and sound bite y.

Optionally, also performed the steps of when program instruction is loaded and executed by processor according to formula L (x, y)= x^Ty-x^TSx-y^TSy+b calculates L (x, y), and S is symmetrical matrix, and b is offset.

Fig. 4 is a kind of schematic diagram of computer equipment provided in an embodiment of the present invention.As shown in figure 4, the meter of the embodiment Machine equipment 50 is calculated to include: processor 51, memory 52 and be stored in the meter that can be run in memory 52 and on processor 51 Calculation machine program 53 realizes the speaker clustering method in embodiment, to avoid when the computer program 53 is executed by processor 51 It repeats, does not repeat one by one herein.Alternatively, realizing that speaker clustering fills in embodiment when the computer program is executed by processor 51 The function of each model/unit does not repeat one by one herein in setting to avoid repeating.

Computer equipment 50 can be desktop PC, notebook, palm PC and cloud server etc. and calculate equipment. Computer equipment may include, but be not limited only to, processor 51, memory 52.It will be understood by those skilled in the art that Fig. 4 is only It is the example of computer equipment 50, does not constitute the restriction to computer equipment 50, may include more more or fewer than illustrating Component perhaps combines certain components or different components, such as computer equipment can also include input-output equipment, net Network access device, bus etc..

Alleged processor 51 can be central processing unit (Central Processing Unit, CPU), can also be Other general processors, digital signal processor (Digital Signal Processor, DSP), specific integrated circuit (Application Specific Integrated Circuit, ASIC), field programmable gate array (Field- Programmable Gate Array, FPGA) either other programmable logic device, discrete gate or transistor logic, Discrete hardware components etc..General processor can be microprocessor or the processor is also possible to any conventional processor Deng.

Memory 52 can be the internal storage unit of computer equipment 50, such as the hard disk or interior of computer equipment 50 It deposits.Memory 52 is also possible to the plug-in type being equipped on the External memory equipment of computer equipment 50, such as computer equipment 50 Hard disk, intelligent memory card (Smart Media Card, SMC), secure digital (Secure Digital, SD) card, flash card (Flash Card) etc..Further, memory 52 can also both including computer equipment 50 internal storage unit and also including External memory equipment.Memory 52 is for storing other programs and data needed for computer program and computer equipment.It deposits Reservoir 52 can be also used for temporarily storing the data that has exported or will export.

It is apparent to those skilled in the art that for convenience and simplicity of description, the system of foregoing description, The specific work process of device and unit, can refer to corresponding processes in the foregoing method embodiment, and details are not described herein.

In several embodiments provided by the present invention, it should be understood that disclosed system, device and method can be with It realizes by another way.For example, the apparatus embodiments described above are merely exemplary, for example, the unit It divides, only a kind of logical function partition, there may be another division manner in actual implementation, for example, multiple units or group Part can be combined or can be integrated into another system, or some features can be ignored or not executed.Another point, it is shown Or the mutual coupling, direct-coupling or communication connection discussed can be through some interfaces, device or unit it is indirect Coupling or communication connection can be electrical property, mechanical or other forms.

The unit as illustrated by the separation member may or may not be physically separated, aobvious as unit The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple In network unit.It can select some or all of unit therein according to the actual needs to realize the mesh of this embodiment scheme 's.

It, can also be in addition, the functional units in various embodiments of the present invention may be integrated into one processing unit It is that each unit physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated list Member both can take the form of hardware realization, can also realize in the form of hardware adds SFU software functional unit.

The above-mentioned integrated unit being realized in the form of SFU software functional unit can store and computer-readable deposit at one In storage media.Above-mentioned SFU software functional unit is stored in a storage medium, including some instructions are used so that a computer It is each that device (can be personal computer, server or network equipment etc.) or processor (Processor) execute the present invention The part steps of embodiment the method.And storage medium above-mentioned includes: USB flash disk, mobile hard disk, read-only memory (Read- Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic or disk etc. it is various It can store the medium of program code.

The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all in essence of the invention Within mind and principle, any modification, equivalent substitution, improvement and etc. done be should be included within the scope of the present invention.

Claims

1. a kind of speaker clustering method, which is characterized in that the described method includes:

Obtain 2N sound bite, the 2N sound bite from N number of speaker, also, every 2 sound bites from The same speaker, N are the natural number more than or equal to 2；

N number of identical speaker couple is combined into according to the 2N sound bite, wherein each described identical speaker is to including 2 sound bites, also, each described identical speaker comes from the same speaker to 2 sound bites for including；

M different speaker couple is combined into according to the 2N sound bite, wherein each described different speaker is to including 2 sound bites, also, each described different speaker comes from different speakers to 2 sound bites for including, and M is big In or equal to 2 natural number；

N number of identical speaker is inputted into neural network to as training sample to speakers different with described M, to described Neural network is trained, when the value of objective function meets preset condition, neural network described in deconditioning；

Target voice is subjected to cutting, obtains multiple sound bites；

It will be in the neural network after the completion of the input training of the multiple sound bite；

Receive the sound bite of multiple classifications of the neural network output after the completion of the training, wherein the voice of each classification Segment corresponds to the same speaker.

2. the method according to claim 1, wherein calculating the objective function according to formula E=E1+K × E2 Value, wherein E be the objective function, K is constant, X and y is sound bite, P_diffAnd P_sameRespectively different speakers speak to identical People couple, Pr (x, y) are the probability that sound bite x and sound bite y belong to the same speaker.

3. according to the method described in claim 2, it is characterized in that, according to formulaCalculate Pr (x, y), L (x, y) are the distance between sound bite x and sound bite y.

4. according to the method described in claim 3, it is characterized in that, according to formula L (x, y)=x^Ty-x^TSx-y^TSy+b calculates L (x, y), S are symmetrical matrix, and b is offset.

5. according to the described in any item methods of claim 2 to 4, which is characterized in that the value of K by identical speaker couple quantity Proportionate relationship between the quantity of different speakers couple determines.

6. a kind of speaker clustering device, which is characterized in that described device includes:

Acquiring unit, for obtaining 2N sound bite, the 2N sound bite is from N number of speaker, also, every 2 For sound bite from the same speaker, N is the natural number more than or equal to 2；

First assembled unit, for being combined into N number of identical speaker couple according to the 2N sound bite, wherein each institute State identical speaker to include 2 sound bites, also, each described identical speaker to 2 sound bites for including come From the same speaker；

Second assembled unit, for being combined into M different speaker couple according to the 2N sound bite, wherein each institute Different speakers are stated to including 2 sound bites, also, each described different speaker carrys out 2 sound bites for including From different speakers, M is the natural number more than or equal to 2；

Training unit, for N number of identical speaker to be inputted mind to as training sample to speakers different with described M Through network, the neural network is trained, when the value of objective function meets preset condition, nerve net described in deconditioning Network；

Cutting unit obtains multiple sound bites for target voice to be carried out cutting；

Input unit, for inputting the multiple sound bite in the neural network after the completion of training；

Receiving unit, the sound bite of multiple classifications for receiving the neural network output after the completion of the training, wherein every The sound bite of a classification corresponds to the same speaker.

7. device according to claim 6, which is characterized in that the training unit includes:

Computation subunit, for calculating the value of the objective function according to formula E=E1+K × E2, wherein E is the target letter Number, K is constant,X and y is sound bite, P_diffAnd P_sameRespectively different speakers to identical speaker couple, Pr (x, y) is that sound bite x and sound bite y belong to together The probability of one speaker.

8. device according to claim 7, which is characterized in that the computation subunit is also used to according to formulaIt calculates Pr (x, y), L (x, y) is the distance between sound bite x and sound bite y.

9. a kind of storage medium, which is characterized in that the storage medium includes the program of storage, wherein run in described program When control the storage medium where equipment perform claim require any one of 1 to 5 described in speaker clustering method.

10. a kind of computer equipment, including memory and processor, the memory is for storing the letter including program instruction Breath, the processor are used to control the execution of program instruction, it is characterised in that: described program instruction is loaded and executed by processor The step of speaker clustering method described in Shi Shixian claim 1 to 5 any one.