WO2020098083A1

WO2020098083A1 - Call separation method and apparatus, computer device and storage medium

Info

Publication number: WO2020098083A1
Application number: PCT/CN2018/123553
Authority: WO
Inventors: 刘博卿; 贾雪丽; 程宁; 王健宗; 肖京
Original assignee: 平安科技（深圳）有限公司
Priority date: 2018-11-13
Filing date: 2018-12-25
Publication date: 2020-05-22
Also published as: CN109360572A; CN109360572B

Abstract

The present application discloses a call separation method and apparatus, a computer device and a storage medium, relating to the field of artificial intelligence. The call separation method comprises: acquiring original call segments; using mute detection to remove mute segments in the original call segments, to obtain a first call segment; segmenting the first call segment to obtain at least three second call segments, one speaker corresponding to one or more second call segments; acquiring i-vector features of each second call segment, and modeling each i-vector feature by using a pre-trained double-covariance probability linear discrimination analysis model, to obtain a target model of each second call segment; on the basis of the target models, using a variational bayes algorithm to determine the second call segments of the same speaker, and marking the second call segments of the same speaker with a unified label. By using the call separation method, call segments corresponding to different speakers in a call can be precisely separated.

Description

Call separation method, device, computer equipment and storage medium

This application is based on the Chinese invention patent application with the application number 201811347184.3 filed on November 13, 2018, titled "Call Separation Method, Device, Computer Equipment, and Storage Media", and claims its priority.

【Technical Field】

This application relates to the field of artificial intelligence, in particular to a call separation method, device, computer equipment and storage medium.

【Background technique】

At present, there is a lack of reasonable design steps to ensure the effect of call separation. Without knowing the speaker information, it is impossible to accurately distinguish the voice segments of the call sent by different speakers in the same call. not ideal.

[Invention content]

In view of this, the embodiments of the present application provide a call separation method, device, computer equipment, and storage medium to solve the current problem of inaccurate call separation.

In a first aspect, an embodiment of the present application provides a call separation method, including:

Obtain an original call segment, where the original call segment includes at least two call segments of different speakers;

Mute detection is used to remove the mute segment in the original call segment to obtain the first call segment;

Cutting the first call segment to obtain at least three second call segments, where one speaker corresponds to one or more second call segments;

Acquiring i-vector features of each of the second call segments, and using a pre-trained double covariance probability linear discriminant analysis model to model each of the i-vector features to obtain each of the second call segments Target model

Based on the target model, a variational Bayes algorithm is used to determine the second conversation segment of the same speaker, and the second conversation segment of the same speaker is marked as a unified label.

In a second aspect, an embodiment of the present application provides a call separation device, including:

An original call segment acquisition module, for acquiring an original call segment, the original call segment includes at least two call segments of different speakers;

A first call segment acquisition module, used to remove the mute segment in the original call segment using mute detection to obtain the first call segment;

A second call segment acquisition module, configured to cut the first call segment to obtain at least three second call segments, where one speaker corresponds to one or more second call segments;

The target model acquisition module is used to acquire the i-vector features of each of the second call segments, and use the pre-trained double covariance probability linear discriminant analysis model to model each of the i-vector features to obtain each A target model of the second call segment;

The unified labeling module is used to determine the second call segment of the same speaker based on the target model, and to use the variational Bayes algorithm to mark the second call segment of the same speaker as a unified label .

In a third aspect, a computer device includes a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor. The processor implements the computer-readable instructions as follows step:

According to a fourth aspect, an embodiment of the present application provides a computer non-volatile readable storage medium, including: computer readable instructions, which are used to execute the following steps when the computer readable instructions are executed:

One of the above technical solutions has the following beneficial effects:

In the embodiment of the present application, the mute detection of the original call voice is performed first, which can remove the mute segment of the voice call in which no one emits sound, which is beneficial to improve the efficiency and accuracy of call separation. Then, the first call segment is cut to obtain second call segments of different speakers, which provides an important technical premise for subsequent determination of the second call segments of the same speaker. Then use the pre-trained double covariance probability linear discriminant analysis model for modeling to obtain the target model of each second call segment. The characteristics of the second call segment can be more accurately represented by the double covariance probability linear discriminant analysis model come out. Finally, the second call segment of the same speaker is determined by the variational Bayes algorithm, and the second call segment belonging to the same speaker can be clustered by using the variational Bayes algorithm, with high accuracy and accurate call seperate effect.

【Explanation】

In order to more clearly explain the technical solutions of the embodiments of the present application, the following will briefly introduce the drawings required in the embodiments. Obviously, the drawings in the following description are only some embodiments of the present application. Those of ordinary skill in the art can obtain other drawings based on these drawings without paying any creative labor.

FIG. 1 is a flowchart of a call separation method in an embodiment of the present application;

2 is a schematic diagram of a call separation device in an embodiment of the present application;

3 is a schematic diagram of a computer device in an embodiment of the present application.

【detailed description】

In order to better understand the technical solution of the present application, the following describes the embodiments of the present application in detail with reference to the accompanying drawings.

It should be clear that the described embodiments are only a part of the embodiments of the present application, but not all the embodiments. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without making creative efforts fall within the scope of protection of this application.

The terms used in the embodiments of the present application are only for the purpose of describing specific embodiments, and are not intended to limit the present application. The singular forms "a", "said" and "the" used in the embodiments of the present application and the appended claims are also intended to include the majority forms unless the context clearly indicates other meanings.

It should be understood that the term “and / or” used herein is merely an association relationship describing an associated object, indicating that there may be three relationships, for example, A and / or B, which may indicate: A exists alone, and A and B, there are three cases of B alone. In addition, the character "/" in this article generally indicates that the related objects before and after are in an "or" relationship.

It should be understood that although the terms first, second, third, etc. may be used to describe the preset ranges and the like in the embodiments of the present application, these preset ranges should not be limited to these terms. These terms are only used to distinguish the preset ranges from each other. For example, without departing from the scope of the embodiments of the present application, the first preset range may also be called a second preset range, and similarly, the second preset range may also be called a first preset range.

Depending on the context, the word "if" as used herein can be interpreted as "when" or "when" or "in response to determination" or "in response to detection". Similarly, depending on the context, the phrases "if determined" or "if detected (statement or event stated)" can be interpreted as "when determined" or "in response to determination" or "when detected (statement or event stated) ) "Or" in response to detection (statement or event stated) ".

FIG. 1 shows a flowchart of the call separation method in this embodiment. The call separation method can be applied to a terminal device that performs call separation, and is used to realize the function of call separation. Specifically, it can be applied to a phone call separation system installed on a computer device. Among them, the computer device is a device that can perform human-computer interaction with a user, including but not limited to computers, smart phones, and tablets. The call separation method includes the following steps:

S10: Obtain the original call segment. The original call segment includes at least two call segments of different speakers.

The original call segment may be a call segment obtained by a recording device and including at least two different speakers. In an embodiment, it may specifically be an original call segment composed of multiple speakers recorded by a recording device in a conference scene.

S20: Mute detection is used to remove the mute segment in the original call segment to obtain the first call segment.

Among them, the mute detection refers to the detection of the silent (unattended) part of the original call segment. In one embodiment, it can be implemented using the technology of Voice Endpoint Detection (Voice Activity Detection) (VAD for short), including frame amplitude, frame energy, short-time zero-crossing rate, and deep neural network. By removing the silent segment of the original call segment, the voice segment of the original call segment when the speaker is speaking can be retained, so that in the subsequent call separation, the interference of the silent part of the original call segment can be eliminated, effectively improving the call separation Efficiency and accuracy.

S30: Cut the first call segment to obtain at least three second call segments, where one speaker corresponds to one or more second call segments.

Understandably, the first call voice segment is continuous on the time axis, but the call voice segments of different speakers will alternately appear on the time axis. Therefore, the first call voice segment can be cut into call segments corresponding to different speakers, and these segments are the second call segments. The obtained second call segment includes at least three segments (because two segments are not necessary for call separation), a speaker can correspond to one or more second call segments, for example, there are 10 second call segments, the The second call segment corresponds to a total of 4 speakers A, B, C, and D, then A may include 5 second call segments, B includes 2, C includes 1, and D includes 2.

Further, in step S30, the first call segment is cut to obtain at least three second call segments, specifically including:

S31: Based on the Bayesian information criterion and likelihood ratio, the speaker's transition point is detected and obtained in the first call segment.

Among them, Bayesian Information Criteria (Bayesian Information Criterion, referred to as BIC) is to estimate the partially unknown state with subjective probability under incomplete intelligence, and then use Bayesian formula to modify the probability of occurrence, and finally use the expected value and correction Probability to make the best decision. Likelihood ratio (LR) is an indicator that reflects authenticity. In one embodiment, by using a Bayesian information criterion combined with a likelihood ratio method, the specific time for changing the speaker in the first call segment can be determined, and the speaker's transition point in the first call segment can be detected.

S32: Cut the first call segment according to the speaker's transition point to obtain at least three second call segments.

In an embodiment, cutting the first call segment according to the obtained transition point can achieve a preliminary call separation effect, and it can be determined that each obtained second call segment corresponds to a speaker.

In steps S31-S32, the first conversation segment is cut so that each second conversation segment obtained by cutting corresponds to a speaker, which provides an important technical premise for subsequent determination of the second conversation segment of the same speaker.

S40: Obtain the i-vector features of each second call segment, and use a pre-trained double covariance probability linear discriminant analysis model to model each i-vector feature to obtain the target model of each second call segment.

The i-vector feature refers to a more compact vector extracted from the Gaussian mixture model (GMM) mean supervector. In addition to the speaker's identity information, the i-vector feature also includes information about the soundtrack, Microphone, speaking method, voice and other information can fully reflect the voiceprint characteristics of the sound. In voiceprint recognition, the double-covariance probability linear discriminant analysis model is used to extract speaker information from i-vector, which can be used to compare and distinguish voiceprint features. The double-covariance probability linear discriminant analysis model assumes that the i-vector is extracted by two other parameters: a speaker's vector y (different speakers have different vectors), and a residual vector ∈ (different fragments have different vectors) ). Use the pre-trained double covariance probability linear discriminant analysis model to model each i-vector feature, which can more accurately represent the characteristics of the second call segment to determine the second call segment of the same speaker , Can achieve a more accurate distinction effect.

Before modeling, there are the following prerequisites: In a conversation, the total number of speakers is S. The i-vector extracted from all the second call segments is expressed as Φ = {φ ₁ , ..., φ _M }. For each call a second segment m = 1, ..., M, is defined as the dimension of a vector indicating a i _m S * 1, if the speaker talking s m in the second call segment, the elements in the I _m i _ms = 1, if the speaker s does not speak in the second call segment _m , the element i _{ms in} im is = 0. Let I = {i ₁ , ..., i _M } be a given set of indication vectors about the second call segment. Suppose the event is that the speaker s speaks in a segment, then assign a prior probability to the time

For each speaker s sample y _s ∈ N (y; μ, Λ ^-1 ), that is, each speaker s sample follows a normal distribution with mean μ and covariance Λ ^-1 . For each first two call segments, subject to the polynomial sample distribution i _m Mult (Π), wherein _{Π = (π 1, ...,} π S).

With the above prerequisites for modeling, the expression of the target model is: φ _m = y _k + ∈ _m , where φ _m represents the i-vector feature extracted from the m-th second call segment, and y represents the second call segment the vector is associated with the speaker, and said order of s y _s distinction made, so as to make an index k i _mk =, i _m denotes a vector indicative of the second call segment,

The speaker-independent vector ∈ representing the m-th second call segment is subject to a Gaussian distribution with mean 0 and covariance L ^-1 . The double covariance in the linear discriminant analysis model of double covariance probability comes from y _k and ∈ _{m respectively} . Understandably, the modeling process is to calculate the representation of each second call segment in the double-covariance probability linear discriminant analysis model. By establishing the target model of each second call segment, the target model can be used later to determine the second call segment of the same speaker.

S50: Based on the target model, a variational Bayes algorithm is used to determine the second call segment of the same speaker, and the second call segment of the same speaker is marked as a unified label.

Among them, the variational Bayesian algorithm (Variational Bayes, VB for short) is an approximate posterior method that provides a local optimal but has a definite solution.

In one embodiment, Y = {y ₁ , ..., y _S } is a set of speaker vectors. Through this target model, the problem of determining the second call segment of the same speaker can be reduced to the posterior probability of asking the speaker to speak in a given second call segment, where the posterior probability refers to random The conditional probability of an event or uncertainty assertion is the conditional probability after the relevant evidence or background is given and taken into account. Due to the above assumption, P (Y, I | Φ) is an unsolvable integral. In this embodiment, through the method of approximate inference, the variational Bayesian algorithm is used to approximate P (Y | Φ) and P ( I | Φ). For simplicity, we can express P (Y | Φ) as Q (I) and P (I | Φ) as Q (Y). The mean field variational Bayesian method is used to assume that the posterior probability can be approximated. It is: Q (Y, I) = Q (Y) Q (I). By approximate inference, the posterior probability of the speaker speaking in a given second call segment can be determined, that is, the second call segment of the same speaker can be determined and the second call segment of the same speaker can be marked Into a unified label to distinguish the second call segment by the speaker to which it belongs.

Further, in step S50, based on the target model, a variational Bayes algorithm is used to determine the second call segment of the same speaker, which specifically includes:

S511: obtaining the expression of the posterior probability of the second call segment based on the target model and the variational Bayes algorithm,

Where m is the second call segment, M is the total number of segments of the second call segment, s is the speaker, and S is the total number of speakers, q _ms is the posterior probability of s speaking in the second call segment m, i _ms Is the indicator vector of the speaker s in the second conversation segment m, when the speaker s speaks in the second conversation segment m, i _ms = 1, when the speaker s does not speak in the second conversation segment m, i _ms = 0.

S512: The expression of the posterior probability of the speaker based on the target model and the variational Bayes algorithm,

Where s is the speaker, S is the total number of speakers, y _s is the second conversation segment of each speaker s, Q (Y) is subject to the mean μ _s , and the covariance is

Gaussian distribution.

S513: Update the posterior probability Q (I) of the second call segment and the posterior probability Q (Y) of the speaker based on the variational Bayesian algorithm.

The update process of the Expectation Maximization Algorithm (EM algorithm for short) is used in the calculation process of the variational Bayesian algorithm. The maximum expectation algorithm includes e-step and m-step, the posterior probability Q (I) of the second call segment and the posterior probability Q (Y) of the speaker are updated in the e-step step of variation; in the m-step step Assign each second call segment m to

The speaker in s.

Further, in step S513, it specifically includes:

Update q _ms in the posterior probability Q (I) of the second call segment to

among them,

s ′ is used to distinguish s in q _ms , which means s before update,

Where T is the transposed matrix operation, L is the inverse of the covariance L ^-1 , tr (.) Is the trace operation of the matrix, and const is the irrelevant term to the speaker; the posterior probability Q (Y) of the speaker Expressed as

Λ is the inverse of the covariance Λ ^-1 ,

Is the covariance of the posterior probability of the speaker, and C _s is the inverse of the covariance. It should be noted that the parameters appearing in the above formulas have been explained above, and are not explained one after another here, only the parameters that appear for the first time.

Further, when updating the posterior probability Q (I) of the second call segment and the posterior probability Q (Y) of the speaker, the temperature parameter β can also be introduced, and the deterministic annealing variant pair of the variational Bayes algorithm The posterior probability of the segment and the posterior probability of the speaker are updated. Specifically, the update process is: q _{ms is} updated to

s ′ is used to distinguish s in q _ms , which means s before update,

β represents the temperature parameter,

Where T is the transposed matrix operation, L is the inverse of the covariance L ^-1 , tr (.) Is the trace operation of the matrix, and const is the irrelevant term to the speaker; the update of the posterior probability of the speaker is expressed as

Λ is the inverse of the covariance Λ ^-1 ,

Is the covariance of the posterior probability of the speaker, and C _s is the inverse of the covariance. Using the deterministic annealing variant of the variational Bayesian algorithm to update the posterior probability of the segment and the posterior probability of the speaker can effectively prevent the posterior probability of the speaker from reaching the local optimal value.

S514: Determine the second call segment of the same speaker according to the updated Q (I) and the updated Q (Y).

By obtaining the updated Q (I) and the updated Q (Y), the posterior probability that the speaker has spoken in a given second conversation segment can be obtained, thereby determining the second conversation segment of the same speaker.

Further, before step S50, that is, before the variational Bayes algorithm is used to determine the second call segment of the same speaker in the target model, the method further includes:

S521: Initialize the number of speakers in the posterior probability of the second call segment, and use each different speaker in the posterior probability of the second call segment as a pair.

In an embodiment, the number of speakers in the posterior probability of initializing the second call segment may specifically be initialized to 3 speakers.

S522: Calculate the distance between each pair of speakers to obtain the two speakers with the longest distance.

Among them, in the double-covariance probability linear discriminant analysis model, cosine similarity and / or likelihood ratio score can be used as a criterion for measuring distance.

S523: Repeat the preset number of times to initialize the number of speakers in the posterior probability of the second call segment, and use each different speaker in the posterior probability of the second call segment as a pair and calculate the number of each pair of speakers The distance between the two speakers with the furthest distance, the two speakers with the furthest distance in the preset number of steps, and the two speakers with the furthest distance in the preset number of steps As a starting point for variational Bayesian calculations.

Understandably, this step is steps S521-S522 repeating a preset number of times (for example, 10 times), and then the two speakers who are farthest among all the steps of the preset number of times are used as the starting point of variational Bayesian calculation.

Steps S521-S523 are optimization steps for the initialization of the variational Bayesian algorithm, which can improve the operation results obtained by the variational Bayesian algorithm when iterating with the maximum expectation algorithm is more accurate, and finally based on the accurate The posterior probability that a person has spoken in a given second call segment, so as to better distinguish the second call voice by speaker.

The technical solutions of the embodiments of the present application have the following beneficial effects:

It should be understood that the size of the sequence numbers of the steps in the above embodiments does not mean the order of execution, and the execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of the present application.

Based on the call separation method provided in the embodiments, the embodiments of the present application further provide device embodiments that implement the steps and methods in the above method embodiments.

FIG. 2 shows a functional block diagram of a call separation device corresponding to the call separation method in the embodiment. As shown in FIG. 2, the call separation device includes an original call segment acquisition module 10, a first call segment acquisition module 20, a second call segment acquisition module 30, a target model acquisition module 40 and a unified label module 50. Among them, the implementation functions of the original call segment acquisition module 10, the first call segment acquisition module 20, the second call segment acquisition module 30, the target model acquisition module 40, and the unified label module 50 correspond to the steps of the call separation method in the embodiment one by one Correspondingly, in order to avoid redundant description, this embodiment will not elaborate one by one.

The original call segment obtaining module 10 is used to obtain an original call segment, and the original call segment includes at least two call segments of different speakers.

The first call segment acquisition module 20 is used to remove the mute segment in the original call segment using mute detection to obtain the first call segment.

The second call segment acquisition module 30 is configured to cut the first call segment to obtain at least three second call segments, where one speaker corresponds to one or more second call segments.

The target model acquisition module 40 is used to acquire the i-vector features of each second call segment, and use the pre-trained double covariance probability linear discriminant analysis model to model each i-vector feature to obtain each second The target model of the call segment.

The unified labeling module 50 is used to determine the second conversation segment of the same speaker based on the target model, and use the variational Bayes algorithm to mark the second conversation segment of the same speaker as a unified label.

Optionally, the first call segment acquisition module 10 includes a transition point acquisition unit and a second call segment acquisition unit.

The transition point acquisition unit is used to detect and obtain the speaker's transition point in the first call segment based on the Bayesian information criterion and the likelihood ratio.

The second call segment acquisition unit is configured to cut the first call segment according to the speaker's transition point to obtain at least three second call segments.

Optionally, the expression φ _m = y _k + ∈ _{m of the} target model, where φ _m represents the i-vector feature extracted from the m-th second call segment, and y represents the speaker-related vector of the second call segment, k is an index that the i _mk = 1, i _m denotes a vector indicative of the second call segment,

The speaker-independent vector ∈ representing the m-th second call segment follows a Gaussian distribution with mean 0 and covariance L ^-1 . The unified labeling module 50 includes a second call segment posterior probability acquisition unit Unit, update unit and determination unit.

The second call segment posterior probability acquisition unit is used to obtain the expression of the posterior probability of the second call segment based on the target model and the variational Bayesian algorithm,

The speaker posterior probability acquisition unit is used to acquire the posterior probability expression of the speaker based on the target model and the variational Bayesian algorithm,

Gaussian distribution.

The updating unit is used to update the posterior probability Q (I) of the second call segment and the posterior probability Q (Y) of the speaker based on the variational Bayes algorithm.

The determining unit is configured to determine the second call segment of the same speaker according to the updated Q (I) and the updated Q (Y).

Optionally, the call separation device further includes an initialization unit, a distance unit, and a starting point determination unit.

The initialization unit is used for initializing the number of speakers in the posterior probability of the second conversation segment, and using each different speaker in the posterior probability of the second conversation segment as a pair.

The distance unit is used to calculate the distance between each pair of speakers to obtain the two speakers with the longest distance.

The starting point determining unit is used to repeat the preset number of times to initialize the number of speakers in the posterior probability of the second call segment, using each different speaker in the posterior probability of the second call segment as a pair and calculating each For the distance between the speakers, the step of getting the two speakers farthest away is obtained, the two speakers who are farthest apart in the preset number of steps are obtained, and the farthest distance is separated in the preset number of steps The two speakers serve as the starting point for the variational Bayesian calculation.

Optionally, the updating unit includes: updating q _ms in the posterior probability Q (I) of the second call segment to

among them,

s ′ is used to distinguish s in q _ms , which means s before update,

Where T is the transposed matrix operation, L is the inverse of the covariance L ^-1 , tr (.) Is the trace operation of the matrix, and const is the irrelevant term to the speaker; the posterior probability Q (Y) of the speaker is updated Expressed as

Λ is the inverse of the covariance Λ ^-1 ,

Is the covariance of the posterior probability of the speaker, and C _s is the inverse of the covariance.

This embodiment provides a computer non-volatile readable storage medium. The computer non-volatile readable storage medium stores computer readable instructions. When the computer readable instructions are executed by a processor, the call separation method in the embodiment is implemented. To avoid repetition, I will not repeat them here. Alternatively, when the computer-readable instructions are executed by the processor, the functions of the modules / units in the call separation device in the embodiment are implemented. To avoid repetition, details are not described here one by one.

3 is a schematic diagram of a computer device provided by an embodiment of the present application. As shown in FIG. 3, the computer device 60 of this embodiment includes: a processor 61, a memory 62, and computer readable instructions 63 stored in the memory 62 and executable on the processor 61, and the computer readable instructions 63 are processed When the device 61 is executed, the call separation method in the embodiment is implemented. To avoid repetition, details are not described here one by one. Alternatively, when the computer readable instructions are executed by the processor 61, the functions of each model / unit in the call separation device in the embodiment are implemented. To avoid repetition, they are not described here one by one.

The computer device 60 may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server. Computer equipment may include, but is not limited to, a processor 61 and a memory 62. Those skilled in the art may understand that FIG. 3 is only an example of the computer device 60, and does not constitute a limitation on the computer device 60, and may include more or less components than shown, or combine certain components, or different components. For example, computer equipment may also include input and output devices, network access devices, buses, and so on.

The so-called processor 61 may be a central processing unit (Central Processing Unit, CPU), or other general-purpose processors, digital signal processors (Digital Signal Processors, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), Field-programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. The general-purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 62 may be an internal storage unit of the computer device 60, such as a hard disk or a memory of the computer device 60. The memory 62 may also be an external storage device of the computer device 60, for example, a plug-in hard disk equipped on the computer device 60, a smart memory card (Smart) Card (SMC), a secure digital (SD) card, and a flash memory card (Flash Card) etc. Further, the memory 62 may also include both the internal storage unit of the computer device 60 and the external storage device. The memory 62 is used to store computer readable instructions and other programs and data required by the computer device. The memory 62 may also be used to temporarily store data that has been or will be output.

Those skilled in the art can clearly understand that, for convenience and conciseness of description, only the above-mentioned division of each functional unit and module is used as an example for illustration. In practical applications, the above-mentioned functions may be allocated by different functional units, Module completion means that the internal structure of the device is divided into different functional units or modules to complete all or part of the functions described above.

The above embodiments are only used to illustrate the technical solutions of the present application, not to limit them; although the present application has been described in detail with reference to the foregoing embodiments, persons of ordinary skill in the art should understand that they can still apply the technical solutions of the foregoing embodiments. The recorded technical solutions are modified, or some of the technical features are equivalently replaced; and these modifications or replacements do not deviate the essence of the corresponding technical solutions from the spirit and scope of the technical solutions of the embodiments of this application, and should be included in this application. Within the scope of protection.

Claims

A call separation method, characterized in that the method includes:

Obtain an original call segment, where the original call segment includes at least two call segments of different speakers;

Mute detection is used to remove the mute segment in the original call segment to obtain the first call segment;

Cutting the first call segment to obtain at least three second call segments, where one speaker corresponds to one or more second call segments;

Acquiring i-vector features of each of the second call segments, and using a pre-trained double covariance probability linear discriminant analysis model to model each of the i-vector features to obtain each of the second call segments Target model

Based on the target model, a variational Bayes algorithm is used to determine the second conversation segment of the same speaker, and the second conversation segment of the same speaker is marked as a unified label.
The method according to claim 1, wherein the cutting the first call segment to obtain at least three second call segments includes:

Based on the Bayesian information criterion and likelihood ratio, detect and obtain the speaker's transition point in the first call segment;

The first call segment is cut according to the speaker's transition point to obtain at least three second call segments.
The method according to claim 1, wherein the expression φ m = y k + ∈ m of the target model, where φ m represents an i-vector feature extracted from the m-th second call segment, y represents the second segment of the call with the speaker associated vector, k is the index that the i mk = 1, i m denotes a vector indicative of the second call segment,

The speaker-independent vector ∈ representing the m-th second call segment follows a Gaussian distribution with mean 0 and covariance L -1 . Based on the target model, the variational Bayesian algorithm is used to determine the same speech The second call segment of the person includes:

Obtaining the expression of the posterior probability of the second call segment based on the target model and the variational Bayesian algorithm,

Where m is the second call segment, M is the total number of segments of the second call segment, s is the speaker, S is the total number of speakers, and q ms is s in the second call segment m The posterior probability of speaking, i ms is the indicator vector of the speaker s in the second conversation segment m, when the speaker s speaks in the second conversation segment m, i ms = 1, When the speaker s does not speak in the second call segment m, ims = 0;

Obtain an expression of the posterior probability of the speaker based on the target model and the variational Bayesian algorithm,

Where s represents the speaker, S represents the total number of speakers, y s represents the second conversation segment of each of the speakers s, Q (Y) obeys the mean value is μ s , and the covariance is
Gaussian distribution

Updating the posterior probability Q (I) of the second conversation segment and the posterior probability Q (Y) of the speaker based on the variational Bayes algorithm;

The second call segment of the same speaker is determined according to the updated Q (I) and the updated Q (Y).
The method according to claim 3, characterized in that, before the adopting the variational Bayes algorithm to determine the second conversation segment of the same speaker in the target model, further comprising:

Initializing the number of speakers in the posterior probability of the second call segment, and using each different speaker in the posterior probability of the second call segment as a pair;

Calculate the distance between each pair of the speakers to obtain the two farthest speakers;

Repeating a preset number of times to initialize the number of speakers in the posterior probability of the second call segment, using each different speaker in the posterior probability of the second call segment as a pair and calculating each pair of The step of obtaining the two farthest speakers from the distance between the speakers, and the two farthest speakers from the step of the preset number of times The two farthest speakers in the step of times are used as the starting point of the variational Bayesian calculation.
The method according to any one of claims 3 or 4, wherein the posterior probability Q (I) of the second conversation segment using the variational Bayes algorithm and the posterior of the speaker The probability Q (Y) is updated, including:

Update q ms in the posterior probability Q (I) of the second call segment to
among them,

s ′ is used to distinguish s in q ms , which means s before update,
Where T is the transposed matrix operation, L is the inverse of the covariance L -1 , tr (.) Is the trace operation of the matrix, and const is the irrelevant term to the speaker; the posterior probability Q (Y) of the speaker Is expressed as

∧ is the inverse of covariance ∧ -1 ,
Is the covariance of the posterior probability of the speaker, and C s is the inverse of the covariance.
A call separation device, characterized in that the device includes:

An original call segment acquisition module, for acquiring an original call segment, the original call segment includes at least two call segments of different speakers;

A first call segment acquisition module, used to remove the mute segment in the original call segment using mute detection to obtain the first call segment;

A second call segment acquisition module, configured to cut the first call segment to obtain at least three second call segments, where one speaker corresponds to one or more second call segments;

The target model acquisition module is used to acquire the i-vector features of each of the second call segments, and use the pre-trained double covariance probability linear discriminant analysis model to model each of the i-vector features to obtain each A target model of the second call segment;

The unified labeling module is used to determine the second call segment of the same speaker based on the target model, and to use the variational Bayes algorithm to mark the second call segment of the same speaker as a unified label .
The apparatus according to claim 6, wherein the first call segment acquisition module includes:

A transition point obtaining unit, configured to detect and obtain a speaker's transition point in the first call segment based on Bayesian information criterion and likelihood ratio;

The second call segment acquiring unit is configured to cut the first call segment according to the speaker's transition point to obtain at least three second call segments.
The apparatus according to claim 6, wherein the expression φ m = y k + ∈ m of the target model, where φ m represents an i-vector feature extracted from the m-th second call segment, y represents the second segment of the call with the speaker associated vector, k is the index that the i mk = 1, i m denotes a vector indicative of the second call segment,

The speaker-independent vector ∈ representing the m-th second call segment is subject to a Gaussian distribution with mean 0 and covariance L -1 . The unified labeling module includes:

A second call segment posterior probability acquisition unit for acquiring an expression of the posterior probability of the second call segment based on the target model and the variational Bayes algorithm
Where m is the second call segment, M is the total number of segments of the second call segment, s is the speaker, S is the total number of speakers, and q ms is s in the second call segment m The posterior probability of speaking, i ms is the indicator vector of the speaker s in the second conversation segment m, when the speaker s speaks in the second conversation segment m, i ms = 1, When the speaker s does not speak in the second call segment m, ims = 0;

A speaker posterior probability acquisition unit for acquiring an expression of the posterior probability of the speaker based on the target model and the variational Bayesian algorithm,
Where s represents the speaker, S represents the total number of speakers, y s represents the second conversation segment of each of the speakers s, Q (Y) obeys the mean value is μ s , and the covariance is
Gaussian distribution

An updating unit, configured to update the posterior probability Q (I) of the second call segment and the posterior probability Q (Y) of the speaker based on the variational Bayes algorithm;

The determining unit is configured to determine the second call segment of the same speaker according to the updated Q (I) and the updated Q (Y).
The device according to claim 8, wherein the device further comprises:

An initialization unit, configured to initialize the number of speakers in the posterior probability of the second call segment, and use each different speaker in the posterior probability of the second call segment as a pair;

The distance unit is used to calculate the distance between each pair of the speakers to obtain the two speakers with the longest distance;

The starting point determining unit is used to repeat a preset number of times to initialize the number of speakers in the posterior probability of the second call segment, and use each different speaker in the posterior probability of the second call segment as a And calculating the distance between each pair of speakers, obtaining the two farthest speakers, and obtaining the two farthest speakers in the preset number of steps, and The two farthest speakers in the step of the preset number of times are used as the starting point of the variational Bayesian calculation.
The device according to any one of claims 8 or 9, wherein the update unit is specifically configured to:

Update q ms in the posterior probability Q (I) of the second call segment to
among them,

s ′ is used to distinguish s in q ms , which means s before update,
Where T is the transposed matrix operation, L is the inverse of the covariance L -1 , tr (.) Is the trace operation of the matrix, and const is the irrelevant term to the speaker; the posterior probability Q (Y) of the speaker Is expressed as

∧ is the inverse of covariance ∧ -1 ,
Is the covariance of the posterior probability of the speaker, and C s is the inverse of the covariance.
A computer device, including a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor, characterized in that, when the processor executes the computer-readable instructions, it is implemented as follows step:

Obtain an original call segment, where the original call segment includes at least two call segments of different speakers;

Mute detection is used to remove the mute segment in the original call segment to obtain the first call segment;

Cutting the first call segment to obtain at least three second call segments, where one speaker corresponds to one or more second call segments;

Acquiring i-vector features of each of the second call segments, and using a pre-trained double covariance probability linear discriminant analysis model to model each of the i-vector features to obtain each of the second call segments Target model

Based on the target model, a variational Bayes algorithm is used to determine the second conversation segment of the same speaker, and the second conversation segment of the same speaker is marked as a unified label.
The computer device according to claim 11, wherein the processor further implements the following steps when executing the computer-readable instructions:

Based on the Bayesian information criterion and likelihood ratio, detect and obtain the speaker's transition point in the first call segment;

The first call segment is cut according to the speaker's transition point to obtain at least three second call segments.
The computer device according to claim 11, wherein the expression of the target model φ m = y k + ∈ m , wherein φ m represents the i-vector feature extracted from the m-th second call segment , y represents the second segment of the call with the speaker associated vector, k is the index that the i mk = 1, i m denotes a vector indicative of the second call segment,

The speaker-independent vector ∈ representing the m-th second call segment follows a Gaussian distribution with mean 0 and covariance L -1 . Based on the target model, the variational Bayesian algorithm is used to determine the same speech For the second call segment of the person, the processor also implements the following steps when executing the computer-readable instructions:

Obtaining the expression of the posterior probability of the second call segment based on the target model and the variational Bayesian algorithm,

Where m is the second call segment, M is the total number of segments of the second call segment, s is the speaker, S is the total number of speakers, and q ms is s in the second call segment m The posterior probability of speaking, i ms is the indicator vector of the speaker s in the second conversation segment m, when the speaker s speaks in the second conversation segment m, i ms = 1, When the speaker s does not speak in the second call segment m, ims = 0;

Obtain an expression of the posterior probability of the speaker based on the target model and the variational Bayesian algorithm,

Where s represents the speaker, S represents the total number of speakers, y s represents the second conversation segment of each of the speakers s, Q (Y) obeys the mean value is μ s , and the covariance is
Gaussian distribution

Update the posterior probability Q (I) of the second call segment and the posterior probability Q (Y) of the speaker based on the variational Bayes algorithm;

The second call segment of the same speaker is determined according to the updated Q (I) and the updated Q (Y).
The computer device according to claim 13, wherein the processor further implements the following steps when executing the computer-readable instructions:

Initializing the number of speakers in the posterior probability of the second call segment, and using each different speaker in the posterior probability of the second call segment as a pair;

Calculate the distance between each pair of the speakers to obtain the two farthest speakers;

Repeating a preset number of times to initialize the number of speakers in the posterior probability of the second call segment, using each different speaker in the posterior probability of the second call segment as a pair and calculating each pair of The step of obtaining the two farthest speakers from the distance between the speakers, and the two farthest speakers from the step of the preset number of times The two farthest speakers in the step of times are used as the starting point of the variational Bayesian calculation.
The computer device according to any one of claims 13 or 14, wherein the processor further implements the following steps when executing the computer-readable instructions:

Update q ms in the posterior probability Q (I) of the second call segment to
among them,

s ′ is used to distinguish s in q ms , which means s before update,
Where T is the transposed matrix operation, L is the inverse of the covariance L -1 , tr (.) Is the trace operation of the matrix, and const is the irrelevant term for the speaker; Is expressed as

∧ is the inverse of covariance ∧ -1 ,
Is the covariance of the posterior probability of the speaker, and C s is the inverse of the covariance.
A computer nonvolatile readable storage medium, the computer nonvolatile readable storage medium storing computer readable instructions, characterized in that the computer readable instructions are executed by a processor to implement the following steps:

Obtain an original call segment, where the original call segment includes at least two call segments of different speakers;

Mute detection is used to remove the mute segment in the original call segment to obtain the first call segment;

Cutting the first call segment to obtain at least three second call segments, where one speaker corresponds to one or more second call segments;

Acquiring i-vector features of each of the second call segments, and using a pre-trained double covariance probability linear discriminant analysis model to model each of the i-vector features to obtain each of the second call segments Target model

Based on the target model, a variational Bayes algorithm is used to determine the second conversation segment of the same speaker, and the second conversation segment of the same speaker is marked as a unified label.
The computer non-volatile storage medium according to claim 16, wherein when the computer-readable instructions are executed by one or more processors, the one or more processors further implement the following steps :

Based on the Bayesian information criterion and likelihood ratio, detect and obtain the speaker's transition point in the first call segment;

The first call segment is cut according to the speaker's transition point to obtain at least three second call segments.
The computer non-volatile storage medium according to claim 16, wherein the expression of the target model is φ m = y k + ∈ m , where φ m represents the m-th second call i-vector extraction feature segment, y represents the second segment of the call with the speaker associated vector, k is the index that the i mk = 1, i m denotes a vector indicative of the second call segment,
The speaker-independent vector ∈ representing the m-th second call segment follows a Gaussian distribution with mean 0 and covariance L -1 . Based on the target model, the variational Bayesian algorithm is used to determine the same speech In the second call segment of the person, when the computer-readable instructions are executed by one or more processors, the one or more processors further implement the following steps:

Obtaining the expression of the posterior probability of the second call segment based on the target model and the variational Bayesian algorithm,

Where m is the second call segment, M is the total number of segments of the second call segment, s is the speaker, S is the total number of speakers, and q ms is s in the second call segment m The posterior probability of speaking, i ms is the indicator vector of the speaker s in the second conversation segment m, when the speaker s speaks in the second conversation segment m, i ms = 1, When the speaker s does not speak in the second call segment m, ims = 0;

Obtain an expression of the posterior probability of the speaker based on the target model and the variational Bayesian algorithm,

Where s represents the speaker, S represents the total number of speakers, y s represents the second conversation segment of each of the speakers s, Q (Y) obeys the mean value is μ s , and the covariance is
Gaussian distribution

Update the posterior probability Q (I) of the second call segment and the posterior probability Q (Y) of the speaker based on the variational Bayes algorithm;

The second call segment of the same speaker is determined according to the updated Q (I) and the updated Q (Y).
The computer non-volatile storage medium according to claim 18, wherein when the computer-readable instructions are executed by one or more processors, the one or more processors further implement the following steps :

Initializing the number of speakers in the posterior probability of the second call segment, and using each different speaker in the posterior probability of the second call segment as a pair;

Calculate the distance between each pair of the speakers to obtain the two farthest speakers;

Repeating a preset number of times to initialize the number of speakers in the posterior probability of the second call segment, using each different speaker in the posterior probability of the second call segment as a pair and calculating each pair of The step of obtaining the two farthest speakers from the distance between the speakers, and the two farthest speakers from the step of the preset number of times The two farthest speakers in the step of times are used as the starting point of the variational Bayesian calculation.
The computer non-volatile storage medium according to any one of claims 18 or 19, wherein when the computer-readable instructions are executed by one or more processors, the one or more processes The device also implements the following steps:

Update q ms in the posterior probability Q (I) of the second call segment to
among them,

s ′ is used to distinguish s in q ms , which means s before update,
Where T is the transposed matrix operation, L is the inverse of the covariance L -1 , tr (.) Is the trace operation of the matrix, and const is the irrelevant term to the speaker; the posterior probability Q (Y) of the speaker Is expressed as

∧ is the inverse of covariance ∧ -1 ,
Is the covariance of the posterior probability of the speaker, and C s is the inverse of the covariance.