WO2020098083A1 - Call separation method and apparatus, computer device and storage medium - Google Patents

Call separation method and apparatus, computer device and storage medium Download PDF

Info

Publication number
WO2020098083A1
WO2020098083A1 PCT/CN2018/123553 CN2018123553W WO2020098083A1 WO 2020098083 A1 WO2020098083 A1 WO 2020098083A1 CN 2018123553 W CN2018123553 W CN 2018123553W WO 2020098083 A1 WO2020098083 A1 WO 2020098083A1
Authority
WO
WIPO (PCT)
Prior art keywords
call
segment
speaker
call segment
speakers
Prior art date
Application number
PCT/CN2018/123553
Other languages
French (fr)
Chinese (zh)
Inventor
刘博卿
贾雪丽
程宁
王健宗
肖京
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2020098083A1 publication Critical patent/WO2020098083A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals

Definitions

  • This application relates to the field of artificial intelligence, in particular to a call separation method, device, computer equipment and storage medium.
  • the embodiments of the present application provide a call separation method, device, computer equipment, and storage medium to solve the current problem of inaccurate call separation.
  • an embodiment of the present application provides a call separation method, including:
  • Mute detection is used to remove the mute segment in the original call segment to obtain the first call segment
  • a variational Bayes algorithm is used to determine the second conversation segment of the same speaker, and the second conversation segment of the same speaker is marked as a unified label.
  • an embodiment of the present application provides a call separation device, including:
  • An original call segment acquisition module for acquiring an original call segment, the original call segment includes at least two call segments of different speakers;
  • a first call segment acquisition module used to remove the mute segment in the original call segment using mute detection to obtain the first call segment
  • a second call segment acquisition module configured to cut the first call segment to obtain at least three second call segments, where one speaker corresponds to one or more second call segments;
  • the target model acquisition module is used to acquire the i-vector features of each of the second call segments, and use the pre-trained double covariance probability linear discriminant analysis model to model each of the i-vector features to obtain each A target model of the second call segment;
  • the unified labeling module is used to determine the second call segment of the same speaker based on the target model, and to use the variational Bayes algorithm to mark the second call segment of the same speaker as a unified label .
  • a computer device in a third aspect, includes a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor.
  • the processor implements the computer-readable instructions as follows step:
  • Mute detection is used to remove the mute segment in the original call segment to obtain the first call segment
  • a variational Bayes algorithm is used to determine the second conversation segment of the same speaker, and the second conversation segment of the same speaker is marked as a unified label.
  • an embodiment of the present application provides a computer non-volatile readable storage medium, including: computer readable instructions, which are used to execute the following steps when the computer readable instructions are executed:
  • Mute detection is used to remove the mute segment in the original call segment to obtain the first call segment
  • a variational Bayes algorithm is used to determine the second conversation segment of the same speaker, and the second conversation segment of the same speaker is marked as a unified label.
  • the mute detection of the original call voice is performed first, which can remove the mute segment of the voice call in which no one emits sound, which is beneficial to improve the efficiency and accuracy of call separation. Then, the first call segment is cut to obtain second call segments of different speakers, which provides an important technical premise for subsequent determination of the second call segments of the same speaker. Then use the pre-trained double covariance probability linear discriminant analysis model for modeling to obtain the target model of each second call segment. The characteristics of the second call segment can be more accurately represented by the double covariance probability linear discriminant analysis model come out. Finally, the second call segment of the same speaker is determined by the variational Bayes algorithm, and the second call segment belonging to the same speaker can be clustered by using the variational Bayes algorithm, with high accuracy and accurate call seperate effect.
  • FIG. 1 is a flowchart of a call separation method in an embodiment of the present application
  • FIG. 2 is a schematic diagram of a call separation device in an embodiment of the present application.
  • FIG. 3 is a schematic diagram of a computer device in an embodiment of the present application.
  • first, second, third, etc. may be used to describe the preset ranges and the like in the embodiments of the present application, these preset ranges should not be limited to these terms. These terms are only used to distinguish the preset ranges from each other.
  • the first preset range may also be called a second preset range, and similarly, the second preset range may also be called a first preset range.
  • the word “if” as used herein can be interpreted as “when” or “when” or “in response to determination” or “in response to detection”.
  • the phrases “if determined” or “if detected (statement or event stated)” can be interpreted as “when determined” or “in response to determination” or “when detected (statement or event stated) ) “Or” in response to detection (statement or event stated) ".
  • FIG. 1 shows a flowchart of the call separation method in this embodiment.
  • the call separation method can be applied to a terminal device that performs call separation, and is used to realize the function of call separation. Specifically, it can be applied to a phone call separation system installed on a computer device.
  • the computer device is a device that can perform human-computer interaction with a user, including but not limited to computers, smart phones, and tablets.
  • the call separation method includes the following steps:
  • the original call segment includes at least two call segments of different speakers.
  • the original call segment may be a call segment obtained by a recording device and including at least two different speakers. In an embodiment, it may specifically be an original call segment composed of multiple speakers recorded by a recording device in a conference scene.
  • Mute detection is used to remove the mute segment in the original call segment to obtain the first call segment.
  • the mute detection refers to the detection of the silent (unattended) part of the original call segment.
  • it can be implemented using the technology of Voice Endpoint Detection (Voice Activity Detection) (VAD for short), including frame amplitude, frame energy, short-time zero-crossing rate, and deep neural network.
  • S30 Cut the first call segment to obtain at least three second call segments, where one speaker corresponds to one or more second call segments.
  • the first call voice segment is continuous on the time axis, but the call voice segments of different speakers will alternately appear on the time axis. Therefore, the first call voice segment can be cut into call segments corresponding to different speakers, and these segments are the second call segments.
  • the obtained second call segment includes at least three segments (because two segments are not necessary for call separation), a speaker can correspond to one or more second call segments, for example, there are 10 second call segments, the The second call segment corresponds to a total of 4 speakers A, B, C, and D, then A may include 5 second call segments, B includes 2, C includes 1, and D includes 2.
  • step S30 the first call segment is cut to obtain at least three second call segments, specifically including:
  • Bayesian Information Criteria (Bayesian Information Criterion, referred to as BIC) is to estimate the partially unknown state with subjective probability under incomplete intelligence, and then use Bayesian formula to modify the probability of occurrence, and finally use the expected value and correction Probability to make the best decision.
  • Likelihood ratio (LR) is an indicator that reflects authenticity.
  • the specific time for changing the speaker in the first call segment can be determined, and the speaker's transition point in the first call segment can be detected.
  • S32 Cut the first call segment according to the speaker's transition point to obtain at least three second call segments.
  • cutting the first call segment according to the obtained transition point can achieve a preliminary call separation effect, and it can be determined that each obtained second call segment corresponds to a speaker.
  • steps S31-S32 the first conversation segment is cut so that each second conversation segment obtained by cutting corresponds to a speaker, which provides an important technical premise for subsequent determination of the second conversation segment of the same speaker.
  • S40 Obtain the i-vector features of each second call segment, and use a pre-trained double covariance probability linear discriminant analysis model to model each i-vector feature to obtain the target model of each second call segment.
  • the i-vector feature refers to a more compact vector extracted from the Gaussian mixture model (GMM) mean supervector.
  • the i-vector feature also includes information about the soundtrack, Microphone, speaking method, voice and other information can fully reflect the voiceprint characteristics of the sound.
  • the double-covariance probability linear discriminant analysis model is used to extract speaker information from i-vector, which can be used to compare and distinguish voiceprint features.
  • the double-covariance probability linear discriminant analysis model assumes that the i-vector is extracted by two other parameters: a speaker's vector y (different speakers have different vectors), and a residual vector ⁇ (different fragments have different vectors) ).
  • the total number of speakers is S.
  • I ⁇ i 1 , ..., i M ⁇ be a given set of indication vectors about the second call segment.
  • the speaker-independent vector ⁇ representing the m-th second call segment is subject to a Gaussian distribution with mean 0 and covariance L -1 .
  • the double covariance in the linear discriminant analysis model of double covariance probability comes from y k and ⁇ m respectively . Understandably, the modeling process is to calculate the representation of each second call segment in the double-covariance probability linear discriminant analysis model.
  • the variational Bayesian algorithm (Variational Bayes, VB for short) is an approximate posterior method that provides a local optimal but has a definite solution.
  • the problem of determining the second call segment of the same speaker can be reduced to the posterior probability of asking the speaker to speak in a given second call segment, where the posterior probability refers to random
  • the conditional probability of an event or uncertainty assertion is the conditional probability after the relevant evidence or background is given and taken into account. Due to the above assumption, P (Y, I
  • the variational Bayesian algorithm is used to approximate P (Y
  • step S50 based on the target model, a variational Bayes algorithm is used to determine the second call segment of the same speaker, which specifically includes:
  • S512 The expression of the posterior probability of the speaker based on the target model and the variational Bayes algorithm, Where s is the speaker, S is the total number of speakers, y s is the second conversation segment of each speaker s, Q (Y) is subject to the mean ⁇ s , and the covariance is Gaussian distribution.
  • S513 Update the posterior probability Q (I) of the second call segment and the posterior probability Q (Y) of the speaker based on the variational Bayesian algorithm.
  • the update process of the Expectation Maximization Algorithm is used in the calculation process of the variational Bayesian algorithm.
  • the maximum expectation algorithm includes e-step and m-step, the posterior probability Q (I) of the second call segment and the posterior probability Q (Y) of the speaker are updated in the e-step step of variation; in the m-step step Assign each second call segment m to The speaker in s.
  • step S513 it specifically includes:
  • the temperature parameter ⁇ can also be introduced, and the deterministic annealing variant pair of the variational Bayes algorithm The posterior probability of the segment and the posterior probability of the speaker are updated.
  • the update process is: q ms is updated to s ′ is used to distinguish s in q ms , which means s before update, ⁇ represents the temperature parameter,
  • T is the transposed matrix operation
  • L is the inverse of the covariance L -1
  • tr (.) Is the trace operation of the matrix
  • const is the irrelevant term to the speaker
  • the update of the posterior probability of the speaker is expressed as ⁇ is the inverse of the covariance ⁇ -1
  • C s is the inverse of the covariance.
  • S514 Determine the second call segment of the same speaker according to the updated Q (I) and the updated Q (Y).
  • the posterior probability that the speaker has spoken in a given second conversation segment can be obtained, thereby determining the second conversation segment of the same speaker.
  • step S50 that is, before the variational Bayes algorithm is used to determine the second call segment of the same speaker in the target model, the method further includes:
  • S521 Initialize the number of speakers in the posterior probability of the second call segment, and use each different speaker in the posterior probability of the second call segment as a pair.
  • the number of speakers in the posterior probability of initializing the second call segment may specifically be initialized to 3 speakers.
  • S522 Calculate the distance between each pair of speakers to obtain the two speakers with the longest distance.
  • cosine similarity and / or likelihood ratio score can be used as a criterion for measuring distance.
  • S523 Repeat the preset number of times to initialize the number of speakers in the posterior probability of the second call segment, and use each different speaker in the posterior probability of the second call segment as a pair and calculate the number of each pair of speakers The distance between the two speakers with the furthest distance, the two speakers with the furthest distance in the preset number of steps, and the two speakers with the furthest distance in the preset number of steps As a starting point for variational Bayesian calculations.
  • this step is steps S521-S522 repeating a preset number of times (for example, 10 times), and then the two speakers who are farthest among all the steps of the preset number of times are used as the starting point of variational Bayesian calculation.
  • Steps S521-S523 are optimization steps for the initialization of the variational Bayesian algorithm, which can improve the operation results obtained by the variational Bayesian algorithm when iterating with the maximum expectation algorithm is more accurate, and finally based on the accurate The posterior probability that a person has spoken in a given second call segment, so as to better distinguish the second call voice by speaker.
  • the mute detection of the original call voice is performed first, which can remove the mute segment of the voice call in which no one emits sound, which is beneficial to improve the efficiency and accuracy of call separation. Then, the first call segment is cut to obtain second call segments of different speakers, which provides an important technical premise for subsequent determination of the second call segments of the same speaker. Then use the pre-trained double covariance probability linear discriminant analysis model for modeling to obtain the target model of each second call segment. The characteristics of the second call segment can be more accurately represented by the double covariance probability linear discriminant analysis model come out. Finally, the second call segment of the same speaker is determined by the variational Bayes algorithm, and the second call segment belonging to the same speaker can be clustered by using the variational Bayes algorithm, with high accuracy and accurate call seperate effect.
  • the embodiments of the present application further provide device embodiments that implement the steps and methods in the above method embodiments.
  • FIG. 2 shows a functional block diagram of a call separation device corresponding to the call separation method in the embodiment.
  • the call separation device includes an original call segment acquisition module 10, a first call segment acquisition module 20, a second call segment acquisition module 30, a target model acquisition module 40 and a unified label module 50.
  • the implementation functions of the original call segment acquisition module 10, the first call segment acquisition module 20, the second call segment acquisition module 30, the target model acquisition module 40, and the unified label module 50 correspond to the steps of the call separation method in the embodiment one by one
  • this embodiment will not elaborate one by one.
  • the original call segment obtaining module 10 is used to obtain an original call segment, and the original call segment includes at least two call segments of different speakers.
  • the first call segment acquisition module 20 is used to remove the mute segment in the original call segment using mute detection to obtain the first call segment.
  • the second call segment acquisition module 30 is configured to cut the first call segment to obtain at least three second call segments, where one speaker corresponds to one or more second call segments.
  • the target model acquisition module 40 is used to acquire the i-vector features of each second call segment, and use the pre-trained double covariance probability linear discriminant analysis model to model each i-vector feature to obtain each second The target model of the call segment.
  • the unified labeling module 50 is used to determine the second conversation segment of the same speaker based on the target model, and use the variational Bayes algorithm to mark the second conversation segment of the same speaker as a unified label.
  • the first call segment acquisition module 10 includes a transition point acquisition unit and a second call segment acquisition unit.
  • the transition point acquisition unit is used to detect and obtain the speaker's transition point in the first call segment based on the Bayesian information criterion and the likelihood ratio.
  • the second call segment acquisition unit is configured to cut the first call segment according to the speaker's transition point to obtain at least three second call segments.
  • the speaker-independent vector ⁇ representing the m-th second call segment follows a Gaussian distribution with mean 0 and covariance L -1 .
  • the unified labeling module 50 includes a second call segment posterior probability acquisition unit Unit, update unit and determination unit.
  • the speaker posterior probability acquisition unit is used to acquire the posterior probability expression of the speaker based on the target model and the variational Bayesian algorithm, Where s is the speaker, S is the total number of speakers, y s is the second conversation segment of each speaker s, Q (Y) is subject to the mean ⁇ s , and the covariance is Gaussian distribution.
  • the updating unit is used to update the posterior probability Q (I) of the second call segment and the posterior probability Q (Y) of the speaker based on the variational Bayes algorithm.
  • the determining unit is configured to determine the second call segment of the same speaker according to the updated Q (I) and the updated Q (Y).
  • the call separation device further includes an initialization unit, a distance unit, and a starting point determination unit.
  • the initialization unit is used for initializing the number of speakers in the posterior probability of the second conversation segment, and using each different speaker in the posterior probability of the second conversation segment as a pair.
  • the distance unit is used to calculate the distance between each pair of speakers to obtain the two speakers with the longest distance.
  • the starting point determining unit is used to repeat the preset number of times to initialize the number of speakers in the posterior probability of the second call segment, using each different speaker in the posterior probability of the second call segment as a pair and calculating each For the distance between the speakers, the step of getting the two speakers farthest away is obtained, the two speakers who are farthest apart in the preset number of steps are obtained, and the farthest distance is separated in the preset number of steps
  • the two speakers serve as the starting point for the variational Bayesian calculation.
  • the updating unit includes: updating q ms in the posterior probability Q (I) of the second call segment to among them, s ′ is used to distinguish s in q ms , which means s before update,
  • T is the transposed matrix operation
  • L is the inverse of the covariance L -1
  • tr Is the trace operation of the matrix
  • const is the irrelevant term to the speaker
  • the posterior probability Q (Y) of the speaker is updated
  • is the inverse of the covariance ⁇ -1
  • C s is the inverse of the covariance.
  • the mute detection of the original call voice is performed first, which can remove the mute segment of the voice call in which no one emits sound, which is beneficial to improve the efficiency and accuracy of call separation. Then, the first call segment is cut to obtain second call segments of different speakers, which provides an important technical premise for subsequent determination of the second call segments of the same speaker. Then use the pre-trained double covariance probability linear discriminant analysis model for modeling to obtain the target model of each second call segment. The characteristics of the second call segment can be more accurately represented by the double covariance probability linear discriminant analysis model come out. Finally, the second call segment of the same speaker is determined by the variational Bayes algorithm, and the second call segment belonging to the same speaker can be clustered by using the variational Bayes algorithm, with high accuracy and accurate call seperate effect.
  • This embodiment provides a computer non-volatile readable storage medium.
  • the computer non-volatile readable storage medium stores computer readable instructions.
  • the call separation method in the embodiment is implemented. To avoid repetition, I will not repeat them here.
  • the computer-readable instructions are executed by the processor, the functions of the modules / units in the call separation device in the embodiment are implemented. To avoid repetition, details are not described here one by one.
  • the computer device 60 of this embodiment includes: a processor 61, a memory 62, and computer readable instructions 63 stored in the memory 62 and executable on the processor 61, and the computer readable instructions 63 are processed
  • the call separation method in the embodiment is implemented. To avoid repetition, details are not described here one by one.
  • the computer readable instructions are executed by the processor 61, the functions of each model / unit in the call separation device in the embodiment are implemented. To avoid repetition, they are not described here one by one.
  • the computer device 60 may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server.
  • Computer equipment may include, but is not limited to, a processor 61 and a memory 62.
  • FIG. 3 is only an example of the computer device 60, and does not constitute a limitation on the computer device 60, and may include more or less components than shown, or combine certain components, or different components.
  • computer equipment may also include input and output devices, network access devices, buses, and so on.
  • the so-called processor 61 may be a central processing unit (Central Processing Unit, CPU), or other general-purpose processors, digital signal processors (Digital Signal Processors, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), Field-programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
  • the memory 62 may be an internal storage unit of the computer device 60, such as a hard disk or a memory of the computer device 60.
  • the memory 62 may also be an external storage device of the computer device 60, for example, a plug-in hard disk equipped on the computer device 60, a smart memory card (Smart) Card (SMC), a secure digital (SD) card, and a flash memory card (Flash Card) etc.
  • the memory 62 may also include both the internal storage unit of the computer device 60 and the external storage device.
  • the memory 62 is used to store computer readable instructions and other programs and data required by the computer device.
  • the memory 62 may also be used to temporarily store data that has been or will be output.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Telephonic Communication Services (AREA)
  • Telephone Function (AREA)

Abstract

The present application discloses a call separation method and apparatus, a computer device and a storage medium, relating to the field of artificial intelligence. The call separation method comprises: acquiring original call segments; using mute detection to remove mute segments in the original call segments, to obtain a first call segment; segmenting the first call segment to obtain at least three second call segments, one speaker corresponding to one or more second call segments; acquiring i-vector features of each second call segment, and modeling each i-vector feature by using a pre-trained double-covariance probability linear discrimination analysis model, to obtain a target model of each second call segment; on the basis of the target models, using a variational bayes algorithm to determine the second call segments of the same speaker, and marking the second call segments of the same speaker with a unified label. By using the call separation method, call segments corresponding to different speakers in a call can be precisely separated.

Description

通话分离方法、装置、计算机设备及存储介质Call separation method, device, computer equipment and storage medium
本申请以2018年11月13日提交的申请号为201811347184.3,名称为“通话分离方法、装置、计算机设备及存储介质”的中国发明专利申请为基础,并要求其优先权。This application is based on the Chinese invention patent application with the application number 201811347184.3 filed on November 13, 2018, titled "Call Separation Method, Device, Computer Equipment, and Storage Media", and claims its priority.
【技术领域】【Technical Field】
本申请涉及人工智能领域,尤其涉及一种通话分离方法、装置、计算机设备及存储介质。This application relates to the field of artificial intelligence, in particular to a call separation method, device, computer equipment and storage medium.
【背景技术】【Background technique】
目前缺少合理的设计步骤来保证通话分离的实现效果,无法在不知道说话人信息的前提下,准确地区分在同一通话中由不同说话人发出的通话语音片段,在通话分离的实现效果上仍不理想。At present, there is a lack of reasonable design steps to ensure the effect of call separation. Without knowing the speaker information, it is impossible to accurately distinguish the voice segments of the call sent by different speakers in the same call. not ideal.
【发明内容】[Invention content]
有鉴于此,本申请实施例提供了一种通话分离方法、装置、计算机设备及存储介质,用以解决目前通话分离不精确的问题。In view of this, the embodiments of the present application provide a call separation method, device, computer equipment, and storage medium to solve the current problem of inaccurate call separation.
第一方面,本申请实施例提供了一种通话分离方法,包括:In a first aspect, an embodiment of the present application provides a call separation method, including:
获取原始通话片段,所述原始通话片段包括至少两个不同说话人的通话片段;Obtain an original call segment, where the original call segment includes at least two call segments of different speakers;
采用静音检测去除所述原始通话片段中的静音片段,得到第一通话片段;Mute detection is used to remove the mute segment in the original call segment to obtain the first call segment;
将所述第一通话片段进行切割,得到至少三个第二通话片段,其中,一个所述说话人对应一个或多个所述第二通话片段;Cutting the first call segment to obtain at least three second call segments, where one speaker corresponds to one or more second call segments;
获取每个所述第二通话片段的i-vector特征,采用预先训练好的双协方差概率线性判别分析模型对每个所述i-vector特征进行建模,得到每个所述第二通话片段的目标模型;Acquiring i-vector features of each of the second call segments, and using a pre-trained double covariance probability linear discriminant analysis model to model each of the i-vector features to obtain each of the second call segments Target model
基于所述目标模型,采用变分贝叶斯算法确定相同的说话人的第二通话片段,并将所述相同的说话人的所述第二通话片段标记成统一的标签。Based on the target model, a variational Bayes algorithm is used to determine the second conversation segment of the same speaker, and the second conversation segment of the same speaker is marked as a unified label.
第二方面,本申请实施例提供了一种通话分离装置,包括:In a second aspect, an embodiment of the present application provides a call separation device, including:
原始通话片段获取模块,用于获取原始通话片段,所述原始通话片段包括至少两个不同说话人的通话片段;An original call segment acquisition module, for acquiring an original call segment, the original call segment includes at least two call segments of different speakers;
第一通话片段获取模块,用于采用静音检测去除所述原始通话片段中的静音片段,得到第一通话片段;A first call segment acquisition module, used to remove the mute segment in the original call segment using mute detection to obtain the first call segment;
第二通话片段获取模块,用于将所述第一通话片段进行切割,得到至少三个第二通话片段,其中,一个所述说话人对应一个或多个所述第二通话片段;A second call segment acquisition module, configured to cut the first call segment to obtain at least three second call segments, where one speaker corresponds to one or more second call segments;
目标模型获取模块,用于获取每个所述第二通话片段的i-vector特征,采用预先训练好的双协方差概率线性判别分析模型对每个所述i-vector特征进行建模,得到每个所述第二通话片段的目标模型;The target model acquisition module is used to acquire the i-vector features of each of the second call segments, and use the pre-trained double covariance probability linear discriminant analysis model to model each of the i-vector features to obtain each A target model of the second call segment;
统一标签模块,用于基于所述目标模型,采用变分贝叶斯算法确定相同的说话人的第二通话片段,并将所述相同的说话人的所述第二通话片段标记成统一的标签。The unified labeling module is used to determine the second call segment of the same speaker based on the target model, and to use the variational Bayes algorithm to mark the second call segment of the same speaker as a unified label .
第三方面,一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现如下步骤:In a third aspect, a computer device includes a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor. The processor implements the computer-readable instructions as follows step:
获取原始通话片段,所述原始通话片段包括至少两个不同说话人的通话片段;Obtain an original call segment, where the original call segment includes at least two call segments of different speakers;
采用静音检测去除所述原始通话片段中的静音片段,得到第一通话片段;Mute detection is used to remove the mute segment in the original call segment to obtain the first call segment;
将所述第一通话片段进行切割,得到至少三个第二通话片段,其中,一个所述说话人对应一个或多个所述第二通话片段;Cutting the first call segment to obtain at least three second call segments, where one speaker corresponds to one or more second call segments;
获取每个所述第二通话片段的i-vector特征,采用预先训练好的双协方差概率线性判别分析模型对每个所述i-vector特征进行建模,得到每个所述第二通话片段的目标模型;Acquiring i-vector features of each of the second call segments, and using a pre-trained double covariance probability linear discriminant analysis model to model each of the i-vector features to obtain each of the second call segments Target model
基于所述目标模型,采用变分贝叶斯算法确定相同的说话人的第二通话片段,并将所述相同的说话人的所述第二通话片段标记成统一的标签。Based on the target model, a variational Bayes algorithm is used to determine the second conversation segment of the same speaker, and the second conversation segment of the same speaker is marked as a unified label.
第四方面,本申请实施例提供了一种计算机非易失性可读存储介质,包括:计算机可读指令,当所述计算机可读指令被运行时,用以执行如下步骤:According to a fourth aspect, an embodiment of the present application provides a computer non-volatile readable storage medium, including: computer readable instructions, which are used to execute the following steps when the computer readable instructions are executed:
获取原始通话片段,所述原始通话片段包括至少两个不同说话人的通话片段;Obtain an original call segment, where the original call segment includes at least two call segments of different speakers;
采用静音检测去除所述原始通话片段中的静音片段,得到第一通话片段;Mute detection is used to remove the mute segment in the original call segment to obtain the first call segment;
将所述第一通话片段进行切割,得到至少三个第二通话片段,其中,一个所述说话人对应一个或多个所述第二通话片段;Cutting the first call segment to obtain at least three second call segments, where one speaker corresponds to one or more second call segments;
获取每个所述第二通话片段的i-vector特征,采用预先训练好的双协方差概率线性判别分析模型对每个所述i-vector特征进行建模,得到每个所述第二通话片段的目标模型;Acquiring i-vector features of each of the second call segments, and using a pre-trained double covariance probability linear discriminant analysis model to model each of the i-vector features to obtain each of the second call segments Target model
基于所述目标模型,采用变分贝叶斯算法确定相同的说话人的第二通话片段,并将所述相同的说话人的所述第二通话片段标记成统一的标签。Based on the target model, a variational Bayes algorithm is used to determine the second conversation segment of the same speaker, and the second conversation segment of the same speaker is marked as a unified label.
上述技术方案中的一个技术方案具有如下有益效果:One of the above technical solutions has the following beneficial effects:
本申请实施例中,首先将原始通话语音进行静音检测,可以去除语音通话中无人发出声音的静音片段,有利于提高通话分离的效率和精确度。接着将第一通话片段进行切割,可以得到不同说话人的第二通话片段,为后续确定相同的说话人的第二通话片段提供重要的技术前提。然后采用预先训练好的双协方差概率线性判别分析模型进行建模,得到每个第二通话片段的目标模型,可以通过双协方差概率线性判别分析模型将第二通话片段的特征更精确地表示出来。最后通过变分贝叶斯算法确定相同的说话人的第二通话片段,采用变分贝叶斯算法可以将属于同一说话人的第二通话片段进行聚类,精确度高,能达到精确的通话分离效果。In the embodiment of the present application, the mute detection of the original call voice is performed first, which can remove the mute segment of the voice call in which no one emits sound, which is beneficial to improve the efficiency and accuracy of call separation. Then, the first call segment is cut to obtain second call segments of different speakers, which provides an important technical premise for subsequent determination of the second call segments of the same speaker. Then use the pre-trained double covariance probability linear discriminant analysis model for modeling to obtain the target model of each second call segment. The characteristics of the second call segment can be more accurately represented by the double covariance probability linear discriminant analysis model come out. Finally, the second call segment of the same speaker is determined by the variational Bayes algorithm, and the second call segment belonging to the same speaker can be clustered by using the variational Bayes algorithm, with high accuracy and accurate call seperate effect.
【附图说明】【Explanation】
为了更清楚地说明本申请实施例的技术方案,下面将对实施例中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其它的附图。In order to more clearly explain the technical solutions of the embodiments of the present application, the following will briefly introduce the drawings required in the embodiments. Obviously, the drawings in the following description are only some embodiments of the present application. Those of ordinary skill in the art can obtain other drawings based on these drawings without paying any creative labor.
图1是本申请一实施例中通话分离方法的一流程图;FIG. 1 is a flowchart of a call separation method in an embodiment of the present application;
图2是本申请一实施例中通话分离装置的一示意图;2 is a schematic diagram of a call separation device in an embodiment of the present application;
图3是本申请一实施例中计算机设备的一示意图。3 is a schematic diagram of a computer device in an embodiment of the present application.
【具体实施方式】【detailed description】
为了更好的理解本申请的技术方案,下面结合附图对本申请实施例进行详细描述。In order to better understand the technical solution of the present application, the following describes the embodiments of the present application in detail with reference to the accompanying drawings.
应当明确,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其它实施例,都属于本申请保护的范围。It should be clear that the described embodiments are only a part of the embodiments of the present application, but not all the embodiments. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without making creative efforts fall within the scope of protection of this application.
在本申请实施例中使用的术语是仅仅出于描述特定实施例的目的,而非旨在限制本申请。在本申请实施例和所附权利要求书中所使用的单数形式的“一种”、“所述”和“该”也旨在包括多数形式,除非上下文清楚地表示其他含义。The terms used in the embodiments of the present application are only for the purpose of describing specific embodiments, and are not intended to limit the present application. The singular forms "a", "said" and "the" used in the embodiments of the present application and the appended claims are also intended to include the majority forms unless the context clearly indicates other meanings.
应当理解,本文中使用的术语“和/或”仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,本文中字符“/”,一般表示前后关联对象是一种“或”的关系。It should be understood that the term “and / or” used herein is merely an association relationship describing an associated object, indicating that there may be three relationships, for example, A and / or B, which may indicate: A exists alone, and A and B, there are three cases of B alone. In addition, the character "/" in this article generally indicates that the related objects before and after are in an "or" relationship.
应当理解,尽管在本申请实施例中可能采用术语第一、第二、第三等来描述预设范围 等,但这些预设范围不应限于这些术语。这些术语仅用来将预设范围彼此区分开。例如,在不脱离本申请实施例范围的情况下,第一预设范围也可以被称为第二预设范围,类似地,第二预设范围也可以被称为第一预设范围。It should be understood that although the terms first, second, third, etc. may be used to describe the preset ranges and the like in the embodiments of the present application, these preset ranges should not be limited to these terms. These terms are only used to distinguish the preset ranges from each other. For example, without departing from the scope of the embodiments of the present application, the first preset range may also be called a second preset range, and similarly, the second preset range may also be called a first preset range.
取决于语境,如在此所使用的词语“如果”可以被解释成为“在……时”或“当……时”或“响应于确定”或“响应于检测”。类似地,取决于语境,短语“如果确定”或“如果检测(陈述的条件或事件)”可以被解释成为“当确定时”或“响应于确定”或“当检测(陈述的条件或事件)时”或“响应于检测(陈述的条件或事件)”。Depending on the context, the word "if" as used herein can be interpreted as "when" or "when" or "in response to determination" or "in response to detection". Similarly, depending on the context, the phrases "if determined" or "if detected (statement or event stated)" can be interpreted as "when determined" or "in response to determination" or "when detected (statement or event stated) ) "Or" in response to detection (statement or event stated) ".
图1示出本实施例中通话分离方法的一流程图。该通话分离方法可应用在进行通话分离的终端设备上,用于实现通话分离的功能,具体可应用在安装在计算机设备上的电话通话分离系统中。其中,该计算机设备是可与用户进行人机交互的设备,包括但不限于电脑、智能手机和平板等设备。该通话分离方法包括如下步骤:FIG. 1 shows a flowchart of the call separation method in this embodiment. The call separation method can be applied to a terminal device that performs call separation, and is used to realize the function of call separation. Specifically, it can be applied to a phone call separation system installed on a computer device. Among them, the computer device is a device that can perform human-computer interaction with a user, including but not limited to computers, smart phones, and tablets. The call separation method includes the following steps:
S10:获取原始通话片段,原始通话片段包括至少两个不同说话人的通话片段。S10: Obtain the original call segment. The original call segment includes at least two call segments of different speakers.
其中,原始通话片段可以是通过录音设备获取的、包括至少两个不同说话人的通话片段。在一实施例中,具体可以是在会议场景中通过录音设备录取的由多个说话人组成的原始通话片段。The original call segment may be a call segment obtained by a recording device and including at least two different speakers. In an embodiment, it may specifically be an original call segment composed of multiple speakers recorded by a recording device in a conference scene.
S20:采用静音检测去除原始通话片段中的静音片段,得到第一通话片段。S20: Mute detection is used to remove the mute segment in the original call segment to obtain the first call segment.
其中,静音检测是指对原始通话片段中静默(无人说话)部分的检测。在一实施例中,可以采用语音端点检测(Voice Activity Detection,简称VAD)的技术实现,包括采用帧幅度、帧能量、短时过零率和深度神经网络等方式实现。通过去除原始通话片段中静默的片段,可以将原始通话片段中说话人说话时的语音片段保留下来,从而在后续进行通话分离时,可以排除原始通话片段中静默部分的干扰,有效提高通话分离的效率和准确率。Among them, the mute detection refers to the detection of the silent (unattended) part of the original call segment. In one embodiment, it can be implemented using the technology of Voice Endpoint Detection (Voice Activity Detection) (VAD for short), including frame amplitude, frame energy, short-time zero-crossing rate, and deep neural network. By removing the silent segment of the original call segment, the voice segment of the original call segment when the speaker is speaking can be retained, so that in the subsequent call separation, the interference of the silent part of the original call segment can be eliminated, effectively improving the call separation Efficiency and accuracy.
S30:将第一通话片段进行切割,得到至少三个第二通话片段,其中,一个说话人对应一个或多个第二通话片段。S30: Cut the first call segment to obtain at least three second call segments, where one speaker corresponds to one or more second call segments.
可以理解地,第一通话语音片段在时间轴上是连续的,但是在时间轴上会交替出现不同说话人的通话语音片段。因此,可以将第一通话语音片段切割成不同说话人所对应的通话片段,这些片段即第二通话片段。得到的第二通话片段至少包括三个片段(因为两个片段没有进行通话分离的必要),一个说话人可以对应有一个或多个第二通话片段,例如,有10个第二通话片段,该第二通话片段总共对应4个说话人A、B、C和D,那么A可以包括5个第二通话片段,B包括2个,C包括1个,D包括2个。Understandably, the first call voice segment is continuous on the time axis, but the call voice segments of different speakers will alternately appear on the time axis. Therefore, the first call voice segment can be cut into call segments corresponding to different speakers, and these segments are the second call segments. The obtained second call segment includes at least three segments (because two segments are not necessary for call separation), a speaker can correspond to one or more second call segments, for example, there are 10 second call segments, the The second call segment corresponds to a total of 4 speakers A, B, C, and D, then A may include 5 second call segments, B includes 2, C includes 1, and D includes 2.
进一步地,在步骤S30中,将第一通话片段进行切割,得到至少三个第二通话片段,具体 包括:Further, in step S30, the first call segment is cut to obtain at least three second call segments, specifically including:
S31:基于贝叶斯信息准则和似然比,在第一通话片段中检测并得到说话人的转变点。S31: Based on the Bayesian information criterion and likelihood ratio, the speaker's transition point is detected and obtained in the first call segment.
其中,贝叶斯信息准则(Bayesian information criterion,简称BIC)是在不完全情报下,对部分未知的状态用主观概率估计,然后用贝叶斯公式对发生概率进行修正,最后再利用期望值和修正概率做出最优决策。似然比(likelihood ratio,简称LR)是反映真实性的一种指标。在一实施例中,通过采用贝叶斯信息准则结合似然比的方法,可以确定第一通话片段中更换说话人的具体时间,检测到第一通话片段中说话人的转变点。Among them, Bayesian Information Criteria (Bayesian Information Criterion, referred to as BIC) is to estimate the partially unknown state with subjective probability under incomplete intelligence, and then use Bayesian formula to modify the probability of occurrence, and finally use the expected value and correction Probability to make the best decision. Likelihood ratio (LR) is an indicator that reflects authenticity. In one embodiment, by using a Bayesian information criterion combined with a likelihood ratio method, the specific time for changing the speaker in the first call segment can be determined, and the speaker's transition point in the first call segment can be detected.
S32:根据说话人的转变点将第一通话片段进行切割,得到至少三个第二通话片段。S32: Cut the first call segment according to the speaker's transition point to obtain at least three second call segments.
在一实施例中,根据得到的转变点切割第一通话片段,可以实现初步的通话分离效果,可以确定每个得到的第二通话片段都对应一说话人。In an embodiment, cutting the first call segment according to the obtained transition point can achieve a preliminary call separation effect, and it can be determined that each obtained second call segment corresponds to a speaker.
在步骤S31-S32中,对第一通话片段实现了切割,使得切割得到的每个第二通话片段都对应一说话人,为后续确定相同的说话人的第二通话片段提供重要的技术前提。In steps S31-S32, the first conversation segment is cut so that each second conversation segment obtained by cutting corresponds to a speaker, which provides an important technical premise for subsequent determination of the second conversation segment of the same speaker.
S40:获取每个第二通话片段的i-vector特征,采用预先训练好的双协方差概率线性判别分析模型对每个i-vector特征进行建模,得到每个第二通话片段的目标模型。S40: Obtain the i-vector features of each second call segment, and use a pre-trained double covariance probability linear discriminant analysis model to model each i-vector feature to obtain the target model of each second call segment.
其中,i-vector特征是指从高斯混合模型(Gaussian mixture model,GMM)均值超矢量中提取的一个更紧凑的矢量,i-vector特征除了包含说话人的身份信息外,还包括关于声道,话筒,说话方式,语音等信息,可以全面地体现声音的声纹特征。在声纹识别中,双协方差概率线性判别分析模型是用来从i-vector中提取说话人信息的,可以通过该模型对声纹特征进行比对和区分。双协方差概率线性判别分析模型假设i-vector是由另外两个参数提取的:一个说话人的向量y(不同的说话人有不同的向量),一个剩余向量∈(不同的片段有不同的向量)。采用预先训练好的双协方差概率线性判别分析模型对每个i-vector特征进行建模,能够将第二通话片段的特征更精确地表示出来,以在确定相同的说话人的第二通话片段时,能够达到更精确的区分效果。The i-vector feature refers to a more compact vector extracted from the Gaussian mixture model (GMM) mean supervector. In addition to the speaker's identity information, the i-vector feature also includes information about the soundtrack, Microphone, speaking method, voice and other information can fully reflect the voiceprint characteristics of the sound. In voiceprint recognition, the double-covariance probability linear discriminant analysis model is used to extract speaker information from i-vector, which can be used to compare and distinguish voiceprint features. The double-covariance probability linear discriminant analysis model assumes that the i-vector is extracted by two other parameters: a speaker's vector y (different speakers have different vectors), and a residual vector ∈ (different fragments have different vectors) ). Use the pre-trained double covariance probability linear discriminant analysis model to model each i-vector feature, which can more accurately represent the characteristics of the second call segment to determine the second call segment of the same speaker , Can achieve a more accurate distinction effect.
在建模前有以下前提条件:在一个对话中,说话人的总数有S个。将所有第二通话片段提取的i-vector表示为Φ={φ 1,...,φ M}。对于每一个第二通话片段m=1,…,M,定义一个维度为S*1的指示向量i m,如果说话人s在第二通话片段m中说话了,则i m中的元素i ms=1,如果说话人s在第二通话片段m中没说话,i m中的元素i ms=0。令I={i 1,...,i M}为一个给出的关于第二通话片段的指示向量集合。假设事件为说话人s在一个片段中说话,则给该时间赋上一个先验概率
Figure PCTCN2018123553-appb-000001
对于每个说话人s的样本y s∈N(y;μ,Λ -1),即每个说话人s的样本服从均值为μ,协方差为Λ -1的正态分布,对于每一个第二通话片段,服从于多项式分布Mult(∏) 的样本i m,其中∏=(π 1,...,π S)。
Before modeling, there are the following prerequisites: In a conversation, the total number of speakers is S. The i-vector extracted from all the second call segments is expressed as Φ = {φ 1 , ..., φ M }. For each call a second segment m = 1, ..., M, is defined as the dimension of a vector indicating a i m S * 1, if the speaker talking s m in the second call segment, the elements in the I m i ms = 1, if the speaker s does not speak in the second call segment m , the element i ms in im is = 0. Let I = {i 1 , ..., i M } be a given set of indication vectors about the second call segment. Suppose the event is that the speaker s speaks in a segment, then assign a prior probability to the time
Figure PCTCN2018123553-appb-000001
For each speaker s sample y s ∈ N (y; μ, Λ -1 ), that is, each speaker s sample follows a normal distribution with mean μ and covariance Λ -1 . For each first two call segments, subject to the polynomial sample distribution i m Mult (Π), wherein Π = (π 1, ..., π S).
有了上述建模的前提条件,目标模型的表达式为:φ m=y k+∈ m,其中,φ m表示第m个第二通话片段提取的i-vector特征,y表示第二通话片段中的与说话人关联向量,为了和上述y s中的s做区分,令k为使i mk=1的索引,i m表示与第二通话片段的指示向量,
Figure PCTCN2018123553-appb-000002
Figure PCTCN2018123553-appb-000003
表示第m个第二通话片段的说话人无关向量∈服从均值为0,协方差为L -1的高斯分布。双协方差概率线性判别分析模型中的双协方差即分别来自y k和∈ m。可以理解地,建模的过程即计算每一个第二通话片段在双协方差概率线性判别分析模型中的表示。通过建立每一个第二通话片段的目标模型,后续可以利用目标模型确定相同的说话人的第二通话片段。
With the above prerequisites for modeling, the expression of the target model is: φ m = y k + ∈ m , where φ m represents the i-vector feature extracted from the m-th second call segment, and y represents the second call segment the vector is associated with the speaker, and said order of s y s distinction made, so as to make an index k i mk =, i m denotes a vector indicative of the second call segment,
Figure PCTCN2018123553-appb-000002
Figure PCTCN2018123553-appb-000003
The speaker-independent vector ∈ representing the m-th second call segment is subject to a Gaussian distribution with mean 0 and covariance L -1 . The double covariance in the linear discriminant analysis model of double covariance probability comes from y k and ∈ m respectively . Understandably, the modeling process is to calculate the representation of each second call segment in the double-covariance probability linear discriminant analysis model. By establishing the target model of each second call segment, the target model can be used later to determine the second call segment of the same speaker.
S50:基于目标模型,采用变分贝叶斯算法确定相同的说话人的第二通话片段,并将相同的说话人的第二通话片段标记成统一的标签。S50: Based on the target model, a variational Bayes algorithm is used to determine the second call segment of the same speaker, and the second call segment of the same speaker is marked as a unified label.
其中,变分贝叶斯算法(Variational Bayes,简称VB)是提供一种局部最优,但具有确定解的近似后验方法。Among them, the variational Bayesian algorithm (Variational Bayes, VB for short) is an approximate posterior method that provides a local optimal but has a definite solution.
在一实施例中,Y={y 1,...,y S}为说话人向量的集合。通过这个目标模型,可以将确定相同的说话人的第二通话片段这个问题归纳成求说话人在一个给定的第二通话片段中说过话的后验概率,其中,后验概率是指关于随机事件或者不确定性断言的条件概率,是在相关证据或者背景给定并纳入考虑之后的条件概率。由于上述的假设,P(Y,I|Φ)是一个不能解的积分,本实施例中,通过近似推断的方法,采用变分贝叶斯算法来近似计算P(Y|Φ)和P(I|Φ)。为简便表示,可以将P(Y|Φ)表示为Q(I),将P(I|Φ)表示为Q(Y),采用平均场变分贝叶斯方法假设后验概率可以被近似表示为:Q(Y,I)=Q(Y)Q(I)。通过近似推断,可以确定说话人在一个给定的第二通话片段中说过话的后验概率,即可确定相同的说话人的第二通话片段,并将相同的说话人的第二通话片段标记成统一的标签,以将第二通话片段按所属的说话人区分开来。 In one embodiment, Y = {y 1 , ..., y S } is a set of speaker vectors. Through this target model, the problem of determining the second call segment of the same speaker can be reduced to the posterior probability of asking the speaker to speak in a given second call segment, where the posterior probability refers to random The conditional probability of an event or uncertainty assertion is the conditional probability after the relevant evidence or background is given and taken into account. Due to the above assumption, P (Y, I | Φ) is an unsolvable integral. In this embodiment, through the method of approximate inference, the variational Bayesian algorithm is used to approximate P (Y | Φ) and P ( I | Φ). For simplicity, we can express P (Y | Φ) as Q (I) and P (I | Φ) as Q (Y). The mean field variational Bayesian method is used to assume that the posterior probability can be approximated. It is: Q (Y, I) = Q (Y) Q (I). By approximate inference, the posterior probability of the speaker speaking in a given second call segment can be determined, that is, the second call segment of the same speaker can be determined and the second call segment of the same speaker can be marked Into a unified label to distinguish the second call segment by the speaker to which it belongs.
进一步地,在步骤S50中,基于目标模型,采用变分贝叶斯算法确定相同的说话人的第二通话片段,具体包括:Further, in step S50, based on the target model, a variational Bayes algorithm is used to determine the second call segment of the same speaker, which specifically includes:
S511:基于目标模型和变分贝叶斯算法获取第二通话片段的后验概率的表达式,
Figure PCTCN2018123553-appb-000004
其中,m表示第二通话片段,M表示第二通话片段的片段总数,s表示说话人,S表示说话人的总数,q ms是s在第二通话片段m中说话的后验概率,i ms为说话人s在第二通话片段m中的指示向量,当说话人s在第二通话片段m中说话时,i ms=1,当说话人s在第二通话片段m中没有说话时,i ms=0。
S511: obtaining the expression of the posterior probability of the second call segment based on the target model and the variational Bayes algorithm,
Figure PCTCN2018123553-appb-000004
Where m is the second call segment, M is the total number of segments of the second call segment, s is the speaker, and S is the total number of speakers, q ms is the posterior probability of s speaking in the second call segment m, i ms Is the indicator vector of the speaker s in the second conversation segment m, when the speaker s speaks in the second conversation segment m, i ms = 1, when the speaker s does not speak in the second conversation segment m, i ms = 0.
S512:基于目标模型和变分贝叶斯算法获取说话人的后验概率的表达式,
Figure PCTCN2018123553-appb-000005
Figure PCTCN2018123553-appb-000006
其中,s表示说话人,S表示说话人的总数,y s表示每个说话人s的第二通话片段,Q(Y)服从均值是μ s,协方差为
Figure PCTCN2018123553-appb-000007
的高斯分布。
S512: The expression of the posterior probability of the speaker based on the target model and the variational Bayes algorithm,
Figure PCTCN2018123553-appb-000005
Figure PCTCN2018123553-appb-000006
Where s is the speaker, S is the total number of speakers, y s is the second conversation segment of each speaker s, Q (Y) is subject to the mean μ s , and the covariance is
Figure PCTCN2018123553-appb-000007
Gaussian distribution.
S513:基于变分贝叶斯算法对第二通话片段的后验概率Q(I)和说话人的后验概率Q(Y)进行更新。S513: Update the posterior probability Q (I) of the second call segment and the posterior probability Q (Y) of the speaker based on the variational Bayesian algorithm.
变分贝叶斯算法计算过程中采用了最大期望算法(Expectation Maximization Algorithm,简称EM算法)的更新过程。最大期望算法包括e-step和m-step,第二通话片段的后验概率Q(I)和说话人的后验概率Q(Y)在变分的e-step步骤更新;在m-step步骤将每个第二通话片段m赋给
Figure PCTCN2018123553-appb-000008
中的说话人s。
The update process of the Expectation Maximization Algorithm (EM algorithm for short) is used in the calculation process of the variational Bayesian algorithm. The maximum expectation algorithm includes e-step and m-step, the posterior probability Q (I) of the second call segment and the posterior probability Q (Y) of the speaker are updated in the e-step step of variation; in the m-step step Assign each second call segment m to
Figure PCTCN2018123553-appb-000008
The speaker in s.
进一步地,在步骤S513中,具体包括:Further, in step S513, it specifically includes:
将第二通话片段的后验概率Q(I)中的q ms更新为
Figure PCTCN2018123553-appb-000009
其中,
Figure PCTCN2018123553-appb-000010
Figure PCTCN2018123553-appb-000011
s′用于区分q ms中的s,表示更新前的s,
Figure PCTCN2018123553-appb-000012
中的T表示转置矩阵运算,L为协方差L -1的逆,tr(.)表示矩阵的迹运算,const表示与说话人的无关项;说话人的后验概率Q(Y)的更新表示为
Figure PCTCN2018123553-appb-000013
Figure PCTCN2018123553-appb-000014
Λ为协方差Λ -1的逆,
Figure PCTCN2018123553-appb-000015
是说话人后验概率的协方差,C s是协方差的逆。需要说明的是,以上公式中出现的参数在上文中均已解释,在此不一一再进行解释,只对首次出现的参数进行解释。
Update q ms in the posterior probability Q (I) of the second call segment to
Figure PCTCN2018123553-appb-000009
among them,
Figure PCTCN2018123553-appb-000010
Figure PCTCN2018123553-appb-000011
s ′ is used to distinguish s in q ms , which means s before update,
Figure PCTCN2018123553-appb-000012
Where T is the transposed matrix operation, L is the inverse of the covariance L -1 , tr (.) Is the trace operation of the matrix, and const is the irrelevant term to the speaker; the posterior probability Q (Y) of the speaker Expressed as
Figure PCTCN2018123553-appb-000013
Figure PCTCN2018123553-appb-000014
Λ is the inverse of the covariance Λ -1 ,
Figure PCTCN2018123553-appb-000015
Is the covariance of the posterior probability of the speaker, and C s is the inverse of the covariance. It should be noted that the parameters appearing in the above formulas have been explained above, and are not explained one after another here, only the parameters that appear for the first time.
进一步地,在更新第二通话片段的后验概率Q(I)和说话人的后验概率Q(Y)时,还可以引入温度参数β,采用变分贝叶斯算法的确定性退火变种对片段的后验概率和说话人的后验概率进行更新。具体地,更新过程为:q ms更新为
Figure PCTCN2018123553-appb-000016
s′用于区分q ms中的s,表示更新前的s,
Figure PCTCN2018123553-appb-000017
β表示温度参数,
Figure PCTCN2018123553-appb-000018
中的T表示转置矩阵运算,L为协方差L -1的逆,tr(.)表示矩阵的迹运算,const表示与说话人的无关项;说话人后验概率的更新表示为
Figure PCTCN2018123553-appb-000019
Figure PCTCN2018123553-appb-000020
Λ为协方差Λ -1的逆,
Figure PCTCN2018123553-appb-000021
是说话人后验概率的协方差,C s是协方差的逆。采用变分贝叶斯算法的确定性退火变种对片段的后验概率和说话人的后验概率进行更新可以有效避免说话人后验概率达到局部最优值。
Further, when updating the posterior probability Q (I) of the second call segment and the posterior probability Q (Y) of the speaker, the temperature parameter β can also be introduced, and the deterministic annealing variant pair of the variational Bayes algorithm The posterior probability of the segment and the posterior probability of the speaker are updated. Specifically, the update process is: q ms is updated to
Figure PCTCN2018123553-appb-000016
s ′ is used to distinguish s in q ms , which means s before update,
Figure PCTCN2018123553-appb-000017
β represents the temperature parameter,
Figure PCTCN2018123553-appb-000018
Where T is the transposed matrix operation, L is the inverse of the covariance L -1 , tr (.) Is the trace operation of the matrix, and const is the irrelevant term to the speaker; the update of the posterior probability of the speaker is expressed as
Figure PCTCN2018123553-appb-000019
Figure PCTCN2018123553-appb-000020
Λ is the inverse of the covariance Λ -1 ,
Figure PCTCN2018123553-appb-000021
Is the covariance of the posterior probability of the speaker, and C s is the inverse of the covariance. Using the deterministic annealing variant of the variational Bayesian algorithm to update the posterior probability of the segment and the posterior probability of the speaker can effectively prevent the posterior probability of the speaker from reaching the local optimal value.
S514:根据更新后的Q(I)和更新后的Q(Y)确定相同的说话人的第二通话片段。S514: Determine the second call segment of the same speaker according to the updated Q (I) and the updated Q (Y).
得到更新后的Q(I)和更新后的Q(Y)即可得到说话人在一个给定的第二通话片段中说过话的后验概率,从而确定相同的说话人的第二通话片段。By obtaining the updated Q (I) and the updated Q (Y), the posterior probability that the speaker has spoken in a given second conversation segment can be obtained, thereby determining the second conversation segment of the same speaker.
进一步地,在步骤S50之前,即在采用变分贝叶斯算法在目标模型中确定相同的说话人的第二通话片段之前,还包括:Further, before step S50, that is, before the variational Bayes algorithm is used to determine the second call segment of the same speaker in the target model, the method further includes:
S521:初始化第二通话片段的后验概率中说话人的个数,将第二通话片段的后验概率中每个不同的说话人作为一对。S521: Initialize the number of speakers in the posterior probability of the second call segment, and use each different speaker in the posterior probability of the second call segment as a pair.
在一实施例中,初始化第二通话片段的后验概率中说话人的个数具体可以是初始化为3个说话人。In an embodiment, the number of speakers in the posterior probability of initializing the second call segment may specifically be initialized to 3 speakers.
S522:计算每一对说话人之间的距离,得到距离最远的两个说话人。S522: Calculate the distance between each pair of speakers to obtain the two speakers with the longest distance.
其中,在双协方差概率线性判别分析模型中,可以采用余弦相似度和/或似然比分数作为衡量距离的标准。Among them, in the double-covariance probability linear discriminant analysis model, cosine similarity and / or likelihood ratio score can be used as a criterion for measuring distance.
S523:重复预设次数的初始化第二通话片段的后验概率中说话人的个数,将第二通话片段的后验概率中每个不同的说话人作为一对和计算每一对说话人之间的距离,得到距离最远的两个说话人的步骤,得到在预设次数的步骤中距离最远的两个说话人,并将在预设次数的步骤中距离最远的两个说话人作为变分贝叶斯计算的起点。S523: Repeat the preset number of times to initialize the number of speakers in the posterior probability of the second call segment, and use each different speaker in the posterior probability of the second call segment as a pair and calculate the number of each pair of speakers The distance between the two speakers with the furthest distance, the two speakers with the furthest distance in the preset number of steps, and the two speakers with the furthest distance in the preset number of steps As a starting point for variational Bayesian calculations.
可以理解地,本步骤为重复预设次数(如10次)的步骤S521-S522,再将所有预设次数的步骤中距离最远的两个说话人作为变分贝叶斯计算的起点。Understandably, this step is steps S521-S522 repeating a preset number of times (for example, 10 times), and then the two speakers who are farthest among all the steps of the preset number of times are used as the starting point of variational Bayesian calculation.
步骤S521-S523中是对变分贝叶斯算法的初始化进行的优化步骤,可以提高变分贝叶斯算法在采用最大期望算法进行迭代时得到的运算结果更加准确,并最终根据准确地得到说话人在一个给定的第二通话片段中说过话的后验概率,从而更好地对第二通话语音按说话人进行区分。Steps S521-S523 are optimization steps for the initialization of the variational Bayesian algorithm, which can improve the operation results obtained by the variational Bayesian algorithm when iterating with the maximum expectation algorithm is more accurate, and finally based on the accurate The posterior probability that a person has spoken in a given second call segment, so as to better distinguish the second call voice by speaker.
本申请实施例的技术方案具有以下有益效果:The technical solutions of the embodiments of the present application have the following beneficial effects:
本申请实施例中,首先将原始通话语音进行静音检测,可以去除语音通话中无人发出声音的静音片段,有利于提高通话分离的效率和精确度。接着将第一通话片段进行切割,可以得到不同说话人的第二通话片段,为后续确定相同的说话人的第二通话片段提供重要的技术前提。然后采用预先训练好的双协方差概率线性判别分析模型进行建模,得到每个第二通话片段的目标模型,可以通过双协方差概率线性判别分析模型将第二通话片段的特征更精确地表示出来。最后通过变分贝叶斯算法确定相同的说话人的第二通话片段,采用变分贝叶斯算法可以将属于同一说话人的第二通话片段进行聚类,精确度高,能达到精确的通话分离效果。In the embodiment of the present application, the mute detection of the original call voice is performed first, which can remove the mute segment of the voice call in which no one emits sound, which is beneficial to improve the efficiency and accuracy of call separation. Then, the first call segment is cut to obtain second call segments of different speakers, which provides an important technical premise for subsequent determination of the second call segments of the same speaker. Then use the pre-trained double covariance probability linear discriminant analysis model for modeling to obtain the target model of each second call segment. The characteristics of the second call segment can be more accurately represented by the double covariance probability linear discriminant analysis model come out. Finally, the second call segment of the same speaker is determined by the variational Bayes algorithm, and the second call segment belonging to the same speaker can be clustered by using the variational Bayes algorithm, with high accuracy and accurate call seperate effect.
应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。It should be understood that the size of the sequence numbers of the steps in the above embodiments does not mean the order of execution, and the execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of the present application.
基于实施例中所提供的通话分离方法,本申请实施例进一步给出实现上述方法实施例中各步骤及方法的装置实施例。Based on the call separation method provided in the embodiments, the embodiments of the present application further provide device embodiments that implement the steps and methods in the above method embodiments.
图2示出与实施例中通话分离方法一一对应的通话分离装置的原理框图。如图2所示,该通话分离装置包括原始通话片段获取模块10、第一通话片段获取模块20、第二通话片段获取模块30、目标模型获取模块40和统一标签模块50。其中,原始通话片段获取模块10、第一通话片段获取模块20、第二通话片段获取模块30、目标模型获取模块40和统一标签模块50的实现功能与实施例中通话分离方法对应的步骤一一对应,为避免赘述,本实施例不一一详述。FIG. 2 shows a functional block diagram of a call separation device corresponding to the call separation method in the embodiment. As shown in FIG. 2, the call separation device includes an original call segment acquisition module 10, a first call segment acquisition module 20, a second call segment acquisition module 30, a target model acquisition module 40 and a unified label module 50. Among them, the implementation functions of the original call segment acquisition module 10, the first call segment acquisition module 20, the second call segment acquisition module 30, the target model acquisition module 40, and the unified label module 50 correspond to the steps of the call separation method in the embodiment one by one Correspondingly, in order to avoid redundant description, this embodiment will not elaborate one by one.
原始通话片段获取模块10,用于获取原始通话片段,原始通话片段包括至少两个不同说话人的通话片段。The original call segment obtaining module 10 is used to obtain an original call segment, and the original call segment includes at least two call segments of different speakers.
第一通话片段获取模块20,用于采用静音检测去除原始通话片段中的静音片段,得到第一通话片段。The first call segment acquisition module 20 is used to remove the mute segment in the original call segment using mute detection to obtain the first call segment.
第二通话片段获取模块30,用于将第一通话片段进行切割,得到至少三个第二通话片段,其中,一个说话人对应一个或多个第二通话片段。The second call segment acquisition module 30 is configured to cut the first call segment to obtain at least three second call segments, where one speaker corresponds to one or more second call segments.
目标模型获取模块40,用于获取每个第二通话片段的i-vector特征,采用预先训练好的双协方差概率线性判别分析模型对每个i-vector特征进行建模,得到每个第二通话片段的目标模型。The target model acquisition module 40 is used to acquire the i-vector features of each second call segment, and use the pre-trained double covariance probability linear discriminant analysis model to model each i-vector feature to obtain each second The target model of the call segment.
统一标签模块50,用于基于目标模型,采用变分贝叶斯算法确定相同的说话人的第二通话片段,并将相同的说话人的第二通话片段标记成统一的标签。The unified labeling module 50 is used to determine the second conversation segment of the same speaker based on the target model, and use the variational Bayes algorithm to mark the second conversation segment of the same speaker as a unified label.
可选地,第一通话片段获取模块10包括转变点获取单元和第二通话片段获取单元。Optionally, the first call segment acquisition module 10 includes a transition point acquisition unit and a second call segment acquisition unit.
转变点获取单元,用于基于贝叶斯信息准则和似然比,在第一通话片段中检测并得到说话人的转变点。The transition point acquisition unit is used to detect and obtain the speaker's transition point in the first call segment based on the Bayesian information criterion and the likelihood ratio.
第二通话片段获取单元,用于根据说话人的转变点将第一通话片段进行切割,得到至少三个第二通话片段。The second call segment acquisition unit is configured to cut the first call segment according to the speaker's transition point to obtain at least three second call segments.
可选地,目标模型的表达式φ m=y k+∈ m,其中,φ m表示第m个第二通话片段提取的i-vector特征,y表示第二通话片段的与说话人关联向量,k为使i mk=1的索引,i m表示与第二通话片段的指示向量,
Figure PCTCN2018123553-appb-000022
表示第m个第二通话片段的说 话人无关向量∈服从均值为0,协方差为L -1的高斯分布,统一标签模块50包括第二通话片段后验概率获取单元、说话人后验概率获取单元、更新单元和确定单元。
Optionally, the expression φ m = y k + ∈ m of the target model, where φ m represents the i-vector feature extracted from the m-th second call segment, and y represents the speaker-related vector of the second call segment, k is an index that the i mk = 1, i m denotes a vector indicative of the second call segment,
Figure PCTCN2018123553-appb-000022
The speaker-independent vector ∈ representing the m-th second call segment follows a Gaussian distribution with mean 0 and covariance L -1 . The unified labeling module 50 includes a second call segment posterior probability acquisition unit Unit, update unit and determination unit.
第二通话片段后验概率获取单元,用于基于目标模型和变分贝叶斯算法获取第二通话片段的后验概率的表达式,
Figure PCTCN2018123553-appb-000023
其中,m表示第二通话片段,M表示第二通话片段的片段总数,s表示说话人,S表示说话人的总数,q ms是s在第二通话片段m中说话的后验概率,i ms为说话人s在第二通话片段m中的指示向量,当说话人s在第二通话片段m中说话时,i ms=1,当说话人s在第二通话片段m中没有说话时,i ms=0。
The second call segment posterior probability acquisition unit is used to obtain the expression of the posterior probability of the second call segment based on the target model and the variational Bayesian algorithm,
Figure PCTCN2018123553-appb-000023
Where m is the second call segment, M is the total number of segments of the second call segment, s is the speaker, and S is the total number of speakers, q ms is the posterior probability of s speaking in the second call segment m, i ms Is the indicator vector of the speaker s in the second conversation segment m, when the speaker s speaks in the second conversation segment m, i ms = 1, when the speaker s does not speak in the second conversation segment m, i ms = 0.
说话人后验概率获取单元,用于基于目标模型和变分贝叶斯算法获取说话人的后验概率的表达式,
Figure PCTCN2018123553-appb-000024
其中,s表示说话人,S表示说话人的总数,y s表示每个说话人s的第二通话片段,Q(Y)服从均值是μ s,协方差为
Figure PCTCN2018123553-appb-000025
的高斯分布。
The speaker posterior probability acquisition unit is used to acquire the posterior probability expression of the speaker based on the target model and the variational Bayesian algorithm,
Figure PCTCN2018123553-appb-000024
Where s is the speaker, S is the total number of speakers, y s is the second conversation segment of each speaker s, Q (Y) is subject to the mean μ s , and the covariance is
Figure PCTCN2018123553-appb-000025
Gaussian distribution.
更新单元,用于基于变分贝叶斯算法对第二通话片段的后验概率Q(I)和说话人的后验概率Q(Y)进行更新。The updating unit is used to update the posterior probability Q (I) of the second call segment and the posterior probability Q (Y) of the speaker based on the variational Bayes algorithm.
确定单元,用于根据更新后的Q(I)和更新后的Q(Y)确定相同的说话人的第二通话片段。The determining unit is configured to determine the second call segment of the same speaker according to the updated Q (I) and the updated Q (Y).
可选地,通话分离装置还包括初始化单元、距离单元和起点确定单元。Optionally, the call separation device further includes an initialization unit, a distance unit, and a starting point determination unit.
初始化单元,用于初始化第二通话片段的后验概率中说话人的个数,将第二通话片段的后验概率中每个不同的说话人作为一对。The initialization unit is used for initializing the number of speakers in the posterior probability of the second conversation segment, and using each different speaker in the posterior probability of the second conversation segment as a pair.
距离单元,用于计算每一对说话人之间的距离,得到距离最远的两个说话人。The distance unit is used to calculate the distance between each pair of speakers to obtain the two speakers with the longest distance.
起点确定单元,用于重复预设次数的初始化第二通话片段的后验概率中说话人的个数,将第二通话片段的后验概率中每个不同的说话人作为一对和计算每一对说话人之间的距离,得到距离最远的两个说话人的步骤,得到在预设次数的步骤中距离最远的两个说话人,并将在预设次数的步骤中距离最远的两个说话人作为变分贝叶斯计算的起点。The starting point determining unit is used to repeat the preset number of times to initialize the number of speakers in the posterior probability of the second call segment, using each different speaker in the posterior probability of the second call segment as a pair and calculating each For the distance between the speakers, the step of getting the two speakers farthest away is obtained, the two speakers who are farthest apart in the preset number of steps are obtained, and the farthest distance is separated in the preset number of steps The two speakers serve as the starting point for the variational Bayesian calculation.
可选地,更新单元包括:将第二通话片段的后验概率Q(I)中的q ms更新为
Figure PCTCN2018123553-appb-000026
Figure PCTCN2018123553-appb-000027
其中,
Figure PCTCN2018123553-appb-000028
s′用于区分q ms中的s,表示更新前的s,
Figure PCTCN2018123553-appb-000029
中的T表示转置矩阵运算,L为协方差L -1的逆,tr(.)表示矩阵的迹运算,const表示与说话人的无关项;说话人的后验概率Q(Y)的更新表示为
Figure PCTCN2018123553-appb-000030
Figure PCTCN2018123553-appb-000031
Λ为协方差Λ -1的逆,
Figure PCTCN2018123553-appb-000032
是说话人后验概率的协方差,C s是协方差的逆。
Optionally, the updating unit includes: updating q ms in the posterior probability Q (I) of the second call segment to
Figure PCTCN2018123553-appb-000026
Figure PCTCN2018123553-appb-000027
among them,
Figure PCTCN2018123553-appb-000028
s ′ is used to distinguish s in q ms , which means s before update,
Figure PCTCN2018123553-appb-000029
Where T is the transposed matrix operation, L is the inverse of the covariance L -1 , tr (.) Is the trace operation of the matrix, and const is the irrelevant term to the speaker; the posterior probability Q (Y) of the speaker is updated Expressed as
Figure PCTCN2018123553-appb-000030
Figure PCTCN2018123553-appb-000031
Λ is the inverse of the covariance Λ -1 ,
Figure PCTCN2018123553-appb-000032
Is the covariance of the posterior probability of the speaker, and C s is the inverse of the covariance.
本申请实施例的技术方案具有以下有益效果:The technical solutions of the embodiments of the present application have the following beneficial effects:
本申请实施例中,首先将原始通话语音进行静音检测,可以去除语音通话中无人发出声音的静音片段,有利于提高通话分离的效率和精确度。接着将第一通话片段进行切割,可以得到不同说话人的第二通话片段,为后续确定相同的说话人的第二通话片段提供重要的技术前提。然后采用预先训练好的双协方差概率线性判别分析模型进行建模,得到每个第二通话片段的目标模型,可以通过双协方差概率线性判别分析模型将第二通话片段的特征更精确地表示出来。最后通过变分贝叶斯算法确定相同的说话人的第二通话片段,采用变分贝叶斯算法可以将属于同一说话人的第二通话片段进行聚类,精确度高,能达到精确的通话分离效果。In the embodiment of the present application, the mute detection of the original call voice is performed first, which can remove the mute segment of the voice call in which no one emits sound, which is beneficial to improve the efficiency and accuracy of call separation. Then, the first call segment is cut to obtain second call segments of different speakers, which provides an important technical premise for subsequent determination of the second call segments of the same speaker. Then use the pre-trained double covariance probability linear discriminant analysis model for modeling to obtain the target model of each second call segment. The characteristics of the second call segment can be more accurately represented by the double covariance probability linear discriminant analysis model come out. Finally, the second call segment of the same speaker is determined by the variational Bayes algorithm, and the second call segment belonging to the same speaker can be clustered by using the variational Bayes algorithm, with high accuracy and accurate call seperate effect.
本实施例提供一计算机非易失性可读存储介质,该计算机非易失性可读存储介质上存储有计算机可读指令,该计算机可读指令被处理器执行时实现实施例中通话分离方法,为避免重复,此处不一一赘述。或者,该计算机可读指令被处理器执行时实现实施例中通话分离装置中各模块/单元的功能,为避免重复,此处不一一赘述。This embodiment provides a computer non-volatile readable storage medium. The computer non-volatile readable storage medium stores computer readable instructions. When the computer readable instructions are executed by a processor, the call separation method in the embodiment is implemented. To avoid repetition, I will not repeat them here. Alternatively, when the computer-readable instructions are executed by the processor, the functions of the modules / units in the call separation device in the embodiment are implemented. To avoid repetition, details are not described here one by one.
图3是本申请一实施例提供的计算机设备的示意图。如图3所示,该实施例的计算机设备60包括:处理器61、存储器62以及存储在存储器62中并可在处理器61上运行的计算机可读指令63,该计算机可读指令63被处理器61执行时实现实施例中的通话分离方法,为避免重复,此处不一一赘述。或者,该计算机可读指令被处理器61执行时实现实施例中通话分离装置中各模型/单元的功能,为避免重复,此处不一一赘述。3 is a schematic diagram of a computer device provided by an embodiment of the present application. As shown in FIG. 3, the computer device 60 of this embodiment includes: a processor 61, a memory 62, and computer readable instructions 63 stored in the memory 62 and executable on the processor 61, and the computer readable instructions 63 are processed When the device 61 is executed, the call separation method in the embodiment is implemented. To avoid repetition, details are not described here one by one. Alternatively, when the computer readable instructions are executed by the processor 61, the functions of each model / unit in the call separation device in the embodiment are implemented. To avoid repetition, they are not described here one by one.
计算机设备60可以是桌上型计算机、笔记本、掌上电脑及云端服务器等计算设备。计算机设备可包括,但不仅限于,处理器61、存储器62。本领域技术人员可以理解,图3仅仅是计算机设备60的示例,并不构成对计算机设备60的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件,例如计算机设备还可以包括输入输出设备、网络接入设备、总线等。The computer device 60 may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server. Computer equipment may include, but is not limited to, a processor 61 and a memory 62. Those skilled in the art may understand that FIG. 3 is only an example of the computer device 60, and does not constitute a limitation on the computer device 60, and may include more or less components than shown, or combine certain components, or different components. For example, computer equipment may also include input and output devices, network access devices, buses, and so on.
所称处理器61可以是中央处理单元(Central Processing Unit,CPU),还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。The so-called processor 61 may be a central processing unit (Central Processing Unit, CPU), or other general-purpose processors, digital signal processors (Digital Signal Processors, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), Field-programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. The general-purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
存储器62可以是计算机设备60的内部存储单元,例如计算机设备60的硬盘或内存。存储器62也可以是计算机设备60的外部存储设备,例如计算机设备60上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存 卡(Flash Card)等。进一步地,存储器62还可以既包括计算机设备60的内部存储单元也包括外部存储设备。存储器62用于存储计算机可读指令以及计算机设备所需的其他程序和数据。存储器62还可以用于暂时地存储已经输出或者将要输出的数据。The memory 62 may be an internal storage unit of the computer device 60, such as a hard disk or a memory of the computer device 60. The memory 62 may also be an external storage device of the computer device 60, for example, a plug-in hard disk equipped on the computer device 60, a smart memory card (Smart) Card (SMC), a secure digital (SD) card, and a flash memory card (Flash Card) etc. Further, the memory 62 may also include both the internal storage unit of the computer device 60 and the external storage device. The memory 62 is used to store computer readable instructions and other programs and data required by the computer device. The memory 62 may also be used to temporarily store data that has been or will be output.
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,仅以上述各功能单元、模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能单元、模块完成,即将装置的内部结构划分成不同的功能单元或模块,以完成以上描述的全部或者部分功能。Those skilled in the art can clearly understand that, for convenience and conciseness of description, only the above-mentioned division of each functional unit and module is used as an example for illustration. In practical applications, the above-mentioned functions may be allocated by different functional units, Module completion means that the internal structure of the device is divided into different functional units or modules to complete all or part of the functions described above.
以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围,均应包含在本申请的保护范围之内。The above embodiments are only used to illustrate the technical solutions of the present application, not to limit them; although the present application has been described in detail with reference to the foregoing embodiments, persons of ordinary skill in the art should understand that they can still apply the technical solutions of the foregoing embodiments. The recorded technical solutions are modified, or some of the technical features are equivalently replaced; and these modifications or replacements do not deviate the essence of the corresponding technical solutions from the spirit and scope of the technical solutions of the embodiments of this application, and should be included in this application. Within the scope of protection.

Claims (20)

  1. 一种通话分离方法,其特征在于,所述方法包括:A call separation method, characterized in that the method includes:
    获取原始通话片段,所述原始通话片段包括至少两个不同说话人的通话片段;Obtain an original call segment, where the original call segment includes at least two call segments of different speakers;
    采用静音检测去除所述原始通话片段中的静音片段,得到第一通话片段;Mute detection is used to remove the mute segment in the original call segment to obtain the first call segment;
    将所述第一通话片段进行切割,得到至少三个第二通话片段,其中,一个所述说话人对应一个或多个所述第二通话片段;Cutting the first call segment to obtain at least three second call segments, where one speaker corresponds to one or more second call segments;
    获取每个所述第二通话片段的i-vector特征,采用预先训练好的双协方差概率线性判别分析模型对每个所述i-vector特征进行建模,得到每个所述第二通话片段的目标模型;Acquiring i-vector features of each of the second call segments, and using a pre-trained double covariance probability linear discriminant analysis model to model each of the i-vector features to obtain each of the second call segments Target model
    基于所述目标模型,采用变分贝叶斯算法确定相同的说话人的第二通话片段,并将所述相同的说话人的所述第二通话片段标记成统一的标签。Based on the target model, a variational Bayes algorithm is used to determine the second conversation segment of the same speaker, and the second conversation segment of the same speaker is marked as a unified label.
  2. 根据权利要求1所述的方法,其特征在于,所述将所述第一通话片段进行切割,得到至少三个第二通话片段,包括:The method according to claim 1, wherein the cutting the first call segment to obtain at least three second call segments includes:
    基于贝叶斯信息准则和似然比,在所述第一通话片段中检测并得到说话人的转变点;Based on the Bayesian information criterion and likelihood ratio, detect and obtain the speaker's transition point in the first call segment;
    根据所述说话人的转变点将所述第一通话片段进行切割,得到至少三个所述第二通话片段。The first call segment is cut according to the speaker's transition point to obtain at least three second call segments.
  3. 根据权利要求1所述的方法,其特征在于,所述目标模型的表达式φ m=y k+∈ m,其中,φ m表示第m个所述第二通话片段提取的i-vector特征,y表示所述第二通话片段的与说话人关联向量,k为使i mk=1的索引,i m表示与所述第二通话片段的指示向量,
    Figure PCTCN2018123553-appb-100001
    Figure PCTCN2018123553-appb-100002
    表示第m个所述第二通话片段的说话人无关向量∈服从均值为0,协方差为L -1的高斯分布,所述基于所述目标模型,采用变分贝叶斯算法确定相同的说话人的第二通话片段,包括:
    The method according to claim 1, wherein the expression φ m = y k + ∈ m of the target model, where φ m represents an i-vector feature extracted from the m-th second call segment, y represents the second segment of the call with the speaker associated vector, k is the index that the i mk = 1, i m denotes a vector indicative of the second call segment,
    Figure PCTCN2018123553-appb-100001
    Figure PCTCN2018123553-appb-100002
    The speaker-independent vector ∈ representing the m-th second call segment follows a Gaussian distribution with mean 0 and covariance L -1 . Based on the target model, the variational Bayesian algorithm is used to determine the same speech The second call segment of the person includes:
    基于所述目标模型和所述变分贝叶斯算法获取第二通话片段的后验概率的表达式,
    Figure PCTCN2018123553-appb-100003
    Figure PCTCN2018123553-appb-100004
    其中,m表示所述第二通话片段,M表示所述第二通话片段的片段总数,s表示说话人,S表示所述说话人的总数,q ms是s在所述第二通话片段m中说话的后验概率,i ms为所述说话人s在所述第二通话片段m中的指示向量,当所述说话人s在所述第二通话片段m中说话时,i ms=1,当所述说话人s在所述第二通话片段m中没有说话时,i ms=0;
    Obtaining the expression of the posterior probability of the second call segment based on the target model and the variational Bayesian algorithm,
    Figure PCTCN2018123553-appb-100003
    Figure PCTCN2018123553-appb-100004
    Where m is the second call segment, M is the total number of segments of the second call segment, s is the speaker, S is the total number of speakers, and q ms is s in the second call segment m The posterior probability of speaking, i ms is the indicator vector of the speaker s in the second conversation segment m, when the speaker s speaks in the second conversation segment m, i ms = 1, When the speaker s does not speak in the second call segment m, ims = 0;
    基于所述目标模型和所述变分贝叶斯算法获取说话人的后验概率的表达式,
    Figure PCTCN2018123553-appb-100005
    Figure PCTCN2018123553-appb-100006
    其中,s表示说话人,S表示所述说话人的总数,y s表示每个所述说话人s的所述第二通话片段,Q(Y)服从均值是μ s,协方差为
    Figure PCTCN2018123553-appb-100007
    的高斯分布;
    Obtain an expression of the posterior probability of the speaker based on the target model and the variational Bayesian algorithm,
    Figure PCTCN2018123553-appb-100005
    Figure PCTCN2018123553-appb-100006
    Where s represents the speaker, S represents the total number of speakers, y s represents the second conversation segment of each of the speakers s, Q (Y) obeys the mean value is μ s , and the covariance is
    Figure PCTCN2018123553-appb-100007
    Gaussian distribution
    基于变分贝叶斯算法对所述第二通话片段的后验概率Q(I)和所述说话人的后验概率Q(Y) 进行更新;Updating the posterior probability Q (I) of the second conversation segment and the posterior probability Q (Y) of the speaker based on the variational Bayes algorithm;
    根据更新后的Q(I)和更新后的Q(Y)确定相同的说话人的所述第二通话片段。The second call segment of the same speaker is determined according to the updated Q (I) and the updated Q (Y).
  4. 根据权利要求3所述的方法,其特征在于,在所述采用变分贝叶斯算法在所述目标模型中确定相同的说话人的第二通话片段之前,还包括:The method according to claim 3, characterized in that, before the adopting the variational Bayes algorithm to determine the second conversation segment of the same speaker in the target model, further comprising:
    初始化所述第二通话片段的后验概率中说话人的个数,将所述第二通话片段的后验概率中每个不同的说话人作为一对;Initializing the number of speakers in the posterior probability of the second call segment, and using each different speaker in the posterior probability of the second call segment as a pair;
    计算每一对所述说话人之间的距离,得到距离最远的两个所述说话人;Calculate the distance between each pair of the speakers to obtain the two farthest speakers;
    重复预设次数的初始化所述第二通话片段的后验概率中说话人的个数,将所述第二通话片段的后验概率中每个不同的说话人作为一对和计算每一对所述说话人之间的距离,得到距离最远的两个所述说话人的步骤,得到在所述预设次数的步骤中距离最远的两个所述说话人,并将在所述预设次数的步骤中距离最远的两个所述说话人作为变分贝叶斯计算的起点。Repeating a preset number of times to initialize the number of speakers in the posterior probability of the second call segment, using each different speaker in the posterior probability of the second call segment as a pair and calculating each pair of The step of obtaining the two farthest speakers from the distance between the speakers, and the two farthest speakers from the step of the preset number of times The two farthest speakers in the step of times are used as the starting point of the variational Bayesian calculation.
  5. 根据权利要求3或4任一项所述的方法,其特征在于,所述采用变分贝叶斯算法对所述第二通话片段的后验概率Q(I)和所述说话人的后验概率Q(Y)进行更新,包括:The method according to any one of claims 3 or 4, wherein the posterior probability Q (I) of the second conversation segment using the variational Bayes algorithm and the posterior of the speaker The probability Q (Y) is updated, including:
    将所述第二通话片段的后验概率Q(I)中的q ms更新为
    Figure PCTCN2018123553-appb-100008
    其中,
    Figure PCTCN2018123553-appb-100009
    Figure PCTCN2018123553-appb-100010
    s′用于区分q ms中的s,表示更新前的s,
    Figure PCTCN2018123553-appb-100011
    中的T表示转置矩阵运算,L为协方差L -1的逆,tr(.)表示矩阵的迹运算,const表示与说话人的无关项;所述说话人的后验概率Q(Y)的更新表示为
    Figure PCTCN2018123553-appb-100012
    Figure PCTCN2018123553-appb-100013
    ∧为协方差∧ -1的逆,
    Figure PCTCN2018123553-appb-100014
    是说话人后验概率的协方差,C s是协方差的逆。
    Update q ms in the posterior probability Q (I) of the second call segment to
    Figure PCTCN2018123553-appb-100008
    among them,
    Figure PCTCN2018123553-appb-100009
    Figure PCTCN2018123553-appb-100010
    s ′ is used to distinguish s in q ms , which means s before update,
    Figure PCTCN2018123553-appb-100011
    Where T is the transposed matrix operation, L is the inverse of the covariance L -1 , tr (.) Is the trace operation of the matrix, and const is the irrelevant term to the speaker; the posterior probability Q (Y) of the speaker Is expressed as
    Figure PCTCN2018123553-appb-100012
    Figure PCTCN2018123553-appb-100013
    ∧ is the inverse of covariance ∧ -1 ,
    Figure PCTCN2018123553-appb-100014
    Is the covariance of the posterior probability of the speaker, and C s is the inverse of the covariance.
  6. 一种通话分离装置,其特征在于,所述装置包括:A call separation device, characterized in that the device includes:
    原始通话片段获取模块,用于获取原始通话片段,所述原始通话片段包括至少两个不同说话人的通话片段;An original call segment acquisition module, for acquiring an original call segment, the original call segment includes at least two call segments of different speakers;
    第一通话片段获取模块,用于采用静音检测去除所述原始通话片段中的静音片段,得到第一通话片段;A first call segment acquisition module, used to remove the mute segment in the original call segment using mute detection to obtain the first call segment;
    第二通话片段获取模块,用于将所述第一通话片段进行切割,得到至少三个第二通话片段,其中,一个所述说话人对应一个或多个所述第二通话片段;A second call segment acquisition module, configured to cut the first call segment to obtain at least three second call segments, where one speaker corresponds to one or more second call segments;
    目标模型获取模块,用于获取每个所述第二通话片段的i-vector特征,采用预先训练好的双协方差概率线性判别分析模型对每个所述i-vector特征进行建模,得到每个所述第二通话片段的目标模型;The target model acquisition module is used to acquire the i-vector features of each of the second call segments, and use the pre-trained double covariance probability linear discriminant analysis model to model each of the i-vector features to obtain each A target model of the second call segment;
    统一标签模块,用于基于所述目标模型,采用变分贝叶斯算法确定相同的说话人的第二通话片段,并将所述相同的说话人的所述第二通话片段标记成统一的标签。The unified labeling module is used to determine the second call segment of the same speaker based on the target model, and to use the variational Bayes algorithm to mark the second call segment of the same speaker as a unified label .
  7. 根据权利要求6所述的装置,其特征在于,所述第一通话片段获取模块,包括:The apparatus according to claim 6, wherein the first call segment acquisition module includes:
    转变点获取单元,用于基于贝叶斯信息准则和似然比,在所述第一通话片段中检测并得到说话人的转变点;A transition point obtaining unit, configured to detect and obtain a speaker's transition point in the first call segment based on Bayesian information criterion and likelihood ratio;
    第二通话片段获取单元,用于根据所述说话人的转变点将所述第一通话片段进行切割,得到至少三个所述第二通话片段。The second call segment acquiring unit is configured to cut the first call segment according to the speaker's transition point to obtain at least three second call segments.
  8. 根据权利要求6所述的装置,其特征在于,所述目标模型的表达式φ m=y k+∈ m,其中,φ m表示第m个所述第二通话片段提取的i-vector特征,y表示所述第二通话片段的与说话人关联向量,k为使i mk=1的索引,i m表示与所述第二通话片段的指示向量,
    Figure PCTCN2018123553-appb-100015
    Figure PCTCN2018123553-appb-100016
    表示第m个所述第二通话片段的说话人无关向量∈服从均值为0,协方差为L -1的高斯分布,所述统一标签模块,包括:
    The apparatus according to claim 6, wherein the expression φ m = y k + ∈ m of the target model, where φ m represents an i-vector feature extracted from the m-th second call segment, y represents the second segment of the call with the speaker associated vector, k is the index that the i mk = 1, i m denotes a vector indicative of the second call segment,
    Figure PCTCN2018123553-appb-100015
    Figure PCTCN2018123553-appb-100016
    The speaker-independent vector ∈ representing the m-th second call segment is subject to a Gaussian distribution with mean 0 and covariance L -1 . The unified labeling module includes:
    第二通话片段后验概率获取单元,用于基于所述目标模型和所述变分贝叶斯算法获取第二通话片段的后验概率的表达式,
    Figure PCTCN2018123553-appb-100017
    其中,m表示所述第二通话片段,M表示所述第二通话片段的片段总数,s表示说话人,S表示所述说话人的总数,q ms是s在所述第二通话片段m中说话的后验概率,i ms为所述说话人s在所述第二通话片段m中的指示向量,当所述说话人s在所述第二通话片段m中说话时,i ms=1,当所述说话人s在所述第二通话片段m中没有说话时,i ms=0;
    A second call segment posterior probability acquisition unit for acquiring an expression of the posterior probability of the second call segment based on the target model and the variational Bayes algorithm
    Figure PCTCN2018123553-appb-100017
    Where m is the second call segment, M is the total number of segments of the second call segment, s is the speaker, S is the total number of speakers, and q ms is s in the second call segment m The posterior probability of speaking, i ms is the indicator vector of the speaker s in the second conversation segment m, when the speaker s speaks in the second conversation segment m, i ms = 1, When the speaker s does not speak in the second call segment m, ims = 0;
    说话人后验概率获取单元,用于基于所述目标模型和所述变分贝叶斯算法获取说话人的后验概率的表达式,
    Figure PCTCN2018123553-appb-100018
    其中,s表示说话人,S表示所述说话人的总数,y s表示每个所述说话人s的所述第二通话片段,Q(Y)服从均值是μ s,协方差为
    Figure PCTCN2018123553-appb-100019
    的高斯分布;
    A speaker posterior probability acquisition unit for acquiring an expression of the posterior probability of the speaker based on the target model and the variational Bayesian algorithm,
    Figure PCTCN2018123553-appb-100018
    Where s represents the speaker, S represents the total number of speakers, y s represents the second conversation segment of each of the speakers s, Q (Y) obeys the mean value is μ s , and the covariance is
    Figure PCTCN2018123553-appb-100019
    Gaussian distribution
    更新单元,用于基于变分贝叶斯算法对所述第二通话片段的后验概率Q(I)和所述说话人的后验概率Q(Y)进行更新;An updating unit, configured to update the posterior probability Q (I) of the second call segment and the posterior probability Q (Y) of the speaker based on the variational Bayes algorithm;
    确定单元,用于根据更新后的Q(I)和更新后的Q(Y)确定相同的说话人的所述第二通话片段。The determining unit is configured to determine the second call segment of the same speaker according to the updated Q (I) and the updated Q (Y).
  9. 根据权利要求8所述的装置,其特征在于,所述装置还包括:The device according to claim 8, wherein the device further comprises:
    初始化单元,用于初始化所述第二通话片段的后验概率中说话人的个数,将所述第二通话片段的后验概率中每个不同的说话人作为一对;An initialization unit, configured to initialize the number of speakers in the posterior probability of the second call segment, and use each different speaker in the posterior probability of the second call segment as a pair;
    距离单元,用于计算每一对所述说话人之间的距离,得到距离最远的两个所述说话人;The distance unit is used to calculate the distance between each pair of the speakers to obtain the two speakers with the longest distance;
    起点确定单元,用于重复预设次数的初始化所述第二通话片段的后验概率中说话人的个数, 将所述第二通话片段的后验概率中每个不同的说话人作为一对和计算每一对所述说话人之间的距离,得到距离最远的两个所述说话人的步骤,得到在所述预设次数的步骤中距离最远的两个所述说话人,并将在所述预设次数的步骤中距离最远的两个所述说话人作为变分贝叶斯计算的起点。The starting point determining unit is used to repeat a preset number of times to initialize the number of speakers in the posterior probability of the second call segment, and use each different speaker in the posterior probability of the second call segment as a And calculating the distance between each pair of speakers, obtaining the two farthest speakers, and obtaining the two farthest speakers in the preset number of steps, and The two farthest speakers in the step of the preset number of times are used as the starting point of the variational Bayesian calculation.
  10. 根据权利要求8或9任一项所述的装置,其特征在于,所述更新单元具体用于:The device according to any one of claims 8 or 9, wherein the update unit is specifically configured to:
    将所述第二通话片段的后验概率Q(I)中的q ms更新为
    Figure PCTCN2018123553-appb-100020
    其中,
    Figure PCTCN2018123553-appb-100021
    Figure PCTCN2018123553-appb-100022
    s′用于区分q ms中的s,表示更新前的s,
    Figure PCTCN2018123553-appb-100023
    中的T表示转置矩阵运算,L为协方差L -1的逆,tr(.)表示矩阵的迹运算,const表示与说话人的无关项;所述说话人的后验概率Q(Y)的更新表示为
    Figure PCTCN2018123553-appb-100024
    Figure PCTCN2018123553-appb-100025
    ∧为协方差∧ -1的逆,
    Figure PCTCN2018123553-appb-100026
    是说话人后验概率的协方差,C s是协方差的逆。
    Update q ms in the posterior probability Q (I) of the second call segment to
    Figure PCTCN2018123553-appb-100020
    among them,
    Figure PCTCN2018123553-appb-100021
    Figure PCTCN2018123553-appb-100022
    s ′ is used to distinguish s in q ms , which means s before update,
    Figure PCTCN2018123553-appb-100023
    Where T is the transposed matrix operation, L is the inverse of the covariance L -1 , tr (.) Is the trace operation of the matrix, and const is the irrelevant term to the speaker; the posterior probability Q (Y) of the speaker Is expressed as
    Figure PCTCN2018123553-appb-100024
    Figure PCTCN2018123553-appb-100025
    ∧ is the inverse of covariance ∧ -1 ,
    Figure PCTCN2018123553-appb-100026
    Is the covariance of the posterior probability of the speaker, and C s is the inverse of the covariance.
  11. 一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,其特征在于,所述处理器执行所述计算机可读指令时实现如下步骤:A computer device, including a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor, characterized in that, when the processor executes the computer-readable instructions, it is implemented as follows step:
    获取原始通话片段,所述原始通话片段包括至少两个不同说话人的通话片段;Obtain an original call segment, where the original call segment includes at least two call segments of different speakers;
    采用静音检测去除所述原始通话片段中的静音片段,得到第一通话片段;Mute detection is used to remove the mute segment in the original call segment to obtain the first call segment;
    将所述第一通话片段进行切割,得到至少三个第二通话片段,其中,一个所述说话人对应一个或多个所述第二通话片段;Cutting the first call segment to obtain at least three second call segments, where one speaker corresponds to one or more second call segments;
    获取每个所述第二通话片段的i-vector特征,采用预先训练好的双协方差概率线性判别分析模型对每个所述i-vector特征进行建模,得到每个所述第二通话片段的目标模型;Acquiring i-vector features of each of the second call segments, and using a pre-trained double covariance probability linear discriminant analysis model to model each of the i-vector features to obtain each of the second call segments Target model
    基于所述目标模型,采用变分贝叶斯算法确定相同的说话人的第二通话片段,并将所述相同的说话人的所述第二通话片段标记成统一的标签。Based on the target model, a variational Bayes algorithm is used to determine the second conversation segment of the same speaker, and the second conversation segment of the same speaker is marked as a unified label.
  12. 根据权利要求11所述的计算机设备,其特征在于,所述处理器执行所述计算机可读指令时还实现如下步骤:The computer device according to claim 11, wherein the processor further implements the following steps when executing the computer-readable instructions:
    基于贝叶斯信息准则和似然比,在所述第一通话片段中检测并得到说话人的转变点;Based on the Bayesian information criterion and likelihood ratio, detect and obtain the speaker's transition point in the first call segment;
    根据所述说话人的转变点将所述第一通话片段进行切割,得到至少三个所述第二通话片段。The first call segment is cut according to the speaker's transition point to obtain at least three second call segments.
  13. 根据权利要求11所述的计算机设备,其特征在于,所述目标模型的表达式φ m=y k+∈ m,其中,φ m表示第m个所述第二通话片段提取的i-vector特征,y表示所述第二通话片段的与说话人关联向量,k为使i mk=1的索引,i m表示与所述第二通话片段的指示向量,
    Figure PCTCN2018123553-appb-100027
    Figure PCTCN2018123553-appb-100028
    表示第m个所述第二通话片段的说话人无关向量∈服从均值为0,协方 差为L -1的高斯分布,所述基于所述目标模型,采用变分贝叶斯算法确定相同的说话人的第二通话片段,所述处理器执行所述计算机可读指令时还实现如下步骤:
    The computer device according to claim 11, wherein the expression of the target model φ m = y k + ∈ m , wherein φ m represents the i-vector feature extracted from the m-th second call segment , y represents the second segment of the call with the speaker associated vector, k is the index that the i mk = 1, i m denotes a vector indicative of the second call segment,
    Figure PCTCN2018123553-appb-100027
    Figure PCTCN2018123553-appb-100028
    The speaker-independent vector ∈ representing the m-th second call segment follows a Gaussian distribution with mean 0 and covariance L -1 . Based on the target model, the variational Bayesian algorithm is used to determine the same speech For the second call segment of the person, the processor also implements the following steps when executing the computer-readable instructions:
    基于所述目标模型和所述变分贝叶斯算法获取第二通话片段的后验概率的表达式,
    Figure PCTCN2018123553-appb-100029
    Figure PCTCN2018123553-appb-100030
    其中,m表示所述第二通话片段,M表示所述第二通话片段的片段总数,s表示说话人,S表示所述说话人的总数,q ms是s在所述第二通话片段m中说话的后验概率,i ms为所述说话人s在所述第二通话片段m中的指示向量,当所述说话人s在所述第二通话片段m中说话时,i ms=1,当所述说话人s在所述第二通话片段m中没有说话时,i ms=0;
    Obtaining the expression of the posterior probability of the second call segment based on the target model and the variational Bayesian algorithm,
    Figure PCTCN2018123553-appb-100029
    Figure PCTCN2018123553-appb-100030
    Where m is the second call segment, M is the total number of segments of the second call segment, s is the speaker, S is the total number of speakers, and q ms is s in the second call segment m The posterior probability of speaking, i ms is the indicator vector of the speaker s in the second conversation segment m, when the speaker s speaks in the second conversation segment m, i ms = 1, When the speaker s does not speak in the second call segment m, ims = 0;
    基于所述目标模型和所述变分贝叶斯算法获取说话人的后验概率的表达式,
    Figure PCTCN2018123553-appb-100031
    Figure PCTCN2018123553-appb-100032
    其中,s表示说话人,S表示所述说话人的总数,y s表示每个所述说话人s的所述第二通话片段,Q(Y)服从均值是μ s,协方差为
    Figure PCTCN2018123553-appb-100033
    的高斯分布;
    Obtain an expression of the posterior probability of the speaker based on the target model and the variational Bayesian algorithm,
    Figure PCTCN2018123553-appb-100031
    Figure PCTCN2018123553-appb-100032
    Where s represents the speaker, S represents the total number of speakers, y s represents the second conversation segment of each of the speakers s, Q (Y) obeys the mean value is μ s , and the covariance is
    Figure PCTCN2018123553-appb-100033
    Gaussian distribution
    基于变分贝叶斯算法对所述第二通话片段的后验概率Q(I)和所述说话人的后验概率Q(Y)进行更新;Update the posterior probability Q (I) of the second call segment and the posterior probability Q (Y) of the speaker based on the variational Bayes algorithm;
    根据更新后的Q(I)和更新后的Q(Y)确定相同的说话人的所述第二通话片段。The second call segment of the same speaker is determined according to the updated Q (I) and the updated Q (Y).
  14. 根据权利要求13所述的计算机设备,其特征在于,所述处理器执行所述计算机可读指令时还实现如下步骤:The computer device according to claim 13, wherein the processor further implements the following steps when executing the computer-readable instructions:
    初始化所述第二通话片段的后验概率中说话人的个数,将所述第二通话片段的后验概率中每个不同的说话人作为一对;Initializing the number of speakers in the posterior probability of the second call segment, and using each different speaker in the posterior probability of the second call segment as a pair;
    计算每一对所述说话人之间的距离,得到距离最远的两个所述说话人;Calculate the distance between each pair of the speakers to obtain the two farthest speakers;
    重复预设次数的初始化所述第二通话片段的后验概率中说话人的个数,将所述第二通话片段的后验概率中每个不同的说话人作为一对和计算每一对所述说话人之间的距离,得到距离最远的两个所述说话人的步骤,得到在所述预设次数的步骤中距离最远的两个所述说话人,并将在所述预设次数的步骤中距离最远的两个所述说话人作为变分贝叶斯计算的起点。Repeating a preset number of times to initialize the number of speakers in the posterior probability of the second call segment, using each different speaker in the posterior probability of the second call segment as a pair and calculating each pair of The step of obtaining the two farthest speakers from the distance between the speakers, and the two farthest speakers from the step of the preset number of times The two farthest speakers in the step of times are used as the starting point of the variational Bayesian calculation.
  15. 根据权利要求13或14任一项所述的计算机设备,其特征在于,所述处理器执行所述计算机可读指令时还实现如下步骤:The computer device according to any one of claims 13 or 14, wherein the processor further implements the following steps when executing the computer-readable instructions:
    将所述第二通话片段的后验概率Q(I)中的q ms更新为
    Figure PCTCN2018123553-appb-100034
    其中,
    Figure PCTCN2018123553-appb-100035
    Figure PCTCN2018123553-appb-100036
    s′用于区分q ms中的s,表示更新前的s,
    Figure PCTCN2018123553-appb-100037
    中的T表示转置矩阵运算,L为协方差L -1的逆,tr(.)表示矩阵的迹运算,const表示与说话人的无关项;所述说话人的后验概率Q(Y)的更新表示为
    Figure PCTCN2018123553-appb-100038
    Figure PCTCN2018123553-appb-100039
    ∧为协方差∧ -1的逆,
    Figure PCTCN2018123553-appb-100040
    是说话人后验概率的协方差,C s是协方差的逆。
    Update q ms in the posterior probability Q (I) of the second call segment to
    Figure PCTCN2018123553-appb-100034
    among them,
    Figure PCTCN2018123553-appb-100035
    Figure PCTCN2018123553-appb-100036
    s ′ is used to distinguish s in q ms , which means s before update,
    Figure PCTCN2018123553-appb-100037
    Where T is the transposed matrix operation, L is the inverse of the covariance L -1 , tr (.) Is the trace operation of the matrix, and const is the irrelevant term for the speaker; Is expressed as
    Figure PCTCN2018123553-appb-100038
    Figure PCTCN2018123553-appb-100039
    ∧ is the inverse of covariance ∧ -1 ,
    Figure PCTCN2018123553-appb-100040
    Is the covariance of the posterior probability of the speaker, and C s is the inverse of the covariance.
  16. 一种计算机非易失性可读存储介质,所述计算机非易失性可读存储介质存储有计算机可读指令,其特征在于,所述计算机可读指令被处理器执行时实现如下步骤:A computer nonvolatile readable storage medium, the computer nonvolatile readable storage medium storing computer readable instructions, characterized in that the computer readable instructions are executed by a processor to implement the following steps:
    获取原始通话片段,所述原始通话片段包括至少两个不同说话人的通话片段;Obtain an original call segment, where the original call segment includes at least two call segments of different speakers;
    采用静音检测去除所述原始通话片段中的静音片段,得到第一通话片段;Mute detection is used to remove the mute segment in the original call segment to obtain the first call segment;
    将所述第一通话片段进行切割,得到至少三个第二通话片段,其中,一个所述说话人对应一个或多个所述第二通话片段;Cutting the first call segment to obtain at least three second call segments, where one speaker corresponds to one or more second call segments;
    获取每个所述第二通话片段的i-vector特征,采用预先训练好的双协方差概率线性判别分析模型对每个所述i-vector特征进行建模,得到每个所述第二通话片段的目标模型;Acquiring i-vector features of each of the second call segments, and using a pre-trained double covariance probability linear discriminant analysis model to model each of the i-vector features to obtain each of the second call segments Target model
    基于所述目标模型,采用变分贝叶斯算法确定相同的说话人的第二通话片段,并将所述相同的说话人的所述第二通话片段标记成统一的标签。Based on the target model, a variational Bayes algorithm is used to determine the second conversation segment of the same speaker, and the second conversation segment of the same speaker is marked as a unified label.
  17. 根据权利要求16所述的计算机非易失性可读存储介质,其特征在于,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器还实现如下步骤:The computer non-volatile storage medium according to claim 16, wherein when the computer-readable instructions are executed by one or more processors, the one or more processors further implement the following steps :
    基于贝叶斯信息准则和似然比,在所述第一通话片段中检测并得到说话人的转变点;Based on the Bayesian information criterion and likelihood ratio, detect and obtain the speaker's transition point in the first call segment;
    根据所述说话人的转变点将所述第一通话片段进行切割,得到至少三个所述第二通话片段。The first call segment is cut according to the speaker's transition point to obtain at least three second call segments.
  18. 根据权利要求16所述的计算机非易失性可读存储介质,其特征在于,所述目标模型的表达式φ m=y k+∈ m,其中,φ m表示第m个所述第二通话片段提取的i-vector特征,y表示所述第二通话片段的与说话人关联向量,k为使i mk=1的索引,i m表示与所述第二通话片段的指示向量,
    Figure PCTCN2018123553-appb-100041
    表示第m个所述第二通话片段的说话人无关向量∈服从均值为0,协方差为L -1的高斯分布,所述基于所述目标模型,采用变分贝叶斯算法确定相同的说话人的第二通话片段,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器还实现如下步骤:
    The computer non-volatile storage medium according to claim 16, wherein the expression of the target model is φ m = y k + ∈ m , where φ m represents the m-th second call i-vector extraction feature segment, y represents the second segment of the call with the speaker associated vector, k is the index that the i mk = 1, i m denotes a vector indicative of the second call segment,
    Figure PCTCN2018123553-appb-100041
    The speaker-independent vector ∈ representing the m-th second call segment follows a Gaussian distribution with mean 0 and covariance L -1 . Based on the target model, the variational Bayesian algorithm is used to determine the same speech In the second call segment of the person, when the computer-readable instructions are executed by one or more processors, the one or more processors further implement the following steps:
    基于所述目标模型和所述变分贝叶斯算法获取第二通话片段的后验概率的表达式,
    Figure PCTCN2018123553-appb-100042
    Figure PCTCN2018123553-appb-100043
    其中,m表示所述第二通话片段,M表示所述第二通话片段的片段总数,s表示说话人,S表示所述说话人的总数,q ms是s在所述第二通话片段m中说话的后验概率,i ms为所述说话人s在所述第二通话片段m中的指示向量,当所述说话人s在所述第二通话片段m中说话时,i ms=1,当所述说话人s在所述第二通话片段m中没有说话时,i ms=0;
    Obtaining the expression of the posterior probability of the second call segment based on the target model and the variational Bayesian algorithm,
    Figure PCTCN2018123553-appb-100042
    Figure PCTCN2018123553-appb-100043
    Where m is the second call segment, M is the total number of segments of the second call segment, s is the speaker, S is the total number of speakers, and q ms is s in the second call segment m The posterior probability of speaking, i ms is the indicator vector of the speaker s in the second conversation segment m, when the speaker s speaks in the second conversation segment m, i ms = 1, When the speaker s does not speak in the second call segment m, ims = 0;
    基于所述目标模型和所述变分贝叶斯算法获取说话人的后验概率的表达式,
    Figure PCTCN2018123553-appb-100044
    Figure PCTCN2018123553-appb-100045
    其中,s表示说话人,S表示所述说话人的总数,y s表示每个所述说话人 s的所述第二通话片段,Q(Y)服从均值是μ s,协方差为
    Figure PCTCN2018123553-appb-100046
    的高斯分布;
    Obtain an expression of the posterior probability of the speaker based on the target model and the variational Bayesian algorithm,
    Figure PCTCN2018123553-appb-100044
    Figure PCTCN2018123553-appb-100045
    Where s represents the speaker, S represents the total number of speakers, y s represents the second conversation segment of each of the speakers s, Q (Y) obeys the mean value is μ s , and the covariance is
    Figure PCTCN2018123553-appb-100046
    Gaussian distribution
    基于变分贝叶斯算法对所述第二通话片段的后验概率Q(I)和所述说话人的后验概率Q(Y)进行更新;Update the posterior probability Q (I) of the second call segment and the posterior probability Q (Y) of the speaker based on the variational Bayes algorithm;
    根据更新后的Q(I)和更新后的Q(Y)确定相同的说话人的所述第二通话片段。The second call segment of the same speaker is determined according to the updated Q (I) and the updated Q (Y).
  19. 根据权利要求18所述的计算机非易失性可读存储介质,其特征在于,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器还实现如下步骤:The computer non-volatile storage medium according to claim 18, wherein when the computer-readable instructions are executed by one or more processors, the one or more processors further implement the following steps :
    初始化所述第二通话片段的后验概率中说话人的个数,将所述第二通话片段的后验概率中每个不同的说话人作为一对;Initializing the number of speakers in the posterior probability of the second call segment, and using each different speaker in the posterior probability of the second call segment as a pair;
    计算每一对所述说话人之间的距离,得到距离最远的两个所述说话人;Calculate the distance between each pair of the speakers to obtain the two farthest speakers;
    重复预设次数的初始化所述第二通话片段的后验概率中说话人的个数,将所述第二通话片段的后验概率中每个不同的说话人作为一对和计算每一对所述说话人之间的距离,得到距离最远的两个所述说话人的步骤,得到在所述预设次数的步骤中距离最远的两个所述说话人,并将在所述预设次数的步骤中距离最远的两个所述说话人作为变分贝叶斯计算的起点。Repeating a preset number of times to initialize the number of speakers in the posterior probability of the second call segment, using each different speaker in the posterior probability of the second call segment as a pair and calculating each pair of The step of obtaining the two farthest speakers from the distance between the speakers, and the two farthest speakers from the step of the preset number of times The two farthest speakers in the step of times are used as the starting point of the variational Bayesian calculation.
  20. 根据权利要求18或19任一项所述的计算机非易失性可读存储介质,其特征在于,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器还实现如下步骤:The computer non-volatile storage medium according to any one of claims 18 or 19, wherein when the computer-readable instructions are executed by one or more processors, the one or more processes The device also implements the following steps:
    将所述第二通话片段的后验概率Q(I)中的q ms更新为
    Figure PCTCN2018123553-appb-100047
    其中,
    Figure PCTCN2018123553-appb-100048
    Figure PCTCN2018123553-appb-100049
    s′用于区分q ms中的s,表示更新前的s,
    Figure PCTCN2018123553-appb-100050
    中的T表示转置矩阵运算,L为协方差L -1的逆,tr(.)表示矩阵的迹运算,const表示与说话人的无关项;所述说话人的后验概率Q(Y)的更新表示为
    Figure PCTCN2018123553-appb-100051
    Figure PCTCN2018123553-appb-100052
    ∧为协方差∧ -1的逆,
    Figure PCTCN2018123553-appb-100053
    是说话人后验概率的协方差,C s是协方差的逆。
    Update q ms in the posterior probability Q (I) of the second call segment to
    Figure PCTCN2018123553-appb-100047
    among them,
    Figure PCTCN2018123553-appb-100048
    Figure PCTCN2018123553-appb-100049
    s ′ is used to distinguish s in q ms , which means s before update,
    Figure PCTCN2018123553-appb-100050
    Where T is the transposed matrix operation, L is the inverse of the covariance L -1 , tr (.) Is the trace operation of the matrix, and const is the irrelevant term to the speaker; the posterior probability Q (Y) of the speaker Is expressed as
    Figure PCTCN2018123553-appb-100051
    Figure PCTCN2018123553-appb-100052
    ∧ is the inverse of covariance ∧ -1 ,
    Figure PCTCN2018123553-appb-100053
    Is the covariance of the posterior probability of the speaker, and C s is the inverse of the covariance.
PCT/CN2018/123553 2018-11-13 2018-12-25 Call separation method and apparatus, computer device and storage medium WO2020098083A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811347184.3A CN109360572B (en) 2018-11-13 2018-11-13 Call separation method and device, computer equipment and storage medium
CN201811347184.3 2018-11-13

Publications (1)

Publication Number Publication Date
WO2020098083A1 true WO2020098083A1 (en) 2020-05-22

Family

ID=65344905

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/123553 WO2020098083A1 (en) 2018-11-13 2018-12-25 Call separation method and apparatus, computer device and storage medium

Country Status (2)

Country Link
CN (1) CN109360572B (en)
WO (1) WO2020098083A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112071438A (en) * 2020-09-29 2020-12-11 武汉东湖大数据交易中心股份有限公司 Intelligent pertussis screening method and system
CN115168643A (en) * 2022-09-07 2022-10-11 腾讯科技(深圳)有限公司 Audio processing method, device, equipment and computer readable storage medium

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110390946A (en) * 2019-07-26 2019-10-29 龙马智芯(珠海横琴)科技有限公司 A kind of audio signal processing method, device, electronic equipment and storage medium
CN110517667A (en) * 2019-09-03 2019-11-29 龙马智芯(珠海横琴)科技有限公司 A kind of method of speech processing, device, electronic equipment and storage medium
CN113129893B (en) * 2019-12-30 2022-09-02 Oppo(重庆)智能科技有限公司 Voice recognition method, device, equipment and storage medium
CN112669855A (en) * 2020-12-17 2021-04-16 北京沃东天骏信息技术有限公司 Voice processing method and device
CN112735384A (en) * 2020-12-28 2021-04-30 科大讯飞股份有限公司 Turning point detection method, device and equipment applied to speaker separation
CN113051426A (en) * 2021-03-18 2021-06-29 深圳市声扬科技有限公司 Audio information classification method and device, electronic equipment and storage medium
CN113707173B (en) * 2021-08-30 2023-12-29 平安科技(深圳)有限公司 Voice separation method, device, equipment and storage medium based on audio segmentation

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106971713A (en) * 2017-01-18 2017-07-21 清华大学 Speaker's labeling method and system based on density peaks cluster and variation Bayes
CN107342077A (en) * 2017-05-27 2017-11-10 国家计算机网络与信息安全管理中心 A kind of speaker segmentation clustering method and system based on factorial analysis
CN107452403A (en) * 2017-09-12 2017-12-08 清华大学 A kind of speaker's labeling method
WO2018005620A1 (en) * 2016-06-28 2018-01-04 Pindrop Security, Inc. System and method for cluster-based audio event detection
US20180254051A1 (en) * 2017-03-02 2018-09-06 International Business Machines Corporation Role modeling in call centers and work centers

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018005620A1 (en) * 2016-06-28 2018-01-04 Pindrop Security, Inc. System and method for cluster-based audio event detection
CN106971713A (en) * 2017-01-18 2017-07-21 清华大学 Speaker's labeling method and system based on density peaks cluster and variation Bayes
US20180254051A1 (en) * 2017-03-02 2018-09-06 International Business Machines Corporation Role modeling in call centers and work centers
CN107342077A (en) * 2017-05-27 2017-11-10 国家计算机网络与信息安全管理中心 A kind of speaker segmentation clustering method and system based on factorial analysis
CN107452403A (en) * 2017-09-12 2017-12-08 清华大学 A kind of speaker's labeling method

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112071438A (en) * 2020-09-29 2020-12-11 武汉东湖大数据交易中心股份有限公司 Intelligent pertussis screening method and system
CN112071438B (en) * 2020-09-29 2022-06-14 武汉东湖大数据交易中心股份有限公司 Intelligent pertussis screening method and system
CN115168643A (en) * 2022-09-07 2022-10-11 腾讯科技(深圳)有限公司 Audio processing method, device, equipment and computer readable storage medium

Also Published As

Publication number Publication date
CN109360572A (en) 2019-02-19
CN109360572B (en) 2022-03-11

Similar Documents

Publication Publication Date Title
WO2020098083A1 (en) Call separation method and apparatus, computer device and storage medium
US11996091B2 (en) Mixed speech recognition method and apparatus, and computer-readable storage medium
WO2021093449A1 (en) Wakeup word detection method and apparatus employing artificial intelligence, device, and medium
WO2021208287A1 (en) Voice activity detection method and apparatus for emotion recognition, electronic device, and storage medium
US10468032B2 (en) Method and system of speaker recognition using context aware confidence modeling
US11335352B2 (en) Voice identity feature extractor and classifier training
US9589564B2 (en) Multiple speech locale-specific hotword classifiers for selection of a speech locale
US9202462B2 (en) Key phrase detection
US20180158449A1 (en) Method and device for waking up via speech based on artificial intelligence
US9626970B2 (en) Speaker identification using spatial information
US20150325240A1 (en) Method and system for speech input
US9589560B1 (en) Estimating false rejection rate in a detection system
KR20200012963A (en) Object recognition method, computer device and computer readable storage medium
Tong et al. A comparative study of robustness of deep learning approaches for VAD
WO2020147256A1 (en) Conference content distinguishing method and apparatus, and computer device and storage medium
WO2014029099A1 (en) I-vector based clustering training data in speech recognition
WO2020253051A1 (en) Lip language recognition method and apparatus
WO2018223727A1 (en) Voiceprint recognition method, apparatus and device, and medium
WO2014114049A1 (en) Voice recognition method and device
US11756572B2 (en) Self-supervised speech representations for fake audio detection
WO2019237518A1 (en) Model library establishment method, voice recognition method and apparatus, and device and medium
CN110706707B (en) Method, apparatus, device and computer-readable storage medium for voice interaction
JP2019144467A (en) Mask estimation apparatus, model learning apparatus, sound source separation apparatus, mask estimation method, model learning method, sound source separation method, and program
Jiang et al. Mobile phone identification from speech recordings using weighted support vector machine
Dang et al. Factor Analysis Based Speaker Normalisation for Continuous Emotion Prediction.

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18939999

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18939999

Country of ref document: EP

Kind code of ref document: A1