CN110299150A

CN110299150A - A kind of real-time voice speaker separation method and system

Info

Publication number: CN110299150A
Application number: CN201910549060.1A
Authority: CN
Inventors: 周晓天; 黄希; 崔莉
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2019-06-24
Filing date: 2019-06-24
Publication date: 2019-10-01

Abstract

The invention discloses a kind of real-time voice speaker separation method and systems, method includes the following steps: obtaining sound bite, to the sound bite, corresponding speaker classifies, and obtains matched universal background model；Feature extraction is carried out to the sound bite, speaker's temporary pattern is established based on extracted feature and the universal background model；The speaker's temporary pattern established is compared with having similar speaker model, judges whether the speaker is existing speaker, and carry out the update of speaker model based on judging result.The present invention is able to satisfy speaker and separates the task execution on intelligent terminal in real time；The ability of expansible intelligent terminal faster obtains the result of speaker's separation；The delay brought due to network transmission has been saved, and has reduced and gives network bring transmission burden as intelligent terminal increases.

Description

Real-time voice speaker separation method and system

Technical Field

The invention relates to the technical field of voice recognition, in particular to a real-time voice speaker separation method and a real-time voice speaker separation system.

Background

Speaker separation (Speaker segregation) tasks, also known as Speaker labeling tasks, Speaker segmentation clustering tasks, label voice streams with Speaker information. As shown in fig. 1, the system first segments the original speech and then marks the speech segments with speaker information. Unlike speaker recognition, the speaker separation task does not focus on the absolute identity of the speaker, only on relative differences, and the user may not have presented and registered with previous systems. The obtained speaker separation result can be used as a parameter of a subsequent system to perform model adaptation, service selection, auxiliary segmentation, auxiliary retrieval and the like.

The existing speaker separation method mainly aims at a scene that speaker information is marked after a complete voice file is obtained. The method firstly segments the whole voice file, and then completes the separation task of the voice segments and the speaker by top-down segmentation or bottom-up clustering. Both methods are not suitable for the application scenes of the intelligent terminal with continuously increased voice data and real-time requirements. Taking a currently common speaker separation method of bottom-up clustering as an example, as shown in fig. 2, the key steps of the method are that firstly, a voice activity detection module is used for identifying an interval with voice activity by using energy, zero-crossing rate or a model-based method to obtain voice segments, then, a sliding window divided into a left part and a right part is used for scanning each voice segment, the left part and the right part of voice are respectively modeled and the similarity or difference of the left part and the right part of voice is calculated to judge whether the current position is a dividing point for speaker conversion, a series of voice segments without marks are obtained after the scanning is finished, and each segment is expected to only contain pronunciation information of one speaker; and then, clustering the segments from bottom to top to classify the segments belonging to the same speaker into one class, thereby obtaining a speaker separation result. Clustering can continuously aggregate small speaker voice segments, and an algorithm is considered to be aggregated together with voices belonging to one speaker. As in the case of the voice cluster 2, the same result as that of the true speaker is achieved, and it is called under-clustering below the voice cluster 2 and over-clustering above the speaker 2, and none of the results obtained is the best.

The speaker separation method mainly aims at the speaker separation task after complete voice data exists. In the aspect of real-time performance, the method cannot be directly used for an intelligent terminal scene needing real-time result return. Even if the above algorithm is executed immediately after each speech segment is obtained, the required speaker separation result cannot be given in a short time due to too large calculation amount, too long calculation time and repeated calculation.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a real-time online voice speaker separation method and a real-time online voice speaker separation system which are oriented to an intelligent terminal scene, wherein the real-time read voice is segmented, and the segments obtained by segmentation are marked with speaker information to complete a speaker separation task.

In order to achieve the above object, the present invention provides a real-time separation method of speakers, comprising the steps of:

step 101: carrying out voice activity state detection on voice data to obtain voice fragments;

step 102: matching a general background model corresponding to the voice segment;

step 103: extracting the characteristics of the voice segments, and establishing a speaker temporary model based on the extracted characteristics and the general background model;

step 104: and comparing the established speaker temporary model with the existing similar speaker model to judge whether the speaker is the existing speaker.

Preferably, the method may further include updating the speaker model according to the determination result. The corresponding generic background model is matched in step 102, for example, to match a male speaker model or a female speaker model.

In the above technical solution, the step 104 includes: and judging whether the speaker is an existing speaker or not based on the log-likelihood of the voice segments on the pre-stored model and the similarity of the speaker temporary model and the pre-stored model.

In the above technical solution, the step of obtaining the voice fragment includes:

the method comprises the steps of carrying out real-time batch processing on original voice digital waveform data according to a preset amount, calculating corresponding acoustic characteristics of each preset amount of voice waveform data, and segmenting the voice data to form voice segments based on the acoustic characteristics.

In the above technical solution, the step of obtaining the voice segment includes determining whether the voice data includes voice activity through an energy threshold based on short-time energy characteristics of the voice data, where the energy threshold is dynamically updated based on a mean μ and a variance σ of the non-voice data; the step of obtaining speech segments also includes a process of making a secondary decision on high energy speech frames using a model-based approach.

In the above technical solution, the step of updating the speaker model includes: for the existing speaker, fusing the speaker models; for new speakers, the speaker models are stored.

In the above technical solution, the method further includes recording a life cycle of each speaker model information, comparing the life cycle with a predetermined threshold, and processing the speaker models exceeding the life cycle threshold.

In another aspect, the present invention provides a real-time speaker separation system, comprising: the voice activity detection module, the speaker clustering mark module and the speaker model management module; wherein,

the voice activity detection module is used for respectively detecting and acquiring voice segments according to a preset amount on the basis of voice waveform data and extracting features of the voice segments;

the speaker clustering and marking module is used for classifying speakers corresponding to the voice segments to obtain a general background model matched with the speakers; establishing a speaker temporary model by using the extracted features and the general background model; comparing the established speaker temporary model with the existing similar speaker model, and judging whether the speaker is an existing speaker or not based on model similarity;

and the speaker model management module is used for updating the speaker model based on the judgment result.

In the above technical solution, the real-time separation system for speakers further comprises a data reading module, configured to read original voice digital waveform data and store the original voice digital waveform data in an audio data buffer area in a predetermined amount.

In the above technical solution, the voice activity detection module is configured to perform real-time batch processing on the original voice digital waveform data according to a predetermined amount, calculate, for each predetermined amount of voice waveform data, an acoustic feature corresponding to each predetermined amount of voice waveform data, and segment the voice data to form a voice segment based on the acoustic feature.

It should be noted that: the system and method of the present invention also includes some pre-training of the obtained models, including but not limited to: a speech, non-speech model for speech activity detection; a general background model (male speaker model and female speaker model in the example); and the classifier is used for classifying by utilizing Bayesian characteristics and model distance characteristics, and is an SVM classifier in the example.

The invention has the following advantages:

the real-time voice speaker separation method provided by the invention is designed aiming at the real-time scene of intelligent terminal equipment, and can meet the requirement of real-time speaker separation in the scene through the combination of voice stream segmentation, speaker modeling and clustering marking and speaker model management.

The real-time voice speaker separation method is suitable for a speaker separation framework under a scene of real-time voice reading, meets the requirements of real-time voice segment labeling and dynamic speaker model management, and meets the requirements of speaker separation tasks under the scene of intelligent terminal equipment by using a speaker model establishing method which has short calculation time and can better utilize information contained in a short voice segment.

The real-time voice speaker separation method of the invention utilizes timeliness and data quantity as parameters to manage the storage, deletion and sequencing of the speaker model, solves the speaker management problem that the number of speakers in the intelligent terminal scene is increased continuously, and adapts to the requirements of the intelligent terminal scene.

The real-time voice speaker separation method can meet the requirement that a speaker separation task is executed on intelligent terminal equipment in real time, and continuously marks speaker information for detected voice activity in the voice acquisition process. Under the background of popularization of voice interaction of intelligent equipment, the method can utilize the computing power of the intelligent terminal equipment to obtain the result of voice speaker marking more quickly. Moreover, compared with a method of firstly transmitting voice data to the server and then executing the speaker separation task, the method saves delay caused by network transmission and reduces transmission load caused by the increase of intelligent terminal equipment to the network.

Drawings

FIG. 1 is a schematic diagram of a speaker separation task.

FIG. 2 is a process diagram of a bottom-up clustering speaker separation method.

FIG. 3 is a flowchart illustrating a real-time speaker separation method according to the present invention.

FIG. 4 is a flowchart illustrating a short-term energy-based voice activity detection method.

FIG. 5 is a schematic diagram of the workflow of the speaker clustering mark module.

FIG. 6 is a schematic diagram of the workflow of the speaker model management module.

FIG. 7 is a schematic diagram of a real-time speaker separation system according to the present invention.

Detailed Description

The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited thereto.

As shown in fig. 3, the real-time voice data stream D301 is used as the starting point of the system, and finally the speaker separation result D307 data corresponding to the current voice stream is obtained. The real-time voice speaker separation method of the present invention can be realized by the system shown in fig. 7, and the system shown in fig. 7 comprises four modules, namely a data reading module M301, a voice activity detection module M302, a speaker clustering and labeling module M303 and a speaker model management module M304.

The invention relates to a real-time voice speaker separation method, which comprises the following steps:

the data reading module M301 reads the real-time voice data stream D301 of the original voice digital waveform data from the device sound card, on one hand, the data is put into an audio data buffer D303, and on the other hand, the data is stored in the local database D302 for use by a subsequent system. The size and real-time of the audio data buffer D303 are related to the feature calculation, and the size thereof can be adjusted. Data is continuously read from the sound card when the buffer is not full, and voice activity detection module M302 performs voice activity detection when the buffer is full.

The voice activity detection module M302 calculates acoustic characteristics corresponding to the data in the audio data buffer D303 and stores the acoustic characteristics into the audio characteristic data buffer D304, and the acoustic characteristics can be used as the basis for judging voice activity and can be used by a speaker clustering and marking module later; the voice activity detection module M302 infers the current voice state to obtain one of the results of the four voice activity states of no voice, start of voice, in voice, and end of voice. Besides the end-of-speech mark, the speech data are read continuously after the result is obtained, and the previous process is repeated. Once the end of a segment of speech is detected, the current buffer characteristics correspond to the speech of a speaker.

The speaker clustering mark module M303 performs speaker modeling by using the feature extracted by the speech, using the data of an external training background Model D305(universal background Model, UBM), using a Maximum a posteriori probability (MAP) estimation method, and comparing the speaker modeling with the existing speaker Model in the speaker Model information storage D306, and if the speaker clustering mark is very similar to a certain existing speaker Model, giving a corresponding speaker mark as a speaker separation result D307; if the difference is very large indicating a new speaker, a new model and label is created and used as the speaker separation result D307. The new speaker model is stored in the speaker model information store D306.

After the speaker separation result D307 is obtained, the speaker model management module M304 updates the state of the current speaker model, and for a model with a small voice data volume and no reappearance, it is determined that the conversation has exited, and the model is deleted or separated from the current speaker frame, so as to reduce the number of comparisons that need to be performed when the speaker clustering mark is performed while ensuring the separation accuracy in the conversation range. For some important speakers, the flag bit can be set by the feedback or setting of the subsequent system, so that the speaker model exists in a longer time scale.

Generally, the intelligent terminal scene has two common scenes, namely a fixed user scene and a public user scene. The former users are relatively fixed and frequently appear, and the total speaker number in a conversation in one day is small; the latter speakers only appear during a session for a period of time, but the overall number of speakers during a day is large. The corresponding configuration of the two types of scenes can be completed by changing the setting of the speaker model management module.

The real-time speaker separation method and system of the present invention will be described in detail with reference to fig. 3 to 7.

(1) The data reading module M301.

Original voice digital waveform data is read, and the read data is cached according to a preset amount.

Aiming at a real-time intelligent terminal scene, a data reading module finishes the reading task of original voice, and aims to adapt to sound cards of different platforms, set the sound cards and poll and extract the content stored in a buffer area of the sound cards to be read into a speaker separation system after the system starts to run. Taking a common setting as an example, when initializing a sound reading object, a sampling rate needs to be set to be 16000Hz, a sampling channel is defined to be mono, an audio coding mode is defined to be 16-bit small-end PCM, and the like, and finally, a data reading module obtains digital sound waveform data from a system and stores the digital sound waveform data in a buffer. The setting of the buffer area is related to a plurality of factors, because data are generated continuously, and the data need to be calculated in frames when the voice activity detection module calculates the characteristics, the data should be the integral multiple of the frame length data quantity plus the frame shift data quantity, and the setting of the proper buffer area size can complete the integral characteristic calculation while the voice data is read quickly. The specific settings are related to the platform hardware actually deployed by the system, and can be adapted by those skilled in the art.

(2) Voice activity detection module M302.

When the voice activity detection module M302 detects voice activity, the method based on the feature threshold value and the method based on the model discrimination are organically combined by using the distribution characteristics of the acoustic features of noise. The final algorithm not only utilizes the robustness based on the model discrimination method, but also reduces the computational complexity of a large number of voice-free sections in a real-time voice interaction scene.

The voice activity detection module M302 completes the determination of the real-time voice activity state, and it determines the data of the current buffer to obtain the current voice state: no voice activity, voice activity start, and in voice activity, voice activity end. Voice activity detection has many methods to choose from, and there are methods based on feature threshold, such as a method based on short-term energy feature passing threshold decision; there are model-based methods such as a method of making a decision using a pre-trained neural network, a method of making a decision using Viterbi decoding for speech modeling using a hidden markov model, and the like. In consideration of the requirement of real-time performance, complex and various deployment scenes of the intelligent terminal and mainly based on man-machine conversation in the scene facing the intelligent terminal, the embodiment adopts a method of mixing a characteristic threshold-based method and a model-based method to detect voice activity. Based on the characteristic threshold value method, the speed is high, the real-time performance is high, and the method is judged based on the short-time energy characteristic and the threshold value method; the method based on the model has strong robustness and has a good judgment effect on audible noise, and a method based on a Gaussian mixture model is adopted for judgment in the embodiment.

As shown in fig. 4, the flow of the method for performing voice activity detection based on short-term energy according to this embodiment includes the following steps:

step S401: and (4) performing system initialization, wherein the initialization flag bit is once done when the system is started, and setting the initial value of the voice activity mark variable, the initialization threshold value and the voice activity judgment preset value, such as the shortest voice length, the shortest mute length and the like. The data reading module fills the audio data buffer area, and triggers the voice activity detection module to check the voice data of the current buffer area.

Step S402: and windowing the original digital voice waveform data in a frame mode, calculating to obtain voice data, and extracting short-time energy characteristics and MFCC characteristics based on the voice data. The voice activity detection module firstly extracts features of original voice data, and since the subsequent speaker clustering marks also use specific features for operation, the features are all calculated together in the part, the common features include short-time energy, zero-crossing rate, MFCC acoustic features and the like, and the short-time energy features are adopted in the embodiment.

Short-time energy calculation formula:

wherein x is the speech frame signal in the current buffer area, and w is a window function which can be selected according to requirements, and the sum of squares of the windowed data is obtained to obtain the short-time energy corresponding to the speech frame. The MFCC acoustic features are acoustic features commonly used in the fields of speaker recognition, speaker separation, voice recognition and the like, and are features based on Mel frequency cepstrum coefficients perceived by human ears as the acoustic representation of the speaker's voice.

And comparing the obtained short-time energy characteristics with an energy threshold set by a system. If the energy threshold is larger than the energy threshold, voice activity is suspected to exist, and further confirmation is carried out by using a model-based method (the model-based method used herein means that parameters of a Gaussian mixture model are estimated on existing labeled training data (such as voice and non-voice data) through an EM (effective magnetic resonance) algorithm based on the Gaussian mixture model. If it is determined to be a voice frame, step S403 is performed to accumulate the voice count, otherwise, if it is determined to be voice activity free, step S404 is performed to accumulate the silence count, and the data is updated to the non-voice data buffer. If the existing silence count is smaller than the set silence shortest length after the voice count is accumulated, the voice is regarded as a section of low-energy voice, the step S405 is carried out to clear the silence count, then the step S411 is carried out, otherwise, the step S411 is directly carried out without processing. If the mute count is accumulated, checking whether the current mute count is greater than or equal to the shortest mute length, if so, indicating that the mute section has been entered, performing step S406 to clear the voice count, then entering step S411, otherwise, directly entering step S411 without processing. In the processing of step S411, the flag is checked to see whether it is currently marked as being in voice activity. If yes, checking whether the mute count is greater than the shortest mute length and the voice count is greater than the shortest voice length, if yes, indicating that the position of the voice activity end is met, and performing step S407 to mark the voice activity end point; and determining that the current cached voice segment is a complete speaker voice segment, and triggering clustering marking of speakers on the voice segment. If the voice activity is in the voice activity but the above condition is not satisfied, the voice activity is still considered, the step S408 is performed to mark the voice activity, the data reading is continued, and the above steps are repeated. If the voice activity flag is no, then there has been no voice activity previously. And if the voice length is larger than the shortest voice activity length at the moment, considering that the voice activity is started to enter the voice activity from the silence. Marking the voice start and marking the current state as being in voice activity, and performing step S409 to mark the voice start point; otherwise, step S410 is performed to mark no voice activity. And S412 is entered after the frame voice state marking is finished, the threshold value for voice activity detection is dynamically updated according to the updating frequency and the preset updating method and the data of the non-voice data buffer area, the voice data is continuously read, and the steps are repeated.

With respect to step S402, in this embodiment, when the threshold determination is performed, dynamic threshold updating is performed, assuming that the short-time energy distribution follows normal distribution, and the method for dynamically updating the threshold is to calculate the mean μ and the variance σ of data (non-speech data) in the non-speech buffer, and perform dynamic updating with μ +2 σ as a new threshold.

The invention uses the assumption that the background noise meets normal distribution at fixed time, estimates the threshold value through caching a section of nearest noise through the noise variance sigma and the mean value mu, and the formula is that the threshold value T is mu +2 sigma and is used for detecting non-speech, and once the threshold value is exceeded, secondary judgment is carried out by using a model-based method. If the speech frame is the speech frame, the speech frame is marked in the algorithm, otherwise, the noise buffer is updated while the speech frame is marked as the non-speech frame, and the correct estimation of the threshold value is optimized.

Because the model-based method is used for secondary judgment, the invention ensures the overall robustness of the algorithm, and simultaneously, a large amount of non-speech can be directly judged in the first time based on the characteristic threshold value method, thereby reducing the use of the model-based method and reducing the average calculation complexity.

(3) Speaker cluster labeling module M303.

The speaker clustering labeling module M303 completes the speaker modeling and judging process of the voice segment, and is the core step of giving out the speaker clustering labeling result. The traditional method converts the clustering problem into the model selection problem based on the Bayesian information criterion, the Bayesian information criterion is used as a classification characteristic in the method of the invention, a classifier is trained by combining the model distance characteristic, the detection of a new speaker is finished through the classifier, if the new speaker is speaking, the new speaker information is marked, and if the old speaker is speaking, the original speaker information is used as a result. Compared with the method based on the Bayesian information criterion, the method of the invention increases the discrimination of the samples and improves the accuracy of speaker separation.

When the voice activity detection module M302 obtains a voice segment, the speaker clustering and labeling module M303 is triggered to perform speaker modeling and clustering and labeling processes. As shown in fig. 5, the specific workflow of the speaker cluster labeling module M303 includes:

step S501: firstly, a Universal Background Model (Universal Background Model) trained by external data in advance is utilized (the Universal Background Model is equivalent to an average speaker Model and is trained in advance, for example, the Universal Background Model is divided into a male Model and a female Model, which can be understood as a male average Model and a female average Model which are irrelevant to the speaker), the speaker type is judged, the Model likelihood is calculated for the voice segment data characteristics of the current cache region, and which Background Model is judged to be of which type if the log likelihood corresponding to which Background Model is higher, namely the best matched Universal Background Model is selected. The present embodiment uses a Gaussian Mixture Model (Gaussian Mixture Model) as a modeling Model. The speaker models are all described by a gaussian mixture model and have the same amount of mixture components. The gaussian mixture model can be expressed by the following formula:

wherein x is a feature vector of current observation, model parameters of a kth normal distribution formed by K multivariate normal distributions of the Gaussian mixture model are respectively a mean vector mu K and a covariance matrix sigma K, weighted summation is carried out by a component coefficient pi, and the component pi satisfies:

the general background model can be obtained by various methods, such as a method of directly estimating a Gaussian mixture model on training data which is labeled and classified by using an EM algorithm, and the quantity and the structure of parameters are the same among different general background models.

Step S502: and obtaining speaker model estimation corresponding to the voice segment by using a background model corresponding to the classification result of S501 through a Maximum posterior probability estimation (Maximum A Posteriori) method, completing modeling of the speaker, and obtaining a speaker correlation model. The maximum posterior probability adaptation method can fully utilize the information describing the acoustic characteristics of one speaker in the original data, for example, when two speaker models of male and female are classified, the general background model respectively describes the characteristics of acoustic subclass information irrelevant to some speakers of male and female, namely average pronunciation. The method describes some common information of speakers corresponding to the categories, so that the method can solve the problem that the speaker model is difficult to accurately estimate due to too short voice segments when the speaker model is directly modeled by the features, and simultaneously, the problem that the speaker model cannot be converged when the model parameter estimation is directly carried out is avoided.

Step S503: after the temporary speaker models are obtained, the speaker models of the same category currently existing in the speaker model information storage need to be examined one by one to judge whether the speaker is a new speaker or an old speaker to give speaker information. When in inspection, the old speaker model needs to be inspected in turn, if the old speaker model is the first one, the old speaker model is directly used as a speaker result, and comparison with the old speaker model is not needed. The embodiment judges by calculating the log likelihood, Bayesian Information Criterion (BIC) and speaker model distance. Log-likelihood uses the information of all speech frames, much like microscopic similarity calculation, and since probability calculation is performed for each data frame, the indicator is more prone to establishing a new model to obtain a larger probability. The Bayesian information criterion converts the clustering problem into a model selection problem to be solved, can be used for judging whether two speaker models represent the same speaker, and can be understood as model selection with minimized structural risk. The speaker model distance macroscopically examines the difference degree of the two models relatively, and if the model difference is small, the two models are more likely to be classified into the same category. There are many choices for the model distance, including cross entropy, KL divergence (Kullback-Leiblerdyrgence), cross correlation, etc. The KL divergence is given by the formula:

where p, q are the probability density functions of the two distributions to be investigated.

In this embodiment, after obtaining the most similar speaker log-likelihood and model distance, it is necessary to first determine whether the model is a general background model. If the background model is a general background model, the background model means that the model which is subjected to maximum posterior probability adaptation describes the characteristics of too many speakers per se, and the data of the speakers to be examined at present are different from the information contained in the same class. This means that the speaker corresponding to the current speech segment is not present in the model. Therefore, at this time, whether the difference of the distances is larger than the setting of the different types of confidence distances is compared, if so, the step S504 is performed to establish a new speaker, if not, the speaker is still considered to be the most similar speaker in the existing speaker models, and the step S505 is performed to consider the speaker to be the most similar old speaker. Alternatively, the similarity of the generic background model is less than the likelihood of a model of an existing speaker. It is more preferable to use this speaker model as a description of this segment of speech. At this time, whether the distance between the two models is larger than the set of the same-class confidence distance is judged. If the distance is smaller than the threshold value, the voice generated by the same speaker is considered, and step S505 is performed; otherwise, step S504 is performed, and the newly established temporary speaker model is stored in the speaker model information storage. The similarity judgment has various strategies, a method of setting a threshold value for indexes one by a development set can be used, and an SVM (support vector machine) and a logistic classification model can be used for training a new speaker classifier and an old speaker classifier to finish the judgment of the identity of a model speaker so as to judge whether the model speaker is an old speaker or a new speaker.

(4) The speaker model management module M304.

The process of speaker clustering marking has a link that new data and a model thereof are sequentially calculated, and the log likelihood and the model distance of each speaker model are already existed. On one hand, the number of speakers encountered by equipment deployed in an intelligent terminal scene can be continuously increased, and on the other hand, each user has a certain session period when using the intelligent terminal equipment, and the intelligent terminal equipment has the characteristic of time locality. The stored speaker models are managed, so that the requirement of real-time speaker separation tasks can be effectively met, the number of the models is controlled, and simultaneously, the comparison tasks of the speaker models are well finished, so that a correct speaker separation result is obtained. The speaker model management module M304 has two main functions: fusing speaker models and managing the life cycle of the speaker models.

The speaker model fusion and speaker model lifetime management process is shown in fig. 6. When the speaker clustering mark module M303 completes speaker modeling on the current speech segment, the speaker model management module M304 is triggered to manage the current speech data and the speaker model, and the specific management steps include:

step S601: and obtaining the result of judging the speaker of the voice segment after the speaker clustering mark. When the determination result is an existing model, whether to update the old speaker model with the current speech segment is considered. A speaker's model always corresponds to several most representative speech segments and their features. If the distance between the current statement model and the original speaker model of the judgment result is smaller than a certain preset threshold value, the statement is considered to represent the speaker model, the statement is updated and stored in a speaker model information storage, the statement can be selected to be stored all the time, or K voice features with the highest occurrence probability can be selected to be stored, the rest of the K voice features are discarded, and then the speaker model is regenerated by using new reserved speaker voice feature data to complete updating of the speaker model. When the speaker model is updated, the original relationship between the speaker models in the speaker model information storage is changed. A check is made to determine if model fusion is required to make the separation more accurate.

Step S602: when the relative distance of the models is checked pairwise, if the similarity degree is larger than the threshold value setting, the two speaker models in the original storage are considered to represent the same speaker, and at the moment, the models are fused. And the original speaker label is modified, and the two speakers are marked as the same again so as to be consistent with the future result. Similar to step S601, the speaker model corresponds to a series of stored information features of the speaker sentence, and may be selected to be completely retained, or may be selected to use the K speech data features having the highest occurrence probability under the fusion model as the features of the speaker model.

Step S603: after the modeling and clustering process of a segment of speech is completed, the conversation duration of the system increases the length of the speech. In consideration of the fact that voice is continuously input in a real-time scene, and the conversation of the intelligent terminal scene is always local, the current task is finished after the conversation is completed within a period of time. In order to limit the number of speaker models stored in the system, two aspects of limitation can be made according to the service time characteristics and the speaker characteristics, on one hand, the life cycle of the conversation is limited, if one model does not reappear within the limited time limit of the longest conversation period, the speaker is considered to leave the conversation, the speaker reappears as a new speaker in the new conversation, step S604 is carried out, the model is hung or deleted from the current model storage, and the model which is compared in the speaker clustering marking process is not listed. On the other hand, the setting of the priority of the speaker can determine some speakers with high priority through registration, total voice occurrence length or later service information for certain applications serving specific users, and the model of the speakers is reserved for a long time without being influenced by life cycle limitation, so that the service requirements are met.

And after the steps are completed, the voice is continuously read, and the voice activity occurring next time is processed.

The real-time voice speaker separation method of the invention is designed aiming at the real-time scene of intelligent terminal equipment, and can meet the requirement of the real-time speaker separation scene by recombining the voice segmentation and speaker modeling methods and adding the management of speaker model modeling.

The invention provides a speaker separating frame suitable for a scene of real-time voice reading, which meets the requirements of real-time voice segment labeling and dynamic speaker model management, and uses a speaker model establishing method which has short calculation time and can better utilize information contained in a short voice segment so as to meet the requirements of an intelligent terminal device scene.

The real-time voice speaker separation method of the invention utilizes timeliness and data volume to manage the storage, deletion and sequencing of the speaker model, solves the speaker management problem that the number of speakers in the intelligent terminal scene is continuously increased, and adapts to the requirements of the intelligent terminal scene.

The real-time voice speaker separation method can meet the requirement that a speaker separation task is executed on intelligent terminal equipment in real time, and speaker marking is continuously completed for detected voice activity in the voice acquisition process. Under the background of popularization of voice interaction of intelligent equipment, the method can expand the capability of the intelligent terminal equipment and obtain the result of speaker separation more quickly. Moreover, compared with a method of firstly transmitting voice data to the server and then executing the speaker separation task, the method saves delay caused by network transmission and reduces transmission burden on the network along with increase of intelligent terminal devices.

While the principles of the invention have been described in detail in connection with the preferred embodiments thereof, it will be understood by those skilled in the art that the foregoing embodiments are merely illustrative of exemplary implementations of the invention and are not limiting of the scope of the invention. The details of the embodiments are not to be interpreted as limiting the scope of the invention, and any obvious changes, such as equivalent alterations, simple substitutions and the like, based on the technical solution of the invention, can be interpreted without departing from the spirit and scope of the invention.

Claims

1. A real-time voice speaker separation method is characterized by comprising the following steps:

2. The method for separating real-time speakers as claimed in claim 1, wherein said step 104 comprises: and judging whether the speaker is an existing speaker or not based on the log-likelihood of the voice segments on the pre-stored model and the similarity of the speaker temporary model and the pre-stored model.

3. The real-time speaker segregation method of claim 1, wherein the step of obtaining the speech segments comprises:

the method comprises the steps of carrying out real-time batch processing on original voice digital waveform data according to a preset amount, calculating corresponding acoustic characteristics of each preset amount of voice waveform data, and detecting voice activity states by combining an energy threshold-based method and a model-based method to divide the voice data into voice segments.

4. The real-time speaker separation method of claim 1,

the step of obtaining the voice segment includes determining whether the voice data contains voice activity based on short-time energy characteristics of the voice data through an energy threshold, wherein the energy threshold is dynamically updated based on a mean μ and a variance σ of the non-voice data.

5. The method for separating a speaker from a real-time speech as recited in claim 1, further comprising the step of updating the speaker model by: for the existing speaker, fusing the speaker models; for new speakers, the speaker models are stored.

6. The real-time speaker separation method of claim 5,

the method also includes recording the life cycle of each speaker model information, comparing it with a predetermined threshold, and processing speaker models that exceed the life cycle threshold.

7. A real-time voice speaker separation system, comprising: a voice activity detection module (M302), a speaker clustering mark module (M303) and a speaker model management module (M304); wherein,

the voice activity detection module (M302) is used for respectively detecting and acquiring voice segments according to a preset amount on the basis of voice waveform data and extracting features of the voice segments;

the speaker clustering and marking module (M303) is used for classifying speakers corresponding to the voice segments to obtain a general background model matched with the speakers; establishing a speaker temporary model by using the extracted features and the general background model; comparing the established speaker temporary model with the existing similar speaker model, and judging whether the speaker is an existing speaker;

the speaker model management module (M304) is used for updating the speaker model based on the judgment result.

8. The real-time speaker separation system of claim 7,

the device also comprises a data reading module (M301) which is used for reading the original voice digital waveform data and storing the original voice digital waveform data into the audio data buffer area by a preset amount.

9. The real-time speaker separation system of claim 7,

the voice activity detection module (M302) is used for carrying out real-time batch processing on the original voice digital waveform data according to a preset amount, calculating the corresponding acoustic characteristic of each preset amount of voice waveform data, and segmenting the voice data to form voice segments based on the acoustic characteristics.