WO2017076211A1

WO2017076211A1 - Voice-based role separation method and device

Info

Publication number: WO2017076211A1
Application number: PCT/CN2016/103490
Authority: WO
Inventors: 李晓辉; 李宏言
Original assignee: 阿里巴巴集团控股有限公司
Priority date: 2015-11-05
Filing date: 2016-10-27
Publication date: 2017-05-11
Also published as: CN106683661A; CN106683661B

Abstract

A voice-based role separation method and device. The method comprises: extracting feature vectors from a voice signal frame by frame, so as to obtain a feature vector sequence (101); allocating role labels to the feature vectors (102); training, by employing the feature vectors having the role labels, a deep neural network (DNN) model (103); and determining, according to the DNN model and a hidden Markov model (HMM) trained by using the feature vectors, a role sequence corresponding to the feature vector sequence, and outputting a role separation result (104), wherein the DNN model is configured to output, according to an inputted feature vector, probabilities corresponding to respective roles, and the HMM is configured to describe a transition relationship between the roles. Employing the DNN model having powerful feature extraction capability to establish a model of a speaker role, the method can better describe a role compared with the conventional GMM, and can generate more detailed and accurate description of a role, thereby providing a more accurate role separation result.

Description

Voice-based role separation method and device

The present application claims priority to Chinese Patent Application Serial No. No. No. No. No. No. No. No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No

Technical field

The present application relates to the field of speech recognition, and in particular to a speech-based role separation method. The application also relates to a speech-based role separation device.

Background technique

Speech is the most natural way of communication for human beings. Speech recognition technology is a technology that allows a machine to transform a speech signal into a corresponding text or command through a process of recognition and understanding. Speech recognition is an interdisciplinary subject, including: signal processing, pattern recognition, probability theory and information theory, vocal mechanism and auditory mechanism, artificial intelligence, and so on.

In practical applications, in order to be able to make a more accurate analysis of the speech signal, not only the speech recognition but also the speaker of each speech is discriminated, so the need to separate the speech according to the role naturally occurs. In many scenes such as daily life, meetings, and telephone conversations, there are dialogue voices. By separating the roles of dialogue voices, it is possible to determine which part of the voice is spoken by one of the people, and which part of the voice is spoken by another person. After the dialogue voice is separated according to the role, combined with speaker recognition and voice recognition, a wider application space will be generated. For example, the voice of the customer service center can be separated according to the role, and then the voice recognition can be used to determine what the customer service said. What the customer said, so that the corresponding customer service quality inspection or the potential demand of the customer can be carried out.

In the prior art, GMM (Gaussian Mixture Model) and HMM (Hidden Markov Model) are commonly used for character separation of conversational voices, that is, using GMM modeling for each character, for different Jumps between roles are modeled using HMM. Because the time proposed by GMM modeling technology is relatively early, and the function of fitting arbitrary functions depends on the number of mixed Gaussian functions, it has certain limitations on the character's ability to characterize, and the accuracy of role separation is usually low. Can not meet the needs of the application.

Summary of the invention

The embodiment of the present application provides a voice-based role separation method and apparatus to solve the existing GMM-based solution. The problem of separation of the character and the HMM is relatively low.

The present application provides a voice-based role separation method, including:

Extracting feature vectors from the speech signal frame by frame to obtain a feature vector sequence;

Assign a character tag to the feature vector;

Training a deep neural network DNN model using feature vectors with role tags;

Determining a character sequence corresponding to the feature vector sequence according to the DNN model and the hidden Markov model HMM obtained by using the feature vector training, and outputting a role separation result;

The DNN model is configured to output a probability corresponding to each role according to the input feature vector, and the HMM is used to describe a jump relationship between the characters.

Optionally, after the step of extracting the feature vector frame by frame from the voice signal, before the step of assigning the character tag to the feature vector, performing the following operations: by identifying and culling the audio frame that does not include the voice content, Splitting the voice signal into voice segments;

The assigning a role tag to the feature vector includes: assigning a character tag to the feature vector in each voice segment; and determining the character sequence corresponding to the feature vector sequence includes: determining a character sequence corresponding to the feature vector sequence included in each voice segment.

Optionally, the assigning a role tag to the feature vector in each voice segment includes: assigning a role tag to a feature vector in each voice segment by establishing a Gaussian mixture model GMM and an HMM; wherein the GMM is used for each role And outputting, according to the input feature vector, a probability that the feature vector corresponds to the character;

Determining, according to the DNN model and the HMM obtained by using the feature vector, the role sequence corresponding to the feature vector sequence included in each voice segment includes: assigning a role tag according to the DNN model and the feature vector in each voice segment The adopted HMM determines a sequence of characters corresponding to the sequence of feature vectors included in each of the speech segments.

Optionally, by assigning a Gaussian mixture model GMM and an HMM, the role label is assigned to the feature vector in each voice segment, including:

Select a corresponding number of voice segments according to the preset initial number of roles, and assign different roles to each voice segment;

Training the GMM and HMM for each character using the feature vectors in the speech segments of the specified character;

Decoding according to the trained GMM and the HMM, and obtaining a sequence of roles in which the probability values of the feature vector sequences included in each speech segment are output;

Determining whether the probability value corresponding to the role sequence is greater than a preset threshold; if yes, assigning a role tag to the feature vector in each voice segment according to the character sequence.

Optionally, when the result of determining whether the probability value corresponding to the role sequence is greater than a preset threshold is no, perform the following operations:

Assigning a corresponding role to each voice segment according to the sequence of roles;

Training the GMM and the HMM for each character according to the feature vectors in each of the speech segments and the corresponding characters;

Go to the step of decoding according to the trained GMM and HMM.

Optionally, the assigning a corresponding role to each voice segment according to the sequence of roles includes:

For each speech segment, the mode of the character corresponding to each feature vector is designated as the character of the speech segment.

Optionally, the training the GMM and the HMM for each role according to the feature vector in each voice segment and the corresponding role, including: training the GMM in an incremental manner based on a model obtained from the last training. HMM.

Determining whether the number of times the GMM and the HMM are trained under the current number of roles is less than a preset upper limit of the number of training times;

If yes, performing the step of specifying a corresponding role for each voice segment according to the role sequence;

If not, do the following:

Adjust the number of roles, select the appropriate number of voice segments, and assign different roles to each voice segment;

And proceeding to the step of performing the step of the GMM and the HMM for each character by using the feature vector in the voice segment of the specified character.

Optionally, when the result of determining whether the number of times the GMM and the HMM are trained under the current number of roles is less than the preset upper limit of the training times, the following operations are performed:

Determining whether the current number of roles meets the preset requirement; if yes, performing the step of assigning a role tag to the feature vector in each voice segment according to the sequence of roles, and if not, performing the step of adjusting the number of roles.

Optionally, the preset initial number of roles is 2, and the adjusting the number of roles includes: adding 1 to the current number of roles.

Optionally, the extracting the feature vector from the voice signal frame by frame, and obtaining the feature vector sequence includes:

Performing a frame processing on the voice signal according to a preset frame length to obtain a plurality of audio frames;

A feature vector of each audio frame is extracted to obtain the feature vector sequence.

Optionally, the extracting the feature vector of each audio frame comprises: extracting an MFCC feature, a PLP feature, or an LPC feature.

Optionally, the identifying and culling the audio frame that does not include the voice content comprises: using the VAD technology to identify the audio frame that does not include the voice content, and performing a corresponding culling operation.

Optionally, after performing the identifying and culling operation by using the VAD technology and dividing the voice signal into voice segments, perform the following VAD smoothing operation:

The speech segment whose duration is less than the preset threshold is merged with the adjacent speech segment.

Optionally, the training the depth neural network DNN model by using the feature vector with the character tag comprises: training the DNN model by using a back propagation algorithm.

Optionally, determining, according to the DNN model and the Hidden Markov Model HMM obtained by using the feature vector training, determining a character sequence corresponding to the feature vector sequence, comprising: performing a decoding operation according to the DNN model and the HMM, and acquiring an output station The probability values of the feature vector sequence are sorted by the preceding character sequence, and the character sequence is taken as the character sequence corresponding to the feature vector sequence.

Optionally, the output role separation result includes: starting and ending time information of an audio frame to which the corresponding feature vector belongs according to the role sequence corresponding to the feature vector sequence.

Optionally, the selecting the corresponding number of voice segments comprises: selecting the number of voice segments that meet the preset requirements.

Correspondingly, the present application further provides a voice-based role separation device, including:

a feature extraction unit, configured to extract a feature vector from a voice signal frame by frame to obtain a feature vector sequence;

a label allocation unit, configured to assign a character label to the feature vector;

a DNN model training unit for training a DNN model with a feature vector having a character tag, wherein the DNN model is configured to output a probability corresponding to each character according to the input feature vector;

The role determining unit is configured to determine a character sequence corresponding to the feature vector sequence and output a role separation result according to the DNN model and the HMM obtained by using the feature vector training, wherein the HMM is used to describe a jump relationship between the characters.

Optionally, the device further includes:

a voice segment segmentation unit, configured to: after the feature extraction unit extracts the feature vector, identify and cull the audio frame that does not include the voice content, and divide the voice signal into voices before triggering the label allocation unit to work segment;

The label distribution unit is specifically configured to allocate a role label for a feature vector in each voice segment;

The role determining unit is specifically configured to determine a character sequence corresponding to the feature vector sequence included in each voice segment according to the DNN model and the HMM obtained by using the feature vector training, and output a role separation result.

Optionally, the label allocation unit is specifically configured to allocate a role label for a feature vector in each voice segment by establishing a GMM and an HMM, where the GMM is configured to output the feature according to the input feature vector for each role. The vector corresponds to the probability of the character;

The role determining unit is specifically configured to determine a character sequence corresponding to the feature vector sequence included in each of the voice segments according to the DNN model and an HMM used to assign a character tag to a feature vector in each voice segment.

Optionally, the label distribution unit includes:

The initial role designating sub-unit is configured to select a corresponding number of voice segments according to a preset initial number of roles, and specify different roles for each voice segment;

An initial model training subunit for training a GMM and an HMM for each character by using a feature vector in a voice segment of a specified character;

a decoding subunit, configured to perform decoding according to the GMM and the HMM obtained by the training, and obtain a sequence of roles in which the probability values of the feature vector sequences included in each speech segment are outputted;

a probability judging unit, configured to determine whether a probability value corresponding to the role sequence is greater than a preset threshold;

a label allocation subunit, configured to allocate a role label for the feature vector in each voice segment according to the role sequence when the output of the probability determination subunit is YES.

Optionally, the label distribution unit further includes:

a voice-by-speech role designation sub-unit, configured to specify a corresponding role for each voice segment according to the role sequence when the output of the probability determination sub-unit is negative;

The model update training subunit is configured to train the GMM and the HMM for each role according to the feature vector in each voice segment and the corresponding role, and trigger the decoding subunit to work.

Optionally, the voice-by-speech segment role specifying sub-unit is specifically configured to specify, for each voice segment, a mode of a character corresponding to each feature vector as a role of the voice segment.

Optionally, the model update training subunit is specifically configured to train the GMM and the HMM in an incremental manner on the basis of the model obtained in the previous training.

Optionally, the label distribution unit further includes:

a training number determining subunit, configured to determine, when the output of the probability judging subunit is negative, whether the number of times the GMM and the HMM are trained under the current number of characters is less than a preset upper limit of the training times, and when the judgment result is yes, Triggering the voice-by-speech role designation sub-unit work.

a role quantity adjustment subunit, configured to: when the output of the training times determining subunit is negative, adjust the number of characters, select a corresponding number of voice segments, and respectively specify different roles for each voice segment, and trigger the initial model training Practice the subunit work.

Optionally, the label distribution unit further includes:

a role number determining sub-unit, configured to determine whether the current number of roles meets a preset requirement when the output of the training number determining sub-unit is negative, and trigger the label allocation sub-unit to work if it matches, otherwise trigger the number of the roles Adjust the subunit work.

Optionally, the feature extraction unit includes:

a framing sub-unit, configured to perform framing processing on the voice signal according to a preset frame length to obtain a plurality of audio frames;

A feature extraction execution subunit is configured to extract a feature vector of each audio frame to obtain the feature vector sequence.

Optionally, the feature extraction execution sub-unit is specifically configured to extract an MFCC feature, a PLP feature, or an LPC feature of each audio frame to obtain the feature vector sequence.

Optionally, the voice segment segmentation unit is specifically configured to: identify and cull the audio frame that does not include voice content by using a VAD technology, and divide the voice signal into voice segments.

Optionally, the device further includes:

The VAD smoothing unit is configured to merge the voice segment whose duration is less than the preset threshold with the adjacent voice segment after the voice segment segmentation unit uses the VAD technology to segment the voice segment.

Optionally, the DNN model training unit is specifically configured to train the DNN model by using a back propagation algorithm.

Optionally, the role determining unit is configured to: perform a decoding operation according to the DNN model and the HMM, obtain a role sequence that outputs a probability value of the sequence of the feature vector, and use the role sequence as a context A sequence of characters corresponding to the sequence of feature vectors.

Optionally, the role determining unit outputs the role separation result in the following manner: according to the character sequence corresponding to the feature vector sequence, the start and end time information of the audio frame to which the corresponding feature vector belongs is output for each character.

Optionally, the initial role designation subunit or the role quantity adjustment subunit specifically selects a corresponding number of voice segments by selecting the number of voice segments that meet the preset requirement.

Compared with the prior art, the present application has the following advantages:

The speech-based character separation method provided by the present application firstly extracts a feature vector sequence from a speech signal frame by frame, and then trains the DNN model on the basis of assigning a character tag to the feature vector, and according to the DNN model and using the feature vector training. The HMM determines the character sequence corresponding to the feature vector sequence, thereby obtaining the role separation result. The above method provided by the present application uses a DNN model with powerful feature extraction capability to model the speaker character, which has more powerful characterization ability than the traditional GMM, and the character is more refined. Accurate, so you can get more accurate role separation results.

DRAWINGS

1 is a flow chart of an embodiment of a voice-based role separation method of the present application;

2 is a flowchart of a process for extracting a feature vector sequence from a voice signal according to an embodiment of the present application;

FIG. 3 is a flowchart of a process for assigning a role tag to a feature vector in each voice segment by using a GMM and an HMM according to an embodiment of the present application;

4 is a schematic diagram of voice segment division provided by an embodiment of the present application;

FIG. 5 is a schematic diagram of a topology structure of a DNN network according to an embodiment of the present application;

6 is a schematic diagram of an embodiment of a voice-based role separation device of the present application.

detailed description

Numerous specific details are set forth in the description below in order to provide a thorough understanding of the application. However, the present invention can be implemented in many other ways than those described herein, and those skilled in the art can make similar promotion without departing from the scope of the present application. Therefore, the present application is not limited by the specific embodiments disclosed below.

In the present application, a voice-based role separation method and a voice-based role separation device are respectively provided, which are described in detail in the following embodiments. For the sake of easy understanding, the technical background, technical solutions, and writing manners of the embodiments of the present application will be briefly described before describing the embodiments.

The existing role separation techniques applied in the field of speech usually use GMM (Gaussian mixture model) to model the characters and HMM (Hidden Markov Model) to jump between roles. Modeling.

The HMM is a statistical model used to describe a Markov process with implicit unknown parameters. Hidden Markov model is a kind of Markov chain. Its state (called hidden state) cannot be directly observed, but it is related to the observable observation vector. Therefore, HMM is a double stochastic process. It consists of two parts: a Markov chain with state transition probability (usually described by transfer matrix A), and a random process describing the output relationship between the hidden state and the observation vector (usually described by the confusion matrix B, each of which The element is the hidden state corresponding to the output probability of the observation vector, also known as the emission probability). An HMM with N states can be represented by the triplet parameter λ = {π, A, B}, where π is the initial probability of each state.

The GMM can be simply understood as a superposition of multiple Gaussian density functions. The core idea is to use a combination of multiple Gaussian distribution probability density functions to describe the distribution of eigenvectors in the probability space, which can be smoothed. The approximate approximate density distribution of the shape. The parameters include: mixing weight, mean vector, and covariance matrix for each Gaussian distribution.

In the existing speech-based role separation application, GMM is usually used for each role. The state in the HMM is each role. The observation vector is a feature vector extracted from the speech signal frame by frame, and each state outputs the feature vector. The probability of transmission is determined by the GMM (the confusion matrix can be known from the GMM), and the role separation process is the process of determining the sequence of roles corresponding to the sequence of feature vectors using the GMM and the HMM.

Since the function fitting function of GMM is limited by the number of Gaussian density functions used, its own expression ability has certain limitations, which leads to the low accuracy of the existing role separation using GMM and HMM. To solve this problem, the technical solution of the present application determines the transmission probability of each state of the HMM by using a deep neural network (DNN) based on pre-assigning a character tag to the feature vector of each speech frame, and determines and eigenvectors according to the DNN and the HMM. The sequence of characters corresponding to the sequence, because DNN has the powerful ability to combine lower-level features to form more abstract high-level features, can achieve more precise character characterization, and thus can obtain more accurate role separation results.

The technical solution of the present application first assigns a role tag to a feature vector extracted from a voice signal. The role tag assigned at this time is usually not very accurate, but can provide a reference for the subsequent execution of a supervised learning process, and training on the basis of this. The resulting DNN model can more accurately characterize the characters, thus making the role separation results more accurate. In the specific implementation of the technical solution of the present application, the function of assigning a role tag may be implemented by using a statistical-based algorithm or a classifier, and the following embodiments are provided according to the GMM and the HMM. Assign an implementation of the role tag.

Hereinafter, embodiments of the present application will be described in detail. Please refer to FIG. 1, which is a flowchart of an embodiment of a voice-based role separation method according to the present application. The method includes the following steps:

Step 101: Extract a feature vector from a voice signal frame by frame to obtain a feature vector sequence.

The speech signal to be separated by the character is usually a time domain signal. In this step, a sequence of feature vectors capable of characterizing the speech signal is obtained through two processes of framing and extracting feature vectors, which will be further described below with reference to FIG. 2 .

Step 101-1: Perform frame processing on the voice signal according to a preset frame length to obtain a plurality of audio frames.

In a specific implementation, the frame length may be preset according to requirements, for example, may be set to 10 ms, or 15 ms, etc., and then the voice signal in the time domain is segmented frame by frame according to the frame length, thereby dividing the voice signal into multiples. Audio frames. Depending on the segmentation strategy employed, adjacent audio frames may or may not overlap.

Step 101-2: Extract a feature vector of each audio frame to obtain the feature vector sequence.

After the speech signal in the time domain is divided into multiple audio frames, the feature vector capable of characterizing the speech signal can be extracted frame by frame. the amount. Since the description ability of the speech signal in the time domain is relatively weak, the Fourier transform can usually be performed for each audio frame, and then the frequency domain feature is extracted as the feature vector of the audio frame. For example, the MFCC (Mel Frequency Cepstrum Coefficient- Mel frequency cepstral coefficient) feature, PLP (Perceptual Linear Predictive) feature, or LPC (Linear Predictive Coding) feature.

The following takes the MFCC feature of an audio frame as an example to further describe the feature vector extraction process. First, the time domain signal of the audio frame is obtained by FFT (Fast Fourier Transformation) to obtain corresponding spectrum information, and the spectrum information is obtained through a Mel filter group to obtain a Mel spectrum, and cepstrum analysis is performed on the Mel spectrum. The core generally uses DCT (Discrete Cosine Transform) to perform inverse transform, and then takes a preset N coefficients (for example, N=12 or 38), and then obtains the feature vector of the audio frame: MFCC feature. Each audio frame is processed in the above manner, and a series of feature vectors characterizing the speech signal, that is, the feature vector sequence described in the present application, can be obtained.

Step 102: Assign a character tag to the feature vector.

In this embodiment, a role tag is assigned to a feature vector in a feature vector sequence by establishing a GMM and an HMM. It is considered that in addition to the speech signal corresponding to each character in a speech signal, there may be a portion without speech content, for example, a mute portion due to listening, thinking, and the like. Since these parts do not contain the information of the character, in order to improve the accuracy of the character separation, such an audio frame can be recognized and culled from the voice signal in advance.

Based on the above considerations, in this embodiment, before the role tag is assigned to the feature vector, the audio frame that does not include the voice content is removed, and the voice segment is divided, and then the character tag is assigned to the feature vector in each voice segment. The assignment of the role tag includes: performing an initial division of the role, and iteratively training the GMM and the HMM on the basis of the initial division. If the model obtained by the training does not satisfy the preset requirement, the number of roles is adjusted and then the GMM and the HMM are retrained until the training is obtained. When the model satisfies the preset requirement, the character tag is assigned to the feature vector in each voice segment according to the model. The above processing will be described in detail below with reference to FIG.

Step 102-1, by identifying and culling an audio frame that does not contain voice content, and dividing the voice signal into voice segments.

The prior art generally adopts an acoustic segmentation method, that is, separating, for example, a "music segment", a "speech segment", a "silent segment", and the like from a voice signal according to an existing model. In this way, it is necessary to train the acoustic models corresponding to the various audio segments in advance, such as an acoustic model corresponding to the “music segment”, and based on the acoustic model, the audio segment corresponding to the acoustic model can be separated from the speech signal.

Preferably, the technical solution of the present application may use a VAD (Voice Activity Detection) technology to identify a portion that does not include voice content, so that it may not be required with respect to a technique using an acoustic segmentation method. The acoustic model corresponding to different audio segments is trained in advance, and the adaptability is stronger. For example, it is possible to identify whether the audio frame is a silent frame by calculating the energy characteristics, the zero-crossing rate, and the like of the audio frame. For the case where the ambient noise is relatively strong, the above various methods may be used in combination or may be identified by establishing a noise model.

After identifying an audio frame that does not contain voice content, on the one hand, the audio frame can be removed from the voice signal to improve the accuracy of the character separation; on the other hand, by identifying the audio frame that does not contain the voice content, The start and end points of each valid voice (including voice content) are identified, so the voice segment can be divided on this basis.

Please refer to FIG. 4 , which is a schematic diagram of the segmentation of speech segments provided by the embodiment. In the figure, each audio frame between time t ₂ and t _{3 and} between t ₄ and t ₅ is detected by the VAD technique. For a silent frame, this step can remove the partial silence frame from the voice signal, and correspondingly divide three voice segments: a voice segment 1 (seg1) between t ₁ and t _2, between t ₃ and t ₄ speech segment 2 (seg2), and located between the feature vector t and t _{_5. 6} speech segment 3 (seg3), each voice segment including a number of audio frames, each audio frame has a corresponding. Based on the segmentation of speech segments, role assignments can be roughly performed to provide a reasonable starting point for subsequent training.

Preferably, after the above processing is performed by using the VAD technique, the VAD smoothing operation can also be performed. This is mainly in consideration of the actual vocalization of human beings. The duration of the real speech segment is not too short. If the VAD operation described above is performed, the duration of some of the obtained speech segments is less than a preset threshold (for example, the length of the speech segment). For 30 ms, and the preset threshold is 100 ms), such a speech segment can be merged with adjacent speech segments to form a longer speech segment. The segmentation of the speech segment obtained after the VAD smoothing process is closer to the real situation, which helps to improve the accuracy of the character separation.

In this step, the voice signal is divided into a plurality of voice segments by the VAD technique, and the tasks of the subsequent steps 102-2 to 102-11 are to use the GMM and the HMM to assign a character tag to the feature vector in each voice segment.

Step 102-2: Select a corresponding number of voice segments according to a preset initial number of roles, and specify different roles for each voice segment.

In this step, the voice segment with the same number of initial characters can be randomly selected from the already segmented speech segments, considering that the selected speech segment is to be used for initial training of the GMM and the HMM, and if the duration is relatively short, the data can be used for training. If the duration is too long, the possibility of including more than one role is increased. Both of the cases are not conducive to the initial training. Therefore, this embodiment provides a preferred implementation manner, that is, the duration is selected according to the initial number of roles. Set the required voice segments and assign different roles to each voice segment.

The number of initial characters preset in this embodiment is 2, and the requirement of the preset selected voice segment is: the duration is between 2s and 4s, so this step selects two voices that satisfy the above requirements from the already-divided voice segments. Segment and for each The voice segments specify different roles. Still taking the speech segmentation shown in FIG. 4 as an example, seg1 and seg2 each satisfy the above duration requirement, so two speech segments seg1 and seg2 can be selected, and role 1 (s1) is assigned to seg1, and role 2 is assigned to seg2. (s2).

Step 102-3: Train the GMM and the HMM for each character by using feature vectors in the voice segment of the specified character.

This step trains the GMM for each character and the HMM describing the jump relationship between the characters according to the feature vector contained in the speech segment of the specified character. This step is the initial training performed under the specific number of roles. Still taking the speech segment division shown in FIG. 4 as an example, in the initial character number, the feature vector included in seg1 is used to train the GMM of the character 1 (gmm1), and the feature vector included in the seg2 is used to train the GMM of the character 2 ( Gmm2), if the GMM and HMM trained under the number of roles do not meet the requirements, you can adjust the number of characters and repeat this step, and perform the corresponding initial training according to the adjusted number of characters.

The process of training the GMM and the HMM for each character, that is, the process of learning the various parameters related to the HMM based on a given sequence of observations (ie, a sequence of feature vectors included in each segment of speech, ie, a training sample), The parameters include: the transfer matrix A of the HMM, the mean vector of the GMM corresponding to each character, and a covariance matrix. In the specific implementation, the Baum-Welch algorithm can be used for training. First, the initial value of each parameter is estimated according to the training sample, and the posterior probability Υt of the state s _j at the time t is estimated according to the initial value of the training sample and each parameter. s _j ), then updating each parameter of the HMM according to the calculated posterior probability, and then re-estimating the posterior probability Υt(s _j ) according to the training sample and the updated parameters, and repeatedly performing the above process Until a set of HMM parameters is found to maximize the probability of outputting the observed sequence. After obtaining the parameters satisfying the above requirements, the GMM and HMM are initially trained in the specific number of roles.

Step 102-4: Perform decoding according to the GMM and the HMM obtained by the training, and obtain a sequence of roles in which the probability values of the feature vector sequences included in each speech segment are output.

The speech signal has been divided into a plurality of speech segments in step 102-1, and each audio frame in each speech segment has a corresponding feature vector, which together constitute the feature vector sequence described in this step. In this step, based on the sequence of feature vectors and the trained GMM and HMM, a sequence of HMM states that the sequence of feature vectors may be dependent on, that is, a sequence of characters is found.

The function performed in this step is a generally described HMM decoding process. According to the feature vector sequence, the role sequence of the probability value of the feature vector sequence is searched and outputted. As a preferred embodiment, the maximum probability value may be selected. The sequence of characters, that is, the sequence of characters most likely to output the sequence of feature vectors, is also referred to as the optimal sequence of hidden states.

In a specific implementation, an exhaustive search method may be used to calculate a probability value of each possible character sequence outputting the feature vector sequence, and select a maximum value therefrom. In order to improve the computational efficiency, as a preferred embodiment, a Viterbi algorithm may be employed to reduce the computational complexity by using the transition probability of the HMM in time, and obtain the maximum probability of outputting the sequence of feature vectors in the search. After the value, the backtracking is performed according to the information recorded in the search process, and the corresponding character sequence is obtained.

Step 102-5: Determine whether the probability value corresponding to the role sequence is greater than a preset threshold. If yes, go to step 102-6, otherwise go to step 102-7.

If the probability value corresponding to the role sequence obtained by the decoding process in step 102-4 is greater than a preset threshold, for example, 0.5, it can generally be considered that the current GMM and the HMM are stable, and step 102-6 can be performed for each voice segment. The feature vector assigns a role tag (subsequent step 104 may use the stabilized HMM to determine a sequence of characters corresponding to the sequence of feature vectors), otherwise proceeds to step 102-7 to determine whether to continue iterative training.

Step 102-6: Assign a role tag to the feature vector in each voice segment according to the role sequence.

Since the current GMM and the HMM are already stable, the character tags in the respective voice segments can be assigned role tags according to the sequence of characters obtained by decoding in step 102-4. In a specific implementation, since each character in the character sequence has a one-to-one correspondence with each feature vector in each voice segment, a character tag may be assigned to each feature vector according to the one-to-one correspondence. So far, the feature vectors in each speech segment have their own role tags. After the execution of step 102 is completed, step 103 can be continued.

In step 102-7, it is determined whether the number of times the GMM and the HMM are trained in the current number of roles is less than a preset upper limit of the number of training times; if yes, step 102-8 is performed; otherwise, the process proceeds to step 102-10.

Execution to this step shows that the GMM and HMM obtained by the current training are not stable yet, and iterative training needs to be continued. Considering that the number of current roles used in the training process is inconsistent with the actual number of roles (the number of real characters involved in the voice signal), the GMM and the HMM may not meet the requirements even after repeated iteration training (the decoding operation obtains The probability value corresponding to the character sequence does not always satisfy the condition that is greater than the preset threshold. In order to avoid the meaningless loop iteration process, the upper limit of the training times for training the GMM and the HMM in each role number may be preset. If it is determined that the number of trainings in the current number of roles is less than the upper limit, proceed to step 102-8 to assign a role to each voice segment to continue the iterative training. Otherwise, the number of roles currently used may be inconsistent with the actual situation. Therefore, it is possible to go to step 102-10 to determine whether it is necessary to adjust the number of roles.

Step 102-8: Specify a corresponding role for each voice segment according to the role sequence.

The sequence of characters has been acquired by decoding in step 102-4, due to each character and language in the sequence of characters. The feature vectors in the segments are one-to-one correspondence, so that the character corresponding to each feature vector in each segment can be known. In this step, for each speech segment in the speech signal, a character is assigned to the speech segment by calculating the mode of the character corresponding to each feature vector. For example, a voice segment includes 10 audio frames, that is, 10 feature vectors, wherein 8 feature vectors correspond to character 1 (s1), and 2 feature vectors correspond to character 2 (s2), then each of the voice segments The mode of the character corresponding to the feature vector is the character 1 (s1), so the character 1 (s1) is designated as the character of the voice segment.

Step 102-9: Train the GMM and the HMM for each role according to the feature vector in each voice segment and the corresponding role, and go to step 102-4 to continue the execution.

Based on the assignment of roles to each speech segment in steps 102-8, the GMM and HMM for each character can be trained. Still taking the speech segment division shown in FIG. 4 as an example, if step 102-8 designates seg1 and seg3 as role 1 (s1) and seg2 as role 2 (s2), the feature vectors included in seg1 and seg3 can be used for The GMM (gmm1) of the character 1 is trained, and the feature vector contained in the seg2 is used to train the GMM (gmm2) of the character 2. For the training methods of GMM and HMM, refer to the related text in step 102-3, and details are not described here.

In a specific implementation, the technical solution is usually an iterative training process. In order to improve the training efficiency, this step can incrementally train the new GMM and HMM based on the GMM and HMM obtained from the previous training, that is, the last training. Based on the obtained parameters, using the current sample data, the parameters are continuously adjusted, so that the training speed can be improved.

After completing the above training process, after obtaining the new GMM and HMM, the process may go to step 102-4 to perform decoding according to the new model and perform subsequent operations.

Step 102-10: Determine whether the current number of roles meets the preset requirement; if yes, go to step 102-6 to execute, otherwise continue to step 102-11.

Execution to this step usually indicates that the GMM and HMM trained under the current number of roles are not stable, and the number of trainings has equaled or exceeded the preset maximum number of training times. In this case, it can be judged whether the current number of roles is consistent with the pre-preview. If the requirements are met, the role separation process can be stopped. Go to step 102-6 to assign the role tag. Otherwise, continue to perform step 102-11 to adjust the number of roles.

Step 102-11: Adjust the number of roles, select a corresponding number of voice segments, and assign different roles to each voice segment; and go to step 102-3 to continue.

For example, the current number of roles is 2, and the preset requirement for the number of roles is “the number of roles is equal to 4”. Step 102-10 determines that the current number of roles has not met the preset requirement. In this case, you can perform this step to perform the number of roles. Adjust, for example: add 1 to the current number of characters, and update the current number of roles to 3.

According to the adjusted number of characters, a corresponding number of voice segments are selected from each voice segment included in the voice signal, and different characters are respectively assigned to each voice segment selected. For the duration of the selected voice segment, refer to related text in step 102-2, and details are not described herein.

Still taking the voice segmentation shown in FIG. 4 as an example, if the current number of characters is increased from 2 to 3, and seg1, seg2, and seg3 satisfy the duration requirement of the selected voice segment, then this step may select the three voice segments, and Specify role 1 (s1) for seg1, role 2 (s2) for seg2, and role 3 (s3) for seg3.

After completing the above operations of adjusting the number of roles and selecting a voice segment, you can go to step 102-3 to initially train the GMM and the HMM for the adjusted number of characters.

Step 103: Train the DNN model with a feature vector having a character tag.

At this time, the character label has been assigned to the feature vector in each speech segment. On this basis, this step trains the DNN model with the feature vector having the character tag as a sample, and the DNN model is used to output the corresponding image according to the input feature vector. The probability of each character. For ease of understanding, a brief description of the DNN is given first.

DNN (Deep Neural Networks) generally refers to a nerve that includes one input layer, three or more hidden layers (which may also contain seven, nine, or even more hidden layers), and one output layer. The internet. Each hidden layer can extract certain features and use the output of this layer as the input of the next layer. By extracting features layer by layer and forming lower-level features into more abstract high-level features, the recognition of objects or categories can be realized. .

Please refer to FIG. 5 , which is a schematic diagram of the topology of the DNN network. The DNN network in the figure has a total of n layers, each layer has multiple neurons, and the layers are fully connected; each layer has its own excitation function f (for example Sigmoid function). The input is the feature vector v, the transfer matrix of the i-th layer to the i+1th layer is w _i(i+1) , the offset vector of the i+1th layer is b _(i+1) , and the output of the i-th layer is Out _i , the input of the _i+1 is in _i+1 , and the calculation process is:

In _i+1 =out _i *w _i(i+1) +b _(i+1)

Out _i+1 =f(in _i+1 )

It can be seen that the parameters of the DNN model include the transition matrix w between the layers and the offset vector b of each layer. The main task of training the DNN model is to determine the above parameters. In practical applications, BP (Back-propagation) algorithm is usually used for training. The training process is a supervised learning process: the input signal is a labeled feature vector, and the layer propagates forward and reaches the output layer. Then, the layers are back-propagated, and the parameters of each layer are adjusted by the gradient descent method so that the actual output of the network is continuously approaching the desired output. For a DNN network with thousands of neurons per layer, the number of parameters may be one million or more. The DNN model obtained by the above training process usually has very powerful feature extraction ability and recognition ability.

In this embodiment, the DNN model is used to output a probability corresponding to each character according to the input feature vector, and thus The output layer of the DNN model may use a classifier (for example, Softmax) as an activation function. After the process of pre-assigning the role tag is completed in step 102, if the number of roles involved in the role tag is n, the output layer of the DNN model may include n nodes. Corresponding to n characters, each node outputs a probability value corresponding to the character to which the feature vector belongs for each of the input feature vectors.

This step uses the feature vector with the character tag as a sample to supervise the constructed DNN model. In the specific implementation, the BP algorithm can be directly used for training, considering that the BP algorithm alone may fall into the local minimum value, and the resulting model cannot meet the application requirements. Therefore, the pre-training is adopted in this embodiment. The pre-training is combined with the BP algorithm to train the DNN model.

Pre-training usually uses an unsupervised greedy layer-by-layer training algorithm, first training the network with a hidden layer in an unsupervised manner, then retaining the trained parameters, adding 1 to the network layer, and then training the network with two hidden layers. .... and so on, until the network with the largest hidden layer. After the layer-by-layer training is completed, the parameter values learned by the unsupervised training process are used as initial values, and then the traditional BP algorithm is used for supervised training, and finally the DNN model is obtained.

Since the initial distribution obtained by pre-training is closer to the final convergence value than the random initial parameter adopted by the pure BP algorithm, which is equivalent to making the subsequent supervised training process have a good starting point, the trained DNN model usually does not Get into a local minimum and get a higher recognition rate.

Step 104: Determine a character sequence corresponding to the feature vector sequence according to the DNN model and the HMM obtained by using the feature vector training, and output a role separation result.

Since the DNN model is used to output a probability corresponding to each character according to the input feature vector, and according to the distribution of the character tags of the feature vector sequence, the prior probability corresponding to each character can be known, and the prior of each feature vector The probability is usually also fixed. Therefore, according to the Bayes' theorem, according to the output of the DNN model and the above prior probability, the probability of each character outputting the corresponding feature vector can be known, that is, the DNN model trained in step 103 can be used to determine the state of the HMM. The probability of launch.

The HMM may be trained by using a feature vector sequence on the basis of determining the HMM transmission probability by using the DNN model described above. Considering that the description of the jump relationship between the roles of the HMM used in the role tag assignment of the feature vector in step 102 is basically stable, no additional training can be performed. Therefore, the HMM is directly used in this embodiment, and training is used. The obtained DNN model replaces the GMM, that is, the transmission probability of each state of the HMM is determined by the DNN model.

In this embodiment, step 102-1 performs segmentation of the speech segment. In this step, the role sequence corresponding to the feature vector sequence included in each speech segment is determined according to the DNN model and the HMM used when the role tag is pre-assigned. .

The process of determining a sequence of characters from a sequence of feature vectors is a commonly described decoding problem, which may be based on the DNN The model and the HMM perform a decoding operation, and obtain a character sequence that outputs a probability value of the feature vector sequence ranked first (for example, the probability value is the largest), and uses the character sequence as a character sequence corresponding to the feature vector sequence. For details, see related text in step 102-4, and details are not mentioned here.

After the character sequence corresponding to the feature vector sequence included in each speech segment is obtained by the decoding process, the corresponding character separation result can be output. Since each character in the character sequence has a one-to-one correspondence with the feature vector, and each of the audio frames corresponding to each feature vector has its own time start and end point, this step can output the audio of the corresponding feature vector for each character. The start and end time information of the frame.

So far, the specific implementation manner of the voice-based role separation method provided by the present application is described in detail through steps 101 to 104. It should be noted that, in the embodiment, the method for pre-assigning the character tag to the feature vector in step 102 adopts a method of top-down and gradually increasing the number of characters. In other embodiments, the bottom-up and gradual reduction of the number of characters may also be adopted: initially, each segment of the segment obtained by segmentation may be assigned to a different character, and then the GMM and HMM for each role are trained, if The probability values obtained by the GMM and the HMM obtained by the iterative training after performing the decoding operation are always not greater than the preset threshold, so when adjusting the number of roles, the similarity between the GMMs of each character can be evaluated (for example, calculating KL dispersion) Degrees, the voice segments corresponding to the GMM whose similarity meets the preset requirements are combined, and the number of roles is reduced accordingly, and the above process is repeated and iteratively performed until the probability value obtained by the HMM decoding is greater than a preset threshold or the number of roles meets the preset requirement. Then, the iterative process is stopped, and a character tag is assigned to the feature vector in each speech segment according to the decoded character sequence.

In summary, the voice-based role separation method provided by the present application uses a DNN model with powerful feature extraction capability to model a character, which has more powerful characterization ability than the traditional GMM, and the character characterization is more elaborate. Accurate, so you can get more accurate role separation results. The technical solution of the present application can be applied not only to a scenario in which a dialogue between a customer service center and a conference voice is separated, but also to other scenarios in which a character in a voice signal needs to be separated, as long as the voice signal is included. Two or more characters can adopt the technical solution of the present application and obtain corresponding beneficial effects.

In the above embodiment, a voice-based role separation method is provided. Correspondingly, the present application further provides a voice-based role separation device. Please refer to FIG. 6, which is a schematic diagram of an embodiment of a voice-based role separation apparatus according to the present application. Since the device embodiment is basically similar to the method embodiment, the description is relatively simple, and the relevant parts can be referred to the description of the method embodiment. The device embodiments described below are merely illustrative.

A voice-based role separation device of the embodiment includes: a feature extraction unit 601, configured to The feature vector is extracted frame by frame in the signal to obtain a feature vector sequence; a label assigning unit 602 is configured to assign a character tag to the feature vector; and a DNN model training unit 603 is configured to train the DNN model by using a feature vector having a character tag, wherein the DNN The model is configured to output a probability corresponding to each character according to the input feature vector; the role determining unit 604 is configured to determine a character sequence corresponding to the feature vector sequence according to the DNN model and the HMM obtained by using the feature vector training, and output a role separation result. The HMM is used to describe a jump relationship between characters.

Optionally, the device further includes:

Optionally, the label allocation unit is specifically configured to pre-allocate a character label for a feature vector in each voice segment by establishing a GMM and an HMM, where the GMM is configured to output the role according to the input feature vector for each role. The feature vector corresponds to the probability of the character;

Optionally, the label distribution unit includes:

Optionally, the label distribution unit further includes:

a role quantity adjustment subunit, configured to: when the output of the training times determining subunit is negative, adjust the number of characters, select a corresponding number of voice segments, and respectively specify different roles for each voice segment, and trigger the initial model training. Subunit work.

Optionally, the label distribution unit further includes:

Optionally, the feature extraction unit includes:

Optionally, the device further includes:

Optionally, the role determining unit is specifically configured to perform a decoding operation according to the DNN model and the HMM. Obtaining a sequence of characters in which the probability values of the sequence of feature vectors are output are ranked, and the sequence of characters is used as a sequence of characters corresponding to the sequence of feature vectors.

The present application is disclosed in the above preferred embodiments, but it is not intended to limit the present application, and any person skilled in the art can make possible changes and modifications without departing from the spirit and scope of the present application. The scope of protection should be based on the scope defined by the claims of the present application.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include non-persistent memory, random access memory (RAM), and/or non-volatile memory in a computer readable medium, such as read only memory (ROM) or flash memory. Memory is an example of a computer readable medium.

1. Computer readable media including both permanent and non-persistent, removable and non-removable media may be implemented by any method or technology. The information can be computer readable instructions, data structures, modules of programs, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read only memory. (ROM), electrically erasable programmable read only memory (EEPROM), flash memory or other memory technology, compact disk read only memory (CD-ROM), digital versatile disk (DVD) or other optical storage, Magnetic tape cartridges, magnetic tape storage or other magnetic storage devices or any other non-transportable media can be used to store information that can be accessed by a computing device. As defined herein, computer readable media does not include non-transitory computer readable media, such as modulated data signals and carrier waves.

2. Those skilled in the art will appreciate that embodiments of the present application can be provided as a method, system, or computer program product. Thus, the present application can take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment in combination of software and hardware. Moreover, the application can take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) including computer usable program code.

Claims

A voice-based role separation method, comprising:

Extracting feature vectors from the speech signal frame by frame to obtain a feature vector sequence;

Assign a character tag to the feature vector;

Training a deep neural network DNN model using feature vectors with role tags;

Determining a character sequence corresponding to the feature vector sequence according to the DNN model and the hidden Markov model HMM obtained by using the feature vector training, and outputting a role separation result;

The DNN model is configured to output a probability corresponding to each role according to the input feature vector, and the HMM is used to describe a jump relationship between the characters.
The speech-based character separation method according to claim 1, wherein after the step of extracting the feature vector frame by frame from the speech signal, before the step of assigning the character tag to the feature vector, performing the following Operation: segmenting the voice signal into voice segments by identifying and culling an audio frame that does not include voice content;

The assigning a role tag to the feature vector includes: assigning a character tag to the feature vector in each voice segment; and determining the character sequence corresponding to the feature vector sequence includes: determining a character sequence corresponding to the feature vector sequence included in each voice segment.
The speech-based role separation method according to claim 2, wherein the assigning a role tag to a feature vector in each speech segment comprises: constructing a Gaussian mixture model GMM and an HMM for a feature vector in each speech segment Assigning a role tag; wherein the GMM is configured to output, for each character, a probability that the feature vector corresponds to the character according to the input feature vector;

Determining, according to the DNN model and the HMM obtained by using the feature vector, the role sequence corresponding to the feature vector sequence included in each voice segment includes: assigning a role tag according to the DNN model and the feature vector in each voice segment The adopted HMM determines a sequence of characters corresponding to the sequence of feature vectors included in each of the speech segments.
The voice-based role separation method according to claim 3, wherein the assigning a role tag to the feature vector in each voice segment by establishing a Gaussian mixture model GMM and an HMM includes:

Select a corresponding number of voice segments according to the preset initial number of roles, and assign different roles to each voice segment;

Training the GMM and HMM for each character using the feature vectors in the speech segments of the specified character;

Decoding according to the trained GMM and the HMM, and obtaining a sequence of roles in which the probability values of the feature vector sequences included in each speech segment are output;

Determining whether the probability value corresponding to the role sequence is greater than a preset threshold; if yes, assigning a role tag to the feature vector in each voice segment according to the character sequence.
The voice-based role separation method according to claim 4, wherein when the result of determining whether the probability value corresponding to the character sequence is greater than a preset threshold is no, the following operations are performed:

Assigning a corresponding role to each voice segment according to the sequence of roles;

Training the GMM and the HMM for each character according to the feature vectors in each of the speech segments and the corresponding characters;

Go to the step of decoding according to the trained GMM and HMM.
The voice-based role separation method according to claim 5, wherein the assigning a corresponding role to each voice segment according to the role sequence includes:

For each speech segment, the mode of the character corresponding to each feature vector is designated as the character of the speech segment.
The speech-based character separation method according to claim 5, wherein the training the GMM and the HMM for each character according to the feature vector in each of the speech segments and the corresponding character, including: the last training The GMM and the HMM are trained incrementally based on the obtained model.
The voice-based role separation method according to claim 5, wherein when the result of determining whether the probability value corresponding to the character sequence is greater than a preset threshold is no, the following operations are performed:

Determining whether the number of times the GMM and the HMM are trained under the current number of roles is less than a preset upper limit of the number of training times;

If yes, performing the step of specifying a corresponding role for each voice segment according to the role sequence;

If not, do the following:

Adjust the number of roles, select the appropriate number of voice segments, and assign different roles to each voice segment;

And proceeding to the step of performing the step of the GMM and the HMM for each character by using the feature vector in the voice segment of the specified character.
The speech-based character separation method according to claim 8, wherein when the result of determining whether the number of times the GMM and the HMM are trained under the current number of roles is less than a preset upper limit of the number of training times, the following is performed. operating:

Determining whether the current number of roles meets the preset requirement; if yes, performing the step of assigning a role tag to the feature vector in each voice segment according to the sequence of roles, and if not, performing the step of adjusting the number of roles.
The voice-based role separation method according to claim 8, wherein the preset initial number of characters is 2, and the adjusting the number of roles comprises: adding 1 to the current number of characters.
The speech-based role separation method according to claim 1, wherein said slave voice message Extract the feature vector frame by frame, and obtain the feature vector sequence including:

Performing a frame processing on the voice signal according to a preset frame length to obtain a plurality of audio frames;

A feature vector of each audio frame is extracted to obtain the feature vector sequence.
The speech-based character separation method according to claim 11, wherein the extracting the feature vector of each audio frame comprises: extracting an MFCC feature, a PLP feature, or an LPC feature.
The voice-based role separation method according to claim 2, wherein the identifying and culling an audio frame that does not include the voice content comprises: identifying the audio frame that does not include the voice content by using a VAD technology, and performing corresponding Eliminate the operation.
The speech-based character separation method according to claim 13, wherein after performing the recognizing and culling operation using the VAD technique and dividing the speech signal into speech segments, the following VAD smoothing operation is performed:

The speech segment whose duration is less than the preset threshold is merged with the adjacent speech segment.
The speech-based character separation method according to claim 1, wherein the training the depth neural network DNN model by using the feature vector having the character tag comprises: training the DNN model by using a back propagation algorithm.
The speech-based character separation method according to claim 1, wherein the determining the character sequence corresponding to the feature vector sequence according to the DNN model and the hidden Markov model HMM obtained by using the feature vector training comprises: Performing a decoding operation according to the DNN model and the HMM, acquiring a character sequence that outputs a probability value of the feature vector sequence, and using the character sequence as a character sequence corresponding to the feature vector sequence.
The speech-based role separation method according to claim 1, wherein the output character separation result comprises: starting and ending the audio frame of the corresponding feature vector corresponding to each character according to the character sequence corresponding to the feature vector sequence. Time information.
The voice-based role separation method according to claim 4 or 8, wherein the selecting a corresponding number of voice segments comprises: selecting the number of voice segments that satisfy a preset requirement.
A voice-based role separation device, comprising:

a feature extraction unit, configured to extract a feature vector from a voice signal frame by frame to obtain a feature vector sequence;

a label allocation unit, configured to assign a character label to the feature vector;

a DNN model training unit for training a DNN model with a feature vector having a character tag, wherein the DNN model is configured to output a probability corresponding to each character according to the input feature vector;

The role determining unit is configured to determine a character sequence corresponding to the feature vector sequence and output a role separation result according to the DNN model and the HMM obtained by using the feature vector training, wherein the HMM is used to describe a jump relationship between the characters.
The voice-based role separation device of claim 19, further comprising:

a voice segment segmentation unit, configured to: after the feature extraction unit extracts the feature vector, identify and cull the audio frame that does not include the voice content, and divide the voice signal into voices before triggering the label allocation unit to work segment;

The label distribution unit is specifically configured to allocate a role label for a feature vector in each voice segment;

The role determining unit is specifically configured to determine a character sequence corresponding to the feature vector sequence included in each voice segment according to the DNN model and the HMM obtained by using the feature vector training, and output a role separation result.
The voice-based role separation apparatus according to claim 20, wherein the label distribution unit is configured to: assign a role label to a feature vector in each voice segment by establishing a GMM and an HMM, wherein the GMM is used by the GMM Outputting, for each character, a probability that the feature vector corresponds to the character according to the input feature vector;

The role determining unit is specifically configured to determine a character sequence corresponding to the feature vector sequence included in each of the voice segments according to the DNN model and an HMM used to assign a character tag to a feature vector in each voice segment.
The voice-based role separation device according to claim 21, wherein the label distribution unit comprises:

The initial role designating sub-unit is configured to select a corresponding number of voice segments according to a preset initial number of roles, and specify different roles for each voice segment;

An initial model training subunit for training a GMM and an HMM for each character by using a feature vector in a voice segment of a specified character;

a decoding subunit, configured to perform decoding according to the GMM and the HMM obtained by the training, and obtain a sequence of roles in which the probability values of the feature vector sequences included in each speech segment are outputted;

a probability judging unit, configured to determine whether a probability value corresponding to the role sequence is greater than a preset threshold;

a label allocation subunit, configured to allocate a role label for the feature vector in each voice segment according to the role sequence when the output of the probability determination subunit is YES.
The voice-based role separation device according to claim 22, wherein the label distribution unit further comprises:

a sub-segment-by-speech segment specifying sub-unit for when the output of the probability judging sub-unit is negative, according to the angle a color sequence that assigns a corresponding role to each voice segment;

The model update training subunit is configured to train the GMM and the HMM for each role according to the feature vector in each voice segment and the corresponding role, and trigger the decoding subunit to work.
The voice-based role separation device according to claim 23, wherein the voice-by-speech segment role specifying sub-unit is specifically configured to, for each voice segment, specify a mode of a character corresponding to each feature vector as The role of the voice segment.
The speech-based role separation device according to claim 23, wherein the model update training sub-unit is specifically configured to train the GMM and the HMM in an incremental manner based on a model obtained from the previous training.
The voice-based role separation device according to claim 23, wherein the label distribution unit further comprises:

a training number determining subunit, configured to determine, when the output of the probability judging subunit is negative, whether the number of times the GMM and the HMM are trained under the current number of characters is less than a preset upper limit of the training times, and when the judgment result is yes, Triggering the voice-by-speech role designation sub-unit work;

a role quantity adjustment subunit, configured to: when the output of the training times determining subunit is negative, adjust the number of characters, select a corresponding number of voice segments, and respectively specify different roles for each voice segment, and trigger the initial model training. Subunit work.
The voice-based role separation device according to claim 26, wherein the label distribution unit further comprises:

a role number determining sub-unit, configured to determine whether the current number of roles meets a preset requirement when the output of the training number determining sub-unit is negative, and trigger the label allocation sub-unit to work if it matches, otherwise trigger the number of the roles Adjust the subunit work.
The speech-based character separation device according to claim 19, wherein the feature extraction unit comprises:

a framing sub-unit, configured to perform framing processing on the voice signal according to a preset frame length to obtain a plurality of audio frames;

A feature extraction execution subunit is configured to extract a feature vector of each audio frame to obtain the feature vector sequence.
The speech-based character separation device according to claim 28, wherein the feature extraction execution sub-unit is specifically configured to extract an MFCC feature, a PLP feature, or an LPC feature of each audio frame to obtain the feature vector sequence. .
The speech-based character separation device according to claim 20, wherein the speech segment segmentation unit is specifically configured to identify and cull the audio frame not containing the speech content by using a VAD technology, and the speech segment The signal is divided into speech segments.
The voice-based role separation device of claim 30, further comprising:

The VAD smoothing unit is configured to merge the voice segment whose duration is less than the preset threshold with the adjacent voice segment after the voice segment segmentation unit uses the VAD technology to segment the voice segment.
The speech-based character separation device according to claim 19, wherein the DNN model training unit is specifically configured to train the DNN model by using a back propagation algorithm.
The speech-based character separation device according to claim 19, wherein the role determining unit is configured to: perform a decoding operation according to the DNN model and the HMM, and obtain a probability value of outputting the sequence of feature vectors The previous character sequence, and the character sequence is taken as a character sequence corresponding to the feature vector sequence.
The speech-based character separation device according to claim 19, wherein the character determination unit outputs a role separation result by outputting a feature vector corresponding thereto for each character according to a character sequence corresponding to the feature vector sequence. The start and end time information of the audio frame to which it belongs.
The voice-based role separation device according to claim 22 or 26, wherein the initial role designation sub-unit or the role number adjustment sub-unit specifically selects a corresponding number of voice segments by: selecting a duration to satisfy the pre- Set the required number of voice segments.