WO2017076211A1 - Voice-based role separation method and device - Google Patents

Voice-based role separation method and device Download PDF

Info

Publication number
WO2017076211A1
WO2017076211A1 PCT/CN2016/103490 CN2016103490W WO2017076211A1 WO 2017076211 A1 WO2017076211 A1 WO 2017076211A1 CN 2016103490 W CN2016103490 W CN 2016103490W WO 2017076211 A1 WO2017076211 A1 WO 2017076211A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature vector
role
voice
character
sequence
Prior art date
Application number
PCT/CN2016/103490
Other languages
French (fr)
Chinese (zh)
Inventor
李晓辉
李宏言
Original Assignee
阿里巴巴集团控股有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴集团控股有限公司 filed Critical 阿里巴巴集团控股有限公司
Publication of WO2017076211A1 publication Critical patent/WO2017076211A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • G10L15/142Hidden Markov Models [HMMs]
    • G10L15/144Training of HMMs
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/12Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being prediction coefficients
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum

Definitions

  • the present application relates to the field of speech recognition, and in particular to a speech-based role separation method.
  • the application also relates to a speech-based role separation device.
  • Speech is the most natural way of communication for human beings.
  • Speech recognition technology is a technology that allows a machine to transform a speech signal into a corresponding text or command through a process of recognition and understanding.
  • Speech recognition is an interdisciplinary subject, including: signal processing, pattern recognition, probability theory and information theory, vocal mechanism and auditory mechanism, artificial intelligence, and so on.
  • GMM Global System for Mobile Communications
  • HMM Hidden Markov Model
  • the embodiment of the present application provides a voice-based role separation method and apparatus to solve the existing GMM-based solution.
  • the problem of separation of the character and the HMM is relatively low.
  • the present application provides a voice-based role separation method, including:
  • the DNN model is configured to output a probability corresponding to each role according to the input feature vector, and the HMM is used to describe a jump relationship between the characters.
  • step of extracting the feature vector frame by frame from the voice signal before the step of assigning the character tag to the feature vector, performing the following operations: by identifying and culling the audio frame that does not include the voice content, Splitting the voice signal into voice segments;
  • the assigning a role tag to the feature vector includes: assigning a character tag to the feature vector in each voice segment; and determining the character sequence corresponding to the feature vector sequence includes: determining a character sequence corresponding to the feature vector sequence included in each voice segment.
  • the assigning a role tag to the feature vector in each voice segment includes: assigning a role tag to a feature vector in each voice segment by establishing a Gaussian mixture model GMM and an HMM; wherein the GMM is used for each role And outputting, according to the input feature vector, a probability that the feature vector corresponds to the character;
  • Determining, according to the DNN model and the HMM obtained by using the feature vector, the role sequence corresponding to the feature vector sequence included in each voice segment includes: assigning a role tag according to the DNN model and the feature vector in each voice segment
  • the adopted HMM determines a sequence of characters corresponding to the sequence of feature vectors included in each of the speech segments.
  • the role label is assigned to the feature vector in each voice segment, including:
  • the assigning a corresponding role to each voice segment according to the sequence of roles includes:
  • the mode of the character corresponding to each feature vector is designated as the character of the speech segment.
  • the training the GMM and the HMM for each role according to the feature vector in each voice segment and the corresponding role including: training the GMM in an incremental manner based on a model obtained from the last training. HMM.
  • Determining whether the current number of roles meets the preset requirement if yes, performing the step of assigning a role tag to the feature vector in each voice segment according to the sequence of roles, and if not, performing the step of adjusting the number of roles.
  • the preset initial number of roles is 2, and the adjusting the number of roles includes: adding 1 to the current number of roles.
  • the extracting the feature vector from the voice signal frame by frame, and obtaining the feature vector sequence includes:
  • a feature vector of each audio frame is extracted to obtain the feature vector sequence.
  • the extracting the feature vector of each audio frame comprises: extracting an MFCC feature, a PLP feature, or an LPC feature.
  • the identifying and culling the audio frame that does not include the voice content comprises: using the VAD technology to identify the audio frame that does not include the voice content, and performing a corresponding culling operation.
  • VAD smoothing operation After performing the identifying and culling operation by using the VAD technology and dividing the voice signal into voice segments, perform the following VAD smoothing operation:
  • the speech segment whose duration is less than the preset threshold is merged with the adjacent speech segment.
  • the training the depth neural network DNN model by using the feature vector with the character tag comprises: training the DNN model by using a back propagation algorithm.
  • determining, according to the DNN model and the Hidden Markov Model HMM obtained by using the feature vector training, determining a character sequence corresponding to the feature vector sequence comprising: performing a decoding operation according to the DNN model and the HMM, and acquiring an output station
  • the probability values of the feature vector sequence are sorted by the preceding character sequence, and the character sequence is taken as the character sequence corresponding to the feature vector sequence.
  • the output role separation result includes: starting and ending time information of an audio frame to which the corresponding feature vector belongs according to the role sequence corresponding to the feature vector sequence.
  • the selecting the corresponding number of voice segments comprises: selecting the number of voice segments that meet the preset requirements.
  • the present application further provides a voice-based role separation device, including:
  • a feature extraction unit configured to extract a feature vector from a voice signal frame by frame to obtain a feature vector sequence
  • a label allocation unit configured to assign a character label to the feature vector
  • a DNN model training unit for training a DNN model with a feature vector having a character tag, wherein the DNN model is configured to output a probability corresponding to each character according to the input feature vector;
  • the role determining unit is configured to determine a character sequence corresponding to the feature vector sequence and output a role separation result according to the DNN model and the HMM obtained by using the feature vector training, wherein the HMM is used to describe a jump relationship between the characters.
  • the device further includes:
  • a voice segment segmentation unit configured to: after the feature extraction unit extracts the feature vector, identify and cull the audio frame that does not include the voice content, and divide the voice signal into voices before triggering the label allocation unit to work segment;
  • the label distribution unit is specifically configured to allocate a role label for a feature vector in each voice segment
  • the role determining unit is specifically configured to determine a character sequence corresponding to the feature vector sequence included in each voice segment according to the DNN model and the HMM obtained by using the feature vector training, and output a role separation result.
  • the label allocation unit is specifically configured to allocate a role label for a feature vector in each voice segment by establishing a GMM and an HMM, where the GMM is configured to output the feature according to the input feature vector for each role.
  • the vector corresponds to the probability of the character
  • the role determining unit is specifically configured to determine a character sequence corresponding to the feature vector sequence included in each of the voice segments according to the DNN model and an HMM used to assign a character tag to a feature vector in each voice segment.
  • the label distribution unit includes:
  • the initial role designating sub-unit is configured to select a corresponding number of voice segments according to a preset initial number of roles, and specify different roles for each voice segment;
  • An initial model training subunit for training a GMM and an HMM for each character by using a feature vector in a voice segment of a specified character
  • a decoding subunit configured to perform decoding according to the GMM and the HMM obtained by the training, and obtain a sequence of roles in which the probability values of the feature vector sequences included in each speech segment are outputted;
  • a probability judging unit configured to determine whether a probability value corresponding to the role sequence is greater than a preset threshold
  • a label allocation subunit configured to allocate a role label for the feature vector in each voice segment according to the role sequence when the output of the probability determination subunit is YES.
  • the label distribution unit further includes:
  • a voice-by-speech role designation sub-unit configured to specify a corresponding role for each voice segment according to the role sequence when the output of the probability determination sub-unit is negative;
  • the model update training subunit is configured to train the GMM and the HMM for each role according to the feature vector in each voice segment and the corresponding role, and trigger the decoding subunit to work.
  • the voice-by-speech segment role specifying sub-unit is specifically configured to specify, for each voice segment, a mode of a character corresponding to each feature vector as a role of the voice segment.
  • model update training subunit is specifically configured to train the GMM and the HMM in an incremental manner on the basis of the model obtained in the previous training.
  • the label distribution unit further includes:
  • a training number determining subunit configured to determine, when the output of the probability judging subunit is negative, whether the number of times the GMM and the HMM are trained under the current number of characters is less than a preset upper limit of the training times, and when the judgment result is yes, Triggering the voice-by-speech role designation sub-unit work.
  • a role quantity adjustment subunit configured to: when the output of the training times determining subunit is negative, adjust the number of characters, select a corresponding number of voice segments, and respectively specify different roles for each voice segment, and trigger the initial model training Practice the subunit work.
  • the label distribution unit further includes:
  • a role number determining sub-unit configured to determine whether the current number of roles meets a preset requirement when the output of the training number determining sub-unit is negative, and trigger the label allocation sub-unit to work if it matches, otherwise trigger the number of the roles Adjust the subunit work.
  • the feature extraction unit includes:
  • a framing sub-unit configured to perform framing processing on the voice signal according to a preset frame length to obtain a plurality of audio frames
  • a feature extraction execution subunit is configured to extract a feature vector of each audio frame to obtain the feature vector sequence.
  • the feature extraction execution sub-unit is specifically configured to extract an MFCC feature, a PLP feature, or an LPC feature of each audio frame to obtain the feature vector sequence.
  • the voice segment segmentation unit is specifically configured to: identify and cull the audio frame that does not include voice content by using a VAD technology, and divide the voice signal into voice segments.
  • the device further includes:
  • the VAD smoothing unit is configured to merge the voice segment whose duration is less than the preset threshold with the adjacent voice segment after the voice segment segmentation unit uses the VAD technology to segment the voice segment.
  • the DNN model training unit is specifically configured to train the DNN model by using a back propagation algorithm.
  • the role determining unit is configured to: perform a decoding operation according to the DNN model and the HMM, obtain a role sequence that outputs a probability value of the sequence of the feature vector, and use the role sequence as a context A sequence of characters corresponding to the sequence of feature vectors.
  • the role determining unit outputs the role separation result in the following manner: according to the character sequence corresponding to the feature vector sequence, the start and end time information of the audio frame to which the corresponding feature vector belongs is output for each character.
  • the initial role designation subunit or the role quantity adjustment subunit specifically selects a corresponding number of voice segments by selecting the number of voice segments that meet the preset requirement.
  • the speech-based character separation method provided by the present application firstly extracts a feature vector sequence from a speech signal frame by frame, and then trains the DNN model on the basis of assigning a character tag to the feature vector, and according to the DNN model and using the feature vector training.
  • the HMM determines the character sequence corresponding to the feature vector sequence, thereby obtaining the role separation result.
  • the above method provided by the present application uses a DNN model with powerful feature extraction capability to model the speaker character, which has more powerful characterization ability than the traditional GMM, and the character is more refined. Accurate, so you can get more accurate role separation results.
  • FIG. 1 is a flow chart of an embodiment of a voice-based role separation method of the present application
  • FIG. 2 is a flowchart of a process for extracting a feature vector sequence from a voice signal according to an embodiment of the present application
  • FIG. 3 is a flowchart of a process for assigning a role tag to a feature vector in each voice segment by using a GMM and an HMM according to an embodiment of the present application;
  • FIG. 4 is a schematic diagram of voice segment division provided by an embodiment of the present application.
  • FIG. 5 is a schematic diagram of a topology structure of a DNN network according to an embodiment of the present application.
  • FIG. 6 is a schematic diagram of an embodiment of a voice-based role separation device of the present application.
  • a voice-based role separation method and a voice-based role separation device are respectively provided, which are described in detail in the following embodiments.
  • the technical background, technical solutions, and writing manners of the embodiments of the present application will be briefly described before describing the embodiments.
  • GMM Garsian mixture model
  • HMM Hidden Markov Model
  • HMM is a statistical model used to describe a Markov process with implicit unknown parameters.
  • Hidden Markov model is a kind of Markov chain. Its state (called hidden state) cannot be directly observed, but it is related to the observable observation vector. Therefore, HMM is a double stochastic process. It consists of two parts: a Markov chain with state transition probability (usually described by transfer matrix A), and a random process describing the output relationship between the hidden state and the observation vector (usually described by the confusion matrix B, each of which The element is the hidden state corresponding to the output probability of the observation vector, also known as the emission probability).
  • the GMM can be simply understood as a superposition of multiple Gaussian density functions.
  • the core idea is to use a combination of multiple Gaussian distribution probability density functions to describe the distribution of eigenvectors in the probability space, which can be smoothed.
  • the parameters include: mixing weight, mean vector, and covariance matrix for each Gaussian distribution.
  • GMM is usually used for each role.
  • the state in the HMM is each role.
  • the observation vector is a feature vector extracted from the speech signal frame by frame, and each state outputs the feature vector.
  • the probability of transmission is determined by the GMM (the confusion matrix can be known from the GMM), and the role separation process is the process of determining the sequence of roles corresponding to the sequence of feature vectors using the GMM and the HMM.
  • the technical solution of the present application determines the transmission probability of each state of the HMM by using a deep neural network (DNN) based on pre-assigning a character tag to the feature vector of each speech frame, and determines and eigenvectors according to the DNN and the HMM.
  • DNN deep neural network
  • the technical solution of the present application first assigns a role tag to a feature vector extracted from a voice signal.
  • the role tag assigned at this time is usually not very accurate, but can provide a reference for the subsequent execution of a supervised learning process, and training on the basis of this.
  • the resulting DNN model can more accurately characterize the characters, thus making the role separation results more accurate.
  • the function of assigning a role tag may be implemented by using a statistical-based algorithm or a classifier, and the following embodiments are provided according to the GMM and the HMM. Assign an implementation of the role tag.
  • FIG. 1 is a flowchart of an embodiment of a voice-based role separation method according to the present application. The method includes the following steps:
  • Step 101 Extract a feature vector from a voice signal frame by frame to obtain a feature vector sequence.
  • the speech signal to be separated by the character is usually a time domain signal.
  • a sequence of feature vectors capable of characterizing the speech signal is obtained through two processes of framing and extracting feature vectors, which will be further described below with reference to FIG. 2 .
  • Step 101-1 Perform frame processing on the voice signal according to a preset frame length to obtain a plurality of audio frames.
  • the frame length may be preset according to requirements, for example, may be set to 10 ms, or 15 ms, etc., and then the voice signal in the time domain is segmented frame by frame according to the frame length, thereby dividing the voice signal into multiples. Audio frames. Depending on the segmentation strategy employed, adjacent audio frames may or may not overlap.
  • Step 101-2 Extract a feature vector of each audio frame to obtain the feature vector sequence.
  • the feature vector capable of characterizing the speech signal can be extracted frame by frame. the amount. Since the description ability of the speech signal in the time domain is relatively weak, the Fourier transform can usually be performed for each audio frame, and then the frequency domain feature is extracted as the feature vector of the audio frame.
  • the MFCC Mel Frequency Cepstrum Coefficient- Mel frequency cepstral coefficient
  • PLP Perceptual Linear Predictive
  • LPC Linear Predictive Coding
  • the time domain signal of the audio frame is obtained by FFT (Fast Fourier Transformation) to obtain corresponding spectrum information, and the spectrum information is obtained through a Mel filter group to obtain a Mel spectrum, and cepstrum analysis is performed on the Mel spectrum.
  • DCT Discrete Cosine Transform
  • Step 102 Assign a character tag to the feature vector.
  • a role tag is assigned to a feature vector in a feature vector sequence by establishing a GMM and an HMM. It is considered that in addition to the speech signal corresponding to each character in a speech signal, there may be a portion without speech content, for example, a mute portion due to listening, thinking, and the like. Since these parts do not contain the information of the character, in order to improve the accuracy of the character separation, such an audio frame can be recognized and culled from the voice signal in advance.
  • the audio frame that does not include the voice content is removed, and the voice segment is divided, and then the character tag is assigned to the feature vector in each voice segment.
  • the assignment of the role tag includes: performing an initial division of the role, and iteratively training the GMM and the HMM on the basis of the initial division. If the model obtained by the training does not satisfy the preset requirement, the number of roles is adjusted and then the GMM and the HMM are retrained until the training is obtained. When the model satisfies the preset requirement, the character tag is assigned to the feature vector in each voice segment according to the model.
  • Step 102-1 by identifying and culling an audio frame that does not contain voice content, and dividing the voice signal into voice segments.
  • the prior art generally adopts an acoustic segmentation method, that is, separating, for example, a "music segment”, a “speech segment”, a “silent segment”, and the like from a voice signal according to an existing model.
  • an acoustic segmentation method that is, separating, for example, a "music segment”, a "speech segment”, a “silent segment”, and the like from a voice signal according to an existing model.
  • the technical solution of the present application may use a VAD (Voice Activity Detection) technology to identify a portion that does not include voice content, so that it may not be required with respect to a technique using an acoustic segmentation method.
  • VAD Voice Activity Detection
  • the acoustic model corresponding to different audio segments is trained in advance, and the adaptability is stronger. For example, it is possible to identify whether the audio frame is a silent frame by calculating the energy characteristics, the zero-crossing rate, and the like of the audio frame.
  • the above various methods may be used in combination or may be identified by establishing a noise model.
  • the audio frame After identifying an audio frame that does not contain voice content, on the one hand, the audio frame can be removed from the voice signal to improve the accuracy of the character separation; on the other hand, by identifying the audio frame that does not contain the voice content, The start and end points of each valid voice (including voice content) are identified, so the voice segment can be divided on this basis.
  • FIG. 4 is a schematic diagram of the segmentation of speech segments provided by the embodiment.
  • each audio frame between time t 2 and t 3 and between t 4 and t 5 is detected by the VAD technique.
  • this step can remove the partial silence frame from the voice signal, and correspondingly divide three voice segments: a voice segment 1 (seg1) between t 1 and t 2, between t 3 and t 4 speech segment 2 (seg2), and located between the feature vector t and t 5.
  • 6 speech segment 3 (seg3), each voice segment including a number of audio frames, each audio frame has a corresponding.
  • role assignments can be roughly performed to provide a reasonable starting point for subsequent training.
  • the VAD smoothing operation can also be performed. This is mainly in consideration of the actual vocalization of human beings.
  • the duration of the real speech segment is not too short. If the VAD operation described above is performed, the duration of some of the obtained speech segments is less than a preset threshold (for example, the length of the speech segment). For 30 ms, and the preset threshold is 100 ms), such a speech segment can be merged with adjacent speech segments to form a longer speech segment.
  • a preset threshold for example, the length of the speech segment. For 30 ms, and the preset threshold is 100 ms
  • the segmentation of the speech segment obtained after the VAD smoothing process is closer to the real situation, which helps to improve the accuracy of the character separation.
  • the voice signal is divided into a plurality of voice segments by the VAD technique, and the tasks of the subsequent steps 102-2 to 102-11 are to use the GMM and the HMM to assign a character tag to the feature vector in each voice segment.
  • Step 102-2 Select a corresponding number of voice segments according to a preset initial number of roles, and specify different roles for each voice segment.
  • the voice segment with the same number of initial characters can be randomly selected from the already segmented speech segments, considering that the selected speech segment is to be used for initial training of the GMM and the HMM, and if the duration is relatively short, the data can be used for training. If the duration is too long, the possibility of including more than one role is increased. Both of the cases are not conducive to the initial training. Therefore, this embodiment provides a preferred implementation manner, that is, the duration is selected according to the initial number of roles. Set the required voice segments and assign different roles to each voice segment.
  • the number of initial characters preset in this embodiment is 2, and the requirement of the preset selected voice segment is: the duration is between 2s and 4s, so this step selects two voices that satisfy the above requirements from the already-divided voice segments. Segment and for each The voice segments specify different roles. Still taking the speech segmentation shown in FIG. 4 as an example, seg1 and seg2 each satisfy the above duration requirement, so two speech segments seg1 and seg2 can be selected, and role 1 (s1) is assigned to seg1, and role 2 is assigned to seg2. (s2).
  • Step 102-3 Train the GMM and the HMM for each character by using feature vectors in the voice segment of the specified character.
  • This step trains the GMM for each character and the HMM describing the jump relationship between the characters according to the feature vector contained in the speech segment of the specified character.
  • This step is the initial training performed under the specific number of roles. Still taking the speech segment division shown in FIG. 4 as an example, in the initial character number, the feature vector included in seg1 is used to train the GMM of the character 1 (gmm1), and the feature vector included in the seg2 is used to train the GMM of the character 2 ( Gmm2), if the GMM and HMM trained under the number of roles do not meet the requirements, you can adjust the number of characters and repeat this step, and perform the corresponding initial training according to the adjusted number of characters.
  • the process of training the GMM and the HMM for each character that is, the process of learning the various parameters related to the HMM based on a given sequence of observations (ie, a sequence of feature vectors included in each segment of speech, ie, a training sample),
  • the parameters include: the transfer matrix A of the HMM, the mean vector of the GMM corresponding to each character, and a covariance matrix.
  • the Baum-Welch algorithm can be used for training. First, the initial value of each parameter is estimated according to the training sample, and the posterior probability ⁇ t of the state s j at the time t is estimated according to the initial value of the training sample and each parameter.
  • the GMM and HMM are initially trained in the specific number of roles.
  • Step 102-4 Perform decoding according to the GMM and the HMM obtained by the training, and obtain a sequence of roles in which the probability values of the feature vector sequences included in each speech segment are output.
  • the speech signal has been divided into a plurality of speech segments in step 102-1, and each audio frame in each speech segment has a corresponding feature vector, which together constitute the feature vector sequence described in this step.
  • a sequence of HMM states that the sequence of feature vectors may be dependent on, that is, a sequence of characters is found.
  • the function performed in this step is a generally described HMM decoding process.
  • the role sequence of the probability value of the feature vector sequence is searched and outputted.
  • the maximum probability value may be selected.
  • the sequence of characters that is, the sequence of characters most likely to output the sequence of feature vectors, is also referred to as the optimal sequence of hidden states.
  • an exhaustive search method may be used to calculate a probability value of each possible character sequence outputting the feature vector sequence, and select a maximum value therefrom.
  • a Viterbi algorithm may be employed to reduce the computational complexity by using the transition probability of the HMM in time, and obtain the maximum probability of outputting the sequence of feature vectors in the search. After the value, the backtracking is performed according to the information recorded in the search process, and the corresponding character sequence is obtained.
  • Step 102-5 Determine whether the probability value corresponding to the role sequence is greater than a preset threshold. If yes, go to step 102-6, otherwise go to step 102-7.
  • step 102-4 If the probability value corresponding to the role sequence obtained by the decoding process in step 102-4 is greater than a preset threshold, for example, 0.5, it can generally be considered that the current GMM and the HMM are stable, and step 102-6 can be performed for each voice segment.
  • the feature vector assigns a role tag (subsequent step 104 may use the stabilized HMM to determine a sequence of characters corresponding to the sequence of feature vectors), otherwise proceeds to step 102-7 to determine whether to continue iterative training.
  • Step 102-6 Assign a role tag to the feature vector in each voice segment according to the role sequence.
  • the character tags in the respective voice segments can be assigned role tags according to the sequence of characters obtained by decoding in step 102-4.
  • a character tag may be assigned to each feature vector according to the one-to-one correspondence. So far, the feature vectors in each speech segment have their own role tags.
  • step 102-7 it is determined whether the number of times the GMM and the HMM are trained in the current number of roles is less than a preset upper limit of the number of training times; if yes, step 102-8 is performed; otherwise, the process proceeds to step 102-10.
  • Execution to this step shows that the GMM and HMM obtained by the current training are not stable yet, and iterative training needs to be continued. Considering that the number of current roles used in the training process is inconsistent with the actual number of roles (the number of real characters involved in the voice signal), the GMM and the HMM may not meet the requirements even after repeated iteration training (the decoding operation obtains The probability value corresponding to the character sequence does not always satisfy the condition that is greater than the preset threshold. In order to avoid the meaningless loop iteration process, the upper limit of the training times for training the GMM and the HMM in each role number may be preset.
  • step 102-8 If it is determined that the number of trainings in the current number of roles is less than the upper limit, proceed to step 102-8 to assign a role to each voice segment to continue the iterative training. Otherwise, the number of roles currently used may be inconsistent with the actual situation. Therefore, it is possible to go to step 102-10 to determine whether it is necessary to adjust the number of roles.
  • Step 102-8 Specify a corresponding role for each voice segment according to the role sequence.
  • the sequence of characters has been acquired by decoding in step 102-4, due to each character and language in the sequence of characters.
  • the feature vectors in the segments are one-to-one correspondence, so that the character corresponding to each feature vector in each segment can be known.
  • a character is assigned to the speech segment by calculating the mode of the character corresponding to each feature vector.
  • a voice segment includes 10 audio frames, that is, 10 feature vectors, wherein 8 feature vectors correspond to character 1 (s1), and 2 feature vectors correspond to character 2 (s2), then each of the voice segments
  • the mode of the character corresponding to the feature vector is the character 1 (s1), so the character 1 (s1) is designated as the character of the voice segment.
  • Step 102-9 Train the GMM and the HMM for each role according to the feature vector in each voice segment and the corresponding role, and go to step 102-4 to continue the execution.
  • the GMM and HMM for each character can be trained. Still taking the speech segment division shown in FIG. 4 as an example, if step 102-8 designates seg1 and seg3 as role 1 (s1) and seg2 as role 2 (s2), the feature vectors included in seg1 and seg3 can be used for The GMM (gmm1) of the character 1 is trained, and the feature vector contained in the seg2 is used to train the GMM (gmm2) of the character 2.
  • the training methods of GMM and HMM refer to the related text in step 102-3, and details are not described here.
  • the technical solution is usually an iterative training process.
  • this step can incrementally train the new GMM and HMM based on the GMM and HMM obtained from the previous training, that is, the last training. Based on the obtained parameters, using the current sample data, the parameters are continuously adjusted, so that the training speed can be improved.
  • step 102-4 After completing the above training process, after obtaining the new GMM and HMM, the process may go to step 102-4 to perform decoding according to the new model and perform subsequent operations.
  • Step 102-10 Determine whether the current number of roles meets the preset requirement; if yes, go to step 102-6 to execute, otherwise continue to step 102-11.
  • Execution to this step usually indicates that the GMM and HMM trained under the current number of roles are not stable, and the number of trainings has equaled or exceeded the preset maximum number of training times. In this case, it can be judged whether the current number of roles is consistent with the pre-preview. If the requirements are met, the role separation process can be stopped. Go to step 102-6 to assign the role tag. Otherwise, continue to perform step 102-11 to adjust the number of roles.
  • Step 102-11 Adjust the number of roles, select a corresponding number of voice segments, and assign different roles to each voice segment; and go to step 102-3 to continue.
  • Step 102-10 determines that the current number of roles has not met the preset requirement. In this case, you can perform this step to perform the number of roles. Adjust, for example: add 1 to the current number of characters, and update the current number of roles to 3.
  • a corresponding number of voice segments are selected from each voice segment included in the voice signal, and different characters are respectively assigned to each voice segment selected.
  • For the duration of the selected voice segment refer to related text in step 102-2, and details are not described herein.
  • this step may select the three voice segments, and Specify role 1 (s1) for seg1, role 2 (s2) for seg2, and role 3 (s3) for seg3.
  • step 102-3 After completing the above operations of adjusting the number of roles and selecting a voice segment, you can go to step 102-3 to initially train the GMM and the HMM for the adjusted number of characters.
  • Step 103 Train the DNN model with a feature vector having a character tag.
  • this step trains the DNN model with the feature vector having the character tag as a sample, and the DNN model is used to output the corresponding image according to the input feature vector.
  • the probability of each character For ease of understanding, a brief description of the DNN is given first.
  • DNN Deep Neural Networks
  • DNN generally refers to a nerve that includes one input layer, three or more hidden layers (which may also contain seven, nine, or even more hidden layers), and one output layer.
  • Each hidden layer can extract certain features and use the output of this layer as the input of the next layer.
  • FIG. 5 is a schematic diagram of the topology of the DNN network.
  • the DNN network in the figure has a total of n layers, each layer has multiple neurons, and the layers are fully connected; each layer has its own excitation function f (for example Sigmoid function).
  • the input is the feature vector v
  • the transfer matrix of the i-th layer to the i+1th layer is w i(i+1)
  • the offset vector of the i+1th layer is b (i+1)
  • the output of the i-th layer is Out i
  • the input of the i+1 is in i+1
  • the calculation process is:
  • the parameters of the DNN model include the transition matrix w between the layers and the offset vector b of each layer.
  • the main task of training the DNN model is to determine the above parameters.
  • BP Back-propagation
  • the training process is a supervised learning process: the input signal is a labeled feature vector, and the layer propagates forward and reaches the output layer. Then, the layers are back-propagated, and the parameters of each layer are adjusted by the gradient descent method so that the actual output of the network is continuously approaching the desired output. For a DNN network with thousands of neurons per layer, the number of parameters may be one million or more.
  • the DNN model obtained by the above training process usually has very powerful feature extraction ability and recognition ability.
  • the DNN model is used to output a probability corresponding to each character according to the input feature vector, and thus
  • the output layer of the DNN model may use a classifier (for example, Softmax) as an activation function.
  • the output layer of the DNN model may include n nodes. Corresponding to n characters, each node outputs a probability value corresponding to the character to which the feature vector belongs for each of the input feature vectors.
  • This step uses the feature vector with the character tag as a sample to supervise the constructed DNN model.
  • the BP algorithm can be directly used for training, considering that the BP algorithm alone may fall into the local minimum value, and the resulting model cannot meet the application requirements. Therefore, the pre-training is adopted in this embodiment. The pre-training is combined with the BP algorithm to train the DNN model.
  • Pre-training usually uses an unsupervised greedy layer-by-layer training algorithm, first training the network with a hidden layer in an unsupervised manner, then retaining the trained parameters, adding 1 to the network layer, and then training the network with two hidden layers. .... and so on, until the network with the largest hidden layer.
  • the parameter values learned by the unsupervised training process are used as initial values, and then the traditional BP algorithm is used for supervised training, and finally the DNN model is obtained.
  • the trained DNN model Since the initial distribution obtained by pre-training is closer to the final convergence value than the random initial parameter adopted by the pure BP algorithm, which is equivalent to making the subsequent supervised training process have a good starting point, the trained DNN model usually does not Get into a local minimum and get a higher recognition rate.
  • Step 104 Determine a character sequence corresponding to the feature vector sequence according to the DNN model and the HMM obtained by using the feature vector training, and output a role separation result.
  • the DNN model Since the DNN model is used to output a probability corresponding to each character according to the input feature vector, and according to the distribution of the character tags of the feature vector sequence, the prior probability corresponding to each character can be known, and the prior of each feature vector The probability is usually also fixed. Therefore, according to the Bayes' theorem, according to the output of the DNN model and the above prior probability, the probability of each character outputting the corresponding feature vector can be known, that is, the DNN model trained in step 103 can be used to determine the state of the HMM. The probability of launch.
  • the HMM may be trained by using a feature vector sequence on the basis of determining the HMM transmission probability by using the DNN model described above. Considering that the description of the jump relationship between the roles of the HMM used in the role tag assignment of the feature vector in step 102 is basically stable, no additional training can be performed. Therefore, the HMM is directly used in this embodiment, and training is used.
  • the obtained DNN model replaces the GMM, that is, the transmission probability of each state of the HMM is determined by the DNN model.
  • step 102-1 performs segmentation of the speech segment.
  • the role sequence corresponding to the feature vector sequence included in each speech segment is determined according to the DNN model and the HMM used when the role tag is pre-assigned. .
  • the process of determining a sequence of characters from a sequence of feature vectors is a commonly described decoding problem, which may be based on the DNN
  • the model and the HMM perform a decoding operation, and obtain a character sequence that outputs a probability value of the feature vector sequence ranked first (for example, the probability value is the largest), and uses the character sequence as a character sequence corresponding to the feature vector sequence. For details, see related text in step 102-4, and details are not mentioned here.
  • the corresponding character separation result can be output. Since each character in the character sequence has a one-to-one correspondence with the feature vector, and each of the audio frames corresponding to each feature vector has its own time start and end point, this step can output the audio of the corresponding feature vector for each character.
  • the start and end time information of the frame can be used to determine the start and end time information of the frame.
  • the method for pre-assigning the character tag to the feature vector in step 102 adopts a method of top-down and gradually increasing the number of characters.
  • the bottom-up and gradual reduction of the number of characters may also be adopted: initially, each segment of the segment obtained by segmentation may be assigned to a different character, and then the GMM and HMM for each role are trained, if The probability values obtained by the GMM and the HMM obtained by the iterative training after performing the decoding operation are always not greater than the preset threshold, so when adjusting the number of roles, the similarity between the GMMs of each character can be evaluated (for example, calculating KL dispersion) Degrees, the voice segments corresponding to the GMM whose similarity meets the preset requirements are combined, and the number of roles is reduced accordingly, and the above process is repeated and iteratively performed until the probability value obtained by the HMM decoding is greater than a preset threshold or the number of roles meets the preset requirement. Then, the iterative process is stopped, and a character tag is assigned to the feature vector in each speech segment according to the decoded character sequence.
  • the voice-based role separation method uses a DNN model with powerful feature extraction capability to model a character, which has more powerful characterization ability than the traditional GMM, and the character characterization is more elaborate. Accurate, so you can get more accurate role separation results.
  • the technical solution of the present application can be applied not only to a scenario in which a dialogue between a customer service center and a conference voice is separated, but also to other scenarios in which a character in a voice signal needs to be separated, as long as the voice signal is included. Two or more characters can adopt the technical solution of the present application and obtain corresponding beneficial effects.
  • FIG. 6 is a schematic diagram of an embodiment of a voice-based role separation apparatus according to the present application. Since the device embodiment is basically similar to the method embodiment, the description is relatively simple, and the relevant parts can be referred to the description of the method embodiment. The device embodiments described below are merely illustrative.
  • a voice-based role separation device of the embodiment includes: a feature extraction unit 601, configured to The feature vector is extracted frame by frame in the signal to obtain a feature vector sequence; a label assigning unit 602 is configured to assign a character tag to the feature vector; and a DNN model training unit 603 is configured to train the DNN model by using a feature vector having a character tag, wherein the DNN The model is configured to output a probability corresponding to each character according to the input feature vector; the role determining unit 604 is configured to determine a character sequence corresponding to the feature vector sequence according to the DNN model and the HMM obtained by using the feature vector training, and output a role separation result.
  • the HMM is used to describe a jump relationship between characters.
  • the device further includes:
  • a voice segment segmentation unit configured to: after the feature extraction unit extracts the feature vector, identify and cull the audio frame that does not include the voice content, and divide the voice signal into voices before triggering the label allocation unit to work segment;
  • the label distribution unit is specifically configured to allocate a role label for a feature vector in each voice segment
  • the role determining unit is specifically configured to determine a character sequence corresponding to the feature vector sequence included in each voice segment according to the DNN model and the HMM obtained by using the feature vector training, and output a role separation result.
  • the label allocation unit is specifically configured to pre-allocate a character label for a feature vector in each voice segment by establishing a GMM and an HMM, where the GMM is configured to output the role according to the input feature vector for each role.
  • the feature vector corresponds to the probability of the character
  • the role determining unit is specifically configured to determine a character sequence corresponding to the feature vector sequence included in each of the voice segments according to the DNN model and an HMM used to assign a character tag to a feature vector in each voice segment.
  • the label distribution unit includes:
  • the initial role designating sub-unit is configured to select a corresponding number of voice segments according to a preset initial number of roles, and specify different roles for each voice segment;
  • An initial model training subunit for training a GMM and an HMM for each character by using a feature vector in a voice segment of a specified character
  • a decoding subunit configured to perform decoding according to the GMM and the HMM obtained by the training, and obtain a sequence of roles in which the probability values of the feature vector sequences included in each speech segment are outputted;
  • a probability judging unit configured to determine whether a probability value corresponding to the role sequence is greater than a preset threshold
  • a label allocation subunit configured to allocate a role label for the feature vector in each voice segment according to the role sequence when the output of the probability determination subunit is YES.
  • the label distribution unit further includes:
  • a voice-by-speech role designation sub-unit configured to specify a corresponding role for each voice segment according to the role sequence when the output of the probability determination sub-unit is negative;
  • the model update training subunit is configured to train the GMM and the HMM for each role according to the feature vector in each voice segment and the corresponding role, and trigger the decoding subunit to work.
  • the voice-by-speech segment role specifying sub-unit is specifically configured to specify, for each voice segment, a mode of a character corresponding to each feature vector as a role of the voice segment.
  • model update training subunit is specifically configured to train the GMM and the HMM in an incremental manner on the basis of the model obtained in the previous training.
  • the label distribution unit further includes:
  • a training number determining subunit configured to determine, when the output of the probability judging subunit is negative, whether the number of times the GMM and the HMM are trained under the current number of characters is less than a preset upper limit of the training times, and when the judgment result is yes, Triggering the voice-by-speech role designation sub-unit work.
  • a role quantity adjustment subunit configured to: when the output of the training times determining subunit is negative, adjust the number of characters, select a corresponding number of voice segments, and respectively specify different roles for each voice segment, and trigger the initial model training. Subunit work.
  • the label distribution unit further includes:
  • a role number determining sub-unit configured to determine whether the current number of roles meets a preset requirement when the output of the training number determining sub-unit is negative, and trigger the label allocation sub-unit to work if it matches, otherwise trigger the number of the roles Adjust the subunit work.
  • the feature extraction unit includes:
  • a framing sub-unit configured to perform framing processing on the voice signal according to a preset frame length to obtain a plurality of audio frames
  • a feature extraction execution subunit is configured to extract a feature vector of each audio frame to obtain the feature vector sequence.
  • the feature extraction execution sub-unit is specifically configured to extract an MFCC feature, a PLP feature, or an LPC feature of each audio frame to obtain the feature vector sequence.
  • the voice segment segmentation unit is specifically configured to: identify and cull the audio frame that does not include voice content by using a VAD technology, and divide the voice signal into voice segments.
  • the device further includes:
  • the VAD smoothing unit is configured to merge the voice segment whose duration is less than the preset threshold with the adjacent voice segment after the voice segment segmentation unit uses the VAD technology to segment the voice segment.
  • the DNN model training unit is specifically configured to train the DNN model by using a back propagation algorithm.
  • the role determining unit is specifically configured to perform a decoding operation according to the DNN model and the HMM. Obtaining a sequence of characters in which the probability values of the sequence of feature vectors are output are ranked, and the sequence of characters is used as a sequence of characters corresponding to the sequence of feature vectors.
  • the role determining unit outputs the role separation result in the following manner: according to the character sequence corresponding to the feature vector sequence, the start and end time information of the audio frame to which the corresponding feature vector belongs is output for each character.
  • the initial role designation subunit or the role quantity adjustment subunit specifically selects a corresponding number of voice segments by selecting the number of voice segments that meet the preset requirement.
  • a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
  • processors CPUs
  • input/output interfaces network interfaces
  • memory volatile and non-volatile memory
  • the memory may include non-persistent memory, random access memory (RAM), and/or non-volatile memory in a computer readable medium, such as read only memory (ROM) or flash memory.
  • RAM random access memory
  • ROM read only memory
  • Memory is an example of a computer readable medium.
  • Computer readable media including both permanent and non-persistent, removable and non-removable media may be implemented by any method or technology.
  • the information can be computer readable instructions, data structures, modules of programs, or other data.
  • Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read only memory. (ROM), electrically erasable programmable read only memory (EEPROM), flash memory or other memory technology, compact disk read only memory (CD-ROM), digital versatile disk (DVD) or other optical storage, Magnetic tape cartridges, magnetic tape storage or other magnetic storage devices or any other non-transportable media can be used to store information that can be accessed by a computing device.
  • computer readable media does not include non-transitory computer readable media, such as modulated data signals and carrier waves.
  • embodiments of the present application can be provided as a method, system, or computer program product.
  • the present application can take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment in combination of software and hardware.
  • the application can take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) including computer usable program code.

Abstract

A voice-based role separation method and device. The method comprises: extracting feature vectors from a voice signal frame by frame, so as to obtain a feature vector sequence (101); allocating role labels to the feature vectors (102); training, by employing the feature vectors having the role labels, a deep neural network (DNN) model (103); and determining, according to the DNN model and a hidden Markov model (HMM) trained by using the feature vectors, a role sequence corresponding to the feature vector sequence, and outputting a role separation result (104), wherein the DNN model is configured to output, according to an inputted feature vector, probabilities corresponding to respective roles, and the HMM is configured to describe a transition relationship between the roles. Employing the DNN model having powerful feature extraction capability to establish a model of a speaker role, the method can better describe a role compared with the conventional GMM, and can generate more detailed and accurate description of a role, thereby providing a more accurate role separation result.

Description

基于语音的角色分离方法及装置Voice-based role separation method and device
本申请要求2015年11月05日递交的申请号为201510744743.4、发明名称为“基于语音的角色分离方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。The present application claims priority to Chinese Patent Application Serial No. No. No. No. No. No. No. No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No
技术领域Technical field
本申请涉及语音识别领域,具体涉及一种基于语音的角色分离方法。本申请同时涉及一种基于语音的角色分离装置。The present application relates to the field of speech recognition, and in particular to a speech-based role separation method. The application also relates to a speech-based role separation device.
背景技术Background technique
语音是人类最自然的交流沟通方式,语音识别技术则是让机器通过识别和理解过程把语音信号转变为相应的文本或命令的技术。语音识别是一门交叉学科,所涉及的领域包括:信号处理、模式识别、概率论和信息论、发声机理和听觉机理、人工智能等等。Speech is the most natural way of communication for human beings. Speech recognition technology is a technology that allows a machine to transform a speech signal into a corresponding text or command through a process of recognition and understanding. Speech recognition is an interdisciplinary subject, including: signal processing, pattern recognition, probability theory and information theory, vocal mechanism and auditory mechanism, artificial intelligence, and so on.
在实际应用中,为了能够对语音信号作更为准确的分析,不仅需要进行语音识别,而且要判别出每段语音的说话人,因此很自然地出现了对语音按照角色进行分离的需求。在日常生活、会议以及电话对话等很多场景下,都存在对话语音,而通过对对话语音的角色分离,就可以判定哪部分语音是其中一个人说的,哪部分语音是另外一个人说的。在将对话语音按照角色分离之后,结合说话人识别、语音识别,会产生更为广阔的应用空间,例如,将客服中心的对话语音按照角色分离,然后进行语音识别就可以确定客服说了什么内容,客户说了什么内容,从而可以进行相应的客服质检或者进行客户潜在需求的挖掘。In practical applications, in order to be able to make a more accurate analysis of the speech signal, not only the speech recognition but also the speaker of each speech is discriminated, so the need to separate the speech according to the role naturally occurs. In many scenes such as daily life, meetings, and telephone conversations, there are dialogue voices. By separating the roles of dialogue voices, it is possible to determine which part of the voice is spoken by one of the people, and which part of the voice is spoken by another person. After the dialogue voice is separated according to the role, combined with speaker recognition and voice recognition, a wider application space will be generated. For example, the voice of the customer service center can be separated according to the role, and then the voice recognition can be used to determine what the customer service said. What the customer said, so that the corresponding customer service quality inspection or the potential demand of the customer can be carried out.
现有技术中,通常采用GMM(Gaussian Mixture Model—高斯混合模型)和HMM(Hidden Markov Model—隐马尔科夫模型)进行对话语音的角色分离,即:对于每个角色使用GMM建模,对于不同角色之间的跳转采用HMM建模。由于GMM建模技术提出的时间比较早,而且其拟合任意函数的功能取决于混合高斯函数的个数,所以其对角色的刻画能力有一定的局限性,导致角色分离的准确率通常比较低,无法满足应用的需求。In the prior art, GMM (Gaussian Mixture Model) and HMM (Hidden Markov Model) are commonly used for character separation of conversational voices, that is, using GMM modeling for each character, for different Jumps between roles are modeled using HMM. Because the time proposed by GMM modeling technology is relatively early, and the function of fitting arbitrary functions depends on the number of mixed Gaussian functions, it has certain limitations on the character's ability to characterize, and the accuracy of role separation is usually low. Can not meet the needs of the application.
发明内容Summary of the invention
本申请实施例提供一种基于语音的角色分离方法和装置,以解决现有的基于GMM 和HMM的角色分离技术准确率比较低的问题。The embodiment of the present application provides a voice-based role separation method and apparatus to solve the existing GMM-based solution. The problem of separation of the character and the HMM is relatively low.
本申请提供一种基于语音的角色分离方法,包括:The present application provides a voice-based role separation method, including:
从语音信号中逐帧提取特征矢量,得到特征矢量序列;Extracting feature vectors from the speech signal frame by frame to obtain a feature vector sequence;
为特征矢量分配角色标签;Assign a character tag to the feature vector;
利用具有角色标签的特征矢量训练深度神经网络DNN模型;Training a deep neural network DNN model using feature vectors with role tags;
根据所述DNN模型和利用特征矢量训练得到的隐马尔科夫模型HMM,判定特征矢量序列对应的角色序列,并输出角色分离结果;Determining a character sequence corresponding to the feature vector sequence according to the DNN model and the hidden Markov model HMM obtained by using the feature vector training, and outputting a role separation result;
其中,所述DNN模型用于根据输入的特征矢量输出对应每个角色的概率,HMM用于描述角色间的跳转关系。The DNN model is configured to output a probability corresponding to each role according to the input feature vector, and the HMM is used to describe a jump relationship between the characters.
可选的,在所述从语音信号中逐帧提取特征矢量的步骤之后、在所述为特征矢量分配角色标签的步骤之前,执行下述操作:通过识别并剔除不包含语音内容的音频帧、将所述语音信号切分为语音段;Optionally, after the step of extracting the feature vector frame by frame from the voice signal, before the step of assigning the character tag to the feature vector, performing the following operations: by identifying and culling the audio frame that does not include the voice content, Splitting the voice signal into voice segments;
所述为特征矢量分配角色标签包括:为各语音段中的特征矢量分配角色标签;所述判定特征矢量序列对应的角色序列包括:判定各语音段所包含的特征矢量序列对应的角色序列。The assigning a role tag to the feature vector includes: assigning a character tag to the feature vector in each voice segment; and determining the character sequence corresponding to the feature vector sequence includes: determining a character sequence corresponding to the feature vector sequence included in each voice segment.
可选的,所述为各语音段中的特征矢量分配角色标签包括:通过建立高斯混合模型GMM和HMM,为各语音段中的特征矢量分配角色标签;其中所述GMM用于针对每个角色、根据输入的特征矢量输出该特征矢量对应于所述角色的概率;Optionally, the assigning a role tag to the feature vector in each voice segment includes: assigning a role tag to a feature vector in each voice segment by establishing a Gaussian mixture model GMM and an HMM; wherein the GMM is used for each role And outputting, according to the input feature vector, a probability that the feature vector corresponds to the character;
所述根据所述DNN模型和利用特征矢量训练得到的HMM,判定各语音段所包含的特征矢量序列对应的角色序列包括:根据所述DNN模型和为各语音段中的特征矢量分配角色标签所采用的HMM,判定所述各语音段所包含的特征矢量序列对应的角色序列。Determining, according to the DNN model and the HMM obtained by using the feature vector, the role sequence corresponding to the feature vector sequence included in each voice segment includes: assigning a role tag according to the DNN model and the feature vector in each voice segment The adopted HMM determines a sequence of characters corresponding to the sequence of feature vectors included in each of the speech segments.
可选的,所述通过建立高斯混合模型GMM和HMM,为各语音段中的特征矢量分配角色标签,包括:Optionally, by assigning a Gaussian mixture model GMM and an HMM, the role label is assigned to the feature vector in each voice segment, including:
按照预设的初始角色数量选择相应数量的语音段,并为每个语音段分别指定不同角色;Select a corresponding number of voice segments according to the preset initial number of roles, and assign different roles to each voice segment;
利用指定角色的语音段中的特征矢量,训练针对每个角色的GMM以及HMM;Training the GMM and HMM for each character using the feature vectors in the speech segments of the specified character;
根据训练得到的GMM和HMM进行解码,获取输出各语音段所包含的特征矢量序列的概率值排序靠前的角色序列;Decoding according to the trained GMM and the HMM, and obtaining a sequence of roles in which the probability values of the feature vector sequences included in each speech segment are output;
判断所述角色序列对应的概率值是否大于预设阈值;若是,按照所述角色序列为各语音段中的特征矢量分配角色标签。 Determining whether the probability value corresponding to the role sequence is greater than a preset threshold; if yes, assigning a role tag to the feature vector in each voice segment according to the character sequence.
可选的,当所述判断所述角色序列对应的概率值是否大于预设阈值的结果为否时,执行下述操作:Optionally, when the result of determining whether the probability value corresponding to the role sequence is greater than a preset threshold is no, perform the following operations:
根据所述角色序列,为每个语音段指定对应的角色;Assigning a corresponding role to each voice segment according to the sequence of roles;
根据每个语音段中的特征矢量以及对应的角色,训练针对每个角色的GMM以及HMM;Training the GMM and the HMM for each character according to the feature vectors in each of the speech segments and the corresponding characters;
转到所述根据训练得到的GMM和HMM进行解码的步骤执行。Go to the step of decoding according to the trained GMM and HMM.
可选的,所述根据所述角色序列,为每个语音段指定对应的角色,包括:Optionally, the assigning a corresponding role to each voice segment according to the sequence of roles includes:
针对每个语音段,将其中各特征矢量对应的角色的众数指定为所述语音段的角色。For each speech segment, the mode of the character corresponding to each feature vector is designated as the character of the speech segment.
可选的,所述根据每个语音段中的特征矢量以及对应的角色,训练针对每个角色的GMM以及HMM,包括:在上一次训练得到的模型基础上采用增量方式训练所述GMM以及HMM。Optionally, the training the GMM and the HMM for each role according to the feature vector in each voice segment and the corresponding role, including: training the GMM in an incremental manner based on a model obtained from the last training. HMM.
可选的,当所述判断所述角色序列对应的概率值是否大于预设阈值的结果为否时,执行下述操作:Optionally, when the result of determining whether the probability value corresponding to the role sequence is greater than a preset threshold is no, perform the following operations:
判断在当前角色数量下训练GMM和HMM的次数是否小于预设的训练次数上限;Determining whether the number of times the GMM and the HMM are trained under the current number of roles is less than a preset upper limit of the number of training times;
若是,执行所述根据所述角色序列为每个语音段指定对应的角色的步骤;If yes, performing the step of specifying a corresponding role for each voice segment according to the role sequence;
若否,执行下述操作:If not, do the following:
调整角色数量,选择相应数量的语音段并为每个语音段分别指定不同角色;Adjust the number of roles, select the appropriate number of voice segments, and assign different roles to each voice segment;
并转到所述利用指定角色的语音段中的特征矢量,训练针对每个角色的GMM以及HMM的步骤执行。And proceeding to the step of performing the step of the GMM and the HMM for each character by using the feature vector in the voice segment of the specified character.
可选的,当所述判断在当前角色数量下训练GMM和HMM的次数是否小于预设的训练次数上限的结果为否时,执行下述操作:Optionally, when the result of determining whether the number of times the GMM and the HMM are trained under the current number of roles is less than the preset upper limit of the training times, the following operations are performed:
判断当前角色数量是否符合预设要求;若是,转到所述按照所述角色序列为各语音段中的特征矢量分配角色标签的步骤执行,若否,则执行所述调整角色数量的步骤。Determining whether the current number of roles meets the preset requirement; if yes, performing the step of assigning a role tag to the feature vector in each voice segment according to the sequence of roles, and if not, performing the step of adjusting the number of roles.
可选的,所述预设的初始角色数量为2,所述调整角色数量包括:为当前角色数量加1。Optionally, the preset initial number of roles is 2, and the adjusting the number of roles includes: adding 1 to the current number of roles.
可选的,所述从语音信号中逐帧提取特征矢量,得到特征矢量序列包括:Optionally, the extracting the feature vector from the voice signal frame by frame, and obtaining the feature vector sequence includes:
按照预先设定的帧长度对语音信号进行分帧处理,得到多个音频帧;Performing a frame processing on the voice signal according to a preset frame length to obtain a plurality of audio frames;
提取各音频帧的特征矢量,得到所述特征矢量序列。A feature vector of each audio frame is extracted to obtain the feature vector sequence.
可选的,所述提取各音频帧的特征矢量包括:提取MFCC特征、PLP特征、或者LPC特征。 Optionally, the extracting the feature vector of each audio frame comprises: extracting an MFCC feature, a PLP feature, or an LPC feature.
可选的,所述识别并剔除不包含语音内容的音频帧包括:采用VAD技术识别所述不包含语音内容的音频帧、并执行相应的剔除操作。Optionally, the identifying and culling the audio frame that does not include the voice content comprises: using the VAD technology to identify the audio frame that does not include the voice content, and performing a corresponding culling operation.
可选的,在采用VAD技术执行所述识别及剔除操作、并将所述语音信号切分为语音段之后,执行下述VAD平滑操作:Optionally, after performing the identifying and culling operation by using the VAD technology and dividing the voice signal into voice segments, perform the following VAD smoothing operation:
将时长小于预设阈值的语音段与相邻语音段合并。The speech segment whose duration is less than the preset threshold is merged with the adjacent speech segment.
可选的,所述利用具有角色标签的特征矢量训练深度神经网络DNN模型包括:采用反向传播算法训练所述DNN模型。Optionally, the training the depth neural network DNN model by using the feature vector with the character tag comprises: training the DNN model by using a back propagation algorithm.
可选的,所述根据所述DNN模型和利用特征矢量训练得到的隐马尔科夫模型HMM,判定特征矢量序列对应的角色序列,包括:根据所述DNN模型和HMM执行解码操作,获取输出所述特征矢量序列的概率值排序靠前的角色序列,并将所述角色序列作为与所述特征矢量序列对应的角色序列。Optionally, determining, according to the DNN model and the Hidden Markov Model HMM obtained by using the feature vector training, determining a character sequence corresponding to the feature vector sequence, comprising: performing a decoding operation according to the DNN model and the HMM, and acquiring an output station The probability values of the feature vector sequence are sorted by the preceding character sequence, and the character sequence is taken as the character sequence corresponding to the feature vector sequence.
可选的,所述输出角色分离结果包括:根据特征矢量序列对应的角色序列,针对每个角色输出与其对应的特征矢量所属音频帧的起止时间信息。Optionally, the output role separation result includes: starting and ending time information of an audio frame to which the corresponding feature vector belongs according to the role sequence corresponding to the feature vector sequence.
可选的,所述选择相应数量的语音段,包括:选择时长满足预设要求的、所述数量的语音段。Optionally, the selecting the corresponding number of voice segments comprises: selecting the number of voice segments that meet the preset requirements.
相应的,本申请还提供一种基于语音的角色分离装置,包括:Correspondingly, the present application further provides a voice-based role separation device, including:
特征提取单元,用于从语音信号中逐帧提取特征矢量,得到特征矢量序列;a feature extraction unit, configured to extract a feature vector from a voice signal frame by frame to obtain a feature vector sequence;
标签分配单元,用于为特征矢量分配角色标签;a label allocation unit, configured to assign a character label to the feature vector;
DNN模型训练单元,用于利用具有角色标签的特征矢量训练DNN模型,其中所述DNN模型用于根据输入的特征矢量输出对应每个角色的概率;a DNN model training unit for training a DNN model with a feature vector having a character tag, wherein the DNN model is configured to output a probability corresponding to each character according to the input feature vector;
角色判定单元,用于根据所述DNN模型和利用特征矢量训练得到的HMM,判定特征矢量序列对应的角色序列并输出角色分离结果,其中所述HMM用于描述角色间的跳转关系。The role determining unit is configured to determine a character sequence corresponding to the feature vector sequence and output a role separation result according to the DNN model and the HMM obtained by using the feature vector training, wherein the HMM is used to describe a jump relationship between the characters.
可选的,所述装置还包括:Optionally, the device further includes:
语音段切分单元,用于在所述特征提取单元提取特征矢量后、在触发所述标签分配单元工作之前,通过识别并剔除不包含语音内容的音频帧、将所述语音信号切分为语音段;a voice segment segmentation unit, configured to: after the feature extraction unit extracts the feature vector, identify and cull the audio frame that does not include the voice content, and divide the voice signal into voices before triggering the label allocation unit to work segment;
所述标签分配单元具体用于,为各语音段中的特征矢量分配角色标签;The label distribution unit is specifically configured to allocate a role label for a feature vector in each voice segment;
所述角色判定单元具体用于,根据所述DNN模型和利用特征矢量训练得到的HMM,判定各语音段所包含的特征矢量序列对应的角色序列并输出角色分离结果。 The role determining unit is specifically configured to determine a character sequence corresponding to the feature vector sequence included in each voice segment according to the DNN model and the HMM obtained by using the feature vector training, and output a role separation result.
可选的,所述标签分配单元具体用于,通过建立GMM和HMM,为各语音段中的特征矢量分配角色标签,其中所述GMM用于针对每个角色、根据输入的特征矢量输出该特征矢量对应于所述角色的概率;Optionally, the label allocation unit is specifically configured to allocate a role label for a feature vector in each voice segment by establishing a GMM and an HMM, where the GMM is configured to output the feature according to the input feature vector for each role. The vector corresponds to the probability of the character;
所述角色判定单元具体用于,根据所述DNN模型和为各语音段中的特征矢量分配角色标签所采用的HMM,判定所述各语音段所包含的特征矢量序列对应的角色序列。The role determining unit is specifically configured to determine a character sequence corresponding to the feature vector sequence included in each of the voice segments according to the DNN model and an HMM used to assign a character tag to a feature vector in each voice segment.
可选的,所述标签分配单元包括:Optionally, the label distribution unit includes:
初始角色指定子单元,用于按照预设的初始角色数量选择相应数量的语音段,并为每个语音段分别指定不同角色;The initial role designating sub-unit is configured to select a corresponding number of voice segments according to a preset initial number of roles, and specify different roles for each voice segment;
初始模型训练子单元,用于利用指定角色的语音段中的特征矢量,训练针对每个角色的GMM以及HMM;An initial model training subunit for training a GMM and an HMM for each character by using a feature vector in a voice segment of a specified character;
解码子单元,用于根据训练得到的GMM和HMM进行解码,获取输出各语音段所包含的特征矢量序列的概率值排序靠前的角色序列;a decoding subunit, configured to perform decoding according to the GMM and the HMM obtained by the training, and obtain a sequence of roles in which the probability values of the feature vector sequences included in each speech segment are outputted;
概率判断子单元,用于判断所述角色序列对应的概率值是否大于预设阈值;a probability judging unit, configured to determine whether a probability value corresponding to the role sequence is greater than a preset threshold;
标签分配子单元,用于当所述概率判断子单元的输出为是时,按照所述角色序列为各语音段中的特征矢量分配角色标签。a label allocation subunit, configured to allocate a role label for the feature vector in each voice segment according to the role sequence when the output of the probability determination subunit is YES.
可选的,所述标签分配单元还包括:Optionally, the label distribution unit further includes:
逐语音段角色指定子单元,用于当所述概率判断子单元的输出为否时,根据所述角色序列,为每个语音段指定对应的角色;a voice-by-speech role designation sub-unit, configured to specify a corresponding role for each voice segment according to the role sequence when the output of the probability determination sub-unit is negative;
模型更新训练子单元,用于根据每个语音段中的特征矢量以及对应的角色,训练针对每个角色的GMM以及HMM,并触发所述解码子单元工作。The model update training subunit is configured to train the GMM and the HMM for each role according to the feature vector in each voice segment and the corresponding role, and trigger the decoding subunit to work.
可选的,所述逐语音段角色指定子单元具体用于,针对每个语音段,将其中各特征矢量对应的角色的众数指定为所述语音段的角色。Optionally, the voice-by-speech segment role specifying sub-unit is specifically configured to specify, for each voice segment, a mode of a character corresponding to each feature vector as a role of the voice segment.
可选的,所述模型更新训练子单元具体用于,在上一次训练得到的模型基础上采用增量方式训练所述GMM以及HMM。Optionally, the model update training subunit is specifically configured to train the GMM and the HMM in an incremental manner on the basis of the model obtained in the previous training.
可选的,所述标签分配单元还包括:Optionally, the label distribution unit further includes:
训练次数判断子单元,用于当所述概率判断子单元的输出为否时,判断在当前角色数量下训练GMM和HMM的次数是否小于预设的训练次数上限,并在判断结果为是时,触发所述逐语音段角色指定子单元工作。a training number determining subunit, configured to determine, when the output of the probability judging subunit is negative, whether the number of times the GMM and the HMM are trained under the current number of characters is less than a preset upper limit of the training times, and when the judgment result is yes, Triggering the voice-by-speech role designation sub-unit work.
角色数量调整子单元,用于当所述训练次数判断子单元的输出为否时,调整角色数量,选择相应数量的语音段并为每个语音段分别指定不同角色,并触发所述初始模型训 练子单元工作。a role quantity adjustment subunit, configured to: when the output of the training times determining subunit is negative, adjust the number of characters, select a corresponding number of voice segments, and respectively specify different roles for each voice segment, and trigger the initial model training Practice the subunit work.
可选的,所述标签分配单元还包括:Optionally, the label distribution unit further includes:
角色数量判断子单元,用于当所述训练次数判断子单元的输出为否时,判断当前角色数量是否符合预设要求,若符合则触发所述标签分配子单元工作,否则触发所述角色数量调整子单元工作。a role number determining sub-unit, configured to determine whether the current number of roles meets a preset requirement when the output of the training number determining sub-unit is negative, and trigger the label allocation sub-unit to work if it matches, otherwise trigger the number of the roles Adjust the subunit work.
可选的,所述特征提取单元包括:Optionally, the feature extraction unit includes:
分帧子单元,用于按照预先设定的帧长度对语音信号进行分帧处理,得到多个音频帧;a framing sub-unit, configured to perform framing processing on the voice signal according to a preset frame length to obtain a plurality of audio frames;
特征提取执行子单元,用于提取各音频帧的特征矢量,得到所述特征矢量序列。A feature extraction execution subunit is configured to extract a feature vector of each audio frame to obtain the feature vector sequence.
可选的,所述特征提取执行子单元具体用于,提取各音频帧的MFCC特征、PLP特征、或者LPC特征,得到所述特征矢量序列。Optionally, the feature extraction execution sub-unit is specifically configured to extract an MFCC feature, a PLP feature, or an LPC feature of each audio frame to obtain the feature vector sequence.
可选的,所述语音段切分单元具体用于,通过采用VAD技术识别并剔除所述不包含语音内容的音频帧、将所述语音信号切分为语音段。Optionally, the voice segment segmentation unit is specifically configured to: identify and cull the audio frame that does not include voice content by using a VAD technology, and divide the voice signal into voice segments.
可选的,所述装置还包括:Optionally, the device further includes:
VAD平滑单元,用于在所述语音段切分单元采用VAD技术切分语音段后,将时长小于预设阈值的语音段与相邻语音段合并。The VAD smoothing unit is configured to merge the voice segment whose duration is less than the preset threshold with the adjacent voice segment after the voice segment segmentation unit uses the VAD technology to segment the voice segment.
可选的,所述DNN模型训练单元具体用于,采用反向传播算法训练所述DNN模型。Optionally, the DNN model training unit is specifically configured to train the DNN model by using a back propagation algorithm.
可选的,所述角色判定单元具体用于,根据所述DNN模型和HMM执行解码操作,获取输出所述特征矢量序列的概率值排序靠前的角色序列,并将所述角色序列作为与所述特征矢量序列对应的角色序列。Optionally, the role determining unit is configured to: perform a decoding operation according to the DNN model and the HMM, obtain a role sequence that outputs a probability value of the sequence of the feature vector, and use the role sequence as a context A sequence of characters corresponding to the sequence of feature vectors.
可选的,所述角色判定单元采用如下方式输出角色分离结果:根据特征矢量序列对应的角色序列,针对每个角色输出与其对应的特征矢量所属音频帧的起止时间信息。Optionally, the role determining unit outputs the role separation result in the following manner: according to the character sequence corresponding to the feature vector sequence, the start and end time information of the audio frame to which the corresponding feature vector belongs is output for each character.
可选的,所述初始角色指定子单元或所述角色数量调整子单元具体通过如下方式选择相应数量的语音段:选择时长满足预设要求的、所述数量的语音段。Optionally, the initial role designation subunit or the role quantity adjustment subunit specifically selects a corresponding number of voice segments by selecting the number of voice segments that meet the preset requirement.
与现有技术相比,本申请具有以下优点:Compared with the prior art, the present application has the following advantages:
本申请提供的基于语音的角色分离方法,首先从语音信号中逐帧提取特征矢量序列,然后在为特征矢量分配角色标签的基础上训练DNN模型,并根据所述DNN模型以及利用特征矢量训练得到的HMM,判定特征矢量序列对应的角色序列,从而得到角色分离结果。本申请提供的上述方法,由于采用了具有强大特征提取能力的DNN模型对说话人角色进行建模,比传统的GMM具有更为强大的刻画能力,对角色的刻画更加精细、 准确,因此能够获得更为准确的角色分离结果。The speech-based character separation method provided by the present application firstly extracts a feature vector sequence from a speech signal frame by frame, and then trains the DNN model on the basis of assigning a character tag to the feature vector, and according to the DNN model and using the feature vector training. The HMM determines the character sequence corresponding to the feature vector sequence, thereby obtaining the role separation result. The above method provided by the present application uses a DNN model with powerful feature extraction capability to model the speaker character, which has more powerful characterization ability than the traditional GMM, and the character is more refined. Accurate, so you can get more accurate role separation results.
附图说明DRAWINGS
图1是本申请的一种基于语音的角色分离方法的实施例的流程图;1 is a flow chart of an embodiment of a voice-based role separation method of the present application;
图2是本申请实施例提供的从语音信号中提取特征矢量序列的处理流程图;2 is a flowchart of a process for extracting a feature vector sequence from a voice signal according to an embodiment of the present application;
图3是本申请实施例提供的利用GMM和HMM为各语音段中的特征矢量分配角色标签的处理流程图;FIG. 3 is a flowchart of a process for assigning a role tag to a feature vector in each voice segment by using a GMM and an HMM according to an embodiment of the present application;
图4是本申请实施例提供的语音段划分的示意图;4 is a schematic diagram of voice segment division provided by an embodiment of the present application;
图5是本申请实施例提供的DNN网络的拓扑结构示意图;FIG. 5 is a schematic diagram of a topology structure of a DNN network according to an embodiment of the present application;
图6是本申请的一种基于语音的角色分离装置的实施例的示意图。6 is a schematic diagram of an embodiment of a voice-based role separation device of the present application.
具体实施方式detailed description
在下面的描述中阐述了很多具体细节以便于充分理解本申请。但是,本申请能够以很多不同于在此描述的其它方式来实施,本领域技术人员可以在不违背本申请内涵的情况下做类似推广,因此,本申请不受下面公开的具体实施的限制。Numerous specific details are set forth in the description below in order to provide a thorough understanding of the application. However, the present invention can be implemented in many other ways than those described herein, and those skilled in the art can make similar promotion without departing from the scope of the present application. Therefore, the present application is not limited by the specific embodiments disclosed below.
在本申请中,分别提供了一种基于语音的角色分离方法,以及一种基于语音的角色分离装置,在下面的实施例中逐一进行详细说明。为了便于理解,在对实施例进行描述之前,先对本申请的技术背景、技术方案、以及实施例的撰写方式作简要说明。In the present application, a voice-based role separation method and a voice-based role separation device are respectively provided, which are described in detail in the following embodiments. For the sake of easy understanding, the technical background, technical solutions, and writing manners of the embodiments of the present application will be briefly described before describing the embodiments.
现有的应用于语音领域的角色分离技术通常采用GMM(Gaussian mixture model—高斯混合模型)对角色进行建模、采用HMM(Hidden Markov Model—隐马尔可夫模型)对角色之间的跳转进行建模。The existing role separation techniques applied in the field of speech usually use GMM (Gaussian mixture model) to model the characters and HMM (Hidden Markov Model) to jump between roles. Modeling.
所述HMM是统计模型,用来描述一个含有隐含未知参数的马尔可夫过程。隐马尔可夫模型是马尔可夫链的一种,它的状态(称为隐藏状态)不能直接观察到,但是与可观察到的观测向量是概率相关的,因此,HMM是一个双重随机过程,包括两个部分:具有状态转移概率的马尔科夫链(通常用转移矩阵A描述),以及描述隐藏状态与观测向量之间的输出关系的随机过程(通常用混淆矩阵B描述,其中的每个元素为隐藏状态对应于观测向量的输出概率,也称为发射概率)。一个具有N个状态的HMM可以用三元组参数λ={π,A,B}表示,其中π为各状态的初始概率。The HMM is a statistical model used to describe a Markov process with implicit unknown parameters. Hidden Markov model is a kind of Markov chain. Its state (called hidden state) cannot be directly observed, but it is related to the observable observation vector. Therefore, HMM is a double stochastic process. It consists of two parts: a Markov chain with state transition probability (usually described by transfer matrix A), and a random process describing the output relationship between the hidden state and the observation vector (usually described by the confusion matrix B, each of which The element is the hidden state corresponding to the output probability of the observation vector, also known as the emission probability). An HMM with N states can be represented by the triplet parameter λ = {π, A, B}, where π is the initial probability of each state.
所述GMM可以简单理解为多个高斯密度函数的叠加,其核心思想是用多个高斯分布的概率密度函数的组合来描述特征矢量在概率空间的分布状况,采用该模型可以平滑 地近似任意形状的密度分布。其参数包括:各高斯分布的混合权重(mixing weight)、均值向量(mean vector)、协方差矩阵(covariance matrix)。The GMM can be simply understood as a superposition of multiple Gaussian density functions. The core idea is to use a combination of multiple Gaussian distribution probability density functions to describe the distribution of eigenvectors in the probability space, which can be smoothed. The approximate approximate density distribution of the shape. The parameters include: mixing weight, mean vector, and covariance matrix for each Gaussian distribution.
在现有的基于语音的角色分离应用中,通常对每个角色采用GMM建模,HMM中的状态就是各个角色,观测向量是从语音信号中逐帧提取的特征矢量,各个状态输出特征矢量的发射概率由GMM决定(根据GMM可以获知混淆矩阵),而角色分离过程就是利用GMM和HMM确定与特征矢量序列对应的角色序列的过程。In the existing speech-based role separation application, GMM is usually used for each role. The state in the HMM is each role. The observation vector is a feature vector extracted from the speech signal frame by frame, and each state outputs the feature vector. The probability of transmission is determined by the GMM (the confusion matrix can be known from the GMM), and the role separation process is the process of determining the sequence of roles corresponding to the sequence of feature vectors using the GMM and the HMM.
由于GMM的函数拟合功能受限于所采用的高斯密度函数的个数,其本身的表达能力存在一定的局限性,导致现有的采用GMM和HMM进行角色分离的准确率比较低。针对这一问题,本申请的技术方案在为各语音帧的特征矢量预先分配角色标签的基础上,利用深度神经网络(DNN)决定HMM各状态的发射概率,并根据DNN和HMM判定与特征矢量序列对应的角色序列,由于DNN具有组合低层特征形成更加抽象的高层特征的强大能力,可以实现更为精准的角色刻画,因此能够获取更为准确的角色分离结果。Since the function fitting function of GMM is limited by the number of Gaussian density functions used, its own expression ability has certain limitations, which leads to the low accuracy of the existing role separation using GMM and HMM. To solve this problem, the technical solution of the present application determines the transmission probability of each state of the HMM by using a deep neural network (DNN) based on pre-assigning a character tag to the feature vector of each speech frame, and determines and eigenvectors according to the DNN and the HMM. The sequence of characters corresponding to the sequence, because DNN has the powerful ability to combine lower-level features to form more abstract high-level features, can achieve more precise character characterization, and thus can obtain more accurate role separation results.
本申请的技术方案,首先为从语音信号中提取的特征矢量分配角色标签,此时分配的角色标签通常并不是很准确,但是可以为后续执行有监督的学习过程提供参考,在此基础上训练得到的DNN模型能够更为准确地刻画角色,从而使角色分离结果更为准确。在具体实施本申请的技术方案时可以采用基于统计的算法或者采用分类器等方式,实现所述分配角色标签的功能,在本申请提供的下述实施例中采用了根据GMM和HMM为特征矢量分配角色标签的实施方式。The technical solution of the present application first assigns a role tag to a feature vector extracted from a voice signal. The role tag assigned at this time is usually not very accurate, but can provide a reference for the subsequent execution of a supervised learning process, and training on the basis of this. The resulting DNN model can more accurately characterize the characters, thus making the role separation results more accurate. In the specific implementation of the technical solution of the present application, the function of assigning a role tag may be implemented by using a statistical-based algorithm or a classifier, and the following embodiments are provided according to the GMM and the HMM. Assign an implementation of the role tag.
下面,对本申请的实施例进行详细说明。请参考图1,其为本申请的一种基于语音的角色分离方法的实施例的流程图。所述方法包括如下步骤:Hereinafter, embodiments of the present application will be described in detail. Please refer to FIG. 1, which is a flowchart of an embodiment of a voice-based role separation method according to the present application. The method includes the following steps:
步骤101、从语音信号中逐帧提取特征矢量,得到特征矢量序列。Step 101: Extract a feature vector from a voice signal frame by frame to obtain a feature vector sequence.
待进行角色分离的语音信号通常是时域信号,本步骤通过分帧和提取特征矢量两个处理过程,获取能够表征所述语音信号的特征矢量序列,下面结合附图2做进一步说明。The speech signal to be separated by the character is usually a time domain signal. In this step, a sequence of feature vectors capable of characterizing the speech signal is obtained through two processes of framing and extracting feature vectors, which will be further described below with reference to FIG. 2 .
步骤101-1、按照预先设定的帧长度对语音信号进行分帧处理,得到多个音频帧。Step 101-1: Perform frame processing on the voice signal according to a preset frame length to obtain a plurality of audio frames.
在具体实施时,可以根据需求预先设定帧长度,例如可以设置为10ms、或者15ms等,然后根据所述帧长度对时域的语音信号进行逐帧切分,从而将语音信号切分为多个音频帧。根据所采用的切分策略的不同,相邻音频帧可以不存在交叠、也可以是有交叠的。In a specific implementation, the frame length may be preset according to requirements, for example, may be set to 10 ms, or 15 ms, etc., and then the voice signal in the time domain is segmented frame by frame according to the frame length, thereby dividing the voice signal into multiples. Audio frames. Depending on the segmentation strategy employed, adjacent audio frames may or may not overlap.
步骤101-2、提取各音频帧的特征矢量,得到所述特征矢量序列。Step 101-2: Extract a feature vector of each audio frame to obtain the feature vector sequence.
将时域的语音信号切分为多个音频帧后,可以逐帧提取能够表征语音信号的特征矢 量。由于语音信号在时域上的描述能力相对较弱,通常可以针对每个音频帧进行傅里叶变换,然后提取频域特征作为音频帧的特征矢量,例如,可以提取MFCC(Mel Frequency Cepstrum Coefficient—梅尔频率倒谱系数)特征、PLP(Perceptual Linear Predictive—感知线性预测)特征、或者LPC(Linear Predictive Coding—线性预测编码)特征等。After the speech signal in the time domain is divided into multiple audio frames, the feature vector capable of characterizing the speech signal can be extracted frame by frame. the amount. Since the description ability of the speech signal in the time domain is relatively weak, the Fourier transform can usually be performed for each audio frame, and then the frequency domain feature is extracted as the feature vector of the audio frame. For example, the MFCC (Mel Frequency Cepstrum Coefficient- Mel frequency cepstral coefficient) feature, PLP (Perceptual Linear Predictive) feature, or LPC (Linear Predictive Coding) feature.
下面以提取某一音频帧的MFCC特征为例,对特征矢量的提取过程作进一步描述。首先将音频帧的时域信号通过FFT(Fast Fourier Transformation—快速傅氏变换)得到对应的频谱信息,将所述频谱信息通过Mel滤波器组得到Mel频谱,在Mel频谱上进行倒谱分析,其核心一般是采用DCT(Discrete Cosine Transform—离散余弦变换)进行逆变换,然后取预设的N个系数(例如N=12或者38),则得到了所述音频帧的特征矢量:MFCC特征。对每个音频帧都采用上述方式进行处理,可以得到表征所述语音信号的一系列特征矢量,即本申请所述的特征矢量序列。The following takes the MFCC feature of an audio frame as an example to further describe the feature vector extraction process. First, the time domain signal of the audio frame is obtained by FFT (Fast Fourier Transformation) to obtain corresponding spectrum information, and the spectrum information is obtained through a Mel filter group to obtain a Mel spectrum, and cepstrum analysis is performed on the Mel spectrum. The core generally uses DCT (Discrete Cosine Transform) to perform inverse transform, and then takes a preset N coefficients (for example, N=12 or 38), and then obtains the feature vector of the audio frame: MFCC feature. Each audio frame is processed in the above manner, and a series of feature vectors characterizing the speech signal, that is, the feature vector sequence described in the present application, can be obtained.
步骤102、为特征矢量分配角色标签。Step 102: Assign a character tag to the feature vector.
本实施例通过建立GMM和HMM为特征矢量序列中的特征矢量分配角色标签。考虑到在一段语音信号中除了包含对应于各角色的语音信号外,可能还包含没有语音内容的部分,例如:由于倾听、思考等原因造成的静音部分。由于这些部分不包含角色的信息,为了提高角色分离的准确性,可以预先从语音信号中识别并剔除这样的音频帧。In this embodiment, a role tag is assigned to a feature vector in a feature vector sequence by establishing a GMM and an HMM. It is considered that in addition to the speech signal corresponding to each character in a speech signal, there may be a portion without speech content, for example, a mute portion due to listening, thinking, and the like. Since these parts do not contain the information of the character, in order to improve the accuracy of the character separation, such an audio frame can be recognized and culled from the voice signal in advance.
基于上述考虑,本实施例在为特征矢量分配角色标签之前先剔除不包含语音内容的音频帧、并进行语音段的划分,然后在此基础上为各语音段中的特征矢量分配角色标签,所述分配角色标签包括:进行角色的初始划分,在初始划分的基础上迭代训练GMM和HMM,如果训练得到的模型不满足预设要求,则调整角色数量然后重新训练GMM和HMM,直至训练得到的模型满足预设要求,则根据该模型为各语音段中的特征矢量分配角色标签。下面结合附图3对上述处理过程进行详细说明。Based on the above considerations, in this embodiment, before the role tag is assigned to the feature vector, the audio frame that does not include the voice content is removed, and the voice segment is divided, and then the character tag is assigned to the feature vector in each voice segment. The assignment of the role tag includes: performing an initial division of the role, and iteratively training the GMM and the HMM on the basis of the initial division. If the model obtained by the training does not satisfy the preset requirement, the number of roles is adjusted and then the GMM and the HMM are retrained until the training is obtained. When the model satisfies the preset requirement, the character tag is assigned to the feature vector in each voice segment according to the model. The above processing will be described in detail below with reference to FIG.
步骤102-1、通过识别并剔除不包含语音内容的音频帧、将所述语音信号切分为语音段。Step 102-1, by identifying and culling an audio frame that does not contain voice content, and dividing the voice signal into voice segments.
现有技术通常采用声学切分方式,即:根据已有的模型从语音信号中分离出比如“音乐段”、“语音段”、“静音段”等。这种方式需要提前训练好各种音频段对应的声学模型,比如“音乐段”对应的声学模型,基于该声学模型,能够从语音信号中分离出该声学模型对应的音频段。The prior art generally adopts an acoustic segmentation method, that is, separating, for example, a "music segment", a "speech segment", a "silent segment", and the like from a voice signal according to an existing model. In this way, it is necessary to train the acoustic models corresponding to the various audio segments in advance, such as an acoustic model corresponding to the “music segment”, and based on the acoustic model, the audio segment corresponding to the acoustic model can be separated from the speech signal.
优选地,本申请的技术方案可以采用VAD(Voice Activity Detection—语音活动检测)技术识别不包含语音内容的部分,这样,相对于采用声学切分方式的技术,可以不需要 提前训练不同音频段对应的声学模型,适应性更强。例如,可以通过计算音频帧的能量特征、过零率等识别音频帧是否为静音帧,对于存在环境噪声且比较强的情况下,可以综合采用上述多种手段、或者通过建立噪声模型进行识别。Preferably, the technical solution of the present application may use a VAD (Voice Activity Detection) technology to identify a portion that does not include voice content, so that it may not be required with respect to a technique using an acoustic segmentation method. The acoustic model corresponding to different audio segments is trained in advance, and the adaptability is stronger. For example, it is possible to identify whether the audio frame is a silent frame by calculating the energy characteristics, the zero-crossing rate, and the like of the audio frame. For the case where the ambient noise is relatively strong, the above various methods may be used in combination or may be identified by establishing a noise model.
识别出不包含语音内容的音频帧后,一方面可以将这部分音频帧从语音信号中剔除,以提高角色分离的准确性;另一方面通过对不包含语音内容的音频帧的识别,相当于识别出了每段有效语音(包含语音内容)的起点和终点,因此可以在此基础上进行语音段的划分。After identifying an audio frame that does not contain voice content, on the one hand, the audio frame can be removed from the voice signal to improve the accuracy of the character separation; on the other hand, by identifying the audio frame that does not contain the voice content, The start and end points of each valid voice (including voice content) are identified, so the voice segment can be divided on this basis.
请参见附图4,其为本实施例提供的语音段划分的示意图,在该图中通过VAD技术检测出在时间t2与t3之间、以及t4与t5之间的各音频帧为静音帧,本步骤可以从语音信号中剔除这部分静音帧,并相应划分出3个语音段:位于t1与t2之间的语音段1(seg1)、位于t3和t4之间的语音段2(seg2)、以及位于t5和t6之间的语音段3(seg3),每个语音段都包含若干个音频帧,每个音频帧都有对应的特征矢量。在划分语音段的基础上,可以粗略地进行角色分配,便于为后续的训练提供一个合理的起点。Please refer to FIG. 4 , which is a schematic diagram of the segmentation of speech segments provided by the embodiment. In the figure, each audio frame between time t 2 and t 3 and between t 4 and t 5 is detected by the VAD technique. For a silent frame, this step can remove the partial silence frame from the voice signal, and correspondingly divide three voice segments: a voice segment 1 (seg1) between t 1 and t 2, between t 3 and t 4 speech segment 2 (seg2), and located between the feature vector t and t 5. 6 speech segment 3 (seg3), each voice segment including a number of audio frames, each audio frame has a corresponding. Based on the segmentation of speech segments, role assignments can be roughly performed to provide a reasonable starting point for subsequent training.
优选的,在采用VAD技术进行上述处理后,还可以执行VAD平滑操作。这主要是考虑到人类实际发声的情况,真实的语音段的持续时间不会太短,如果执行上述VAD操作后,得到的某些语音段的持续时间小于预先设置的阈值(例如,语音段长度为30ms,而预先设置的阈值为100ms),则可以将这样的语音段与相邻语音段合并,形成较长的语音段。进行VAD平滑处理后得到的语音段划分更接近于真实的情况,有助于提高角色分离的准确性。Preferably, after the above processing is performed by using the VAD technique, the VAD smoothing operation can also be performed. This is mainly in consideration of the actual vocalization of human beings. The duration of the real speech segment is not too short. If the VAD operation described above is performed, the duration of some of the obtained speech segments is less than a preset threshold (for example, the length of the speech segment). For 30 ms, and the preset threshold is 100 ms), such a speech segment can be merged with adjacent speech segments to form a longer speech segment. The segmentation of the speech segment obtained after the VAD smoothing process is closer to the real situation, which helps to improve the accuracy of the character separation.
本步骤通过VAD技术将语音信号划分成若干个语音段,后续步骤102-2至102-11的任务则是利用GMM和HMM为各语音段中的特征矢量分配角色标签。In this step, the voice signal is divided into a plurality of voice segments by the VAD technique, and the tasks of the subsequent steps 102-2 to 102-11 are to use the GMM and the HMM to assign a character tag to the feature vector in each voice segment.
步骤102-2、按照预设的初始角色数量选择相应数量的语音段,并为每个语音段分别指定不同角色。Step 102-2: Select a corresponding number of voice segments according to a preset initial number of roles, and specify different roles for each voice segment.
本步骤可以随机地从已经划分好的语音段中选择与初始角色数量相同的语音段,考虑到所选语音段要用于进行GMM和HMM的初始训练,如果时长比较短则可用于训练的数据较少,时长太长则包含一个以上角色的可能性就会增加,这两种情况都不利于进行初始训练,因此本实施例提供一种优选实施方式,即:根据初始角色数量选择时长满足预设要求的语音段、并为每个语音段分别指定不同的角色。In this step, the voice segment with the same number of initial characters can be randomly selected from the already segmented speech segments, considering that the selected speech segment is to be used for initial training of the GMM and the HMM, and if the duration is relatively short, the data can be used for training. If the duration is too long, the possibility of including more than one role is increased. Both of the cases are not conducive to the initial training. Therefore, this embodiment provides a preferred implementation manner, that is, the duration is selected according to the initial number of roles. Set the required voice segments and assign different roles to each voice segment.
本实施例中预设的初始角色数量为2,预设的选择语音段的要求为:时长在2s至4s之间,因此本步骤从已经划分好的语音段中选择满足上述要求的2个语音段,并为每个 语音段分别指定不同的角色。仍以图4所示的语音段划分为例,seg1和seg2各自满足上述的时长要求,因此可以选择seg1和seg2这两个语音段,并为seg1指定角色1(s1),为seg2指定角色2(s2)。The number of initial characters preset in this embodiment is 2, and the requirement of the preset selected voice segment is: the duration is between 2s and 4s, so this step selects two voices that satisfy the above requirements from the already-divided voice segments. Segment and for each The voice segments specify different roles. Still taking the speech segmentation shown in FIG. 4 as an example, seg1 and seg2 each satisfy the above duration requirement, so two speech segments seg1 and seg2 can be selected, and role 1 (s1) is assigned to seg1, and role 2 is assigned to seg2. (s2).
步骤102-3、利用指定角色的语音段中的特征矢量,训练针对每个角色的GMM以及HMM。Step 102-3: Train the GMM and the HMM for each character by using feature vectors in the voice segment of the specified character.
本步骤根据指定角色的语音段包含的特征矢量,训练针对每个角色的GMM、以及描述角色间的跳转关系的HMM,本步骤是在特定角色数量下进行的初始训练。仍以图4所示的语音段划分为例,在初始角色数量下,seg1中包含的特征矢量用于训练角色1的GMM(gmm1),seg2中包含的特征矢量用于训练角色2的GMM(gmm2),如果在该角色数量下训练得到的GMM和HMM不满足要求,则可以调整角色数量并重复转到本步骤,根据调整后的角色数量执行相应的初始训练。This step trains the GMM for each character and the HMM describing the jump relationship between the characters according to the feature vector contained in the speech segment of the specified character. This step is the initial training performed under the specific number of roles. Still taking the speech segment division shown in FIG. 4 as an example, in the initial character number, the feature vector included in seg1 is used to train the GMM of the character 1 (gmm1), and the feature vector included in the seg2 is used to train the GMM of the character 2 ( Gmm2), if the GMM and HMM trained under the number of roles do not meet the requirements, you can adjust the number of characters and repeat this step, and perform the corresponding initial training according to the adjusted number of characters.
针对每个角色训练GMM以及HMM的过程,也就是在给定观测序列(即:各语音段包含的特征矢量序列,也即训练样本)的基础上学习与HMM相关的各个参数的过程,所述各个参数包括:HMM的转移矩阵A、每个角色对应的GMM的均值向量、协方差矩阵等参数。在具体实施时可以采用Baum-Welch算法进行训练,先根据训练样本估计各参数的初始值,根据训练样本和各参数的初始值,估计在时刻t处于某一状态sj的后验概率Υt(sj),然后根据计算得到的后验概率更新HMM的各个参数,再根据训练样本和更新后的各参数再次估计后验概率Υt(sj)......,反复迭代执行上述过程,直到找到一组HMM参数使得输出观测序列的概率最大化。得到满足上述要求的参数后,则在特定角色数量下的GMM和HMM初始训练完毕。The process of training the GMM and the HMM for each character, that is, the process of learning the various parameters related to the HMM based on a given sequence of observations (ie, a sequence of feature vectors included in each segment of speech, ie, a training sample), The parameters include: the transfer matrix A of the HMM, the mean vector of the GMM corresponding to each character, and a covariance matrix. In the specific implementation, the Baum-Welch algorithm can be used for training. First, the initial value of each parameter is estimated according to the training sample, and the posterior probability Υt of the state s j at the time t is estimated according to the initial value of the training sample and each parameter. s j ), then updating each parameter of the HMM according to the calculated posterior probability, and then re-estimating the posterior probability Υt(s j ) according to the training sample and the updated parameters, and repeatedly performing the above process Until a set of HMM parameters is found to maximize the probability of outputting the observed sequence. After obtaining the parameters satisfying the above requirements, the GMM and HMM are initially trained in the specific number of roles.
步骤102-4、根据训练得到的GMM和HMM进行解码,获取输出各语音段所包含的特征矢量序列的概率值排序靠前的角色序列。Step 102-4: Perform decoding according to the GMM and the HMM obtained by the training, and obtain a sequence of roles in which the probability values of the feature vector sequences included in each speech segment are output.
在步骤102-1中已经将语音信号划分为若干个语音段,每个语音段中的每个音频帧都有对应的特征矢量,共同组成本步骤所述的特征矢量序列。本步骤在给定所述特征矢量序列、以及已训练好的GMM和HMM的基础上,找到所述特征矢量序列可能从属的HMM状态序列,即:角色序列。The speech signal has been divided into a plurality of speech segments in step 102-1, and each audio frame in each speech segment has a corresponding feature vector, which together constitute the feature vector sequence described in this step. In this step, based on the sequence of feature vectors and the trained GMM and HMM, a sequence of HMM states that the sequence of feature vectors may be dependent on, that is, a sequence of characters is found.
本步骤完成的功能是通常所述的HMM解码过程,根据所述特征矢量序列,搜索输出该特征矢量序列的概率值排序靠前的角色序列,作为优选实施方式,通常可以选择最大概率值对应的角色序列,即最可能输出所述特征矢量序列的角色序列,也称为最优的隐状态序列。 The function performed in this step is a generally described HMM decoding process. According to the feature vector sequence, the role sequence of the probability value of the feature vector sequence is searched and outputted. As a preferred embodiment, the maximum probability value may be selected. The sequence of characters, that is, the sequence of characters most likely to output the sequence of feature vectors, is also referred to as the optimal sequence of hidden states.
在具体实施时,可以采用穷举搜索的方法,计算每个可能的角色序列输出所述特征矢量序列的概率值,并从中选择最大值。为了提高计算效率,作为优选实施方式,可以采用维特比(Viterbi)算法,利用HMM的转移概率在时间上的不变性来降低计算的复杂度,并在搜索得到输出所述特征矢量序列的最大概率值后,根据搜索过程记录的信息进行回溯,获取对应的角色序列。In a specific implementation, an exhaustive search method may be used to calculate a probability value of each possible character sequence outputting the feature vector sequence, and select a maximum value therefrom. In order to improve the computational efficiency, as a preferred embodiment, a Viterbi algorithm may be employed to reduce the computational complexity by using the transition probability of the HMM in time, and obtain the maximum probability of outputting the sequence of feature vectors in the search. After the value, the backtracking is performed according to the information recorded in the search process, and the corresponding character sequence is obtained.
步骤102-5、判断所述角色序列对应的概率值是否大于预设阈值,若是,执行步骤102-6,否则转到步骤102-7执行。Step 102-5: Determine whether the probability value corresponding to the role sequence is greater than a preset threshold. If yes, go to step 102-6, otherwise go to step 102-7.
如果步骤102-4通过解码过程获取的角色序列对应的概率值大于预先设定的阈值,例如:0.5,通常可以认为目前的GMM和HMM已经稳定,可以执行步骤102-6为各语音段中的特征矢量分配角色标签(后续步骤104可以利用所述已稳定的HMM判定特征矢量序列对应的角色序列),否则转到步骤102-7判断是否继续进行迭代训练。If the probability value corresponding to the role sequence obtained by the decoding process in step 102-4 is greater than a preset threshold, for example, 0.5, it can generally be considered that the current GMM and the HMM are stable, and step 102-6 can be performed for each voice segment. The feature vector assigns a role tag (subsequent step 104 may use the stabilized HMM to determine a sequence of characters corresponding to the sequence of feature vectors), otherwise proceeds to step 102-7 to determine whether to continue iterative training.
步骤102-6、按照所述角色序列为各语音段中的特征矢量分配角色标签。Step 102-6: Assign a role tag to the feature vector in each voice segment according to the role sequence.
由于目前的GMM和HMM已经稳定,因此可以按照步骤102-4通过解码获取的角色序列为各语音段中的特征矢量分配角色标签。在具体实施时,由于所述角色序列中的每个角色与各语音段中的每个特征矢量是一一对应的,因此可以根据该一一对应关系为每个特征矢量分配角色标签。至此,各语音段中的特征矢量都有了各自的角色标签,步骤102执行完毕,可以继续执行步骤103。Since the current GMM and the HMM are already stable, the character tags in the respective voice segments can be assigned role tags according to the sequence of characters obtained by decoding in step 102-4. In a specific implementation, since each character in the character sequence has a one-to-one correspondence with each feature vector in each voice segment, a character tag may be assigned to each feature vector according to the one-to-one correspondence. So far, the feature vectors in each speech segment have their own role tags. After the execution of step 102 is completed, step 103 can be continued.
步骤102-7、判断在当前角色数量下训练GMM和HMM的次数是否小于预设的训练次数上限;若是,则执行步骤102-8,否则转到步骤102-10执行。In step 102-7, it is determined whether the number of times the GMM and the HMM are trained in the current number of roles is less than a preset upper limit of the number of training times; if yes, step 102-8 is performed; otherwise, the process proceeds to step 102-10.
执行到本步骤,说明目前训练得到的GMM和HMM还没有稳定,需要继续进行迭代训练。考虑到在训练过程所采用的当前角色数量与实际角色数量(所述语音信号涉及的真实角色数量)不一致的情况下,GMM和HMM即使经过多次迭代训练也可能无法满足要求(解码操作所获取的角色序列对应的概率值始终不满足大于预设阈值的条件),为了避免出现无意义的循环迭代过程,可以预先设置在每种角色数量下训练GMM和HMM的训练次数上限。如果本步骤判断出在当前角色数量下的训练次数小于所述上限,则继续执行步骤102-8为各语音段指定角色以便继续进行迭代训练,否则说明目前采用的角色数量可能与实际情况不一致,因此可以转到步骤102-10判断是否需要调整角色数量。Execution to this step shows that the GMM and HMM obtained by the current training are not stable yet, and iterative training needs to be continued. Considering that the number of current roles used in the training process is inconsistent with the actual number of roles (the number of real characters involved in the voice signal), the GMM and the HMM may not meet the requirements even after repeated iteration training (the decoding operation obtains The probability value corresponding to the character sequence does not always satisfy the condition that is greater than the preset threshold. In order to avoid the meaningless loop iteration process, the upper limit of the training times for training the GMM and the HMM in each role number may be preset. If it is determined that the number of trainings in the current number of roles is less than the upper limit, proceed to step 102-8 to assign a role to each voice segment to continue the iterative training. Otherwise, the number of roles currently used may be inconsistent with the actual situation. Therefore, it is possible to go to step 102-10 to determine whether it is necessary to adjust the number of roles.
步骤102-8、根据所述角色序列为每个语音段指定对应的角色。Step 102-8: Specify a corresponding role for each voice segment according to the role sequence.
在步骤102-4中已经通过解码获取了角色序列,由于角色序列中的每个角色与各语 音段中的特征矢量是一一对应的,因此可以获知各语音段中每个特征矢量对应的角色。本步骤针对语音信号中的每个语音段,通过计算其中各特征矢量对应的角色的众数,为所述语音段指定角色。例如:某语音段包含10个音频帧,也即包含10个特征矢量,其中8个特征矢量对应角色1(s1),2个特征矢量对应于角色2(s2),那么所述语音段中各特征矢量对应的角色的众数为角色1(s1),因此将角色1(s1)指定为所述语音段的角色。The sequence of characters has been acquired by decoding in step 102-4, due to each character and language in the sequence of characters. The feature vectors in the segments are one-to-one correspondence, so that the character corresponding to each feature vector in each segment can be known. In this step, for each speech segment in the speech signal, a character is assigned to the speech segment by calculating the mode of the character corresponding to each feature vector. For example, a voice segment includes 10 audio frames, that is, 10 feature vectors, wherein 8 feature vectors correspond to character 1 (s1), and 2 feature vectors correspond to character 2 (s2), then each of the voice segments The mode of the character corresponding to the feature vector is the character 1 (s1), so the character 1 (s1) is designated as the character of the voice segment.
步骤102-9、根据每个语音段中的特征矢量以及对应的角色,训练针对每个角色的GMM以及HMM,并转到步骤102-4继续执行。Step 102-9: Train the GMM and the HMM for each role according to the feature vector in each voice segment and the corresponding role, and go to step 102-4 to continue the execution.
在步骤102-8为每个语音段指定角色的基础上,可以训练针对每个角色的GMM以及HMM。仍以图4所示的语音段划分为例,如果步骤102-8将seg1和seg3指定为角色1(s1),seg2指定为角色2(s2),那么seg1和seg3包含的特征矢量可以用于训练角色1的GMM(gmm1),seg2中包含的特征矢量用于训练角色2的GMM(gmm2)。GMM和HMM的训练方法请参见步骤102-3的相关文字,此处不再赘述。Based on the assignment of roles to each speech segment in steps 102-8, the GMM and HMM for each character can be trained. Still taking the speech segment division shown in FIG. 4 as an example, if step 102-8 designates seg1 and seg3 as role 1 (s1) and seg2 as role 2 (s2), the feature vectors included in seg1 and seg3 can be used for The GMM (gmm1) of the character 1 is trained, and the feature vector contained in the seg2 is used to train the GMM (gmm2) of the character 2. For the training methods of GMM and HMM, refer to the related text in step 102-3, and details are not described here.
在具体实施中,本技术方案通常为迭代训练过程,为了提高训练效率,本步骤可以在上一次训练得到的GMM和HMM的基础上采用增量方式训练新的GMM和HMM,即在上一次训练得到的参数基础上,利用目前的样本数据,继续调整各个参数,从而可以提高训练速度。In a specific implementation, the technical solution is usually an iterative training process. In order to improve the training efficiency, this step can incrementally train the new GMM and HMM based on the GMM and HMM obtained from the previous training, that is, the last training. Based on the obtained parameters, using the current sample data, the parameters are continuously adjusted, so that the training speed can be improved.
完成上述训练过程,得到新的GMM和HMM后,可以转到步骤102-4执行,根据新的模型进行解码以及执行后续的操作。After completing the above training process, after obtaining the new GMM and HMM, the process may go to step 102-4 to perform decoding according to the new model and perform subsequent operations.
步骤102-10、判断当前角色数量是否符合预设要求;若是,转到步骤102-6执行,否则继续执行步骤102-11。Step 102-10: Determine whether the current number of roles meets the preset requirement; if yes, go to step 102-6 to execute, otherwise continue to step 102-11.
执行到本步骤,通常说明在当前角色数量下训练得到的GMM和HMM并未稳定、而且训练次数已经等于或者超过了预设的训练次数上限,在这种情况下可以判断当前角色数量是否符合预设要求,若符合,则说明可以停止角色分离过程,转到步骤102-6进行角色标签的分配,否则继续执行步骤102-11进行角色数量的调整。Execution to this step usually indicates that the GMM and HMM trained under the current number of roles are not stable, and the number of trainings has equaled or exceeded the preset maximum number of training times. In this case, it can be judged whether the current number of roles is consistent with the pre-preview. If the requirements are met, the role separation process can be stopped. Go to step 102-6 to assign the role tag. Otherwise, continue to perform step 102-11 to adjust the number of roles.
步骤102-11、调整角色数量,选择相应数量的语音段并为每个语音段分别指定不同角色;并转到步骤102-3继续执行。Step 102-11: Adjust the number of roles, select a corresponding number of voice segments, and assign different roles to each voice segment; and go to step 102-3 to continue.
例如,当前角色数量为2,对角色数量的预设要求为“角色数量等于4”,步骤102-10判定当前角色数量尚未符合预设要求,这种情况下,可以执行本步骤进行角色数量的调整,例如:为当前角色数量加1,即将当前角色数量更新为3。 For example, the current number of roles is 2, and the preset requirement for the number of roles is “the number of roles is equal to 4”. Step 102-10 determines that the current number of roles has not met the preset requirement. In this case, you can perform this step to perform the number of roles. Adjust, for example: add 1 to the current number of characters, and update the current number of roles to 3.
根据调整后的角色数量,从语音信号包含的各个语音段中选择相应数量的语音段,并为所选每个语音段分别指定不同的角色。其中对所选语音段的时长要求,可以参见步骤102-2中的相关文字,此处不再赘述。According to the adjusted number of characters, a corresponding number of voice segments are selected from each voice segment included in the voice signal, and different characters are respectively assigned to each voice segment selected. For the duration of the selected voice segment, refer to related text in step 102-2, and details are not described herein.
仍以图4所示的语音段划分为例,如果当前角色数量从2增加为3,并且seg1、seg2和seg3都满足选择语音段的时长要求,那么本步骤可以选择这3个语音段,并为seg1指定角色1(s1),为seg2指定角色2(s2),为seg3指定角色3(s3)。Still taking the voice segmentation shown in FIG. 4 as an example, if the current number of characters is increased from 2 to 3, and seg1, seg2, and seg3 satisfy the duration requirement of the selected voice segment, then this step may select the three voice segments, and Specify role 1 (s1) for seg1, role 2 (s2) for seg2, and role 3 (s3) for seg3.
完成上述调整角色数量以及选择语音段的操作后,可以转到步骤102-3针对调整后的角色数量初始训练GMM和HMM。After completing the above operations of adjusting the number of roles and selecting a voice segment, you can go to step 102-3 to initially train the GMM and the HMM for the adjusted number of characters.
步骤103、利用具有角色标签的特征矢量训练DNN模型。Step 103: Train the DNN model with a feature vector having a character tag.
此时,已经为各语音段中的特征矢量分配了角色标签,在此基础上,本步骤以具有角色标签的特征矢量为样本训练DNN模型,所述DNN模型用于根据输入的特征矢量输出对应每个角色的概率。为了便于理解,先对DNN作简要说明。At this time, the character label has been assigned to the feature vector in each speech segment. On this basis, this step trains the DNN model with the feature vector having the character tag as a sample, and the DNN model is used to output the corresponding image according to the input feature vector. The probability of each character. For ease of understanding, a brief description of the DNN is given first.
DNN(Deep Neural Networks—深度神经网络)通常指包括1个输入层、3个以上隐含层(也可以包含7个、9个、甚至更多的隐含层)、以及1个输出层的神经网络。每个隐含层都能够提取一定的特征,并将本层的输出作为下一层的输入,通过逐层提取特征,将低层特征形成更加抽象的高层特征,从而能够实现对物体或者种类的识别。DNN (Deep Neural Networks) generally refers to a nerve that includes one input layer, three or more hidden layers (which may also contain seven, nine, or even more hidden layers), and one output layer. The internet. Each hidden layer can extract certain features and use the output of this layer as the input of the next layer. By extracting features layer by layer and forming lower-level features into more abstract high-level features, the recognition of objects or categories can be realized. .
请参见图5,其为DNN网络的拓扑结构示意图,图中的DNN网络总共有n层,每层有多个神经元,不同层之间全连接;每层都有自己的激励函数f(例如Sigmoid函数)。输入为特征矢量v,第i层到第i+1层的转移矩阵为wi(i+1),第i+1层的偏置矢量为b(i+1),第i层的输出为outi,第i+1的输入为ini+1,计算过程为:Please refer to FIG. 5 , which is a schematic diagram of the topology of the DNN network. The DNN network in the figure has a total of n layers, each layer has multiple neurons, and the layers are fully connected; each layer has its own excitation function f (for example Sigmoid function). The input is the feature vector v, the transfer matrix of the i-th layer to the i+1th layer is w i(i+1) , the offset vector of the i+1th layer is b (i+1) , and the output of the i-th layer is Out i , the input of the i+1 is in i+1 , and the calculation process is:
ini+1=outi*wi(i+1)+b(i+1) In i+1 =out i *w i(i+1) +b (i+1)
outi+1=f(ini+1)Out i+1 =f(in i+1 )
由此可见DNN模型的参数包括层间的转移矩阵w和每一层的偏置矢量b等,训练DNN模型的主要任务就是确定上述参数。在实际应用中通常采用BP(Back-propagation—反向传播)算法进行训练,训练过程是一个有监督的学习过程:输入信号为带有标签的特征矢量,分层向前传播,到达输出层后再逐层反向传播,通过梯度下降法调整各层的参数以使网络的实际输出不断接近期望输出。对于每层有上千神经元的DNN网络来说,其参数的数量可能是百万级的甚至更多,完成上述训练过程获取的DNN模型,通常具有非常强大的特征提取能力以及识别能力。It can be seen that the parameters of the DNN model include the transition matrix w between the layers and the offset vector b of each layer. The main task of training the DNN model is to determine the above parameters. In practical applications, BP (Back-propagation) algorithm is usually used for training. The training process is a supervised learning process: the input signal is a labeled feature vector, and the layer propagates forward and reaches the output layer. Then, the layers are back-propagated, and the parameters of each layer are adjusted by the gradient descent method so that the actual output of the network is continuously approaching the desired output. For a DNN network with thousands of neurons per layer, the number of parameters may be one million or more. The DNN model obtained by the above training process usually has very powerful feature extraction ability and recognition ability.
在本实施例中,DNN模型用于根据输入的特征矢量输出对应每个角色的概率,因此 DNN模型的输出层可以采用分类器(例如Softmax)作为激活函数,在步骤102完成预先分配角色标签的处理后,如果角色标签涉及的角色数量为n,那么DNN模型的输出层可以包括n个节点,分别对应于n个角色,针对输入的特征矢量每个节点输出该特征矢量对应所属角色的概率值。In this embodiment, the DNN model is used to output a probability corresponding to each character according to the input feature vector, and thus The output layer of the DNN model may use a classifier (for example, Softmax) as an activation function. After the process of pre-assigning the role tag is completed in step 102, if the number of roles involved in the role tag is n, the output layer of the DNN model may include n nodes. Corresponding to n characters, each node outputs a probability value corresponding to the character to which the feature vector belongs for each of the input feature vectors.
本步骤以带有角色标签的特征矢量作为样本,对构建的上述DNN模型进行有监督训练。在具体实施时,可以直接采用上述BP算法进行训练,考虑到单纯采用BP算法训练有可能出现陷入局部极小值的情况、导致最终得到的模型无法满足应用的需求,因此本实施例采用预训练(pre-training)与BP算法相结合的方式进行DNN模型的训练。This step uses the feature vector with the character tag as a sample to supervise the constructed DNN model. In the specific implementation, the BP algorithm can be directly used for training, considering that the BP algorithm alone may fall into the local minimum value, and the resulting model cannot meet the application requirements. Therefore, the pre-training is adopted in this embodiment. The pre-training is combined with the BP algorithm to train the DNN model.
预训练通常采用非监督贪心逐层训练算法,先采用非监督方式训练含有一个隐层的网络,然后保留训练好的参数,使网络层数加1,接着训练含两个隐层的网络......以此类推,直到含有最大隐层的网络。这样逐层训练完之后,以该无监督训练过程学习到的参数值作为初始值,再采用传统BP算法进行有监督的训练,最终得到DNN模型。Pre-training usually uses an unsupervised greedy layer-by-layer training algorithm, first training the network with a hidden layer in an unsupervised manner, then retaining the trained parameters, adding 1 to the network layer, and then training the network with two hidden layers. .... and so on, until the network with the largest hidden layer. After the layer-by-layer training is completed, the parameter values learned by the unsupervised training process are used as initial values, and then the traditional BP algorithm is used for supervised training, and finally the DNN model is obtained.
由于经过pre-training得到的初始分布比纯BP算法采用的随机初始参数更接近于最终的收敛值,相当于使后续的有监督训练过程有一个好的起点,因此训练得到的DNN模型通常不会陷入局部极小值,能够获得较高的识别率。Since the initial distribution obtained by pre-training is closer to the final convergence value than the random initial parameter adopted by the pure BP algorithm, which is equivalent to making the subsequent supervised training process have a good starting point, the trained DNN model usually does not Get into a local minimum and get a higher recognition rate.
步骤104、根据所述DNN模型和利用特征矢量训练得到的HMM,判定特征矢量序列对应的角色序列,并输出角色分离结果。Step 104: Determine a character sequence corresponding to the feature vector sequence according to the DNN model and the HMM obtained by using the feature vector training, and output a role separation result.
由于所述DNN模型用于根据输入的特征矢量输出对应每个角色的概率,同时根据特征矢量序列的角色标签的分布情况可以获知对应每个角色的先验概率,而每个特征矢量的先验概率通常也是固定的,因此依据贝叶斯定理,根据DNN模型的输出以及上述先验概率可以获知每个角色输出相应特征矢量的概率,也即可以采用步骤103训练好的DNN模型决定HMM各状态的发射概率。Since the DNN model is used to output a probability corresponding to each character according to the input feature vector, and according to the distribution of the character tags of the feature vector sequence, the prior probability corresponding to each character can be known, and the prior of each feature vector The probability is usually also fixed. Therefore, according to the Bayes' theorem, according to the output of the DNN model and the above prior probability, the probability of each character outputting the corresponding feature vector can be known, that is, the DNN model trained in step 103 can be used to determine the state of the HMM. The probability of launch.
所述HMM可以是在采用上述DNN模型决定HMM发射概率的基础上,用特征矢量序列训练得到的。考虑到在步骤102为特征矢量分配角色标签时所采用的HMM对各角色之间的跳转关系的描述已基本稳定,可以不再进行额外的训练,因此本实施例直接采用该HMM,并用训练得到的DNN模型替换GMM,即:由DNN模型决定HMM各状态的发射概率。The HMM may be trained by using a feature vector sequence on the basis of determining the HMM transmission probability by using the DNN model described above. Considering that the description of the jump relationship between the roles of the HMM used in the role tag assignment of the feature vector in step 102 is basically stable, no additional training can be performed. Therefore, the HMM is directly used in this embodiment, and training is used. The obtained DNN model replaces the GMM, that is, the transmission probability of each state of the HMM is determined by the DNN model.
在本实施例中,步骤102-1进行了语音段的切分,本步骤根据所述DNN模型和预先分配角色标签时所采用的HMM,判定各语音段所包含的特征矢量序列对应的角色序列。In this embodiment, step 102-1 performs segmentation of the speech segment. In this step, the role sequence corresponding to the feature vector sequence included in each speech segment is determined according to the DNN model and the HMM used when the role tag is pre-assigned. .
根据特征矢量序列确定角色序列的过程是通常所述的解码问题,可以根据所述DNN 模型和HMM执行解码操作,获取输出所述特征矢量序列的概率值排序靠前(例如概率值最大)的角色序列,并将所述角色序列作为与所述特征矢量序列对应的角色序列。具体说明请参见步骤102-4中的相关文字,此处不再赘述。The process of determining a sequence of characters from a sequence of feature vectors is a commonly described decoding problem, which may be based on the DNN The model and the HMM perform a decoding operation, and obtain a character sequence that outputs a probability value of the feature vector sequence ranked first (for example, the probability value is the largest), and uses the character sequence as a character sequence corresponding to the feature vector sequence. For details, see related text in step 102-4, and details are not mentioned here.
通过解码过程得到与各语音段所包含的特征矢量序列对应的角色序列后,则可以输出相应的角色分离结果。由于角色序列中的每个角色与特征矢量是一一对应的,而每个特征矢量对应的音频帧都有各自的时间起止点,因此本步骤可以针对每个角色输出与其对应的特征矢量所属音频帧的起止时间信息。After the character sequence corresponding to the feature vector sequence included in each speech segment is obtained by the decoding process, the corresponding character separation result can be output. Since each character in the character sequence has a one-to-one correspondence with the feature vector, and each of the audio frames corresponding to each feature vector has its own time start and end point, this step can output the audio of the corresponding feature vector for each character. The start and end time information of the frame.
至此,通过步骤101至步骤104,对本申请提供的基于语音的角色分离方法的具体实施方式进行了详细的说明。需要说明的是,本实施例在步骤102为特征矢量预先分配角色标签的过程中采用了自顶向下、逐渐增加角色数量的方式。在其他实施方式中,也可以采用自底向上、逐渐减少角色数量的方式:最初可以将切分得到的每个语音段分别指定给不同的角色,然后训练针对每个角色的GMM和HMM,如果通过迭代训练得到的GMM和HMM在执行解码操作后得到的概率值始终不大于预设阈值,那么在调整角色数量时,可以通过评估每个角色的GMM彼此之间的相似度(例如计算KL散度),将相似度满足预设要求的GMM对应的语音段进行合并,并相应减少角色数量,重复迭代执行上述过程,直到HMM通过解码得到的概率值大于预设阈值或者角色数量符合预设要求,则停止迭代过程,并根据解码得到的角色序列为各语音段中的特征矢量分配角色标签。So far, the specific implementation manner of the voice-based role separation method provided by the present application is described in detail through steps 101 to 104. It should be noted that, in the embodiment, the method for pre-assigning the character tag to the feature vector in step 102 adopts a method of top-down and gradually increasing the number of characters. In other embodiments, the bottom-up and gradual reduction of the number of characters may also be adopted: initially, each segment of the segment obtained by segmentation may be assigned to a different character, and then the GMM and HMM for each role are trained, if The probability values obtained by the GMM and the HMM obtained by the iterative training after performing the decoding operation are always not greater than the preset threshold, so when adjusting the number of roles, the similarity between the GMMs of each character can be evaluated (for example, calculating KL dispersion) Degrees, the voice segments corresponding to the GMM whose similarity meets the preset requirements are combined, and the number of roles is reduced accordingly, and the above process is repeated and iteratively performed until the probability value obtained by the HMM decoding is greater than a preset threshold or the number of roles meets the preset requirement. Then, the iterative process is stopped, and a character tag is assigned to the feature vector in each speech segment according to the decoded character sequence.
综上所述,本申请提供的基于语音的角色分离方法,由于采用具有强大特征提取能力的DNN模型对角色进行建模,比传统的GMM具有更为强大的刻画能力,对角色的刻画更加精细、准确,因此能够获得更为准确的角色分离结果。本申请的技术方案不仅可以应用于对客服中心、会议语音等对话语音进行角色分离的场景中,还可以应用于其它需要对语音信号中的角色进行分离的场景中,只要所述语音信号中包含两个或者两个以上角色,就都可以采用本申请的技术方案,并取得相应的有益效果。In summary, the voice-based role separation method provided by the present application uses a DNN model with powerful feature extraction capability to model a character, which has more powerful characterization ability than the traditional GMM, and the character characterization is more elaborate. Accurate, so you can get more accurate role separation results. The technical solution of the present application can be applied not only to a scenario in which a dialogue between a customer service center and a conference voice is separated, but also to other scenarios in which a character in a voice signal needs to be separated, as long as the voice signal is included. Two or more characters can adopt the technical solution of the present application and obtain corresponding beneficial effects.
在上述的实施例中,提供了一种基于语音的角色分离方法,与之相对应的,本申请还提供一种基于语音的角色分离装置。请参看图6,其为本申请的一种基于语音的角色分离装置的实施例示意图。由于装置实施例基本相似于方法实施例,所以描述得比较简单,相关之处参见方法实施例的部分说明即可。下述描述的装置实施例仅仅是示意性的。In the above embodiment, a voice-based role separation method is provided. Correspondingly, the present application further provides a voice-based role separation device. Please refer to FIG. 6, which is a schematic diagram of an embodiment of a voice-based role separation apparatus according to the present application. Since the device embodiment is basically similar to the method embodiment, the description is relatively simple, and the relevant parts can be referred to the description of the method embodiment. The device embodiments described below are merely illustrative.
本实施例的一种基于语音的角色分离装置,包括:特征提取单元601,用于从语音 信号中逐帧提取特征矢量,得到特征矢量序列;标签分配单元602,用于为特征矢量分配角色标签;DNN模型训练单元603,用于利用具有角色标签的特征矢量训练DNN模型,其中所述DNN模型用于根据输入的特征矢量输出对应每个角色的概率;角色判定单元604,用于根据所述DNN模型和利用特征矢量训练得到的HMM,判定特征矢量序列对应的角色序列并输出角色分离结果,其中所述HMM用于描述角色间的跳转关系。A voice-based role separation device of the embodiment includes: a feature extraction unit 601, configured to The feature vector is extracted frame by frame in the signal to obtain a feature vector sequence; a label assigning unit 602 is configured to assign a character tag to the feature vector; and a DNN model training unit 603 is configured to train the DNN model by using a feature vector having a character tag, wherein the DNN The model is configured to output a probability corresponding to each character according to the input feature vector; the role determining unit 604 is configured to determine a character sequence corresponding to the feature vector sequence according to the DNN model and the HMM obtained by using the feature vector training, and output a role separation result. The HMM is used to describe a jump relationship between characters.
可选的,所述装置还包括:Optionally, the device further includes:
语音段切分单元,用于在所述特征提取单元提取特征矢量后、在触发所述标签分配单元工作之前,通过识别并剔除不包含语音内容的音频帧、将所述语音信号切分为语音段;a voice segment segmentation unit, configured to: after the feature extraction unit extracts the feature vector, identify and cull the audio frame that does not include the voice content, and divide the voice signal into voices before triggering the label allocation unit to work segment;
所述标签分配单元具体用于,为各语音段中的特征矢量分配角色标签;The label distribution unit is specifically configured to allocate a role label for a feature vector in each voice segment;
所述角色判定单元具体用于,根据所述DNN模型和利用特征矢量训练得到的HMM,判定各语音段所包含的特征矢量序列对应的角色序列并输出角色分离结果。The role determining unit is specifically configured to determine a character sequence corresponding to the feature vector sequence included in each voice segment according to the DNN model and the HMM obtained by using the feature vector training, and output a role separation result.
可选的,所述标签分配单元具体用于,通过建立GMM和HMM,为各语音段中的特征矢量预先分配角色标签,其中所述GMM用于针对每个角色、根据输入的特征矢量输出该特征矢量对应于所述角色的概率;Optionally, the label allocation unit is specifically configured to pre-allocate a character label for a feature vector in each voice segment by establishing a GMM and an HMM, where the GMM is configured to output the role according to the input feature vector for each role. The feature vector corresponds to the probability of the character;
所述角色判定单元具体用于,根据所述DNN模型和为各语音段中的特征矢量分配角色标签所采用的HMM,判定所述各语音段所包含的特征矢量序列对应的角色序列。The role determining unit is specifically configured to determine a character sequence corresponding to the feature vector sequence included in each of the voice segments according to the DNN model and an HMM used to assign a character tag to a feature vector in each voice segment.
可选的,所述标签分配单元包括:Optionally, the label distribution unit includes:
初始角色指定子单元,用于按照预设的初始角色数量选择相应数量的语音段,并为每个语音段分别指定不同角色;The initial role designating sub-unit is configured to select a corresponding number of voice segments according to a preset initial number of roles, and specify different roles for each voice segment;
初始模型训练子单元,用于利用指定角色的语音段中的特征矢量,训练针对每个角色的GMM以及HMM;An initial model training subunit for training a GMM and an HMM for each character by using a feature vector in a voice segment of a specified character;
解码子单元,用于根据训练得到的GMM和HMM进行解码,获取输出各语音段所包含的特征矢量序列的概率值排序靠前的角色序列;a decoding subunit, configured to perform decoding according to the GMM and the HMM obtained by the training, and obtain a sequence of roles in which the probability values of the feature vector sequences included in each speech segment are outputted;
概率判断子单元,用于判断所述角色序列对应的概率值是否大于预设阈值;a probability judging unit, configured to determine whether a probability value corresponding to the role sequence is greater than a preset threshold;
标签分配子单元,用于当所述概率判断子单元的输出为是时,按照所述角色序列为各语音段中的特征矢量分配角色标签。a label allocation subunit, configured to allocate a role label for the feature vector in each voice segment according to the role sequence when the output of the probability determination subunit is YES.
可选的,所述标签分配单元还包括:Optionally, the label distribution unit further includes:
逐语音段角色指定子单元,用于当所述概率判断子单元的输出为否时,根据所述角色序列,为每个语音段指定对应的角色; a voice-by-speech role designation sub-unit, configured to specify a corresponding role for each voice segment according to the role sequence when the output of the probability determination sub-unit is negative;
模型更新训练子单元,用于根据每个语音段中的特征矢量以及对应的角色,训练针对每个角色的GMM以及HMM,并触发所述解码子单元工作。The model update training subunit is configured to train the GMM and the HMM for each role according to the feature vector in each voice segment and the corresponding role, and trigger the decoding subunit to work.
可选的,所述逐语音段角色指定子单元具体用于,针对每个语音段,将其中各特征矢量对应的角色的众数指定为所述语音段的角色。Optionally, the voice-by-speech segment role specifying sub-unit is specifically configured to specify, for each voice segment, a mode of a character corresponding to each feature vector as a role of the voice segment.
可选的,所述模型更新训练子单元具体用于,在上一次训练得到的模型基础上采用增量方式训练所述GMM以及HMM。Optionally, the model update training subunit is specifically configured to train the GMM and the HMM in an incremental manner on the basis of the model obtained in the previous training.
可选的,所述标签分配单元还包括:Optionally, the label distribution unit further includes:
训练次数判断子单元,用于当所述概率判断子单元的输出为否时,判断在当前角色数量下训练GMM和HMM的次数是否小于预设的训练次数上限,并在判断结果为是时,触发所述逐语音段角色指定子单元工作。a training number determining subunit, configured to determine, when the output of the probability judging subunit is negative, whether the number of times the GMM and the HMM are trained under the current number of characters is less than a preset upper limit of the training times, and when the judgment result is yes, Triggering the voice-by-speech role designation sub-unit work.
角色数量调整子单元,用于当所述训练次数判断子单元的输出为否时,调整角色数量,选择相应数量的语音段并为每个语音段分别指定不同角色,并触发所述初始模型训练子单元工作。a role quantity adjustment subunit, configured to: when the output of the training times determining subunit is negative, adjust the number of characters, select a corresponding number of voice segments, and respectively specify different roles for each voice segment, and trigger the initial model training. Subunit work.
可选的,所述标签分配单元还包括:Optionally, the label distribution unit further includes:
角色数量判断子单元,用于当所述训练次数判断子单元的输出为否时,判断当前角色数量是否符合预设要求,若符合则触发所述标签分配子单元工作,否则触发所述角色数量调整子单元工作。a role number determining sub-unit, configured to determine whether the current number of roles meets a preset requirement when the output of the training number determining sub-unit is negative, and trigger the label allocation sub-unit to work if it matches, otherwise trigger the number of the roles Adjust the subunit work.
可选的,所述特征提取单元包括:Optionally, the feature extraction unit includes:
分帧子单元,用于按照预先设定的帧长度对语音信号进行分帧处理,得到多个音频帧;a framing sub-unit, configured to perform framing processing on the voice signal according to a preset frame length to obtain a plurality of audio frames;
特征提取执行子单元,用于提取各音频帧的特征矢量,得到所述特征矢量序列。A feature extraction execution subunit is configured to extract a feature vector of each audio frame to obtain the feature vector sequence.
可选的,所述特征提取执行子单元具体用于,提取各音频帧的MFCC特征、PLP特征、或者LPC特征,得到所述特征矢量序列。Optionally, the feature extraction execution sub-unit is specifically configured to extract an MFCC feature, a PLP feature, or an LPC feature of each audio frame to obtain the feature vector sequence.
可选的,所述语音段切分单元具体用于,通过采用VAD技术识别并剔除所述不包含语音内容的音频帧、将所述语音信号切分为语音段。Optionally, the voice segment segmentation unit is specifically configured to: identify and cull the audio frame that does not include voice content by using a VAD technology, and divide the voice signal into voice segments.
可选的,所述装置还包括:Optionally, the device further includes:
VAD平滑单元,用于在所述语音段切分单元采用VAD技术切分语音段后,将时长小于预设阈值的语音段与相邻语音段合并。The VAD smoothing unit is configured to merge the voice segment whose duration is less than the preset threshold with the adjacent voice segment after the voice segment segmentation unit uses the VAD technology to segment the voice segment.
可选的,所述DNN模型训练单元具体用于,采用反向传播算法训练所述DNN模型。Optionally, the DNN model training unit is specifically configured to train the DNN model by using a back propagation algorithm.
可选的,所述角色判定单元具体用于,根据所述DNN模型和HMM执行解码操作, 获取输出所述特征矢量序列的概率值排序靠前的角色序列,并将所述角色序列作为与所述特征矢量序列对应的角色序列。Optionally, the role determining unit is specifically configured to perform a decoding operation according to the DNN model and the HMM. Obtaining a sequence of characters in which the probability values of the sequence of feature vectors are output are ranked, and the sequence of characters is used as a sequence of characters corresponding to the sequence of feature vectors.
可选的,所述角色判定单元采用如下方式输出角色分离结果:根据特征矢量序列对应的角色序列,针对每个角色输出与其对应的特征矢量所属音频帧的起止时间信息。Optionally, the role determining unit outputs the role separation result in the following manner: according to the character sequence corresponding to the feature vector sequence, the start and end time information of the audio frame to which the corresponding feature vector belongs is output for each character.
可选的,所述初始角色指定子单元或所述角色数量调整子单元具体通过如下方式选择相应数量的语音段:选择时长满足预设要求的、所述数量的语音段。Optionally, the initial role designation subunit or the role quantity adjustment subunit specifically selects a corresponding number of voice segments by selecting the number of voice segments that meet the preset requirement.
本申请虽然以较佳实施例公开如上,但其并不是用来限定本申请,任何本领域技术人员在不脱离本申请的精神和范围内,都可以做出可能的变动和修改,因此本申请的保护范围应当以本申请权利要求所界定的范围为准。The present application is disclosed in the above preferred embodiments, but it is not intended to limit the present application, and any person skilled in the art can make possible changes and modifications without departing from the spirit and scope of the present application. The scope of protection should be based on the scope defined by the claims of the present application.
在一个典型的配置中,计算设备包括一个或多个处理器(CPU)、输入/输出接口、网络接口和内存。In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
内存可能包括计算机可读介质中的非永久性存储器,随机存取存储器(RAM)和/或非易失性内存等形式,如只读存储器(ROM)或闪存(flash RAM)。内存是计算机可读介质的示例。The memory may include non-persistent memory, random access memory (RAM), and/or non-volatile memory in a computer readable medium, such as read only memory (ROM) or flash memory. Memory is an example of a computer readable medium.
1、计算机可读介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机的存储介质的例子包括,但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带,磁带磁磁盘存储或其他磁性存储设备或任何其他非传输介质,可用于存储可以被计算设备访问的信息。按照本文中的界定,计算机可读介质不包括非暂存电脑可读媒体(transitory media),如调制的数据信号和载波。1. Computer readable media including both permanent and non-persistent, removable and non-removable media may be implemented by any method or technology. The information can be computer readable instructions, data structures, modules of programs, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read only memory. (ROM), electrically erasable programmable read only memory (EEPROM), flash memory or other memory technology, compact disk read only memory (CD-ROM), digital versatile disk (DVD) or other optical storage, Magnetic tape cartridges, magnetic tape storage or other magnetic storage devices or any other non-transportable media can be used to store information that can be accessed by a computing device. As defined herein, computer readable media does not include non-transitory computer readable media, such as modulated data signals and carrier waves.
2、本领域技术人员应明白,本申请的实施例可提供为方法、系统或计算机程序产品。因此,本申请可采用完全硬件实施例、完全软件实施例或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。 2. Those skilled in the art will appreciate that embodiments of the present application can be provided as a method, system, or computer program product. Thus, the present application can take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment in combination of software and hardware. Moreover, the application can take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) including computer usable program code.

Claims (35)

  1. 一种基于语音的角色分离方法,其特征在于,包括:A voice-based role separation method, comprising:
    从语音信号中逐帧提取特征矢量,得到特征矢量序列;Extracting feature vectors from the speech signal frame by frame to obtain a feature vector sequence;
    为特征矢量分配角色标签;Assign a character tag to the feature vector;
    利用具有角色标签的特征矢量训练深度神经网络DNN模型;Training a deep neural network DNN model using feature vectors with role tags;
    根据所述DNN模型和利用特征矢量训练得到的隐马尔科夫模型HMM,判定特征矢量序列对应的角色序列,并输出角色分离结果;Determining a character sequence corresponding to the feature vector sequence according to the DNN model and the hidden Markov model HMM obtained by using the feature vector training, and outputting a role separation result;
    其中,所述DNN模型用于根据输入的特征矢量输出对应每个角色的概率,HMM用于描述角色间的跳转关系。The DNN model is configured to output a probability corresponding to each role according to the input feature vector, and the HMM is used to describe a jump relationship between the characters.
  2. 根据权利要求1所述的基于语音的角色分离方法,其特征在于,在所述从语音信号中逐帧提取特征矢量的步骤之后、在所述为特征矢量分配角色标签的步骤之前,执行下述操作:通过识别并剔除不包含语音内容的音频帧、将所述语音信号切分为语音段;The speech-based character separation method according to claim 1, wherein after the step of extracting the feature vector frame by frame from the speech signal, before the step of assigning the character tag to the feature vector, performing the following Operation: segmenting the voice signal into voice segments by identifying and culling an audio frame that does not include voice content;
    所述为特征矢量分配角色标签包括:为各语音段中的特征矢量分配角色标签;所述判定特征矢量序列对应的角色序列包括:判定各语音段所包含的特征矢量序列对应的角色序列。The assigning a role tag to the feature vector includes: assigning a character tag to the feature vector in each voice segment; and determining the character sequence corresponding to the feature vector sequence includes: determining a character sequence corresponding to the feature vector sequence included in each voice segment.
  3. 根据权利要求2所述的基于语音的角色分离方法,其特征在于,所述为各语音段中的特征矢量分配角色标签包括:通过建立高斯混合模型GMM和HMM,为各语音段中的特征矢量分配角色标签;其中所述GMM用于针对每个角色、根据输入的特征矢量输出该特征矢量对应于所述角色的概率;The speech-based role separation method according to claim 2, wherein the assigning a role tag to a feature vector in each speech segment comprises: constructing a Gaussian mixture model GMM and an HMM for a feature vector in each speech segment Assigning a role tag; wherein the GMM is configured to output, for each character, a probability that the feature vector corresponds to the character according to the input feature vector;
    所述根据所述DNN模型和利用特征矢量训练得到的HMM,判定各语音段所包含的特征矢量序列对应的角色序列包括:根据所述DNN模型和为各语音段中的特征矢量分配角色标签所采用的HMM,判定所述各语音段所包含的特征矢量序列对应的角色序列。Determining, according to the DNN model and the HMM obtained by using the feature vector, the role sequence corresponding to the feature vector sequence included in each voice segment includes: assigning a role tag according to the DNN model and the feature vector in each voice segment The adopted HMM determines a sequence of characters corresponding to the sequence of feature vectors included in each of the speech segments.
  4. 根据权利要求3所述的基于语音的角色分离方法,其特征在于,所述通过建立高斯混合模型GMM和HMM,为各语音段中的特征矢量分配角色标签,包括:The voice-based role separation method according to claim 3, wherein the assigning a role tag to the feature vector in each voice segment by establishing a Gaussian mixture model GMM and an HMM includes:
    按照预设的初始角色数量选择相应数量的语音段,并为每个语音段分别指定不同角色;Select a corresponding number of voice segments according to the preset initial number of roles, and assign different roles to each voice segment;
    利用指定角色的语音段中的特征矢量,训练针对每个角色的GMM以及HMM;Training the GMM and HMM for each character using the feature vectors in the speech segments of the specified character;
    根据训练得到的GMM和HMM进行解码,获取输出各语音段所包含的特征矢量序列的概率值排序靠前的角色序列; Decoding according to the trained GMM and the HMM, and obtaining a sequence of roles in which the probability values of the feature vector sequences included in each speech segment are output;
    判断所述角色序列对应的概率值是否大于预设阈值;若是,按照所述角色序列为各语音段中的特征矢量分配角色标签。Determining whether the probability value corresponding to the role sequence is greater than a preset threshold; if yes, assigning a role tag to the feature vector in each voice segment according to the character sequence.
  5. 根据权利要求4所述的基于语音的角色分离方法,其特征在于,当所述判断所述角色序列对应的概率值是否大于预设阈值的结果为否时,执行下述操作:The voice-based role separation method according to claim 4, wherein when the result of determining whether the probability value corresponding to the character sequence is greater than a preset threshold is no, the following operations are performed:
    根据所述角色序列,为每个语音段指定对应的角色;Assigning a corresponding role to each voice segment according to the sequence of roles;
    根据每个语音段中的特征矢量以及对应的角色,训练针对每个角色的GMM以及HMM;Training the GMM and the HMM for each character according to the feature vectors in each of the speech segments and the corresponding characters;
    转到所述根据训练得到的GMM和HMM进行解码的步骤执行。Go to the step of decoding according to the trained GMM and HMM.
  6. 根据权利要求5所述的基于语音的角色分离方法,其特征在于,所述根据所述角色序列,为每个语音段指定对应的角色,包括:The voice-based role separation method according to claim 5, wherein the assigning a corresponding role to each voice segment according to the role sequence includes:
    针对每个语音段,将其中各特征矢量对应的角色的众数指定为所述语音段的角色。For each speech segment, the mode of the character corresponding to each feature vector is designated as the character of the speech segment.
  7. 根据权利要求5所述的基于语音的角色分离方法,其特征在于,所述根据每个语音段中的特征矢量以及对应的角色,训练针对每个角色的GMM以及HMM,包括:在上一次训练得到的模型基础上采用增量方式训练所述GMM以及HMM。The speech-based character separation method according to claim 5, wherein the training the GMM and the HMM for each character according to the feature vector in each of the speech segments and the corresponding character, including: the last training The GMM and the HMM are trained incrementally based on the obtained model.
  8. 根据权利要求5所述的基于语音的角色分离方法,其特征在于,当所述判断所述角色序列对应的概率值是否大于预设阈值的结果为否时,执行下述操作:The voice-based role separation method according to claim 5, wherein when the result of determining whether the probability value corresponding to the character sequence is greater than a preset threshold is no, the following operations are performed:
    判断在当前角色数量下训练GMM和HMM的次数是否小于预设的训练次数上限;Determining whether the number of times the GMM and the HMM are trained under the current number of roles is less than a preset upper limit of the number of training times;
    若是,执行所述根据所述角色序列为每个语音段指定对应的角色的步骤;If yes, performing the step of specifying a corresponding role for each voice segment according to the role sequence;
    若否,执行下述操作:If not, do the following:
    调整角色数量,选择相应数量的语音段并为每个语音段分别指定不同角色;Adjust the number of roles, select the appropriate number of voice segments, and assign different roles to each voice segment;
    并转到所述利用指定角色的语音段中的特征矢量,训练针对每个角色的GMM以及HMM的步骤执行。And proceeding to the step of performing the step of the GMM and the HMM for each character by using the feature vector in the voice segment of the specified character.
  9. 根据权利要求8所述的基于语音的角色分离方法,其特征在于,当所述判断在当前角色数量下训练GMM和HMM的次数是否小于预设的训练次数上限的结果为否时,执行下述操作:The speech-based character separation method according to claim 8, wherein when the result of determining whether the number of times the GMM and the HMM are trained under the current number of roles is less than a preset upper limit of the number of training times, the following is performed. operating:
    判断当前角色数量是否符合预设要求;若是,转到所述按照所述角色序列为各语音段中的特征矢量分配角色标签的步骤执行,若否,则执行所述调整角色数量的步骤。Determining whether the current number of roles meets the preset requirement; if yes, performing the step of assigning a role tag to the feature vector in each voice segment according to the sequence of roles, and if not, performing the step of adjusting the number of roles.
  10. 根据权利要求8所述的基于语音的角色分离方法,其特征在于,所述预设的初始角色数量为2,所述调整角色数量包括:为当前角色数量加1。The voice-based role separation method according to claim 8, wherein the preset initial number of characters is 2, and the adjusting the number of roles comprises: adding 1 to the current number of characters.
  11. 根据权利要求1所述的基于语音的角色分离方法,其特征在于,所述从语音信 号中逐帧提取特征矢量,得到特征矢量序列包括:The speech-based role separation method according to claim 1, wherein said slave voice message Extract the feature vector frame by frame, and obtain the feature vector sequence including:
    按照预先设定的帧长度对语音信号进行分帧处理,得到多个音频帧;Performing a frame processing on the voice signal according to a preset frame length to obtain a plurality of audio frames;
    提取各音频帧的特征矢量,得到所述特征矢量序列。A feature vector of each audio frame is extracted to obtain the feature vector sequence.
  12. 根据权利要求11所述的基于语音的角色分离方法,其特征在于,所述提取各音频帧的特征矢量包括:提取MFCC特征、PLP特征、或者LPC特征。The speech-based character separation method according to claim 11, wherein the extracting the feature vector of each audio frame comprises: extracting an MFCC feature, a PLP feature, or an LPC feature.
  13. 根据权利要求2所述的基于语音的角色分离方法,其特征在于,所述识别并剔除不包含语音内容的音频帧包括:采用VAD技术识别所述不包含语音内容的音频帧、并执行相应的剔除操作。The voice-based role separation method according to claim 2, wherein the identifying and culling an audio frame that does not include the voice content comprises: identifying the audio frame that does not include the voice content by using a VAD technology, and performing corresponding Eliminate the operation.
  14. 根据权利要求13所述的基于语音的角色分离方法,其特征在于,在采用VAD技术执行所述识别及剔除操作、并将所述语音信号切分为语音段之后,执行下述VAD平滑操作:The speech-based character separation method according to claim 13, wherein after performing the recognizing and culling operation using the VAD technique and dividing the speech signal into speech segments, the following VAD smoothing operation is performed:
    将时长小于预设阈值的语音段与相邻语音段合并。The speech segment whose duration is less than the preset threshold is merged with the adjacent speech segment.
  15. 根据权利要求1所述的基于语音的角色分离方法,其特征在于,所述利用具有角色标签的特征矢量训练深度神经网络DNN模型包括:采用反向传播算法训练所述DNN模型。The speech-based character separation method according to claim 1, wherein the training the depth neural network DNN model by using the feature vector having the character tag comprises: training the DNN model by using a back propagation algorithm.
  16. 根据权利要求1所述的基于语音的角色分离方法,其特征在于,所述根据所述DNN模型和利用特征矢量训练得到的隐马尔科夫模型HMM,判定特征矢量序列对应的角色序列,包括:根据所述DNN模型和HMM执行解码操作,获取输出所述特征矢量序列的概率值排序靠前的角色序列,并将所述角色序列作为与所述特征矢量序列对应的角色序列。The speech-based character separation method according to claim 1, wherein the determining the character sequence corresponding to the feature vector sequence according to the DNN model and the hidden Markov model HMM obtained by using the feature vector training comprises: Performing a decoding operation according to the DNN model and the HMM, acquiring a character sequence that outputs a probability value of the feature vector sequence, and using the character sequence as a character sequence corresponding to the feature vector sequence.
  17. 根据权利要求1所述的基于语音的角色分离方法,其特征在于,所述输出角色分离结果包括:根据特征矢量序列对应的角色序列,针对每个角色输出与其对应的特征矢量所属音频帧的起止时间信息。The speech-based role separation method according to claim 1, wherein the output character separation result comprises: starting and ending the audio frame of the corresponding feature vector corresponding to each character according to the character sequence corresponding to the feature vector sequence. Time information.
  18. 根据权利要求4或8所述的基于语音的角色分离方法,其特征在于,所述选择相应数量的语音段,包括:选择时长满足预设要求的、所述数量的语音段。The voice-based role separation method according to claim 4 or 8, wherein the selecting a corresponding number of voice segments comprises: selecting the number of voice segments that satisfy a preset requirement.
  19. 一种基于语音的角色分离装置,其特征在于,包括:A voice-based role separation device, comprising:
    特征提取单元,用于从语音信号中逐帧提取特征矢量,得到特征矢量序列;a feature extraction unit, configured to extract a feature vector from a voice signal frame by frame to obtain a feature vector sequence;
    标签分配单元,用于为特征矢量分配角色标签;a label allocation unit, configured to assign a character label to the feature vector;
    DNN模型训练单元,用于利用具有角色标签的特征矢量训练DNN模型,其中所述DNN模型用于根据输入的特征矢量输出对应每个角色的概率; a DNN model training unit for training a DNN model with a feature vector having a character tag, wherein the DNN model is configured to output a probability corresponding to each character according to the input feature vector;
    角色判定单元,用于根据所述DNN模型和利用特征矢量训练得到的HMM,判定特征矢量序列对应的角色序列并输出角色分离结果,其中所述HMM用于描述角色间的跳转关系。The role determining unit is configured to determine a character sequence corresponding to the feature vector sequence and output a role separation result according to the DNN model and the HMM obtained by using the feature vector training, wherein the HMM is used to describe a jump relationship between the characters.
  20. 根据权利要求19所述的基于语音的角色分离装置,其特征在于,还包括:The voice-based role separation device of claim 19, further comprising:
    语音段切分单元,用于在所述特征提取单元提取特征矢量后、在触发所述标签分配单元工作之前,通过识别并剔除不包含语音内容的音频帧、将所述语音信号切分为语音段;a voice segment segmentation unit, configured to: after the feature extraction unit extracts the feature vector, identify and cull the audio frame that does not include the voice content, and divide the voice signal into voices before triggering the label allocation unit to work segment;
    所述标签分配单元具体用于,为各语音段中的特征矢量分配角色标签;The label distribution unit is specifically configured to allocate a role label for a feature vector in each voice segment;
    所述角色判定单元具体用于,根据所述DNN模型和利用特征矢量训练得到的HMM,判定各语音段所包含的特征矢量序列对应的角色序列并输出角色分离结果。The role determining unit is specifically configured to determine a character sequence corresponding to the feature vector sequence included in each voice segment according to the DNN model and the HMM obtained by using the feature vector training, and output a role separation result.
  21. 根据权利要求20所述的基于语音的角色分离装置,其特征在于,所述标签分配单元具体用于,通过建立GMM和HMM,为各语音段中的特征矢量分配角色标签,其中所述GMM用于针对每个角色、根据输入的特征矢量输出该特征矢量对应于所述角色的概率;The voice-based role separation apparatus according to claim 20, wherein the label distribution unit is configured to: assign a role label to a feature vector in each voice segment by establishing a GMM and an HMM, wherein the GMM is used by the GMM Outputting, for each character, a probability that the feature vector corresponds to the character according to the input feature vector;
    所述角色判定单元具体用于,根据所述DNN模型和为各语音段中的特征矢量分配角色标签所采用的HMM,判定所述各语音段所包含的特征矢量序列对应的角色序列。The role determining unit is specifically configured to determine a character sequence corresponding to the feature vector sequence included in each of the voice segments according to the DNN model and an HMM used to assign a character tag to a feature vector in each voice segment.
  22. 根据权利要求21所述的基于语音的角色分离装置,其特征在于,所述标签分配单元包括:The voice-based role separation device according to claim 21, wherein the label distribution unit comprises:
    初始角色指定子单元,用于按照预设的初始角色数量选择相应数量的语音段,并为每个语音段分别指定不同角色;The initial role designating sub-unit is configured to select a corresponding number of voice segments according to a preset initial number of roles, and specify different roles for each voice segment;
    初始模型训练子单元,用于利用指定角色的语音段中的特征矢量,训练针对每个角色的GMM以及HMM;An initial model training subunit for training a GMM and an HMM for each character by using a feature vector in a voice segment of a specified character;
    解码子单元,用于根据训练得到的GMM和HMM进行解码,获取输出各语音段所包含的特征矢量序列的概率值排序靠前的角色序列;a decoding subunit, configured to perform decoding according to the GMM and the HMM obtained by the training, and obtain a sequence of roles in which the probability values of the feature vector sequences included in each speech segment are outputted;
    概率判断子单元,用于判断所述角色序列对应的概率值是否大于预设阈值;a probability judging unit, configured to determine whether a probability value corresponding to the role sequence is greater than a preset threshold;
    标签分配子单元,用于当所述概率判断子单元的输出为是时,按照所述角色序列为各语音段中的特征矢量分配角色标签。a label allocation subunit, configured to allocate a role label for the feature vector in each voice segment according to the role sequence when the output of the probability determination subunit is YES.
  23. 根据权利要求22所述的基于语音的角色分离装置,其特征在于,所述标签分配单元还包括:The voice-based role separation device according to claim 22, wherein the label distribution unit further comprises:
    逐语音段角色指定子单元,用于当所述概率判断子单元的输出为否时,根据所述角 色序列,为每个语音段指定对应的角色;a sub-segment-by-speech segment specifying sub-unit for when the output of the probability judging sub-unit is negative, according to the angle a color sequence that assigns a corresponding role to each voice segment;
    模型更新训练子单元,用于根据每个语音段中的特征矢量以及对应的角色,训练针对每个角色的GMM以及HMM,并触发所述解码子单元工作。The model update training subunit is configured to train the GMM and the HMM for each role according to the feature vector in each voice segment and the corresponding role, and trigger the decoding subunit to work.
  24. 根据权利要求23所述的基于语音的角色分离装置,其特征在于,所述逐语音段角色指定子单元具体用于,针对每个语音段,将其中各特征矢量对应的角色的众数指定为所述语音段的角色。The voice-based role separation device according to claim 23, wherein the voice-by-speech segment role specifying sub-unit is specifically configured to, for each voice segment, specify a mode of a character corresponding to each feature vector as The role of the voice segment.
  25. 根据权利要求23所述的基于语音的角色分离装置,其特征在于,所述模型更新训练子单元具体用于,在上一次训练得到的模型基础上采用增量方式训练所述GMM以及HMM。The speech-based role separation device according to claim 23, wherein the model update training sub-unit is specifically configured to train the GMM and the HMM in an incremental manner based on a model obtained from the previous training.
  26. 根据权利要求23所述的基于语音的角色分离装置,其特征在于,所述标签分配单元还包括:The voice-based role separation device according to claim 23, wherein the label distribution unit further comprises:
    训练次数判断子单元,用于当所述概率判断子单元的输出为否时,判断在当前角色数量下训练GMM和HMM的次数是否小于预设的训练次数上限,并在判断结果为是时,触发所述逐语音段角色指定子单元工作;a training number determining subunit, configured to determine, when the output of the probability judging subunit is negative, whether the number of times the GMM and the HMM are trained under the current number of characters is less than a preset upper limit of the training times, and when the judgment result is yes, Triggering the voice-by-speech role designation sub-unit work;
    角色数量调整子单元,用于当所述训练次数判断子单元的输出为否时,调整角色数量,选择相应数量的语音段并为每个语音段分别指定不同角色,并触发所述初始模型训练子单元工作。a role quantity adjustment subunit, configured to: when the output of the training times determining subunit is negative, adjust the number of characters, select a corresponding number of voice segments, and respectively specify different roles for each voice segment, and trigger the initial model training. Subunit work.
  27. 根据权利要求26所述的基于语音的角色分离装置,其特征在于,所述标签分配单元还包括:The voice-based role separation device according to claim 26, wherein the label distribution unit further comprises:
    角色数量判断子单元,用于当所述训练次数判断子单元的输出为否时,判断当前角色数量是否符合预设要求,若符合则触发所述标签分配子单元工作,否则触发所述角色数量调整子单元工作。a role number determining sub-unit, configured to determine whether the current number of roles meets a preset requirement when the output of the training number determining sub-unit is negative, and trigger the label allocation sub-unit to work if it matches, otherwise trigger the number of the roles Adjust the subunit work.
  28. 根据权利要求19所述的基于语音的角色分离装置,其特征在于,所述特征提取单元包括:The speech-based character separation device according to claim 19, wherein the feature extraction unit comprises:
    分帧子单元,用于按照预先设定的帧长度对语音信号进行分帧处理,得到多个音频帧;a framing sub-unit, configured to perform framing processing on the voice signal according to a preset frame length to obtain a plurality of audio frames;
    特征提取执行子单元,用于提取各音频帧的特征矢量,得到所述特征矢量序列。A feature extraction execution subunit is configured to extract a feature vector of each audio frame to obtain the feature vector sequence.
  29. 根据权利要求28所述的基于语音的角色分离装置,其特征在于,所述特征提取执行子单元具体用于,提取各音频帧的MFCC特征、PLP特征、或者LPC特征,得到所述特征矢量序列。 The speech-based character separation device according to claim 28, wherein the feature extraction execution sub-unit is specifically configured to extract an MFCC feature, a PLP feature, or an LPC feature of each audio frame to obtain the feature vector sequence. .
  30. 根据权利要求20所述的基于语音的角色分离装置,其特征在于,所述语音段切分单元具体用于,通过采用VAD技术识别并剔除所述不包含语音内容的音频帧、将所述语音信号切分为语音段。The speech-based character separation device according to claim 20, wherein the speech segment segmentation unit is specifically configured to identify and cull the audio frame not containing the speech content by using a VAD technology, and the speech segment The signal is divided into speech segments.
  31. 根据权利要求30所述的基于语音的角色分离装置,其特征在于,还包括:The voice-based role separation device of claim 30, further comprising:
    VAD平滑单元,用于在所述语音段切分单元采用VAD技术切分语音段后,将时长小于预设阈值的语音段与相邻语音段合并。The VAD smoothing unit is configured to merge the voice segment whose duration is less than the preset threshold with the adjacent voice segment after the voice segment segmentation unit uses the VAD technology to segment the voice segment.
  32. 根据权利要求19所述的基于语音的角色分离装置,其特征在于,所述DNN模型训练单元具体用于,采用反向传播算法训练所述DNN模型。The speech-based character separation device according to claim 19, wherein the DNN model training unit is specifically configured to train the DNN model by using a back propagation algorithm.
  33. 根据权利要求19所述的基于语音的角色分离装置,其特征在于,所述角色判定单元具体用于,根据所述DNN模型和HMM执行解码操作,获取输出所述特征矢量序列的概率值排序靠前的角色序列,并将所述角色序列作为与所述特征矢量序列对应的角色序列。The speech-based character separation device according to claim 19, wherein the role determining unit is configured to: perform a decoding operation according to the DNN model and the HMM, and obtain a probability value of outputting the sequence of feature vectors The previous character sequence, and the character sequence is taken as a character sequence corresponding to the feature vector sequence.
  34. 根据权利要求19所述的基于语音的角色分离装置,其特征在于,所述角色判定单元采用如下方式输出角色分离结果:根据特征矢量序列对应的角色序列,针对每个角色输出与其对应的特征矢量所属音频帧的起止时间信息。The speech-based character separation device according to claim 19, wherein the character determination unit outputs a role separation result by outputting a feature vector corresponding thereto for each character according to a character sequence corresponding to the feature vector sequence. The start and end time information of the audio frame to which it belongs.
  35. 根据权利要求22或26所述的基于语音的角色分离装置,其特征在于,所述初始角色指定子单元或所述角色数量调整子单元具体通过如下方式选择相应数量的语音段:选择时长满足预设要求的、所述数量的语音段。 The voice-based role separation device according to claim 22 or 26, wherein the initial role designation sub-unit or the role number adjustment sub-unit specifically selects a corresponding number of voice segments by: selecting a duration to satisfy the pre- Set the required number of voice segments.
PCT/CN2016/103490 2015-11-05 2016-10-27 Voice-based role separation method and device WO2017076211A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510744743.4A CN106683661B (en) 2015-11-05 2015-11-05 Role separation method and device based on voice
CN201510744743.4 2015-11-05

Publications (1)

Publication Number Publication Date
WO2017076211A1 true WO2017076211A1 (en) 2017-05-11

Family

ID=58661656

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/103490 WO2017076211A1 (en) 2015-11-05 2016-10-27 Voice-based role separation method and device

Country Status (2)

Country Link
CN (1) CN106683661B (en)
WO (1) WO2017076211A1 (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107545898A (en) * 2017-08-07 2018-01-05 清华大学 A kind of processing method and processing device for distinguishing speaker's voice
CN108257592A (en) * 2018-01-11 2018-07-06 广州势必可赢网络科技有限公司 A kind of voice dividing method and system based on shot and long term memory models
US20190207946A1 (en) * 2016-12-20 2019-07-04 Google Inc. Conditional provision of access by interactive assistant modules
WO2019196648A1 (en) * 2018-04-10 2019-10-17 Huawei Technologies Co., Ltd. A method and device for processing whispered speech
CN111199741A (en) * 2018-11-20 2020-05-26 阿里巴巴集团控股有限公司 Voiceprint identification method, voiceprint verification method, voiceprint identification device, computing device and medium
US10685187B2 (en) 2017-05-15 2020-06-16 Google Llc Providing access to user-controlled resources by automated assistants
WO2020222922A1 (en) * 2019-04-29 2020-11-05 Microsoft Technology Licensing, Llc System and method for speaker role determination and scrubbing identifying information
US11087023B2 (en) 2018-08-07 2021-08-10 Google Llc Threshold-based assembly of automated assistant responses
US11436417B2 (en) 2017-05-15 2022-09-06 Google Llc Providing access to user-controlled resources by automated assistants

Families Citing this family (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108346436B (en) 2017-08-22 2020-06-23 腾讯科技(深圳)有限公司 Voice emotion detection method and device, computer equipment and storage medium
CN107885723B (en) * 2017-11-03 2021-04-09 广州杰赛科技股份有限公司 Conversation role distinguishing method and system
CN108109619B (en) * 2017-11-15 2021-07-06 中国科学院自动化研究所 Auditory selection method and device based on memory and attention model
CN108074576B (en) * 2017-12-14 2022-04-08 讯飞智元信息科技有限公司 Speaker role separation method and system under interrogation scene
CN107993665B (en) * 2017-12-14 2021-04-30 科大讯飞股份有限公司 Method for determining role of speaker in multi-person conversation scene, intelligent conference method and system
CN110085216A (en) * 2018-01-23 2019-08-02 中国科学院声学研究所 A kind of vagitus detection method and device
CN108564952B (en) * 2018-03-12 2019-06-07 新华智云科技有限公司 The method and apparatus of speech roles separation
CN108597521A (en) * 2018-05-04 2018-09-28 徐涌 Audio role divides interactive system, method, terminal and the medium with identification word
CN108766440B (en) * 2018-05-28 2020-01-14 平安科技(深圳)有限公司 Speaker separation model training method, two-speaker separation method and related equipment
CN108806707B (en) 2018-06-11 2020-05-12 百度在线网络技术(北京)有限公司 Voice processing method, device, equipment and storage medium
CN109065076B (en) * 2018-09-05 2020-11-27 深圳追一科技有限公司 Audio label setting method, device, equipment and storage medium
CN109344195B (en) * 2018-10-25 2021-09-21 电子科技大学 HMM model-based pipeline security event recognition and knowledge mining method
CN109256128A (en) * 2018-11-19 2019-01-22 广东小天才科技有限公司 A kind of method and system determining user role automatically according to user's corpus
CN110111797A (en) * 2019-04-04 2019-08-09 湖北工业大学 Method for distinguishing speek person based on Gauss super vector and deep neural network
CN110444223B (en) * 2019-06-26 2023-05-23 平安科技(深圳)有限公司 Speaker separation method and device based on cyclic neural network and acoustic characteristics
CN110337030B (en) * 2019-08-08 2020-08-11 腾讯科技(深圳)有限公司 Video playing method, device, terminal and computer readable storage medium
CN111508505B (en) * 2020-04-28 2023-11-03 讯飞智元信息科技有限公司 Speaker recognition method, device, equipment and storage medium
CN112861509B (en) * 2021-02-08 2023-05-12 青牛智胜(深圳)科技有限公司 Role analysis method and system based on multi-head attention mechanism
CN113413613A (en) * 2021-06-17 2021-09-21 网易(杭州)网络有限公司 Method and device for optimizing voice chat in game, electronic equipment and medium
CN114465737B (en) * 2022-04-13 2022-06-24 腾讯科技(深圳)有限公司 Data processing method and device, computer equipment and storage medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1366295A (en) * 2000-07-05 2002-08-28 松下电器产业株式会社 Speaker's inspection and speaker's identification system and method based on prior knowledge
CN101814159A (en) * 2009-02-24 2010-08-25 余华 Speaker verification method based on combination of auto-associative neural network and Gaussian mixture model-universal background model
CN103531199A (en) * 2013-10-11 2014-01-22 福州大学 Ecological sound identification method on basis of rapid sparse decomposition and deep learning
CN103971690A (en) * 2013-01-28 2014-08-06 腾讯科技(深圳)有限公司 Voiceprint recognition method and device
CN104143327A (en) * 2013-07-10 2014-11-12 腾讯科技(深圳)有限公司 Acoustic model training method and device
CN104157290A (en) * 2014-08-19 2014-11-19 大连理工大学 Speaker recognition method based on depth learning
CN104732978A (en) * 2015-03-12 2015-06-24 上海交通大学 Text-dependent speaker recognition method based on joint deep learning
CN104751227A (en) * 2013-12-31 2015-07-01 安徽科大讯飞信息科技股份有限公司 Method and system for constructing deep neural network

Family Cites Families (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101650944A (en) * 2009-09-17 2010-02-17 浙江工业大学 Method for distinguishing speakers based on protective kernel Fisher distinguishing method
US9257121B2 (en) * 2010-12-10 2016-02-09 Panasonic Intellectual Property Corporation Of America Device and method for pass-phrase modeling for speaker verification, and verification system
CN102129860B (en) * 2011-04-07 2012-07-04 南京邮电大学 Text-related speaker recognition method based on infinite-state hidden Markov model
US9489950B2 (en) * 2012-05-31 2016-11-08 Agency For Science, Technology And Research Method and system for dual scoring for text-dependent speaker verification
US9401148B2 (en) * 2013-11-04 2016-07-26 Google Inc. Speaker verification using neural networks
US9336781B2 (en) * 2013-10-17 2016-05-10 Sri International Content-aware speaker recognition
CN103700370B (en) * 2013-12-04 2016-08-17 北京中科模识科技有限公司 A kind of radio and television speech recognition system method and system
CN103714812A (en) * 2013-12-23 2014-04-09 百度在线网络技术(北京)有限公司 Voice identification method and voice identification device
CN104751842B (en) * 2013-12-31 2019-11-15 科大讯飞股份有限公司 The optimization method and system of deep neural network
CN104064189A (en) * 2014-06-26 2014-09-24 厦门天聪智能软件有限公司 Vocal print dynamic password modeling and verification method
CN104376250A (en) * 2014-12-03 2015-02-25 优化科技(苏州)有限公司 Real person living body identity verification method based on sound-type image feature
CN104575504A (en) * 2014-12-24 2015-04-29 上海师范大学 Method for personalized television voice wake-up by voiceprint and voice identification
CN104575490B (en) * 2014-12-30 2017-11-07 苏州驰声信息科技有限公司 Spoken language pronunciation evaluating method based on deep neural network posterior probability algorithm
CN104835497A (en) * 2015-04-14 2015-08-12 时代亿宝(北京)科技有限公司 Voiceprint card swiping system and method based on dynamic password
CN104934028B (en) * 2015-06-17 2017-11-17 百度在线网络技术(北京)有限公司 Training method and device for the deep neural network model of phonetic synthesis

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1366295A (en) * 2000-07-05 2002-08-28 松下电器产业株式会社 Speaker's inspection and speaker's identification system and method based on prior knowledge
CN101814159A (en) * 2009-02-24 2010-08-25 余华 Speaker verification method based on combination of auto-associative neural network and Gaussian mixture model-universal background model
CN103971690A (en) * 2013-01-28 2014-08-06 腾讯科技(深圳)有限公司 Voiceprint recognition method and device
CN104143327A (en) * 2013-07-10 2014-11-12 腾讯科技(深圳)有限公司 Acoustic model training method and device
CN103531199A (en) * 2013-10-11 2014-01-22 福州大学 Ecological sound identification method on basis of rapid sparse decomposition and deep learning
CN104751227A (en) * 2013-12-31 2015-07-01 安徽科大讯飞信息科技股份有限公司 Method and system for constructing deep neural network
CN104157290A (en) * 2014-08-19 2014-11-19 大连理工大学 Speaker recognition method based on depth learning
CN104732978A (en) * 2015-03-12 2015-06-24 上海交通大学 Text-dependent speaker recognition method based on joint deep learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
MIAO, YAJIE ET AL.: "Improving Low-Resource CD -DNN-HMM Using Dropout and Multilingual DNN Training.", PROCEEDINGS OF INTERSPEECH., 29 August 2013 (2013-08-29), pages 2237 - 2241, XP055380510 *

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190207946A1 (en) * 2016-12-20 2019-07-04 Google Inc. Conditional provision of access by interactive assistant modules
US11436417B2 (en) 2017-05-15 2022-09-06 Google Llc Providing access to user-controlled resources by automated assistants
US10685187B2 (en) 2017-05-15 2020-06-16 Google Llc Providing access to user-controlled resources by automated assistants
CN107545898B (en) * 2017-08-07 2020-07-14 清华大学 Processing method and device for distinguishing speaker voice
CN107545898A (en) * 2017-08-07 2018-01-05 清华大学 A kind of processing method and processing device for distinguishing speaker's voice
CN108257592A (en) * 2018-01-11 2018-07-06 广州势必可赢网络科技有限公司 A kind of voice dividing method and system based on shot and long term memory models
US10832660B2 (en) 2018-04-10 2020-11-10 Futurewei Technologies, Inc. Method and device for processing whispered speech
WO2019196648A1 (en) * 2018-04-10 2019-10-17 Huawei Technologies Co., Ltd. A method and device for processing whispered speech
US11314890B2 (en) 2018-08-07 2022-04-26 Google Llc Threshold-based assembly of remote automated assistant responses
US11087023B2 (en) 2018-08-07 2021-08-10 Google Llc Threshold-based assembly of automated assistant responses
US20220083687A1 (en) 2018-08-07 2022-03-17 Google Llc Threshold-based assembly of remote automated assistant responses
US11455418B2 (en) 2018-08-07 2022-09-27 Google Llc Assembling and evaluating automated assistant responses for privacy concerns
US11790114B2 (en) 2018-08-07 2023-10-17 Google Llc Threshold-based assembly of automated assistant responses
US11822695B2 (en) 2018-08-07 2023-11-21 Google Llc Assembling and evaluating automated assistant responses for privacy concerns
US11966494B2 (en) 2018-08-07 2024-04-23 Google Llc Threshold-based assembly of remote automated assistant responses
CN111199741A (en) * 2018-11-20 2020-05-26 阿里巴巴集团控股有限公司 Voiceprint identification method, voiceprint verification method, voiceprint identification device, computing device and medium
US11062706B2 (en) 2019-04-29 2021-07-13 Microsoft Technology Licensing, Llc System and method for speaker role determination and scrubbing identifying information
WO2020222922A1 (en) * 2019-04-29 2020-11-05 Microsoft Technology Licensing, Llc System and method for speaker role determination and scrubbing identifying information

Also Published As

Publication number Publication date
CN106683661A (en) 2017-05-17
CN106683661B (en) 2021-02-05

Similar Documents

Publication Publication Date Title
WO2017076211A1 (en) Voice-based role separation method and device
US10902843B2 (en) Using recurrent neural network for partitioning of audio data into segments that each correspond to a speech feature cluster identifier
CN108320733B (en) Voice data processing method and device, storage medium and electronic equipment
US10249292B2 (en) Using long short-term memory recurrent neural network for speaker diarization segmentation
Tong et al. A comparative study of robustness of deep learning approaches for VAD
US10074363B2 (en) Method and apparatus for keyword speech recognition
WO2019037700A1 (en) Speech emotion detection method and apparatus, computer device, and storage medium
US10872599B1 (en) Wakeword training
EP3469582A1 (en) Neural network-based voiceprint information extraction method and apparatus
JP6732703B2 (en) Emotion interaction model learning device, emotion recognition device, emotion interaction model learning method, emotion recognition method, and program
WO2018192186A1 (en) Speech recognition method and apparatus
Liu et al. Graph-based semi-supervised acoustic modeling in DNN-based speech recognition
CN112331207A (en) Service content monitoring method and device, electronic equipment and storage medium
Rosdi et al. Isolated malay speech recognition using Hidden Markov Models
JP5704071B2 (en) Audio data analysis apparatus, audio data analysis method, and audio data analysis program
Li et al. Semi-supervised ensemble DNN acoustic model training
Liu et al. Using bidirectional associative memories for joint spectral envelope modeling in voice conversion
US11557292B1 (en) Speech command verification
CN113823265A (en) Voice recognition method and device and computer equipment
JP6594251B2 (en) Acoustic model learning device, speech synthesizer, method and program thereof
KR100776729B1 (en) Speaker-independent variable-word keyword spotting system including garbage modeling unit using decision tree-based state clustering and method thereof
Walter et al. An evaluation of unsupervised acoustic model training for a dysarthric speech interface
Bai et al. Phone Classification Using a Non-Linear Manifold with Broad Phone Class Dependent DNNs.
Banjara et al. Nepali speech recognition using cnn and sequence models
TWI725111B (en) Voice-based role separation method and device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16861479

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16861479

Country of ref document: EP

Kind code of ref document: A1