TWI725111B

TWI725111B - Voice-based role separation method and device

Info

Publication number: TWI725111B
Application number: TW106102244A
Authority: TW
Inventors: 李曉輝; 李宏言
Original assignee: 香港商阿里巴巴集團服務有限公司
Priority date: 2017-01-20
Filing date: 2017-01-20
Publication date: 2021-04-21
Also published as: TW201828283A

Abstract

本申請公開了一種基於語音的角色分離方法，包括：從語音信號中逐幀提取特徵向量，得到特徵向量序列；為特徵向量分配角色標籤；利用具有角色標籤的特徵向量訓練深度神經網路DNN模型；根據所述DNN模型和利用特徵向量訓練得到的隱藏馬可夫模型HMM，判定特徵向量序列對應的角色序列，並輸出角色分離結果；其中，所述DNN模型用於根據輸入的特徵向量輸出對應每個角色的機率，HMM用於描述角色間的跳轉關係。本申請同時提供一種基於語音的角色分離裝置。本申請提供的上述方法，由於採用了具有強大特徵提取能力的DNN模型對說話人角色進行建模，比傳統的GMM具有更為強大的刻畫能力，對角色的刻畫更加精細、準確，因此能夠獲得更為準確的角色分離結果。 This application discloses a voice-based role separation method, which includes: extracting feature vectors frame by frame from a voice signal to obtain feature vector sequences; assigning role labels to the feature vectors; and training a deep neural network DNN model using feature vectors with role labels According to the DNN model and the hidden Markov model HMM obtained by using feature vector training, determine the character sequence corresponding to the feature vector sequence, and output the role separation result; wherein, the DNN model is used to output the corresponding each according to the input feature vector The probability of the role, HMM is used to describe the jump relationship between the roles. This application also provides a voice-based role separation device. The above-mentioned method provided in this application uses a DNN model with powerful feature extraction capabilities to model the speaker's character, and has a stronger characterization ability than the traditional GMM, and the characterization of the character is more precise and accurate, so it can obtain More accurate role separation results.

Description

Voice-based role separation method and device

本申請關於語音辨識領域，具體關於一種基於語音的角色分離方法。本申請同時關於一種基於語音的角色分離裝置。 This application relates to the field of speech recognition, and specifically relates to a voice-based role separation method. This application also relates to a voice-based role separation device.

語音是人類最自然的交流溝通方式，語音辨識技術則是讓機器透過識別和理解過程把語音信號轉變為相應的文字或命令的技術。語音辨識是一門交叉學科，所涉及的領域包括：信號處理、模式識別、機率論和資訊理論、發聲機理和聽覺機理、人工智慧等等。 Voice is the most natural way for humans to communicate, and voice recognition technology is a technology that allows machines to convert voice signals into corresponding texts or commands through the process of recognition and understanding. Speech recognition is an interdisciplinary subject. The fields involved include: signal processing, pattern recognition, probability theory and information theory, sound mechanism and hearing mechanism, artificial intelligence, and so on.

在實際應用中，為了能夠對語音信號作更為準確的分析，不僅需要進行語音辨識，而且要判別出每段語音的說話人，因此很自然地出現了對語音按照角色進行分離的需求。在日常生活、會議以及電話對話等很多場景下，都存在對話語音，而透過對對話語音的角色分離，就可以判定哪部分語音是其中一個人說的，哪部分語音是另外一個人說的。在將對話語音按照角色分離之後，結合說話人識別、語音辨識，會產生更為廣闊的應用空間，例如，將客服中心的對話語音按照角色分離，然後進行語音辨識就可以確定客服說了什麼內容，客戶說了什麼內容，從而可以進行相應的客服質檢或者進行客戶潛在需求的挖掘。 In practical applications, in order to be able to analyze the speech signal more accurately, it is not only necessary to perform speech recognition, but also to identify the speaker of each segment of speech, so naturally there is a need to separate speech according to roles. In many scenarios such as daily life, conferences, and telephone conversations, there are dialogue voices. By separating the roles of dialogue voices, it can be determined which part of the voice is spoken by one person and which part of the voice is spoken by the other person. After the dialogue voice is separated according to roles, combined with speaker recognition and voice recognition, a broader application space will be generated. For example, the dialogue voice of the customer service center is separated according to roles, and then voice recognition can determine what the customer service said. , What the customer said, so that the corresponding customer service quality inspection can be carried out or the potential demand of the customer can be explored.

現有技術中，通常採用GMM(Gaussian Mixture Model-高斯混合模型)和HMM(Hidden Markov Model-隱藏馬可夫模型)進行對話語音的角色分離，即：對於每個角色使用GMM建模，對於不同角色之間的跳轉採用HMM建模。由於GMM建模技術提出的時間比較早，而且其擬合任意函數的功能取決於混合高斯函數的個數，所以其對角色的刻畫能力有一定的局限性，導致角色分離的準確率通常比較低，無法滿足應用的需求。 In the prior art, GMM (Gaussian Mixture Model) and HMM (Hidden Markov Model) are usually used to separate the roles of dialogue speech, that is, GMM modeling is used for each role, and for different roles The jump is modeled by HMM. Since GMM modeling technology was proposed earlier, and its ability to fit arbitrary functions depends on the number of mixed Gaussian functions, its ability to characterize characters has certain limitations, resulting in a relatively low accuracy of role separation. , Unable to meet the needs of the application.

本申請實施例提供一種基於語音的角色分離方法和裝置，以解決現有的基於GMM和HMM的角色分離技術準確率比較低的問題。 The embodiments of the present application provide a voice-based role separation method and device to solve the problem of relatively low accuracy of the existing role separation technology based on GMM and HMM.

本申請提供一種基於語音的角色分離方法，包括：從語音信號中逐幀提取特徵向量，得到特徵向量序列；為特徵向量分配角色標籤；利用具有角色標籤的特徵向量訓練深度神經網路DNN模型；根據所述DNN模型和利用特徵向量訓練得到的隱藏馬可夫模型HMM，判定特徵向量序列對應的角色序列，並輸出角色分離結果；其中，所述DNN模型用於根據輸入的特徵向量輸出對應每個角色的機率，HMM用於描述角色間的跳轉關係。 This application provides a voice-based role separation method, including: extracting feature vectors frame by frame from a voice signal to obtain a feature vector sequence; assigning role labels to the feature vectors; training a deep neural network DNN model by using the feature vectors with role labels; According to the DNN model and the hidden Markov model HMM obtained by the feature vector training, determine the character sequence corresponding to the feature vector sequence, and output the character separation result; wherein, the DNN model is used to output the corresponding character according to the input feature vector HMM is used to describe the jump relationship between characters.

可選的，在所述從語音信號中逐幀提取特徵向量的步驟之後、在所述為特徵向量分配角色標籤的步驟之前，執行下述操作：透過識別並剔除不包含語音內容的音訊幀、將所述語音信號切分為語音段；所述為特徵向量分配角色標籤包括：為各語音段中的特徵向量分配角色標籤；所述判定特徵向量序列對應的角色序列包括：判定各語音段所包含的特徵向量序列對應的角色序列。 Optionally, after the step of extracting feature vectors frame by frame from the speech signal and before the step of assigning role labels to the feature vectors, the following operations are performed: by identifying and excluding audio frames that do not contain voice content, The voice signal is divided into voice segments; the assigning role labels to feature vectors includes: assigning role labels to feature vectors in each voice segment; determining the role sequence corresponding to the feature vector sequence includes: determining where each voice segment is The character sequence corresponding to the included feature vector sequence.

可選的，所述為各語音段中的特徵向量分配角色標籤包括：透過建立高斯混合模型GMM和HMM，為各語音段中的特徵向量分配角色標籤；其中所述GMM用於針對每個角色、根據輸入的特徵向量輸出該特徵向量對應於所述角色的機率；所述根據所述DNN模型和利用特徵向量訓練得到的HMM，判定各語音段所包含的特徵向量序列對應的角色序列包括：根據所述DNN模型和為各語音段中的特徵向量分配角色標籤所採用的HMM，判定所述各語音段所包含的特徵向量序列對應的角色序列。 Optionally, the allocating role labels to the feature vectors in each voice segment includes: assigning role tags to the feature vectors in each voice segment by establishing Gaussian mixture models GMM and HMM; wherein the GMM is used for each role Outputting the probability that the feature vector corresponds to the character according to the input feature vector; said determining the character sequence corresponding to the feature vector sequence contained in each speech segment according to the DNN model and the HMM trained using the feature vector includes: According to the DNN model and the HMM used to assign role labels to the feature vectors in each voice segment, determine the role sequence corresponding to the feature vector sequence contained in each voice segment.

可選的，所述透過建立高斯混合模型GMM和HMM，為各語音段中的特徵向量分配角色標籤，包括：按照預設的初始角色數量選擇相應數量的語音段，並為每個語音段分別指定不同角色；利用指定角色的語音段中的特徵向量，訓練針對每個角色的GMM以及HMM；根據訓練得到的GMM和HMM進行解碼，獲取輸出各語音段所包含的特徵向量序列的機率值排序靠前的角色序列；判斷所述角色序列對應的機率值是否大於預設閾值；若是，按照所述角色序列為各語音段中的特徵向量分配角色標籤。 Optionally, by establishing the Gaussian mixture model GMM and HMM, assigning role labels to the feature vectors in each voice segment includes: selecting a corresponding number of voice segments according to a preset initial number of roles, and separately for each voice segment Specify different roles; use the feature vector in the voice segment of the specified role to train GMM and HMM for each role; decode the GMM and HMM obtained from the training, and obtain the probability value ranking of the feature vector sequence contained in each voice segment The leading role sequence; determine whether the probability value corresponding to the role sequence is greater than a preset threshold; if so, assign role labels to the feature vectors in each voice segment according to the role sequence.

可選的，當所述判斷所述角色序列對應的機率值是否大於預設閾值的結果為否時，執行下述操作：根據所述角色序列，為每個語音段指定對應的角色；根據每個語音段中的特徵向量以及對應的角色，訓練針對每個角色的GMM以及HMM；轉到所述根據訓練得到的GMM和HMM進行解碼的步驟執行。 Optionally, when the result of judging whether the probability value corresponding to the role sequence is greater than a preset threshold is No, perform the following operation: according to the role sequence, specify a corresponding role for each voice segment; For the feature vectors and corresponding roles in each voice segment, train the GMM and HMM for each role; turn to the step of decoding according to the GMM and HMM obtained by training.

可選的，所述根據所述角色序列，為每個語音段指定對應的角色，包括：針對每個語音段，將其中各特徵向量對應的角色的眾數指定為所述語音段的角色。 Optionally, the specifying a corresponding role for each voice segment according to the role sequence includes: for each voice segment, specifying the mode of the role corresponding to each feature vector as the role of the voice segment.

可選的，所述根據每個語音段中的特徵向量以及對應的角色，訓練針對每個角色的GMM以及HMM，包括：在上一次訓練得到的模型基礎上採用增量方式訓練所述 GMM以及HMM。 Optionally, the training of the GMM and HMM for each role according to the feature vector in each voice segment and the corresponding role includes: training the GMM in an incremental manner on the basis of the model obtained in the previous training, and HMM.

可選的，當所述判斷所述角色序列對應的機率值是否大於預設閾值的結果為否時，執行下述操作：判斷在當前角色數量下訓練GMM和HMM的次數是否小於預設的訓練次數上限；若是，執行所述根據所述角色序列為每個語音段指定對應的角色的步驟；若否，執行下述操作：調整角色數量，選擇相應數量的語音段並為每個語音段分別指定不同角色；並轉到所述利用指定角色的語音段中的特徵向量，訓練針對每個角色的GMM以及HMM的步驟執行。 Optionally, when the result of judging whether the probability value corresponding to the character sequence is greater than a preset threshold is No, perform the following operation: judging whether the number of times of training GMM and HMM under the current number of characters is less than the preset training The upper limit of the number of times; if yes, perform the step of assigning a corresponding role to each voice segment according to the role sequence; if not, perform the following operation: adjust the number of roles, select the corresponding number of voice segments and separate for each voice segment Specify different roles; and go to the step of using the feature vector in the voice segment of the specified role to train GMM and HMM for each role.

可選的，當所述判斷在當前角色數量下訓練GMM和HMM的次數是否小於預設的訓練次數上限的結果為否時，執行下述操作：判斷當前角色數量是否符合預設要求；若是，轉到所述按照所述角色序列為各語音段中的特徵向量分配角色標籤的步驟執行，若否，則執行所述調整角色數量的步驟。 Optionally, when the result of judging whether the number of times of training GMM and HMM under the current number of characters is less than the preset upper limit of the number of training times is no, the following operation is performed: judging whether the current number of characters meets the preset requirements; if so, Turn to the step of assigning role labels to the feature vectors in each voice segment according to the role sequence, and if not, execute the step of adjusting the number of roles.

可選的，所述預設的初始角色數量為2，所述調整角色數量包括：為當前角色數量加1。 Optionally, the preset initial number of characters is 2, and the adjusting the number of characters includes: adding 1 to the current number of characters.

可選的，所述從語音信號中逐幀提取特徵向量，得到特徵向量序列包括：按照預先設定的幀長度對語音信號進行分幀處理，得到多個音訊幀；提取各音訊幀的特徵向量，得到所述特徵向量序列。 Optionally, extracting the feature vector frame by frame from the voice signal to obtain the feature vector sequence includes: framing the voice signal according to a preset frame length to obtain multiple audio frames; extracting the feature vector of each audio frame, Obtain the feature vector sequence.

可選的，所述提取各音訊幀的特徵向量包括：提取MFCC特徵、PLP特徵、或者LPC特徵。 Optionally, the extracting the feature vector of each audio frame includes: extracting the MFCC feature, the PLP feature, or the LPC feature.

可選的，所述識別並剔除不包含語音內容的音訊幀包括：採用VAD技術識別所述不包含語音內容的音訊幀、並執行相應的剔除操作。 Optionally, the recognizing and removing audio frames that do not contain voice content includes: using VAD technology to recognize the audio frames that do not contain voice content, and performing a corresponding removal operation.

可選的，在採用VAD技術執行所述識別及剔除操作、並將所述語音信號切分為語音段之後，執行下述VAD平滑操作：將時長小於預設閾值的語音段與相鄰語音段合併。 Optionally, after VAD technology is used to perform the recognition and removal operations, and the voice signal is divided into voice segments, the following VAD smoothing operation is performed: a voice segment whose duration is less than a preset threshold and adjacent voices Segment merge.

可選的，所述利用具有角色標籤的特徵向量訓練深度神經網路DNN模型包括：採用反向傳播演算法訓練所述DNN模型。 Optionally, the training of the deep neural network DNN model by using the feature vector with the role label includes: training the DNN model by using a backpropagation algorithm.

可選的，所述根據所述DNN模型和利用特徵向量訓練得到的隱藏馬可夫模型HMM，判定特徵向量序列對應的角色序列，包括：根據所述DNN模型和HMM執行解碼操作，獲取輸出所述特徵向量序列的機率值排序靠前的角色序列，並將所述角色序列作為與所述特徵向量序列對應的角色序列。 Optionally, the determining the character sequence corresponding to the feature vector sequence according to the DNN model and the hidden Markov model HMM trained by using feature vectors includes: performing a decoding operation according to the DNN model and the HMM to obtain and output the feature The probability value of the vector sequence is sorted by the first character sequence, and the character sequence is taken as the character sequence corresponding to the feature vector sequence.

可選的，所述輸出角色分離結果包括：根據特徵向量序列對應的角色序列，針對每個角色輸出與其對應的特徵向量所屬音訊幀的起止時間資訊。 Optionally, the output of the character separation result includes: outputting the start and end time information of the audio frame to which the corresponding feature vector belongs for each character according to the character sequence corresponding to the feature vector sequence.

可選的，所述選擇相應數量的語音段，包括：選擇時長滿足預設要求的、所述數量的語音段。 Optionally, the selecting a corresponding number of voice segments includes: selecting the number of voice segments whose duration meets a preset requirement.

相應的，本申請還提供一種基於語音的角色分離裝置，包括：特徵提取單元，用於從語音信號中逐幀提取特徵向量，得到特徵向量序列；標籤分配單元，用於為特徵向量分配角色標籤；DNN模型訓練單元，用於利用具有角色標籤的特徵向量訓練DNN模型，其中所述DNN模型用於根據輸入的特徵向量輸出對應每個角色的機率；角色判定單元，用於根據所述DNN模型和利用特徵向量訓練得到的HMM，判定特徵向量序列對應的角色序列並輸出角色分離結果，其中所述HMM用於描述角色間的跳轉關係。 Correspondingly, the present application also provides a voice-based role separation device, including: a feature extraction unit for extracting feature vectors from the voice signal frame by frame to obtain a feature vector sequence; a label assignment unit for assigning role labels to the feature vectors The DNN model training unit is used to train the DNN model using feature vectors with role labels, wherein the DNN model is used to output the probability corresponding to each role according to the input feature vector; the role determination unit is used to train the DNN model according to the DNN model Using the HMM trained with the feature vector, determine the character sequence corresponding to the feature vector sequence and output the character separation result, wherein the HMM is used to describe the jump relationship between the characters.

可選的，所述裝置還包括：語音段切分單元，用於在所述特徵提取單元提取特徵向量後、在觸發所述標籤分配單元工作之前，透過識別並剔除不包含語音內容的音訊幀、將所述語音信號切分為語音段；所述標籤分配單元具體用於，為各語音段中的特徵向量分配角色標籤；所述角色判定單元具體用於，根據所述DNN模型和利用特徵向量訓練得到的HMM，判定各語音段所包含的特徵向量序列對應的角色序列並輸出角色分離結果。 Optionally, the device further includes: a voice segment segmentation unit, configured to identify and eliminate audio frames that do not contain voice content after the feature extraction unit extracts the feature vector and before triggering the label distribution unit to work , Divide the voice signal into voice segments; the label assignment unit is specifically configured to assign role labels to the feature vectors in each voice segment; the role determination unit is specifically configured to, according to the DNN model and the utilization feature The HMM obtained by vector training determines the character sequence corresponding to the feature vector sequence contained in each speech segment and outputs the character separation result.

可選的，所述標籤分配單元具體用於，透過建立GMM和HMM，為各語音段中的特徵向量分配角色標籤，其中所述GMM用於針對每個角色、根據輸入的特徵向量輸出該特徵向量對應於所述角色的機率；所述角色判定單元具體用於，根據所述DNN模型和為各語音段中的特徵向量分配角色標籤所採用的HMM，判定所述各語音段所包含的特徵向量序列對應的角色序列。 Optionally, the label allocation unit is specifically configured to allocate role labels to feature vectors in each voice segment by establishing GMM and HMM, where the GMM is used to output the feature for each role according to the input feature vector The probability that the vector corresponds to the role; the role determination unit is specifically configured to determine the features contained in each voice segment according to the DNN model and the HMM used to assign role labels to the feature vectors in each voice segment The role sequence corresponding to the vector sequence.

可選的，所述標籤分配單元包括：初始角色指定子單元，用於按照預設的初始角色數量選擇相應數量的語音段，並為每個語音段分別指定不同角色；初始模型訓練子單元，用於利用指定角色的語音段中的特徵向量，訓練針對每個角色的GMM以及HMM；解碼子單元，用於根據訓練得到的GMM和HMM進行解碼，獲取輸出各語音段所包含的特徵向量序列的機率值排序靠前的角色序列；機率判斷子單元，用於判斷所述角色序列對應的機率值是否大於預設閾值；標籤分配子單元，用於當所述機率判斷子單元的輸出為是時，按照所述角色序列為各語音段中的特徵向量分配角色標籤。 Optionally, the label assignment unit includes: an initial role specifying subunit, which is used to select a corresponding number of voice segments according to a preset initial number of roles, and to specify different roles for each voice segment; an initial model training subunit, It is used to train the GMM and HMM for each role by using the feature vector in the voice segment of the specified role; the decoding subunit is used to decode the GMM and HMM obtained by training to obtain the feature vector sequence contained in each voice segment. The probability value of the character sequence is ranked higher; the probability judgment subunit is used to judge whether the probability value corresponding to the role sequence is greater than a preset threshold; the label allocation subunit is used when the output of the probability judgment subunit is yes At this time, according to the role sequence, a role label is assigned to the feature vector in each speech segment.

可選的，所述標籤分配單元還包括：逐語音段角色指定子單元，用於當所述機率判斷子單元的輸出為否時，根據所述角色序列，為每個語音段指定對應的角色；模型更新訓練子單元，用於根據每個語音段中的特徵向量以及對應的角色，訓練針對每個角色的GMM以及HMM，並觸發所述解碼子單元工作。 Optionally, the label assignment unit further includes: a voice segment role designation subunit, configured to specify a corresponding role for each voice segment according to the role sequence when the output of the probability judgment subunit is no ; The model update training subunit is used to train the GMM and HMM for each role according to the feature vector in each voice segment and the corresponding role, and trigger the decoding subunit to work.

可選的，所述逐語音段角色指定子單元具體用於，針對每個語音段，將其中各特徵向量對應的角色的眾數指定為所述語音段的角色。 Optionally, the voice segment-by-segment role specification subunit is specifically configured to specify, for each voice segment, the mode of the role corresponding to each feature vector as the role of the voice segment.

可選的，所述模型更新訓練子單元具體用於，在上一次訓練得到的模型基礎上採用增量方式訓練所述GMM以及HMM。 Optionally, the model update training subunit is specifically configured to train the GMM and HMM in an incremental manner on the basis of the model obtained in the previous training.

可選的，所述標籤分配單元還包括：訓練次數判斷子單元，用於當所述機率判斷子單元的輸出為否時，判斷在當前角色數量下訓練GMM和HMM的次數是否小於預設的訓練次數上限，並在判斷結果為是時，觸發所述逐語音段角色指定子單元工作。 Optionally, the label allocation unit further includes: a training frequency judgment subunit, configured to judge whether the number of times of training GMM and HMM under the current number of characters is less than a preset number when the output of the probability judgment subunit is no The upper limit of the number of times of training, and when the judgment result is yes, trigger the work of the character designation subunits per voice segment.

角色數量調整子單元，用於當所述訓練次數判斷子單元的輸出為否時，調整角色數量，選擇相應數量的語音段並為每個語音段分別指定不同角色，並觸發所述初始模型訓練子單元工作。 The number of roles adjustment subunit is used to adjust the number of roles when the output of the number of training judgement subunits is no, select a corresponding number of voice segments and assign different roles for each voice segment, and trigger the initial model training Subunit work.

可選的，所述標籤分配單元還包括：角色數量判斷子單元，用於當所述訓練次數判斷子單元的輸出為否時，判斷當前角色數量是否符合預設要求，若符合則觸發所述標籤分配子單元工作，否則觸發所述角色數量調整子單元工作。 Optionally, the label assigning unit further includes: a character quantity judging subunit, which is used to judge whether the current character quantity meets a preset requirement when the output of the training times judging subunit is no, and if so, trigger the The label allocation subunit works, otherwise the role quantity adjustment subunit is triggered to work.

可選的，所述特徵提取單元包括：分幀子單元，用於按照預先設定的幀長度對語音信號進行分幀處理，得到多個音訊幀；特徵提取執行子單元，用於提取各音訊幀的特徵向量，得到所述特徵向量序列。 Optionally, the feature extraction unit includes: a framing subunit, configured to perform framing processing on the voice signal according to a preset frame length to obtain multiple audio frames; a feature extraction execution subunit, used to extract each audio frame To obtain the feature vector sequence.

可選的，所述特徵提取執行子單元具體用於，提取各音訊幀的MFCC特徵、PLP特徵、或者LPC特徵，得到所述特徵向量序列。 Optionally, the feature extraction execution subunit is specifically configured to extract MFCC features, PLP features, or LPC features of each audio frame to obtain the feature vector sequence.

可選的，所述語音段切分單元具體用於，透過採用VAD技術識別並剔除所述不包含語音內容的音訊幀、將所述語音信號切分為語音段。 Optionally, the voice segment segmentation unit is specifically configured to recognize and eliminate the audio frames that do not contain voice content by using VAD technology, and segment the voice signal into voice segments.

可選的，所述裝置還包括：VAD平滑單元，用於在所述語音段切分單元採用VAD技術切分語音段後，將時長小於預設閾值的語音段與相鄰語音段合併。 Optionally, the device further includes: a VAD smoothing unit, configured to merge a voice segment with a duration less than a preset threshold with an adjacent voice segment after the voice segment segmentation unit uses the VAD technology to segment the voice segment.

可選的，所述DNN模型訓練單元具體用於，採用反向傳播演算法訓練所述DNN模型。 Optionally, the DNN model training unit is specifically configured to train the DNN model by using a backpropagation algorithm.

可選的，所述角色判定單元具體用於，根據所述DNN模型和HMM執行解碼操作，獲取輸出所述特徵向量序列的機率值排序靠前的角色序列，並將所述角色序列作為與所述特徵向量序列對應的角色序列。 Optionally, the role determination unit is specifically configured to perform a decoding operation according to the DNN model and the HMM, obtain the role sequence with the highest ranking of the probability values of outputting the feature vector sequence, and use the role sequence as the related The character sequence corresponding to the feature vector sequence.

可選的，所述角色判定單元採用如下方式輸出角色分離結果：根據特徵向量序列對應的角色序列，針對每個角色輸出與其對應的特徵向量所屬音訊幀的起止時間資訊。 Optionally, the character determination unit outputs the character separation result in the following manner: according to the character sequence corresponding to the feature vector sequence, for each character, output the start and end time information of the audio frame to which the corresponding feature vector belongs.

可選的，所述初始角色指定子單元或所述角色數量調整子單元具體透過如下方式選擇相應數量的語音段：選擇時長滿足預設要求的、所述數量的語音段。 Optionally, the initial role designation subunit or the role number adjustment subunit specifically selects a corresponding number of voice segments in the following manner: selecting the number of voice segments whose duration meets a preset requirement.

與現有技術相比，本申請具有以下優點： Compared with the prior art, this application has the following advantages:

本申請提供的基於語音的角色分離方法，首先從語音信號中逐幀提取特徵向量序列，然後在為特徵向量分配角色標籤的基礎上訓練DNN模型，並根據所述DNN模型以及利用特徵向量訓練得到的HMM，判定特徵向量序列對應的角色序列，從而得到角色分離結果。本申請提供的上述方法，由於採用了具有強大特徵提取能力的DNN模型對說話人角色進行建模，比傳統的GMM具有更為強大的刻畫能力，對角色的刻畫更加精細、準確，因此能夠獲得更為準確的角色分離結果。 The voice-based role separation method provided by this application first extracts a feature vector sequence frame by frame from a voice signal, and then trains a DNN model on the basis of assigning role labels to the feature vector, and trains the DNN model according to the DNN model and using the feature vector. In the HMM, the character sequence corresponding to the feature vector sequence is determined, and the result of the character separation is obtained. The above-mentioned method provided in this application uses a DNN model with powerful feature extraction capabilities to model the speaker's character, and has a stronger characterization ability than the traditional GMM, and the characterization of the character is more precise and accurate, so it can obtain More accurate role separation results.

601‧‧‧特徵提取單元 601‧‧‧Feature Extraction Unit

602‧‧‧標籤分配單元 602‧‧‧label distribution unit

603‧‧‧DNN模型訓練單元 603‧‧‧DNN model training unit

604‧‧‧角色判定單元 604‧‧‧Role Judgment Unit

圖1是本申請的一種基於語音的角色分離方法的實施例的流程圖；圖2是本申請實施例提供的從語音信號中提取特徵向量序列的處理流程圖；圖3是本申請實施例提供的利用GMM和HMM為各語音段中的特徵向量分配角色標籤的處理流程圖；圖4是本申請實施例提供的語音段劃分的示意圖；圖5是本申請實施例提供的DNN網路的拓撲結構示意圖；圖6是本申請的一種基於語音的角色分離裝置的實施例的示意圖。 Fig. 1 is a flowchart of an embodiment of a voice-based role separation method of the present application; Fig. 2 is a processing flowchart of extracting a feature vector sequence from a voice signal provided by an embodiment of the present application; Fig. 3 is a flowchart provided by an embodiment of the present application The process flow chart of using GMM and HMM to assign role labels to the feature vectors in each voice segment; FIG. 4 is a schematic diagram of the voice segment division provided by an embodiment of the present application; FIG. 5 is the topology of the DNN network provided by an embodiment of the present application Structural schematic diagram; FIG. 6 is a schematic diagram of an embodiment of a voice-based role separation device of the present application.

在下面的描述中闡述了很多具體細節以便於充分理解本申請。但是，本申請能夠以很多不同於在此描述的其它方式來實施，本領域技術人員可以在不違背本申請內涵的情況下做類似推廣，因此，本申請不受下面公開的具體實施的限制。 In the following description, many specific details are set forth in order to fully understand this application. However, this application can be implemented in many other ways different from those described herein, and those skilled in the art can make similar promotion without violating the connotation of this application. Therefore, this application is not limited by the specific implementation disclosed below.

在本申請中，分別提供了一種基於語音的角色分離方法，以及一種基於語音的角色分離裝置，在下面的實施例中逐一進行詳細說明。為了便於理解，在對實施例進行描述之前，先對本申請的技術背景、技術方案、以及實施例的撰寫方式作簡要說明。 In this application, a voice-based role separation method and a voice-based role separation device are respectively provided, which are described in detail in the following embodiments. For ease of understanding, before describing the embodiments, a brief description of the technical background, technical solutions, and writing methods of the embodiments of the present application will be given first.

現有的應用於語音領域的角色分離技術通常採用GMM(Gaussian mixture model-高斯混合模型)對角色進行建模、採用HMM(Hidden Markov Model-隱藏馬可夫模型)對角色之間的跳轉進行建模。 Existing role separation technologies applied in the speech field usually use GMM (Gaussian mixture model) to model characters, and HMM (Hidden Markov Model) to model jumps between characters.

所述HMM是統計模型，用來描述一個含有隱含未知參數的馬可夫過程。隱藏馬可夫模型是馬可夫鏈的一種，它的狀態(稱為隱藏狀態)不能直接觀察到，但是與可觀察到的觀測向量是機率相關的，因此，HMM是一個雙重隨機過程，包括兩個部分：具有狀態轉移機率的馬可夫鏈(通常用轉移矩陣A描述)，以及描述隱藏狀態與觀測向量之間的輸出關係的隨機過程(通常用混淆矩陣B描述，其中的每個元素為隱藏狀態對應於觀測向量的輸出機率，也稱為發射機率)。一個具有N個狀態的HMM可以用三元組參數λ={π,A,B}表示，其中π為各狀態的初始機率。 The HMM is a statistical model used to describe a Markov process with hidden unknown parameters. Hidden Markov model is a kind of Markov chain. Its state (called hidden state) cannot be directly observed, but it is probabilistically related to the observable observation vector. Therefore, HMM is a double random process that includes two parts: Markov chain with state transition probability (usually described by transition matrix A), and a random process describing the output relationship between the hidden state and the observation vector (usually described by confusion matrix B, where each element of the hidden state corresponds to the observation The output probability of the vector, also known as the transmitter rate). An HMM with N states can be represented by the triple parameter λ={π,A,B}, where π is the initial probability of each state.

所述GMM可以簡單理解為多個高斯密度函數的疊加，其核心思想是用多個高斯分佈的機率密度函數的組合來描述特徵向量在機率空間的分佈狀況，採用該模型可以平滑地近似任意形狀的密度分佈。其參數包括：各高斯分佈的混合權重(mixing weight)、均值向量(mean vector)、協方差矩陣(covariance matrix)。 The GMM can be simply understood as the superposition of multiple Gaussian density functions. Its core idea is to use the combination of multiple Gaussian distribution probability density functions to describe the distribution of feature vectors in the probability space. This model can be used to smoothly approximate any shape The density distribution. Its parameters include: the mixing weight of each Gaussian distribution, the mean vector, and the covariance matrix.

在現有的基於語音的角色分離應用中，通常對每個角色採用GMM建模，HMM中的狀態就是各個角色，觀測向量是從語音信號中逐幀提取的特徵向量，各個狀態輸出特徵向量的發射機率由GMM決定(根據GMM可以獲知混淆矩陣)，而角色分離過程就是利用GMM和HMM確定與特徵向量序列對應的角色序列的過程。 In the existing voice-based role separation applications, GMM is usually used to model each role. The state in the HMM is each role. The observation vector is the feature vector extracted from the voice signal frame by frame. Each state outputs the emission of the feature vector. The probability is determined by GMM (the confusion matrix can be obtained according to GMM), and the role separation process is the process of using GMM and HMM to determine the role sequence corresponding to the feature vector sequence.

由於GMM的函數擬合功能受限於所採用的高斯密度函數的個數，其本身的表達能力存在一定的局限性，導致現有的採用GMM和HMM進行角色分離的準確率比較低。針對這一問題，本申請的技術方案在為各語音幀的特徵向量預先分配角色標籤的基礎上，利用深度神經網路(DNN)決定HMM各狀態的發射機率，並根據DNN和HMM判定與特徵向量序列對應的角色序列，由於DNN具有組合低層特徵形成更加抽象的高層特徵的強大能力，可以實現更為精準的角色刻畫，因此能夠獲取更為準確的角色分離結果。 Since the function fitting function of GMM is limited by the number of Gaussian density functions used, its own expressive ability has certain limitations, resulting in a relatively low accuracy of the existing GMM and HMM for role separation. In response to this problem, the technical solution of this application uses a deep neural network (DNN) to determine the transmitter rate of each state of the HMM on the basis of pre-assigning role labels to the feature vector of each speech frame, and determines and features according to DNN and HMM For the character sequence corresponding to the vector sequence, because DNN has the powerful ability to combine low-level features to form more abstract high-level features, it can achieve more accurate character descriptions, and therefore can obtain more accurate character separation results.

本申請的技術方案，首先為從語音信號中提取的特徵向量分配角色標籤，此時分配的角色標籤通常並不是很準確，但是可以為後續執行有監督的學習過程提供參考，在此基礎上訓練得到的DNN模型能夠更為準確地刻畫角色，從而使角色分離結果更為準確。在具體實施本申請的技術方案時可以採用基於統計的演算法或者採用分類器等方式，實現所述分配角色標籤的功能，在本申請提供的下述實施例中採用了根據GMM和HMM為特徵向量分配角色標籤的實施方式。 The technical solution of the present application first assigns role labels to the feature vectors extracted from the speech signal. At this time, the assigned role labels are usually not very accurate, but can provide a reference for the subsequent implementation of the supervised learning process, and train on this basis The obtained DNN model can describe the characters more accurately, so that the result of the character separation is more accurate. In the specific implementation of the technical solutions of the present application, statistical algorithms or classifiers can be used to implement the function of assigning role tags. In the following embodiments provided in the present application, GMM and HMM are used as features. The implementation of vector assignment role tags.

下面，對本申請的實施例進行詳細說明。請參考圖1，其為本申請的一種基於語音的角色分離方法的實施例的流程圖。所述方法包括如下步驟： Hereinafter, the embodiments of the present application will be described in detail. Please refer to FIG. 1, which is a flowchart of an embodiment of a voice-based role separation method of this application. The method includes the following steps:

步驟101、從語音信號中逐幀提取特徵向量，得到特徵向量序列。 Step 101: Extract feature vectors frame by frame from the speech signal to obtain a feature vector sequence.

待進行角色分離的語音信號通常是時域信號，本步驟透過分幀和提取特徵向量兩個處理過程，獲取能夠表徵所述語音信號的特徵向量序列，下面結合附圖2做進一步說明。 The voice signal to be separated is usually a time-domain signal. In this step, a feature vector sequence that can characterize the voice signal is obtained through two processing processes of framing and feature vector extraction. This will be further described with reference to FIG. 2 below.

步驟101-1、按照預先設定的幀長度對語音信號進行分幀處理，得到多個音訊幀。 Step 101-1: Perform framing processing on the voice signal according to the preset frame length to obtain multiple audio frames.

在具體實施時，可以根據需求預先設定幀長度，例如可以設置為10ms、或者15ms等，然後根據所述幀長度對時域的語音信號進行逐幀切分，從而將語音信號切分為多個音訊幀。根據所採用的切分策略的不同，相鄰音訊幀可以不存在交疊、也可以是有交疊的。 In specific implementation, the frame length can be preset according to requirements, for example, it can be set to 10ms, or 15ms, etc., and then the time domain speech signal is segmented frame by frame according to the frame length, thereby dividing the speech signal into multiple Audio frame. According to different splitting strategies adopted, adjacent audio frames may not overlap or overlap.

步驟101-2、提取各音訊幀的特徵向量，得到所述特徵向量序列。 Step 101-2: Extract the feature vector of each audio frame to obtain the feature vector sequence.

將時域的語音信號切分為多個音訊幀後，可以逐幀提取能夠表徵語音信號的特徵向量。由於語音信號在時域上的描述能力相對較弱，通常可以針對每個音訊幀進行傅立葉變換，然後提取頻域特徵作為音訊幀的特徵向量，例如，可以提取MFCC(Mel Frequency Cepstrum Coefficient-梅爾頻率倒譜系數)特徵、PLP(Perceptual Linear Predictive-感知線性預測)特徵、或者LPC(Linear Predictive Coding-線性預測編碼)特徵等。 After the speech signal in the time domain is divided into multiple audio frames, the feature vector that can characterize the speech signal can be extracted frame by frame. Due to the relatively weak description ability of the speech signal in the time domain, the Fourier transform can usually be performed for each audio frame, and then the frequency domain features can be extracted as the feature vector of the audio frame. For example, the MFCC (Mel Frequency Cepstrum Coefficient-Meer) can be extracted. Frequency cepstral coefficient) feature, PLP (Perceptual Linear Predictive-perceptual linear prediction) feature, or LPC (Linear Predictive Coding-linear predictive coding) feature, etc.

下面以提取某一音訊幀的MFCC特徵為例，對特徵向量的提取過程作進一步描述。首先將音訊幀的時域信號透過FFT(Fast Fourier Transformation-快速傅氏變換)得到對應的頻譜資訊，將所述頻譜資訊透過Mel濾波器組得到Mel頻譜，在Mel頻譜上進行倒譜分析，其核心一般是採用DCT(Discrete Cosine Transform-離散餘弦變換)進行逆變換，然後取預設的N個係數(例如N=12或者38)，則得到了所述音訊幀的特徵向量：MFCC特徵。對每個音訊幀都採用上述方式進行處理，可以得到表徵所述語音信號的一系列特徵向量，即本申請所述的特徵向量序列。 The following takes the extraction of the MFCC feature of a certain audio frame as an example to further describe the feature vector extraction process. First, the time-domain signal of the audio frame is obtained through FFT (Fast Fourier Transformation) to obtain the corresponding spectrum information, the spectrum information is passed through the Mel filter bank to obtain the Mel spectrum, and the cepstrum analysis is performed on the Mel spectrum. The core generally uses DCT (Discrete Cosine Transform) for inverse transformation, and then takes preset N coefficients (for example, N=12 or 38) to obtain the feature vector of the audio frame: MFCC feature. Each audio frame is processed in the above manner to obtain a series of feature vectors that characterize the speech signal, that is, the feature vector sequence described in this application.

步驟102、為特徵向量分配角色標籤。 Step 102: Assign a role label to the feature vector.

本實施例透過建立GMM和HMM為特徵向量序列中的特徵向量分配角色標籤。考慮到在一段語音信號中除了包含對應於各角色的語音信號外，可能還包含沒有語音內容的部分，例如：由於傾聽、思考等原因造成的靜音部分。由於這些部分不包含角色的資訊，為了提高角色分離的準確性，可以預先從語音信號中識別並剔除這樣的音訊幀。 In this embodiment, role labels are assigned to the feature vectors in the feature vector sequence by establishing GMM and HMM. Considering that in addition to the voice signal corresponding to each role, a voice signal may also include parts without voice content, for example: silent parts caused by listening, thinking, etc. Since these parts do not contain character information, in order to improve the accuracy of character separation, such audio frames can be identified and eliminated from the speech signal in advance.

基於上述考慮，本實施例在為特徵向量分配角色標籤之前先剔除不包含語音內容的音訊幀、並進行語音段的劃分，然後在此基礎上為各語音段中的特徵向量分配角色標籤，所述分配角色標籤包括：進行角色的初始劃分，在初始劃分的基礎上反覆運算訓練GMM和HMM，如果訓練得到的模型不滿足預設要求，則調整角色數量然後重新訓練GMM和HMM，直至訓練得到的模型滿足預設要求，則根據該模型為各語音段中的特徵向量分配角色標籤。下面結合附圖3對上述處理過程進行詳細說明。 Based on the above considerations, this embodiment first removes audio frames that do not contain voice content and divides the voice segments before assigning role labels to feature vectors, and then assigns role labels to feature vectors in each voice segment on this basis. The assignment of role labels includes: initial division of roles, repeated calculations and training of GMM and HMM based on the initial division, if the trained model does not meet the preset requirements, adjust the number of roles and then retrain GMM and HMM until the training is obtained If the model meets the preset requirements, the feature vector in each speech segment is assigned role labels according to the model. The above processing process will be described in detail below with reference to FIG. 3.

步驟102-1、透過識別並剔除不包含語音內容的音訊幀、將所述語音信號切分為語音段。 Step 102-1: Divide the voice signal into voice segments by identifying and excluding audio frames that do not contain voice content.

現有技術通常採用聲學切分方式，即：根據已有的模型從語音信號中分離出比如“音樂段”、“語音段”、“靜音段”等。這種方式需要提前訓練好各種音訊段對應的聲學模型，比如“音樂段”對應的聲學模型，基於該聲學模型，能夠從語音信號中分離出該聲學模型對應的音訊段。 The prior art usually adopts an acoustic segmentation method, that is, according to an existing model, separating, for example, "music segment", "speech segment", "silent segment", etc. from the speech signal. In this way, the acoustic models corresponding to various audio segments need to be trained in advance, such as the acoustic model corresponding to the “music segment”. Based on the acoustic model, the audio segment corresponding to the acoustic model can be separated from the speech signal.

較佳地，本申請的技術方案可以採用VAD(Voice Activity Detection-語音活動檢測)技術識別不包含語音內容的部分，這樣，相對於採用聲學切-分方式的技術，可以不需要提前訓練不同音訊段對應的聲學模型，適應性更強。例如，可以透過計算音訊幀的能量特徵、過零率等識別音訊幀是否為靜音幀，對於存在環境雜訊且比較強的情況下，可以綜合採用上述多種手段、或者透過建立雜訊模型進行識別。 Preferably, the technical solution of the present application can use VAD (Voice Activity Detection) technology to identify parts that do not contain voice content. In this way, compared to the technology that uses the acoustic segmentation method, there is no need to train different audio signals in advance. The acoustic model corresponding to the segment is more adaptable. For example, it is possible to identify whether the audio frame is a silent frame by calculating the energy characteristics of the audio frame, the zero-crossing rate, etc. For the presence of environmental noise and relatively strong conditions, the above-mentioned multiple methods can be used for identification, or the establishment of a noise model can be used to identify .

識別出不包含語音內容的音訊幀後，一方面可以將這部分音訊幀從語音信號中剔除，以提高角色分離的準確性；另一方面透過對不包含語音內容的音訊幀的識別，相當於識別出了每段有效語音(包含語音內容)的起點和終點，因此可以在此基礎上進行語音段的劃分。 After recognizing audio frames that do not contain voice content, on the one hand, this part of the audio frames can be removed from the voice signal to improve the accuracy of role separation; on the other hand, by identifying audio frames that do not contain voice content, it is equivalent to The starting point and ending point of each valid voice (including voice content) are recognized, so the voice segment can be divided on this basis.

請參見附圖4，其為本實施例提供的語音段劃分的示意圖，在該圖中透過VAD技術檢測出在時間t₂與t₃之間、以及t₄與t₅之間的各音訊幀為靜音幀，本步驟可以從語音信號中剔除這部分靜音幀，並相應劃分出3個語音段：位於t₁與t₂之間的語音段1(seg1)、位於t₃和t₄之間的語音段2(seg2)、以及位於t₅和t₆之間的語音段3(seg3)，每個語音段都包含若干個音訊幀，每個音訊幀都有對應的特徵向量。在劃分語音段的基礎上，可以粗略地進行角色分配，便於為後續的訓練提供一個合理的起點。 Please refer to FIG. 4, which is a schematic diagram of the voice segment division provided by this embodiment. In this figure, the audio frames between _{time t 2} and t _{3 and} between t ₄ and t _{5 are detected through VAD technology.} It is a mute frame, this step can remove this part of the mute frame from the voice signal, and accordingly divide 3 voice segments: voice segment 1 (seg1) located between _{t 1} and t _{2 and} _{between t 3} and t ₄ speech segment 2 (seg2), and located between the feature vector t and t _{_5. 6} speech segment 3 (seg3), each voice segment including a number of audio frames, each audio frame has a corresponding. On the basis of dividing the voice segment, you can roughly assign roles, which is convenient to provide a reasonable starting point for subsequent training.

較佳的，在採用VAD技術進行上述處理後，還可以執行VAD平滑操作。這主要是考慮到人類實際發聲的情況，真實的語音段的持續時間不會太短，如果執行上述VAD操作後，得到的某些語音段的持續時間小於預先設置的閾值(例如，語音段長度為30ms，而預先設置的閾值為100ms)，則可以將這樣的語音段與相鄰語音段合併，形成較長的語音段。進行VAD平滑處理後得到的語音段劃分更接近於真實的情況，有助於提高角色分離的準確性。 Preferably, after using the VAD technology to perform the above-mentioned processing, the VAD smoothing operation can also be performed. This is mainly due to the actual vocalization of human beings. The duration of the real speech segment will not be too short. If the VAD operation is performed, the duration of some speech segments obtained is less than the preset threshold (for example, the length of the speech segment). It is 30ms, and the preset threshold is 100ms), then such a voice segment can be merged with an adjacent voice segment to form a longer voice segment. The voice segment division obtained after VAD smoothing is closer to the real situation, which helps to improve the accuracy of role separation.

本步驟透過VAD技術將語音信號劃分成若干個語音段，後續步驟102-2至102-11的任務則是利用GMM和HMM為各語音段中的特徵向量分配角色標籤。 In this step, the voice signal is divided into several voice segments through the VAD technology, and the task of the subsequent steps 102-2 to 102-11 is to use GMM and HMM to assign role labels to the feature vectors in each voice segment.

步驟102-2、按照預設的初始角色數量選擇相應數量的語音段，並為每個語音段分別指定不同角色。 Step 102-2: Select a corresponding number of voice segments according to the preset initial number of roles, and specify different roles for each voice segment.

本步驟可以隨機地從已經劃分好的語音段中選擇與初始角色數量相同的語音段，考慮到所選語音段要用於進行GMM和HMM的初始訓練，如果時長比較短則可用於訓練的資料較少，時長太長則包含一個以上角色的可能性就會增加，這兩種情況都不利於進行初始訓練，因此本實施例提供一種較佳實施方式，即：根據初始角色數量選擇時長滿足預設要求的語音段、並為每個語音段分別指定不同的角色。 In this step, you can randomly select the same number of voice segments as the initial characters from the already divided voice segments. Considering that the selected voice segment will be used for initial training of GMM and HMM, if the duration is relatively short, it can be used for training. If the data is too long, the possibility of including more than one character will increase. Both of these situations are not conducive to the initial training. Therefore, this embodiment provides a better implementation method, that is, when selecting according to the initial number of characters Long voice segments that meet the preset requirements, and assign different roles for each voice segment.

本實施例中預設的初始角色數量為2，預設的選擇語音段的要求為：時長在2s至4s之間，因此本步驟從已經劃分好的語音段中選擇滿足上述要求的2個語音段，並為每個語音段分別指定不同的角色。仍以圖4所示的語音段劃分為例，seg1和seg2各自滿足上述的時長要求，因此可以選擇seg1和seg2這兩個語音段，並為seg1指定角色1(s1)，為seg2指定角色2(s2)。 In this embodiment, the preset initial number of characters is 2, and the preset requirement for selecting a voice segment is: the duration is between 2s and 4s, so this step selects 2 voice segments that meet the above requirements from the divided voice segments Voice segments, and assign different roles for each voice segment. Still taking the voice segment division shown in Figure 4 as an example, seg1 and seg2 each meet the above-mentioned duration requirements, so you can select two voice segments, seg1 and seg2, and assign role 1 (s1) for seg1 and seg2 2(s2).

步驟102-3、利用指定角色的語音段中的特徵向量，訓練針對每個角色的GMM以及HMM。 Step 102-3: Use the feature vector in the voice segment of the designated role to train GMM and HMM for each role.

本步驟根據指定角色的語音段包含的特徵向量，訓練針對每個角色的GMM、以及描述角色間的跳轉關係的HMM，本步驟是在特定角色數量下進行的初始訓練。仍以圖4所示的語音段劃分為例，在初始角色數量下，seg1中包含的特徵向量用於訓練角色1的GMM(gmm1)，seg2中包含的特徵向量用於訓練角色2的GMM(gmm2)，如果在該角色數量下訓練得到的GMM和HMM不滿足要求，則可以調整角色數量並重複轉到本步驟，根據調整後的角色數量執行相應的初始訓練。 This step trains the GMM for each role and the HMM that describes the jump relationship between the roles according to the feature vector contained in the voice segment of the specified role. This step is the initial training performed under a specific number of roles. Still taking the voice segment division shown in Figure 4 as an example, under the initial number of characters, the feature vector contained in seg1 is used to train the GMM (gmm1) of character 1, and the feature vector contained in seg2 is used to train the GMM of character 2 ( gmm2), if the GMM and HMM trained under this number of roles do not meet the requirements, you can adjust the number of roles and repeat to this step, and perform the corresponding initial training according to the adjusted number of roles.

針對每個角色訓練GMM以及HMM的過程，也就是在給定觀測序列(即：各語音段包含的特徵向量序列，也即訓練樣本)的基礎上學習與HMM相關的各個參數的過程，所述各個參數包括：HMM的轉移矩陣A、每個角色對應的GMM的均值向量、協方差矩陣等參數。在具體實施時可以採用Baum-Welch演算法進行訓練，先根據訓練樣本估計各參數的初始值，根據訓練樣本和各參數的初始值，估計在時刻t處於某一狀態s_j的後驗機率γt(s_j)，然後根據計算得到的後驗機率更新HMM的各個參數，再根據訓練樣本和更新後的各參數再次估計後驗機率γt(s_j)......，反復反覆運算執行上述過程，直到找到一組HMM參數使得輸出觀測序列的機率最大化。得到滿足上述要求的參數後，則在特定角色數量下的GMM和HMM初始訓練完畢。 The process of training GMM and HMM for each role is the process of learning various parameters related to HMM on the basis of a given observation sequence (ie: the sequence of feature vectors contained in each speech segment, that is, training samples). Each parameter includes: HMM transition matrix A, GMM mean vector corresponding to each role, covariance matrix and other parameters. In the specific implementation, the Baum-Welch algorithm can be used for training. First estimate the initial value of each parameter based on the training sample. According to the training sample and the initial value of each parameter, estimate the posterior probability of _{being in a certain state s j at time t} (s _j ), and then update the various parameters of the HMM according to the calculated posterior probability, and then estimate the posterior probability γt(s _j ) again according to the training sample and the updated parameters..., repeat the calculation In the above process, until a set of HMM parameters is found to maximize the probability of outputting the observation sequence. After the parameters that meet the above requirements are obtained, the initial training of GMM and HMM under a specific number of roles is completed.

步驟102-4、根據訓練得到的GMM和HMM進行解碼，獲取輸出各語音段所包含的特徵向量序列的機率值排序靠前的角色序列。 Step 102-4: Decode according to the GMM and HMM obtained by training, and obtain the character sequence with the highest probability value of the feature vector sequence contained in each speech segment.

在步驟102-1中已經將語音信號劃分為若干個語音段，每個語音段中的每個音訊幀都有對應的特徵向量，共同組成本步驟所述的特徵向量序列。本步驟在給定所述特徵向量序列、以及已訓練好的GMM和HMM的基礎上，找到所述特徵向量序列可能從屬的HMM狀態序列，即：角色序列。 In step 102-1, the speech signal has been divided into several speech segments, and each audio frame in each speech segment has a corresponding feature vector, which together form the feature vector sequence described in this step. In this step, on the basis of the given feature vector sequence and the trained GMM and HMM, the HMM state sequence to which the feature vector sequence may be subordinate is found, that is, the role sequence.

本步驟完成的功能是通常所述的HMM解碼過程，根據所述特徵向量序列，搜索輸出該特徵向量序列的機率值排序靠前的角色序列，作為較佳實施方式，通常可以選擇最大機率值對應的角色序列，即最可能輸出所述特徵向量序列的角色序列，也稱為最佳的隱狀態序列。 The function completed in this step is the usual HMM decoding process. According to the feature vector sequence, search and output the character sequence with the highest probability value of the feature vector sequence. As a preferred embodiment, usually the maximum probability value corresponding The role sequence of, that is, the role sequence that is most likely to output the feature vector sequence, is also called the best hidden state sequence.

在具體實施時，可以採用窮舉搜索的方法，計算每個可能的角色序列輸出所述特徵向量序列的機率值，並從中選擇最大值。為了提高計算效率，作為較佳實施方式，可以採用維特比(Viterbi)演算法，利用HMM的轉移機率在時間上的不變性來降低計算的複雜度，並在搜索得到輸出所述特徵向量序列的最大機率值後，根據搜索過程記錄的資訊進行回溯，獲取對應的角色序列。 In a specific implementation, an exhaustive search method can be used to calculate the probability value of each possible character sequence outputting the feature vector sequence, and select the maximum value from it. In order to improve the calculation efficiency, as a preferred embodiment, the Viterbi algorithm can be used to reduce the complexity of the calculation by using the invariance of the HMM transfer probability in time, and the output of the feature vector sequence is obtained in the search After the maximum probability value, go back according to the information recorded during the search process to obtain the corresponding character sequence.

步驟102-5、判斷所述角色序列對應的機率值是否大於預設閾值，若是，執行步驟102-6，否則轉到步驟102-7執行。 Step 102-5: Determine whether the probability value corresponding to the role sequence is greater than a preset threshold, if yes, execute step 102-6, otherwise go to step 102-7 for execution.

如果步驟102-4透過解碼過程獲取的角色序列對應的機率值大於預先設定的閾值，例如：0.5，通常可以認為目前的GMM和HMM已經穩定，可以執行步驟102-6為各語音段中的特徵向量分配角色標籤(後續步驟104可以利用所述已穩定的HMM判定特徵向量序列對應的角色序列)，否則轉到步驟102-7判斷是否繼續進行反覆運算訓練。 If the probability value corresponding to the character sequence obtained through the decoding process in step 102-4 is greater than the preset threshold, such as 0.5, it can usually be considered that the current GMM and HMM are stable, and step 102-6 can be performed to identify the characteristics of each voice segment The vector assigns role labels (the subsequent step 104 can use the stabilized HMM to determine the role sequence corresponding to the feature vector sequence), otherwise, go to step 102-7 to determine whether to continue the repeated operation training.

步驟102-6、按照所述角色序列為各語音段中的特徵向量分配角色標籤。 Step 102-6: Assign role labels to the feature vectors in each speech segment according to the role sequence.

由於目前的GMM和HMM已經穩定，因此可以按照步驟102-4透過解碼獲取的角色序列為各語音段中的特徵向量分配角色標籤。在具體實施時，由於所述角色序列中的每個角色與各語音段中的每個特徵向量是一一對應的，因此可以根據該一一對應關係為每個特徵向量分配角色標籤。至此，各語音段中的特徵向量都有了各自的角色標籤，步驟102執行完畢，可以繼續執行步驟103。 Since the current GMM and HMM are already stable, it is possible to assign a role label to the feature vector in each voice segment according to the role sequence obtained through decoding in step 102-4. In specific implementation, since each role in the role sequence has a one-to-one correspondence with each feature vector in each speech segment, a role label can be assigned to each feature vector according to the one-to-one correspondence. At this point, the feature vectors in each speech segment have their own role tags. After step 102 is executed, step 103 can be continued.

步驟102-7、判斷在當前角色數量下訓練GMM和HMM的次數是否小於預設的訓練次數上限；若是，則執行步驟102-8，否則轉到步驟102-10執行。 Step 102-7: Determine whether the number of times of training GMM and HMM under the current number of characters is less than the preset upper limit of the number of times of training; if yes, execute step 102-8, otherwise go to step 102-10 for execution.

執行到本步驟，說明目前訓練得到的GMM和HMM還沒有穩定，需要繼續進行反覆運算訓練。考慮到在訓練過程所採用的當前角色數量與實際角色數量(所述語音信號涉及的真實角色數量)不一致的情況下，GMM和HMM即使經過多次反覆運算訓練也可能無法滿足要求(解碼操作所獲取的角色序列對應的機率值始終不滿足大於預設閾值的條件)，為了避免出現無意義的迴圈反覆運算過程，可以預先設置在每種角色數量下訓練GMM和HMM的訓練次數上限。如果本步驟判斷出在當前角色數量下的訓練次數小於所述上限，則繼續執行步驟102-8為各語音段指定角色以便繼續進行反覆運算訓練，否則說明目前採用的角色數量可能與實際情況不一致，因此可以轉到步驟102-10判斷是否需要調整角色數量。 Performing to this step indicates that the GMM and HMM obtained by the current training are not stable yet, and iterative computing training needs to be continued. Considering that the current number of characters used in the training process is inconsistent with the actual number of characters (the number of real characters involved in the voice signal), GMM and HMM may not be able to meet the requirements even after multiple iterations of training. The probability value corresponding to the acquired character sequence never satisfies the condition greater than the preset threshold). In order to avoid meaningless loop repetitive calculation process, the upper limit of training times for training GMM and HMM under each number of characters can be preset. If it is determined in this step that the number of training times under the current number of characters is less than the upper limit, proceed to step 102-8 to assign characters to each voice segment to continue the repeated calculation training, otherwise it means that the number of characters currently used may be inconsistent with the actual situation , So you can go to step 102-10 to determine whether you need to adjust the number of characters.

步驟102-8、根據所述角色序列為每個語音段指定對應的角色。 Step 102-8: Assign a corresponding role to each voice segment according to the role sequence.

在步驟102-4中已經透過解碼獲取了角色序列，由於角色序列中的每個角色與各語音段中的特徵向量是一一對應的，因此可以獲知各語音段中每個特徵向量對應的角色。本步驟針對語音信號中的每個語音段，透過計算其中各特徵向量對應的角色的眾數，為所述語音段指定角色。例如：某語音段包含10個音訊幀，也即包含10個特徵向量，其中8個特徵向量對應角色1(s1)，2個特徵向量對應於角色2(s2)，那麼所述語音段中各特徵向量對應的角色的眾數為角色1(s1)，因此將角色1(s1)指定為所述語音段的角色。 In step 102-4, the character sequence has been obtained through decoding. Since each character in the character sequence has a one-to-one correspondence with the feature vector in each voice segment, it is possible to know the role corresponding to each feature vector in each voice segment . In this step, for each voice segment in the voice signal, a role is assigned to the voice segment by calculating the mode of the role corresponding to each feature vector. For example: a voice segment contains 10 audio frames, that is, contains 10 feature vectors, of which 8 feature vectors correspond to character 1 (s1), and 2 feature vectors correspond to character 2 (s2), then each of the voice segments The mode of the character corresponding to the feature vector is character 1 (s1), so character 1 (s1) is designated as the character of the speech segment.

步驟102-9、根據每個語音段中的特徵向量以及對應的角色，訓練針對每個角色的GMM以及HMM，並轉到步驟102-4繼續執行。 Step 102-9: According to the feature vector in each voice segment and the corresponding role, train the GMM and HMM for each role, and go to step 102-4 to continue execution.

在步驟102-8為每個語音段指定角色的基礎上，可以訓練針對每個角色的GMM以及HMM。仍以圖4所示的語音段劃分為例，如果步驟102-8將seg1和seg3指定為角色1(s1)，seg2指定為角色2(s2)，那麼seg1和seg3包含的特徵向量可以用於訓練角色1的GMM(gmm1)，seg2中包含的特徵向量用於訓練角色2的GMM(gmm2)。GMM和HMM的訓練方法請參見步驟102-3的相關文字，此處不再贅述。 On the basis of specifying roles for each voice segment in step 102-8, GMM and HMM for each role can be trained. Still taking the voice segment division shown in Figure 4 as an example, if step 102-8 specifies seg1 and seg3 as role 1 (s1) and seg2 as role 2 (s2), then the feature vectors contained in seg1 and seg3 can be used Train the GMM (gmm1) of character 1, and the feature vector contained in seg2 is used to train the GMM (gmm2) of character 2. For the training methods of GMM and HMM, please refer to the relevant text of step 102-3, which will not be repeated here.

在具體實施中，本技術方案通常為反覆運算訓練過程，為了提高訓練效率，本步驟可以在上一次訓練得到的GMM和HMM的基礎上採用增量方式訓練新的GMM和HMM，即在上一次訓練得到的參數基礎上，利用目前的樣本資料，繼續調整各個參數，從而可以提高訓練速度。 In specific implementation, this technical solution is usually an iterative computing training process. In order to improve training efficiency, this step can train new GMM and HMM incrementally on the basis of the GMM and HMM obtained in the previous training. On the basis of the parameters obtained in the training, the current sample data is used to continue to adjust the various parameters, which can increase the training speed.

完成上述訓練過程，得到新的GMM和HMM後，可以轉到步驟102-4執行，根據新的模型進行解碼以及執行後續的操作。 After completing the above training process and obtaining the new GMM and HMM, you can go to step 102-4 for execution, and perform decoding according to the new model and perform subsequent operations.

步驟102-10、判斷當前角色數量是否符合預設要求；若是，轉到步驟102-6執行，否則繼續執行步驟102-11。 Step 102-10: Determine whether the current number of characters meets the preset requirements; if yes, go to step 102-6 to execute, otherwise continue to execute step 102-11.

執行到本步驟，通常說明在當前角色數量下訓練得到的GMM和HMM並未穩定、而且訓練次數已經等於或者超過了預設的訓練次數上限，在這種情況下可以判斷當前角色數量是否符合預設要求，若符合，則說明可以停止角色分離過程，轉到步驟102-6進行角色標籤的分配，否則繼續執行步驟102-11進行角色數量的調整。 Performing this step usually means that the GMM and HMM trained under the current number of roles are not stable, and the number of training times has equaled or exceeded the preset upper limit of training times. In this case, it can be judged whether the current number of characters meets the expected number. If the requirements are met, it means that you can stop the role separation process and go to step 102-6 to assign role tags; otherwise, continue to step 102-11 to adjust the number of roles.

步驟102-11、調整角色數量，選擇相應數量的語音段並為每個語音段分別指定不同角色；並轉到步驟102-3繼續執行。 Step 102-11: Adjust the number of roles, select a corresponding number of voice segments and assign different roles for each voice segment; and go to step 102-3 to continue execution.

例如，當前角色數量為2，對角色數量的預設要求為“角色數量等於4”，步驟102-10判定當前角色數量尚未符合預設要求，這種情況下，可以執行本步驟進行角色數量的調整，例如：為當前角色數量加1，即將當前角色數量更新為3。 For example, if the current number of characters is 2, the preset requirement for the number of characters is "the number of characters is equal to 4", and step 102-10 determines that the current number of characters does not meet the preset requirement. In this case, you can perform this step to determine the number of characters. Adjust, for example: add 1 to the current number of characters, that is, update the current number of characters to 3.

根據調整後的角色數量，從語音信號包含的各個語音段中選擇相應數量的語音段，並為所選每個語音段分別指定不同的角色。其中對所選語音段的時長要求，可以參見步驟102-2中的相關文字，此處不再贅述。 According to the adjusted number of roles, a corresponding number of voice segments are selected from each voice segment included in the voice signal, and a different role is assigned to each selected voice segment. For the time length requirements of the selected speech segment, please refer to the relevant text in step 102-2, which will not be repeated here.

仍以圖4所示的語音段劃分為例，如果當前角色數量從2增加為3，並且seg1、seg2和seg3都滿足選擇語音段的時長要求，那麼本步驟可以選擇這3個語音段，並為seg1指定角色1(s1)，為seg2指定角色2(s2)，為seg3指定角色3(s3)。 Still taking the voice segment division shown in Figure 4 as an example, if the current number of roles increases from 2 to 3, and seg1, seg2, and seg3 all meet the duration requirements for selecting the voice segment, then these 3 voice segments can be selected in this step, And specify role 1 (s1) for seg1, role 2 (s2) for seg2, and role 3 (s3) for seg3.

完成上述調整角色數量以及選擇語音段的操作後，可以轉到步驟102-3針對調整後的角色數量初始訓練GMM 和HMM。 After completing the above operations of adjusting the number of characters and selecting voice segments, you can go to step 102-3 to initially train GMM and HMM for the adjusted number of characters.

步驟103、利用具有角色標籤的特徵向量訓練DNN模型。 Step 103: Train the DNN model by using the feature vector with the role label.

此時，已經為各語音段中的特徵向量分配了角色標籤，在此基礎上，本步驟以具有角色標籤的特徵向量為樣本訓練DNN模型，所述DNN模型用於根據輸入的特徵向量輸出對應每個角色的機率。為了便於理解，先對DNN作簡要說明。 At this point, role labels have been assigned to the feature vectors in each voice segment. On this basis, this step uses the feature vectors with role labels as samples to train the DNN model. The DNN model is used to output the corresponding output according to the input feature vector. The probability of each character. In order to facilitate understanding, a brief description of DNN is given first.

DNN(Deep Neural Networks-深度神經網路)通常指包括1個輸入層、3個以上隱含層(也可以包含7個、9個、甚至更多的隱含層)、以及1個輸出層的神經網路。每個隱含層都能夠提取一定的特徵，並將本層的輸出作為下一層的輸入，透過逐層提取特徵，將低層特徵形成更加抽象的高層特徵，從而能夠實現對物體或者種類的識別。 DNN (Deep Neural Networks-Deep Neural Networks) usually refers to an input layer, 3 or more hidden layers (it can also include 7, 9, or even more hidden layers), and an output layer Neural network. Each hidden layer can extract certain features, and use the output of this layer as the input of the next layer. By extracting features layer by layer, the low-level features are formed into more abstract high-level features, which can realize the recognition of objects or types.

請參見圖5，其為DNN網路的拓撲結構示意圖，圖中的DNN網路總共有n層，每層有多個神經元，不同層之間全連接；每層都有自己的激勵函數f(例如Sigmoid函數)。輸入為特徵向量v，第i層到第i+1層的轉移矩陣為w_i(i+1)，第i+1層的偏置向量為b_(i+1)，第i層的輸出為out_i，第i+1的輸入為in_i+1，計算過程為：in_i+1=out_i * wi_(i+1)+b_(i+1) out_i+1=f(in_i+1) Please refer to Figure 5, which is a schematic diagram of the topological structure of the DNN network. The DNN network in the figure has a total of n layers, each layer has multiple neurons, and the different layers are fully connected; each layer has its own activation function f (Sigmoid function for example). The input is the feature vector v, the transition matrix from the i-th layer to the i+1-th layer is w _i(i+1) , the bias vector of the _{i+1-th layer is b (i+1)} , and the output of the i-th layer is out _i , the input of _i+1 is in i+1, the calculation process is: in _i+1 =out _i * wi _(i+1) + b _(i+1) out _i+1 = f(in _{i+ 1} )

由此可見DNN模型的參數包括層間的轉移矩陣w和每一層的偏置向量b等，訓練DNN模型的主要任務就是確定上述參數。在實際應用中通常採用BP(Back-propagation-反向傳播)演算法進行訓練，訓練過程是一個有監督的學習過程：輸入信號為帶有標籤的特徵向量，分層向前傳播，到達輸出層後再逐層反向傳播，透過梯度下降法調整各層的參數以使網路的實際輸出不斷接近期望輸出。對於每層有上千神經元的DNN網路來說，其參數的數量可能是百萬級的甚至更多，完成上述訓練過程獲取的DNN模型，通常具有非常強大的特徵提取能力以及識別能力。 It can be seen that the parameters of the DNN model include the transfer matrix w between layers and the bias vector b of each layer. The main task of training the DNN model is to determine the above parameters. In practical applications, BP (Back-propagation) algorithm is usually used for training. The training process is a supervised learning process: the input signal is a labeled feature vector, which is propagated forward hierarchically and reaches the output layer Then back propagation layer by layer, and adjust the parameters of each layer through the gradient descent method so that the actual output of the network is constantly close to the expected output. For a DNN network with thousands of neurons in each layer, the number of parameters may be millions or more. The DNN model obtained by completing the above training process usually has very powerful feature extraction capabilities and recognition capabilities.

在本實施例中，DNN模型用於根據輸入的特徵向量輸出對應每個角色的機率，因此DNN模型的輸出層可以採用分類器(例如Softmax)作為啟動函數，在步驟102完成預先分配角色標籤的處理後，如果角色標籤涉及的角色數量為n，那麼DNN模型的輸出層可以包括n個節點，分別對應於n個角色，針對輸入的特徵向量每個節點輸出該特徵向量對應所屬角色的機率值。 In this embodiment, the DNN model is used to output the probability corresponding to each role according to the input feature vector. Therefore, the output layer of the DNN model can use a classifier (such as Softmax) as the startup function. In step 102, the pre-assigned role label is completed. After processing, if the number of roles involved in the role label is n, then the output layer of the DNN model can include n nodes, corresponding to n roles, and for each node of the input feature vector, the probability value of the feature vector corresponding to the role is output .

本步驟以帶有角色標籤的特徵向量作為樣本，對構建的上述DNN模型進行有監督訓練。在具體實施時，可以直接採用上述BP演算法進行訓練，考慮到單純採用BP演算法訓練有可能出現陷入局部極小值的情況、導致最終得到的模型無法滿足應用的需求，因此本實施例採用預訓練(pre-training)與BP演算法相結合的方式進行DNN模型的訓練。 In this step, the feature vector with role label is used as a sample to perform supervised training on the constructed DNN model. In specific implementation, the above-mentioned BP algorithm can be directly used for training. Considering that purely using BP algorithm for training may fall into a local minimum, the resulting model cannot meet the needs of the application. Therefore, this embodiment adopts the prediction method. Training (pre-training) combined with BP algorithm for DNN model training.

預訓練通常採用非監督貪心逐層訓練演算法，先採用非監督方式訓練含有一個隱層的網路，然後保留訓練好的參數，使網路層數加1，接著訓練含兩個隱層的網路......以此類推，直到含有最大隱層的網路。這樣逐層訓練完之後，以該無監督訓練過程學習到的參數值作為初始值，再採用傳統BP演算法進行有監督的訓練，最終得到DNN模型。 Pre-training usually uses unsupervised greedy layer-by-layer training algorithm. First, unsupervised training of the network with one hidden layer is used, and then the trained parameters are retained, the number of network layers is increased by 1, and then the network with two hidden layers is trained Network...and so on, until the network with the largest hidden layer. After the layer-by-layer training is completed, the parameter values learned in the unsupervised training process are used as the initial values, and then the traditional BP algorithm is used for supervised training, and finally the DNN model is obtained.

由於經過pre-training得到的初始分佈比純BP演算法採用的隨機初始參數更接近於最終的收斂值，相當於使後續的有監督訓練過程有一個好的起點，因此訓練得到的DNN模型通常不會陷入局部極小值，能夠獲得較高的識別率。 Since the initial distribution obtained by pre-training is closer to the final convergence value than the random initial parameters used by the pure BP algorithm, it is equivalent to a good starting point for the subsequent supervised training process, so the DNN model obtained by training usually does not It will fall into a local minimum, and a higher recognition rate can be obtained.

步驟104、根據所述DNN模型和利用特徵向量訓練得到的HMM，判定特徵向量序列對應的角色序列，並輸出角色分離結果。 Step 104: Determine the role sequence corresponding to the feature vector sequence according to the DNN model and the HMM obtained by the feature vector training, and output the role separation result.

由於所述DNN模型用於根據輸入的特徵向量輸出對應每個角色的機率，同時根據特徵向量序列的角色標籤的分佈情況可以獲知對應每個角色的先驗機率，而每個特徵向量的先驗機率通常也是固定的，因此依據貝葉斯定理，根據DNN模型的輸出以及上述先驗機率可以獲知每個角色輸出相應特徵向量的機率，也即可以採用步驟103訓練好的DNN模型決定HMM各狀態的發射機率。 Since the DNN model is used to output the probability corresponding to each role according to the input feature vector, at the same time, according to the distribution of the role label of the feature vector sequence, the prior probability corresponding to each role can be obtained, and the priori probability of each feature vector The probability is usually fixed. Therefore, according to Bayes' theorem, according to the output of the DNN model and the above-mentioned prior probability, the probability of each character outputting the corresponding feature vector can be known, that is, the DNN model trained in step 103 can be used to determine the states of the HMM Transmitter rate.

所述HMM可以是在採用上述DNN模型決定HMM發射機率的基礎上，用特徵向量序列訓練得到的。考慮到在步驟102為特徵向量分配角色標籤時所採用的HMM對各角色之間的跳轉關係的描述已基本穩定，可以不再進行額外的訓練，因此本實施例直接採用該HMM，並用訓練得到的DNN模型替換GMM，即：由DNN模型決定HMM各狀態的發射機率。 The HMM may be obtained by training with a feature vector sequence on the basis of using the above-mentioned DNN model to determine the HMM transmitter rate. Considering that the description of the jump relationship between the roles by the HMM used when assigning the feature vector to the feature vector in step 102 is basically stable, no additional training can be performed. Therefore, this embodiment directly uses the HMM and uses the training to obtain The DNN model replaces GMM, that is, the DNN model determines the transmitter rate of each state of the HMM.

在本實施例中，步驟102-1進行了語音段的切分，本步驟根據所述DNN模型和預先分配角色標籤時所採用的HMM，判定各語音段所包含的特徵向量序列對應的角色序列。 In this embodiment, the voice segment is segmented in step 102-1. This step determines the character sequence corresponding to the feature vector sequence contained in each voice segment according to the DNN model and the HMM used when pre-assigning role tags. .

根據特徵向量序列確定角色序列的過程是通常所述的解碼問題，可以根據所述DNN模型和HMM執行解碼操作，獲取輸出所述特徵向量序列的機率值排序靠前(例如機率值最大)的角色序列，並將所述角色序列作為與所述特徵向量序列對應的角色序列。具體說明請參見步驟102-4中的相關文字，此處不再贅述。 The process of determining the character sequence according to the feature vector sequence is a commonly described decoding problem. The decoding operation can be performed according to the DNN model and HMM to obtain the character with the highest probability value (for example, the highest probability value) outputting the feature vector sequence. Sequence, and use the character sequence as a character sequence corresponding to the feature vector sequence. For specific instructions, please refer to the relevant text in step 102-4, which will not be repeated here.

透過解碼過程得到與各語音段所包含的特徵向量序列對應的角色序列後，則可以輸出相應的角色分離結果。由於角色序列中的每個角色與特徵向量是一一對應的，而每個特徵向量對應的音訊幀都有各自的時間起止點，因此本步驟可以針對每個角色輸出與其對應的特徵向量所屬音訊幀的起止時間資訊。 After the character sequence corresponding to the feature vector sequence contained in each speech segment is obtained through the decoding process, the corresponding character separation result can be output. Since each character in the character sequence has a one-to-one correspondence with the feature vector, and the audio frame corresponding to each feature vector has its own time start and end points, this step can output the audio information of the corresponding feature vector for each character The start and end time information of the frame.

至此，透過步驟101至步驟104，對本申請提供的基於語音的角色分離方法的具體實施方式進行了詳細的說明。需要說明的是，本實施例在步驟102為特徵向量預先分配角色標籤的過程中採用了自頂向下、逐漸增加角色數量的方式。在其他實施方式中，也可以採用自底向上、逐漸減少角色數量的方式：最初可以將切分得到的每個語音段分別指定給不同的角色，然後訓練針對每個角色的GMM和HMM，如果透過反覆運算訓練得到的GMM和HMM在執行解碼操作後得到的機率值始終不大於預設閾值，那麼在調整角色數量時，可以透過評估每個角色的GMM彼此之間的相似度(例如計算KL散度)，將相似度滿足預設要求的GMM對應的語音段進行合併，並相應減少角色數量，重複反覆運算執行上述過程，直到HMM透過解碼得到的機率值大於預設閾值或者角色數量符合預設要求，則停止反覆運算過程，並根據解碼得到的角色序列為各語音段中的特徵向量分配角色標籤。 So far, through step 101 to step 104, the specific implementation of the voice-based role separation method provided in the present application has been described in detail. It should be noted that in this embodiment, in the process of pre-assigning role labels to the feature vector in step 102, a top-down method is adopted to gradually increase the number of roles. In other implementations, a bottom-up method of gradually reducing the number of roles can also be used: initially each segmented voice segment can be assigned to a different role, and then the GMM and HMM for each role can be trained, if The probability value of the GMM and HMM obtained through iterative calculation training after performing the decoding operation is always not greater than the preset threshold. Then when adjusting the number of characters, you can evaluate the similarity between the GMMs of each character (for example, calculating KL Divergence), merge the speech segments corresponding to the GMM whose similarity meets the preset requirements, and reduce the number of characters accordingly, and repeat the repeated operation to perform the above process until the probability value obtained by the HMM through decoding is greater than the preset threshold or the number of characters meets the preset threshold. If required, stop the iterative calculation process, and assign role labels to the feature vectors in each voice segment according to the role sequence obtained by decoding.

綜上所述，本申請提供的基於語音的角色分離方法，由於採用具有強大特徵提取能力的DNN模型對角色進行建模，比傳統的GMM具有更為強大的刻畫能力，對角色的刻畫更加精細、準確，因此能夠獲得更為準確的角色分離結果。本申請的技術方案不僅可以應用於對客服中心、會議語音等對話語音進行角色分離的場景中，還可以應用於其它需要對語音信號中的角色進行分離的場景中，只要所述語音信號中包含兩個或者兩個以上角色，就都可以採用本申請的技術方案，並取得相應的有益效果。 In summary, the voice-based role separation method provided by this application uses a DNN model with powerful feature extraction capabilities to model roles, which has a more powerful characterization ability than traditional GMM, and the characterization of roles is more refined. , Accurate, so you can get more accurate role separation results. The technical solution of the present application can not only be applied to scenarios where the role separation of conversational voices such as call center and conference voices is performed, but also can be applied to other scenarios where roles in the voice signal need to be separated, as long as the voice signal contains For two or more roles, the technical solution of this application can be adopted and corresponding beneficial effects can be obtained.

在上述的實施例中，提供了一種基於語音的角色分離方法，與之相對應的，本申請還提供一種基於語音的角色分離裝置。請參看圖6，其為本申請的一種基於語音的角色分離裝置的實施例示意圖。由於裝置實施例基本相似於方法實施例，所以描述得比較簡單，相關之處參見方法實施例的部分說明即可。下述描述的裝置實施例僅僅是示意性的。 In the foregoing embodiment, a voice-based role separation method is provided. Correspondingly, this application also provides a voice-based role separation device. Please refer to FIG. 6, which is a schematic diagram of an embodiment of a voice-based role separation apparatus according to this application. Since the device embodiment is basically similar to the method embodiment, the description is relatively simple, and for related parts, please refer to the part of the description of the method embodiment. The device embodiments described below are merely illustrative.

本實施例的一種基於語音的角色分離裝置，包括：特徵提取單元601，用於從語音信號中逐幀提取特徵向量，得到特徵向量序列；標籤分配單元602，用於為特徵向量分配角色標籤；DNN模型訓練單元603，用於利用具有角色標籤的特徵向量訓練DNN模型，其中所述DNN模型用於根據輸入的特徵向量輸出對應每個角色的機率；角色判定單元604，用於根據所述DNN模型和利用特徵向量訓練得到的HMM，判定特徵向量序列對應的角色序列並輸出角色分離結果，其中所述HMM用於描述角色間的跳轉關係。 A voice-based role separation device in this embodiment includes: a feature extraction unit 601, configured to extract feature vectors from a voice signal frame by frame to obtain a feature vector sequence; a label allocation unit 602, configured to assign role labels to feature vectors; The DNN model training unit 603 is used to train a DNN model using feature vectors with role labels, where the DNN model is used to output the probability corresponding to each role according to the input feature vectors; the role determination unit 604 is used to train DNN models based on the DNN The model and the HMM trained by the feature vector determine the character sequence corresponding to the feature vector sequence and output the character separation result, wherein the HMM is used to describe the jump relationship between the characters.

可選的，所述標籤分配單元具體用於，透過建立GMM和HMM，為各語音段中的特徵向量預先分配角色標籤，其中所述GMM用於針對每個角色、根據輸入的特徵向量輸出該特徵向量對應於所述角色的機率；所述角色判定單元具體用於，根據所述DNN模型和為各語音段中的特徵向量分配角色標籤所採用的HMM，判定所述各語音段所包含的特徵向量序列對應的角色序列。 Optionally, the label allocating unit is specifically configured to pre-allocate role labels for the feature vectors in each voice segment by establishing GMM and HMM, wherein the GMM is used to output the feature vector for each role according to the input feature vector. The feature vector corresponds to the probability of the character; the role determination unit is specifically configured to determine the content contained in each voice segment according to the DNN model and the HMM used to assign role labels to the feature vectors in each voice segment The character sequence corresponding to the feature vector sequence.

可選的，所述標籤分配單元還包括：逐語音段角色指定子單元，用於當所述機率判斷子單元的輸出為否時，根據所述角色序列，為每個語音段指定對應的角色；模型更新訓練子單元，用於根據每個語音段中的特徵向量以及對應的角色，訓練針對每個角色的GMM以及HMM，並觸發所述解碼子單元工作。 Optionally, the label assignment unit further includes: a voice segment role specification subunit, configured to specify a corresponding role for each voice segment according to the role sequence when the output of the probability judgment subunit is no ; The model update training subunit is used to train the GMM and HMM for each role according to the feature vector in each voice segment and the corresponding role, and trigger the decoding subunit to work.

可選的，所述標籤分配單元還包括：角色數量判斷子單元，用於當所述訓練次數判斷子單元的輸出為否時，判斷當前角色數量是否符合預設要求，若符合則觸發所述標籤分配子單元工作，否則觸發所述角色數量調整子單元工作。 Optionally, the label assignment unit further includes: a character quantity judging subunit, which is used to judge whether the current character quantity meets a preset requirement when the output of the training times judging subunit is no, and if so, trigger the The label allocation subunit works, otherwise the role quantity adjustment subunit is triggered to work.

本申請雖然以較佳實施例公開如上，但其並不是用來限定本申請，任何本領域技術人員在不脫離本申請的精神和範圍內，都可以做出可能的變動和修改，因此本申請的保護範圍應當以本申請申請專利範圍第所界定的範圍為準。 Although this application is disclosed as above in preferred embodiments, it is not intended to limit this application. Any person skilled in the art can make possible changes and modifications without departing from the spirit and scope of this application. Therefore, this application The scope of protection shall be subject to the scope defined in the patent scope of this application.

在一個典型的配置中，計算設備包括一個或多個處理器(CPU)、輸入/輸出介面、網路介面和記憶體。 In a typical configuration, the computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

記憶體可能包括電腦可讀媒體中的非永久性記憶體，隨機存取記憶體(RAM)和/或非揮發性記憶體等形式，如唯讀記憶體(ROM)或快閃記憶體(flash RAM)。記憶體是電腦可讀媒體的示例。 Memory may include non-permanent memory in computer-readable media, random access memory (RAM) and/or non-volatile memory, such as read-only memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.

1、電腦可讀媒體包括永久性和非永久性、可移動和非可移動媒體可以由任何方法或技術來實現資訊儲存。資訊可以是電腦可讀指令、資料結構、程式的模組或其他資料。電腦的儲存媒體的例子包括，但不限於相變記憶體(PRAM)、靜態隨機存取記憶體(SRAM)、動態隨機存取記憶體(DRAM)、其他類型的隨機存取記憶體(RAM)、唯讀記憶體(ROM)、電可擦除可程式設計唯讀記憶體(EEPROM)、快閃記憶體或其他記憶體技術、唯讀光碟唯讀記憶體(CD-ROM)、數位多功能光碟(DVD)或其他光學儲存、磁盒式磁帶，磁帶磁片儲存或其他磁性存放裝置或任何其他非傳輸媒體，可用於儲存可以被計算設備訪問的資訊。按照本文中的界定，電腦可讀媒體不包括非暫存電腦可讀媒體(transitory media)，如調製的資料信號和載波。 1. Computer-readable media includes permanent and non-permanent, removable and non-removable media, and information storage can be realized by any method or technology. Information can be computer-readable instructions, data structures, program modules, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), and other types of random access memory (RAM) , Read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technologies, CD-ROM, digital multi-function Optical discs (DVD) or other optical storage, magnetic cassettes, magnetic tape storage or other magnetic storage devices, or any other non-transmission media, can be used to store information that can be accessed by computing devices. According to the definition in this article, computer-readable media does not include non-transitory computer-readable media (transitory media), such as modulated data signals and carrier waves.

2、本領域技術人員應明白，本申請的實施例可提供為方法、系統或電腦程式產品。因此，本申請可採用完全硬體實施例、完全軟體實施例或結合軟體和硬體方面的實施例的形式。而且，本申請可採用在一個或多個其中包含有電腦可用程式碼的電腦可用儲存媒體(包括但不限於磁碟記憶體、CD-ROM、光學記憶體等)上實施的電腦程式產品的形式。 2. Those skilled in the art should understand that the embodiments of this application can be provided as methods, systems or computer program products. Therefore, this application may adopt the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware. Moreover, this application can be in the form of computer program products implemented on one or more computer-usable storage media (including but not limited to disk memory, CD-ROM, optical memory, etc.) containing computer-usable program codes. .

Claims

A voice-based role separation method, which is characterized by: extracting feature vectors from the voice signal frame by frame to obtain a feature vector sequence; assigning role labels to the feature vectors; training a deep neural network DNN model using the feature vectors with role labels ; According to the DNN model and the hidden Markov model HMM obtained by using feature vector training, determine the character sequence corresponding to the feature vector sequence, and output the role separation result; wherein, the DNN model is used to output the corresponding character for each role according to the input feature vector Probability, HMM is used to describe the jump relationship between roles, wherein, according to the DNN model and the hidden Markov model HMM trained by using feature vectors, determining the character sequence corresponding to the feature vector sequence includes: according to the DNN model and the HMM Perform a decoding operation to obtain the character sequence with the greatest probability of outputting the feature vector sequence, and use the character sequence as the character sequence corresponding to the feature vector sequence.

The voice-based role separation method according to the first item of the scope of patent application, wherein after the step of extracting feature vectors from the voice signal frame by frame and before the step of assigning role labels to the feature vectors, the following operations are performed : Divide the voice signal into voice segments by recognizing and removing audio frames that do not contain voice content; the assignment of role tags to feature vectors includes: The feature vector assigns role labels; the determining the role sequence corresponding to the feature vector sequence includes: determining the role sequence corresponding to the feature vector sequence contained in each speech segment.

The voice-based role separation method according to item 2 of the scope of patent application, wherein the assigning role labels to the feature vectors in each voice segment includes: establishing Gaussian mixture models GMM and HMM to assign the features in each voice segment The vector assigns role labels; where the GMM is used to output the probability that the feature vector corresponds to the role for each role according to the input feature vector; the DNN model and the HMM trained with the feature vector are used to determine each speech segment The role sequence corresponding to the included feature vector sequence includes: determining the role sequence corresponding to the feature vector sequence contained in each voice segment according to the DNN model and the HMM used for assigning role labels to the feature vector in each voice segment.

According to the voice-based role separation method described in item 3 of the scope of patent application, wherein the establishment of Gaussian mixture models GMM and HMM to assign role labels to feature vectors in each voice segment includes: according to a preset initial role Select the corresponding number of voice segments, and specify different roles for each voice segment; use the feature vectors in each voice segment corresponding to each specified role to train the GMM and HMM for each role; according to the training GMM and HMM decodes and obtains the character sequence with the highest probability value of the feature vector sequence contained in each speech segment; It is determined whether the probability value corresponding to the role sequence is greater than a preset threshold; if so, a role label is assigned to the feature vector in each voice segment according to the role sequence.

The voice-based role separation method according to item 4 of the scope of patent application, wherein, when the result of judging whether the probability value corresponding to the role sequence is greater than the preset threshold is No, the following operation is performed: according to the role sequence , Specify the corresponding role for each voice segment; train GMM and HMM for each role according to the feature vector in each voice segment and the corresponding role; go to the step of decoding according to the GMM and HMM obtained by training .

The voice-based role separation method according to item 5 of the scope of patent application, wherein, according to the role sequence, assigning a corresponding role to each voice segment includes: for each voice segment, corresponding each feature vector therein The mode of the character of is designated as the character of the speech segment.

The voice-based role separation method according to item 5 of the scope of patent application, wherein the training of GMM and HMM for each role according to the feature vector in each voice segment and the corresponding role includes: Based on the trained model, the GMM and HMM are trained in an incremental manner.

The voice-based role separation method according to item 5 of the scope of patent application, wherein, when the result of judging whether the probability value corresponding to the role sequence is greater than a preset threshold is no, the following operations are performed: Determine whether the number of times of training GMM and HMM under the current number of roles is less than the preset upper limit of the number of times of training; if yes, execute the step of assigning a corresponding role to each voice segment according to the role sequence; if not, execute the following operations: Adjust the number of roles, select a corresponding number of voice segments and specify different roles for each voice segment respectively; and turn to the step of using the feature vector in the voice segment of the specified role to train GMM and HMM for each role.

The voice-based role separation method according to item 8 of the scope of patent application, characterized in that, when the result of judging whether the number of training GMM and HMM under the current number of roles is less than the preset upper limit of the number of training times is no, Perform the following operations: determine whether the current number of roles meets the preset requirements; if so, go to the step of assigning role labels to the feature vectors in each voice segment according to the role sequence; if not, execute the adjustment of the number of roles step.

The voice-based role separation method according to item 8 of the scope of patent application is characterized in that the preset initial number of roles is 2, and the adjustment of the number of roles includes: adding 1 to the current number of roles.

The voice-based role separation method according to item 1 of the scope of patent application, wherein the extracting feature vectors from the voice signal frame by frame to obtain the feature vector sequence includes: framing the voice signal according to a preset frame length , Get multiple audio frames; The feature vector of each audio frame is extracted to obtain the feature vector sequence.

The voice-based role separation method according to item 11 of the scope of patent application, wherein said extracting the feature vector of each audio frame includes: extracting MFCC feature, PLP feature, or LPC feature.

The voice-based role separation method according to item 2 of the scope of patent application, wherein the recognizing and removing audio frames that do not contain voice content includes: using VAD technology to identify the audio frames that do not contain voice content, and execute the corresponding Eliminate operation.

According to the voice-based role separation method described in item 13 of the scope of patent application, the following VAD smoothing operation is performed after VAD technology is used to perform the recognition and removal operations, and the voice signal is divided into voice segments: The voice segment whose duration is less than the preset threshold is merged with the adjacent voice segment.

The voice-based role separation method according to the first item of the scope of patent application, wherein the training of the deep neural network DNN model by using the feature vector with the role label includes: training the DNN model by using a backpropagation algorithm.

The voice-based role separation method according to item 1 of the scope of patent application, wherein the output of the role separation result includes: outputting the audio frame of the corresponding feature vector for each role according to the role sequence corresponding to the feature vector sequence Start and end time information.

The voice-based role separation method according to item 4 or 8 of the scope of patent application, wherein the selecting a corresponding number of voice segments includes: selecting the number of voice segments whose duration meets a preset requirement.

A voice-based role separation device, which is characterized by comprising: a feature extraction unit for extracting feature vectors from the voice signal frame by frame to obtain a feature vector sequence; a label assignment unit for assigning role labels to feature vectors; DNN model The training unit is used to train the DNN model using the feature vector with the role label, where the DNN model is used to output the probability corresponding to each role according to the input feature vector; the role determination unit is used to train based on the DNN model and using the feature vector The obtained HMM determines the character sequence corresponding to the feature vector sequence and outputs the character separation result. The HMM is used to describe the jump relationship between the characters. The character determination unit is specifically used to perform a decoding operation according to the DNN model and the HMM, Obtain the character sequence with the largest probability value of outputting the feature vector sequence, and use the character sequence as the character sequence corresponding to the feature vector sequence.

According to the 18th item of the scope of patent application, the voice-based role separation device further includes: a voice segment segmentation unit, which is configured to pass through after the feature extraction unit extracts the feature vector and before triggering the label assignment unit to work Recognize and eliminate audio frames that do not contain voice content, and divide the voice signal into voice segments; the label assignment unit is specifically used to assign role labels to the feature vectors in each voice segment; the role determination unit is specifically used to, according to The DNN model and utilization The HMM obtained by feature vector training determines the character sequence corresponding to the feature vector sequence contained in each speech segment and outputs the character separation result.

The voice-based role separation device according to item 19 of the scope of patent application, wherein the label allocation unit is specifically configured to allocate role labels to the feature vectors in each voice segment by establishing GMM and HMM, wherein the GMM is used for For each role, output the probability that the feature vector corresponds to the role according to the input feature vector; the role determination unit is specifically used to assign role labels to the feature vectors in each voice segment according to the DNN model and the HMM used, Determine the character sequence corresponding to the feature vector sequence contained in each speech segment.

The voice-based role separation device according to item 20 of the scope of patent application, wherein the label allocation unit includes: an initial role designation subunit for selecting a corresponding number of voice segments according to a preset initial number of roles, and for each Each voice segment specifies different roles; the initial model training subunit is used to train the GMM and HMM for each role using the feature vector in the voice segment of the specified role; the decoding subunit is used to train the GMM and HMM obtained according to the training Perform decoding to obtain the character sequence with the highest probability value of the feature vector sequence contained in each speech segment; the probability judgment subunit is used to judge whether the probability value corresponding to the character sequence is greater than the preset threshold; the label allocation subunit, Used to assign roles to the feature vectors in each voice segment according to the sequence of roles when the output of the probability judgment subunit is yes label.

The voice-based role separation device according to item 21 of the scope of patent application, wherein the label assignment unit further includes: a voice segment role designation subunit for determining whether the output of the subunit is negative according to the probability The role sequence specifies the corresponding role for each voice segment; the model update training subunit is used to train the GMM and HMM for each role according to the feature vector in each voice segment and the corresponding role, and trigger the decoder Unit work.

According to the voice-based role separation device described in item 22 of the scope of patent application, the voice segment-by-speech role specifying subunit is specifically used to specify the mode of the role corresponding to each feature vector for each voice segment as The role of the voice segment.

According to the voice-based role separation device according to item 22 of the scope of patent application, the model update training subunit is specifically used to train the GMM and HMM in an incremental manner on the basis of the model obtained in the previous training.

The voice-based role separation device according to item 22 of the scope of patent application, wherein the label assignment unit further includes: a training frequency judging subunit for judging whether the output of the probability judging subunit is negative. Whether the number of times of training GMM and HMM is less than the preset upper limit of the number of training times, and when the judgment result is yes, trigger the work of the role-designated subunit of each voice segment; unit When the output of is No, adjust the number of roles, select the corresponding number of voice segments and assign different roles for each voice segment, and trigger the initial model training subunit to work.

The voice-based role separation device according to item 25 of the scope of patent application, wherein the label assignment unit further includes: a character quantity judging subunit for judging the current role when the output of the training times judging subunit is no Whether the quantity meets the preset requirements, if it does, trigger the label allocation subunit to work, otherwise trigger the role quantity adjustment subunit to work.

The voice-based role separation device according to item 18 of the scope of patent application, wherein the feature extraction unit includes: a framing subunit for performing framing processing on the voice signal according to a preset frame length to obtain multiple audio signals Frame: The feature extraction execution subunit is used to extract the feature vector of each audio frame to obtain the feature vector sequence.

The voice-based role separation device according to item 27 of the scope of patent application, wherein the feature extraction execution subunit is specifically configured to extract MFCC features, PLP features, or LPC features of each audio frame to obtain the feature vector sequence.

The voice-based role separation device according to item 19 of the scope of patent application, wherein the voice segment segmentation unit is specifically used to recognize and eliminate the audio frame that does not contain voice content by using VAD technology, and cut the voice signal Divided into voice segments.

According to the voice-based angle described in item 29 of the scope of patent application The color separation device further includes: a VAD smoothing unit, which is used to merge a voice segment with a duration less than a preset threshold with an adjacent voice segment after the voice segment segmentation unit adopts the VAD technology to segment the voice segment.

According to the voice-based role separation device described in item 18 of the scope of patent application, the DNN model training unit is specifically used to train the DNN model by using a backpropagation algorithm.

The voice-based role separation device according to item 18 of the scope of patent application, wherein the role determination unit outputs the role separation result in the following manner: according to the role sequence corresponding to the feature vector sequence, output the corresponding feature vector for each role The start and end time information of the audio frame to which it belongs.

According to the voice-based role separation device described in item 21 or 25 of the scope of patent application, the initial role designation subunit or the role number adjustment subunit specifically selects the corresponding number of voice segments in the following manner: Set the required number of voice segments.