WO2020062679A1 - 一种基于深度学习的端到端说话人分割方法及系统 - Google Patents

一种基于深度学习的端到端说话人分割方法及系统 Download PDF

Info

Publication number
WO2020062679A1
WO2020062679A1 PCT/CN2018/124431 CN2018124431W WO2020062679A1 WO 2020062679 A1 WO2020062679 A1 WO 2020062679A1 CN 2018124431 W CN2018124431 W CN 2018124431W WO 2020062679 A1 WO2020062679 A1 WO 2020062679A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
segmented
mixed
stft
stft feature
Prior art date
Application number
PCT/CN2018/124431
Other languages
English (en)
French (fr)
Inventor
叶志坚
李稀敏
肖龙源
蔡振华
刘晓葳
谭玉坤
Original Assignee
厦门快商通信息技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 厦门快商通信息技术有限公司 filed Critical 厦门快商通信息技术有限公司
Publication of WO2020062679A1 publication Critical patent/WO2020062679A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/18Artificial neural networks; Connectionist approaches
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Definitions

  • the invention relates to the technical field of speech signal processing, in particular to an end-to-end speaker segmentation method based on deep learning and a system using the method.
  • the traditional speaker segmentation method is divided into two steps:
  • the current mainstream speech segmentation algorithms include distance measurement and model method. Among them, the distance measurement method needs to determine whether there is a speaker transformation point between two segments based on the distance between adjacent speech segments; the model method commonly uses GMM Gaussian mixture The model and SVM support vector machine model, by calculating the similarity distance between adjacent models, determine whether the speaker transformation point exists based on the empirical threshold value; segment the mixed speech according to the speaker transformation point to obtain multiple phrase sounds.
  • Clustering step clustering multiple phrase sounds belonging to the same person through a clustering algorithm to obtain the long speech of the same person.
  • segmentation and re-aggregation are required to obtain long speech belonging to the same person. Not only is the algorithm more complex and the computation efficiency is low, but its purity is affected by the accuracy of the two steps of segmentation and clustering.
  • the present invention provides an end-to-end speaker segmentation method and system based on deep learning. Only the mixed speech to be segmented is input into the trained model to output the segmented speech of each speaker. This end-to-end method can avoid the accumulation of errors in the intermediate process, and the segmentation accuracy is higher.
  • An end-to-end speaker segmentation method based on deep learning includes the following steps:
  • a2 mixing the first single-person voice and the second single-person voice to obtain a mixed voice for training, and calculating a mixed STFT feature of the mixed voice for training;
  • step a3 Perform segmentation processing on the mixed voice according to the mixed STFT feature in step a2 to obtain a first segmented voice corresponding to the first single-person voice, a first segmented STFT feature, and the second single-person voice.
  • segmented STFT features of different speakers after segmenting the mixed speech to be segmented are output, and segmented speech corresponding to different speakers is further obtained through ISTFT transformation.
  • performing the mixed processing of the first single voice and the second single voice refers to segmenting the first single voice and the second single voice separately.
  • the mixed speech to be divided refers to the speech of two or more speakers. Conversation voice.
  • the mixed STFT feature of the training mixed voice the first real STFT feature, the second real STFT feature, the first segmented STFT feature, the second segmented STFT feature, and the mixture of the mixed voice to be segmented are further included.
  • the step a3 performs segmentation processing on the mixed voice, further comprising:
  • timing information features into a three-layer fully connected network to generate a mask of a first single-person voice and a mask of a second single-person voice, respectively;
  • a34 multiplying the mask of the first single voice with the mixed STFT feature of the training mixed voice to obtain a second segmented STFT feature corresponding to the second single voice; and masking the second single voice Multiplying the film with the mixed STFT features of the training mixed voice to obtain a first segmented STFT feature corresponding to the first single person voice;
  • the first segmented STFT feature is transformed by ISTFT to obtain a first segmented voice
  • the second segmented STFT feature is transformed by ISTFT to obtain a second segmented voice.
  • a mean square error is used as a loss function, that is, calculating between the first real STFT feature and the first segmented STFT feature, the second real STFT feature, and the first Mean square error between two segmented STFT features.
  • optimizing the model parameters according to the loss function is optimizing the model parameters by using a stochastic gradient descent algorithm, so that the value of the mean square error drops to a preset threshold.
  • the present invention also provides an end-to-end speaker segmentation system based on deep learning, which includes:
  • a model training module further comprising:
  • a voice collection unit configured to collect a first single voice and a second single voice
  • a voice mixing unit that mixes the first single voice and the second single voice to obtain a mixed voice for training
  • An STFT feature extraction unit configured to calculate a first real STFT feature corresponding to the first single person voice, a second real STFT feature corresponding to the second single person voice, and a mixed STFT feature of the training mixed voice;
  • a voice segmentation unit configured to segment the mixed voice according to the mixed STFT feature extracted in the STFT feature extraction unit to obtain a first segmented voice and a first segmented STFT corresponding to the first single person voice Features, and second segmented voice and second segmented STFT features corresponding to the second single person voice;
  • a loss function construction unit that constructs a loss function by comparing the first real STFT feature with the first segmented STFT feature, the second real STFT feature, and the second segmented STFT feature;
  • a model optimization unit that optimizes model parameters according to the loss function and completes model training
  • It is used to input the mixed speech to be segmented into the model, and output the segmented speech of different speakers after segmentation; or extract the mixed STFT feature of the mixed speech to be segmented, and the mixed STFT feature of the mixed speech to be segmented Input into the model, output segmented STFT features of different speakers after segmenting the mixed speech to be segmented, and further obtain segmented speech corresponding to different speakers through ISTFT transformation.
  • the speaker segmentation method of the present invention it is not necessary to first divide a mixed voice into multiple phrases, and then use a clustering algorithm to cluster multiple phrase tones belonging to the same speaker to obtain long speech of the same speaker. ; Instead, directly input the mixed speech to be segmented into the trained model to output the segmented speech of each speaker.
  • This end-to-end method can avoid the accumulation of errors in the intermediate process and the segmentation accuracy is higher;
  • the present invention obtains a trained model by collecting a large number of single-person voices, and training any two single-person voices by mixing processing and re-segmentation, which makes the performance of the model better, especially suitable for more than two speeches Segmentation of human conversation recordings;
  • the present invention compares the real STFT features of a single person's voice with the segmented STFT features of segmentation training and constructs a loss function, thereby optimizing the model parameters and making the model more accurate;
  • the present invention performs feature extraction and segmentation through a CNN network, an LSTM network, and a three-layer fully connected network, so that the trained model has higher performance.
  • FIG. 1 is a schematic flowchart of an end-to-end speaker segmentation method based on deep learning according to the present invention
  • FIG. 2 is a schematic structural diagram of an end-to-end speaker segmentation system based on deep learning according to the present invention.
  • an end-to-end speaker segmentation method based on deep learning of the present invention includes the following steps:
  • a2 mixing the first single-person voice and the second single-person voice to obtain a mixed voice for training, and calculating a mixed STFT feature of the mixed voice for training;
  • step a3 Perform segmentation processing on the mixed voice according to the mixed STFT feature in step a2 to obtain a first segmented voice corresponding to the first single-person voice, a first segmented STFT feature, and the second single-person voice.
  • segmented STFT features of different speakers after segmenting the mixed speech to be segmented are output, and segmented speech corresponding to different speakers is further obtained through ISTFT transformation.
  • step a1 collecting the first single-person voice and the second single-person voice refers to the training of mixing processing and re-segmentation of any two single-person voices by collecting a large number of single-person voices; for example, first Collect the single-person voices of thousands of people, and then randomly take the single-person voices of two people for mixing, and build and train the model by constructing a voice training set, a voice development set, and a voice test set.
  • step a2 performing the mixed processing of the first single voice and the second single voice refers to dividing the first single voice and the second single voice into two, respectively.
  • the above phrase sounds are mixed, and all the phrase sounds are mixed and synthesized into a long speech to obtain a mixed speech for training; in step b, the mixed speech to be divided refers to a dialogue between two or more speakers. voice.
  • the step a3 performing segmentation processing on the mixed voice further includes:
  • the mixed STFT features of the mixed speech for training are input to a CNN neural network to extract deep-level features; wherein the Convolutional Neural Network (CNN) is composed of a convolutional layer, a pooling layer, a full-scale A deep neural network with local perception and weight-sharing capabilities composed of connection layers; the convolutional layer attempts to analyze each small block in the neural network in order to obtain more abstract features.
  • CNN Convolutional Neural Network
  • the node matrix depth increases; the pooling layer neural network will not change the depth of the three-dimensional matrix, but it can reduce the size of the matrix;
  • LSTM Long Short-Term Memory
  • long-term short-term memory network is a time-recurrent neural network suitable for processing and predicting time series Important events with relatively long intervals and delays;
  • the three-layer fully-connected network (fully connected layers (FC) include an input layer, a hidden layer, and an output layer;
  • the mask is a feature extracted from the three-layer fully connected network, and is used to extract the first segmented STFT feature corresponding to the first single-person voice Shielding the second segmented STFT feature corresponding to the second single-person voice, and shielding the first segmented STFT feature corresponding to the first single-person voice when extracting the second segmented STFT feature corresponding to the second single-person voice;
  • a34 multiplying the mask of the first single voice with the mixed STFT feature of the training mixed voice to obtain a second segmented STFT feature corresponding to the second single voice; and masking the second single voice Multiplying the film with the mixed STFT features of the training mixed voice to obtain a first segmented STFT feature corresponding to the first single person voice;
  • the first segmented STFT feature is transformed by ISTFT to obtain a first segmented voice
  • the second segmented STFT feature is transformed by ISTFT to obtain a second segmented voice.
  • step a31 the CNN neural network adopts a 15-layer neural network architecture, and its architecture parameters are as follows:
  • Layer 1 Use a 1 * 7 convolution kernel with 96 channels and 1 * 1 dilation
  • Layer 2 Use a 7 * 1 convolution kernel with 96 channels and 1 * 1 dilation;
  • Layer 3 Use a 5 * 5 convolution kernel with 96 channels and 1 * 1 dilations
  • the fourth layer uses a 5 * 5 convolution kernel with 96 channels and 2 * 1 dilations;
  • Layer 5 Use a 5 * 5 convolution kernel with 96 channels and 4 * 1 dilations
  • Layer 6 Use a 5 * 5 convolution kernel with 96 channels and 8 * 1 dilations
  • Layer 7 Use a 5 * 5 convolution kernel with 96 channels and 16 * 1 dilations
  • Layer 8 Use a 5 * 5 convolution kernel with 96 channels and 32 * 1 dilations
  • Layer 9 Use a 5 * 5 convolution kernel with 96 channels and 1 * 1 dilation;
  • Layer 10 Use a 5 * 5 convolution kernel with 96 channels and 2 * 2 dilations;
  • Layer 11 Use a 5 * 5 convolution kernel with 96 channels and 4 * 4 dilations;
  • Layer 12 Use a 5 * 5 convolution kernel with 96 channels and 8 * 8 dilations
  • Layer 13 Use a 5 * 5 convolution kernel with 96 channels and 16 * 16 dilations
  • Layer 14 Use a 5 * 5 convolution kernel with 96 channels and 32 * 32 expansion numbers;
  • Layer 15 Use a 1 * 1 convolution kernel with 8 channels and 1 * 1 dilation.
  • a mean square error is used as a loss function, that is, calculating between the first real STFT feature and the first segmented STFT feature, the second real STFT feature, and the second segmented STFT.
  • step a5 optimizing the model parameters according to the loss function is to optimize the model parameters by a stochastic gradient descent algorithm (SGD), so that the mean square error (loss function) becomes smaller and smaller, thereby making the mean square The value of the error drops to a preset threshold.
  • SGD stochastic gradient descent algorithm
  • steps a1 to a5 are performed repeatedly until the value of the mean square error drops to a preset threshold, that is, the loss function is minimized, and then the model training is completed.
  • Short-time Fourier transform also known as windowed Fourier transform
  • STFT short-term Fourier transform
  • short-term Fourier transform also known as windowed Fourier transform
  • windowed Fourier transform is a time-frequency analysis method. It uses a signal in the time window to Signal characteristics at a certain time.
  • the length of the window determines the time resolution and frequency resolution of the spectrogram. The longer the window length, the longer the intercepted signal, the longer the signal, and the higher the frequency resolution after the Fourier transform. , The worse the time resolution; conversely, the shorter the window length, the shorter the intercepted signal, the worse the frequency resolution, the better the time resolution.
  • the time window makes the signal valid only within a certain interval, which avoids the inadequacy of the traditional Fourier transform in the time-frequency local expression capability, and makes the Fourier transform have the local positioning capability.
  • the STFT feature is a complex number, that is, a + bj, where a is a real number part and b is an imaginary number part; in this embodiment, in order to avoid the operation of the complex number, the hybrid STFT feature of the mixed voice for training is further included.
  • the first real STFT feature, the second real STFT feature, the first segmented STFT feature, the second segmented STFT feature, and the mixed STFT feature and segmented STFT feature of the mixed voice to be segmented, putting real and imaginary numbers in the channel dimension It performs splicing processing on it, so that the calculation is simpler and more efficient.
  • the present invention also provides an end-to-end speaker segmentation system based on deep learning, which includes:
  • a model training module further comprising:
  • a voice collection unit configured to collect a first single voice and a second single voice
  • a voice mixing unit that mixes the first single voice and the second single voice to obtain a mixed voice for training
  • An STFT feature extraction unit configured to calculate a first real STFT feature corresponding to the first single person voice, a second real STFT feature corresponding to the second single person voice, and a mixed STFT feature of the training mixed voice;
  • a voice segmentation unit configured to segment the mixed voice according to the mixed STFT feature extracted in the STFT feature extraction unit to obtain a first segmented voice and a first segmented STFT corresponding to the first single person voice Features, and second segmented voice and second segmented STFT features corresponding to the second single person voice;
  • a loss function construction unit that constructs a loss function by comparing the first real STFT feature with the first segmented STFT feature, the second real STFT feature, and the second segmented STFT feature;
  • a model optimization unit that optimizes model parameters according to the loss function and completes model training
  • It is used to input the mixed speech to be segmented into the model, and output the segmented speech of different speakers after segmentation; or extract the mixed STFT feature of the mixed speech to be segmented, and the mixed STFT feature of the mixed speech to be segmented Input into the model, output segmented STFT features of different speakers after segmenting the mixed speech to be segmented, and further obtain segmented speech corresponding to different speakers through ISTFT transformation.
  • the terms "including”, “including” or any other variation thereof are intended to encompass non-exclusive inclusion, such that a process, method, article, or device that includes a series of elements includes not only those elements, but also Other elements not explicitly listed, or elements inherent to such a process, method, article, or device. Without more restrictions, the elements defined by the sentence “including a " do not exclude the existence of other identical elements in the process, method, article, or equipment including the elements.
  • a person of ordinary skill in the art may understand that all or part of the steps for implementing the foregoing embodiments may be completed by hardware, or may be instructed by a program to perform related hardware.
  • the program may be stored in a computer-readable storage medium.
  • the aforementioned storage medium may be a read-only memory, a magnetic disk or an optical disk.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Business, Economics & Management (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Game Theory and Decision Science (AREA)
  • Electrically Operated Instructional Devices (AREA)
  • Image Analysis (AREA)

Abstract

本发明公开了一种基于深度学习的端到端说话人分割方法及系统,其通过采集第一单人语音和第二单人语音进行真实STFT特征提取、语音混合处理、混合STFT特征的提取、分割STFT特征的计算、真实STFT特征与分割STFT特征的比较、模型的优化,从而训练得到所需的模型;使用时,无需先将混合语音分割成多段短语音,再通过聚类算法将属于同一个说话人的多个短语音进行聚类处理得到同一个说话人的长语音;而是直接将待分割的混合语音输入训练好的模型即可输出每一个说话人的分割语音,这种端到端的方法可以避免中间过程的误差积累,分割精度更高。

Description

一种基于深度学习的端到端说话人分割方法及系统 技术领域
本发明涉及语音信号处理技术领域,特别是一种基于深度学习的端到端说话人分割方法及其应用该方法的系统。
背景技术
随着音频获取途径和数量的快速增加,音频管理变得越来越复杂,近几年说话人分割聚类在国际上逐渐成为热点研究问题,国外许多大学和研究机构都开展了相关研究工作,美国国家标准技术局(National Institute of Standards and Technology,NIST)在1999年组织的说话人识别评测任务中就增加了两人之间的电话语音的分割聚类项目,2002年NIST提出的富信息转写(Rich Transcription,RT)评测正式开展对说话人分割聚类的研究。
传统的说话人分割方法分为两个步骤:
1.分割步骤,将一段混合语音分割成多段短语音。目前主流的语音分割算法有距离度量法和模型法,其中,距离度量法需根据相邻语音段之间距离,确定两个音段之间是否存在说话人变换点;模型法常见使用GMM高斯混合模型和SVM支持向量机模型,通过计算相邻模型之间的相似度距离,根据经验阈值判别说话人变换点是否存在;根据说话人变换点对混合语音进行分段处理得到多个短语音。
2.聚类步骤,通过聚类算法将属于同一个人的多个短语音进行聚类处理,得到同一个人的长语音。
采用上述的传统说话人分割方法,需要先分割再重新聚合来得到属于同一个人的长语音,不仅算法较复杂,计算效率低,而且其纯度受分割和聚类两个步骤精度的影响。
发明内容
本发明为解决上述问题,提供了一种基于深度学习的端到端说话人分割方法及系统,只需将待分割的混合语音输入训练好的模型即可输出每一个说话人的分割语音,这种端到端的方法可以避免中间过程的误差积累,分割精度更高。
为实现上述目的,本发明采用的技术方案为:
一种基于深度学习的端到端说话人分割方法,其包括以下步骤:
a.模型训练步骤:
a1.采集第一单人语音和第二单人语音,并计算所述第一单人语音对应的第一真实STFT特征和所述第二单人语音对应的第二真实STFT特征;
a2.将所述第一单人语音和所述第二单人语音进行混合处理,得到训练用混合语音,并计算所述训练用混合语音的混合STFT特征;
a3.根据步骤a2中所述混合STFT特征对所述混合语音进行分割处理,得到与所述第一单人语音对应的第一分割语音、第一分割STFT特征,以及与所述第二单人语音对应的第二分割语音、第二分割STFT特征;
a4.对比所述第一真实STFT特征和所述第一分割STFT特征、所述第二真实STFT特征和所述第二分割STFT特征,构造损失函数;
a5.根据所述损失函数进行优化模型参数,完成模型训练;
b.说话人分割步骤:
将待分割的混合语音输入所述模型中,输出分割后的不同说话人的分割语音;或者,对待分割的混合语音提取混合STFT特征,并将所述待分割的混合语音的混合STFT特征输入所述模型中,输出所述待分割的混合语音分割后的不同说话人的分割STFT特征,并进一步通过ISTFT变换得到不同说话人对 应的分割语音。
优选的,所述的步骤a2中,将所述第一单人语音和所述第二单人语音进行混合处理,是指分别将所述第一单人语音和所述第二单人语音分割为两个以上的短语音,并将所有短语音进行混合,并合成长语音,得到训练用混合语音;所述的步骤b中,所述待分割的混合语音,是指两个以上说话人之间的对话语音。
优选的,进一步将所述训练用混合语音的混合STFT特征、第一真实STFT特征、第二真实STFT特征、第一分割STFT特征、第二分割STFT特征,以及所述待分割的混合语音的混合STFT特征、分割STFT特征,将其实数和虚数在通道维度上进行拼接处理。
优选的,所述的步骤a3对所述混合语音进行分割处理,进一步包括:
a31.将所述训练用混合语音的混合STFT特征输入CNN神经网络,以提取深层次特征;
a32.将所述深层次特征输入LSTM网络,以提取时序信息特征;
a33.将所述时序信息特征输入三层全连接网络,分别生成第一单人语音的掩膜和第二单人语音的掩膜;
a34.将所述第一单人语音的掩膜与所述训练用混合语音的混合STFT特征相乘得到第二单人语音对应的第二分割STFT特征;将所述第二单人语音的掩膜与所述训练用混合语音的混合STFT特征相乘得到第一单人语音对应的第一分割STFT特征;
a35.将所述第一分割STFT特征通过ISTFT变换得到第一分割语音,将所述第二分割STFT特征通过ISTFT变换得到第二分割语音。
优选的,所述的步骤a4中,使用均方误差作为损失函数,即,计算所述 第一真实STFT特征和所述第一分割STFT特征之间、所述第二真实STFT特征和所述第二分割STFT特征之间的均方误差。
优选的,所述的步骤a5中,根据所述损失函数进行优化模型参数,是通过随机梯度下降算法进行优化模型参数,使得均方误差的值下降到预设阈值。
对应的,本发明还提供一种基于深度学习的端到端说话人分割系统,其包括:
a.模型训练模块,其进一步包括:
语音采集单元,用于采集第一单人语音和第二单人语音;
语音混合单元,将所述第一单人语音和所述第二单人语音进行混合处理,得到训练用混合语音;
STFT特征提取单元,用于计算所述第一单人语音对应的第一真实STFT特征、所述第二单人语音对应的第二真实STFT特征、所述训练用混合语音的混合STFT特征;
语音分割单元,用于根据所述STFT特征提取单元中提取的所述混合STFT特征对所述混合语音进行分割处理,得到与所述第一单人语音对应的第一分割语音、第一分割STFT特征,以及与所述第二单人语音对应的第二分割语音、第二分割STFT特征;
损失函数构造单元,其通过对比所述第一真实STFT特征和所述第一分割STFT特征、所述第二真实STFT特征和所述第二分割STFT特征,构造损失函数;
模型优化单元,其根据所述损失函数进行优化模型参数,完成模型训练;
b.说话人分割模块:
用于将待分割的混合语音输入所述模型中,输出分割后的不同说话人的 分割语音;或者,对待分割的混合语音提取混合STFT特征,并将所述待分割的混合语音的混合STFT特征输入所述模型中,输出所述待分割的混合语音分割后的不同说话人的分割STFT特征,并进一步通过ISTFT变换得到不同说话人对应的分割语音。
本发明的有益效果是:
(1)本发明的说话人分割方法,无需先将混合语音分割成多段短语音,再通过聚类算法将属于同一个说话人的多个短语音进行聚类处理得到同一个说话人的长语音;而是直接将待分割的混合语音输入训练好的模型即可输出每一个说话人的分割语音,这种端到端的方法可以避免中间过程的误差积累,分割精度更高;
(2)本发明通过采集大量的单人语音,并将任意两个单人语音进行混合处理和重新分割的训练,得到训练好的模型,使得模型的性能更好,特别适用于两个以上说话人的对话录音的分割处理;
(3)本发明通过将单人语音的真实STFT特征与分割训练的分割STFT特征进行比较和构造损失函数,从而对模型参数进行优化,使得模型更加准确;
(4)本发明通过CNN网络、LSTM网络、三层全连接网络进行特征的提取和分割,使得训练得到的模型具有更高的性能。
附图说明
此处所说明的附图用来提供对本发明的进一步理解,构成本发明的一部分,本发明的示意性实施例及其说明用于解释本发明,并不构成对本发明的不当限定。在附图中:
图1为本发明一种基于深度学习的端到端说话人分割方法的流程简图;
图2为本发明一种基于深度学习的端到端说话人分割系统的结构示意图。
具体实施方式
为了使本发明所要解决的技术问题、技术方案及有益效果更加清楚、明白,以下结合附图及实施例对本发明进行进一步详细说明。应当理解,此处所描述的具体实施例仅用以解释本发明,并不用于限定本发明。
如图1所示,本发明的一种基于深度学习的端到端说话人分割方法,其包括以下步骤:
a.模型训练步骤:
a1.采集第一单人语音和第二单人语音,并计算所述第一单人语音对应的第一真实STFT特征和所述第二单人语音对应的第二真实STFT特征;
a2.将所述第一单人语音和所述第二单人语音进行混合处理,得到训练用混合语音,并计算所述训练用混合语音的混合STFT特征;
a3.根据步骤a2中所述混合STFT特征对所述混合语音进行分割处理,得到与所述第一单人语音对应的第一分割语音、第一分割STFT特征,以及与所述第二单人语音对应的第二分割语音、第二分割STFT特征;
a4.对比所述第一真实STFT特征和所述第一分割STFT特征、所述第二真实STFT特征和所述第二分割STFT特征,构造损失函数;
a5.根据所述损失函数进行优化模型参数,完成模型训练;
b.说话人分割步骤:
将待分割的混合语音输入所述模型中,输出分割后的不同说话人的分割语音;或者,对待分割的混合语音提取混合STFT特征,并将所述待分割的混合语音的混合STFT特征输入所述模型中,输出所述待分割的混合语音分割后的不同说话人的分割STFT特征,并进一步通过ISTFT变换得到不同说话人对 应的分割语音。
所述的步骤a1中,采集第一单人语音和第二单人语音,是指通过采集大量的单人语音,并对任意两个单人语音进行混合处理和重新分割的训练;例如,首先采集几千人的单人语音,然后任意取两个人的单人语音进行混合,通过构建语音训练集、语音开发集、语音测试集,对模型进行训练和优化。
所述的步骤a2中,将所述第一单人语音和所述第二单人语音进行混合处理,是指分别将所述第一单人语音和所述第二单人语音分割为两个以上的短语音,并将所有短语音进行混合,并合成长语音,得到训练用混合语音;所述的步骤b中,所述待分割的混合语音,是指两个以上说话人之间的对话语音。
所述的步骤a3对所述混合语音进行分割处理,进一步包括:
a31.将所述训练用混合语音的混合STFT特征输入CNN神经网络,以提取深层次特征;其中,所述卷积神经网络(Convolutional Neural Network,CNN)是由卷积层、池化层、全连接层构成的具有局部感知和权值共享能力的深层神经网络;卷积层试图将神经网络中的每一小块进行更加深入的分析从而得到抽象程度更高的特征,经过卷积层之后的节点矩阵深度增加;池化层神经网络不会改变三维矩阵的深度,但是可以缩小矩阵的大小;
a32.将所述深层次特征输入LSTM网络,以提取时序信息特征;其中,所述LSTM(Long Short-Term Memory)长短期记忆网络,是一种时间递归神经网络,适合于处理和预测时间序列中间隔和延迟相对较长的重要事件;
a33.将所述时序信息特征输入三层全连接网络,分别生成第一单人语音的掩膜和第二单人语音的掩膜(masks);其中,所述三层全连接网络(fully connected layers,FC)包括输入层、隐含层、输出层;所述掩膜为所述三层 全连接网络中提取的特征,并用于在提取第一单人语音对应的第一分割STFT特征时进行屏蔽第二单人语音对应的第二分割STFT特征,以及在提取第二单人语音对应的第二分割STFT特征时进行屏蔽第一单人语音对应的第一分割STFT特征;
a34.将所述第一单人语音的掩膜与所述训练用混合语音的混合STFT特征相乘得到第二单人语音对应的第二分割STFT特征;将所述第二单人语音的掩膜与所述训练用混合语音的混合STFT特征相乘得到第一单人语音对应的第一分割STFT特征;
a35.将所述第一分割STFT特征通过ISTFT变换得到第一分割语音,将所述第二分割STFT特征通过ISTFT变换得到第二分割语音。
所述的步骤a31中,所述CNN神经网络采用15层的神经网络架构,其架构参数如下:
第1层:使用1*7的卷积核,通道数为96,膨胀数为1*1
第2层:使用7*1的卷积核,通道数为96,膨胀数为1*1;
第3层:使用5*5的卷积核,通道数为96,膨胀数为1*1;
第4层,使用5*5的卷积核,通道数为96,膨胀数为2*1;
第5层:使用5*5的卷积核,通道数为96,膨胀数为4*1;
第6层:使用5*5的卷积核,通道数为96,膨胀数为8*1;
第7层:使用5*5的卷积核,通道数为96,膨胀数为16*1;
第8层:使用5*5的卷积核,通道数为96,膨胀数为32*1;
第9层:使用5*5的卷积核,通道数为96,膨胀数为1*1;
第10层:使用5*5的卷积核,通道数为96,膨胀数为2*2;
第11层:使用5*5的卷积核,通道数为96,膨胀数为4*4;
第12层:使用5*5的卷积核,通道数为96,膨胀数为8*8;
第13层:使用5*5的卷积核,通道数为96,膨胀数为16*16;
第14层:使用5*5的卷积核,通道数为96,膨胀数为32*32;
第15层:使用1*1的卷积核,通道数为8,膨胀数为1*1。
所述的步骤a4中,使用均方误差作为损失函数,即,计算所述第一真实STFT特征和所述第一分割STFT特征之间、所述第二真实STFT特征和所述第二分割STFT特征之间的均方误差(mean square error,MSE)。
所述的步骤a5中,根据所述损失函数进行优化模型参数,是通过随机梯度下降算法(SGD)进行优化模型参数,使得所述均方误差(损失函数)越来越小,从而使得均方误差的值下降到预设阈值。
循环往复执行上述步骤a1至a5,直到均方误差的值下降到预设阈值,即损失函数达到最小化,则完成模型训练。
短时傅里叶变换(STFT,short-time Fourier transform,或short-term Fourier transform)),又称加窗傅里叶变换,是一种时频分析方法,它通过时间窗内的一段信号来表示某一时刻的信号特征。在短时傅里叶变换过程中,窗的长度决定频谱图的时间分辨率和频率分辨率,窗长越长,截取的信号越长,信号越长,傅里叶变换后频率分辨率越高,时间分辨率越差;相反,窗长越短,截取的信号就越短,频率分辨率越差,时间分辨率越好。时间窗口使得信号只在某一小区间内有效,这就避免了传统的傅里叶变换在时频局部表达能力上的不足,使得傅里叶变换有了局部定位的能力。
并且,由于STFT特征是一个复数,即a+bj,其中,a为实数部分,b为虚数部分;本实施例中,为了避免复数的运算,还进一步将所述训练用混合语音的混合STFT特征、第一真实STFT特征、第二真实STFT特征、第一分割 STFT特征、第二分割STFT特征,以及所述待分割的混合语音的混合STFT特征、分割STFT特征,将其实数和虚数在通道维度上进行拼接处理,从而使得运算更简单高效。
如图2所示,本发明还提供一种基于深度学习的端到端说话人分割系统,其包括:
a.模型训练模块,其进一步包括:
语音采集单元,用于采集第一单人语音和第二单人语音;
语音混合单元,将所述第一单人语音和所述第二单人语音进行混合处理,得到训练用混合语音;
STFT特征提取单元,用于计算所述第一单人语音对应的第一真实STFT特征、所述第二单人语音对应的第二真实STFT特征、所述训练用混合语音的混合STFT特征;
语音分割单元,用于根据所述STFT特征提取单元中提取的所述混合STFT特征对所述混合语音进行分割处理,得到与所述第一单人语音对应的第一分割语音、第一分割STFT特征,以及与所述第二单人语音对应的第二分割语音、第二分割STFT特征;
损失函数构造单元,其通过对比所述第一真实STFT特征和所述第一分割STFT特征、所述第二真实STFT特征和所述第二分割STFT特征,构造损失函数;
模型优化单元,其根据所述损失函数进行优化模型参数,完成模型训练;
b.说话人分割模块:
用于将待分割的混合语音输入所述模型中,输出分割后的不同说话人的分割语音;或者,对待分割的混合语音提取混合STFT特征,并将所述待分割 的混合语音的混合STFT特征输入所述模型中,输出所述待分割的混合语音分割后的不同说话人的分割STFT特征,并进一步通过ISTFT变换得到不同说话人对应的分割语音。
需要说明的是,本说明书中的各个实施例均采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似的部分互相参见即可。对于系统实施例而言,由于其与方法实施例基本相似,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。
并且,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。另外,本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。
上述说明示出并描述了本发明的优选实施例,应当理解本发明并非局限于本文所披露的形式,不应看作是对其他实施例的排除,而可用于各种其他组合、修改和环境,并能够在本文发明构想范围内,通过上述教导或相关领域的技术或知识进行改动。而本领域人员所进行的改动和变化不脱离本发明的精神和范围,则都应在本发明所附权利要求的保护范围内。

Claims (7)

  1. 一种基于深度学习的端到端说话人分割方法,其特征在于,包括以下步骤:
    a.模型训练步骤:
    a1.采集第一单人语音和第二单人语音,并计算所述第一单人语音对应的第一真实STFT特征和所述第二单人语音对应的第二真实STFT特征;
    a2.将所述第一单人语音和所述第二单人语音进行混合处理,得到训练用混合语音,并计算所述训练用混合语音的混合STFT特征;
    a3.根据步骤a2中所述混合STFT特征对所述混合语音进行分割处理,得到与所述第一单人语音对应的第一分割语音、第一分割STFT特征,以及与所述第二单人语音对应的第二分割语音、第二分割STFT特征;
    a4.对比所述第一真实STFT特征和所述第一分割STFT特征、所述第二真实STFT特征和所述第二分割STFT特征,构造损失函数;
    a5.根据所述损失函数进行优化模型参数,完成模型训练;
    b.说话人分割步骤:
    将待分割的混合语音输入所述模型中,输出分割后的不同说话人的分割语音;或者,对待分割的混合语音提取混合STFT特征,并将所述待分割的混合语音的混合STFT特征输入所述模型中,输出所述待分割的混合语音分割后的不同说话人的分割STFT特征,并进一步通过ISTFT变换得到不同说话人对应的分割语音。
  2. 根据权利要求1所述的一种基于深度学习的端到端说话人分割方法,其特征在于:所述的步骤a2中,将所述第一单人语音和所述第二单人语音进行混合处理,是指分别将所述第一单人语音和所述第二单人语音分割为两个以上的短语音,并将所有短语音进行混合,并合成长语音,得到训练用混合 语音;所述的步骤b中,所述待分割的混合语音,是指两个以上说话人之间的对话语音。
  3. 根据权利要求1所述的一种基于深度学习的端到端说话人分割方法,其特征在于:进一步将所述训练用混合语音的混合STFT特征、第一真实STFT特征、第二真实STFT特征、第一分割STFT特征、第二分割STFT特征,以及所述待分割的混合语音的混合STFT特征、分割STFT特征,将其实数和虚数在通道维度上进行拼接处理。
  4. 根据权利要求1或2或3所述的一种基于深度学习的端到端说话人分割方法,其特征在于:所述的步骤a3对所述混合语音进行分割处理,进一步包括:
    a31.将所述训练用混合语音的混合STFT特征输入CNN神经网络,以提取深层次特征;
    a32.将所述深层次特征输入LSTM网络,以提取时序信息特征;
    a33.将所述时序信息特征输入三层全连接网络,分别生成第一单人语音的掩膜和第二单人语音的掩膜;
    a34.将所述第一单人语音的掩膜与所述训练用混合语音的混合STFT特征相乘得到第二单人语音对应的第二分割STFT特征;将所述第二单人语音的掩膜与所述训练用混合语音的混合STFT特征相乘得到第一单人语音对应的第一分割STFT特征;
    a35.将所述第一分割STFT特征通过ISTFT变换得到第一分割语音,将所述第二分割STFT特征通过ISTFT变换得到第二分割语音。
  5. 根据权利要求1所述的一种基于深度学习的端到端说话人分割方法,其特征在于:所述的步骤a4中,使用均方误差作为损失函数,即,计算所述 第一真实STFT特征和所述第一分割STFT特征之间、所述第二真实STFT特征和所述第二分割STFT特征之间的均方误差。
  6. 根据权利要求5所述的一种基于深度学习的端到端说话人分割方法,其特征在于:所述的步骤a5中,根据所述损失函数进行优化模型参数,是通过随机梯度下降算法进行优化模型参数,使得均方误差的值下降到预设阈值。
  7. 一种基于深度学习的端到端说话人分割系统,其特征在于,包括:
    a.模型训练模块,其进一步包括:
    语音采集单元,用于采集第一单人语音和第二单人语音;
    语音混合单元,将所述第一单人语音和所述第二单人语音进行混合处理,得到训练用混合语音;
    STFT特征提取单元,用于计算所述第一单人语音对应的第一真实STFT特征、所述第二单人语音对应的第二真实STFT特征、所述训练用混合语音的混合STFT特征;
    语音分割单元,用于根据所述STFT特征提取单元中提取的所述混合STFT特征对所述混合语音进行分割处理,得到与所述第一单人语音对应的第一分割语音、第一分割STFT特征,以及与所述第二单人语音对应的第二分割语音、第二分割STFT特征;
    损失函数构造单元,其通过对比所述第一真实STFT特征和所述第一分割STFT特征、所述第二真实STFT特征和所述第二分割STFT特征,构造损失函数;
    模型优化单元,其根据所述损失函数进行优化模型参数,完成模型训练;
    b.说话人分割模块:
    用于将待分割的混合语音输入所述模型中,输出分割后的不同说话人的 分割语音;或者,对待分割的混合语音提取混合STFT特征,并将所述待分割的混合语音的混合STFT特征输入所述模型中,输出所述待分割的混合语音分割后的不同说话人的分割STFT特征,并进一步通过ISTFT变换得到不同说话人对应的分割语音。
PCT/CN2018/124431 2018-09-30 2018-12-27 一种基于深度学习的端到端说话人分割方法及系统 WO2020062679A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811158674.9 2018-09-30
CN201811158674.9A CN109461447B (zh) 2018-09-30 2018-09-30 一种基于深度学习的端到端说话人分割方法及系统

Publications (1)

Publication Number Publication Date
WO2020062679A1 true WO2020062679A1 (zh) 2020-04-02

Family

ID=65607277

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/124431 WO2020062679A1 (zh) 2018-09-30 2018-12-27 一种基于深度学习的端到端说话人分割方法及系统

Country Status (2)

Country Link
CN (1) CN109461447B (zh)
WO (1) WO2020062679A1 (zh)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110289002B (zh) * 2019-06-28 2021-04-27 四川长虹电器股份有限公司 一种端到端的说话人聚类方法及系统
CN110544482B (zh) * 2019-09-09 2021-11-12 北京中科智极科技有限公司 一种单通道语音分离系统
CN110970053B (zh) * 2019-12-04 2022-03-15 西北工业大学深圳研究院 一种基于深度聚类的多通道与说话人无关语音分离方法

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106952649A (zh) * 2017-05-14 2017-07-14 北京工业大学 基于卷积神经网络和频谱图的说话人识别方法
CN107680611A (zh) * 2017-09-13 2018-02-09 电子科技大学 基于卷积神经网络的单通道声音分离方法
CN108228915A (zh) * 2018-03-29 2018-06-29 华南理工大学 一种基于深度学习的视频检索方法
CN108510979A (zh) * 2017-02-27 2018-09-07 芋头科技(杭州)有限公司 一种混合频率声学识别模型的训练方法及语音识别方法

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008052117A (ja) * 2006-08-25 2008-03-06 Oki Electric Ind Co Ltd 雑音除去装置、方法及びプログラム
US9916538B2 (en) * 2012-09-15 2018-03-13 Z Advanced Computing, Inc. Method and system for feature detection
CN102543063B (zh) * 2011-12-07 2013-07-24 华南理工大学 基于说话人分割与聚类的多说话人语速估计方法
US9159321B2 (en) * 2012-02-27 2015-10-13 Hong Kong Baptist University Lip-password based speaker verification system
CN106782507B (zh) * 2016-12-19 2018-03-06 平安科技(深圳)有限公司 语音分割的方法及装置
CN107358945A (zh) * 2017-07-26 2017-11-17 谢兵 一种基于机器学习的多人对话音频识别方法及系统
CN108376215A (zh) * 2018-01-12 2018-08-07 上海大学 一种身份认证方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108510979A (zh) * 2017-02-27 2018-09-07 芋头科技(杭州)有限公司 一种混合频率声学识别模型的训练方法及语音识别方法
CN106952649A (zh) * 2017-05-14 2017-07-14 北京工业大学 基于卷积神经网络和频谱图的说话人识别方法
CN107680611A (zh) * 2017-09-13 2018-02-09 电子科技大学 基于卷积神经网络的单通道声音分离方法
CN108228915A (zh) * 2018-03-29 2018-06-29 华南理工大学 一种基于深度学习的视频检索方法

Also Published As

Publication number Publication date
CN109461447A (zh) 2019-03-12
CN109461447B (zh) 2023-08-18

Similar Documents

Publication Publication Date Title
US10176811B2 (en) Neural network-based voiceprint information extraction method and apparatus
Du et al. Aishell-2: Transforming mandarin asr research into industrial scale
KR102134201B1 (ko) 숫자 음성 인식에 있어서 음성 복호화 네트워크를 구성하기 위한 방법, 장치, 및 저장 매체
WO2018227781A1 (zh) 语音识别方法、装置、计算机设备及存储介质
CN109192213B (zh) 庭审语音实时转写方法、装置、计算机设备及存储介质
CN109599093B (zh) 智能质检的关键词检测方法、装置、设备及可读存储介质
WO2018227780A1 (zh) 语音识别方法、装置、计算机设备及存储介质
CN106297776B (zh) 一种基于音频模板的语音关键词检索方法
US11514891B2 (en) Named entity recognition method, named entity recognition equipment and medium
CN106611604B (zh) 一种基于深度神经网络的自动语音叠音检测方法
CN104900235B (zh) 基于基音周期混合特征参数的声纹识别方法
CN108766418A (zh) 语音端点识别方法、装置及设备
CN108922541B (zh) 基于dtw和gmm模型的多维特征参数声纹识别方法
US20160189730A1 (en) Speech separation method and system
CN105469784B (zh) 一种基于概率线性鉴别分析模型的说话人聚类方法及系统
CN110178178A (zh) 具有环境自动语音识别(asr)的麦克风选择和多个讲话者分割
CN109545228A (zh) 一种端到端说话人分割方法及系统
WO2020062679A1 (zh) 一种基于深度学习的端到端说话人分割方法及系统
CN101923855A (zh) 文本无关的声纹识别系统
CN107146615A (zh) 基于匹配模型二次识别的语音识别方法及系统
CN109767778A (zh) 一种融合Bi-LSTM和WaveNet的语音转换方法
CN103065620A (zh) 在手机上或网页上接收用户输入的文字并实时合成为个性化声音的方法
CN110299142A (zh) 一种基于网络融合的声纹识别方法及装置
Li et al. Sams-net: A sliced attention-based neural network for music source separation
CN110268471A (zh) 具有嵌入式降噪的asr的方法和设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18935622

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18935622

Country of ref document: EP

Kind code of ref document: A1