CN114155883B

CN114155883B - Progressive type based speech deep neural network training method and device

Info

Publication number: CN114155883B
Application number: CN202210116109.6A
Authority: CN
Inventors: 史慧宇; 欧阳鹏
Original assignee: Beijing Qingwei Intelligent Information Technology Co ltd
Current assignee: Yuanhaoxin Microelectronics Shanghai Co ltd
Priority date: 2022-02-07
Filing date: 2022-02-07
Publication date: 2022-12-02
Anticipated expiration: 2042-02-07
Also published as: CN114155883A

Abstract

The invention discloses an advanced speech deep neural network training method, device, storage medium and electronic device. Wherein, the advanced voice deep neural network training method includes: obtaining a mixed voice sample and a target sample voice, wherein the mixed voice sample includes target voice and noise voice; inputting the mixed voice sample into a preset voice deep neural network model, The predicted target speech is obtained, wherein the preset speech neural network model includes an advanced extractor, an encoder and a reconstructor, and the preset speech deep neural network model is determined as the target speech deep neural network model, based on the training in this program. The speech deep neural network of advanced extractor, encoder and reconstructor solves the technical problem that the target speech cannot be effectively separated from the mixed speech in the prior art.

Description

Advanced speech deep neural network training method and device

技术领域technical field

本发明涉及语音信号处理相关领域，具体而言，涉及一种基于进阶式的语音深度神经网络训练方法、装置、存储介质及电子装置。The present invention relates to the related field of speech signal processing, in particular, to an advanced speech deep neural network training method, device, storage medium and electronic device.

背景技术Background technique

智能设备如智能音响、助听器、智能耳机等已成为人们日常生活中不可或缺的一部分。这些设备的快速发展得益于近些年语音交互技术的不断提高。语音交互时，说话者常常会在场景复杂的情况说出口令，因此，说话人的语音通常会收到噪声、混响或者其他说话人的干扰。若是不能及时的将这些背景噪声或者重叠的说话声去除，将严重影响后端的语音识别、语义识别或唤醒等应用。因此确有必要将语音的提取和分离技术作为语音信号处理的研究重点。单通道语音分离技术是语音分离算法中研究和应用最广泛的技术，相比于多通道语音分离任务，他的优点是硬件要求和成本较低，运算量较小，但是缺点是算法设计难度更高，因为单通道语音分离主要利用单个麦克风采集的信号，借助目标语音和干扰信号之间的时频域声学和统计特性的差异进行建模。Smart devices such as smart speakers, hearing aids, smart headphones, etc. have become an indispensable part of people's daily life. The rapid development of these devices has benefited from the continuous improvement of voice interaction technology in recent years. During voice interaction, the speaker often speaks passwords in complex scenes. Therefore, the speaker's voice is usually disturbed by noise, reverberation, or other speakers. If these background noises or overlapping voices cannot be removed in time, it will seriously affect the back-end speech recognition, semantic recognition or wake-up applications. Therefore, it is necessary to take speech extraction and separation technology as the research focus of speech signal processing. Single-channel speech separation technology is the most widely researched and applied technology in speech separation algorithms. Compared with multi-channel speech separation tasks, its advantages are lower hardware requirements and costs, and less calculation, but the disadvantage is that algorithm design is more difficult. High, because single-channel speech separation mainly uses the signal collected by a single microphone, and is modeled by the difference in time-frequency domain acoustic and statistical properties between the target speech and the interference signal.

近些年，神经网络和深度学习技术的快速发展使得语音分离技术在这一领域得到广泛的研究。基于深度学习的语音分离方法的基本思想是：建立语音分离模型，从混合语音中提取特征参数，然后通过网络训练寻找特征参数与目标语音信号的特征参数之间的映射关系，之后任意输入的混合信号都可以通过训练后的模型输出目标语音的信号，从而达到语音分离的目的。端到端的时域和频域的算法都开展了大量的研究工作，频域中的算法有Deep Clustering,DANet,uPIT,Deep CASA等算法，时域中的算法有Conv-TasNet,BLSTM-TasNet,FurcaNeXt，wavesplit等。这些算法大多以纯语音分离为平台设计的算法，虽然分离效果不错，但是当这些算法应用在复杂场景下时，分离准确度大大衰减。然而真实的生活场景往往伴随着背景噪声、混响和其他说话人声音等因素，若是研究语音的分离问题不可避免的要研究混合语音中包含较多干扰因素时，采取何种方法能使得算法更准确、更高效。In recent years, the rapid development of neural network and deep learning technology has made speech separation technology widely studied in this field. The basic idea of the speech separation method based on deep learning is: establish a speech separation model, extract the characteristic parameters from the mixed speech, and then find the mapping relationship between the characteristic parameters and the characteristic parameters of the target speech signal through network training, and then mix any input The signal can output the signal of the target voice through the trained model, so as to achieve the purpose of voice separation. A lot of research work has been carried out on the end-to-end time domain and frequency domain algorithms. The algorithms in the frequency domain include Deep Clustering, DANet, uPIT, Deep CASA and other algorithms, and the algorithms in the time domain include Conv-TasNet, BLSTM-TasNet, FurcaNeXt, wavesplit, etc. Most of these algorithms are designed on the basis of pure speech separation. Although the separation effect is good, when these algorithms are applied in complex scenarios, the separation accuracy is greatly reduced. However, real life scenes are often accompanied by factors such as background noise, reverberation, and other speaker voices. If it is inevitable to study the separation of speech, it is inevitable to study the mixed speech that contains more interference factors. What method can be adopted to make the algorithm more efficient? Accurate and more efficient.

针对上述的问题，目前尚未提出有效的解决方案。For the above problems, no effective solution has been proposed yet.

发明内容Contents of the invention

本发明实施例提供了一种基于进阶式的语音深度神经网络训练方法、装置、存储介质及电子装置，以至少解决现有技术中，无法有效的从混合语音中分离出目标语音的技术问题。The embodiment of the present invention provides an advanced speech deep neural network training method, device, storage medium and electronic device to at least solve the technical problem in the prior art that the target speech cannot be effectively separated from the mixed speech .

根据本发明实施例的一个方面，提供了一种基于进阶式的语音深度神经网络训练方法，包括：获取混合语音样本以及目标样本语音，其中，所述混合语音样本包括目标语音和噪音语音；将所述混合语音样本输入预设语音深度神经网络模型，得到预测目标语音，其中，所述预设语音深度神经网络模型包括进阶式提取器、重构器和编码器，所述编码器用于对所述混合语音进行特征提取，得到第一特征，所述进阶式提取器用于根据所述第一特征，计算得到高维映射关系特征，所述重构器用于根据所述高维映射关系特征，得到所述混合语音样本中的预测目标语音；根据所述目标样本语音和所述预测目标语音所确定的损失函数满足预设条件，确定所述预设语音深度神经网络模型为目标语音深度神经网络模型。According to an aspect of an embodiment of the present invention, there is provided an advanced speech-based deep neural network training method, comprising: acquiring a mixed speech sample and a target sample speech, wherein the mixed speech sample includes target speech and noise speech; Inputting the mixed speech sample into a preset speech deep neural network model to obtain a predicted target speech, wherein the preset speech deep neural network model includes an advanced extractor, a reconstructor and an encoder, and the encoder is used for performing feature extraction on the mixed speech to obtain a first feature, the advanced extractor is used to calculate a high-dimensional mapping relationship feature based on the first feature, and the reconstructor is used to obtain a high-dimensional mapping relationship feature according to the high-dimensional mapping relationship feature, to obtain the predicted target voice in the mixed voice sample; the loss function determined according to the target sample voice and the predicted target voice satisfies a preset condition, and the preset voice deep neural network model is determined to be the target voice depth neural network model.

可选的，所述编码器用于对所述混合语音进行特征提取，得到第一特征，包括：将所述混合语音样本输入到所述预设语音深度神经网络模型中，通过所述编码器包括的两层卷积网络、ReLU激活函数和批归一化处理，得到所述第一特征。Optionally, the encoder is used to perform feature extraction on the mixed speech to obtain the first feature, comprising: inputting the mixed speech sample into the preset speech deep neural network model, and the encoder includes The two-layer convolutional network, ReLU activation function and batch normalization are used to obtain the first feature.

可选的，所述进阶式提取器用于根据所述第一特征，计算得到高维映射关系特征，包括：在所述进阶式提取器包括多个进阶单元，每个进阶单元包括：延时神经网络、ReLU激活函数、批归一化处理、时延神经网络、池化层、批归一化处理、图卷积层的情况下；将所述第一特征中的每个元素分别输入对应的进阶单元，得到所述高维映射关系特征。Optionally, the advanced extractor is used to calculate and obtain high-dimensional mapping relationship features according to the first feature, including: the advanced extractor includes a plurality of advanced units, and each advanced unit includes : In the case of delayed neural network, ReLU activation function, batch normalization processing, time-delayed neural network, pooling layer, batch normalization processing, and graph convolution layer; each element in the first feature The corresponding advanced units are respectively input to obtain the high-dimensional mapping relationship features.

可选的，所述将所述第一特征中的每个元素分别输入对应的进阶单元，得到所述高维映射关系特征，包括：在所述第一特征表示为H＝{h0，…，hi，…，hM-1}，其中，i＝0到M-1，所述进阶单元包括M个，即J＝{j0，…，ji，…，jM-1}的情况下；h0输入至第一个进阶单元，得到对应输出p0；h1与p0相加后的结果输入第二进阶单元计算，得到h1位置对应的输出p1；h2与p1相加后输入至第三进阶单元得到h2位置对应的输出p2；每个位置计算以此类推，直到最后的hM-1与pM-2相加得到对应的输出pM-1，得到高维映射关系特征P＝{p0，…，pM-1}。Optionally, inputting each element in the first feature into a corresponding advanced unit to obtain the high-dimensional mapping relationship feature includes: expressing the first feature as H={h0,... , hi,..., hM-1}, wherein, i=0 to M-1, the advanced unit includes M, that is, in the case of J={j0,...,ji,...,jM-1}; h0 Input to the first advanced unit to obtain the corresponding output p0; the result of adding h1 and p0 is input to the second advanced unit for calculation, and the output p1 corresponding to the position of h1 is obtained; after adding h2 and p1, it is input to the third advanced unit The unit obtains the output p2 corresponding to the h2 position; each position is calculated by analogy, until the final hM-1 is added to pM-2 to obtain the corresponding output pM-1, and the high-dimensional mapping relationship feature P={p0,..., pM-1}.

可选的，所述重构器用于根据所述高维映射关系特征，得到所述混合语音样本中的预测目标语音，包括：将所述映射关系P输入到所述重构器，经两层卷积网络层、ReLU激活函数和批归一化处理后，得到所述混合语音样本中的预测目标语音。Optionally, the reconstructor is used to obtain the predicted target speech in the mixed speech samples according to the high-dimensional mapping relationship features, including: inputting the mapping relationship P into the reconstructor, and performing two-layer After convolutional network layer, ReLU activation function and batch normalization processing, the predicted target speech in the mixed speech samples is obtained.

可选的，所述根据所述目标样本语音和所述预测目标语音所确定的损失函数满足预设条件，确定所述预设语音深度神经网络模型为目标语音深度神经网络模型，包括：计算所述目标样本语音和所述预测目标语的等比例不变信噪比，根据所述等比例不变信噪比确定所述损失函数；根据所述损失函数的损失值，通过梯度下降法调整所述预设语音深度神经网络模型的各参数的权重和偏置；根据所述目标样本语音和所述预测目标语音所确定的损失函数满足预设条件，确定所述预设语音深度神经网络模型为目标语音深度神经网络模型。Optionally, the loss function determined according to the target sample speech and the predicted target speech satisfies a preset condition, and determining the preset speech deep neural network model as the target speech deep neural network model includes: calculating the The proportional invariant signal-to-noise ratio of the target sample speech and the predicted target language, the loss function is determined according to the proportional invariant signal-to-noise ratio; according to the loss value of the loss function, the gradient descent method is used to adjust the The weight and offset of each parameter of the preset voice deep neural network model; the loss function determined according to the target sample voice and the predicted target voice meets the preset condition, and the preset voice deep neural network model is determined to be Target Speech Deep Neural Network Model.

根据本发明实施例的一个方面，提供了一种基于进阶式的语音深度神经网络训练装置，包括：获取单元，用于获取混合语音样本以及目标样本语音，其中，所述混合语音样本包括目标语音和噪音语音；预测单元，用于将所述混合语音样本输入预设语音深度神经网络模型，得到预测目标语音，其中，所述预设语音深度神经网络模型包括进阶式提取器、重构器和编码器，所述编码器用于对所述混合语音进行特征提取，得到第一特征，所述进阶式提取器用于根据所述第一特征，计算得到高维映射关系特征，所述重构器用于根据所述高维映射关系特征，得到所述混合语音样本中的预测目标语音；确定单元，用于根据所述目标样本语音和所述预测目标语音所确定的损失函数满足预设条件，确定所述预设语音深度神经网络模型为目标语音深度神经网络模型。According to an aspect of an embodiment of the present invention, there is provided an advanced-based voice deep neural network training device, including: an acquisition unit for acquiring mixed voice samples and target sample voices, wherein the mixed voice samples include target Speech and noise speech; a prediction unit, used to input the mixed speech sample into a preset speech deep neural network model to obtain a predicted target speech, wherein the preset speech deep neural network model includes an advanced extractor, reconstruction An encoder and an encoder, the encoder is used to perform feature extraction on the mixed speech to obtain a first feature, the advanced extractor is used to calculate a high-dimensional mapping relationship feature based on the first feature, and the heavy The constructor is used to obtain the predicted target speech in the mixed speech sample according to the high-dimensional mapping relationship feature; the determining unit is used to determine the loss function according to the target sample speech and the predicted target speech to meet a preset condition , determining the preset speech deep neural network model as the target speech deep neural network model.

可选的，所述预测单元，包括：编码模块，用于将所述混合语音样本输入到所述预设语音深度神经网络模型中，通过所述编码器包括的两层卷积网络、ReLU激活函数和批归一化处理，得到所述第一特征。Optionally, the prediction unit includes: an encoding module, configured to input the mixed speech samples into the preset speech deep neural network model, and activate the two-layer convolutional network and ReLU through the encoder. function and batch normalization to obtain the first feature.

可选的，所述预测单元还用于执行如下操作：在所述进阶式提取器包括多个进阶单元，每个进阶单元包括：延时神经网络、ReLU激活函数、批归一化处理、时延神经网络、池化层、批归一化处理、图卷积层的情况下；将所述第一特征中的每个元素分别输入对应的进阶单元，得到所述高维映射关系特征。Optionally, the prediction unit is also used to perform the following operations: the advanced extractor includes a plurality of advanced units, each advanced unit includes: a delayed neural network, a ReLU activation function, batch normalization In the case of processing, time-delay neural network, pooling layer, batch normalization processing, and graph convolution layer; input each element in the first feature into the corresponding advanced unit to obtain the high-dimensional map relationship features.

可选的，所述预测单元还用于执行如下操作：在所述第一特征表示为H＝{h0，…，hi，…，hM-1}，其中，i＝0到M-1，所述进阶单元包括M个，即J＝{j0，…，ji，…，jM-1}的情况下；h0输入至第一个进阶单元，得到对应输出p0；h1与p0相加后的结果输入第二进阶单元计算，得到h1位置对应的输出p1；h2与p1相加后输入至第三进阶单元得到h2位置对应的输出p2；每个位置计算以此类推，直到最后的hM-1与pM-2相加得到对应的输出pM-1，得到高维映射关系特征P＝{p0，…，pM-1}。Optionally, the prediction unit is further configured to perform the following operation: the first feature is expressed as H={h0,...,hi,...,hM-1}, wherein, i=0 to M-1, the The above-mentioned advanced unit includes M, that is, in the case of J={j0,...,ji,...,jM-1}; h0 is input to the first advanced unit, and the corresponding output p0 is obtained; h1 and p0 are added The result is input to the second advanced unit for calculation, and the output p1 corresponding to the h1 position is obtained; h2 is added to p1 and then input to the third advanced unit to obtain the output p2 corresponding to the h2 position; each position is calculated in the same way until the final hM -1 is added to pM-2 to obtain the corresponding output pM-1, and the high-dimensional mapping relationship feature P={p0,...,pM-1} is obtained.

可选的，所述预测单元还用于执行如下操作：将所述映射关系P输入到所述重构器，经两层卷积网络层、ReLU激活函数和批归一化处理后，得到所述混合语音样本中的预测目标语音。Optionally, the prediction unit is further configured to perform the following operations: input the mapping relationship P to the reconstructor, and obtain the The predicted target speech in the mixed speech samples.

可选的，所述确定单元，包括：计算模块，用于计算所述目标样本语音和所述预测目标语的等比例不变信噪比，根据所述等比例不变信噪比确定所述损失函数；调整模块，用于根据所述损失函数的损失值，通过梯度下降法调整所述预设语音深度神经网络模型的各参数的权重和偏置；确定模块，用于根据所述目标样本语音和所述预测目标语音所确定的损失函数满足预设条件，确定所述预设语音深度神经网络模型为目标语音深度神经网络模型。Optionally, the determining unit includes: a calculation module, configured to calculate the proportional invariant SNR of the target sample speech and the predicted target language, and determine the proportional invariant signal to noise ratio according to the proportional invariant signal to noise ratio. Loss function; adjustment module, used to adjust the weight and bias of each parameter of the preset voice deep neural network model by gradient descent method according to the loss value of the loss function; determination module, used to adjust the weight and bias of each parameter according to the target sample The speech and the loss function determined by the predicted target speech satisfy a preset condition, and the preset speech deep neural network model is determined as the target speech deep neural network model.

根据本申请实施例的第一个方面，提供了一种计算机可读的存储介质，其特征在于，所述存储介质中存储有计算机程序，其中，所述计算机程序被设置为运行时执行上述基于进阶式的语音深度神经网络训练方法。According to the first aspect of the embodiments of the present application, there is provided a computer-readable storage medium, which is characterized in that a computer program is stored in the storage medium, wherein the computer program is set to execute the above-mentioned Advanced speech deep neural network training method.

根据本申请实施例的第一个方面，提供了一种电子装置，包括存储器和处理器，其特征在于，所述存储器中存储有计算机程序，所述处理器被设置为运行所述计算机程序以执行上述基于进阶式的语音深度神经网络训练方法。According to the first aspect of the embodiments of the present application, there is provided an electronic device, including a memory and a processor, wherein a computer program is stored in the memory, and the processor is configured to run the computer program to Execute the above-mentioned progressive-based voice deep neural network training method.

在本发明实施例中，获取混合语音样本以及目标样本语音，其中，混合语音样本包括目标语音和噪音语音；将混合语音样本输入预设语音深度神经网络模型，得到预测目标语音，其中，预设语音深度神经网络模型包括进阶式提取器、重构器和编码器，编码器用于对混合语音进行特征提取，得到第一特征，进阶式提取器用于根据第一特征，计算得到高维映射关系特征，重构器用于根据高维映射关系特征，得到混合语音样本中的预测目标语音；根据目标样本语音和预测目标语音所确定的损失函数满足预设条件，确定预设语音深度神经网络模型为目标语音深度神经网络模型，基于本方案中训练包括进阶式提取器、编码器以及重构器的语音深度神经网络模型，解决了现有技术中，无法有效的从混合语音中分离出目标语音的技术问题。In an embodiment of the present invention, a mixed speech sample and target sample speech are obtained, wherein the mixed speech sample includes target speech and noise speech; the mixed speech sample is input into a preset speech deep neural network model to obtain a predicted target speech, wherein the preset The speech deep neural network model includes an advanced extractor, a reconstructor, and an encoder. The encoder is used to extract features from the mixed speech to obtain the first feature. The advanced extractor is used to calculate a high-dimensional map based on the first feature. Relational features, the reconstructor is used to obtain the predicted target voice in the mixed voice sample according to the high-dimensional mapping relationship features; the loss function determined according to the target sample voice and the predicted target voice meets the preset conditions, and determines the preset voice deep neural network model For the target speech deep neural network model, based on the training of the speech deep neural network model including advanced extractor, encoder and reconstructor in this scheme, it solves the problem of not being able to effectively separate the target speech from the mixed speech in the prior art Voice technical issues.

附图说明Description of drawings

此处所说明的附图用来提供对本发明的进一步理解，构成本申请的一部分，本发明的示意性实施例及其说明用于解释本发明，并不构成对本发明的不当限定。在附图中：The accompanying drawings described here are used to provide a further understanding of the present invention and constitute a part of the application. The schematic embodiments of the present invention and their descriptions are used to explain the present invention and do not constitute improper limitations to the present invention. In the attached picture:

图1是根据本发明实施例的一种可选的基于进阶式的语音深度神经网络训练方法的移动终端的硬件结构框图；Fig. 1 is a kind of optional hardware structure block diagram of the mobile terminal based on the speech deep neural network training method of advanced type according to the embodiment of the present invention;

图2是根据本发明实施例的一种可选的基于进阶式的语音深度神经网络训练方法的流程图；Fig. 2 is a flow chart of an optional advanced-based speech deep neural network training method according to an embodiment of the present invention;

图3是根据本发明实施例的一种可选的进阶式语音提取网络整体结构图；Fig. 3 is an overall structural diagram of an optional advanced speech extraction network according to an embodiment of the present invention;

图4是根据本发明实施例的一种可选的编码器结构图；Fig. 4 is a structure diagram of an optional encoder according to an embodiment of the present invention;

图5是根据本发明实施例的一种可选的进阶单元结构图；FIG. 5 is a structural diagram of an optional advanced unit according to an embodiment of the present invention;

图6是根据本发明实施例的一种可选的进阶式提取器结构图；FIG. 6 is a structural diagram of an optional advanced extractor according to an embodiment of the present invention;

图7是根据本发明实施例的一种可选的重构器结构图；Fig. 7 is a structure diagram of an optional reconstructor according to an embodiment of the present invention;

图8是根据本发明实施例的一种可选的基于进阶式的语音深度神经网络训练装置图。Fig. 8 is a diagram of an optional progressive-based speech deep neural network training device according to an embodiment of the present invention.

具体实施方式Detailed ways

为了使本技术领域的人员更好地理解本发明方案，下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分的实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都应当属于本发明保护的范围。In order to enable those skilled in the art to better understand the solutions of the present invention, the following will clearly and completely describe the technical solutions in the embodiments of the present invention in conjunction with the drawings in the embodiments of the present invention. Obviously, the described embodiments are only It is an embodiment of a part of the present invention, but not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts shall fall within the protection scope of the present invention.

需要说明的是，本发明的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象，而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换，以便这里描述的本发明的实施例能够以除了在这里图示或描述的那些以外的顺序实施。此外，术语“包括”和“具有”以及他们的任何变形，意图在于覆盖不排他的包含，例如，包含了一序列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元，而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。It should be noted that the terms "first" and "second" in the description and claims of the present invention and the above drawings are used to distinguish similar objects, but not necessarily used to describe a specific sequence or sequence. It is to be understood that the data so used are interchangeable under appropriate circumstances such that the embodiments of the invention described herein can be practiced in sequences other than those illustrated or described herein. Furthermore, the terms "comprising" and "having", as well as any variations thereof, are intended to cover a non-exclusive inclusion, for example, a process, method, system, product or device comprising a sequence of steps or elements is not necessarily limited to the expressly listed instead, may include other steps or elements not explicitly listed or inherent to the process, method, product or apparatus.

为了更好的理解本申请，现对部分名称说明如下：In order to better understand this application, some names are described as follows:

本申请实施例所提供的基于进阶式的语音深度神经网络训练方法实施例可以在移动终端、计算机终端或者类似的运算装置中执行。以运行在移动终端上为例，图1是本发明实施例的一种基于进阶式的语音深度神经网络训练方法的移动终端的硬件结构框图。如图1所示，移动终端10可以包括一个或多个(图1中仅示出一个)处理器102(处理器102可以包括但不限于微处理器MCU或可编程逻辑器件FPGA等的处理装置)和用于存储数据的存储器104，可选地，上述移动终端还可以包括用于通信功能的传输设备106以及输入输出设备108。本领域普通技术人员可以理解，图1所示的结构仅为示意，其并不对上述移动终端的结构造成限定。例如，移动终端10还可包括比图1中所示更多或者更少的组件，或者具有与图1所示不同的配置。The embodiment of the advanced speech-based deep neural network training method provided in the embodiment of the present application can be executed in a mobile terminal, a computer terminal or a similar computing device. Taking running on a mobile terminal as an example, FIG. 1 is a hardware structural block diagram of a mobile terminal based on an advanced speech deep neural network training method according to an embodiment of the present invention. As shown in FIG. 1, the mobile terminal 10 may include one or more (only one is shown in FIG. 1) processors 102 (the processors 102 may include but not limited to processing devices such as microprocessor MCU or programmable logic device FPGA, etc. ) and a memory 104 for storing data. Optionally, the above-mentioned mobile terminal may also include a transmission device 106 and an input and output device 108 for communication functions. Those skilled in the art can understand that the structure shown in FIG. 1 is only for illustration, and it does not limit the structure of the above mobile terminal. For example, the mobile terminal 10 may also include more or fewer components than those shown in FIG. 1 , or have a different configuration than that shown in FIG. 1 .

存储器104可用于存储计算机程序，例如，应用软件的软件程序以及模块，如本发明实施例中的基于进阶式的语音深度神经网络训练方法对应的计算机程序，处理器102通过运行存储在存储器104内的计算机程序，从而执行各种功能应用以及数据处理，即实现上述的方法。存储器104可包括高速随机存储器，还可包括非易失性存储器，如一个或者多个磁性存储装置、闪存、或者其他非易失性固态存储器。在一些实例中，存储器104可进一步包括相对于处理器102远程设置的存储器，这些远程存储器可以通过网络连接至移动终端10。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。The memory 104 can be used to store computer programs, for example, software programs and modules of application software, such as the computer program corresponding to the advanced voice deep neural network training method in the embodiment of the present invention, and the processor 102 stores the memory in the memory 104 by running The computer program in the computer, so as to execute various functional applications and data processing, that is, to realize the above-mentioned method. The memory 104 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include a memory that is remotely located relative to the processor 102, and these remote memories may be connected to the mobile terminal 10 through a network. Examples of the aforementioned networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.

传输装置106用于经由一个网络接收或者发送数据。上述的网络具体实例可包括移动终端10的通信供应商提供的无线网络。在一个实例中，传输装置106包括一个网络适配器(Network Interface Controller，简称为NIC)，其可通过基站与其他网络设备相连从而可与互联网进行通讯。在一个实例中，传输装置106可以为射频(Radio Frequency，简称为RF)模块，其用于通过无线方式与互联网进行通讯。The transmission device 106 is used to receive or transmit data via a network. The specific example of the above-mentioned network may include a wireless network provided by the communication provider of the mobile terminal 10 . In one example, the transmission device 106 includes a network interface controller (NIC for short), which can be connected to other network devices through a base station so as to communicate with the Internet. In one example, the transmission device 106 may be a radio frequency (Radio Frequency, RF for short) module, which is used to communicate with the Internet in a wireless manner.

在本实施例中还提供了一种基于进阶式的语音深度神经网络训练方法，图2是根据本发明实施例的基于进阶式的语音深度神经网络训练方法的流程图，如图2所示，该基于进阶式的语音深度神经网络训练方法流程包括如下步骤：In this embodiment, an advanced-based voice deep neural network training method is also provided. FIG. 2 is a flowchart of an advanced-based voice deep neural network training method according to an embodiment of the present invention, as shown in FIG. 2 Shown, this advanced voice deep neural network training method process based on the advanced formula includes the following steps:

步骤S202，获取混合语音样本以及目标样本语音，其中，混合语音样本包括目标语音和噪音语音。Step S202, acquiring a mixed speech sample and a target sample speech, wherein the mixed speech sample includes target speech and noise speech.

步骤S204，将混合语音样本输入预设语音深度神经网络模型，得到预测目标语音，其中，预设语音深度神经网络模型包括进阶式提取器、重构器和编码器，编码器用于对混合语音进行特征提取，得到第一特征，进阶式提取器用于根据第一特征，计算得到高维映射关系特征，重构器用于根据高维映射关系特征，得到混合语音样本中的预测目标语音。Step S204, input the mixed speech sample into the preset speech deep neural network model to obtain the predicted target speech, wherein the preset speech deep neural network model includes an advanced extractor, a reconstructor and an encoder, and the encoder is used to analyze the mixed speech Perform feature extraction to obtain the first feature, the advanced extractor is used to calculate the high-dimensional mapping relationship feature based on the first feature, and the reconstructor is used to obtain the predicted target voice in the mixed voice sample according to the high-dimensional mapping relationship feature.

步骤S206，根据目标样本语音和预测目标语音所确定的损失函数满足预设条件，确定预设语音深度神经网络模型为目标语音深度神经网络模型。Step S206, the loss function determined according to the target sample speech and the predicted target speech satisfies the preset condition, and the preset speech deep neural network model is determined as the target speech deep neural network model.

在本实施例中，本发明的目的是针对含有背景噪声、混响和其他说话人干扰背景下的目标说话人提取离问题，提出一种进阶式提取目标说话人的单通道语音分离算法。该算法相比其他同类型，能进阶式的增强目标语音的提取特征，从而极大的提高噪音和混响场景下的语音分离准确度，以降低语音提取后的失真率，提高了语音的可懂度。In this embodiment, the purpose of the present invention is to propose an advanced single-channel speech separation algorithm for target speaker extraction for the problem of target speaker extraction in background noise, reverberation and other speaker interference backgrounds. Compared with other similar types, this algorithm can enhance the extraction features of the target speech in an advanced manner, thereby greatly improving the accuracy of speech separation in noise and reverberation scenes, reducing the distortion rate after speech extraction, and improving the accuracy of speech. intelligibility.

上述噪音可以包括但不限于目标用户的目标语音信息与其他对象的对话，还可以包括环境中的其他的声音。The above-mentioned noise may include but not limited to a dialogue between the target voice information of the target user and other objects, and may also include other sounds in the environment.

通过本申请提供的实施例，获取混合语音样本以及目标样本语音，其中，混合语音样本包括目标语音和噪音语音；将混合语音样本输入预设语音深度神经网络模型，得到预测目标语音，其中，预设语音深度神经网络模型包括进阶式提取器、重构器和编码器，编码器用于对混合语音进行特征提取，得到第一特征，进阶式提取器用于根据第一特征，计算得到高维映射关系特征，重构器用于根据高维映射关系特征，得到混合语音样本中的预测目标语音；根据目标样本语音和预测目标语音所确定的损失函数满足预设条件，确定预设语音深度神经网络模型为目标语音深度神经网络模型，基于本方案中训练包括进阶式提取器、编码器以及重构器的语音深度神经网络模型，解决了现有技术中，无法有效的从混合语音中分离出目标语音的技术问题。Through the embodiments provided by this application, a mixed voice sample and a target sample voice are obtained, wherein the mixed voice sample includes target voice and noise voice; the mixed voice sample is input into a preset voice deep neural network model to obtain a predicted target voice, wherein the preset Suppose the voice deep neural network model includes an advanced extractor, a reconstructor and an encoder. The encoder is used to extract features from the mixed speech to obtain the first feature. The advanced extractor is used to calculate the high-dimensional The mapping relationship feature, the reconstructor is used to obtain the predicted target voice in the mixed voice sample according to the high-dimensional mapping relationship feature; the loss function determined according to the target sample voice and the predicted target voice meets the preset conditions, and determines the preset voice deep neural network The model is the target speech deep neural network model. Based on the training of the speech deep neural network model including advanced extractor, encoder and reconstructor in this scheme, it solves the problem that the existing technology cannot effectively separate the speech from the mixed speech. Technical issues with the target voice.

可选的，编码器用于对混合语音进行特征提取，得到第一特征，可以包括：将混合语音样本输入到预设语音深度神经网络模型中，通过编码器包括的两层卷积网络、ReLU激活函数和批归一化处理，得到第一特征。Optionally, the encoder is used to perform feature extraction on the mixed speech to obtain the first feature, which may include: input the mixed speech sample into the preset speech deep neural network model, and activate the two-layer convolutional network and ReLU through the encoder. function and batch normalization to get the first feature.

可选的，进阶式提取器用于根据第一特征，计算得到高维映射关系特征，可以包括：在进阶式提取器包括多个进阶单元，每个进阶单元包括：延时神经网络、ReLU激活函数、批归一化处理、时延神经网络、池化层、批归一化处理、图卷积层的情况下；将第一特征中的每个元素分别输入对应的进阶单元，得到高维映射关系特征。Optionally, the advanced extractor is used to calculate the high-dimensional mapping relationship features according to the first feature, which may include: the advanced extractor includes a plurality of advanced units, and each advanced unit includes: a delay neural network , ReLU activation function, batch normalization processing, time-delay neural network, pooling layer, batch normalization processing, and graph convolution layer; each element in the first feature is input into the corresponding advanced unit , to obtain high-dimensional mapping relationship features.

可选的，将第一特征中的每个元素分别输入对应的进阶单元，得到高维映射关系特征，可以包括：在第一特征表示为H＝{h0，…，hi，…，hM-1}，其中，i＝0到M-1，进阶单元包括M个，即J＝{j0，…，ji，…，jM-1}的情况下；h0输入至第一个进阶单元，得到对应输出p0；h1与p0相加后的结果输入第二进阶单元计算，得到h1位置对应的输出p1；h2与p1相加后输入至第三进阶单元得到h2位置对应的输出p2；每个位置计算以此类推，直到最后的hM-1与pM-2相加得到对应的输出pM-1，得到高维映射关系特征P＝{p0，…，pM-1}。Optionally, each element in the first feature is input into the corresponding advanced unit to obtain the high-dimensional mapping relationship feature, which may include: the first feature is expressed as H={h0,...,hi,...,hM- 1}, wherein, i=0 to M-1, the advanced unit includes M, that is, in the case of J={j0,...,ji,...,jM-1}; h0 is input to the first advanced unit, The corresponding output p0 is obtained; the result of adding h1 and p0 is input to the second advanced unit for calculation, and the output p1 corresponding to the position of h1 is obtained; after the addition of h2 and p1, it is input to the third advanced unit to obtain the output p2 corresponding to the position of h2; The calculation of each position is deduced by analogy until the final hM-1 is added to pM-2 to obtain the corresponding output pM-1, and the high-dimensional mapping relationship feature P={p0,...,pM-1} is obtained.

可选的，重构器用于根据高维映射关系特征，得到混合语音样本中的预测目标语音，可以包括：将映射关系P输入到重构器，经两层卷积网络层、ReLU激活函数和批归一化处理后，得到混合语音样本中的预测目标语音。Optionally, the reconstructor is used to obtain the predicted target speech in the mixed speech sample according to the high-dimensional mapping relationship features, which may include: input the mapping relationship P to the reconstructor, and pass through two layers of convolutional network layers, ReLU activation function and After batch normalization processing, the predicted target speech in the mixed speech samples is obtained.

可选的，根据目标样本语音和预测目标语音所确定的损失函数满足预设条件，确定预设语音深度神经网络模型为目标语音深度神经网络模型，可以包括：计算目标样本语音和预测目标语的等比例不变信噪比，根据等比例不变信噪比确定损失函数；根据损失函数的损失值，通过梯度下降法调整预设语音深度神经网络模型的各参数的权重和偏置；根据目标样本语音和预测目标语音所确定的损失函数满足预设条件，确定预设语音深度神经网络模型为目标语音深度神经网络模型。Optionally, the loss function determined according to the target sample speech and the predicted target speech satisfies the preset conditions, and the preset speech deep neural network model is determined as the target speech deep neural network model, which may include: calculating the target sample speech and the predicted target language Equal proportion constant signal to noise ratio, determine the loss function according to the equal proportion constant signal to noise ratio; according to the loss value of the loss function, adjust the weight and bias of each parameter of the preset voice deep neural network model through the gradient descent method; according to the target The loss function determined by the sample speech and the predicted target speech satisfies the preset condition, and the preset speech deep neural network model is determined as the target speech deep neural network model.

作为一种可选的实施例，本申请还提供了一种进阶式语音提取算法。具有包括如下内容。As an optional embodiment, the present application also provides an advanced speech extraction algorithm. It has the following contents.

如图3所示，进阶式语音提取网络整体结构图。在本实施例中，进阶式式语音提取算法包括进阶式提取器、编码器和重构器组成。如图4所示，编码器结构图，编码器主要由由两层卷积层(CNN)和一层池化层(Pooling)构成。如图5所示，进阶单元结构图，每个进阶单元均可以包括：延时神经网络、ReLU激活函数、批归一化处理、时延神经网络、池化层、批归一化处理、图卷积层。其中，如图6所示，进阶式提取器结构图，进阶式提取器主要由两层时延神经网络层(TDNN)和一层池化层和一层图卷积网络层(GCN)构成。如图7所示，重构器结构图，重构器主要由两层解卷积网络层(DCNN)构成。主要包括以下内容：As shown in Figure 3, the overall structure diagram of the advanced speech extraction network. In this embodiment, the advanced speech extraction algorithm includes an advanced extractor, an encoder and a reconstructor. As shown in Figure 4, the encoder structure diagram, the encoder is mainly composed of two convolutional layers (CNN) and one pooling layer (Pooling). As shown in Figure 5, the advanced unit structure diagram, each advanced unit can include: delayed neural network, ReLU activation function, batch normalization processing, delayed neural network, pooling layer, batch normalization processing , graph convolution layer. Among them, as shown in Figure 6, the advanced extractor structure diagram, the advanced extractor mainly consists of two layers of time-delayed neural network layers (TDNN), one layer of pooling layer and one layer of graph convolutional network layer (GCN) constitute. As shown in Figure 7, the structure diagram of the reconstructor, the reconstructor is mainly composed of two layers of deconvolutional network layers (DCNN). It mainly includes the following contents:

第一部分：对训练和测试时需要的混合语音样本进行预处理；The first part: Preprocessing the mixed speech samples required for training and testing;

第二部分：使用损失函数对建立的进阶式提取深度神经网络进行训练，以获得进阶式提取深度神经网络模型；The second part: use the loss function to train the established advanced extraction deep neural network to obtain the advanced extraction deep neural network model;

第三部分：将待测试语音样本进行预处理，并通过训练后的进阶式提取深度神经网络模型进行语音分离，得到分离结果。The third part: Preprocess the speech samples to be tested, and perform speech separation through the advanced extraction deep neural network model after training, and obtain the separation results.

以下将对每个部分做详细说明。Each part will be described in detail below.

其中，第一部分具体包括：Among them, the first part specifically includes:

步骤1，对语音信号样本和噪声样本的时域信号在8kHz下重采样，将不同的说话人语音在信噪比在0到5dB之间进行随即混合，并将其与随机抽取的噪声样本在-6到3dB的信噪比下做混合，然后根据房间响应函数对不同条件的空间和麦克风进行混响计算，得到最后的混合语音信号y；Step 1, re-sampling the time-domain signal of speech signal samples and noise samples at 8kHz, randomly mixing the speech of different speakers with a signal-to-noise ratio between 0 and 5dB, and combining them with randomly drawn noise samples at Mix under the signal-to-noise ratio of -6 to 3dB, and then calculate the reverberation of the space and the microphone under different conditions according to the room response function, and obtain the final mixed voice signal y;

步骤2，将上述步骤得到的整个数据库分为训练集、验证集和测试集。混合语音作为进阶式提取深度神经网络的输入，混合语音中的一个说话人语音作为网络的训练目标。Step 2, the entire database obtained in the above steps is divided into training set, verification set and test set. The mixed voice is used as the input of the deep neural network for advanced extraction, and a speaker's voice in the mixed voice is used as the training target of the network.

第二部分具体包括：The second part specifically includes:

步骤1，建立进阶式提取深度神经网络模型，包括编码器、进阶式提取器和重构器。编码器由两层卷积层(CNN)和一层池化层(Pooling)构成，如图4所示。进阶式提取器由两层时延神经网络层(TDNN)和一层池化层和一层图卷积网络层(GCN)构成，如图6所示。重构器由两层解卷积网络层(DCNN)构成，如图7所示。Step 1. Establish an advanced extraction deep neural network model, including an encoder, an advanced extractor, and a reconstructor. The encoder consists of two convolutional layers (CNN) and one pooling layer (Pooling), as shown in Figure 4. The advanced extractor consists of two layers of time-delayed neural network (TDNN) and one layer of pooling layer and one layer of graph convolutional network (GCN), as shown in Figure 6. The reconstructor consists of two deconvolutional network layers (DCNN), as shown in Fig. 7.

步骤2，对进阶提取式深度神经网络参数进行随机初始化，包括对网络神经元节点之间的权重和偏置进行初始化。Step 2. Randomly initialize the parameters of the advanced extraction deep neural network, including initializing the weights and biases between network neuron nodes.

步骤3，深度神经网络进行前向传播。在前向传播过程中，可使用激活函数来增加网络之间的非线性关系，最后能够生成输入与输出结果间的非线性映射。Step 3, the deep neural network performs forward propagation. In the forward propagation process, the activation function can be used to increase the nonlinear relationship between the networks, and finally a nonlinear mapping between the input and output results can be generated.

步骤4，根据步骤2初始化后的参数和第一部分的网络训练目标，对深度神经网络进行有监督的训练。在本实施例中，使用损失函数通过梯度下降法来反向传播更新权重和偏置，整个网络的损失函数为：Step 4, perform supervised training on the deep neural network according to the parameters initialized in step 2 and the network training target in the first part. In this embodiment, the loss function is used to backpropagate and update the weights and biases through the gradient descent method. The loss function of the entire network is:

其中，

s为理想目标语音，

为估计的目标语音，<·,·>表示两个向量之间的点积，而||·||₂表示欧式距离。in,

s is the ideal target speech,

is the estimated target speech, <·,·> represents the dot product between two vectors, and ||·|| ₂ represents the Euclidean distance.

步骤5，通过梯度下降法更新深度神经网络的参数。Step 5, update the parameters of the deep neural network through the gradient descent method.

a、在一定时间内，固定网络内的参数，计算输出层损失函数的梯度；a. Within a certain period of time, the parameters in the network are fixed, and the gradient of the loss function of the output layer is calculated;

b、计算网络层数l＝L-1，L-2，…，2时每一层所对应的梯度；b. Calculate the gradient corresponding to each layer when the number of network layers l=L-1, L-2, ..., 2;

c、更新整个网络的权重和偏置。c. Update the weights and biases of the entire network.

步骤6，训练完毕，根据训练结果获得深度神经网络模型。Step 6, the training is completed, and the deep neural network model is obtained according to the training results.

需要说明的是，编码器部分：将混合音频y输入到网络输入端，然后经两层卷积网络、ReLU激活函数和批归一化处理(BN)对目标语音进行初步的特征提取，得到H＝{h0，…，hi，…，hM-1}，i＝0到M-1。M为此全局提取器最后一层网络对应的输出长度。It should be noted that the encoder part: input the mixed audio y to the network input, and then perform preliminary feature extraction on the target speech through two-layer convolutional network, ReLU activation function and batch normalization (BN), and obtain H ={h0,...,hi,...,hM-1}, i=0 to M-1. M is the output length corresponding to the last network layer of this global extractor.

进阶式提取器部分：此部分由多个进阶单元构成，具体计算操作如图5所示。每个进阶单元如图5所示，包括：延时神经网络、ReLU激活函数、批归一化处理、时延神经网络、池化层、批归一化处理、图卷积层。将H输入到此模块的输入端，其中h0直接进入进阶单元，得到对应输出p0，h1与p0相加后的结果进入进阶单元计算，得到h1位置对应的输出p1，h2与p1相加后进入进阶单元得到h2位置对应的输出p2，接下来的每个位置计算以此类推，直到最后的hM-1与pM-2相加得到对应的输出pM-1。此时全部的输出结果作为目标语音对应的高维提取映射P＝{p0，…，pM-1}。Advanced extractor part: This part is composed of multiple advanced units, and the specific calculation operation is shown in Figure 5. Each advanced unit is shown in Figure 5, including: delayed neural network, ReLU activation function, batch normalization processing, delayed neural network, pooling layer, batch normalization processing, and graph convolution layer. Input H to the input terminal of this module, where h0 directly enters the advanced unit to obtain the corresponding output p0, and the result of adding h1 and p0 enters the advanced unit for calculation, and obtains the output p1 corresponding to the position of h1, and adds h2 and p1 After entering the advanced unit, the output p2 corresponding to the h2 position is obtained, and the calculation of each subsequent position is deduced by analogy, until the final hM-1 and pM-2 are added to obtain the corresponding output pM-1. At this time, all the output results are taken as the high-dimensional extraction map P={p0, . . . , pM-1} corresponding to the target speech.

重构器部分：将P输入到此模块的输入端，经两层解卷积网络层、ReLU激活函数和批归一化处理后得到与每个说话人对应的估计语音

Reconstructor part: Input P to the input of this module, and get the estimated speech corresponding to each speaker after two layers of deconvolution network layer, ReLU activation function and batch normalization

第三部分中的语音重建操作为：将第一部分中的待测试语音样本输入到训练后的进阶式提取分离网络模型中，经计算可直接得到目标说话人的语音分离结果。The speech reconstruction operation in the third part is: input the speech sample to be tested in the first part into the advanced extraction and separation network model after training, and the speech separation result of the target speaker can be directly obtained through calculation.

通过本申请提供的实施例，通过进阶式提取目标说话人的单通道语音分离算法可解决在噪声、混响和其他说话人干扰背景下目标说话人语音提取困难、分离效果衰减的问题，相比其他的单通道语音分离方法可进阶式的有效提取目标语音的有用信息，提高分离语音的准确性，使得语音的失真率降低、可懂度提高。Through the embodiments provided in this application, the single-channel speech separation algorithm for advanced extraction of the target speaker can solve the problems of difficult speech extraction of the target speaker and attenuation of the separation effect under the background of noise, reverberation and other speaker interference. Compared with other single-channel speech separation methods, it can effectively extract the useful information of the target speech in an advanced manner, improve the accuracy of the separated speech, reduce the distortion rate of the speech, and improve the intelligibility.

通过以上的实施方式的描述，本领域的技术人员可以清楚地了解到根据上述实施例的方法可借助软件加必需的通用硬件平台的方式来实现，当然也可以通过硬件，但很多情况下前者是更佳的实施方式。基于这样的理解，本发明的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中，包括若干指令用以使得一台终端设备(可以是手机，计算机，服务器，或者网络设备等)执行本发明各个实施例所述的方法。Through the description of the above embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by means of software plus a necessary general-purpose hardware platform, and of course also by hardware, but in many cases the former is better implementation. Based on such an understanding, the essence of the technical solution of the present invention or the part that contributes to the prior art can be embodied in the form of software products, and the computer software products are stored in a storage medium (such as ROM/RAM, disk, CD) contains several instructions to enable a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to execute the methods described in various embodiments of the present invention.

在本实施例中还提供了一种基于进阶式的语音深度神经网络训练装置，该装置用于实现上述实施例及优选实施方式，已经进行过说明的不再赘述。如以下所使用的，术语“模块”可以实现预定功能的软件和/或硬件的组合。尽管以下实施例所描述的装置较佳地以软件来实现，但是硬件，或者软件和硬件的组合的实现也是可能并被构想的。This embodiment also provides an advanced-based speech deep neural network training device, which is used to implement the above-mentioned embodiments and preferred implementation modes, and those that have already been described will not be repeated. As used below, the term "module" may be a combination of software and/or hardware that realizes a predetermined function. Although the devices described in the following embodiments are preferably implemented in software, implementations in hardware, or a combination of software and hardware are also possible and contemplated.

图8是根据本发明实施例的基于进阶式的语音深度神经网络训练装置的结构框图，如图8所示，该基于进阶式的语音深度神经网络训练装置包括：Fig. 8 is a structural block diagram of an advanced speech deep neural network training device according to an embodiment of the present invention. As shown in Fig. 8, the advanced speech deep neural network training device includes:

获取单元81，用于获取混合语音样本以及目标样本语音，其中，混合语音样本包括目标语音和噪音语音。The obtaining unit 81 is configured to obtain a mixed speech sample and a target sample speech, where the mixed speech sample includes target speech and noise speech.

预测单元83，用于将混合语音样本输入预设语音深度神经网络模型，得到预测目标语音，其中，预设语音深度神经网络模型包括进阶式提取器、重构器和编码器，编码器用于对混合语音进行特征提取，得到第一特征，进阶式提取器用于根据第一特征，计算得到高维映射关系特征，重构器用于根据高维映射关系特征，得到混合语音样本中的预测目标语音。The prediction unit 83 is used to input the mixed speech samples into the preset speech deep neural network model to obtain the predicted target speech, wherein the preset speech deep neural network model includes an advanced extractor, a reconstructor and an encoder, and the encoder is used for Feature extraction is performed on the mixed speech to obtain the first feature. The advanced extractor is used to calculate the high-dimensional mapping relationship feature based on the first feature. The reconstructor is used to obtain the prediction target in the mixed speech sample based on the high-dimensional mapping relationship feature. voice.

确定单元85，用于根据目标样本语音和预测目标语音所确定的损失函数满足预设条件，确定预设语音深度神经网络模型为目标语音深度神经网络模型。The determining unit 85 is configured to determine the preset voice deep neural network model as the target voice deep neural network model by determining the loss function determined according to the target sample voice and the predicted target voice to meet the preset conditions.

通过本申请提供的实施例，获取单元81获取混合语音样本以及目标样本语音，其中，混合语音样本包括目标语音和噪音语音；预测单元83将混合语音样本输入预设语音深度神经网络模型，得到预测目标语音，其中，预设语音深度神经网络模型包括进阶式提取器、重构器和编码器，编码器用于对混合语音进行特征提取，得到第一特征，进阶式提取器用于根据第一特征，计算得到高维映射关系特征，重构器用于根据高维映射关系特征，得到混合语音样本中的预测目标语音；确定单元85根据目标样本语音和预测目标语音所确定的损失函数满足预设条件，确定预设语音深度神经网络模型为目标语音深度神经网络模型，基于本方案中训练包括进阶式提取器、编码器以及重构器的语音深度神经网络模型，解决了现有技术中，无法有效的从混合语音中分离出目标语音的技术问题。Through the embodiment provided by this application, the acquisition unit 81 acquires the mixed speech sample and the target sample speech, wherein the mixed speech sample includes target speech and noise speech; the prediction unit 83 inputs the mixed speech sample into the preset speech deep neural network model to obtain the prediction The target speech, wherein the preset speech deep neural network model includes an advanced extractor, a reconstructor and an encoder, the encoder is used to perform feature extraction on the mixed speech to obtain the first feature, and the advanced extractor is used to obtain the first feature according to the first features, the calculated high-dimensional mapping relationship features, the reconstructor is used to obtain the predicted target voice in the mixed voice sample according to the high-dimensional mapping relationship features; the loss function determined by the determination unit 85 according to the target sample voice and the predicted target voice satisfies the preset Conditions, determine the preset speech deep neural network model as the target speech deep neural network model, based on the training of the speech deep neural network model including advanced extractor, encoder and reconstructor in this program, solve the problem in the prior art, The technical problem of not being able to effectively separate the target speech from the mixed speech.

可选的，上述预测单元83，可以包括：编码模块，用于将混合语音样本输入到预设语音深度神经网络模型中，通过编码器包括的两层卷积网络、ReLU激活函数和批归一化处理，得到第一特征。Optionally, the above prediction unit 83 may include: an encoding module, configured to input the mixed speech samples into the preset speech deep neural network model, through the two-layer convolutional network included in the encoder, the ReLU activation function and batch normalization processing to obtain the first feature.

可选的，上述预测单元83还可以用于执行如下操作：在进阶式提取器包括多个进阶单元，每个进阶单元包括：延时神经网络、ReLU激活函数、批归一化处理、时延神经网络、池化层、批归一化处理、图卷积层的情况下；将第一特征中的每个元素分别输入对应的进阶单元，得到高维映射关系特征。Optionally, the above prediction unit 83 can also be used to perform the following operations: the advanced extractor includes multiple advanced units, and each advanced unit includes: a delayed neural network, a ReLU activation function, batch normalization processing In the case of , time-delay neural network, pooling layer, batch normalization processing, and graph convolution layer; each element in the first feature is input into the corresponding advanced unit to obtain high-dimensional mapping relationship features.

可选的，上述预测单元83还可以用于执行如下操作：在第一特征表示为H＝{h0，…，hi，…，hM-1}，其中，i＝0到M-1，进阶单元包括M个，即J＝{j0，…，ji，…，jM-1}的情况下；h0输入至第一个进阶单元，得到对应输出p0；h1与p0相加后的结果输入第二进阶单元计算，得到h1位置对应的输出p1；h2与p1相加后输入至第三进阶单元得到h2位置对应的输出p2；每个位置计算以此类推，直到最后的hM-1与pM-2相加得到对应的输出pM-1，得到高维映射关系特征P＝{p0，…，pM-1}。Optionally, the above prediction unit 83 can also be used to perform the following operations: the first feature is expressed as H={h0,...,hi,...,hM-1}, wherein, i=0 to M-1, advanced The units include M, that is, in the case of J={j0,...,ji,...,jM-1}; h0 is input to the first advanced unit, and the corresponding output p0 is obtained; the result of adding h1 and p0 is input to the first The second advanced unit calculates to obtain the output p1 corresponding to the h1 position; h2 and p1 are added and input to the third advanced unit to obtain the output p2 corresponding to the h2 position; each position is calculated in the same way until the final hM-1 and The pM-2 is added to obtain the corresponding output pM-1, and the high-dimensional mapping relation feature P={p0, . . . , pM-1} is obtained.

可选的，上述预测单元83还可以用于执行如下操作：将映射关系P输入到重构器，经两层卷积网络层、ReLU激活函数和批归一化处理后，得到混合语音样本中的预测目标语音。Optionally, the above-mentioned prediction unit 83 can also be used to perform the following operations: input the mapping relationship P to the reconstructor, and after two layers of convolutional network layers, ReLU activation function and batch normalization, obtain the mixed speech samples predicted target speech.

可选的，上述确定单元85，可以包括：计算模块，用于计算目标样本语音和预测目标语的等比例不变信噪比，根据等比例不变信噪比确定损失函数；调整模块，用于根据损失函数的损失值，通过梯度下降法调整预设语音深度神经网络模型的各参数的权重和偏置；确定模块，用于根据目标样本语音和预测目标语音所确定的损失函数满足预设条件，确定预设语音深度神经网络模型为目标语音深度神经网络模型。Optionally, the above-mentioned determination unit 85 may include: a calculation module for calculating the proportional invariant signal-to-noise ratio of the target sample speech and the predicted target language, and determining the loss function according to the proportional invariant signal-to-noise ratio; According to the loss value of the loss function, the weight and bias of each parameter of the preset voice deep neural network model are adjusted by the gradient descent method; the determination module is used to satisfy the preset loss function according to the target sample voice and the predicted target voice. condition, determine the preset voice deep neural network model as the target voice deep neural network model.

需要说明的是，上述各个模块是可以通过软件或硬件来实现的，对于后者，可以通过以下方式实现，但不限于此：上述模块均位于同一处理器中；或者，上述各个模块以任意组合的形式分别位于不同的处理器中。It should be noted that the above-mentioned modules can be realized by software or hardware. For the latter, it can be realized by the following methods, but not limited to this: the above-mentioned modules are all located in the same processor; or, the above-mentioned modules can be combined in any combination The forms of are located in different processors.

本发明的实施例还提供了一种存储介质，该存储介质中存储有计算机程序，其中，该计算机程序被设置为运行时执行上述任一项方法实施例中的步骤。An embodiment of the present invention also provides a storage medium, in which a computer program is stored, wherein the computer program is set to execute the steps in any one of the above method embodiments when running.

可选地，在本实施例中，上述存储介质可以被设置为存储用于执行以下步骤的计算机程序：Optionally, in this embodiment, the above-mentioned storage medium may be configured to store a computer program for performing the following steps:

S1，获取混合语音样本以及目标样本语音，其中，混合语音样本包括目标语音和噪音语音；S1. Acquire a mixed voice sample and a target sample voice, where the mixed voice sample includes target voice and noise voice;

S2，将混合语音样本输入预设语音深度神经网络模型，得到预测目标语音，其中，预设语音深度神经网络模型包括进阶式提取器、重构器和编码器，编码器用于对混合语音进行特征提取，得到第一特征，进阶式提取器用于根据第一特征，计算得到高维映射关系特征，重构器用于根据高维映射关系特征，得到混合语音样本中的预测目标语音；S2, input the mixed speech sample into the preset speech deep neural network model to obtain the predicted target speech, wherein the preset speech deep neural network model includes an advanced extractor, a reconstructor and an encoder, and the encoder is used to perform mixed speech processing feature extraction to obtain the first feature, the advanced extractor is used to calculate the high-dimensional mapping relationship feature according to the first feature, and the reconstructor is used to obtain the predicted target voice in the mixed voice sample according to the high-dimensional mapping relationship feature;

S3，根据目标样本语音和预测目标语音所确定的损失函数满足预设条件，确定预设语音深度神经网络模型为目标语音深度神经网络模型。S3. The loss function determined according to the target sample speech and the predicted target speech satisfies a preset condition, and the preset speech deep neural network model is determined as the target speech deep neural network model.

可选地，在本实施例中，上述存储介质可以包括但不限于：U盘、只读存储器(Read-Only Memory，简称为ROM)、随机存取存储器(Random Access Memory，简称为RAM)、移动硬盘、磁碟或者光盘等各种可以存储计算机程序的介质。Optionally, in this embodiment, the above-mentioned storage medium may include but not limited to: U disk, read-only memory (Read-Only Memory, ROM for short), random access memory (Random Access Memory, RAM for short), Various media that can store computer programs, such as removable hard disks, magnetic disks, or optical disks.

本发明的实施例还提供了一种电子装置，包括存储器和处理器，该存储器中存储有计算机程序，该处理器被设置为运行计算机程序以执行上述任一项方法实施例中的步骤。An embodiment of the present invention also provides an electronic device, including a memory and a processor, where a computer program is stored in the memory, and the processor is configured to run the computer program to perform the steps in any one of the above method embodiments.

可选地，上述电子装置还可以包括传输设备以及输入输出设备，其中，该传输设备和上述处理器连接，该输入输出设备和上述处理器连接。Optionally, the above-mentioned electronic device may further include a transmission device and an input-output device, wherein the transmission device is connected to the above-mentioned processor, and the input-output device is connected to the above-mentioned processor.

可选地，在本实施例中，上述处理器可以被设置为通过计算机程序执行以下步骤：Optionally, in this embodiment, the above-mentioned processor may be configured to execute the following steps through a computer program:

可选地，本实施例中的具体示例可以参考上述实施例及可选实施方式中所描述的示例，本实施例在此不再赘述。Optionally, for specific examples in this embodiment, reference may be made to the examples described in the foregoing embodiments and optional implementation manners, and details are not repeated in this embodiment.

显然，本领域的技术人员应该明白，上述的本发明的各模块或各步骤可以用通用的计算装置来实现，它们可以集中在单个的计算装置上，或者分布在多个计算装置所组成的网络上，可选地，它们可以用计算装置可执行的程序代码来实现，从而，可以将它们存储在存储装置中由计算装置来执行，并且在某些情况下，可以以不同于此处的顺序执行所示出或描述的步骤，或者将它们分别制作成各个集成电路模块，或者将它们中的多个模块或步骤制作成单个集成电路模块来实现。这样，本发明不限制于任何特定的硬件和软件结合。Obviously, those skilled in the art should understand that each module or each step of the above-mentioned present invention can be realized by a general-purpose computing device, and they can be concentrated on a single computing device, or distributed in a network formed by multiple computing devices Alternatively, they may be implemented in program code executable by a computing device so that they may be stored in a storage device to be executed by a computing device, and in some cases in an order different from that shown here The steps shown or described are carried out, or they are separately fabricated into individual integrated circuit modules, or multiple modules or steps among them are fabricated into a single integrated circuit module for implementation. As such, the present invention is not limited to any specific combination of hardware and software.

以上仅为本发明的优选实施例而已，并不用于限制本发明，对于本领域的技术人员来说，本发明可以有各种更改和变化。凡在本发明的原则之内，所作的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。The above are only preferred embodiments of the present invention, and are not intended to limit the present invention. For those skilled in the art, the present invention may have various modifications and changes. Any modification, equivalent replacement, improvement, etc. made within the principle of the present invention shall be included in the protection scope of the present invention.

Claims

1. A method for training a speech deep neural network based on an advanced form is characterized by comprising the following steps:

acquiring a mixed voice sample and target sample voice, wherein the mixed voice sample comprises the target voice and noise voice;

inputting the mixed voice sample into a preset voice deep neural network model to obtain a predicted target voice, wherein the preset voice deep neural network model comprises a step extractor, a reconstructor and a coder, the coder is used for carrying out feature extraction on the mixed voice to obtain a first feature, the step extractor is used for calculating to obtain a high-dimensional mapping relation feature according to the first feature, and the reconstructor is used for obtaining the predicted target voice in the mixed voice sample according to the high-dimensional mapping relation feature;

determining the preset speech deep neural network model as a target speech deep neural network model according to the loss function determined by the target sample speech and the predicted target speech and meeting a preset condition;

the encoder is configured to perform feature extraction on the mixed speech to obtain a first feature, and includes:

and inputting the mixed voice sample into the preset voice deep neural network model, and obtaining the first characteristic through a two-layer convolution network, a ReLU activation function and batch normalization processing which are included by the encoder.

2. The method of claim 1, wherein the step extractor is configured to compute a high-dimensional mapping relationship feature according to the first feature, and comprises:

the step extractor comprises a plurality of step units, each step unit comprises: a time-delay neural network, a ReLU activation function, batch normalization processing, a time-delay neural network, a pooling layer, batch normalization processing and a graph convolution layer;

and inputting each element in the first characteristic into a corresponding advanced unit respectively to obtain the high-dimensional mapping relation characteristic.

3. The method according to claim 2, wherein the inputting each element in the first feature into a corresponding order unit to obtain the high-dimensional mapping relationship feature comprises:

in case the first feature is denoted as H = { H0, …, hi, …, hM-1}, where i =0 to M-1, the advanced cell comprises M, i.e., J = { J0, …, ji, …, jM-1 };

h0 is input into the first step unit to obtain a corresponding output p0;

the result of the addition of h1 and p0 is input into a second advanced unit for calculation to obtain an output p1 corresponding to the position of h 1;

adding h2 and p1, and inputting the sum to a third step unit to obtain an output p2 corresponding to the position of h 2;

and performing calculation in each position by analogy until the final hM-1 and pM-2 are added to obtain a corresponding output pM-1, and obtaining a high-dimensional mapping relation characteristic P = { P0, …, pM-1}.

4. The method according to claim 3, wherein the reconstructor is configured to obtain the predicted target speech in the mixed speech sample according to the high-dimensional mapping relation feature, and comprises:

and inputting the mapping relation P into the reconstructor, and obtaining the predicted target voice in the mixed voice sample after two layers of convolutional network layers, reLU activation functions and batch normalization processing.

5. The method according to claim 1, wherein the determining the predetermined deep neural network model as the target deep neural network model according to the loss function determined by the target sample speech and the predicted target speech satisfying a predetermined condition comprises:

calculating an equal-proportion invariant signal-to-noise ratio of the target sample voice and the predicted target voice, and determining the loss function according to the equal-proportion invariant signal-to-noise ratio;

according to the loss value of the loss function, adjusting the weight and the bias of each parameter of the preset speech deep neural network model by a gradient descent method;

and determining the preset speech deep neural network model as a target speech deep neural network model according to the loss function determined by the target sample speech and the predicted target speech and meeting a preset condition.

6. A speech deep neural network training device based on an advance formula is characterized by comprising:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a mixed voice sample and target sample voice, and the mixed voice sample comprises the target voice and noise voice;

the prediction unit is used for inputting the mixed voice sample into a preset voice deep neural network model to obtain a predicted target voice, wherein the preset voice deep neural network model comprises a step extractor, a reconstructor and a coder, the coder is used for extracting the characteristics of the mixed voice to obtain a first characteristic, the step extractor is used for calculating to obtain a high-dimensional mapping relation characteristic according to the first characteristic, and the reconstructor is used for obtaining the predicted target voice in the mixed voice sample according to the high-dimensional mapping relation characteristic;

the determining unit is used for determining the preset speech deep neural network model as a target speech deep neural network model according to the loss function determined by the target sample speech and the predicted target speech and meeting a preset condition;

the prediction unit includes:

and the coding module is used for inputting the mixed voice sample into the preset voice deep neural network model, and obtaining the first characteristic through two layers of convolutional networks, a ReLU activation function and batch normalization processing which are included by the coder.

7. The apparatus of claim 6, wherein the prediction unit is further configured to:

8. The apparatus of claim 7, wherein the prediction unit is further configured to:

h0 is input into the first step unit to obtain a corresponding output p0;

9. The apparatus of claim 8, wherein the prediction unit is further configured to:

10. The apparatus of claim 6, wherein the determining unit comprises:

the calculation module is used for calculating the equal-proportion invariable signal-to-noise ratio of the target sample voice and the predicted target voice and determining the loss function according to the equal-proportion invariable signal-to-noise ratio;

the adjusting module is used for adjusting the weight and the bias of each parameter of the preset speech depth neural network model through a gradient descent method according to the loss value of the loss function;

and the determining module is used for determining the preset voice deep neural network model as a target voice deep neural network model according to the condition that the loss function determined by the target sample voice and the predicted target voice meets a preset condition.