CN108322858B

CN108322858B - Multi-microphone Speech Enhancement Method Based on Tensor Decomposition

Info

Publication number: CN108322858B
Application number: CN201810070662.4A
Authority: CN
Inventors: 叶中付; 童仁杰
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2018-01-25
Filing date: 2018-01-25
Publication date: 2019-11-22
Anticipated expiration: 2038-01-25
Also published as: CN108322858A

Abstract

The invention discloses a kind of multi-microphone sound enhancement method based on tensor resolution, comprising: the multicenter voice signal that multiple microphones observe is indicated using 3D tensor, and will be on a series of tensor projection to orthogonal basis；Using statistical risk criterion is minimized, noise segment is collected in real time, and tracking noise covariance calculates optimal thresholding according to tensor block size.Received multi-channel data is expressed as a three rank tensors by the present invention, to retain original spatial information and temporal information, it is thus possible to be removed ambient noise and weak directionality noise more obviously, and be reduced voice distortion as much as possible.

Description

Multi-microphone Speech Enhancement Method Based on Tensor Decomposition

技术领域technical field

本发明涉及语音降噪领域，尤其涉及一种基于张量分解的多麦克风语音增强方法。The invention relates to the field of voice noise reduction, in particular to a multi-microphone voice enhancement method based on tensor decomposition.

背景技术Background technique

在语音增强领域，经典的单通道算法虽然可以去除较多背景噪声，但是容易引起语音失真，甚至带来“音乐”噪声，导致语音质量损伤。通过采用麦克风阵列，利用波束形成算法，可以对方向性干扰实现较好的抑制。In the field of speech enhancement, although the classic single-channel algorithm can remove more background noise, it is easy to cause speech distortion, and even bring "music" noise, resulting in speech quality damage. By using a microphone array and using a beamforming algorithm, better suppression of directional interference can be achieved.

传统的基于麦克风阵列的语音增强算法，可分为时域降噪算法和频域降噪算法。时域算法通常将每个麦克风输出的语音帧拼接，并对这个加长帧进行最优线性滤波。频域算法则将每一麦克风的帧做傅里叶变换后，提取这些帧相对应的时频单元并组成一个快拍矢量，并对这个快拍矢量进行最优线性滤波。然而，基于向量的表示方法不能充分地利用多通道数据中携带的空间、时间、频率信息，因而有着改善的空间。Traditional speech enhancement algorithms based on microphone arrays can be divided into time-domain noise reduction algorithms and frequency-domain noise reduction algorithms. The time-domain algorithm usually stitches the speech frames output by each microphone, and performs optimal linear filtering on this extended frame. The frequency domain algorithm performs Fourier transform on each microphone frame, extracts the time-frequency units corresponding to these frames and forms a snapshot vector, and performs optimal linear filtering on the snapshot vector. However, vector-based representation methods cannot fully utilize the space, time, and frequency information carried in multi-channel data, so there is room for improvement.

此外，在很多情况下麦克风阵列接收的噪声并非完全是方向性干扰，这导致基于空间滤波原理的波束形成技术极易遭受性能损失。对于方向性不明显甚至无方向性的背景噪声，波束形成算法的空间滤波效果不佳，会带来较多的噪声残留。In addition, in many cases, the noise received by the microphone array is not completely directional interference, which makes the beamforming technology based on the principle of spatial filtering extremely vulnerable to performance loss. For the background noise with no obvious directionality or even no directionality, the spatial filtering effect of the beamforming algorithm is not good, which will bring more noise residues.

发明内容Contents of the invention

本发明的目的是提供一种基于张量分解的多麦克风语音增强方法。与传统的波束形成方法相比，本发明将接收的多通道数据表示为一个三阶张量，从而保留原始的空间信息和时间信息，因而能够更加明显地去除背景噪声和弱方向性噪声，并尽可能地降低语音失真。The purpose of the present invention is to provide a multi-microphone speech enhancement method based on tensor decomposition. Compared with the traditional beamforming method, the present invention represents the received multi-channel data as a third-order tensor, thereby retaining the original spatial information and time information, so that background noise and weak directional noise can be removed more obviously, and Speech distortion is minimized as much as possible.

本发明的目的是通过以下技术方案实现的：The purpose of the present invention is achieved through the following technical solutions:

一种基于张量分解的多麦克风语音增强方法，包括：A multi-microphone speech enhancement method based on tensor decomposition, comprising:

步骤(1)、将观测到的多通道语音数据表示为一个三阶张量，将该三阶张量作为观测张量；并利用三个正交方阵分别对观测张量的三个维度进行稀疏重构，得到一个包含投影系数的核心张量，包括：通过预设的或预选的正交基作为投影矩阵，将观测张量的三个维度分别投影到三个正交方阵上，得到一个包含投影系数的核心张量；这一步骤中输入的是观测张量，输出的是包含投影系数的核心张量。Step (1), express the observed multi-channel voice data as a third-order tensor, and use the third-order tensor as the observation tensor; and use three orthogonal square matrices to perform three dimensions of the observation tensor Sparse reconstruction to obtain a core tensor containing projection coefficients, including: using a preset or pre-selected orthogonal base as the projection matrix, projecting the three dimensions of the observation tensor onto three orthogonal square matrices, to obtain A core tensor containing the projection coefficients; the input to this step is the observation tensor, and the output is the core tensor containing the projection coefficients.

步骤(2)、预先设定一个非负的门限值，将核心张量中幅度低于该门限的投影系数置零，实现对噪声的抑制和干净语音的重构；包括：采用最小化统计风险准则设计最佳门限值，门限值大小由噪声的标准差和观测张量的尺寸决定。这一步中输入的是核心张量，输出的是代表干净语音的张量。Step (2), pre-setting a non-negative threshold value, zeroing the projection coefficients in the core tensor whose amplitude is lower than the threshold value, realizing the suppression of noise and the reconstruction of clean speech; including: adopting the criterion of minimizing statistical risk Design the optimal threshold value, the size of the threshold value is determined by the standard deviation of the noise and the size of the observation tensor. The input in this step is the core tensor, and the output is a tensor representing clean speech.

进一步的，上述基于张量分解的多麦克风语音增强方法中，步骤(1)包括：Further, in the above-mentioned multi-microphone speech enhancement method based on tensor decomposition, step (1) includes:

步骤(11)、通过麦克风阵列的信号接收模型将观测到的多通道语音数据表示为一个三阶张量，将该三阶张量作为观测张量；Step (11), the observed multi-channel voice data is represented as a third-order tensor by the signal receiving model of the microphone array, and the third-order tensor is used as the observation tensor;

麦克风阵列的信号接收模型表示如下：The signal reception model of the microphone array is expressed as follows:

Y(l,k,n)＝X(l,k,n)+N(l,k,n)∈R^L×K×N； Y (l, k, n) = X (l, k, n) + N (l, k, n)∈RL ^×K×N ;

其中，Y表示观测张量，X表示代表干净语音的待估计张量，N表示噪声张量，Y(l,k,n)，X(l，k，n)和N(l，k，n)分别表示观测张量，待估计张量，噪声张量中第n个接收通道、第k帧的第l个元素，L,K,N分别表示帧长，帧数目，麦克风数目；Among them, Y represents the observation tensor, X represents the tensor to be estimated representing clean speech, N represents the noise tensor, Y (l,k,n), X (l,k,n) and N (l,k,n ) respectively represent the observation tensor, the tensor to be estimated, the nth receiving channel in the noise tensor, and the lth element of the kth frame, L, K, and N respectively represent the frame length, the number of frames, and the number of microphones;

步骤(12)、利用三个正交方阵分别对观测张量的三个维度进行稀疏重构，得到一个包含投影系数的核心张量；Step (12), using three orthogonal square matrices to perform sparse reconstruction on the three dimensions of the observation tensor respectively, to obtain a core tensor containing projection coefficients;

对观测张量的分解通常具有如下形式：The decomposition of the observation tensor usually has the following form:

Y＝Σ×₁U₁×₂U₂×₃U₃,Σ∈R^{L′×K′×N′},U₁∈R^L×L′, Y ＝ Σ × ₁ U ₁ × ₂ U ₂ × ₃ U ₃ , Σ ∈ R ^{L′×K′×N′} , U ₁ ∈ R ^L×L′ ,

U₂∈R^K×K′,U₃∈R^N×N′,L′≤L,K′≤K,N′≤NU ₂ ∈ R ^K×K′ , U ₃ ∈ R ^N×N′ , L′≤L, K′≤K, N′≤N

其中{U₁,U₂,U₃}表示基矩阵，Σ表示核心张量。具体地，U₁表示观测张量mode-1纤维Y(:,k,n)的基矩阵，U₂表示观测张量mode-2纤维Y(l,:,n)的基矩阵，U₃表示观测张量mode-3纤维Y(l,k,:)的基矩阵，Σ包括了观测张量在基矩阵{U₁,U₂,U₃}上的投影系数，L′,K',N'表示核心张量的截断尺寸。Where {U ₁ , U ₂ , U ₃ } represents the basis matrix, and Σ represents the core tensor. Specifically, U ₁ represents the basis matrix of the observation tensor mode-1 fiber Y (:,k,n), U ₂ represents the basis matrix of the observation tensor mode-2 fiber Y (l,:,n), and U ₃ represents The base matrix of the observation tensor mode-3 fiber Y (l, k,:), Σ includes the projection coefficients of the observation tensor on the base matrix {U ₁ , U ₂ , U ₃ }, L′, K’, N ' indicates the truncated dimension of the core tensor.

通过规范多态分解，我们可以将观测张量分解为最基本的秩-1张量求和的形式，通过解决如下公式可以获得张量的规范多态分解：Through canonical polymorphic decomposition, we can decompose the observation tensor into the most basic rank-1 tensor summation form, and the canonical polymorphic decomposition of tensor can be obtained by solving the following formula:

s.t.L′＝K′＝N′＝R,Σis diagonalstL'=K'=N'=R, Σ is diagonal

获得超对角的核心张量和非正交的基矩阵。此处R表示干净语音张量的秩。Obtain superdiagonal core tensors and non-orthogonal basis matrices. Here R denotes the rank of the clean speech tensor.

通过张量的正交分解，我们可以将观测张量分解为三个正交基矩阵与核心张量乘积的形式，通过解决如下公式可以获得张量的正交分解：Through the orthogonal decomposition of tensor, we can decompose the observation tensor into the form of the product of three orthogonal basis matrices and the core tensor, and the orthogonal decomposition of tensor can be obtained by solving the following formula:

获得非对角的核心张量和正交的基矩阵。Obtain the off-diagonal core tensor and the orthonormal basis matrix.

注意，如果L'≤L,K'≤K,N'≤N，直接通过即可近似重构干净的语音张量，从而恢复原始的语音信号。在本发明中，我们选择L'＝L,K'＝K,N'＝N得到正交方阵{U₁,U₂,U₃}作为基矩阵；然后设计一个阈值λ，将Σ中幅度绝对值小于λ的投影系数设置为零，从而实现噪声的抑制。通常，过大的阈值会带来较多的语音失真，而较小的阈值会带来较多的噪声残留；关于最佳阈值的选取，将在下面详细描述：Note that if L'≤L, K'≤K, N'≤N, pass directly The clean speech tensor can be approximately reconstructed to restore the original speech signal. In the present invention, we choose L'=L, K'=K, N' = N to obtain an orthogonal square matrix {U ₁ , U ₂ , U ₃ } as the base matrix; Projection coefficients whose absolute values are smaller than λ are set to zero, thereby achieving noise suppression. Usually, too large a threshold will bring more speech distortion, while a smaller threshold will bring more noise residue; the selection of the optimal threshold will be described in detail below:

对于如下的线性观测模型：For the following linear observation model:

y(i)＝x(i)+n(i),i＝1,2,...,Qy(i)=x(i)+n(i), i=1,2,...,Q

此处n(i)服从单变量高斯分布，不同时刻的n(i)互相独立，x(i)服从高斯分布，那么对于x(i)的最小统计风险估计可以表示为H_λ(y(i))，其中H_λ(·)为硬门限算子，其作用在于将y(i)中低于门限λ的成分置0；根据最小化统计风险准则，最优门限为 Here n(i) obeys a univariate Gaussian distribution, n(i) at different times is independent of each other, and x(i) obeys a Gaussian distribution, then the minimum statistical risk estimate for x(i) can be expressed as H _λ (y(i )), where H _λ (·) is a hard threshold operator, and its function is to set the components of y(i) lower than the threshold λ to 0; according to the criterion of minimizing statistical risk, the optimal threshold is

对于多通道数据接收模型，满足如下关系：For the multi-channel data receiving model, the following relationship is satisfied:

Y＝X+N＝Σ×₁U₁×₂U₂×₃U₃ Y = X + N = Σ × ₁ U ₁ × ₂ U ₂ × ₃ U ₃

此处噪声张量N的元素满足相互独立、高斯分布的假设，以上公式等价于：Here the elements of the noise tensor N satisfy the assumption of mutual independence and Gaussian distribution, the above formula is equivalent to:

通过恢复X'即可恢复X；由于U₁,U₂,U₃均为正交矩阵，而正交矩阵对应于旋转变换，因而不改变N'独立同分布、高斯分布的性质；将上述公式中的张量展开为向量，则其可以重写为此处向量的长度一样，均为NLK(即N×L×K，N、L、K分别表示麦克风数目，帧长，帧数目)；可以估计为根据对张量X'的定义，可以重构为X'，进而可以还原X；而最优门限为此处δ表示噪声的标准差；N、L、K分别表示麦克风数目，帧长，帧数目；log表示以2为底的对数。 X can be recovered by restoring X '; since U ₁ , U ₂ , and U ₃ are all orthogonal matrices, and the orthogonal matrix corresponds to rotation transformation, the nature of N'independent identical distribution and Gaussian distribution will not be changed; the above formula The tensor in is expanded to a vector, then it can be rewritten as vector here The lengths are the same, both are NLK (that is, N×L×K, N, L, and K respectively represent the number of microphones, the frame length, and the number of frames); can be estimated as According to the definition of tensor X ', can be reconstructed into X ', and then X can be restored; and the optimal threshold is Here δ represents the standard deviation of the noise; N, L, and K represent the number of microphones, frame length, and frame number respectively; log represents the logarithm with base 2.

由上述本发明提供的技术方案可以看出，一方面，相比于传统的多通道语音增强算法，本发明将接收数据表示为一个三阶张量，可以有效保留原始信号中的时空相关特性；另一方面，相比于传统的多通道增强算法，本发明方案的运算量小，只需确定最优的门限系数，即可实现有效的降噪。It can be seen from the technical solution provided by the present invention that, on the one hand, compared with the traditional multi-channel speech enhancement algorithm, the present invention represents the received data as a third-order tensor, which can effectively retain the spatiotemporal correlation characteristics in the original signal; On the other hand, compared with the traditional multi-channel enhancement algorithm, the calculation amount of the solution of the present invention is small, and effective noise reduction can be realized only by determining the optimal threshold coefficient.

附图说明Description of drawings

为了更清楚地说明本发明实施例的技术方案，下面将对实施例描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域的普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他附图。In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the following will briefly introduce the accompanying drawings that need to be used in the description of the embodiments. Obviously, the accompanying drawings in the following description are only some embodiments of the present invention. For Those of ordinary skill in the art can also obtain other drawings based on these drawings on the premise of not paying creative efforts.

图1为本发明实施例提供的多通道语音增强算法流程图；Fig. 1 is the multi-channel speech enhancement algorithm flowchart that the embodiment of the present invention provides;

图2为本发明实施例提供的最优门限计算示意图。FIG. 2 is a schematic diagram of optimal threshold calculation provided by an embodiment of the present invention.

具体实施方式Detailed ways

下面结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明的保护范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only some of the embodiments of the present invention, not all of them. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.

本发明实施例提供的一种基于张量分解的多麦克风语音增强方法，如图1所示，其主要包括如下步骤：The embodiment of the present invention provides a multi-microphone speech enhancement method based on tensor decomposition, as shown in Figure 1, which mainly includes the following steps:

步骤11、选择何种正交基矩阵(3DDCT正交基、有监督正交基、无监督正交基)。Step 11. Select what kind of orthogonal basis matrix (3DDCT orthogonal basis, supervised orthogonal basis, unsupervised orthogonal basis).

步骤12、将张量在所选取基矩阵上进行投影；选取最优门限对投影系数进行截断。Step 12: Project the tensor on the selected basis matrix; select an optimal threshold to truncate the projection coefficients.

本发明实施例提供的一种根据最小化统计风险准则计算最优门限算子的流程图，如图2所示，其主要包括如下步骤：The embodiment of the present invention provides a flowchart for calculating the optimal threshold operator according to the criterion of minimizing the statistical risk, as shown in FIG. 2 , which mainly includes the following steps:

步骤21、实时跟踪提取无音/噪声段，计算噪声方差。Step 21, real-time tracking and extraction of silent/noise segments, and calculation of noise variance.

步骤22、根据信号采样率等选取张量块大小，计算最佳截断门限。Step 22. Select the tensor block size according to the signal sampling rate, etc., and calculate the optimal truncation threshold.

本发明上述方案，相比于传统的多通道语音增强算法，通过利用高阶张量表示实现多通道语音信号的增强，可以有效地保留信号的时空相关特性；此外，相比于传统的多通道维纳滤波算法，本发明方案的运算量小，只需确定最优门限即可实现增强。The above solution of the present invention, compared with the traditional multi-channel speech enhancement algorithm, realizes the enhancement of the multi-channel speech signal by using high-order tensor representation, which can effectively retain the time-space correlation characteristics of the signal; in addition, compared to the traditional multi-channel Wiener filtering algorithm, the calculation amount of the scheme of the present invention is small, and the enhancement can be realized only by determining the optimal threshold.

为了便于理解，下面针对上述两个步骤做详细的说明。For ease of understanding, the following two steps are described in detail below.

1、利用三个正交方阵分别对张量的三个维度进行稀疏重构1. Use three orthogonal square matrices to sparsely reconstruct the three dimensions of the tensor

其中，Y(l,k,n)为第n个接收通道中第k帧的第l个元素。Wherein, Y (l,k,n) is the lth element of the kth frame in the nth receiving channel.

其中{U₁,U₂,U₃}表示基矩阵，Σ表示核心张量。具体地，U₁表示观测张量mode-1纤维Y(:,k,n)的基矩阵，U₂表示观测张量mode-2纤维Y(l,:,n)的基矩阵，U₃表示观测张量mode-3纤维Y(l,k,:)的基矩阵，Σ包括了观测张量在这些基矩阵上的投影系数，L′,K',N'表示核心张量的尺寸。Where {U ₁ , U ₂ , U ₃ } represents the basis matrix, and Σ represents the core tensor. Specifically, U ₁ represents the basis matrix of the observation tensor mode-1 fiber Y (:,k,n), U ₂ represents the basis matrix of the observation tensor mode-2 fiber Y (l,:,n), and U ₃ represents The basis matrix of the observation tensor mode-3 fiber Y (l,k,:), Σ includes the projection coefficients of the observation tensor on these basis matrices, and L', K', N' represent the dimensions of the core tensor.

规范多态分解通过解决如下问题：Canonical polymorphic decomposition solves the following problems by:

s.t.L′＝K′＝N′＝R,Σis diagonalstL'=K'=N'=R, Σ is diagonal

正交张量分解则通过解决如下问题：Orthogonal tensor decomposition solves the following problems:

注意，如果L'≤L,K'≤K,N'≤N，直接通过即可近似重构干净的语音张量，从而恢复原始的语音信号。在本发明中，我们选择L'＝L,K'＝K,N'＝N得到正交方阵{U₁,U₂,U₃}作为基矩阵；然后设计一个阈值λ，将Σ中绝对值小于λ的投影系数设置为零，从而实现噪声的抑制。通常，过大的阈值会带来较多的语音失真，而较小的阈值会带来较多的噪声残留；关于最佳阈值的选取，随后会论证。Note that if L'≤L, K'≤K, N'≤N, pass directly The clean speech tensor can be approximately reconstructed to restore the original speech signal. In the present invention, we choose L'=L, K'=K, N'=N to obtain an orthogonal square matrix {U ₁ , U ₂ , U ₃ } as the base matrix; then design a threshold λ to convert the absolute Projection coefficients with values smaller than λ are set to zero, thereby achieving noise suppression. Usually, a too large threshold will bring more speech distortion, while a smaller threshold will bring more noise residue; the selection of the optimal threshold will be demonstrated later.

本发明的实施例考虑了四种基矩阵，基矩阵{U₁,U₂,U₃}可以为3维的离散余弦变换(3D-DCT)基矩阵，有监督基矩阵，(无监督)近似基矩阵，(无监督)精确基矩阵。其中3D-DCT基矩阵通过如下公式定义：The embodiment of the present invention considers four basic matrices, the basic matrix {U ₁ , U ₂ , U ₃ } can be a 3-dimensional discrete cosine transform (3D-DCT) basic matrix, supervised basic matrix, (unsupervised) approximation basis matrix, the (unsupervised) exact basis matrix. The 3D-DCT basis matrix is defined by the following formula:

3D-DCT基矩阵是一种数据无关的通用基矩阵。The 3D-DCT basis matrix is a data-independent general basis matrix.

在实际过程中，我们可以针对特定问题，搜集干净的多通道语音数据作为训练数据，从而获得针对该问题最优的正交基矩阵。比如，本发明通过解决如下优化问题获得有监督基矩阵 In the actual process, we can collect clean multi-channel speech data as training data for a specific problem, so as to obtain the optimal orthogonal basis matrix for this problem. For example, the present invention obtains a supervised basis matrix by solving the following optimization problem

此处，X _i∈R^L×K×N,i＝1,2,…,T表示由干净语音组成的训练块。由于隐变量Σ _i的存在，上述问题没有显式的最优解。我们采用一种循环迭代的方法获得局部的最优解。第一步，我们将初始化为3D-DCT矩阵；第二步，给定我们采用一个软门限或硬门限算子来获得稀疏化的Σ _i；第三步，给定Σ _i和更新第四步，给定Σ _i和更新第五步，给定Σ _i和更新步骤二到步骤五被不断地重复，直到整个过程收敛。Here, X _i ∈ R ^L×K×N , i=1, 2, . . . , T represents a training block composed of clean speech. Due to the existence of hidden variables Σ _i , the above problem has no explicit optimal solution. We use a loop iterative method to obtain a local optimal solution. As a first step, we will Initialized as a 3D-DCT matrix; the second step, given We use a soft-threshold or hard-threshold operator to obtain the sparse Σ _i ; in the third step, given Σ _i and renew The fourth step, given Σ _i and renew The fifth step, given Σ _i and renew Steps 2 to 5 are repeated continuously until the whole process converges.

比如，在步骤三中，我们需要解决如下优化问题：For example, in step 3, we need to solve the following optimization problem:

该问题可以转化为：The question can be transformed into:

此处X _i(1)表示干净语音块的mode-1展开矩阵。上述问题可以简化为：Here Xi ₍₁₎ denotes the mode-1 expansion matrix of the clean speech block. The above problem can be simplified to:

该问题进一步等价于：The problem is further equivalent to:

假设的SVD分解可以简化为上述问题进一步转化为：suppose The SVD decomposition of can be simplified as The above question further translates into:

由于正交矩阵的对角元素不可能超过1，我们有等号仅在时成立。也就是说，在该步骤，的最佳取值为同理，我们可以更新和整个过程在20～30个循环之后，即可收敛。Due to the orthogonal matrix It is impossible for the diagonal elements of to exceed 1, we have The equal sign is only in was established. That is, at this step, The best value of Similarly, we can update and The whole process can converge after 20-30 cycles.

上述有监督基矩阵在所有训练数据上达到了最佳效果，但是实际问题中，我们面对的测试数据通常与训练数据并不严格匹配，这会导致有监督基矩阵面临一定的性能下降。因此，本发明提出采用无监督学习的方式，自动从测试数据中推理得出最适合测试数据的正交基矩阵。具体优化问题如下：The above-mentioned supervised base matrix achieves the best results on all training data, but in practical problems, the test data we face usually does not strictly match the training data, which will lead to a certain performance degradation of the supervised base matrix. Therefore, the present invention proposes to use unsupervised learning to automatically deduce from the test data the most suitable orthogonal basis matrix for the test data. The specific optimization problem is as follows:

此处表示无监督基矩阵，表示稀疏化的、包含投影系数的张量。基于上述问题，本发明提供两种无监督基矩阵，即近似基矩阵和精确基矩阵。here Denotes the unsupervised basis matrix, Represents a sparsified tensor containing projection coefficients. Based on the above problems, the present invention provides two kinds of unsupervised basis matrices, namely approximate basis matrix and exact basis matrix.

近似基矩阵可以通过高阶奇异值分解算法获得；通过对Y ₍₁₎,Y ₍₂₎和Y ₍₃₎做SVD分解可以分别得到精确基矩阵则需要在此基础上，做进一步的优化。首先固定更新然后固定和更新以此类推，更新整个过程被循环迭代，直到收敛。例如，对于的更新可以转化为：The approximate basis matrix can be obtained by a high-order singular value decomposition algorithm; by performing SVD decomposition on Y ₍₁₎ , Y ₍₂₎ and Y _(3), it can be obtained respectively The exact basis matrix needs to be further optimized on this basis. fixed first renew then fixed and renew And so on, update The whole process is iterated in a loop until convergence. For example, for An update can be transformed into:

假设矩阵的奇异值分解可表示为那么上述问题的解可以直接写为此处M′,N′为奇异向量矩阵，Σ′为非负奇异值组成的对角矩阵。hypothesis matrix The singular value decomposition of can be expressed as Then the solution to the above problem can be directly written as Here M', N' are singular vector matrices, and Σ' is a diagonal matrix composed of non-negative singular values.

2、选取非负的门限值，将核心张量中低于该门限的系数置零2. Select a non-negative threshold value, and set the coefficients in the core tensor below the threshold to zero

对于如下的线性观测模型：For the following linear observation model:

y(i)＝x(i)+n(i),i＝1,2,...,Qy(i)=x(i)+n(i), i=1,2,...,Q

此处n(i)服从单变量高斯分布，不同时刻的n(i)互相独立，x(i)服从高斯分布，那么对于x(i)的最小统计风险估计可以表示为H_λ(y(i))，其中H_λ(·)为硬门限算子，其作用在于将y(i)中低于门限λ的成分置0。根据最小化统计风险准则，最优门限为 Here n(i) obeys a univariate Gaussian distribution, n(i) at different times is independent of each other, and x(i) obeys a Gaussian distribution, then the minimum statistical risk estimate for x(i) can be expressed as H _λ (y(i )), where H _λ (·) is a hard threshold operator, and its function is to set the components in y(i) lower than the threshold λ to 0. According to the criterion of minimizing statistical risk, the optimal threshold is

此处噪声张量N的元素满足相互独立、高斯分布的假设。以上公式等价于：Here the elements of the noise tensor N satisfy the assumption of mutual independence and Gaussian distribution. The above formula is equivalent to:

此处，我们只需要恢复X'即可恢复X。由于U₁,U₂,U₃均为正交矩阵，而正交矩阵对应于旋转变换，因而不改变N'独立同分布、高斯分布的性质。将上述公式中的张量展开为向量，则其可以重写为此处向量的长度一样，均为NLK(即N×L×K，N、L、K分别表示麦克风数目，帧长，帧数目)。那么可以估计为根据定义，可以重构为X'，进而可以还原X。因而最优门限为 Here, we only need to restore X ' to restore X. Since U ₁ , U ₂ , and U ₃ are all orthogonal matrices, and the orthogonal matrix corresponds to rotation transformation, the properties of N ' independent and identical distribution and Gaussian distribution are not changed. Expanding the tensor in the above formula into a vector, it can be rewritten as vector here The lengths are the same, both are NLK (that is, N×L×K, N, L, and K respectively represent the number of microphones, the frame length, and the number of frames). So can be estimated as By definition, can be refactored to X ', which in turn can be reduced to X. Therefore, the optimal threshold is

通过以上的实施方式的描述，本领域的技术人员可以清楚地了解到上述实施例可以通过软件实现，也可以借助软件加必要的通用硬件平台的方式来实现。基于这样的理解，上述实施例的技术方案可以以软件产品的形式体现出来，该软件产品可以存储在一个非易失性存储介质(可以是CD-ROM，U盘，移动硬盘等)中，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行本发明各个实施例所述的方法。Through the above description of the implementation manners, those skilled in the art can clearly understand that the above embodiments can be implemented by software, or by means of software plus a necessary general hardware platform. Based on this understanding, the technical solutions of the above-mentioned embodiments can be embodied in the form of software products, which can be stored in a non-volatile storage medium (which can be CD-ROM, U disk, mobile hard disk, etc.), including Several instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute the methods described in various embodiments of the present invention.

以上所述，仅为本发明较佳的具体实施方式，但本发明的保护范围并不局限于此，任何熟悉本技术领域的技术人员在本发明披露的技术范围内，可轻易想到的变化或替换，都应涵盖在本发明的保护范围之内。因此，本发明的保护范围应该以权利要求书的保护范围为准。The above is only a preferred embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any person familiar with the technical field can easily conceive of changes or changes within the technical scope disclosed in the present invention. Replacement should be covered within the protection scope of the present invention. Therefore, the protection scope of the present invention should be determined by the protection scope of the claims.

Claims

1. A multi-microphone speech enhancement method based on tensor decomposition is characterized by comprising the following steps:

step (1), expressing the observed multi-channel voice data as a third-order tensor, and taking the third-order tensor as an observation tensor; and three dimensions of the observation tensor are respectively subjected to sparse reconstruction by using three orthogonal square matrixes to obtain a core tensor containing a projection coefficient, and the method comprises the following steps: respectively projecting three dimensions of the observation tensor onto the three orthogonal matrixes by using a preset or preselected orthogonal basis as a projection matrix to obtain a core tensor containing a projection coefficient;

step (2), a non-negative threshold value is preset, and the projection coefficient with the amplitude lower than the threshold value in the core tensor is set to be zero, so that the suppression of noise and the reconstruction of clean voice are realized; the method comprises the following steps: and designing an optimal threshold value by adopting a minimum statistical risk criterion, wherein the size of the threshold value is determined by the standard deviation of the noise and the size of the observation tensor.

2. The tensor decomposition-based multi-microphone speech enhancement method of claim 1, wherein the step (1) comprises:

step (11), representing observed multi-channel voice data as a third-order tensor through a signal receiving model of a microphone array, and taking the third-order tensor as an observation tensor;

the signal reception model of the microphone array is represented as follows:

Y(l,k,n)＝X(l,k,n)+N(l,k,n)∈R^L×K×N；

wherein,Ya tensor is represented which is a tensor of observation,Xrepresenting the tensor to be estimated that represents the clean speech,Nthe tensor of the noise is represented as,Y(l,k,n)，X(l, k, n) andN(L, K, N) respectively represent an observation tensor, a tensor to be estimated, an nth receiving channel in a noise tensor and an L element of a kth frame, and L, K and N respectively represent the frame length, the number of frames and the number of microphones;

respectively carrying out sparse reconstruction on three dimensions of the observation tensor by using three orthogonal square matrixes to obtain a core tensor containing a projection coefficient;

the decomposition of the observed tensor takes the form:

Y＝Σ×₁U₁×₂U₂×₃U₃,Σ∈R^{L′×K′×N′},U₁∈R^L×L′,

U₂∈R^K×K′,U₃∈R^N×N′,L′≤L,K′≤K,N′≤N

wherein { U₁,U₂,U₃Denotes a base matrix of the image data set,Σrepresenting a core tensor; specifically, U₁Fiber expressing observation tensor mode-1YBase matrix of (: k, n), U₂Fiber expressing observation tensor mode-2YBase matrix of (l,: n), U₃Representing observed tensor mode-3 fibersYA base matrix of (l, k,: in the following order,Σincludes the projection coefficients of the observation tensor on the basis matrixes₁、×₂、×₃Respectively representΣ、U₁、U₂、U₃Sequentially multiplying the mode 1, the mode 2 and the mode 3, wherein L ', K ' and N ' represent the size of the core tensor;

by canonicalizing polymorphic decomposition, we can decompose the observation tensor into a form of the sum of a finite number of rank-1 tensors, and canonicalizing polymorphic decomposition of the tensor can be realized by the following formula:

s.t.L′＝K′＝N′＝R,

obtaining a core tensor of the over-diagonal and a non-orthogonal basis matrix, where R represents the rank of the clean speech tensor;

by orthogonal decomposition, we can decompose the observation tensor into the form of the product of three orthogonal basis matrices and one core tensor, and the orthogonal decomposition of the tensor can be realized by the following formula:

a non-diagonal core tensor and orthogonal basis matrices are obtained.

3. The multi-microphone speech enhancement method based on tensor decomposition as recited in claim 2, wherein in the step (2):

for the following linear observation model:

y(i)＝x(i)+n(i),i＝1,2,...,Q

where Q represents the number of total samples, where n (i) obeys a univariate Gaussian distribution, n (i) at different times are independent of each other, and x (i) obeys a Gaussian distribution, then the minimum statistical risk estimate for x (i) can be represented as H_λ(y (i)) wherein H_λ(. h) is a hard threshold operator whose effect is to set the components in y (i) below threshold λ to 0; based on the minimum statistical risk criterion, the optimal threshold is

For a multi-channel data reception model, the following relationship is satisfied:

Y＝X+N＝Σ×₁U₁×₂U₂×₃U₃

here the tensor of noiseNSatisfies the assumption of mutually independent, gaussian distributions, the above formula being equivalent to:

by recoveryX' immediate recoveryX(ii) a Due to U₁,U₂,U₃Are all orthogonal matrices, and the orthogonal matrices correspond to rotational transformations and thus do not changeNThe nature of the independent co-distribution, gaussian distribution; by unfolding the tensor in the above formula into a vector, it can be rewritten asHere vectorAll the lengths of the two groups are NLK;can be estimated asAccording to the pair tensorXIn the definition of' in the present specification,can be reconstructed intoX', which in turn can reduceX(ii) a And the optimal threshold isWhere δ represents the standard deviation of the noise; log represents base 2 logarithm and NLK represents N x L x K.