CN111239687A

CN111239687A - Sound source positioning method and system based on deep neural network

Info

Publication number: CN111239687A
Application number: CN202010050760.9A
Authority: CN
Inventors: 张巧灵; 唐柔冰; 马晗
Original assignee: Zhejiang Sci Tech University ZSTU
Current assignee: Zhejiang Sci Tech University ZSTU
Priority date: 2020-01-17
Filing date: 2020-01-17
Publication date: 2020-06-05
Anticipated expiration: 2040-01-17
Also published as: CN111239687B

Abstract

The invention discloses a positioning method, comprising: S1. acquiring a voice signal received by a microphone, and generating a voice data set; S2. preprocessing the voice signal in the voice data set; S3. calculating the phase of the sound source signal corresponding to the voice signal Weighted generalized cross-correlation function; S4. Obtain the time delay information corresponding to the peak of the phase-weighted generalized cross-correlation function, and use the time delay information as the TDOA observation value of the sound source signal reaching the microphone; Obtain the amplitude corresponding to the time delay information; S5. Use the TDOA The observation value and the amplitude are combined as the input vector, the three-dimensional space position coordinates corresponding to the sound source signal are used as the output vector, and the input vector and the output vector are combined to generate the feature vector; S6. Preprocess the feature vector; S7. Set the deep neural network correlation parameters of the training set, and train the deep neural network with the feature vector of the training set to obtain the trained deep neural network; S8. Input the input vector of the test set into the trained deep neural network for prediction, and obtain the three-dimensional spatial coordinates of the sound source signal.

Description

A sound source localization method and system based on deep neural network

技术领域technical field

本发明涉及室内声源定位技术领域，尤其涉及一种基于深度神经网络的声源定位方法及系统。The invention relates to the technical field of indoor sound source localization, in particular to a sound source localization method and system based on a deep neural network.

背景技术Background technique

近年来，智能服务型产品(如智能音箱，智能家居等)在现实生活中得到广泛应用，为了获得良好的用户体验，产品的人机交互能力得到越来越多人的关注。在人机交互中，语音沟通是不可或缺的一部分，人们可以直接下达语音口令命令机器提供相应的服务，机器对语音口令进行识别并提供相应的服务，勿需手动操作。目前，在近场语音识别应用场景下(如手机端)，麦克风接收的语音信号质量很高，语音识别率已满足实际要求。然而，在智能家居等远场语音识别应用场景下，麦克风捕捉到的语音信号质量较差，语音识别率低，无法满足实际要求。因此，解决远场语音识别应用落地日益成为近年来国内外机构的研究热点。目前，在语音识别前端利用声源定位算法估计声源位置，增强该位置方向的声源信号，同时削弱其他方向的干扰信号的方法能提高语音信号质量，提高语音识别率，该方法能有效解决远场语音识别应用落地。其中，在语音识别之前进行有效的声源定位，在实际中具有重要意义。In recent years, intelligent service products (such as smart speakers, smart homes, etc.) have been widely used in real life. In order to obtain a good user experience, the human-computer interaction capabilities of products have attracted more and more attention. In human-computer interaction, voice communication is an indispensable part. People can directly issue a voice password to command the machine to provide corresponding services. The machine recognizes the voice password and provides corresponding services without manual operation. At present, in near-field speech recognition application scenarios (such as mobile phones), the quality of the speech signal received by the microphone is very high, and the speech recognition rate has met the actual requirements. However, in far-field speech recognition application scenarios such as smart homes, the quality of the speech signal captured by the microphone is poor, and the speech recognition rate is low, which cannot meet the actual requirements. Therefore, solving the application of far-field speech recognition has increasingly become a research focus of domestic and foreign institutions in recent years. At present, the sound source localization algorithm is used in the front-end of speech recognition to estimate the position of the sound source, enhance the sound source signal in the direction of the position, and at the same time weaken the interference signals in other directions, which can improve the quality of the speech signal and improve the speech recognition rate. This method can effectively solve the problem. The application of far-field speech recognition has landed. Among them, effective sound source localization before speech recognition is of great significance in practice.

经典的定位算法主要是二维空间的声源定位算法，这类算法主要分为三大类：一是基于时延估计(Time Difference of Arrival,TDOA)的算法。时延估计算法，又称到达时间差算法，根据两个不同位置麦克风接收同一声源信号的时间差确定声源的位置。它将两个麦克风接收声音信号的广义互相关函数(Generalized Cross Correlation,GCC)的最大波峰对应的时延作为时间延迟估计，进而利用麦克风阵列的几何约束得到声源位置估计。这类方法容易受环境噪声以及室内混响的影响，当噪声较大或者混响严重时，GCC函数出现多个伪峰(Spurious Peaks)，容易估计出错误的TDOA值，进而导致错误的声源位置估计。二是基于空间谱估计的算法。基于空间谱估计的算法的基本思想是根据空间谱确定方向角以及声源的位置。由于空间信号的估计与时域信号的频率估计相似，空间谱的估计方法可由时域非线性谱推广而成，但是这类算法的前提是信号源是连续分布且空间平稳，因此此类算法的运用就大大受到限制。空间谱算法代表性算法之一是特征子空间类算法，特征子空间类算法又可分为子空间分解类算法以及子空间拟合类算法，前者的主要算法为多信号分类(Multiple Signal Classification，MUSIC)算法以及旋转不变子空间算法(EstimatingSignal Parameter via Rotational Invariance Techniques,ESPRIT)，后者的算法主要为最大似然算法(Maximum Likelihood,ML)以及加权子空间拟合算法(Weighted SubspaceFitting,WSF)等。三是基于可控波束响应的算法。基于可控波束响应的算法是在麦克风阵列中全局搜索能量最大的位置，该位置即声源位置。通常，先对麦克风收集的语音信号滤波并加权求和形成波束，然后再求出波束输出功率最大的点，该点即声源位置。基于可控波束响应的算法具体又可分为延迟累加波束算法和自适应波束算法。延迟累加算法虽然信号失真小且计算量小但是抗干扰能力弱，容易受到噪声的影响。自适应算法计算量大，信号有一定的失真但是抗干扰能力强。Classical localization algorithms are mainly sound source localization algorithms in two-dimensional space, which are mainly divided into three categories: one is the algorithm based on Time Difference of Arrival (TDOA). The time delay estimation algorithm, also known as the time difference of arrival algorithm, determines the position of the sound source according to the time difference between two microphones at different positions receiving the same sound source signal. It takes the time delay corresponding to the maximum peak of the Generalized Cross Correlation (GCC) of the sound signals received by the two microphones as the time delay estimation, and then uses the geometric constraints of the microphone array to obtain the sound source position estimation. This type of method is easily affected by environmental noise and indoor reverberation. When the noise is large or the reverberation is severe, there will be multiple spurious peaks in the GCC function, and it is easy to estimate the wrong TDOA value, which will lead to wrong sound sources. Location estimation. The second is an algorithm based on spatial spectrum estimation. The basic idea of the algorithm based on spatial spectrum estimation is to determine the direction angle and the position of the sound source according to the spatial spectrum. Since the estimation of the space signal is similar to the frequency estimation of the time domain signal, the estimation method of the space spectrum can be generalized by the nonlinear spectrum of the time domain, but the premise of this kind of algorithm is that the signal source is continuously distributed and the space is stationary, so this kind of algorithm The use is greatly restricted. One of the representative algorithms of the spatial spectrum algorithm is the feature subspace class algorithm. The feature subspace class algorithm can be divided into a subspace decomposition class algorithm and a subspace fitting class algorithm. The main algorithm of the former is Multiple Signal Classification (Multiple Signal Classification, MUSIC) algorithm and rotation invariant subspace algorithm (EstimatingSignal Parameter via Rotational Invariance Techniques, ESPRIT), the latter algorithm is mainly the maximum likelihood algorithm (Maximum Likelihood, ML) and weighted subspace fitting algorithm (Weighted SubspaceFitting, WSF) Wait. The third is the algorithm based on the controllable beam response. The algorithm based on the steerable beam response is to globally search the position of the maximum energy in the microphone array, which is the position of the sound source. Usually, the speech signal collected by the microphone is filtered, weighted and summed to form a beam, and then the point with the maximum output power of the beam is obtained, which is the position of the sound source. The algorithm based on the controllable beam response can be divided into the delay accumulation beam algorithm and the adaptive beam algorithm. Although the delay accumulation algorithm has small signal distortion and small calculation amount, its anti-interference ability is weak, and it is easily affected by noise. The adaptive algorithm has a large amount of calculation, and the signal has a certain distortion but has strong anti-interference ability.

三维空间的声源定位算法现在大多采用多模式(Multi-Modal)融合算法，其中具有代表性的算法为视听融合(Audio-Visual Fusion)算法。通常联合相机收集的脸部位置信息以及麦克风收集到的延迟估计(Difference of Arrival,DOA)共同估计声源位置，该算法既避免了传统图像跟踪受限于相机数量以及光照强度的缺点，也避免了传统声源跟踪受限于背景噪声和室内混响的缺点，大大减少了环境因素的影响。但是在多模式融合算法中，我们仍需设置许多的参数，当环境发生变化时，算法鲁棒性就有所下降。Most of the sound source localization algorithms in three-dimensional space now use Multi-Modal fusion algorithms, of which the representative algorithm is the Audio-Visual Fusion algorithm. Usually, the face position information collected by the camera and the delay estimation (DOA) collected by the microphone are used to estimate the sound source position. This algorithm not only avoids the shortcomings of traditional image tracking that are limited by the number of cameras and light intensity, but also avoids The traditional sound source tracking is limited by the shortcomings of background noise and indoor reverberation, and the influence of environmental factors is greatly reduced. However, in the multi-modal fusion algorithm, we still need to set many parameters, and when the environment changes, the robustness of the algorithm decreases.

近几年，利用神经网络进行声源定位是一个比较热门的研究方向，尤其是在深度学习发展以后。利用神经网络进行声源定位算法的研究通常都是先在语音信号中提取出特征向量，再将特征向量传入到神经网络中训练。常见的语音特征向量由多个麦克风对的TDOA构成，并没有利用TDOA对应的幅值信息，而TDOA对应的幅值在一定程度上反应了该TDOA的可靠程度。In recent years, the use of neural networks for sound source localization is a relatively popular research direction, especially after the development of deep learning. The research of sound source localization algorithm using neural network is usually to extract the feature vector from the speech signal first, and then transfer the feature vector to the neural network for training. The common speech feature vector is composed of TDOA of multiple microphone pairs, and does not use the amplitude information corresponding to the TDOA, and the amplitude corresponding to the TDOA reflects the reliability of the TDOA to a certain extent.

总体而言，基于深度神经网络的声源定位方法是室内声源定位问题的研究热点，该研究对于解决当前许多音频应用，如智能语音交互的技术落地具有十分重要的意义。然而，目前基于深度神经网络的声源定位方法的研究尚不成熟，现有成果或多或少存在一定不足。In general, the sound source localization method based on deep neural network is a research hotspot of indoor sound source localization, and this research is of great significance for solving many current audio applications, such as intelligent voice interaction technology. However, the current research on sound source localization methods based on deep neural networks is still immature, and the existing results have more or less certain deficiencies.

发明内容SUMMARY OF THE INVENTION

本发明的目的是针对现有技术的缺陷，提供了一种基于深度神经网络的声源定位方法及系统，将时延估计

及其所对应的幅值

共同作为深度神经网络的输入向量，三维空间坐标作为深度神经网络的输出向量，适用于室内声源定位，具有良好的可扩展性以及算法鲁棒性。The purpose of the present invention is to provide a sound source localization method and system based on a deep neural network in view of the defects of the prior art.

and its corresponding amplitude

Together as the input vector of the deep neural network, and the three-dimensional space coordinates as the output vector of the deep neural network, it is suitable for indoor sound source localization, and has good scalability and algorithm robustness.

为了实现以上目的，本发明采用以下技术方案：In order to achieve the above purpose, the present invention adopts the following technical solutions:

一种基于深度神经网络的声源定位方法，包括深度神经网络的训练阶段和深度神经网络的测试阶段，包括步骤：A sound source localization method based on a deep neural network, including a training phase of the deep neural network and a testing phase of the deep neural network, including steps:

S1.获取麦克风接收的语音信号，并将获取到的语音信号生成语音数据集；其中，所述语音数据集包括训练数据集和测试数据集；S1. Acquire the voice signal received by the microphone, and generate a voice data set from the obtained voice signal; wherein, the voice data set includes a training data set and a test data set;

S2.对所述生成的语音数据集内的语音信号进行第一预处理；S2. carry out first preprocessing to the voice signal in the generated voice data set;

S3.计算所述预处理后的语音信号所对应的声源信号的相位加权广义互相关函数；S3. calculate the phase-weighted generalized cross-correlation function of the sound source signal corresponding to the preprocessed speech signal;

S4.获取与所述相位加权广义互相关函数波峰相对应的时延信息，并将所述获取的时延信息作为声源信号到达麦克风的TDOA观测值；并获取所述时延信息对应的幅值；S4. Obtain time delay information corresponding to the peak of the phase-weighted generalized cross-correlation function, and use the obtained time delay information as the TDOA observation value of the sound source signal arriving at the microphone; and obtain the amplitude corresponding to the time delay information value;

S5.将所述TDOA观测值与幅值结合作为深度神经网络的输入向量，将声源信号对应的三维空间位置坐标作为神经网络的输出向量，将所述输入向量和输出向量结合生成特征向量；S5. the TDOA observation value and the amplitude are combined as the input vector of the deep neural network, the three-dimensional space position coordinates corresponding to the sound source signal are used as the output vector of the neural network, and the input vector and the output vector are combined to generate a feature vector;

S6.对所述生成的特征向量进行第二预处理；S6. The second preprocessing is performed on the generated feature vector;

S7.在深度神经网络的训练阶段，设置深度神经网络相关的参数，并用训练集的特征向量训练深度神经网络，得到训练好的深度神经网络；S7. In the training phase of the deep neural network, set the parameters related to the deep neural network, and train the deep neural network with the feature vector of the training set to obtain a trained deep neural network;

S8.在深度神经网络的测试阶段，将测试集的输入特征向量传入训练好的深度神经网络进行预测，得到声源信号的三维空间位置坐标，并采用交叉验证评估深度神经网络模型的性能。S8. In the test phase of the deep neural network, the input feature vector of the test set is transmitted to the trained deep neural network for prediction, the three-dimensional spatial position coordinates of the sound source signal are obtained, and the performance of the deep neural network model is evaluated by cross-validation.

进一步的，所述步骤S1中麦克风阵列的集合为V＝{1,2,…,M}；每个麦克风节点m包含两个麦克风，其中m∈V；M表示共有M对麦克风。Further, the set of microphone arrays in the step S1 is V={1,2,...,M}; each microphone node m includes two microphones, where m∈V; M represents a total of M pairs of microphones.

进一步的，所述步骤S2具体为对麦克风节点m内的两个麦克风接收的语音信号进行第一预处理，所述第一预处理包括分帧、加窗、预加重。Further, the step S2 is specifically to perform first preprocessing on the speech signals received by the two microphones in the microphone node m, where the first preprocessing includes framing, windowing, and pre-emphasis.

进一步的，所述步骤S3具体为计算预处理后的麦克风节点m内的两个麦克风语音信号的相位加权广义互相关函数R_m(τ)，表示为：Further, the step S3 is specifically calculating the phase-weighted generalized cross-correlation function R _m (τ) of the two microphone voice signals in the preprocessed microphone node m, which is expressed as:

其中，m∈V，

和

分别表示为在节点m处的时域麦克风信号

和

的所对应的频域部分；符号*表示为复共轭操作。where m∈V,

and

are denoted as the time-domain microphone signal at node m, respectively

and

The corresponding frequency domain part of ; the symbol * denotes a complex conjugate operation.

进一步的，所述步骤S4获取与所述相位加权广义互相关函数R_m(τ)波峰相对应的时延信息

表示为：Further, the step S4 obtains the time delay information corresponding to the peak of the phase weighted generalized cross-correlation function R _m (τ)

Expressed as:

并获取所述时延信息

对应的幅值

and obtain the delay information

corresponding amplitude

进一步的，所述步骤S5具体为：Further, the step S5 is specifically:

将所有节点计算出的时延信息

及其所对应的幅值

联合作为深度神经网络的输入向量I：Calculate the delay information of all nodes

and its corresponding amplitude

The union is the input vector I of the deep neural network:

将声源信号S对应的三维空间位置坐标Q作为神经网络的输出向量：The three-dimensional space position coordinate Q corresponding to the sound source signal S is used as the output vector of the neural network:

将所述输入向量I和输出向量Q结合生成特征向量G：The input vector I and the output vector Q are combined to generate the feature vector G:

G＝(I,Q)^T。G=(I,Q) ^T .

进一步的，所述步骤S6中的第二预处理包括数据清洗、数据乱序、数据归一化。Further, the second preprocessing in step S6 includes data cleaning, data disorder, and data normalization.

进一步的，所述步骤S8中采用的交叉验证包括留一验证法。Further, the cross-validation used in the step S8 includes a leave-one-out validation method.

相应的，还提供一种基于深度神经网络的声源定位系统，包括：Correspondingly, a sound source localization system based on a deep neural network is also provided, including:

第一获取模块，用于获取麦克风接收的语音信号，并将获取到的语音信号生成语音数据集；其中，所述语音数据集包括训练数据集和测试数据集；The first acquisition module is used to acquire the voice signal received by the microphone, and generate a voice data set from the acquired voice signal; wherein, the voice data set includes a training data set and a test data set;

第一预处理模块，用于对所述生成的语音数据集内的语音信号进行第一预处理；a first preprocessing module, configured to perform a first preprocessing on the voice signal in the generated voice data set;

计算模块，用于计算所述预处理后的语音信号所对应的声源信号的相位加权广义互相关函数；a calculation module for calculating the phase-weighted generalized cross-correlation function of the sound source signal corresponding to the preprocessed speech signal;

第二获取模块，用于获取与所述相位加权广义互相关函数波峰相对应的时延信息，并将所述获取的时延信息作为声源信号到达麦克风的TDOA观测值；并获取所述时延信息对应的幅值；The second acquisition module is configured to acquire the time delay information corresponding to the peak of the phase-weighted generalized cross-correlation function, and use the acquired time delay information as the TDOA observation value of the sound source signal arriving at the microphone; and acquire the time delay information. The amplitude corresponding to the delay information;

生成模块，用于将所述TDOA观测值与幅值结合作为深度神经网络的输入向量，将声源信号对应的三维空间位置坐标作为神经网络的输出向量，将所述输入向量和输出向量结合生成特征向量；The generation module is used to combine the TDOA observation value and the amplitude value as the input vector of the deep neural network, use the three-dimensional space position coordinates corresponding to the sound source signal as the output vector of the neural network, and combine the input vector and the output vector to generate Feature vector;

第二预处理模块，用于对所述生成的特征向量进行第二预处理；A second preprocessing module, configured to perform a second preprocessing on the generated feature vector;

训练模块，用于设置深度神经网络相关的参数，并用训练集的特征向量训练深度神经网络，得到训练好的深度神经网络The training module is used to set the parameters related to the deep neural network, and train the deep neural network with the feature vector of the training set to obtain the trained deep neural network

进一步的，还包括：Further, it also includes:

测试模块，用于将测试集的输入向量传入训练好的深度神经网络进行预测，得到声源信号的三维空间位置坐标，并采用交叉验证评估深度神经网络模型的性能。The test module is used to input the input vector of the test set into the trained deep neural network for prediction, obtain the three-dimensional spatial position coordinates of the sound source signal, and use cross-validation to evaluate the performance of the deep neural network model.

与现有技术相比，本发明将时延估计

及其所对应的幅值

共同作为深度神经网络的输入向量，三维空间坐标作为深度神经网络的输出向量，适用于室内声源定位，具有良好的可扩展性以及算法鲁棒性。Compared with the prior art, the present invention estimates the time delay

and its corresponding amplitude

附图说明Description of drawings

图1是实施例一提供的一种基于深度神经网络的声源定位方法流程图；1 is a flowchart of a method for sound source localization based on a deep neural network provided by Embodiment 1;

图2是实施例一提供的仿真环境的俯视示意图，其中圆圈代表麦克风的位置；2 is a schematic top view of a simulation environment provided by Embodiment 1, wherein a circle represents the position of a microphone;

图3是实施例一提供的深度神经网络的训练阶段流程图；3 is a flowchart of a training phase of the deep neural network provided by Embodiment 1;

图4是实施例一提供的深度神经网络的测试阶段流程图。FIG. 4 is a flow chart of the testing phase of the deep neural network provided in the first embodiment.

具体实施方式Detailed ways

以下通过特定的具体实例说明本发明的实施方式，本领域技术人员可由本说明书所揭露的内容轻易地了解本发明的其他优点与功效。本发明还可以通过另外不同的具体实施方式加以实施或应用，本说明书中的各项细节也可以基于不同观点与应用，在没有背离本发明的精神下进行各种修饰或改变。需说明的是，在不冲突的情况下，以下实施例及实施例中的特征可以相互组合。The embodiments of the present invention are described below through specific specific examples, and those skilled in the art can easily understand other advantages and effects of the present invention from the contents disclosed in this specification. The present invention can also be implemented or applied through other different specific embodiments, and various details in this specification can also be modified or changed based on different viewpoints and applications without departing from the spirit of the present invention. It should be noted that the following embodiments and features in the embodiments may be combined with each other under the condition of no conflict.

本发明的目的是针对现有技术的缺陷，提供了一种基于深度神经网络的声源定位方法及系统。The purpose of the present invention is to provide a sound source localization method and system based on a deep neural network in view of the defects of the prior art.

实施例一Example 1

本实施例一提供一种基于深度神经网络的声源定位方法，包括深度神经网络的训练阶段和深度神经网络的测试阶段，如图1-2所示，包括步骤：The first embodiment provides a sound source localization method based on a deep neural network, including a training phase of the deep neural network and a testing phase of the deep neural network, as shown in Figure 1-2, including steps:

S11.获取麦克风接收的语音信号，并将获取到的语音信号生成语音数据集；其中，所述语音数据集包括训练数据集和测试数据集；S11. Obtain the voice signal received by the microphone, and generate a voice data set from the obtained voice signal; wherein, the voice data set includes a training data set and a test data set;

S12.对所述生成的语音数据集内的语音信号进行第一预处理；S12. Perform first preprocessing on the voice signal in the generated voice data set;

S13.计算所述预处理后的语音信号所对应的声源信号的相位加权广义互相关函数；S13. Calculate the phase-weighted generalized cross-correlation function of the sound source signal corresponding to the preprocessed speech signal;

S14.获取与所述相位加权广义互相关函数波峰相对应的时延信息，并将所述获取的时延信息作为声源信号到达麦克风的TDOA观测值；并获取所述时延信息对应的幅值；S14. Acquire the time delay information corresponding to the peak of the phase-weighted generalized cross-correlation function, and use the obtained time delay information as the TDOA observation value of the sound source signal arriving at the microphone; and obtain the amplitude corresponding to the time delay information value;

S15.将所述TDOA观测值与幅值结合作为深度神经网络的输入向量，将声源信号对应的三维空间位置坐标作为神经网络的输出向量，将所述输入向量和输出向量结合生成特征向量；S15. The TDOA observation value and the amplitude are combined as the input vector of the deep neural network, the three-dimensional space position coordinates corresponding to the sound source signal are used as the output vector of the neural network, and the input vector and the output vector are combined to generate a feature vector;

S16.对所述生成的特征向量进行第二预处理；S16. Second preprocessing is performed on the generated feature vector;

S17.在深度神经网络的训练阶段，设置深度神经网络相关的参数，并用训练集的特征向量训练深度神经网络，得到训练好的深度神经网络；S17. In the training phase of the deep neural network, set the parameters related to the deep neural network, and train the deep neural network with the feature vector of the training set to obtain a trained deep neural network;

S18.在深度神经网络的测试阶段，将测试集的输入向量传入训练好的深度神经网络进行预测，得到声源信号的三维空间位置坐标，并采用交叉验证评估深度神经网络模型的性能。S18. In the test phase of the deep neural network, the input vector of the test set is transmitted to the trained deep neural network for prediction, the three-dimensional spatial position coordinates of the sound source signal are obtained, and the performance of the deep neural network model is evaluated by cross-validation.

在本实施例中，以分布式麦克风阵列具体说明：In this embodiment, the distributed microphone array is specifically described:

具体的模拟设置为：仿真环境是尺寸为4.1m×3.1m×3m的典型会议室，其中总共有L＝12个随机分布的麦克风。每个麦克风节点中两个麦克风之间的距离为Dm＝0.6m。为简便起见，麦克风的位置在高度为1.75m的平面上。声音传播速度为c＝343m/s。在本实施例中，原始无混响的语音信号是一段采样频率为16kHz、单通道的纯净男性英语发音，语音信号帧长为120ms。室内混响时间T60＝0.1s，信噪比为SNR＝20dB，蒙特卡洛实验次数为50。分布式麦克风阵列共有M个麦克风节点，即麦克风节点的集合V＝{1,2,…,M}。每个麦克风节点m包含两个麦克风，其中m∈V。The specific simulation settings are as follows: the simulation environment is a typical conference room with a size of 4.1m×3.1m×3m, in which there are L=12 randomly distributed microphones in total. The distance between two microphones in each microphone node is Dm=0.6m. For simplicity, the location of the microphone is on a plane with a height of 1.75m. The speed of sound propagation is c=343 m/s. In this embodiment, the original voice signal without reverberation is a single-channel pure male English pronunciation with a sampling frequency of 16 kHz, and the frame length of the voice signal is 120 ms. The indoor reverberation time is T60=0.1s, the signal-to-noise ratio is SNR=20dB, and the number of Monte Carlo experiments is 50. The distributed microphone array has M microphone nodes in total, that is, a set of microphone nodes V={1,2,...,M}. Each microphone node m contains two microphones, where m ∈ V.

在步骤S11中，获取麦克风接收的语音信号，并将获取到的语音信号生成语音数据集；其中，所述语音数据集包括训练数据集和测试数据集；。In step S11, a voice signal received by the microphone is acquired, and a voice data set is generated from the acquired voice signal; wherein, the voice data set includes a training data set and a test data set;

在本实施例中，声源位置集设置在高为1.5m～1.7m的平面内，均匀采集H＝24000个位置样本集作为神经网络的数据集。在MATLAB仿真环境中，先利用Image模型模拟出房间脉冲响应，再将原始的无混响的语音信号与房间脉冲响应卷积并加上高斯白噪声，最终仿真出麦克风接收的信号。In this embodiment, the sound source position set is set in a plane with a height of 1.5m-1.7m, and H=24000 position sample sets are uniformly collected as the data set of the neural network. In the MATLAB simulation environment, the Image model is used to simulate the room impulse response, and then the original unreverberated speech signal is convolved with the room impulse response and Gaussian white noise is added to finally simulate the signal received by the microphone.

在步骤S12中，对所述生成的语音数据集内的语音信号进行第一预处理。In step S12, a first preprocessing is performed on the speech signal in the generated speech data set.

具体为对麦克风节点m内的两个麦克风接收的语音信号进行第一预处理，所述第一预处理包括分帧、加窗、预加重。Specifically, the first preprocessing is performed on the speech signals received by the two microphones in the microphone node m, where the first preprocessing includes framing, windowing, and pre-emphasis.

采用矩形窗对语音信号进行加窗，矩形窗的窗函数ω(n)公式为：A rectangular window is used to add a window to the speech signal, and the formula of the window function ω(n) of the rectangular window is:

其中，N表示窗函数的长度。where N represents the length of the window function.

预加重的公式表示式为：The formula for pre-emphasis is:

H(z)＝1-αz^-1 H(z)=1-αz ^-1

其中，α表示预加重系数，范围为0.9<α<1.0。在本实施例中，窗函数的长度为帧长，预加重系数α＝0.97。Among them, α represents the pre-emphasis coefficient, and the range is 0.9<α<1.0. In this embodiment, the length of the window function is the frame length, and the pre-emphasis coefficient α=0.97.

在步骤S13中，计算所述预处理后的语音信号所对应的声源信号的相位加权广义互相关函数。In step S13, a phase-weighted generalized cross-correlation function of the sound source signal corresponding to the preprocessed speech signal is calculated.

具体为计算预处理后的麦克风节点m内的两个麦克风语音信号的相位加权广义互相关函数R_m(τ)，表示为：Specifically, the phase-weighted generalized cross-correlation function R _m (τ) of the two microphone speech signals in the preprocessed microphone node m is calculated, which is expressed as:

其中，m∈V，

和

分别表示为在节点m处的时域麦克风信号

和

的所对应的频域部分；符号*表示为复共轭操作。在本实施例中，M＝6。where m∈V,

and

are denoted as the time-domain microphone signal at node m, respectively

and

The corresponding frequency domain part of ; the symbol * denotes a complex conjugate operation. In this embodiment, M=6.

在步骤S14中，获取与所述相位加权广义互相关函数波峰相对应的时延信息，并将所述获取的时延信息作为声源信号到达麦克风的TDOA观测值；并获取所述时延信息对应的幅值。In step S14, the time delay information corresponding to the peak of the phase-weighted generalized cross-correlation function is obtained, and the obtained time delay information is used as the TDOA observation value of the sound source signal arriving at the microphone; and the time delay information is obtained corresponding amplitude.

获取与所述相位加权广义互相关函数R_m(τ)波峰相对应的时延信息

并将时延信息

作为声源信号S到达麦克风节点m的TDOA观测值，表示为：Obtain time delay information corresponding to the peak of the phase-weighted generalized cross-correlation function R _m (τ)

and the delay information

As the TDOA observation of the sound source signal S arriving at the microphone node m, it is expressed as:

其中，τ∈[-τ_max,τ_max]，τ_max表示声源信息S到达麦克风节点m的理论最大时延(TDOA)，即

和

表示节点m处包含的麦克风对与声源信息S的距离，c表示声音传播速度；||·||表示欧几里得范数。然后获取所述时延信息

(即TDOA观测值)对应的幅值

Among them, τ∈[-τ _max ,τ _max ], τ _max represents the theoretical maximum time delay (TDOA) of the sound source information S reaching the microphone node m, namely

and

Represents the distance between the microphone pair contained at the node m and the sound source information S, c represents the sound propagation speed; ||·|| represents the Euclidean norm. Then obtain the delay information

(i.e. TDOA observations) corresponding to the amplitude

TDOA定位是一种利用时间差进行定位的方法。通过测量信号到达监测站的时间，可以确定信号源的距离。利用信号源到各个监测站的距离(以监测站为中心，距离为半径作圆)，就能确定信号的位置。但是绝对时间一般比较难测量，通过比较信号到达各个监测站的绝对时间差，就能作出以监测站为焦点，距离差为长轴的双曲线，双曲线的交点就是信号的位置。TDOA positioning is a method of positioning using time difference. By measuring the time it takes for the signal to arrive at the monitoring station, the distance to the source of the signal can be determined. Using the distance from the signal source to each monitoring station (with the monitoring station as the center, the distance is the radius as a circle), the position of the signal can be determined. However, the absolute time is generally difficult to measure. By comparing the absolute time difference between the signal arriving at each monitoring station, a hyperbola with the monitoring station as the focus and the distance difference as the long axis can be drawn. The intersection of the hyperbola is the position of the signal.

在步骤S15中，将所述TDOA观测值与幅值结合作为深度神经网络的输入向量，将声源信号对应的三维空间位置坐标作为神经网络的输出向量，将所述输入向量和输出向量结合生成特征向量。In step S15, the TDOA observation value and the amplitude are combined as the input vector of the deep neural network, the three-dimensional space position coordinates corresponding to the sound source signal are used as the output vector of the neural network, and the input vector and the output vector are combined to generate Feature vector.

具体为：将时延信息

(即TDOA观测值)及其所对应的幅值

结合作为深度神经网络的输入向量I：Specifically: convert the delay information

(i.e. TDOA observations) and their corresponding amplitudes

Combined with the input vector I as a deep neural network:

将输入向量I和输出向量Q结合生成特征向量G：Combine the input vector I and the output vector Q to generate the feature vector G:

G＝(I,Q)^T。G=(I,Q) ^T .

在步骤S16中，对所述生成的特征向量进行第二预处理。其中，第二预处理包括数据清洗、数据乱序、数据归一化。In step S16, a second preprocessing is performed on the generated feature vector. The second preprocessing includes data cleaning, data disorder, and data normalization.

归一化采用min-max标准化的方法，其转换函数为：The normalization adopts the method of min-max normalization, and its conversion function is:

其中，g_min、g_max分别表示样本特征向量G中的最小值与最大值；

表示样本数据归一化后的结果。在经过神经网络的训练过后，应经过反归一化得出数据的值，在本实施例中为声源点的三维空间位置。Among them, g _min and g _max represent the minimum and maximum values in the sample feature vector G, respectively;

Indicates the result after normalizing the sample data. After the training of the neural network, the data value should be obtained through inverse normalization, which is the three-dimensional spatial position of the sound source point in this embodiment.

其中，反归一化的转换函数为：Among them, the denormalized transformation function is:

其中，g_min、g_max分别表示样本特征向量G中的最小值与最大值，

表示样本数据归一化后的结果，g为反归一化后的结果。Among them, g _min and g _max represent the minimum and maximum values in the sample feature vector G, respectively,

Indicates the result after normalization of sample data, and g is the result after de-normalization.

在步骤S17中，在深度神经网络的训练阶段，设置深度神经网络相关的参数，并用训练集的特征向量训练深度神经网络，得到训练好的深度神经网络。In step S17, in the training stage of the deep neural network, parameters related to the deep neural network are set, and the deep neural network is trained with the feature vector of the training set, so as to obtain a trained deep neural network.

在本实施例中，深度神经网络(DNN)的输入层神经元数目设置为12，输出层神经元数目设置为3。隐藏层设置为三层，第一层隐藏层神经元数目为12，激活函数为tanh函数，第二层隐藏层神经元数目为15，激活函数为tanh函数，第三层隐藏层神经元数目为3，激活函数为tanh函数。In this embodiment, the number of neurons in the input layer of the deep neural network (DNN) is set to 12, and the number of neurons in the output layer is set to 3. The hidden layer is set to three layers, the number of neurons in the first hidden layer is 12, the activation function is tanh function, the number of neurons in the second hidden layer is 15, the activation function is tanh function, and the number of neurons in the third hidden layer is 3. The activation function is the tanh function.

在本实施例中，神经网络的损失函数设置为真实空间位置向量Q与神经网络预测估计向量P之间的均方差(Mean Squared Error,MSE)，表示为：In this embodiment, the loss function of the neural network is set as the mean squared error (MSE) between the real space position vector Q and the neural network prediction estimation vector P, which is expressed as:

其中，U为当前神经网络迭代数据集的总数。where U is the total number of current neural network iteration datasets.

在步骤S18中，在深度神经网络的测试阶段，将测试集的输入向量传入训练好的深度神经网络进行预测，得到声源信号的三维空间位置坐标，并采用交叉验证评估深度神经网络模型的性能。In step S18, in the test phase of the deep neural network, the input vector of the test set is transmitted to the trained deep neural network for prediction, the three-dimensional spatial position coordinates of the sound source signal are obtained, and cross-validation is used to evaluate the performance of the deep neural network model. performance.

将测试集的输入向量传入到训练好的深度神经网络，可预测得到声源信号的三维空间位置坐标P＝[px,py,pz]^T，并利用交叉验证评估深度神经网络模型的性能。The input vector of the test set is passed into the trained deep neural network, and the three-dimensional spatial position coordinates P=[px,py,pz] ^T of the sound source signal can be predicted, and the performance of the deep neural network model is evaluated by cross-validation.

在本实施例中，数据集总数为24000，用交叉验证的方法对神经网络的性能进行测试，交叉验证的方法采用留一验证的方法，即留下4000个样本点作为测试集，20000个样本作为训练集，被测试过的数据在下个过程中将作为训练集的一部分，重复这个过程，直到没有新的样本数据需要被预测。In this embodiment, the total number of data sets is 24,000, and the performance of the neural network is tested by the cross-validation method. The cross-validation method adopts the leave-one-out validation method, that is, 4,000 sample points are left as the test set and 20,000 samples are left as the test set. As a training set, the tested data will be used as part of the training set in the next process, and this process will be repeated until no new sample data needs to be predicted.

本实施例的一种基于深度神经网络的声源定位方法，包括深度神经网络的训练阶段和深度神经网络的测试阶段。A sound source localization method based on a deep neural network in this embodiment includes a training phase of the deep neural network and a testing phase of the deep neural network.

如图3所示，在深度神经网络的训练阶段，包括步骤S11-S17。As shown in Figure 3, in the training phase of the deep neural network, steps S11-S17 are included.

如图4所示，在深度神经网络的测试阶段，包括步骤S11-S16、S18。As shown in FIG. 4 , in the testing phase of the deep neural network, steps S11-S16 and S18 are included.

需要说明的是，本实施例的测试阶段是基于训练阶段得到训练好的深度神经网络，然后在进行测试定位。It should be noted that, in the testing phase of this embodiment, a trained deep neural network is obtained based on the training phase, and then testing and positioning is performed.

与现有技术相比，本实施例将时延估计

以及R_m(τ)最大峰值所对应的幅值

共同作为深度神经网络的输入向量，三维空间坐标作为深度神经网络的输出向量，适用于室内声源定位，具有良好的可扩展性以及算法鲁棒性。Compared with the prior art, this embodiment uses the time delay estimation

and the amplitude corresponding to the maximum peak value of R _m (τ)

实施例二Embodiment 2

本实施例提供一种基于深度神经网络的声源定位系统，包括：This embodiment provides a sound source localization system based on a deep neural network, including:

进一步的，还包括：Further, it also includes:

需要说明的是，本实施例的一种基于深度神经网络的声源定位系统与实施例一类似，在此不多做赘述。It should be noted that a sound source localization system based on a deep neural network in this embodiment is similar to that in the first embodiment, and details are not described here.

与现有技术相比，本实施例将时延估计

以及R_m(τ)最大峰值所对应的幅值

and the amplitude corresponding to the maximum peak value of R _m (τ)

本文中所描述的具体实施例仅仅是对本发明精神作举例说明。本发明所属技术领域的技术人员可以对所描述的具体实施例做各种各样的修改或补充或采用类似的方式替代，但并不会偏离本发明的精神或者超越所附权利要求书所定义的范围。The specific embodiments described herein are merely illustrative of the spirit of the invention. Those skilled in the art to which the present invention pertains can make various modifications or additions to the described specific embodiments or substitute in similar manners, but will not deviate from the spirit of the present invention or go beyond the definitions of the appended claims range.

Claims

1. a sound source localization method based on deep neural network, is characterized in that, comprises the training phase of deep neural network and the testing phase of deep neural network, comprises steps:

S1. Acquire the voice signal received by the microphone, and generate a voice data set from the obtained voice signal; wherein, the voice data set includes a training data set and a test data set;

S2. carry out first preprocessing to the voice signal in the generated voice data set;

S3. calculate the phase-weighted generalized cross-correlation function of the sound source signal corresponding to the preprocessed speech signal;

S4. Obtain time delay information corresponding to the peak of the phase-weighted generalized cross-correlation function, and use the obtained time delay information as the TDOA observation value of the sound source signal arriving at the microphone; and obtain the amplitude corresponding to the time delay information value;

S5. the TDOA observation value and the amplitude are combined as the input vector of the deep neural network, the three-dimensional space position coordinates corresponding to the sound source signal are used as the output vector of the neural network, and the input vector and the output vector are combined to generate a feature vector;

S6. The second preprocessing is performed on the generated feature vector;

S7. In the training phase of the deep neural network, set the parameters related to the deep neural network, and train the deep neural network with the feature vector of the training set to obtain a trained deep neural network;

S8. In the test phase of the deep neural network, the input vector of the test set is transmitted to the trained deep neural network for prediction, the three-dimensional spatial position coordinates of the sound source signal are obtained, and the performance of the deep neural network model is evaluated by cross-validation.

2. A sound source localization method based on a deep neural network according to claim 1, wherein the set of microphone arrays in the step S1 is V={1,2,...,M}; each microphone Node m contains two microphones, where m ∈ V; M represents a total of M microphone nodes.

3. a kind of sound source localization method based on deep neural network according to claim 2, is characterized in that, described step S2 specifically is to carry out the first preprocessing to the speech signals received by two microphones in the microphone node m, The first preprocessing includes framing, windowing, and pre-emphasis.

4. a kind of sound source localization method based on deep neural network according to claim 2, is characterized in that, described step S3 is to calculate the phase-weighted generalized generalization of two microphone speech signals in the microphone node m after preprocessing specifically The cross-correlation function R _m (τ), expressed as:

where m∈V,

and

are denoted as the time-domain microphone signal at node m, respectively

and

5. a kind of sound source localization method based on deep neural network according to claim 4, is characterized in that, described step S4 obtains the time delay corresponding to the peak of described phase-weighted generalized cross-correlation function R _m (τ) information

Expressed as:

and obtain the delay information

corresponding amplitude

6. a kind of sound source localization method based on deep neural network according to claim 5, is characterized in that, described step S5 is specifically:

delay information

and its corresponding amplitude

Combined with the input vector I as a deep neural network:

The three-dimensional space position coordinate Q corresponding to the sound source signal S is used as the output vector of the neural network:

The input vector I and the output vector Q are combined to generate the feature vector G:

G=(I,Q) ^T .

7 . The sound source localization method based on a deep neural network according to claim 6 , wherein the second preprocessing in the step S6 includes data cleaning, data disorder, and data normalization. 8 .

8 . The sound source localization method based on a deep neural network according to claim 7 , wherein the cross-validation adopted in the step S8 includes a leave-one-out validation method. 9 .

9. A sound source localization system based on a deep neural network, characterized in that, comprising:

The first acquisition module is used to acquire the voice signal received by the microphone, and generate a voice data set from the acquired voice signal; wherein, the voice data set includes a training data set and a test data set;

a first preprocessing module, configured to perform a first preprocessing on the voice signal in the generated voice data set;

a calculation module for calculating the phase-weighted generalized cross-correlation function of the sound source signal corresponding to the preprocessed speech signal;

The second acquisition module is configured to acquire the time delay information corresponding to the peak of the phase-weighted generalized cross-correlation function, and use the acquired time delay information as the TDOA observation value of the sound source signal arriving at the microphone; and acquire the time delay information. The amplitude corresponding to the delay information;

The generation module is used to combine the TDOA observation value and the amplitude value as the input vector of the deep neural network, use the three-dimensional space position coordinates corresponding to the sound source signal as the output vector of the neural network, and combine the input vector and the output vector to generate Feature vector;

A second preprocessing module, configured to perform a second preprocessing on the generated feature vector;

The training module is used to set parameters related to the deep neural network, and train the deep neural network with the feature vector of the training set to obtain a trained deep neural network.

10. A deep neural network-based sound source localization system according to claim 9, characterized in that, further comprising:

The test module is used to input the input vector of the test set into the trained deep neural network for prediction, obtain the three-dimensional spatial position coordinates of the sound source signal, and use cross-validation to evaluate the performance of the deep neural network model.