CN111239687A - Sound source positioning method and system based on deep neural network - Google Patents

Sound source positioning method and system based on deep neural network Download PDF

Info

Publication number
CN111239687A
CN111239687A CN202010050760.9A CN202010050760A CN111239687A CN 111239687 A CN111239687 A CN 111239687A CN 202010050760 A CN202010050760 A CN 202010050760A CN 111239687 A CN111239687 A CN 111239687A
Authority
CN
China
Prior art keywords
neural network
deep neural
sound source
microphone
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010050760.9A
Other languages
Chinese (zh)
Other versions
CN111239687B (en
Inventor
张巧灵
唐柔冰
马晗
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Sci Tech University ZSTU
Original Assignee
Zhejiang Sci Tech University ZSTU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Sci Tech University ZSTU filed Critical Zhejiang Sci Tech University ZSTU
Priority to CN202010050760.9A priority Critical patent/CN111239687B/en
Publication of CN111239687A publication Critical patent/CN111239687A/en
Application granted granted Critical
Publication of CN111239687B publication Critical patent/CN111239687B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G01MEASURING; TESTING
    • G01SRADIO DIRECTION-FINDING; RADIO NAVIGATION; DETERMINING DISTANCE OR VELOCITY BY USE OF RADIO WAVES; LOCATING OR PRESENCE-DETECTING BY USE OF THE REFLECTION OR RERADIATION OF RADIO WAVES; ANALOGOUS ARRANGEMENTS USING OTHER WAVES
    • G01S5/00Position-fixing by co-ordinating two or more direction or position line determinations; Position-fixing by co-ordinating two or more distance determinations
    • G01S5/18Position-fixing by co-ordinating two or more direction or position line determinations; Position-fixing by co-ordinating two or more distance determinations using ultrasonic, sonic, or infrasonic waves
    • G01S5/22Position of source determined by co-ordinating a plurality of position lines defined by path-difference measurements

Landscapes

  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Remote Sensing (AREA)
  • Measurement Of Velocity Or Position Using Acoustic Or Ultrasonic Waves (AREA)

Abstract

本发明公开了定位方法,包括:S1.获取麦克风接收的语音信号,并生成语音数据集;S2.对语音数据集内的语音信号进行预处理;S3.计算语音信号对应的声源信号的相位加权广义互相关函数;S4.获取相位加权广义互相关函数波峰对应的时延信息,将时延信息作为声源信号到达麦克风的TDOA观测值;获取时延信息对应的幅值;S5.将TDOA观测值与幅值结合作为输入向量,将声源信号对应的三维空间位置坐标作为输出向量,结合输入向量和输出向量生成特征向量;S6.对特征向量进行预处理;S7.设置深度神经网络相关的参数,并用训练集的特征向量训练深度神经网络,得到训练好的深度神经网络;S8.将测试集的输入向量传入训练好的深度神经网络进行预测,得到声源信号的三维空间坐标。

Figure 202010050760

The invention discloses a positioning method, comprising: S1. acquiring a voice signal received by a microphone, and generating a voice data set; S2. preprocessing the voice signal in the voice data set; S3. calculating the phase of the sound source signal corresponding to the voice signal Weighted generalized cross-correlation function; S4. Obtain the time delay information corresponding to the peak of the phase-weighted generalized cross-correlation function, and use the time delay information as the TDOA observation value of the sound source signal reaching the microphone; Obtain the amplitude corresponding to the time delay information; S5. Use the TDOA The observation value and the amplitude are combined as the input vector, the three-dimensional space position coordinates corresponding to the sound source signal are used as the output vector, and the input vector and the output vector are combined to generate the feature vector; S6. Preprocess the feature vector; S7. Set the deep neural network correlation parameters of the training set, and train the deep neural network with the feature vector of the training set to obtain the trained deep neural network; S8. Input the input vector of the test set into the trained deep neural network for prediction, and obtain the three-dimensional spatial coordinates of the sound source signal.

Figure 202010050760

Description

一种基于深度神经网络的声源定位方法及系统A sound source localization method and system based on deep neural network

技术领域technical field

本发明涉及室内声源定位技术领域,尤其涉及一种基于深度神经网络的声源定位方法及系统。The invention relates to the technical field of indoor sound source localization, in particular to a sound source localization method and system based on a deep neural network.

背景技术Background technique

近年来,智能服务型产品(如智能音箱,智能家居等)在现实生活中得到广泛应用,为了获得良好的用户体验,产品的人机交互能力得到越来越多人的关注。在人机交互中,语音沟通是不可或缺的一部分,人们可以直接下达语音口令命令机器提供相应的服务,机器对语音口令进行识别并提供相应的服务,勿需手动操作。目前,在近场语音识别应用场景下(如手机端),麦克风接收的语音信号质量很高,语音识别率已满足实际要求。然而,在智能家居等远场语音识别应用场景下,麦克风捕捉到的语音信号质量较差,语音识别率低,无法满足实际要求。因此,解决远场语音识别应用落地日益成为近年来国内外机构的研究热点。目前,在语音识别前端利用声源定位算法估计声源位置,增强该位置方向的声源信号,同时削弱其他方向的干扰信号的方法能提高语音信号质量,提高语音识别率,该方法能有效解决远场语音识别应用落地。其中,在语音识别之前进行有效的声源定位,在实际中具有重要意义。In recent years, intelligent service products (such as smart speakers, smart homes, etc.) have been widely used in real life. In order to obtain a good user experience, the human-computer interaction capabilities of products have attracted more and more attention. In human-computer interaction, voice communication is an indispensable part. People can directly issue a voice password to command the machine to provide corresponding services. The machine recognizes the voice password and provides corresponding services without manual operation. At present, in near-field speech recognition application scenarios (such as mobile phones), the quality of the speech signal received by the microphone is very high, and the speech recognition rate has met the actual requirements. However, in far-field speech recognition application scenarios such as smart homes, the quality of the speech signal captured by the microphone is poor, and the speech recognition rate is low, which cannot meet the actual requirements. Therefore, solving the application of far-field speech recognition has increasingly become a research focus of domestic and foreign institutions in recent years. At present, the sound source localization algorithm is used in the front-end of speech recognition to estimate the position of the sound source, enhance the sound source signal in the direction of the position, and at the same time weaken the interference signals in other directions, which can improve the quality of the speech signal and improve the speech recognition rate. This method can effectively solve the problem. The application of far-field speech recognition has landed. Among them, effective sound source localization before speech recognition is of great significance in practice.

经典的定位算法主要是二维空间的声源定位算法,这类算法主要分为三大类:一是基于时延估计(Time Difference of Arrival,TDOA)的算法。时延估计算法,又称到达时间差算法,根据两个不同位置麦克风接收同一声源信号的时间差确定声源的位置。它将两个麦克风接收声音信号的广义互相关函数(Generalized Cross Correlation,GCC)的最大波峰对应的时延作为时间延迟估计,进而利用麦克风阵列的几何约束得到声源位置估计。这类方法容易受环境噪声以及室内混响的影响,当噪声较大或者混响严重时,GCC函数出现多个伪峰(Spurious Peaks),容易估计出错误的TDOA值,进而导致错误的声源位置估计。二是基于空间谱估计的算法。基于空间谱估计的算法的基本思想是根据空间谱确定方向角以及声源的位置。由于空间信号的估计与时域信号的频率估计相似,空间谱的估计方法可由时域非线性谱推广而成,但是这类算法的前提是信号源是连续分布且空间平稳,因此此类算法的运用就大大受到限制。空间谱算法代表性算法之一是特征子空间类算法,特征子空间类算法又可分为子空间分解类算法以及子空间拟合类算法,前者的主要算法为多信号分类(Multiple Signal Classification,MUSIC)算法以及旋转不变子空间算法(EstimatingSignal Parameter via Rotational Invariance Techniques,ESPRIT),后者的算法主要为最大似然算法(Maximum Likelihood,ML)以及加权子空间拟合算法(Weighted SubspaceFitting,WSF)等。三是基于可控波束响应的算法。基于可控波束响应的算法是在麦克风阵列中全局搜索能量最大的位置,该位置即声源位置。通常,先对麦克风收集的语音信号滤波并加权求和形成波束,然后再求出波束输出功率最大的点,该点即声源位置。基于可控波束响应的算法具体又可分为延迟累加波束算法和自适应波束算法。延迟累加算法虽然信号失真小且计算量小但是抗干扰能力弱,容易受到噪声的影响。自适应算法计算量大,信号有一定的失真但是抗干扰能力强。Classical localization algorithms are mainly sound source localization algorithms in two-dimensional space, which are mainly divided into three categories: one is the algorithm based on Time Difference of Arrival (TDOA). The time delay estimation algorithm, also known as the time difference of arrival algorithm, determines the position of the sound source according to the time difference between two microphones at different positions receiving the same sound source signal. It takes the time delay corresponding to the maximum peak of the Generalized Cross Correlation (GCC) of the sound signals received by the two microphones as the time delay estimation, and then uses the geometric constraints of the microphone array to obtain the sound source position estimation. This type of method is easily affected by environmental noise and indoor reverberation. When the noise is large or the reverberation is severe, there will be multiple spurious peaks in the GCC function, and it is easy to estimate the wrong TDOA value, which will lead to wrong sound sources. Location estimation. The second is an algorithm based on spatial spectrum estimation. The basic idea of the algorithm based on spatial spectrum estimation is to determine the direction angle and the position of the sound source according to the spatial spectrum. Since the estimation of the space signal is similar to the frequency estimation of the time domain signal, the estimation method of the space spectrum can be generalized by the nonlinear spectrum of the time domain, but the premise of this kind of algorithm is that the signal source is continuously distributed and the space is stationary, so this kind of algorithm The use is greatly restricted. One of the representative algorithms of the spatial spectrum algorithm is the feature subspace class algorithm. The feature subspace class algorithm can be divided into a subspace decomposition class algorithm and a subspace fitting class algorithm. The main algorithm of the former is Multiple Signal Classification (Multiple Signal Classification, MUSIC) algorithm and rotation invariant subspace algorithm (EstimatingSignal Parameter via Rotational Invariance Techniques, ESPRIT), the latter algorithm is mainly the maximum likelihood algorithm (Maximum Likelihood, ML) and weighted subspace fitting algorithm (Weighted SubspaceFitting, WSF) Wait. The third is the algorithm based on the controllable beam response. The algorithm based on the steerable beam response is to globally search the position of the maximum energy in the microphone array, which is the position of the sound source. Usually, the speech signal collected by the microphone is filtered, weighted and summed to form a beam, and then the point with the maximum output power of the beam is obtained, which is the position of the sound source. The algorithm based on the controllable beam response can be divided into the delay accumulation beam algorithm and the adaptive beam algorithm. Although the delay accumulation algorithm has small signal distortion and small calculation amount, its anti-interference ability is weak, and it is easily affected by noise. The adaptive algorithm has a large amount of calculation, and the signal has a certain distortion but has strong anti-interference ability.

三维空间的声源定位算法现在大多采用多模式(Multi-Modal)融合算法,其中具有代表性的算法为视听融合(Audio-Visual Fusion)算法。通常联合相机收集的脸部位置信息以及麦克风收集到的延迟估计(Difference of Arrival,DOA)共同估计声源位置,该算法既避免了传统图像跟踪受限于相机数量以及光照强度的缺点,也避免了传统声源跟踪受限于背景噪声和室内混响的缺点,大大减少了环境因素的影响。但是在多模式融合算法中,我们仍需设置许多的参数,当环境发生变化时,算法鲁棒性就有所下降。Most of the sound source localization algorithms in three-dimensional space now use Multi-Modal fusion algorithms, of which the representative algorithm is the Audio-Visual Fusion algorithm. Usually, the face position information collected by the camera and the delay estimation (DOA) collected by the microphone are used to estimate the sound source position. This algorithm not only avoids the shortcomings of traditional image tracking that are limited by the number of cameras and light intensity, but also avoids The traditional sound source tracking is limited by the shortcomings of background noise and indoor reverberation, and the influence of environmental factors is greatly reduced. However, in the multi-modal fusion algorithm, we still need to set many parameters, and when the environment changes, the robustness of the algorithm decreases.

近几年,利用神经网络进行声源定位是一个比较热门的研究方向,尤其是在深度学习发展以后。利用神经网络进行声源定位算法的研究通常都是先在语音信号中提取出特征向量,再将特征向量传入到神经网络中训练。常见的语音特征向量由多个麦克风对的TDOA构成,并没有利用TDOA对应的幅值信息,而TDOA对应的幅值在一定程度上反应了该TDOA的可靠程度。In recent years, the use of neural networks for sound source localization is a relatively popular research direction, especially after the development of deep learning. The research of sound source localization algorithm using neural network is usually to extract the feature vector from the speech signal first, and then transfer the feature vector to the neural network for training. The common speech feature vector is composed of TDOA of multiple microphone pairs, and does not use the amplitude information corresponding to the TDOA, and the amplitude corresponding to the TDOA reflects the reliability of the TDOA to a certain extent.

总体而言,基于深度神经网络的声源定位方法是室内声源定位问题的研究热点,该研究对于解决当前许多音频应用,如智能语音交互的技术落地具有十分重要的意义。然而,目前基于深度神经网络的声源定位方法的研究尚不成熟,现有成果或多或少存在一定不足。In general, the sound source localization method based on deep neural network is a research hotspot of indoor sound source localization, and this research is of great significance for solving many current audio applications, such as intelligent voice interaction technology. However, the current research on sound source localization methods based on deep neural networks is still immature, and the existing results have more or less certain deficiencies.

发明内容SUMMARY OF THE INVENTION

本发明的目的是针对现有技术的缺陷,提供了一种基于深度神经网络的声源定位方法及系统,将时延估计

Figure BDA0002371094510000031
及其所对应的幅值
Figure BDA0002371094510000032
共同作为深度神经网络的输入向量,三维空间坐标作为深度神经网络的输出向量,适用于室内声源定位,具有良好的可扩展性以及算法鲁棒性。The purpose of the present invention is to provide a sound source localization method and system based on a deep neural network in view of the defects of the prior art.
Figure BDA0002371094510000031
and its corresponding amplitude
Figure BDA0002371094510000032
Together as the input vector of the deep neural network, and the three-dimensional space coordinates as the output vector of the deep neural network, it is suitable for indoor sound source localization, and has good scalability and algorithm robustness.

为了实现以上目的,本发明采用以下技术方案:In order to achieve the above purpose, the present invention adopts the following technical solutions:

一种基于深度神经网络的声源定位方法,包括深度神经网络的训练阶段和深度神经网络的测试阶段,包括步骤:A sound source localization method based on a deep neural network, including a training phase of the deep neural network and a testing phase of the deep neural network, including steps:

S1.获取麦克风接收的语音信号,并将获取到的语音信号生成语音数据集;其中,所述语音数据集包括训练数据集和测试数据集;S1. Acquire the voice signal received by the microphone, and generate a voice data set from the obtained voice signal; wherein, the voice data set includes a training data set and a test data set;

S2.对所述生成的语音数据集内的语音信号进行第一预处理;S2. carry out first preprocessing to the voice signal in the generated voice data set;

S3.计算所述预处理后的语音信号所对应的声源信号的相位加权广义互相关函数;S3. calculate the phase-weighted generalized cross-correlation function of the sound source signal corresponding to the preprocessed speech signal;

S4.获取与所述相位加权广义互相关函数波峰相对应的时延信息,并将所述获取的时延信息作为声源信号到达麦克风的TDOA观测值;并获取所述时延信息对应的幅值;S4. Obtain time delay information corresponding to the peak of the phase-weighted generalized cross-correlation function, and use the obtained time delay information as the TDOA observation value of the sound source signal arriving at the microphone; and obtain the amplitude corresponding to the time delay information value;

S5.将所述TDOA观测值与幅值结合作为深度神经网络的输入向量,将声源信号对应的三维空间位置坐标作为神经网络的输出向量,将所述输入向量和输出向量结合生成特征向量;S5. the TDOA observation value and the amplitude are combined as the input vector of the deep neural network, the three-dimensional space position coordinates corresponding to the sound source signal are used as the output vector of the neural network, and the input vector and the output vector are combined to generate a feature vector;

S6.对所述生成的特征向量进行第二预处理;S6. The second preprocessing is performed on the generated feature vector;

S7.在深度神经网络的训练阶段,设置深度神经网络相关的参数,并用训练集的特征向量训练深度神经网络,得到训练好的深度神经网络;S7. In the training phase of the deep neural network, set the parameters related to the deep neural network, and train the deep neural network with the feature vector of the training set to obtain a trained deep neural network;

S8.在深度神经网络的测试阶段,将测试集的输入特征向量传入训练好的深度神经网络进行预测,得到声源信号的三维空间位置坐标,并采用交叉验证评估深度神经网络模型的性能。S8. In the test phase of the deep neural network, the input feature vector of the test set is transmitted to the trained deep neural network for prediction, the three-dimensional spatial position coordinates of the sound source signal are obtained, and the performance of the deep neural network model is evaluated by cross-validation.

进一步的,所述步骤S1中麦克风阵列的集合为V={1,2,…,M};每个麦克风节点m包含两个麦克风,其中m∈V;M表示共有M对麦克风。Further, the set of microphone arrays in the step S1 is V={1,2,...,M}; each microphone node m includes two microphones, where m∈V; M represents a total of M pairs of microphones.

进一步的,所述步骤S2具体为对麦克风节点m内的两个麦克风接收的语音信号进行第一预处理,所述第一预处理包括分帧、加窗、预加重。Further, the step S2 is specifically to perform first preprocessing on the speech signals received by the two microphones in the microphone node m, where the first preprocessing includes framing, windowing, and pre-emphasis.

进一步的,所述步骤S3具体为计算预处理后的麦克风节点m内的两个麦克风语音信号的相位加权广义互相关函数Rm(τ),表示为:Further, the step S3 is specifically calculating the phase-weighted generalized cross-correlation function R m (τ) of the two microphone voice signals in the preprocessed microphone node m, which is expressed as:

Figure BDA0002371094510000041
Figure BDA0002371094510000041

其中,m∈V,

Figure BDA0002371094510000042
Figure BDA0002371094510000043
分别表示为在节点m处的时域麦克风信号
Figure BDA0002371094510000044
Figure BDA0002371094510000045
的所对应的频域部分;符号*表示为复共轭操作。where m∈V,
Figure BDA0002371094510000042
and
Figure BDA0002371094510000043
are denoted as the time-domain microphone signal at node m, respectively
Figure BDA0002371094510000044
and
Figure BDA0002371094510000045
The corresponding frequency domain part of ; the symbol * denotes a complex conjugate operation.

进一步的,所述步骤S4获取与所述相位加权广义互相关函数Rm(τ)波峰相对应的时延信息

Figure BDA0002371094510000046
表示为:Further, the step S4 obtains the time delay information corresponding to the peak of the phase weighted generalized cross-correlation function R m (τ)
Figure BDA0002371094510000046
Expressed as:

Figure BDA0002371094510000047
Figure BDA0002371094510000047

并获取所述时延信息

Figure BDA0002371094510000048
对应的幅值
Figure BDA0002371094510000049
and obtain the delay information
Figure BDA0002371094510000048
corresponding amplitude
Figure BDA0002371094510000049

进一步的,所述步骤S5具体为:Further, the step S5 is specifically:

将所有节点计算出的时延信息

Figure BDA00023710945100000410
及其所对应的幅值
Figure BDA00023710945100000411
联合作为深度神经网络的输入向量I:Calculate the delay information of all nodes
Figure BDA00023710945100000410
and its corresponding amplitude
Figure BDA00023710945100000411
The union is the input vector I of the deep neural network:

Figure BDA00023710945100000412
Figure BDA00023710945100000412

将声源信号S对应的三维空间位置坐标Q作为神经网络的输出向量:The three-dimensional space position coordinate Q corresponding to the sound source signal S is used as the output vector of the neural network:

Figure BDA00023710945100000413
Figure BDA00023710945100000413

将所述输入向量I和输出向量Q结合生成特征向量G:The input vector I and the output vector Q are combined to generate the feature vector G:

G=(I,Q)TG=(I,Q) T .

进一步的,所述步骤S6中的第二预处理包括数据清洗、数据乱序、数据归一化。Further, the second preprocessing in step S6 includes data cleaning, data disorder, and data normalization.

进一步的,所述步骤S8中采用的交叉验证包括留一验证法。Further, the cross-validation used in the step S8 includes a leave-one-out validation method.

相应的,还提供一种基于深度神经网络的声源定位系统,包括:Correspondingly, a sound source localization system based on a deep neural network is also provided, including:

第一获取模块,用于获取麦克风接收的语音信号,并将获取到的语音信号生成语音数据集;其中,所述语音数据集包括训练数据集和测试数据集;The first acquisition module is used to acquire the voice signal received by the microphone, and generate a voice data set from the acquired voice signal; wherein, the voice data set includes a training data set and a test data set;

第一预处理模块,用于对所述生成的语音数据集内的语音信号进行第一预处理;a first preprocessing module, configured to perform a first preprocessing on the voice signal in the generated voice data set;

计算模块,用于计算所述预处理后的语音信号所对应的声源信号的相位加权广义互相关函数;a calculation module for calculating the phase-weighted generalized cross-correlation function of the sound source signal corresponding to the preprocessed speech signal;

第二获取模块,用于获取与所述相位加权广义互相关函数波峰相对应的时延信息,并将所述获取的时延信息作为声源信号到达麦克风的TDOA观测值;并获取所述时延信息对应的幅值;The second acquisition module is configured to acquire the time delay information corresponding to the peak of the phase-weighted generalized cross-correlation function, and use the acquired time delay information as the TDOA observation value of the sound source signal arriving at the microphone; and acquire the time delay information. The amplitude corresponding to the delay information;

生成模块,用于将所述TDOA观测值与幅值结合作为深度神经网络的输入向量,将声源信号对应的三维空间位置坐标作为神经网络的输出向量,将所述输入向量和输出向量结合生成特征向量;The generation module is used to combine the TDOA observation value and the amplitude value as the input vector of the deep neural network, use the three-dimensional space position coordinates corresponding to the sound source signal as the output vector of the neural network, and combine the input vector and the output vector to generate Feature vector;

第二预处理模块,用于对所述生成的特征向量进行第二预处理;A second preprocessing module, configured to perform a second preprocessing on the generated feature vector;

训练模块,用于设置深度神经网络相关的参数,并用训练集的特征向量训练深度神经网络,得到训练好的深度神经网络The training module is used to set the parameters related to the deep neural network, and train the deep neural network with the feature vector of the training set to obtain the trained deep neural network

进一步的,还包括:Further, it also includes:

测试模块,用于将测试集的输入向量传入训练好的深度神经网络进行预测,得到声源信号的三维空间位置坐标,并采用交叉验证评估深度神经网络模型的性能。The test module is used to input the input vector of the test set into the trained deep neural network for prediction, obtain the three-dimensional spatial position coordinates of the sound source signal, and use cross-validation to evaluate the performance of the deep neural network model.

与现有技术相比,本发明将时延估计

Figure BDA0002371094510000051
及其所对应的幅值
Figure BDA0002371094510000052
共同作为深度神经网络的输入向量,三维空间坐标作为深度神经网络的输出向量,适用于室内声源定位,具有良好的可扩展性以及算法鲁棒性。Compared with the prior art, the present invention estimates the time delay
Figure BDA0002371094510000051
and its corresponding amplitude
Figure BDA0002371094510000052
Together as the input vector of the deep neural network, and the three-dimensional space coordinates as the output vector of the deep neural network, it is suitable for indoor sound source localization, and has good scalability and algorithm robustness.

附图说明Description of drawings

图1是实施例一提供的一种基于深度神经网络的声源定位方法流程图;1 is a flowchart of a method for sound source localization based on a deep neural network provided by Embodiment 1;

图2是实施例一提供的仿真环境的俯视示意图,其中圆圈代表麦克风的位置;2 is a schematic top view of a simulation environment provided by Embodiment 1, wherein a circle represents the position of a microphone;

图3是实施例一提供的深度神经网络的训练阶段流程图;3 is a flowchart of a training phase of the deep neural network provided by Embodiment 1;

图4是实施例一提供的深度神经网络的测试阶段流程图。FIG. 4 is a flow chart of the testing phase of the deep neural network provided in the first embodiment.

具体实施方式Detailed ways

以下通过特定的具体实例说明本发明的实施方式,本领域技术人员可由本说明书所揭露的内容轻易地了解本发明的其他优点与功效。本发明还可以通过另外不同的具体实施方式加以实施或应用,本说明书中的各项细节也可以基于不同观点与应用,在没有背离本发明的精神下进行各种修饰或改变。需说明的是,在不冲突的情况下,以下实施例及实施例中的特征可以相互组合。The embodiments of the present invention are described below through specific specific examples, and those skilled in the art can easily understand other advantages and effects of the present invention from the contents disclosed in this specification. The present invention can also be implemented or applied through other different specific embodiments, and various details in this specification can also be modified or changed based on different viewpoints and applications without departing from the spirit of the present invention. It should be noted that the following embodiments and features in the embodiments may be combined with each other under the condition of no conflict.

本发明的目的是针对现有技术的缺陷,提供了一种基于深度神经网络的声源定位方法及系统。The purpose of the present invention is to provide a sound source localization method and system based on a deep neural network in view of the defects of the prior art.

实施例一Example 1

本实施例一提供一种基于深度神经网络的声源定位方法,包括深度神经网络的训练阶段和深度神经网络的测试阶段,如图1-2所示,包括步骤:The first embodiment provides a sound source localization method based on a deep neural network, including a training phase of the deep neural network and a testing phase of the deep neural network, as shown in Figure 1-2, including steps:

S11.获取麦克风接收的语音信号,并将获取到的语音信号生成语音数据集;其中,所述语音数据集包括训练数据集和测试数据集;S11. Obtain the voice signal received by the microphone, and generate a voice data set from the obtained voice signal; wherein, the voice data set includes a training data set and a test data set;

S12.对所述生成的语音数据集内的语音信号进行第一预处理;S12. Perform first preprocessing on the voice signal in the generated voice data set;

S13.计算所述预处理后的语音信号所对应的声源信号的相位加权广义互相关函数;S13. Calculate the phase-weighted generalized cross-correlation function of the sound source signal corresponding to the preprocessed speech signal;

S14.获取与所述相位加权广义互相关函数波峰相对应的时延信息,并将所述获取的时延信息作为声源信号到达麦克风的TDOA观测值;并获取所述时延信息对应的幅值;S14. Acquire the time delay information corresponding to the peak of the phase-weighted generalized cross-correlation function, and use the obtained time delay information as the TDOA observation value of the sound source signal arriving at the microphone; and obtain the amplitude corresponding to the time delay information value;

S15.将所述TDOA观测值与幅值结合作为深度神经网络的输入向量,将声源信号对应的三维空间位置坐标作为神经网络的输出向量,将所述输入向量和输出向量结合生成特征向量;S15. The TDOA observation value and the amplitude are combined as the input vector of the deep neural network, the three-dimensional space position coordinates corresponding to the sound source signal are used as the output vector of the neural network, and the input vector and the output vector are combined to generate a feature vector;

S16.对所述生成的特征向量进行第二预处理;S16. Second preprocessing is performed on the generated feature vector;

S17.在深度神经网络的训练阶段,设置深度神经网络相关的参数,并用训练集的特征向量训练深度神经网络,得到训练好的深度神经网络;S17. In the training phase of the deep neural network, set the parameters related to the deep neural network, and train the deep neural network with the feature vector of the training set to obtain a trained deep neural network;

S18.在深度神经网络的测试阶段,将测试集的输入向量传入训练好的深度神经网络进行预测,得到声源信号的三维空间位置坐标,并采用交叉验证评估深度神经网络模型的性能。S18. In the test phase of the deep neural network, the input vector of the test set is transmitted to the trained deep neural network for prediction, the three-dimensional spatial position coordinates of the sound source signal are obtained, and the performance of the deep neural network model is evaluated by cross-validation.

在本实施例中,以分布式麦克风阵列具体说明:In this embodiment, the distributed microphone array is specifically described:

具体的模拟设置为:仿真环境是尺寸为4.1m×3.1m×3m的典型会议室,其中总共有L=12个随机分布的麦克风。每个麦克风节点中两个麦克风之间的距离为Dm=0.6m。为简便起见,麦克风的位置在高度为1.75m的平面上。声音传播速度为c=343m/s。在本实施例中,原始无混响的语音信号是一段采样频率为16kHz、单通道的纯净男性英语发音,语音信号帧长为120ms。室内混响时间T60=0.1s,信噪比为SNR=20dB,蒙特卡洛实验次数为50。分布式麦克风阵列共有M个麦克风节点,即麦克风节点的集合V={1,2,…,M}。每个麦克风节点m包含两个麦克风,其中m∈V。The specific simulation settings are as follows: the simulation environment is a typical conference room with a size of 4.1m×3.1m×3m, in which there are L=12 randomly distributed microphones in total. The distance between two microphones in each microphone node is Dm=0.6m. For simplicity, the location of the microphone is on a plane with a height of 1.75m. The speed of sound propagation is c=343 m/s. In this embodiment, the original voice signal without reverberation is a single-channel pure male English pronunciation with a sampling frequency of 16 kHz, and the frame length of the voice signal is 120 ms. The indoor reverberation time is T60=0.1s, the signal-to-noise ratio is SNR=20dB, and the number of Monte Carlo experiments is 50. The distributed microphone array has M microphone nodes in total, that is, a set of microphone nodes V={1,2,...,M}. Each microphone node m contains two microphones, where m ∈ V.

在步骤S11中,获取麦克风接收的语音信号,并将获取到的语音信号生成语音数据集;其中,所述语音数据集包括训练数据集和测试数据集;。In step S11, a voice signal received by the microphone is acquired, and a voice data set is generated from the acquired voice signal; wherein, the voice data set includes a training data set and a test data set;

在本实施例中,声源位置集设置在高为1.5m~1.7m的平面内,均匀采集H=24000个位置样本集作为神经网络的数据集。在MATLAB仿真环境中,先利用Image模型模拟出房间脉冲响应,再将原始的无混响的语音信号与房间脉冲响应卷积并加上高斯白噪声,最终仿真出麦克风接收的信号。In this embodiment, the sound source position set is set in a plane with a height of 1.5m-1.7m, and H=24000 position sample sets are uniformly collected as the data set of the neural network. In the MATLAB simulation environment, the Image model is used to simulate the room impulse response, and then the original unreverberated speech signal is convolved with the room impulse response and Gaussian white noise is added to finally simulate the signal received by the microphone.

在步骤S12中,对所述生成的语音数据集内的语音信号进行第一预处理。In step S12, a first preprocessing is performed on the speech signal in the generated speech data set.

具体为对麦克风节点m内的两个麦克风接收的语音信号进行第一预处理,所述第一预处理包括分帧、加窗、预加重。Specifically, the first preprocessing is performed on the speech signals received by the two microphones in the microphone node m, where the first preprocessing includes framing, windowing, and pre-emphasis.

采用矩形窗对语音信号进行加窗,矩形窗的窗函数ω(n)公式为:A rectangular window is used to add a window to the speech signal, and the formula of the window function ω(n) of the rectangular window is:

Figure BDA0002371094510000071
Figure BDA0002371094510000071

其中,N表示窗函数的长度。where N represents the length of the window function.

预加重的公式表示式为:The formula for pre-emphasis is:

H(z)=1-αz-1 H(z)=1-αz -1

其中,α表示预加重系数,范围为0.9<α<1.0。在本实施例中,窗函数的长度为帧长,预加重系数α=0.97。Among them, α represents the pre-emphasis coefficient, and the range is 0.9<α<1.0. In this embodiment, the length of the window function is the frame length, and the pre-emphasis coefficient α=0.97.

在步骤S13中,计算所述预处理后的语音信号所对应的声源信号的相位加权广义互相关函数。In step S13, a phase-weighted generalized cross-correlation function of the sound source signal corresponding to the preprocessed speech signal is calculated.

具体为计算预处理后的麦克风节点m内的两个麦克风语音信号的相位加权广义互相关函数Rm(τ),表示为:Specifically, the phase-weighted generalized cross-correlation function R m (τ) of the two microphone speech signals in the preprocessed microphone node m is calculated, which is expressed as:

Figure BDA0002371094510000072
Figure BDA0002371094510000072

其中,m∈V,

Figure BDA0002371094510000081
Figure BDA0002371094510000082
分别表示为在节点m处的时域麦克风信号
Figure BDA0002371094510000083
Figure BDA0002371094510000084
的所对应的频域部分;符号*表示为复共轭操作。在本实施例中,M=6。where m∈V,
Figure BDA0002371094510000081
and
Figure BDA0002371094510000082
are denoted as the time-domain microphone signal at node m, respectively
Figure BDA0002371094510000083
and
Figure BDA0002371094510000084
The corresponding frequency domain part of ; the symbol * denotes a complex conjugate operation. In this embodiment, M=6.

在步骤S14中,获取与所述相位加权广义互相关函数波峰相对应的时延信息,并将所述获取的时延信息作为声源信号到达麦克风的TDOA观测值;并获取所述时延信息对应的幅值。In step S14, the time delay information corresponding to the peak of the phase-weighted generalized cross-correlation function is obtained, and the obtained time delay information is used as the TDOA observation value of the sound source signal arriving at the microphone; and the time delay information is obtained corresponding amplitude.

获取与所述相位加权广义互相关函数Rm(τ)波峰相对应的时延信息

Figure BDA0002371094510000085
并将时延信息
Figure BDA0002371094510000086
作为声源信号S到达麦克风节点m的TDOA观测值,表示为:Obtain time delay information corresponding to the peak of the phase-weighted generalized cross-correlation function R m (τ)
Figure BDA0002371094510000085
and the delay information
Figure BDA0002371094510000086
As the TDOA observation of the sound source signal S arriving at the microphone node m, it is expressed as:

Figure BDA0002371094510000087
Figure BDA0002371094510000087

其中,τ∈[-τmaxmax],τmax表示声源信息S到达麦克风节点m的理论最大时延(TDOA),即

Figure BDA0002371094510000088
Figure BDA0002371094510000089
表示节点m处包含的麦克风对与声源信息S的距离,c表示声音传播速度;||·||表示欧几里得范数。然后获取所述时延信息
Figure BDA00023710945100000810
(即TDOA观测值)对应的幅值
Figure BDA00023710945100000811
Among them, τ∈[-τ maxmax ], τ max represents the theoretical maximum time delay (TDOA) of the sound source information S reaching the microphone node m, namely
Figure BDA0002371094510000088
and
Figure BDA0002371094510000089
Represents the distance between the microphone pair contained at the node m and the sound source information S, c represents the sound propagation speed; ||·|| represents the Euclidean norm. Then obtain the delay information
Figure BDA00023710945100000810
(i.e. TDOA observations) corresponding to the amplitude
Figure BDA00023710945100000811

TDOA定位是一种利用时间差进行定位的方法。通过测量信号到达监测站的时间,可以确定信号源的距离。利用信号源到各个监测站的距离(以监测站为中心,距离为半径作圆),就能确定信号的位置。但是绝对时间一般比较难测量,通过比较信号到达各个监测站的绝对时间差,就能作出以监测站为焦点,距离差为长轴的双曲线,双曲线的交点就是信号的位置。TDOA positioning is a method of positioning using time difference. By measuring the time it takes for the signal to arrive at the monitoring station, the distance to the source of the signal can be determined. Using the distance from the signal source to each monitoring station (with the monitoring station as the center, the distance is the radius as a circle), the position of the signal can be determined. However, the absolute time is generally difficult to measure. By comparing the absolute time difference between the signal arriving at each monitoring station, a hyperbola with the monitoring station as the focus and the distance difference as the long axis can be drawn. The intersection of the hyperbola is the position of the signal.

在步骤S15中,将所述TDOA观测值与幅值结合作为深度神经网络的输入向量,将声源信号对应的三维空间位置坐标作为神经网络的输出向量,将所述输入向量和输出向量结合生成特征向量。In step S15, the TDOA observation value and the amplitude are combined as the input vector of the deep neural network, the three-dimensional space position coordinates corresponding to the sound source signal are used as the output vector of the neural network, and the input vector and the output vector are combined to generate Feature vector.

具体为:将时延信息

Figure BDA00023710945100000812
(即TDOA观测值)及其所对应的幅值
Figure BDA00023710945100000813
结合作为深度神经网络的输入向量I:Specifically: convert the delay information
Figure BDA00023710945100000812
(i.e. TDOA observations) and their corresponding amplitudes
Figure BDA00023710945100000813
Combined with the input vector I as a deep neural network:

Figure BDA00023710945100000814
Figure BDA00023710945100000814

将声源信号S对应的三维空间位置坐标Q作为神经网络的输出向量:The three-dimensional space position coordinate Q corresponding to the sound source signal S is used as the output vector of the neural network:

Figure BDA00023710945100000815
Figure BDA00023710945100000815

将输入向量I和输出向量Q结合生成特征向量G:Combine the input vector I and the output vector Q to generate the feature vector G:

G=(I,Q)TG=(I,Q) T .

在步骤S16中,对所述生成的特征向量进行第二预处理。其中,第二预处理包括数据清洗、数据乱序、数据归一化。In step S16, a second preprocessing is performed on the generated feature vector. The second preprocessing includes data cleaning, data disorder, and data normalization.

归一化采用min-max标准化的方法,其转换函数为:The normalization adopts the method of min-max normalization, and its conversion function is:

Figure BDA0002371094510000091
Figure BDA0002371094510000091

其中,gmin、gmax分别表示样本特征向量G中的最小值与最大值;

Figure BDA0002371094510000092
表示样本数据归一化后的结果。在经过神经网络的训练过后,应经过反归一化得出数据的值,在本实施例中为声源点的三维空间位置。Among them, g min and g max represent the minimum and maximum values in the sample feature vector G, respectively;
Figure BDA0002371094510000092
Indicates the result after normalizing the sample data. After the training of the neural network, the data value should be obtained through inverse normalization, which is the three-dimensional spatial position of the sound source point in this embodiment.

其中,反归一化的转换函数为:Among them, the denormalized transformation function is:

Figure BDA0002371094510000095
Figure BDA0002371094510000095

其中,gmin、gmax分别表示样本特征向量G中的最小值与最大值,

Figure BDA0002371094510000093
表示样本数据归一化后的结果,g为反归一化后的结果。Among them, g min and g max represent the minimum and maximum values in the sample feature vector G, respectively,
Figure BDA0002371094510000093
Indicates the result after normalization of sample data, and g is the result after de-normalization.

在步骤S17中,在深度神经网络的训练阶段,设置深度神经网络相关的参数,并用训练集的特征向量训练深度神经网络,得到训练好的深度神经网络。In step S17, in the training stage of the deep neural network, parameters related to the deep neural network are set, and the deep neural network is trained with the feature vector of the training set, so as to obtain a trained deep neural network.

在本实施例中,深度神经网络(DNN)的输入层神经元数目设置为12,输出层神经元数目设置为3。隐藏层设置为三层,第一层隐藏层神经元数目为12,激活函数为tanh函数,第二层隐藏层神经元数目为15,激活函数为tanh函数,第三层隐藏层神经元数目为3,激活函数为tanh函数。In this embodiment, the number of neurons in the input layer of the deep neural network (DNN) is set to 12, and the number of neurons in the output layer is set to 3. The hidden layer is set to three layers, the number of neurons in the first hidden layer is 12, the activation function is tanh function, the number of neurons in the second hidden layer is 15, the activation function is tanh function, and the number of neurons in the third hidden layer is 3. The activation function is the tanh function.

在本实施例中,神经网络的损失函数设置为真实空间位置向量Q与神经网络预测估计向量P之间的均方差(Mean Squared Error,MSE),表示为:In this embodiment, the loss function of the neural network is set as the mean squared error (MSE) between the real space position vector Q and the neural network prediction estimation vector P, which is expressed as:

Figure BDA0002371094510000094
Figure BDA0002371094510000094

其中,U为当前神经网络迭代数据集的总数。where U is the total number of current neural network iteration datasets.

在步骤S18中,在深度神经网络的测试阶段,将测试集的输入向量传入训练好的深度神经网络进行预测,得到声源信号的三维空间位置坐标,并采用交叉验证评估深度神经网络模型的性能。In step S18, in the test phase of the deep neural network, the input vector of the test set is transmitted to the trained deep neural network for prediction, the three-dimensional spatial position coordinates of the sound source signal are obtained, and cross-validation is used to evaluate the performance of the deep neural network model. performance.

将测试集的输入向量传入到训练好的深度神经网络,可预测得到声源信号的三维空间位置坐标P=[px,py,pz]T,并利用交叉验证评估深度神经网络模型的性能。The input vector of the test set is passed into the trained deep neural network, and the three-dimensional spatial position coordinates P=[px,py,pz] T of the sound source signal can be predicted, and the performance of the deep neural network model is evaluated by cross-validation.

在本实施例中,数据集总数为24000,用交叉验证的方法对神经网络的性能进行测试,交叉验证的方法采用留一验证的方法,即留下4000个样本点作为测试集,20000个样本作为训练集,被测试过的数据在下个过程中将作为训练集的一部分,重复这个过程,直到没有新的样本数据需要被预测。In this embodiment, the total number of data sets is 24,000, and the performance of the neural network is tested by the cross-validation method. The cross-validation method adopts the leave-one-out validation method, that is, 4,000 sample points are left as the test set and 20,000 samples are left as the test set. As a training set, the tested data will be used as part of the training set in the next process, and this process will be repeated until no new sample data needs to be predicted.

本实施例的一种基于深度神经网络的声源定位方法,包括深度神经网络的训练阶段和深度神经网络的测试阶段。A sound source localization method based on a deep neural network in this embodiment includes a training phase of the deep neural network and a testing phase of the deep neural network.

如图3所示,在深度神经网络的训练阶段,包括步骤S11-S17。As shown in Figure 3, in the training phase of the deep neural network, steps S11-S17 are included.

如图4所示,在深度神经网络的测试阶段,包括步骤S11-S16、S18。As shown in FIG. 4 , in the testing phase of the deep neural network, steps S11-S16 and S18 are included.

需要说明的是,本实施例的测试阶段是基于训练阶段得到训练好的深度神经网络,然后在进行测试定位。It should be noted that, in the testing phase of this embodiment, a trained deep neural network is obtained based on the training phase, and then testing and positioning is performed.

与现有技术相比,本实施例将时延估计

Figure BDA0002371094510000101
以及Rm(τ)最大峰值所对应的幅值
Figure BDA0002371094510000102
共同作为深度神经网络的输入向量,三维空间坐标作为深度神经网络的输出向量,适用于室内声源定位,具有良好的可扩展性以及算法鲁棒性。Compared with the prior art, this embodiment uses the time delay estimation
Figure BDA0002371094510000101
and the amplitude corresponding to the maximum peak value of R m (τ)
Figure BDA0002371094510000102
Together as the input vector of the deep neural network, and the three-dimensional space coordinates as the output vector of the deep neural network, it is suitable for indoor sound source localization, and has good scalability and algorithm robustness.

实施例二Embodiment 2

本实施例提供一种基于深度神经网络的声源定位系统,包括:This embodiment provides a sound source localization system based on a deep neural network, including:

第一获取模块,用于获取麦克风接收的语音信号,并将获取到的语音信号生成语音数据集;其中,所述语音数据集包括训练数据集和测试数据集;The first acquisition module is used to acquire the voice signal received by the microphone, and generate a voice data set from the acquired voice signal; wherein, the voice data set includes a training data set and a test data set;

第一预处理模块,用于对所述生成的语音数据集内的语音信号进行第一预处理;a first preprocessing module, configured to perform a first preprocessing on the voice signal in the generated voice data set;

计算模块,用于计算所述预处理后的语音信号所对应的声源信号的相位加权广义互相关函数;a calculation module for calculating the phase-weighted generalized cross-correlation function of the sound source signal corresponding to the preprocessed speech signal;

第二获取模块,用于获取与所述相位加权广义互相关函数波峰相对应的时延信息,并将所述获取的时延信息作为声源信号到达麦克风的TDOA观测值;并获取所述时延信息对应的幅值;The second acquisition module is configured to acquire the time delay information corresponding to the peak of the phase-weighted generalized cross-correlation function, and use the acquired time delay information as the TDOA observation value of the sound source signal arriving at the microphone; and acquire the time delay information. The amplitude corresponding to the delay information;

生成模块,用于将所述TDOA观测值与幅值结合作为深度神经网络的输入向量,将声源信号对应的三维空间位置坐标作为神经网络的输出向量,将所述输入向量和输出向量结合生成特征向量;The generation module is used to combine the TDOA observation value and the amplitude value as the input vector of the deep neural network, use the three-dimensional space position coordinates corresponding to the sound source signal as the output vector of the neural network, and combine the input vector and the output vector to generate Feature vector;

第二预处理模块,用于对所述生成的特征向量进行第二预处理;A second preprocessing module, configured to perform a second preprocessing on the generated feature vector;

训练模块,用于设置深度神经网络相关的参数,并用训练集的特征向量训练深度神经网络,得到训练好的深度神经网络The training module is used to set the parameters related to the deep neural network, and train the deep neural network with the feature vector of the training set to obtain the trained deep neural network

进一步的,还包括:Further, it also includes:

测试模块,用于将测试集的输入向量传入训练好的深度神经网络进行预测,得到声源信号的三维空间位置坐标,并采用交叉验证评估深度神经网络模型的性能。The test module is used to input the input vector of the test set into the trained deep neural network for prediction, obtain the three-dimensional spatial position coordinates of the sound source signal, and use cross-validation to evaluate the performance of the deep neural network model.

需要说明的是,本实施例的一种基于深度神经网络的声源定位系统与实施例一类似,在此不多做赘述。It should be noted that a sound source localization system based on a deep neural network in this embodiment is similar to that in the first embodiment, and details are not described here.

与现有技术相比,本实施例将时延估计

Figure BDA0002371094510000111
以及Rm(τ)最大峰值所对应的幅值
Figure BDA0002371094510000112
共同作为深度神经网络的输入向量,三维空间坐标作为深度神经网络的输出向量,适用于室内声源定位,具有良好的可扩展性以及算法鲁棒性。Compared with the prior art, this embodiment uses the time delay estimation
Figure BDA0002371094510000111
and the amplitude corresponding to the maximum peak value of R m (τ)
Figure BDA0002371094510000112
Together as the input vector of the deep neural network, and the three-dimensional space coordinates as the output vector of the deep neural network, it is suitable for indoor sound source localization, and has good scalability and algorithm robustness.

本文中所描述的具体实施例仅仅是对本发明精神作举例说明。本发明所属技术领域的技术人员可以对所描述的具体实施例做各种各样的修改或补充或采用类似的方式替代,但并不会偏离本发明的精神或者超越所附权利要求书所定义的范围。The specific embodiments described herein are merely illustrative of the spirit of the invention. Those skilled in the art to which the present invention pertains can make various modifications or additions to the described specific embodiments or substitute in similar manners, but will not deviate from the spirit of the present invention or go beyond the definitions of the appended claims range.

Claims (10)

1.一种基于深度神经网络的声源定位方法,其特征在于,包括深度神经网络的训练阶段和深度神经网络的测试阶段,包括步骤:1. a sound source localization method based on deep neural network, is characterized in that, comprises the training phase of deep neural network and the testing phase of deep neural network, comprises steps: S1.获取麦克风接收的语音信号,并将获取到的语音信号生成语音数据集;其中,所述语音数据集包括训练数据集和测试数据集;S1. Acquire the voice signal received by the microphone, and generate a voice data set from the obtained voice signal; wherein, the voice data set includes a training data set and a test data set; S2.对所述生成的语音数据集内的语音信号进行第一预处理;S2. carry out first preprocessing to the voice signal in the generated voice data set; S3.计算所述预处理后的语音信号所对应的声源信号的相位加权广义互相关函数;S3. calculate the phase-weighted generalized cross-correlation function of the sound source signal corresponding to the preprocessed speech signal; S4.获取与所述相位加权广义互相关函数波峰相对应的时延信息,并将所述获取的时延信息作为声源信号到达麦克风的TDOA观测值;并获取所述时延信息对应的幅值;S4. Obtain time delay information corresponding to the peak of the phase-weighted generalized cross-correlation function, and use the obtained time delay information as the TDOA observation value of the sound source signal arriving at the microphone; and obtain the amplitude corresponding to the time delay information value; S5.将所述TDOA观测值与幅值结合作为深度神经网络的输入向量,将声源信号对应的三维空间位置坐标作为神经网络的输出向量,将所述输入向量和输出向量结合生成特征向量;S5. the TDOA observation value and the amplitude are combined as the input vector of the deep neural network, the three-dimensional space position coordinates corresponding to the sound source signal are used as the output vector of the neural network, and the input vector and the output vector are combined to generate a feature vector; S6.对所述生成的特征向量进行第二预处理;S6. The second preprocessing is performed on the generated feature vector; S7.在深度神经网络的训练阶段,设置深度神经网络相关的参数,并用训练集的特征向量训练深度神经网络,得到训练好的深度神经网络;S7. In the training phase of the deep neural network, set the parameters related to the deep neural network, and train the deep neural network with the feature vector of the training set to obtain a trained deep neural network; S8.在深度神经网络的测试阶段,将测试集的输入向量传入训练好的深度神经网络进行预测,得到声源信号的三维空间位置坐标,并采用交叉验证评估深度神经网络模型的性能。S8. In the test phase of the deep neural network, the input vector of the test set is transmitted to the trained deep neural network for prediction, the three-dimensional spatial position coordinates of the sound source signal are obtained, and the performance of the deep neural network model is evaluated by cross-validation. 2.根据权利要求1所述的一种基于深度神经网络的声源定位方法,其特征在于,所述步骤S1中麦克风阵列的集合为V={1,2,…,M};每个麦克风节点m包含两个麦克风,其中m∈V;M表示共有M个麦克风节点。2. A sound source localization method based on a deep neural network according to claim 1, wherein the set of microphone arrays in the step S1 is V={1,2,...,M}; each microphone Node m contains two microphones, where m ∈ V; M represents a total of M microphone nodes. 3.根据权利要求2所述的一种基于深度神经网络的声源定位方法,其特征在于,所述步骤S2具体为对麦克风节点m内的两个麦克风接收的语音信号进行第一预处理,所述第一预处理包括分帧、加窗、预加重。3. a kind of sound source localization method based on deep neural network according to claim 2, is characterized in that, described step S2 specifically is to carry out the first preprocessing to the speech signals received by two microphones in the microphone node m, The first preprocessing includes framing, windowing, and pre-emphasis. 4.根据权利要求2所述的一种基于深度神经网络的声源定位方法,其特征在于,所述步骤S3具体为计算预处理后的麦克风节点m内的两个麦克风语音信号的相位加权广义互相关函数Rm(τ),表示为:4. a kind of sound source localization method based on deep neural network according to claim 2, is characterized in that, described step S3 is to calculate the phase-weighted generalized generalization of two microphone speech signals in the microphone node m after preprocessing specifically The cross-correlation function R m (τ), expressed as:
Figure FDA0002371094500000021
Figure FDA0002371094500000021
其中,m∈V,
Figure FDA0002371094500000022
Figure FDA0002371094500000023
分别表示为在节点m处的时域麦克风信号
Figure FDA0002371094500000024
Figure FDA0002371094500000025
的所对应的频域部分;符号*表示为复共轭操作。
where m∈V,
Figure FDA0002371094500000022
and
Figure FDA0002371094500000023
are denoted as the time-domain microphone signal at node m, respectively
Figure FDA0002371094500000024
and
Figure FDA0002371094500000025
The corresponding frequency domain part of ; the symbol * denotes a complex conjugate operation.
5.根据权利要求4所述的一种基于深度神经网络的声源定位方法,其特征在于,所述步骤S4获取与所述相位加权广义互相关函数Rm(τ)波峰相对应的时延信息
Figure FDA0002371094500000026
表示为:
5. a kind of sound source localization method based on deep neural network according to claim 4, is characterized in that, described step S4 obtains the time delay corresponding to the peak of described phase-weighted generalized cross-correlation function R m (τ) information
Figure FDA0002371094500000026
Expressed as:
Figure FDA0002371094500000027
Figure FDA0002371094500000027
并获取所述时延信息
Figure FDA0002371094500000028
对应的幅值
Figure FDA0002371094500000029
and obtain the delay information
Figure FDA0002371094500000028
corresponding amplitude
Figure FDA0002371094500000029
6.根据权利要求5所述的一种基于深度神经网络的声源定位方法,其特征在于,所述步骤S5具体为:6. a kind of sound source localization method based on deep neural network according to claim 5, is characterized in that, described step S5 is specifically: 将时延信息
Figure FDA00023710945000000210
及其所对应的幅值
Figure FDA00023710945000000211
结合作为深度神经网络的输入向量I:
delay information
Figure FDA00023710945000000210
and its corresponding amplitude
Figure FDA00023710945000000211
Combined with the input vector I as a deep neural network:
Figure FDA00023710945000000212
Figure FDA00023710945000000212
将声源信号S对应的三维空间位置坐标Q作为神经网络的输出向量:The three-dimensional space position coordinate Q corresponding to the sound source signal S is used as the output vector of the neural network:
Figure FDA00023710945000000213
Figure FDA00023710945000000213
将所述输入向量I和输出向量Q结合生成特征向量G:The input vector I and the output vector Q are combined to generate the feature vector G: G=(I,Q)TG=(I,Q) T .
7.根据权利要求6所述的一种基于深度神经网络的声源定位方法,其特征在于,所述步骤S6中的第二预处理包括数据清洗、数据乱序、数据归一化。7 . The sound source localization method based on a deep neural network according to claim 6 , wherein the second preprocessing in the step S6 includes data cleaning, data disorder, and data normalization. 8 . 8.根据权利要求7所述的一种基于深度神经网络的声源定位方法,其特征在于,所述步骤S8中采用的交叉验证包括留一验证法。8 . The sound source localization method based on a deep neural network according to claim 7 , wherein the cross-validation adopted in the step S8 includes a leave-one-out validation method. 9 . 9.一种基于深度神经网络的声源定位系统,其特征在于,包括:9. A sound source localization system based on a deep neural network, characterized in that, comprising: 第一获取模块,用于获取麦克风接收的语音信号,并将获取到的语音信号生成语音数据集;其中,所述语音数据集包括训练数据集和测试数据集;The first acquisition module is used to acquire the voice signal received by the microphone, and generate a voice data set from the acquired voice signal; wherein, the voice data set includes a training data set and a test data set; 第一预处理模块,用于对所述生成的语音数据集内的语音信号进行第一预处理;a first preprocessing module, configured to perform a first preprocessing on the voice signal in the generated voice data set; 计算模块,用于计算所述预处理后的语音信号所对应的声源信号的相位加权广义互相关函数;a calculation module for calculating the phase-weighted generalized cross-correlation function of the sound source signal corresponding to the preprocessed speech signal; 第二获取模块,用于获取与所述相位加权广义互相关函数波峰相对应的时延信息,并将所述获取的时延信息作为声源信号到达麦克风的TDOA观测值;并获取所述时延信息对应的幅值;The second acquisition module is configured to acquire the time delay information corresponding to the peak of the phase-weighted generalized cross-correlation function, and use the acquired time delay information as the TDOA observation value of the sound source signal arriving at the microphone; and acquire the time delay information. The amplitude corresponding to the delay information; 生成模块,用于将所述TDOA观测值与幅值结合作为深度神经网络的输入向量,将声源信号对应的三维空间位置坐标作为神经网络的输出向量,将所述输入向量和输出向量结合生成特征向量;The generation module is used to combine the TDOA observation value and the amplitude value as the input vector of the deep neural network, use the three-dimensional space position coordinates corresponding to the sound source signal as the output vector of the neural network, and combine the input vector and the output vector to generate Feature vector; 第二预处理模块,用于对所述生成的特征向量进行第二预处理;A second preprocessing module, configured to perform a second preprocessing on the generated feature vector; 训练模块,用于设置深度神经网络相关的参数,并用训练集的特征向量训练深度神经网络,得到训练好的深度神经网络。The training module is used to set parameters related to the deep neural network, and train the deep neural network with the feature vector of the training set to obtain a trained deep neural network. 10.根据权利要求9所述的一种基于深度神经网络的声源定位系统,其特征在于,还包括:10. A deep neural network-based sound source localization system according to claim 9, characterized in that, further comprising: 测试模块,用于将测试集的输入向量传入训练好的深度神经网络进行预测,得到声源信号的三维空间位置坐标,并采用交叉验证评估深度神经网络模型的性能。The test module is used to input the input vector of the test set into the trained deep neural network for prediction, obtain the three-dimensional spatial position coordinates of the sound source signal, and use cross-validation to evaluate the performance of the deep neural network model.
CN202010050760.9A 2020-01-17 2020-01-17 Sound source positioning method and system based on deep neural network Expired - Fee Related CN111239687B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010050760.9A CN111239687B (en) 2020-01-17 2020-01-17 Sound source positioning method and system based on deep neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010050760.9A CN111239687B (en) 2020-01-17 2020-01-17 Sound source positioning method and system based on deep neural network

Publications (2)

Publication Number Publication Date
CN111239687A true CN111239687A (en) 2020-06-05
CN111239687B CN111239687B (en) 2021-12-14

Family

ID=70872716

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010050760.9A Expired - Fee Related CN111239687B (en) 2020-01-17 2020-01-17 Sound source positioning method and system based on deep neural network

Country Status (1)

Country Link
CN (1) CN111239687B (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111949965A (en) * 2020-08-12 2020-11-17 腾讯科技(深圳)有限公司 Artificial intelligence-based identity verification method, device, medium and electronic equipment
CN111965600A (en) * 2020-08-14 2020-11-20 长安大学 Indoor positioning method based on sound fingerprints in strong shielding environment
CN111981644A (en) * 2020-08-26 2020-11-24 北京声智科技有限公司 Air conditioner control method and device and electronic equipment
CN112180318A (en) * 2020-09-28 2021-01-05 深圳大学 Sound source DOA estimation model training and sound source DOA estimation method
CN113111765A (en) * 2021-04-08 2021-07-13 浙江大学 Multi-voice source counting and positioning method based on deep learning
CN113589230A (en) * 2021-09-29 2021-11-02 广东省科学院智能制造研究所 Target sound source positioning method and system based on joint optimization network
CN114545332A (en) * 2022-02-18 2022-05-27 桂林电子科技大学 Arbitrary array sound source positioning method based on cross-correlation sequence and neural network
CN115267671A (en) * 2022-06-29 2022-11-01 金茂云科技服务(北京)有限公司 Distributed voice interaction terminal equipment and sound source positioning method and device thereof
WO2022263710A1 (en) * 2021-06-17 2022-12-22 Nokia Technologies Oy Apparatus, methods and computer programs for obtaining spatial metadata
CN115980668A (en) * 2023-01-29 2023-04-18 桂林电子科技大学 Sound source localization method based on generalized cross correlation of wide neural network
CN116304639A (en) * 2023-05-05 2023-06-23 上海玫克生储能科技有限公司 Identification model generation method, identification system, identification device and identification medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103576126A (en) * 2012-07-27 2014-02-12 姜楠 Four-channel array sound source positioning system based on neural network
US20160322055A1 (en) * 2015-03-27 2016-11-03 Google Inc. Processing multi-channel audio waveforms
CN108318862A (en) * 2017-12-26 2018-07-24 北京大学 A kind of sound localization method based on neural network
CN109839612A (en) * 2018-08-31 2019-06-04 大象声科(深圳)科技有限公司 Sounnd source direction estimation method based on time-frequency masking and deep neural network

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103576126A (en) * 2012-07-27 2014-02-12 姜楠 Four-channel array sound source positioning system based on neural network
US20160322055A1 (en) * 2015-03-27 2016-11-03 Google Inc. Processing multi-channel audio waveforms
CN108318862A (en) * 2017-12-26 2018-07-24 北京大学 A kind of sound localization method based on neural network
CN109839612A (en) * 2018-08-31 2019-06-04 大象声科(深圳)科技有限公司 Sounnd source direction estimation method based on time-frequency masking and deep neural network

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
SHARATH ADAVANNE等: "Sound Event Localization and Detection of Overlapping Sources Using Convolutional Recurrent Neural Networks", 《IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING》 *
王义圆: "基于麦克风阵列的目标探测与信号增强技术研究", 《中国优秀硕博学位论文全文数据库(硕士) 信息科技辑》 *
祖丽楠等: "一种基于神经网络滤波的广义互相关时延估计方法的设计", 《化工自动化及仪表》 *
黎长江等: "基于循环神经网络的音素识别研究", 《微电子学与计算机》 *

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111949965A (en) * 2020-08-12 2020-11-17 腾讯科技(深圳)有限公司 Artificial intelligence-based identity verification method, device, medium and electronic equipment
CN111965600A (en) * 2020-08-14 2020-11-20 长安大学 Indoor positioning method based on sound fingerprints in strong shielding environment
CN111981644A (en) * 2020-08-26 2020-11-24 北京声智科技有限公司 Air conditioner control method and device and electronic equipment
CN111981644B (en) * 2020-08-26 2021-09-24 北京声智科技有限公司 Air conditioner control method and device and electronic equipment
CN112180318A (en) * 2020-09-28 2021-01-05 深圳大学 Sound source DOA estimation model training and sound source DOA estimation method
CN112180318B (en) * 2020-09-28 2023-06-27 深圳大学 Sound source direction of arrival estimation model training and sound source direction of arrival estimation method
CN113111765A (en) * 2021-04-08 2021-07-13 浙江大学 Multi-voice source counting and positioning method based on deep learning
WO2022263710A1 (en) * 2021-06-17 2022-12-22 Nokia Technologies Oy Apparatus, methods and computer programs for obtaining spatial metadata
CN113589230A (en) * 2021-09-29 2021-11-02 广东省科学院智能制造研究所 Target sound source positioning method and system based on joint optimization network
CN114545332A (en) * 2022-02-18 2022-05-27 桂林电子科技大学 Arbitrary array sound source positioning method based on cross-correlation sequence and neural network
CN114545332B (en) * 2022-02-18 2024-05-03 桂林电子科技大学 Random array sound source positioning method based on cross-correlation sequence and neural network
CN115267671A (en) * 2022-06-29 2022-11-01 金茂云科技服务(北京)有限公司 Distributed voice interaction terminal equipment and sound source positioning method and device thereof
CN115980668A (en) * 2023-01-29 2023-04-18 桂林电子科技大学 Sound source localization method based on generalized cross correlation of wide neural network
CN116304639A (en) * 2023-05-05 2023-06-23 上海玫克生储能科技有限公司 Identification model generation method, identification system, identification device and identification medium

Also Published As

Publication number Publication date
CN111239687B (en) 2021-12-14

Similar Documents

Publication Publication Date Title
CN111239687B (en) Sound source positioning method and system based on deep neural network
Salvati et al. Exploiting CNNs for improving acoustic source localization in noisy and reverberant conditions
CN103901401B (en) A kind of binaural sound source of sound localization method based on ears matched filtering device
CN107172018A (en) The vocal print cryptosecurity control method and system of activation type under common background noise
CN105976827B (en) An Indoor Sound Source Localization Method Based on Ensemble Learning
WO2020024816A1 (en) Audio signal processing method and apparatus, device, and storage medium
Hu et al. Unsupervised multiple source localization using relative harmonic coefficients
CN113870893B (en) Multichannel double-speaker separation method and system
CN112180318B (en) Sound source direction of arrival estimation model training and sound source direction of arrival estimation method
CN113466793A (en) Sound source positioning method and device based on microphone array and storage medium
Jiang et al. Deep and CNN fusion method for binaural sound source localisation
CN113111765A (en) Multi-voice source counting and positioning method based on deep learning
Yang et al. SRP-DNN: Learning direct-path phase difference for multiple moving sound source localization
Zhang et al. AcousticFusion: Fusing sound source localization to visual SLAM in dynamic environments
CN112363112A (en) Sound source positioning method and device based on linear microphone array
CN112394324A (en) Microphone array-based remote sound source positioning method and system
CN113707136B (en) Audio and video mixed voice front-end processing method for voice interaction of service robot
Pertilä et al. Time difference of arrival estimation with Deep learning–from acoustic simulations to recorded data
CN112216301B (en) Deep clustering speech separation method based on logarithmic magnitude spectrum and interaural phase difference
CN116859336B (en) High-precision implementation method for sound source localization
Yang et al. A review of sound source localization research in Three-Dimensional space
Dwivedi et al. Learning based method for near field acoustic range estimation in spherical harmonics domain using intensity vectors
CN115038014B (en) Audio signal processing method and device, electronic equipment and storage medium
Hu et al. Evaluation and comparison of three source direction-of-arrival estimators using relative harmonic coefficients
Zhu et al. Speaker localization based on audio-visual bimodal fusion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20211214