CN103811020B

CN103811020B - A kind of intelligent sound processing method

Info

Publication number: CN103811020B
Application number: CN201410081493.6A
Authority: CN
Inventors: 王�义; 魏阳杰; 陈瑶; 关楠
Original assignee: Northeastern University China
Current assignee: Northeastern University China
Priority date: 2014-03-05
Filing date: 2014-03-05
Publication date: 2016-06-22
Anticipated expiration: 2034-03-05
Also published as: CN103811020A

Abstract

The invention relates to an intelligent speech processing method, which belongs to the technical field of information processing. The present invention realizes the intelligent identification of the identities of multiple interlocutors in a multi-person voice environment by establishing a voice model library of interlocutors, and at the same time separates the mixed voice to obtain the identity of each interlocutor. Independent voice, according to the user's needs, amplifies the voice of the interlocutor to be heard by the user and eliminates the voice of the interlocutor not required by the user; different from traditional hearing aids, this method can automatically provide the user with the desired voice according to the user's personal needs, The interference of non-target human voices except noise is reduced, reflecting the individuation, interaction and intelligence of the method.

Description

A kind of intelligent voice processing method

技术领域technical field

本发明属于信息处理技术领域，具体涉及一种智能语音处理方法。The invention belongs to the technical field of information processing, and in particular relates to an intelligent voice processing method.

背景技术Background technique

据2013年世界卫生组织(WHO)发布的最新评估数据显示，全球目前共有3.6亿人存在不同程度的听力障碍，占全球总人口的5％。助听产品的使用可以有效地补偿听力障碍患者的听力损失，提高他们的生活和工作质量。然而，当今助听系统相关技术的研究仍然集中在噪声抑制和源声音幅值放大两个方面，很少涉及到基于声音特征的建模和多声源自动分离技术。当实际应用场景非常复杂时，例如：聚会时，多个说话人同时发声，甚至是伴有音乐等背景声音，由于助听系统无法从混合后的声音输入中分离出感兴趣的声音对象，简单的声音强度扩大功能只能增加使用者的听力负担甚至伤害，不会带来有效的声音输入和理解。因此，针对当前助听系统的技术缺陷，设计一款具有特定声音对象识别功能的、更加智能化和个性化的新型助听系统，具有非常重要的意义。According to the latest assessment data released by the World Health Organization (WHO) in 2013, there are currently 360 million people in the world with hearing impairments of varying degrees, accounting for 5% of the total global population. The use of hearing aid products can effectively compensate for the hearing loss of hearing-impaired patients and improve their quality of life and work. However, the current research on hearing aid system technology still focuses on noise suppression and source sound amplitude amplification, and rarely involves sound feature-based modeling and multi-sound source automatic separation technology. When the actual application scenario is very complex, for example: at a party, multiple speakers speak at the same time, even accompanied by background sounds such as music, since the hearing aid system cannot separate the sound object of interest from the mixed sound input, it is simple The sound intensity expansion function of the mobile phone can only increase the hearing burden or even damage the user, and will not bring effective sound input and understanding. Therefore, aiming at the technical defects of the current hearing aid system, it is of great significance to design a new hearing aid system with specific sound object recognition function, which is more intelligent and personalized.

发明内容Contents of the invention

针对现有技术存在的不足，本发明提出一种智能语音处理方法，以达到保证用户根据自己的需求获得纯净的声音接收和放大，实现助听系统的智能化、互动化和个性化的目的。Aiming at the deficiencies of the existing technology, the present invention proposes an intelligent voice processing method to ensure that users can obtain pure voice reception and amplification according to their own needs, and realize the purpose of intelligence, interaction and personalization of the hearing aid system.

一种智能语音处理方法，包括以下步骤：A kind of intelligent speech processing method, comprises the following steps:

步骤1、采集样本语音段构建样本语音库，对样本语音进行特征提取，获得特征参数，并对特征参数进行训练；Step 1. Collect sample speech segments to build a sample speech library, perform feature extraction on the sample speech, obtain feature parameters, and train the feature parameters;

具体过程如下：The specific process is as follows:

步骤1-1、采集样本语音段，将采集的语音段进行离散化处理，提取语音信号的梅尔频率倒谱系数作为语音信号特征参数，并建立高斯混合模型；Step 1-1, collecting sample speech segments, discretizing the collected speech segments, extracting the Mel frequency cepstral coefficients of the speech signal as the speech signal characteristic parameters, and establishing a Gaussian mixture model;

模型公式如下：The model formula is as follows:

$p p ((XIG XIG)) = = {Σ Σ}_{i i = = 11}^{M m} {p p}_{i i} {b b}_{i i} ((X x)) - - - - - - ((11))$

其中，p(XIG)表示样本语音特征参数在模型参数为G的模型中的概率；Wherein, p (XIG) represents the probability that the sample speech characteristic parameter is in the model of G in model parameter;

G表示高斯混合模型参数集，G＝{p_i，μ_i，∑_i}，i＝1，2，...，I；G represents the Gaussian mixture model parameter set, G={p _i , μ _i , ∑ _i }, i=1, 2,..., I;

I表示高斯混合模型中单一高斯模型个数；I represents the number of single Gaussian models in the Gaussian mixture model;

p_i表示第i个单一高斯模型的权重系数， p _i represents the weight coefficient of the i-th single Gaussian model,

μ_i表示第i个单一高斯模型的均值矢量；μ _i represents the mean vector of the i-th single Gaussian model;

∑_i表示第i个单一高斯模型的协方差矩阵；∑ _i represents the covariance matrix of the i-th single Gaussian model;

X表示样本语音特征参数，X＝{x₁，x₂，...，x_T}，T表示特征向量的个数；X represents the sample speech feature parameter, X={x ₁ , x ₂ ,..., x _T }, and T represents the number of feature vectors;

b_i(X)表示第i个单一高斯模型的密度函数，b_i(X)＝N(μ_i，∑_i)，N(.)表示标准高斯分布的密度函数；b _i (X) represents the density function of the i-th single Gaussian model, b _i (X)=N(μ _i , ∑ _i ), N(.) represents the density function of the standard Gaussian distribution;

步骤1-2、利用语音信号特征参数训练高斯混合模型；Step 1-2, utilize speech signal feature parameter training Gaussian mixture model;

即采用k均值聚类算法对语音信号特征参数进行聚类，获得高斯混合模型参数集初始值G₀＝{p_i ⁰，μ_i ⁰，∑_i ⁰}，i＝1，2，...，I；并根据获得的高斯混合模型参数集初始值，采用最大期望算法对模型进行估计，进而获得高斯混合模型参数，即完成特征参数的训练；That is, the k-means clustering algorithm is used to cluster the characteristic parameters of the speech signal, and the initial value of the Gaussian mixture model parameter set G ₀ = {p _i ⁰ , μ _i ⁰ , ∑ _i ⁰ }, i=1, 2,... , I; and according to the initial value of the obtained Gaussian mixture model parameter set, adopt the maximum expectation algorithm to estimate the model, and then obtain the Gaussian mixture model parameters, that is, complete the training of the characteristic parameters;

步骤2、采用M个麦克风组成的麦克风阵列采集被测环境音频信号，确定该环境声音源个数和每个声音源波束到达的方向，即声源到麦克风阵列的入射角度；Step 2, adopting a microphone array composed of M microphones to collect the measured environmental audio signal, determining the number of environmental sound sources and the direction of arrival of each sound source beam, that is, the incident angle of the sound source to the microphone array;

具体过程如下：The specific process is as follows:

步骤2-1、采用M个麦克风组成的麦克风阵列采集被测环境的混合音频信号，并对采集的混合音频信号进行离散化处理，获得每个采样点的幅值；Step 2-1, using a microphone array composed of M microphones to collect the mixed audio signal of the environment under test, and discretize the collected mixed audio signal to obtain the amplitude of each sampling point;

步骤2-2、将每个采样点的幅值进行矩阵化，获得每个麦克风采集到的混合音频矩阵；上述混合音频矩阵的列数为一，行数为采样点个数，矩阵中元素为每个采样点的幅值；Step 2-2, matrix the amplitude of each sampling point to obtain the mixed audio matrix collected by each microphone; the number of columns of the above mixed audio matrix is one, the number of rows is the number of sampling points, and the elements in the matrix are The amplitude of each sampling point;

步骤2-3、根据每个麦克风采集到的混合音频矩阵和麦克风个数，获得被测环境的混合音频信号的矢量协方差矩阵的估计值；Step 2-3, according to the mixed audio matrix collected by each microphone and the number of microphones, obtain the estimated value of the vector covariance matrix of the mixed audio signal of the tested environment;

矢量协方差矩阵的估计值公式如下：The formula for estimating the vector covariance matrix is as follows:

${R R}_{xx xx} = = \frac{11}{M m} {Σ Σ}_{m m = = 11}^{M m} X x ((m m)) {X x}^{H h} ((m m)) - - - - - - ((22))$

其中，R_xx表示被测环境的混合音频信号的矢量协方差矩阵的估计值；Wherein, R _xx represents the estimated value of the vector covariance matrix of the mixed audio signal of tested environment;

X(m)表示第m个麦克风采集到的混合音频矩阵；X(m) represents the mixed audio matrix collected by the mth microphone;

X^H(m)表示第m个麦克风采集到的混合音频矩阵的转置矩阵；X ^H (m) represents the transpose matrix of the mixed audio matrix collected by the m microphone;

步骤2-4、对矢量协方差矩阵的估计值进行特征值分解，获得特征值，并对特征值从大到小进行排序，确定特征值大于阈值的个数，即为声音源的个数；Step 2-4, performing eigenvalue decomposition on the estimated value of the vector covariance matrix, obtaining eigenvalues, and sorting the eigenvalues from large to small, and determining the number of eigenvalues greater than the threshold value, which is the number of sound sources;

步骤2-5、将麦克风个数减去声音源个数获得噪音源个数，进而对应获得噪音矩阵；Step 2-5, subtracting the number of sound sources from the number of microphones to obtain the number of noise sources, and then correspondingly obtain a noise matrix;

步骤2-6、根据各个麦克风与阵列中心之间的距离、混合音频信号的波长、麦克风对于阵列中心的方向角度和声音源的波束到达方向获得麦克风阵列的导向矢量，再根据噪音矩阵和麦克风阵列的导向矢量获得混合音频信号的角度谱函数；Step 2-6, according to the distance between each microphone and the array center, the wavelength of the mixed audio signal, the direction angle of the microphone to the array center and the beam arrival direction of the sound source to obtain the steering vector of the microphone array, and then according to the noise matrix and the microphone array The steering vector obtains the angular spectrum function of the mixed audio signal;

混合音频信号的角度谱函数公式如下：The angular spectrum function formula of the mixed audio signal is as follows:

$P P ((θ θ)) = = \frac{11}{{α α}^{H h} ((θ θ)) {V V}_{u u} {V V}^{H h}_{u u} α α ((θ θ))} - - - - - - ((33))$

其中，P(θ)表示混合音频信号的角度谱函数；Wherein, P(θ) represents the angular spectrum function of the mixed audio signal;

α(θ)表示麦克风阵列的导向矢量，α(θ)＝(α₁(θ)，...，α_m(θ)，...，α_M(θ))，其中，j表示虚数单位，k＝2π/λ，λ表示混合音频信号的波长，d_m表示第m个麦克风与阵列中心的距离，表示第m个麦克风对于阵列中心的方向角度；α(θ) represents the steering vector of the microphone array, α(θ)=(α ₁ (θ), ..., α _m (θ), ..., α _M (θ)), where, j represents the imaginary number unit, k=2π/λ, λ represents the wavelength of the mixed audio signal, d _m represents the distance between the mth microphone and the center of the array, Indicates the direction angle of the mth microphone to the center of the array;

θ表示声音源的波束到达方向；θ represents the beam arrival direction of the sound source;

α^H(θ)表示麦克风阵列的导向矢量的转置矩阵；α ^H (θ) represents the transpose matrix of the steering vector of the microphone array;

V_u表示噪音矩阵；V _u represents the noise matrix;

V^H _u表示噪音矩阵的转置矩阵；V ^H _u represents the transpose matrix of the noise matrix;

步骤2-7、根据混合音频信号的角度谱函数的波形，由大到小选取该波形的多个峰值，选择峰值的个数即为声音源的个数；Step 2-7, according to the waveform of the angle spectrum function of the mixed audio signal, select a plurality of peaks of the waveform from large to small, and the number of selected peaks is the number of sound sources;

步骤2-8、确定选取峰值对应的角度值，即获得每个声音源的波束到达方向；Steps 2-8, determine the angle value corresponding to the selected peak value, that is, obtain the beam arrival direction of each sound source;

步骤3、根据每个声音源的音频信号、声音源与麦克风之间的转换关系，获得麦克风接收到的麦克风阵列声压、麦克风阵列水平方向声压梯度和麦克风阵列垂直方向的声压梯度；Step 3, according to the audio signal of each sound source, the conversion relationship between the sound source and the microphone, obtain the sound pressure of the microphone array received by the microphone, the sound pressure gradient in the horizontal direction of the microphone array, and the sound pressure gradient in the vertical direction of the microphone array;

麦克风阵列声压信号公式如下：The formula of the sound pressure signal of the microphone array is as follows:

${p p}_{w w} ((t t)) = = {Σ Σ}_{n no = = 11}^{N N} 0.5 0.5 {Σ Σ}_{m m = = 11}^{M m} {h h}_{mn mn} ((t t)) {s the s}_{n no} ((t t)) - - - - - - ((44))$

其中，p_w(t)表示t时刻麦克风阵列声压；Among them, p _w (t) represents the sound pressure of the microphone array at time t;

N表示声音源个数；N represents the number of sound sources;

t表示时间；t means time;

s_n(t)表示第n个声音源的音频信号；s _n (t) represents the audio signal of the nth sound source;

h_mn(t)表示第n个声音源与第m个麦克风之间的转换矩阵，h_mn(t)＝p₀(t)α_m(θ_n(t))，p₀(t)表示t时刻由声波造成的麦克风阵列中心声压；α_m(θ_n(t))表示在t时刻第m个麦克风关于第n个声音源的导向矢量，其中，θ_n(t)表示t时刻第n个声音源的波束到达方向；h _mn (t) represents the transformation matrix between the nth sound source and the mth microphone, h _mn (t) = p ₀ (t)α _m (θ _n (t)), p ₀ (t) represents t The sound pressure at the center of the microphone array caused by sound waves at time; α _m (θ _n (t)) represents the steering vector of the mth microphone with respect to the _n The beam arrival direction of a sound source;

麦克风阵列水平方向声压梯度公式如下：The formula for the sound pressure gradient in the horizontal direction of the microphone array is as follows:

其中，p_x(t)表示麦克风阵列水平方向声压梯度；Wherein, p _x (t) represents the sound pressure gradient in the horizontal direction of the microphone array;

麦克风阵列垂直方向的声压梯度公式如下：The sound pressure gradient formula in the vertical direction of the microphone array is as follows:

其中，p_y(t)表示麦克风阵列垂直方向的声压梯度；Wherein, p _y (t) represents the sound pressure gradient in the vertical direction of the microphone array;

步骤4、采用傅里叶变换将麦克风阵列中心声压、麦克风阵列水平方向声压梯度和麦克风阵列垂直方向的声压梯度从时域转换到频域；Step 4, using Fourier transform to convert the sound pressure at the center of the microphone array, the sound pressure gradient in the horizontal direction of the microphone array, and the sound pressure gradient in the vertical direction of the microphone array from the time domain to the frequency domain;

步骤5、根据频域内的麦克风阵列声压、麦克风阵列水平方向梯度和麦克风阵列垂直方向声压梯度，获得频率域内的声压信号的强度矢量公式，进而推导出强度矢量方向；Step 5, according to the microphone array sound pressure in the frequency domain, the microphone array horizontal direction gradient and the microphone array vertical direction sound pressure gradient, obtain the intensity vector formula of the sound pressure signal in the frequency domain, and then deduce the intensity vector direction;

频率域内的声压信号的强度矢量公式为：The intensity vector formula of the sound pressure signal in the frequency domain is:

$I I ((ω ω,, t t)) = = \frac{11}{{ρ ρ}_{00} c c} [[Re Re {{{p p}_{w w}^{* *} ((ω ω,, t t)) {p p}_{x x} ((ω ω,, t t))}} {u u}_{x x} + + Re Re {{{p p}_{w w}^{* *} ((ω ω,, t t)) {p p}_{y the y} ((ω ω,, t t))}} {u u}_{y the y}]] - - - - - - ((77))$

其中，I(ω，t)表示频率域内的声压信号的强度矢量；Wherein, I (ω, t) represents the intensity vector of the sound pressure signal in the frequency domain;

p₀表示被测环境空气密度；p ₀ represents the measured ambient air density;

c表示声速；c represents the speed of sound;

Re[.]表示取复数实部；Re[.] means to take the real part of a complex number;

p_w ^*(ω，t)表示频域内的麦克风阵列声压的共轭矩阵；p _w ^* (ω, t) represents the conjugate matrix of the sound pressure of the microphone array in the frequency domain;

p_x(ω，t)表示频域内的麦克风阵列水平方向声压梯度；p _x (ω, t) represents the sound pressure gradient in the horizontal direction of the microphone array in the frequency domain;

p_y(ω，t)表示频域内的麦克风阵列垂直方向声压梯度；p _y (ω, t) represents the sound pressure gradient in the vertical direction of the microphone array in the frequency domain;

u_x表示横坐标轴方向单位矢量；u _x represents the unit vector in the direction of the abscissa axis;

u_y表示纵坐标轴方向单位矢量；u _y represents the unit vector in the direction of the ordinate axis;

强度矢量方向公式如下：The intensity vector direction formula is as follows:

$γ γ ((ω ω,, t t)) = = {tan the tan}^{- - 11} [[\frac{Re Re {{{p p}_{w w}^{* *} ((ω ω,, t t)) {p p}_{y the y} ((ω ω,, t t))}}}{{p p}_{w w}^{* *} ((ω ω,, t t)) {p p}_{x x} ((ω ω,, t t))}]] - - - - - - ((88))$

其中，γ(ω，t)表示麦克风阵列接收到的混合声音的声压信号的强度矢量方向；Wherein, γ (ω, t) represents the intensity vector direction of the sound pressure signal of the mixed sound received by the microphone array;

步骤6、对强度矢量方向进行统计获得其概率密度分布，采用混合冯米修斯分布进行拟合，获得语音强度矢量方向服从混合冯米修斯分布的模型参数，进而得到每个声压信号的强度矢量方向函数；Step 6. Perform statistics on the intensity vector direction to obtain its probability density distribution, and use the mixed von Misius distribution for fitting to obtain the model parameters that the voice intensity vector direction obeys the mixed von Misius distribution, and then obtain the intensity vector direction function of each sound pressure signal ;

具体过程如下：The specific process is as follows:

步骤6-1、对强度矢量方向进行统计获得其概率密度分布，采用混合冯米修斯分布进行拟合，获得语音的强度矢量方向服从的混合冯米修斯分布的模型参数集；Step 6-1. Perform statistics on the direction of the intensity vector to obtain its probability density distribution, and use a mixed von Misius distribution for fitting to obtain a model parameter set of the mixed von Misius distribution that the intensity vector direction of the speech obeys;

所述的混合冯米修斯分布模型公式如下：The formula of the mixed von Misius distribution model is as follows:

$g g ((θ θ)) = = {Σ Σ}_{n no = = 11}^{N N} {α α}_{n no} f f ((θ θ;; {k k}_{n no})) - - - - - - ((1010))$

其中，表示混合冯米修斯分布概率密度；in, Represents the mixed von Misius distribution probability density;

表示混合声音方向角度； Indicates the mixed sound direction angle;

α_n表示第n个声音源的声压信号的强度矢量方向函数的权重；α _n represents the weight of the intensity vector direction function of the sound pressure signal of the nth sound source;

其中，I₀(k_n)表示第n个声音源对应的一阶修正贝塞尔函数，k_n表示第n个声音源声压信号的强度矢量方向服从的单一冯米修斯分布对应的浓度参数，即冯米修斯分布的方差的倒数； Among them, I ₀ (k _n ) represents the first-order modified Bessel function corresponding to the nth sound source, and k _n represents the concentration parameter corresponding to the single von Misius distribution that the intensity vector direction of the sound pressure signal of the nth sound source obeys, That is, the reciprocal of the variance of the von Misius distribution;

混合冯米修斯分布函数参数集如下：The parameter set of the mixed von Misius distribution function is as follows:

Γ＝{α_n，k_n}，i＝1，...，N(11)Γ={α _n , k _n }, i=1, . . . , N(11)

步骤6-2、初始化模型参数，获得初始函数参数集；Step 6-2, initialize the model parameters, and obtain the initial function parameter set;

步骤6-3、根据获得的初始模型参数，采用最大期望算法估计得到混合冯米修斯分布模型的参数；Step 6-3, according to the obtained initial model parameters, use the maximum expectation algorithm to estimate the parameters of the mixed von Misius distribution model;

步骤6-4、根据估计得到的混合冯米修斯分布模型参数，求得每个声压信号的强度矢量方向函数；Step 6-4, obtain the intensity vector direction function of each sound pressure signal according to the estimated mixed von Misius distribution model parameters;

声压信号的强度矢量方向函数公式如下：The formula of the intensity vector direction function of the sound pressure signal is as follows:

${I I}_{n no} ((θ θ;; ω ω,, t t)) = = {α α}_{n no} f f ((θ θ;; {k k}_{n no})) - - - - - - ((1212))$

其中，表示第n个声音源的强度矢量方向函数；in, Represents the intensity vector direction function of the nth sound source;

步骤7、根据得到的每个声压信号的强度矢量方向函数和麦克风阵列声压，获得每个声音源在频率域信号，并采用傅里叶反变换将该频域中的每个声源信号转换为时域内的声源信号；Step 7, according to the obtained intensity vector direction function of each sound pressure signal and the sound pressure of the microphone array, obtain the signal of each sound source in the frequency domain, and use the inverse Fourier transform to convert each sound source signal in the frequency domain Convert to a sound source signal in the time domain;

每个声音源在频域中的信号公式如下：The signal formula for each sound source in the frequency domain is as follows:

${\overset{~ ~}{s the s}}_{n no} ((ω ω,, t t)) = = {p p}_{w w} ((ω ω,, t t)) {I I}_{n no} ((θ θ;; ω ω,, t t)) - - - - - - ((1313))$

其中，表示混合语音分离后得到的第n个声源信号的频率域信号；in, Represents the frequency domain signal of the nth sound source signal obtained after the mixed speech separation;

将经过傅里叶反变换得到时域信号 Will The time domain signal is obtained by inverse Fourier transform

步骤8、计算每个声音源信号与样本语音库中指定声音源的匹配概率，选择概率值最大的声音源为目标声音源，保留该声音源信号，删除其他非目标声音源；Step 8, calculate the matching probability of each sound source signal and the specified sound source in the sample speech library, select the sound source with the largest probability value as the target sound source, keep the sound source signal, and delete other non-target sound sources;

每个声音源信号与样本语音库中指定声音源的匹配概率公式如下：The matching probability formula of each sound source signal and the specified sound source in the sample speech library is as follows:

$C C (({\overset{~ ~}{X x}}_{n no})) = = log log [[P P (({\overset{~ ~}{X x}}_{n no} | | {G G}_{c c}))]] - - - - - - ((1414))$

式中：表示由分离后语音提取的语音特征参数，即提取语音的梅尔频率倒谱系数作为语音的特征参数；In the formula: voice after separation The extracted speech feature parameters, that is, the extracted speech Mel frequency cepstral coefficients as speech The characteristic parameters;

表示第n个声音源信号与样本语音库中指定声音源的匹配概率； Indicates the matching probability of the nth sound source signal and the specified sound source in the sample speech library;

G_c表示用户指定人的声音模型参数；G _c represents the voice model parameter of the user designated person;

表示分离后语音属于用户指定人声音的概率； Indicates the probability that the voice after separation belongs to the voice of the user-designated person;

步骤9、对保留的声音源信号进行放大，即完成在被测环境中对指定声音源的放大。Step 9, amplifying the retained sound source signal, that is, completing the amplification of the specified sound source in the tested environment.

步骤2-4所述的阈值取值范围为10^-2～10^-16。The threshold described in step 2-4 ranges from 10 ^-2 to 10 ^-16 .

步骤6-1所述的α_n取0～1内的随机数，且满足k_n取1～700内的随机数。The α _n described in step 6-1 is a random number between 0 and 1, and satisfies k _n is a random number from 1 to 700.

本发明优点：Advantage of the present invention:

本发明一种智能语音处理方法，通过建立对话人声音模型库，实现在多人语音环境下智能识别多个对话人的身份同时分离混合语音得到每个对话人的独立语音，根据用户需求为用户放大要听取的对话人的语音同时消除非用户要求的对话人的语音；与传统助听器不同，该方法可以根据用户个人需求从而自动为用户提供其所需的声音，减少了除噪音外的非目标人声的干扰，体现了该方法的个性化、互动化和智能化。An intelligent voice processing method of the present invention, by establishing a voice model library of interlocutors, realizes the intelligent recognition of the identities of multiple interlocutors in a multi-person voice environment, and simultaneously separates and mixes the voices to obtain the independent voice of each interlocutor, and provides the user with a voice according to user needs. Amplify the voice of the interlocutor to be heard while eliminating the voice of the interlocutor who is not required by the user; unlike traditional hearing aids, this method can automatically provide the user with the sound they need according to the user's personal needs, reducing non-targets except noise The interference of human voice reflects the personalization, interaction and intelligence of the method.

附图说明Description of drawings

图1为本发明一种实施例的智能语音处理方法流程图；Fig. 1 is the flow chart of the intelligent voice processing method of an embodiment of the present invention;

图2为本发明一种实施例的建模声音源数据示意图，其中，图(a)表示第一个人的声音Fig. 2 is a schematic diagram of modeling sound source data of an embodiment of the present invention, wherein, figure (a) represents the sound of the first person

数据示意图，图(b)表示第二个人的声音数据示意图，图(c)表示第三个人的声音数据示意图；Data schematic diagram, figure (b) represents the voice data schematic diagram of the second person, and figure (c) represents the voice data schematic diagram of the third person;

图3为本发明一种实施例用于声音混合的声音源数据示意图，其中，图(a)表示第一声音源的数据示意图，图(b)表示第二声音源的数据示意图，图(c)表示第三声音源的数据示意图；Fig. 3 is a schematic diagram of sound source data used for sound mixing in an embodiment of the present invention, wherein, figure (a) represents the data schematic diagram of the first sound source, figure (b) represents the data schematic diagram of the second sound source, and figure (c) ) represents the data schematic diagram of the third sound source;

图4为本发明一种实施例的麦克风阵列示意图；Fig. 4 is a schematic diagram of a microphone array according to an embodiment of the present invention;

图5为本发明一种实施例的四个麦克风接收到的数据示意图，其中，图(a)表示第一个麦克风接收到的混合声音信号示意图，图(b)表示第二个麦克风接收到的混合声音信号示意图，图(c)表示第三个麦克风接收到的混合声音信号示意图，图(d)表示第四个麦克风接收到的混合声音信号示意图；Fig. 5 is a schematic diagram of data received by four microphones in an embodiment of the present invention, wherein, figure (a) represents a schematic diagram of the mixed sound signal received by the first microphone, and figure (b) represents the signal received by the second microphone Mixed sound signal schematic diagram, figure (c) represents the mixed sound signal schematic diagram that the 3rd microphone receives, and figure (d) represents the mixed sound signal schematic diagram that the 4th microphone receives;

图6为本发明一种实施例的四个麦克风接收到的数据采样后的示意图，其中，图(a)表示第一个麦克风接收到的混合声音信号采样后示意图，图(b)表示第二个麦克风接收到的混合声音信号采样后示意图，图(c)表示第三个麦克风接收到的混合声音信号采样后示意图，图(d)表示第四个麦克风接收到的混合声音信号采样后示意图；Fig. 6 is the schematic diagram of the data sampling received by four microphones of an embodiment of the present invention, wherein, figure (a) represents the schematic diagram of the mixed sound signal received by the first microphone after sampling, and figure (b) represents the second Schematic diagram after the mixed sound signal sampling that a microphone receives, figure (c) represents the schematic diagram after the mixed sound signal sampling that the 3rd microphone receives, and figure (d) represents the schematic diagram after the mixed sound signal sampling that the 4th microphone receives;

图7为本发明一种实施例的混合信号的空间谱估计示意图；FIG. 7 is a schematic diagram of spatial spectrum estimation of a mixed signal according to an embodiment of the present invention;

图8为本发明一种实施例的混合声音矢量方向分布概率密度图；Fig. 8 is a probability density diagram of the direction distribution of mixed sound vectors according to an embodiment of the present invention;

图9为本发明一种实施例的极大似然估计混合冯米修斯模型示意图；Fig. 9 is a schematic diagram of a maximum likelihood estimation mixed von Misius model according to an embodiment of the present invention;

图10为本发明一种实施例的理想语音与分离后得到语音对比图，其中，图(a)为第一声音源的原始声音信号，图(b)为分离后第一声音源的原始声音信号，图(c)为第二声音源的原始声音信号，图(d)为分离后第二声音源的原始声音信号，图(e)为第三声音源的原始声音信号，图(f)为分离后第三声音源的原始声音信号。Fig. 10 is the ideal speech of an embodiment of the present invention and obtains the speech contrast chart after separation, and wherein, figure (a) is the original sound signal of the first sound source, and figure (b) is the original sound of the first sound source after separation Signal, figure (c) is the original sound signal of the second sound source, figure (d) is the original sound signal of the second sound source after separation, figure (e) is the original sound signal of the third sound source, figure (f) is the original sound signal of the third sound source after separation.

具体实施方式detailed description

下面结合附图对本发明一种实施例做进一步说明。An embodiment of the present invention will be further described below in conjunction with the accompanying drawings.

本发明实施例中，模型系统主要分为语音建模模块和语音动态实时处理模块两个模块，其中语音建模模块实现说话人语音建模，语音动态实时处理模块实现复杂语音环境下，混合人声的方向定位与分离，混合语音识别与提取(即目标声音的提取放大和其余声音的屏蔽)。In the embodiment of the present invention, the model system is mainly divided into two modules: a speech modeling module and a speech dynamic real-time processing module, wherein the speech modeling module realizes speaker speech modeling, and the speech dynamic real-time processing module realizes mixed human Direction localization and separation of sound, hybrid speech recognition and extraction (that is, extraction and amplification of the target sound and shielding of other sounds).

一种智能语音处理方法，方法流程图如图1所示，包括以下步骤：A kind of intelligent speech processing method, method flowchart as shown in Figure 1, comprises the following steps:

步骤1、采集样本语音段构建样本语音库，对样本语音进行特征提取，获得特征参数，并对特征参数进行训练；具体过程如下：Step 1. Collect sample speech segments to build a sample speech library, perform feature extraction on sample speech, obtain feature parameters, and train feature parameters; the specific process is as follows:

步骤1-1、在安静的室内环境录制样本语音段，将采集的语音段进行离散化处理，提取语音信号的梅尔频率倒谱系数(MFCC)作为语音信号特征参数，并建立高斯混合模型；Step 1-1, recording the sample speech segment in a quiet indoor environment, discretizing the collected speech segment, extracting the Mel frequency cepstral coefficient (MFCC) of the speech signal as the speech signal characteristic parameter, and establishing a Gaussian mixture model;

本发明实施例中，采用windows自带录音机分别录制3个人的语音，每个人录制2段，其中1段用于声音分离与识别，另外1段用于说话人语音建模，设置目标声音源为第一号声音源；如图2中图(a)至图(c)所示，分别取三个人的一段语音，为其建立高斯混合模型，并将得到的模型参数存入模型库中。In the embodiment of the present invention, adopt Windows self-contained recorder to record the voices of 3 people respectively, each person records 2 sections, wherein 1 section is used for sound separation and recognition, and the other 1 section is used for speaker's voice modeling, and the target sound source is set as No. 1 sound source; as shown in Figure 2 (a) to (c), take a piece of speech from three people respectively, establish a Gaussian mixture model for it, and store the obtained model parameters in the model library.

模型公式如下：The model formula is as follows:

即采用k均值聚类算法对语音信号特征参数进行聚类，获得高斯混合模型参数集初始值G₀＝{p_i ⁰，μ_i ⁰，∑_i ⁰}，i＝1，2，...，I；That is, the k-means clustering algorithm is used to cluster the characteristic parameters of the speech signal, and the initial value of the Gaussian mixture model parameter set G ₀ = {p _i ⁰ , μ _i ⁰ , ∑ _i ⁰ }, i=1, 2,... , I;

本实例中采用16个单一高斯模型组成高斯混合模型。随机产生16个向量作为聚类中心，每个向量长度为语音帧数，将每帧的特征参数按最小距离准则分配到16个聚类中心中的某一个，然后重新计算每个聚类中心向量的中心值，将其作为新的聚类中心，直到算法收敛计算结束，此时得到的聚类中心就是初始高斯混合模型均值参数μ_i ⁰，求特征参数协方差获得初始∑_i ⁰，p_i ⁰则初始取值都为 In this example, 16 single Gaussian models are used to form a Gaussian mixture model. Randomly generate 16 vectors as cluster centers, the length of each vector is the number of speech frames, assign the feature parameters of each frame to one of the 16 cluster centers according to the minimum distance criterion, and then recalculate each cluster center vector Take it as the new clustering center until the algorithm converges and calculates. At this time, the obtained clustering center is the initial Gaussian mixture model mean parameter μ _i ⁰ . Calculate the characteristic parameter covariance to obtain the initial ∑ _i ⁰ , p _i ⁰ , the initial value is

采用最大期望算法对模型进行估计，其原则就是观测值出现的概率最大，通过分别对模型函数关于参数p_i ⁰，μ_i ⁰，∑_i ⁰求导等于零计算参数p_i，μ_i，∑_i的重估值，直到算法收敛计算结束，此时即完成特征参数的训练。The maximum expectation algorithm is used to estimate the model. The principle is that the probability of occurrence of the observed value is the largest. By deriving the model function with respect to the parameters p _i ⁰ , μ _i ⁰ , ∑ _i ⁰ respectively, the parameters p _i , μ _i , ∑ _i are equal to zero. The revaluation of the algorithm until the end of the algorithm convergence calculation, at this time, the training of the characteristic parameters is completed.

步骤2、采用4个麦克风组成的麦克风阵列采集被测环境音频信号，确定该环境声音源个数和每个声音源波束到达的方向，即声源到麦克风阵列的入射角度；Step 2, adopting a microphone array composed of 4 microphones to collect the measured environmental audio signal, determining the number of the environmental sound sources and the arrival direction of each sound source beam, that is, the incident angle of the sound source to the microphone array;

具体过程如下：The specific process is as follows:

步骤2-1、采用4个麦克风组成的麦克风阵列采集被测环境音频信号，并对采集的混合音频信号进行离散化处理，获得每个采样点的幅值；Step 2-1, using a microphone array composed of 4 microphones to collect the measured environmental audio signal, and discretizing the collected mixed audio signal to obtain the amplitude of each sampling point;

本发明实施例中，如图3中图(a)至图(c)所示，分别取三个人的另一段语音作为混合音频的声音数据源，采用4个麦克风，该4个麦克风组成的阵列如图4所示，一号麦克风与二号麦克风以阵列中心为中心对称分布于水平方向两侧，三号与四号麦克风以阵列中心为中心对称分布于垂直方向两侧；4个麦克风接收的混合数据如图5中图(a)至图(d)所示，对4个麦克风接收的语音进行离散化处理，离散化的频率为12500Hz，并确定每个采样点的幅值，如图6中图(a)至图(d)所示。In the embodiment of the present invention, as shown in Figures (a) to (c) in Figure 3, another section of voice of three people is respectively taken as the sound data source of the mixed audio, and four microphones are used, and the array formed by the four microphones is As shown in Figure 4, the No. 1 microphone and the No. 2 microphone are symmetrically distributed on both sides of the horizontal direction around the center of the array, and the No. 3 and No. 4 microphones are symmetrically distributed on both sides of the vertical direction around the center of the array; The mixed data is shown in Figures (a) to (d) in Figure 5, discretize the speech received by the four microphones, the discretization frequency is 12500Hz, and determine the amplitude of each sampling point, as shown in Figure 6 Figures (a) to (d) shown in the middle.

${R R}_{xx xx} = = \frac{11}{M m} {Σ Σ}_{m m = = 11}^{44} X x ((m m)) {X x}^{H h} ((m m)) - - - - - - ((22))$

XH(m)表示第m个麦克风采集到的混合音频矩阵的转置矩阵；XH(m) represents the transpose matrix of the mixed audio matrix collected by the mth microphone;

步骤2-4、本实例中，对矢量协方差矩阵的估计值进行特征值分解，获得特征值[0.00000.01900.03630.1128]，并对特征值从大到小进行排序，与阈值10^-7比较，即获得3个特征值，因此声音源个数为3；Step 2-4. In this example, perform eigenvalue decomposition on the estimated value of the vector covariance matrix to obtain the eigenvalue [0.00000.01900.03630.1128], and sort the eigenvalues from large to small, and compare with the threshold 10 ^-7 , that is, get 3 eigenvalues, so the number of sound sources is 3;

本发明实施例中，把与声音源个数3相等的特征值和对应的特征向量看作信号部分空间，剩下的4-3，即1个特征值和特征向量看作噪声部分空间，即噪音源个数为1，根据噪声特征值对应的元素可以得到噪声矩阵In the embodiment of the present invention, the eigenvalues and corresponding eigenvectors equal to the number of sound sources 3 are regarded as the signal part space, and the remaining 4-3, that is, 1 eigenvalue and eigenvector are regarded as the noise part space, namely The number of noise sources is 1, and the noise matrix can be obtained according to the elements corresponding to the noise eigenvalues

V_u＝[-0.1218-0.4761i-0.1564+0.4659i-0.5070-0.0374i-0.5084]；V _u =[-0.1218-0.4761i-0.1564+0.4659i-0.5070-0.0374i-0.5084];

如图4所示，各个麦克风与阵列中心的距离都为0.02m；本发明实施例中，混合音频信号的波长为30000；一号麦克风对于阵列中心的方向角度为0°，二号麦克风对于阵列中心的方向角度为180°，三号麦克风对于阵列中心的方向角度为90°，一号麦克风对于阵列中心的方向角度为270°；As shown in Figure 4, the distance between each microphone and the array center is 0.02m; in the embodiment of the present invention, the wavelength of the mixed audio signal is 30000; The directional angle of the center is 180°, the directional angle of the No. 3 microphone to the array center is 90°, and the directional angle of the No. 1 microphone to the array center is 270°;

α(θ)表示麦克风阵列的导向矢量，α(θ)＝(α₁(θ)，α₂(θ)，α₃(θ)，α₄(θ))，其中，α₁(θ)＝e^{jk0.02cos(0°-θ)}，α₂(θ)＝e^{jk002cos(180°-θ)}，α₃(θ)＝e^{jk002cos(90°-θ)}，α₄(θ)＝e^{jk002cos(270°-θ)}，j表示虚数单位，k＝2π/λ，λ表示混合音频信号的波长；α(θ) represents the steering vector of the microphone array, α(θ)=(α ₁ (θ), α ₂ (θ), α ₃ (θ), α ₄ (θ)), where α ₁ (θ)= e ^{jk0.02cos(0°-θ)} ，α ₂ (θ)＝e ^{jk002cos(180°-θ)} ，α ₃ (θ)＝e ^{jk002cos(90°-θ)} ，α ₄ (θ)＝e ^{jk002cos( 270°-θ)} , j represents the imaginary number unit, k=2π/λ, and λ represents the wavelength of the mixed audio signal;

V_u表示噪音矩阵；V _u represents the noise matrix;

如图7所示，混合音频信号的角度谱函数P(θ)的波形，得到该混合声音中存在的3个声音源的波束到达方向分别为[50°，200°，300°]。As shown in Figure 7, the waveform of the angular spectrum function P(θ) of the mixed audio signal, the beam arrival directions of the three sound sources in the mixed sound are [50°, 200°, 300°] respectively.

麦克风阵列声压公式如下：The sound pressure formula of the microphone array is as follows:

${p p}_{w w} ((t t)) = = {Σ Σ}_{n no = = 11}^{33} 0.5 0.5 {Σ Σ}_{m m = = 11}^{44} {h h}_{mn mn} ((t t)) {s the s}_{n no} ((t t)) - - - - - - ((44))$

N表示声音源个数；N represents the number of sound sources;

t表示时间；t means time;

步骤5、根据频域内的麦克风阵列声压、麦克风阵列水平方向梯度和麦克风阵列垂直方向声压梯度，获得频率域内的声压信号的强度矢量公式，进而得出强度矢量方向；Step 5, according to the microphone array sound pressure in the frequency domain, the microphone array horizontal direction gradient and the microphone array vertical direction sound pressure gradient, obtain the intensity vector formula of the sound pressure signal in the frequency domain, and then obtain the intensity vector direction;

ρ₀表示被测环境空气密度； _ρ0 represents the measured ambient air density;

c表示声速；c represents the speed of sound;

$γ γ ((ω ω,, t t)) = = {tan the tan}^{- - 11} [[\frac{Re Re {{{p p}_{w w}^{* *} ((ω ω,, t t)) {p p}_{y the y} ((ω ω,, t t))}}}{Re Re {{{p p}_{w w}^{* *} ((ω ω,, t t)) {p p}_{x x} ((ω ω,, t t))}}}]] - - - - - - ((88))$

具体过程如下：The specific process is as follows:

本发明实施例中，如图8所示，γ(ω，t)的分布概率密度图；根据上述所求的声音源个数和角度可以得到符合该概率密度分布的混合冯米修斯分布由3个单一冯米修斯分布组成，且这三个分布的中心角度分别为[50°，200°，300°]。In the embodiment of the present invention, as shown in Figure 8, the distribution probability density map of γ (ω, t); According to the number and angle of the sound source sought above, the mixed Von Misius distribution conforming to the probability density distribution can be obtained by three Composed of a single von Misius distribution, and the central angles of these three distributions are [50°, 200°, 300°], respectively.

表示混合声音方向角度； Indicates the mixed sound direction angle;

Γ＝{α_n，k_n}，i＝1，2，3(11)Γ={α _n , k _n }, i=1, 2, 3 (11)

本发明实施例中，α取值为[1/3，1/3，1/3]，k取值[8，6，3]；In the embodiment of the present invention, the value of α is [1/3, 1/3, 1/3], and the value of k is [8, 6, 3];

步骤6-3、根据获得的初始模型参数，建立初始的混合冯米修斯分布函数，得到函数公式为：Step 6-3. According to the obtained initial model parameters, an initial mixed von Misius distribution function is established, and the function formula is obtained as:

采用最大期望算法估计得到混合冯米修斯分布模型的参数，其原则就是观测值出现的概率最大，通过对模型函数关于参数α和k求导等于零计算参数α和k的重估值，The maximum expectation algorithm is used to estimate the parameters of the mixed von Misius distribution model. The principle is that the probability of the observed value is the largest. By deriving the model function with respect to the parameters α and k equal to zero, the revaluation of the parameters α and k is calculated.

将γ(ω，t)作为代入取对数得到初始对数似然值-3.0249e+004，通过计算每个当前单一冯米修斯分布占混合冯米修斯分布的比例可以获得重估的α参数[0.2267，0.2817，0.4516]，同时根据求导所得参数k求取方法得到重估k的值为[5.1498，4.0061，3.1277]，此时可得到新的对数似然值为-2.9887e+004，比较新旧似然值差值为362.3362远大于阈值所取阈值0.1，故将新似然值赋值给旧似然值，然后再重新用这两个新得到的重估参数重复刚才步骤直到新旧似然值小于阈值即认为算法收敛，本实例中最终得到α参数[0.2689，0.2811，0.4500]，k的值为[4.3508，3.3601，2.8332]，此时即获得了满足强度矢量方向分布的混合冯米修斯分布函数，如图9所示为得到的混合冯米修斯分布。Take γ(ω,t) as substitute Take the logarithm to get the initial log-likelihood value -3.0249e+004. By calculating the proportion of each current single von Misius distribution to the mixed von Misius distribution, the revalued α parameter [0.2267, 0.2817, 0.4516] can be obtained, and at the same time according to The method of calculating the parameter k obtained by deriving the value of the revalued k is [5.1498, 4.0061, 3.1277]. At this time, the new logarithmic likelihood value can be obtained -2.9887e+004, and the difference between the old and new likelihood values is 362.3362. The threshold value is 0.1, so the new likelihood value is assigned to the old likelihood value, and then the two newly obtained re-evaluation parameters are used to repeat the previous steps until the old and new likelihood values are less than the threshold value, and the algorithm is considered to converge. In this example Finally, the α parameter [0.2689, 0.2811, 0.4500] is obtained, and the value of k is [4.3508, 3.3601, 2.8332]. At this time, the mixed von Misius distribution function satisfying the direction distribution of the intensity vector is obtained, as shown in Figure 9. Mixed von Misius distribution.

步骤8、计算每个声音源信号与样本语音库中指定声音源的匹配概率，认为概率值最大的声音源即为目标声音源，保留该声音源信号，删除其他非目标声音源；Step 8, calculate the matching probability of each sound source signal and the specified sound source in the sample speech library, think that the sound source with the largest probability value is the target sound source, keep the sound source signal, and delete other non-target sound sources;

本发明实施例中，假设第一个人为目标声音源，最终分离后的三个语音与该目标声音模型的匹配概率对数值分别为[-2.0850-2.8807-3.5084]×10⁴，其中最大匹配声音为1号分离后声音，即找到目标声音源。In the embodiment of the present invention, assuming that the first person is the target sound source, the logarithm values of the matching probabilities of the final separated three voices and the target sound model are [-2.0850-2.8807-3.5084]×10 ⁴ , where the largest matching sound After separating the sound for No. 1, the target sound source is found.

本发明实施例中，最后根据得到的混合冯米修斯分布模型参数得到每个声音源的方向函数，进一步分离得到原始声音，如图10中图(a)至图(f)所示，即为理想与分离后得到数据的对比图，可以看到相似度极高。In the embodiment of the present invention, finally, according to the obtained mixed von Misius distribution model parameters, the direction function of each sound source is obtained, and the original sound is further separated, as shown in Fig. 10 (a) to (f), which is ideal Compared with the data obtained after separation, it can be seen that the similarity is extremely high.

Claims

1. An intelligent speech processing method, comprising the steps of:

step 1, collecting sample voice sections to construct a sample voice library, performing feature extraction on the sample voice to obtain feature parameters, and training the feature parameters;

the specific process is as follows:

step 1-1, collecting sample voice sections, carrying out discretization processing on the collected voice sections, extracting Mel frequency cepstrum coefficients of voice signals as voice signal characteristic parameters, and establishing a Gaussian mixture model;

the model formula is as follows:

p (X | G) = Σ_{i = 1}^{M} p_{i} b_{i} (X) - - - (1)

wherein p (X | G) represents the probability of the sample speech feature parameter in the model with the model parameter G;

g denotes a gaussian mixture model parameter set, G ═ p_i，μ_i，∑_i}，i＝1，2，...，I；

I represents the number of single Gaussian models in the Gaussian mixture model;

p_irepresenting the weighting coefficients of the ith single gaussian model,

μ_ia mean vector representing the ith single Gaussian model;

∑_ia covariance matrix representing the ith single gaussian model;

x represents a sample speech feature parameter, X ═ X₁，x₂，...，x_TT represents the number of feature vectors;

b_i(X) a density function representing the ith single Gaussian model, b_i(X)＝N(μ_i，∑_i) N (·) represents a density function of a standard gaussian distribution;

step 1-2, training a Gaussian mixture model by using the characteristic parameters of the voice signals;

i.e. adopt k are allClustering the characteristic parameters of the voice signals by using a value clustering algorithm to obtain an initial value G of a Gaussian mixture model parameter set₀＝{p_i ⁰，μ_i ⁰，∑_i ⁰1, 2, ·, I; estimating the model by adopting a maximum expectation algorithm according to the obtained initial value of the parameter set of the Gaussian mixture model, and further obtaining parameters of the Gaussian mixture model, namely finishing the training of the characteristic parameters;

step 2, collecting the audio signal of the detected environment by adopting a microphone array consisting of M microphones, and determining the number of the environmental sound sources and the arrival direction of each sound source beam, namely the incident angle from the sound source to the microphone array;

the specific process is as follows:

step 2-1, collecting mixed audio signals of the tested environment by using a microphone array consisting of M microphones, and carrying out discretization processing on the collected mixed audio signals to obtain the amplitude of each sampling point;

step 2-2, performing matrixing on the amplitude of each sampling point to obtain a mixed audio matrix collected by each microphone; the number of columns of the mixed audio matrix is one, the number of rows is the number of sampling points, and elements in the matrix are the amplitude of each sampling point;

step 2-3, obtaining an estimated value of a vector covariance matrix of the mixed audio signal of the tested environment according to the mixed audio matrix and the number of the microphones acquired by each microphone;

the estimated value of the vector covariance matrix is expressed as follows:

R_{x x} = \frac{1}{M} Σ_{m = 1}^{M} X (m) X^{H} (m) - - - (2)

wherein R is_xxAn estimate of a vector covariance matrix of the mixed audio signal representing the measured environment;

x (m) represents the mixed audio matrix collected by the mth microphone;

X^H(m) a transpose matrix representing the mixed audio matrix collected by the mth microphone;

step 2-4, performing eigenvalue decomposition on the estimated value of the vector covariance matrix to obtain eigenvalues, sequencing the eigenvalues from large to small, and determining the number of the eigenvalues larger than a threshold, namely the number of the sound sources;

2-5, subtracting the number of the sound sources from the number of the microphones to obtain the number of the noise sources, and further correspondingly obtaining a noise matrix;

step 2-6, obtaining a steering vector of the microphone array according to the distance between each microphone and the array center, the wavelength of the mixed audio signal, the direction angle of the microphone to the array center and the arrival direction of the sound source beam, and obtaining an angle spectrum function of the mixed audio signal according to the noise matrix and the steering vector of the microphone array;

the angle spectral function formula of the mixed audio signal is as follows:

P (θ) = \frac{1}{α^{H} (θ) V_{u} {V^{H}}_{u} α (θ)} - - - (3)

wherein P (θ) represents an angular spectral function of the mixed audio signal;

α (theta) represents the steering vector of the microphone array, α (theta) ═ α₁(θ)，...，α_m(θ)，...，α_M(theta)), wherein, among others,j denotes an imaginary unit, k is 2 pi/λ, λ denotes a wavelength of the mixed audio signal, d_mIndicating the distance of the mth microphone from the center of the array,represents the direction angle of the mth microphone to the center of the array;

θ represents a beam arrival direction of the sound source;

α^H(θ) a transposed matrix representing steering vectors of the microphone array;

V_urepresenting a noise matrix;

V^H _ua transposed matrix representing the noise matrix;

2-7, selecting a plurality of peak values of the waveform from big to small according to the waveform of the angle spectrum function of the mixed audio signal, wherein the number of the selected peak values is the number of the sound sources;

step 2-8, determining an angle value corresponding to the selected peak value, namely obtaining the arrival direction of the wave beam of each sound source;

step 3, obtaining microphone array sound pressure received by the microphone, sound pressure gradient in the horizontal direction of the microphone array and sound pressure gradient in the vertical direction of the microphone array according to the audio signal of each sound source and the conversion relation between the sound sources and the microphone;

the microphone array sound pressure signal formula is as follows:

p_{w} (t) = Σ_{n = 1}^{N} 0.5 Σ_{m = 1}^{M} h_{m n} (t) s_{n} (t) - - - (4)

wherein p is_w(t) represents microphone array sound pressure at time t;

n represents the number of sound sources;

t represents time;

s_n(t) represents an audio signal of an nth sound source;

h_mn(t) denotes a conversion matrix between the nth sound source and the mth microphone, h_mn(t)＝p₀(t)α_m(θ_n(t))，p₀(t) represents the sound pressure at the center of the microphone array caused by the sound wave at time t, α_m(θ_n(t)) represents the steering vector of the mth microphone with respect to the nth sound source at time t, where θ_n(t) represents the beam arrival direction of the nth sound source at time t;

the sound pressure gradient formula in the horizontal direction of the microphone array is as follows:

wherein p is_x(t) representing a sound pressure gradient in a horizontal direction of the microphone array;

the sound pressure gradient in the vertical direction of the microphone array is formulated as follows:

wherein p is_y(t) representing a sound pressure gradient in a vertical direction of the microphone array;

step 4, converting the central sound pressure of the microphone array, the sound pressure gradient in the horizontal direction of the microphone array and the sound pressure gradient in the vertical direction of the microphone array from a time domain to a frequency domain by adopting Fourier transform;

step 5, obtaining an intensity vector formula of a sound pressure signal in a frequency domain according to sound pressure of the microphone array in the frequency domain, horizontal direction gradient of the microphone array and vertical direction sound pressure gradient of the microphone array, and further deducing an intensity vector direction;

the intensity vector formula of the sound pressure signal in the frequency domain is:

I (ω, t) = \frac{1}{ρ_{0} c} [Re {{p_{w}}^{*} (ω, t) p_{x} (ω, t)} u_{x} + Re {{p_{w}}^{*} (ω, t) p_{y} (ω, t)} u_{y}] - - - (7)

wherein I (ω, t) represents an intensity vector of the sound pressure signal in the frequency domain;

ρ₀representing the air density of the tested environment;

c represents the speed of sound;

re [ ] represents taking the real part of the complex number;

p_w ^*(ω, t) represents a conjugate matrix of microphone array sound pressures in the frequency domain;

p_x(ω, t) represents the microphone array horizontal direction sound pressure gradient in the frequency domain;

p_y(ω, t) represents the microphone array vertical direction sound pressure gradient in the frequency domain;

u_xrepresenting unit vectors in the direction of the abscissa axis;

u_ya unit vector representing the direction of the ordinate axis;

the intensity vector direction formula is as follows:

γ (ω, t) = \tan^{- 1} [\frac{Re {{p_{w}}^{*} (ω, t) p_{y} (ω, t)}}{Re {{p_{w}}^{*} (ω, t) p_{x} (ω, t)}}] - - - (8)

wherein γ (ω, t) represents the intensity vector direction of the sound pressure signal of the mixed sound received by the microphone array;

step 6, counting the intensity vector direction to obtain probability density distribution, fitting by adopting mixed Von Milius distribution to obtain model parameters of the voice intensity vector direction obeying the mixed Von Milius distribution, and further obtaining an intensity vector direction function of each sound pressure signal;

the specific process is as follows:

step 6-1, counting the intensity vector direction to obtain probability density distribution of the intensity vector direction, and fitting by adopting mixed Von Milius distribution to obtain a model parameter set of the mixed Von Milius distribution to which the intensity vector direction of the voice obeys;

the formula of the mixed von mises distribution model is as follows:

wherein,representing a mixed von mises distribution probability density;

representing a mixed sound direction angle;

α_na weight representing an intensity vector direction function of a sound pressure signal of an nth sound source;

wherein, I₀(k_n) First order modified Bessel function, k, representing the corresponding nth sound source_nThe concentration parameter corresponding to the single von mises distribution to which the intensity vector direction of the nth sound source sound pressure signal obeys is represented, namely the reciprocal of the variance of the von mises distribution;

the hybrid von mises distribution function parameter set is as follows:

＝{α_n，k_n}，i＝1，...，N(11)

6-2, initializing model parameters to obtain an initial function parameter set;

6-3, estimating parameters of the mixed von Michels distribution model by adopting a maximum expectation algorithm according to the obtained initial model parameters;

6-4, solving an intensity vector direction function of each sound pressure signal according to the estimated mixed Von Milius distribution model parameters;

the intensity vector direction function formula of the sound pressure signal is as follows:

wherein,representing the intensity vector direction function of the nth sound source;

step 7, obtaining a signal of each sound source in a frequency domain according to the obtained intensity vector direction function of each sound pressure signal and the sound pressure of the microphone array, and converting each sound source signal in the frequency domain into a sound source signal in a time domain by adopting Fourier inverse transformation;

the signal formula of each sound source in the frequency domain is as follows:

wherein,a frequency domain signal representing an nth sound source signal obtained after separation of the mixed speech;

will be provided withObtaining a time domain signal through Fourier inverse transformation

Step 8, calculating the matching probability of each sound source signal and the designated sound source in the sample sound library, selecting the sound source with the maximum probability value as a target sound source, reserving the sound source signal, and deleting other non-target sound sources;

the matching probability formula of each sound source signal and the designated sound source in the sample voice library is as follows:

C ({\tilde{X}}_{n}) = l o g [P ({\tilde{X}}_{n} | G_{c})] - - - (14)

in the formula:representing speech by separationExtracted speech feature parameters, i.e. extracting speechAs speech, Mel frequency cepstrum coefficientsThe characteristic parameters of (1);

representing the matching probability of the nth sound source signal and the specified sound source in the sample voice library;

G_cacoustic model parameters representing a user-specified person;

representing the probability that the separated speech belongs to the voice of the user-specified person;

and 9, amplifying the reserved sound source signals, namely completing the amplification of the specified sound source in the tested environment.

2. The intelligent speech processing method of claim 1, wherein the speech processing method comprises a speech processing method, and a speech processing methodThe threshold value range in the steps 2-4 is 10^-2～10^-16。

3. The intelligent speech processing method according to claim 1, wherein α of step 6-1_nTaking a random number within 0-1 and satisfyingk_nTaking random numbers within 1-700.