一种基于语音样本特征空间轨迹的说话人识别方法A speaker recognition method based on the trajectory of the feature space of speech samples
技术领域Technical field
本发明涉及生物特征识别领域,具体涉及一种基于语音样本特征空间轨迹的说话人识别方法。The invention relates to the field of biometric recognition, and in particular to a speaker recognition method based on the trajectory of the feature space of speech samples.
背景技术Background technique
随着人工智能技术的发展,音频感知已经成为音频处理技术研究的热点,其中音频分类或音频识别是音频感知的核心问题,在工程应用中,音频分类表现为说话人识别、音频事件识别、音频事件检测等。说话人识别技术是身份验证技术---生物特征识别技术的一种。生物特征识别技术是利用生物特征自动识别个体身份的技术,包括指纹识别、虹膜识别、基因识别、人脸识别等。与其他身份验证技术相比,说话人识别更加方便、自然,且具有比较低的用户侵犯性。说话人识别利用语音信号进行身份识别,具有人机交互自然、语音信号易于提取、可实现远程识别等优势。With the development of artificial intelligence technology, audio perception has become a hotspot in the research of audio processing technology. Among them, audio classification or audio recognition is the core issue of audio perception. In engineering applications, audio classification manifests as speaker recognition, audio event recognition, audio Event detection, etc. Speaker recognition technology is a kind of identity verification technology---biometric recognition technology. Biometric recognition technology is a technology that uses biometrics to automatically identify individuals, including fingerprint recognition, iris recognition, gene recognition, and face recognition. Compared with other identity verification technologies, speaker identification is more convenient, natural, and has lower user intrusiveness. Speaker recognition uses voice signals for identity recognition, which has the advantages of natural human-computer interaction, easy extraction of voice signals, and remote recognition.
现有的说话人识别系统包括两个阶段:训练阶段和识别阶段。在训练阶段,系统使用收集的说话人语音为说话人建立模型;在识别阶段,系统将输入语音与说话人模型进行匹配来作出判决。说话人识别系统需要从语音信号提取能反映说话人个性的特征,并建立准确的模型区分该说话人与其他说话人之间的差异。目前常用的音频分类技术主要有两大类,一类是生成统计模型,如混合高斯模型GMM和隐马尔可夫模型HMM,另一类是基于深度神经网络的方法,如DNN、RNN或LSTM等。不论是哪一种技术,都需要大量的标注训练样本,为了达到较好的识别性能,深度神经网络方法对样本规模要求更高。基于GMM或HMM的方法对不同音频类之间的区分性信息并没有加以特别的考虑,也没有考虑不同类样本数据的共享,比如:MIT Reynold教授等的论文《Speaker Verification Using Adapted Gaussian Mixture Models》(Digital Signal Processing 10(2000),19–41.)提到的方法,具有较高的计算复杂度;在大样本支持下,深度神经网络方法表现出了很好的性能,比如谷歌公司的论文《End-to-End Text-Dependent Speaker Verification》(2016 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP),2016,Pages:5115-5119)中使用神经网络对语音提取特征并训练,但是神经网络的训练需要大量的有标注语音,而大量样本的获取成本是非常高昂的,而且深度神经网络方法缺乏解释性,相当一个黑匣子。The existing speaker recognition system includes two stages: training stage and recognition stage. In the training stage, the system uses the collected speaker speech to build a model for the speaker; in the recognition stage, the system matches the input speech with the speaker model to make a decision. The speaker recognition system needs to extract features that can reflect the personality of the speaker from the speech signal, and establish an accurate model to distinguish the difference between the speaker and other speakers. At present, there are two main types of audio classification technology, one is to generate statistical models, such as the mixed Gaussian model GMM and the hidden Markov model HMM, and the other is based on deep neural network methods, such as DNN, RNN or LSTM. . No matter what kind of technology, a large number of labeled training samples are required. In order to achieve better recognition performance, the deep neural network method requires higher sample size. The GMM or HMM-based method does not give special consideration to the distinguishing information between different audio classes, nor does it consider the sharing of sample data of different classes, such as: MIT Professor Reynold's paper "Speaker Verification Using Adapted Gaussian Mixture Models" (Digital Signal Processing 10 (2000), 19–41.) The method mentioned has a high computational complexity; with the support of large samples, the deep neural network method has shown very good performance, such as the paper of Google "End-to-End Text-Dependent Speaker Verification" (2016 IEEE International Conference on Acoustics, Speech and Processing (ICASSP), 2016, Pages: 5115-5119) uses neural networks to extract features and train speech, but neural networks The training requires a lot of labeled speech, and the acquisition cost of a large number of samples is very high, and the deep neural network method lacks explanation, which is quite a black box.
现有说话人识别技术往往计算复杂度较高,需要大量标注过的说话人语音数据来训练模型,而采集大量的有标注语音数据需要巨大的工作量。因此需要找到一种能够 更加方便有效的说话人识别方法和系统。Existing speaker recognition technologies tend to be computationally complex, requiring a large amount of annotated speaker speech data to train the model, and collecting a large amount of annotated speech data requires huge workload. Therefore, we need to find a more convenient and effective speaker recognition method and system.
发明内容Summary of the invention
本发明的目的是针对现有技术的不足,提供了一种基于语音样本特征空间轨迹的说话人识别方法,其中语音特征空间是不依赖说话人、文本和语言的,因此语音特征空间的构建可以采用任何合格的语音数据,实现了语音数据的共享;而说话人语音轨迹,即使一个样本也能构建,因此不需要大量的有标注语音数据,克服了现有技术需要采集大量有标注语音数据的缺陷。The purpose of the present invention is to provide a speaker recognition method based on the trajectory of the feature space of the speech sample, in which the speech feature space does not depend on the speaker, text and language, so the construction of the speech feature space can be Use any qualified voice data to achieve the sharing of voice data; and the speaker's voice trajectory can be constructed even with a sample, so a large amount of annotated voice data is not required, which overcomes the need to collect a large amount of annotated voice data in the prior art defect.
本发明的目的可以通过如下技术方案实现:The object of the present invention can be achieved by the following technical solutions:
一种基于语音样本特征空间轨迹的说话人识别方法,其中一个语音样本能够视为语音特征空间的一次运动,具有活动空间和空间中的轨迹特性,所述方法包括以下步骤:A speaker recognition method based on the trajectory of the feature space of the voice sample, in which a voice sample can be regarded as a movement in the feature space of the voice, with trajectory characteristics in the active space and space, the method includes the following steps:
步骤1)、构建语音特征空间Ω:利用聚类方法将无标注语音样本在特征空间进行聚类,由聚类所得到的子类数据生成该子类数据的某种表达作为语音特征空间的表达Ω={g
k,k=1,2,…,K};
Step 1), construct the voice feature space Ω: cluster the unlabeled voice samples in the feature space using a clustering method, and generate an expression of the subclass data from the subclass data obtained by the clustering as the expression of the voice feature space Ω={g k ,k=1, 2,...,K};
步骤2)、构建说话人知识:利用有说话人属性标注的纯净语音样本,获得其在语音特征空间Ω上的分布信息以及运动轨迹信息;Step 2), construct speaker knowledge: use pure speech samples marked with speaker attributes to obtain their distribution information and movement trajectory information in the speech feature space Ω;
步骤3)、说话人识别:对于待识别语音样本,首先获得该样本的语音特征空间分布表达以及轨迹,然后利用说话人语音特征空间分布信息计算样本分布与先验分布的差异以及沿轨迹的累计局部分布差异,作为说话人识别的依据并进行判断。Step 3) Speaker recognition: For the speech sample to be recognized, first obtain the spatial distribution expression and trajectory of the speech feature of the sample, and then use the speaker's speech feature spatial distribution information to calculate the difference between the sample distribution and the prior distribution and the accumulation along the trajectory The local distribution difference is used as the basis for speaker identification and judgment.
进一步地,步骤1)构建语音特征空间Ω的过程中,能够使用任何纯净语音样本,对说话人、语种因素没有任何约束。Further, in the process of constructing the speech feature space Ω in step 1), any pure speech samples can be used without any constraints on speakers and language factors.
进一步地,步骤1)中采用K-means或其它聚类方法将语音样本在特征空间进行聚类,所述语音特征空间表达Ω={g
k,k=1,2,…,K}能够是类数据的分布函数(比如高斯分布函数)、聚类中心矢量(质心)或者生成模型(比如隐马尔可夫模型或神经网络)这些具有定位能力的标识,称之为特征空间标识子,语音特征空间所使用的类标识子规模K决定语音特征空间表达粒度,K越大,语音特征空间表达越精细。另一方面,空间表达的准确性与数据规模有关,数据越丰富,空间表达越完整;同时,构建语音特征空间的数据越有针对性,对于特定问题而言,空间表达会越精确。
Further, in step 1), K-means or other clustering methods are used to cluster the speech samples in the feature space, and the speech feature space expression Ω={g k ,k=1, 2,...,K} can be Distribution functions of class data (such as Gaussian distribution function), cluster center vector (centroid) or generative models (such as hidden Markov model or neural network), which have positioning capabilities, are called feature space identifiers, voice features The scale K of the class identifier used in the space determines the granularity of voice feature spatial expression. The larger K is, the finer the voice feature spatial expression. On the other hand, the accuracy of spatial expression is related to the size of the data. The richer the data, the more complete the spatial expression. At the same time, the more targeted the data for constructing the speech feature space, the more precise the spatial expression will be for a particular problem.
进一步地,步骤2)中,利用有说话人属性标注的纯净语音样本对语音特征空间进行标注,在采用高斯分布g
k(m
k,U
k)作为空间标识子时,说话人特征空间分布信息按以下方式获得:
Further, in step 2), the pure voice samples with speaker attribute annotations are used to mark the voice feature space. When the Gaussian distribution g k (m k , U k ) is used as the spatial identifier, the speaker feature space distribution information Obtained as follows:
一、计算语音样本每个特征f
t与空间标识子g
k(m
k,U
k)的位置关联度,定义为:
1. Calculate the positional correlation between each feature f t of the speech sample and the spatial identifier g k (m k , U k ), defined as:
式中,空间标识子用多维高斯分布来表示,m
k表示第k个高斯分布的均值矢量,U
k表示第k个多维高斯分布的方差矩阵;
In the formula, the spatial identifier is represented by a multidimensional Gaussian distribution, m k represents the mean vector of the kth Gaussian distribution, and U k represents the variance matrix of the kth multidimensional Gaussian distribution;
二、计算说话人样本集与空间标识子g
k(m
k,U
k)的位置关联度的期望值:
2. Calculate the expected value of the positional correlation between the speaker sample set and the spatial identifier g k (m k , U k ):
式中,
表示第n个样本的第t帧特征与空间标识子g
k(m
k,U
k)的关联度;
In the formula, Represents the degree of association between the t-th frame feature of the n-th sample and the spatial identifier g k (m k , U k );
三、计算说话人特征空间分布为:3. The spatial distribution of speaker features is calculated as:
进一步地,步骤2)中,说话人语音样本在语音特征空间Ω上的运动轨迹时序信息表示为语音样本特征的邻域序列Ψ
1Ψ
2…Ψ
T,而语音样本特征f
t的δ邻域为Ψ
t={g
k|d
tk<δ},其中d
tk为语音样本特征与语音样本分布的马哈拉诺比斯距离(Mahalanobis distance),即
Further, in step 2), the time sequence information of the movement trajectory of the speaker's speech sample on the speech feature space Ω is expressed as the neighborhood sequence Ψ 1 Ψ 2 …Ψ T of the feature of the speech sample, and the δ neighborhood of the feature f t of the speech sample Is Ψ t = {g k |d tk <δ}, where d tk is the Mahalanobis distance (Mahalanobis distance) between the feature of the speech sample and the distribution of the speech sample, ie
进一步地,所述语音样本特征f
t的δ邻域Ψ
t={g
k|d
tk<δ}的判决门限参考正态分布的特性,选取2<δ<3。
Further, the decision threshold of the δ neighborhood Ψ t ={g k |d tk <δ} of the feature f t of the speech sample refers to the characteristics of the normal distribution, and 2<δ<3 is selected.
进一步地,步骤3)中,语音样本f={f
1,f
2,…,f
T}的说话人识别过程包括以下步骤:
Further, in step 3), the speaker recognition process of the speech sample f={f 1 , f 2 ,..., f T } includes the following steps:
一、计算语音样本f={f
1,f
2,…,f
T}在语音特征空间Ω的分布P=(p
1,p
2,…,p
K),其中
1. Calculate the distribution of speech samples f = {f 1 , f 2 , ..., f T } in the speech feature space Ω P = (p 1 , p 2 , ..., p K ), where
二、确定语音样本f={f
1,f
2,…,f
T}在语音特征空间Ω中的运动轨迹Ψ
1Ψ
2…Ψ
T,Ψ
t={g
k|d
tk<δ};
2. Determine the movement trajectory of the speech sample f = {f 1 , f 2 ,..., f T } in the speech feature space Ω Ψ 1 Ψ 2 …Ψ T , Ψ t = {g k |d tk <δ};
三、计算样本分布P=(p
1,p
2,…,p
K)与说话人s的先验特征空间分布
的距离
然后筛选包含真实解的可能的解集S
p:
3. Calculate the sample distribution P = (p 1 , p 2 , ..., p K ) and the spatial distribution of the prior features of the speaker s the distance Then filter the possible solution set S p containing the real solution:
四、计算运动轨迹Ψ
1Ψ
2…Ψ
T的距离度量
从S
p中选出可能的解
完成说话人识别。
4. Calculate the distance metric of the trajectory Ψ 1 Ψ 2 …Ψ T Select possible solutions from S p Complete speaker recognition.
具体地,步骤3)中,仅用语音样本的空间分布信息P=(p
1,p
2,…,p
K)或者运动轨迹距离
均能够获得很好的说话人识别性能。
Specifically, in step 3), only the spatial distribution information P=(p 1 ,p 2 ,...,p K ) of the speech samples or the distance of the movement track Both can obtain very good speaker recognition performance.
本发明与现有技术相比,具有如下优点和有益效果:Compared with the prior art, the present invention has the following advantages and beneficial effects:
1、本发明提供的一种基于语音样本特征空间轨迹的说话人识别方法,其中语音特征空间的建立是对大量的语音特征进行聚类,不需要有标注的数据,建立语音特征空间的数据样本来源于不同说话人即可,对于说话内容、说话人年龄、语种没有确切的要求,克服了神经网络方法中,需要大量有标注语音的问题,语音空间建立的数据采集方便实现。1. A speaker recognition method based on the trajectory of the voice sample feature space provided by the present invention, in which the establishment of the voice feature space is to cluster a large number of voice features without the need for annotated data, to establish data samples of the voice feature space It can be derived from different speakers. There is no exact requirement for the content of the speaker, the age of the speaker, and the language. It overcomes the problem of the need for a large number of labeled voices in the neural network method, and the data collection established in the voice space is easy to implement.
2、本发明提供的一种基于语音样本特征空间轨迹的说话人识别方法,基于说话人语音特征在语音特征空间中的定位与轨迹信息,不同于信号源生成类模型方法,如隐马尔可夫模型(HMM)等,定位是相对的,而生成模型是绝对的;与深度神经网络的方法相比,具有可解释性,每一个知识数据都具有一定的物理语义,比如样本特征在空间Ω上的关联度分布信息P=(p
1,p
2,…,p
K)即表达了该样本的活动空间范围(非零元素所对应的标识子集合所代表的空间),又表达了在该空间的分布。
2. A speaker recognition method based on the trajectory of the voice sample feature space provided by the present invention, based on the location and trajectory information of the speaker's voice feature in the voice feature space, is different from the signal source generation model method, such as Hidden Markov Model (HMM), etc., the positioning is relative, and the generative model is absolute; compared with the deep neural network method, it is interpretable, and each knowledge data has a certain physical semantics, such as the sample features on the space Ω The correlation degree distribution information of P = (p 1 , p 2 , ..., p K ) expresses the range of the active space of the sample (the space represented by the identifier subset corresponding to the non-zero elements), and also expresses in the space Distribution.
3、本发明提供的一种基于语音样本特征空间轨迹的说话人识别方法,本质是语音特征在空间中定位,对于不同的说话人的语音特征,在建立的语音特征空间上进行定位,使用关联度来表示不同说话人的语音特征定位信息,通过较少的计算量就表达出了不同说话人之间的区分性,相比于需要用生成模型对每个说话人进行建模的GMM或HMM的方法,具有更低的计算复杂度。3. A speaker recognition method based on the trajectory of the feature space of the voice samples provided by the present invention is essentially that the voice features are located in the space. For the voice features of different speakers, the location is established on the established voice feature space and the association is used. Degree to represent the speech feature location information of different speakers, and expresses the distinction between different speakers with less calculation, compared to GMM or HMM that requires a generative model to model each speaker The method has a lower computational complexity.
4、本发明提供的一种基于语音样本特征空间轨迹的说话人识别方法,其中语音特征空间标识子集合是用来定位说话人语音特征的参照系统,是相对关系,与待识别样本没有严格的关系要求,因此特征空间具有共享性,建立的语音特征空间可以迁移到其他的说话人数据集上进行识别,比如:一个语种的说话人语音特征空间可作为另一语种的说话人识别的语音特征空间,实现了数据的共享。4. A speaker recognition method based on the trajectory of the voice sample feature space provided by the present invention, wherein the voice feature space identifier subset is a reference system used to locate the speaker's voice feature, which is a relative relationship and is not strictly related to the sample to be recognized The relationship requires that the feature space is shared, and the established voice feature space can be transferred to other speaker data sets for recognition, for example: a speaker's voice feature space of one language can be used as a voice feature for speaker recognition of another language Space to achieve data sharing.
附图说明BRIEF DESCRIPTION
图1为本发明实施例1中说话人识别方法的概略流程图。FIG. 1 is a schematic flowchart of a speaker recognition method in Embodiment 1 of the present invention.
图2为本发明实施例1中语音特征空间建立的步骤流程图。FIG. 2 is a flowchart of steps for establishing a voice feature space in Embodiment 1 of the present invention.
图3为本发明实施例1中说话人语音特征空间分布信息与轨迹信息生成的步骤流程图。FIG. 3 is a flowchart of steps for generating spatial distribution information and trajectory information of speaker voice features in Embodiment 1 of the present invention.
图4为本发明实施例1中对待识别语音样本进行识别的步骤流程图。FIG. 4 is a flowchart of steps for recognizing speech samples to be recognized in Embodiment 1 of the present invention.
具体实施方式detailed description
下面结合实施例及附图对本发明作进一步详细的描述,但本发明的实施方式不限于此。The present invention will be described in further detail below with reference to examples and drawings, but the embodiments of the present invention are not limited thereto.
实施例1:Example 1:
本实施例提供了一种基于语音样本特征空间轨迹的说话人识别方法,概略流程图如图1所示,包括以下三个步骤:This embodiment provides a speaker recognition method based on the trajectory of the voice sample feature space. The schematic flowchart is shown in FIG. 1 and includes the following three steps:
1)建立语音特征空间Ω,可以使用任何纯净语音样本,对说话人、语种等因素没有任何约束,然后利用聚类方法将语音样本在特征空间进行聚类,聚类所得到的子类数据表达为语音特征空间的表达{g
k,k=1,2,…,K};
1) Establish a voice feature space Ω, you can use any pure voice samples, without any restrictions on speakers, language and other factors, and then use the clustering method to cluster the voice samples in the feature space, and the subclass data obtained by the clustering is expressed Is the expression of the speech feature space {g k ,k=1, 2,...,K};
2)构建说话人知识,包括说话人在语音特征空间中的分布信息和运动轨迹信息两部分;2) Construct speaker knowledge, including speaker distribution information and motion trajectory information in the speech feature space;
3)对于待识别语音样本,利用说话人语音特征空间分布信息以及语音样本的运动轨迹信息进行识别。3) For the voice samples to be recognized, use the spatial distribution information of the speaker's voice features and the motion trajectory information of the voice samples for recognition.
参考图2,是本实施例中语音特征空间建立的步骤流程图。使用aishell中文语料库中的说话人语音数据作为无标注的语音样本集,aishell中一共包含400个说话人,选择每个人的60个wav文件用来训练语音特征空间,提取无标注语音样本集X={x
1,x
2,.....,x
N}的12维MFCC特征,从而获得特征集F
x={f
i
x,i=1,2,…,t
N},其中f
i
x是短时帧特征,t
N为所有样本的帧数之和;
Referring to FIG. 2, it is a flowchart of steps for establishing a voice feature space in this embodiment. Use the speaker speech data in the aishell Chinese corpus as the unlabeled speech sample set. Aishell contains a total of 400 speakers. Select 60 wav files for each person to train the speech feature space and extract the unlabeled speech sample set X = 12-dimensional MFCC features of {x 1 ,x 2 ,.....,x N } to obtain the feature set F x ={f i x ,i=1, 2,..., t N }, where f i x Is a short-time frame feature, t N is the sum of the number of frames of all samples;
然后使用F
x={f
i
x,i=1,2,…,t
N}特征序列训练一个混合度为K的GMM,舍弃GMM的权重信息,保留每一个高斯分量作为语音特征空间的标识子集合。其中,K为音频特征空间标识子的数量,标识数量K选择4096,以便对音频特征空间给予精度较高的描述;
Then use F x ={f i x ,i=1, 2,..., t N }feature sequence to train a GMM with mixing degree K, discard the weight information of GMM, and retain each Gaussian component as an identifier of the speech feature space set. Among them, K is the number of identifiers of the audio feature space, and the number of identifiers K is selected to be 4096, so as to give a higher precision description to the audio feature space;
语音特征空间标识子表示为Ω={g
k,k=1,2,…,K},其中,g=N(m,U)为多维高斯分布函数;
The speech feature space identifier is expressed as Ω={g k ,k=1, 2,...,K}, where g=N(m,U) is a multi-dimensional Gaussian distribution function;
参考图3,是本实施例中说话人特征空间分布信息生成的步骤流程图。对于aishell的中每个人,使用20个wav文件用来对语音特征空间进行标注。目标说话人语音样本集Y={(y
1,s
1),(y
2,s
2),.....,(y
M,s
M)},s
i∈S={S
l,l=1,2,…,L}(说话人集),说话人S
l的样本为Y
l={y
m|s
m=S
l,m=1,2,…,M},提取其音频特征序列为
计算语音样本所有特征f
t与空间标识子g
k(m
k,U
k)的位置关联度:
Referring to FIG. 3, it is a flowchart of steps for generating speaker feature space distribution information in this embodiment. For each person in aishell, 20 wav files are used to label the speech feature space. Target speaker speech sample set Y={(y 1 ,s 1 ),(y 2 ,s 2 ),... (y M ,s M )}, s i ∈S={S l ,l =1,2,...,L} (speaker set), the sample of the speaker S l is Y l ={y m |s m =S l ,m=1, 2,..., M}, and the audio features are extracted The sequence is Calculate the positional correlation between all the features f t of the speech sample and the spatial identifier g k (m k , U k ):
计算说话人样本集与空间标识子g
k(m
k,U
k)的位置关联度的期望值:
Calculate the expected value of the positional correlation between the speaker sample set and the spatial identifier g k (m k , U k ):
其中
为第n个样本的第t帧特征与标识子g
k(m
k,U
k)的位置关联度;
among them The position correlation between the t-th frame feature of the n-th sample and the identifier g k (m k , U k );
计算说话人特征空间分布为:The spatial distribution of speaker features is calculated as:
对目标说话人集中每个说话人的注册语音进行处理,得到每个说话人的语音特征分布信息。The registered speech of each speaker in the target speaker set is processed to obtain the speech feature distribution information of each speaker.
语音样本的运动轨迹时序信息表示为语音样本特征的邻域序列Ψ
1Ψ
2…Ψ
T,而样本特征f
t的δ邻域为Ψ
t={g
k|d
tk<δ},其中d
tk为特征与分布的Mahalanobis distance,即
The timing information of the motion trajectory of the speech sample is expressed as the neighborhood sequence Ψ 1 Ψ 2 …Ψ T of the speech sample feature, and the δ neighborhood of the sample feature f t is Ψ t ={g k |d tk <δ}, where d tk Mahalanobis distance is the feature and distribution, namely
参考图4,是本实施例中对语音样本进行识别的步骤流程图。待识别语音样本的特征为f={f
1,f
2,…,f
T},识别过程如下:
Referring to FIG. 4, it is a flowchart of steps for recognizing speech samples in this embodiment. The feature of the speech sample to be recognized is f={f 1 , f 2 ,..., f T }, and the recognition process is as follows:
计算语音f={f
1,f
2,…,f
T}在特征空间Ω的分布P=(p
1,p
2,…,p
K);
Calculate the distribution of speech f = {f 1 , f 2 , ..., f T } in the feature space Ω P = (p 1 , p 2 , ..., p K );
其中,特征f
t与空间标识子g
k(m
k,U
k)的位置关联度为:
Among them, the positional correlation between the feature f t and the spatial identifier g k (m k , U k ) is:
待识别说话人样本与空间标识子g
k(m
k,U
k)的位置关联度为:
The positional correlation between the sample of the speaker to be recognized and the spatial identifier g k (m k , U k ) is:
确定语音特征f={f
1,f
2,…,f
T}在特征空间Ω中的轨迹Ψ
1Ψ
2…Ψ
T,其中Ψ
t={g
k|d
tk<δ};
Determine the trajectory Ψ 1 Ψ 2 …Ψ T of the speech feature f = {f 1 , f 2 ,..., f T } in the feature space Ω, where Ψ t = {g k |d tk <δ};
计算样本分布P=(p
1,p
2,…,p
K)与说话人s的先验特征空间分布
的距离
其中α取2,然后筛选包含真实解的可能的解集S
p:
选择距离最小的10个说话人作为候选的识别结果;
Calculate the sample distribution P = (p 1 , p 2 , ..., p K ) and the prior feature spatial distribution of the speaker s the distance Where α is 2 and then the possible solution set S p containing the real solution is selected: Select the 10 speakers with the smallest distance as the candidate recognition results;
计算轨迹Ψ
1Ψ
2…Ψ
T的距离度量
其中α取2,从候选的10个说话人中选择轨迹距离最小的说话人作为识别结果,即
Calculate the distance metric of trajectories Ψ 1 Ψ 2 …Ψ T Where α is 2, and the speaker with the smallest trajectory distance is selected from the 10 candidate candidates as the recognition result, namely
实施例2:Example 2:
本实施例提供了一种基于语音样本特征空间轨迹的说话人识别方法,包括以下步骤:This embodiment provides a speaker recognition method based on the trajectory of the feature space of a voice sample, including the following steps:
步骤1,使用英文语料库timit的语音数据来建立语音特征空间标识子集合;Step 1: Use the voice data of the English corpus timit to establish a voice feature space identifier sub-collection;
步骤2,使用aishell语料库中的语音数据对目标说话人集进行注册,同实施例1;Step 2: Use the voice data in the aishell corpus to register the target speaker set, as in Example 1;
步骤3,对待识别语音样本进行识别,同实施例1。Step 3: Recognize the speech samples to be recognized, as in Embodiment 1.
得到的识别效果与实施例1中相比有少量差距,可以证明另外一个语种的说话人语音特征空间可作为另一语种的说话人识别的语音特征空间,实现了数据的共享。The obtained recognition effect has a small gap compared with that in Embodiment 1. It can be proved that the speaker speech feature space of another language can be used as the speech feature space of speaker recognition of another language, and data sharing is realized.
以上所述,仅为本发明专利较佳的实施例,但本发明专利的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本发明专利所公开的范围内,根据本发明专利的技术方案及其发明专利构思加以等同替换或改变,都属于本发明专利的保护范围。The above is only the preferred embodiment of the invention patent, but the scope of protection of the invention patent is not limited to this, any person skilled in the art in the technical field is within the scope disclosed by the invention patent, according to the invention patent Technical solutions and their invention patent ideas are equivalently replaced or changed, and all belong to the protection scope of the invention patent.