WO2020143263A1 - Speaker identification method based on speech sample feature space trajectory - Google Patents

Speaker identification method based on speech sample feature space trajectory Download PDF

Info

Publication number
WO2020143263A1
WO2020143263A1 PCT/CN2019/111530 CN2019111530W WO2020143263A1 WO 2020143263 A1 WO2020143263 A1 WO 2020143263A1 CN 2019111530 W CN2019111530 W CN 2019111530W WO 2020143263 A1 WO2020143263 A1 WO 2020143263A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech
speaker
feature space
sample
voice
Prior art date
Application number
PCT/CN2019/111530
Other languages
French (fr)
Chinese (zh)
Inventor
贺前华
吴克乾
谢伟
庞文丰
Original Assignee
华南理工大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华南理工大学 filed Critical 华南理工大学
Priority to SG11202103091XA priority Critical patent/SG11202103091XA/en
Publication of WO2020143263A1 publication Critical patent/WO2020143263A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies
    • G10L17/08Use of distortion metrics or a particular distance between probe pattern and reference templates

Definitions

  • the invention relates to the field of biometric recognition, and in particular to a speaker recognition method based on the trajectory of the feature space of speech samples.
  • audio classification or audio recognition is the core issue of audio perception.
  • audio classification manifests as speaker recognition, audio event recognition, audio Event detection, etc.
  • Speaker recognition technology is a kind of identity verification technology---biometric recognition technology.
  • Biometric recognition technology is a technology that uses biometrics to automatically identify individuals, including fingerprint recognition, iris recognition, gene recognition, and face recognition. Compared with other identity verification technologies, speaker identification is more convenient, natural, and has lower user intrusiveness.
  • Speaker recognition uses voice signals for identity recognition, which has the advantages of natural human-computer interaction, easy extraction of voice signals, and remote recognition.
  • the existing speaker recognition system includes two stages: training stage and recognition stage.
  • the training stage the system uses the collected speaker speech to build a model for the speaker; in the recognition stage, the system matches the input speech with the speaker model to make a decision.
  • the speaker recognition system needs to extract features that can reflect the personality of the speaker from the speech signal, and establish an accurate model to distinguish the difference between the speaker and other speakers.
  • audio classification technology there are two main types of audio classification technology, one is to generate statistical models, such as the mixed Gaussian model GMM and the hidden Markov model HMM, and the other is based on deep neural network methods, such as DNN, RNN or LSTM. . No matter what kind of technology, a large number of labeled training samples are required.
  • the deep neural network method requires higher sample size.
  • the GMM or HMM-based method does not give special consideration to the distinguishing information between different audio classes, nor does it consider the sharing of sample data of different classes, such as: MIT Professor Reynold's paper "Speaker Verification Using Adapted Gaussian Mixture Models” (Digital Signal Processing 10 (2000), 19–41.)
  • the method mentioned has a high computational complexity; with the support of large samples, the deep neural network method has shown very good performance, such as the paper of Google "End-to-End Text-Dependent Speaker Verification" (2016 IEEE International Conference on Acoustics, Speech and Processing (ICASSP), 2016, Pages: 5115-5119) uses neural networks to extract features and train speech, but neural networks
  • the training requires a lot of labeled speech, and the acquisition cost of a large number of samples is very high, and the deep neural network method lacks explanation, which is quite a black box.
  • the purpose of the present invention is to provide a speaker recognition method based on the trajectory of the feature space of the speech sample, in which the speech feature space does not depend on the speaker, text and language, so the construction of the speech feature space can be Use any qualified voice data to achieve the sharing of voice data; and the speaker's voice trajectory can be constructed even with a sample, so a large amount of annotated voice data is not required, which overcomes the need to collect a large amount of annotated voice data in the prior art defect.
  • Step 2 construct speaker knowledge: use pure speech samples marked with speaker attributes to obtain their distribution information and movement trajectory information in the speech feature space ⁇ ;
  • any pure speech samples can be used without any constraints on speakers and language factors.
  • K-means or other clustering methods are used to cluster the speech samples in the feature space
  • voice features The scale K of the class identifier used in the space determines the granularity of voice feature spatial expression. The larger K is, the finer the voice feature spatial expression.
  • the accuracy of spatial expression is related to the size of the data. The richer the data, the more complete the spatial expression. At the same time, the more targeted the data for constructing the speech feature space, the more precise the spatial expression will be for a particular problem.
  • step 2) the pure voice samples with speaker attribute annotations are used to mark the voice feature space.
  • the Gaussian distribution g k (m k , U k ) is used as the spatial identifier
  • the speaker feature space distribution information Obtained as follows:
  • the spatial identifier is represented by a multidimensional Gaussian distribution
  • m k represents the mean vector of the kth Gaussian distribution
  • U k represents the variance matrix of the kth multidimensional Gaussian distribution
  • the decision threshold of the ⁇ neighborhood ⁇ t ⁇ g k
  • d tk ⁇ of the feature f t of the speech sample refers to the characteristics of the normal distribution, and 2 ⁇ 3 is selected.
  • the present invention has the following advantages and beneficial effects:
  • a speaker recognition method based on the trajectory of the voice sample feature space provided by the present invention, in which the establishment of the voice feature space is to cluster a large number of voice features without the need for annotated data, to establish data samples of the voice feature space It can be derived from different speakers. There is no exact requirement for the content of the speaker, the age of the speaker, and the language. It overcomes the problem of the need for a large number of labeled voices in the neural network method, and the data collection established in the voice space is easy to implement.
  • a speaker recognition method based on the trajectory of the voice sample feature space provided by the present invention based on the location and trajectory information of the speaker's voice feature in the voice feature space, is different from the signal source generation model method, such as Hidden Markov Model (HMM), etc., the positioning is relative, and the generative model is absolute; compared with the deep neural network method, it is interpretable, and each knowledge data has a certain physical semantics, such as the sample features on the space ⁇
  • the correlation degree distribution information of P (p 1 , p 2 , ..., p K ) expresses the range of the active space of the sample (the space represented by the identifier subset corresponding to the non-zero elements), and also expresses in the space Distribution.
  • a speaker recognition method based on the trajectory of the feature space of the voice samples provided by the present invention is essentially that the voice features are located in the space. For the voice features of different speakers, the location is established on the established voice feature space and the association is used. Degree to represent the speech feature location information of different speakers, and expresses the distinction between different speakers with less calculation, compared to GMM or HMM that requires a generative model to model each speaker The method has a lower computational complexity.
  • a speaker recognition method based on the trajectory of the voice sample feature space provided by the present invention wherein the voice feature space identifier subset is a reference system used to locate the speaker's voice feature, which is a relative relationship and is not strictly related to the sample to be recognized The relationship requires that the feature space is shared, and the established voice feature space can be transferred to other speaker data sets for recognition, for example: a speaker's voice feature space of one language can be used as a voice feature for speaker recognition of another language Space to achieve data sharing.
  • FIG. 1 is a schematic flowchart of a speaker recognition method in Embodiment 1 of the present invention.
  • FIG. 2 is a flowchart of steps for establishing a voice feature space in Embodiment 1 of the present invention.
  • FIG. 3 is a flowchart of steps for generating spatial distribution information and trajectory information of speaker voice features in Embodiment 1 of the present invention.
  • FIG. 4 is a flowchart of steps for recognizing speech samples to be recognized in Embodiment 1 of the present invention.
  • This embodiment provides a speaker recognition method based on the trajectory of the voice sample feature space.
  • the schematic flowchart is shown in FIG. 1 and includes the following three steps:
  • FIG. 2 it is a flowchart of steps for establishing a voice feature space in this embodiment.
  • Aishell contains a total of 400 speakers.
  • K is the number of identifiers of the audio feature space, and the number of identifiers K is selected to be 4096, so as to give a higher precision description to the audio feature space;
  • FIG. 3 it is a flowchart of steps for generating speaker feature space distribution information in this embodiment.
  • 20 wav files are used to label the speech feature space.
  • Target speaker speech sample set Y ⁇ (y 1 ,s 1 ),(y 2 ,s 2 ),...
  • the spatial distribution of speaker features is calculated as:
  • the registered speech of each speaker in the target speaker set is processed to obtain the speech feature distribution information of each speaker.
  • FIG. 4 it is a flowchart of steps for recognizing speech samples in this embodiment.
  • the positional correlation between the feature f t and the spatial identifier g k (m k , U k ) is:
  • This embodiment provides a speaker recognition method based on the trajectory of the feature space of a voice sample, including the following steps:
  • Step 1 Use the voice data of the English corpus timit to establish a voice feature space identifier sub-collection
  • Step 2 Use the voice data in the aishell corpus to register the target speaker set, as in Example 1;
  • Step 3 Recognize the speech samples to be recognized, as in Embodiment 1.
  • the obtained recognition effect has a small gap compared with that in Embodiment 1. It can be proved that the speaker speech feature space of another language can be used as the speech feature space of speaker recognition of another language, and data sharing is realized.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Business, Economics & Management (AREA)
  • Game Theory and Decision Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Analysis (AREA)

Abstract

A speaker identification method based on a speech sample feature space trajectory, the method comprising: clustering unmarked speech data features to obtain a speech feature space representing an identifier subset; using a marked speech sample to implement speaker registration and obtaining motion trajectory information and distribution information of the speaker in the speech feature space; and using the distribution information of the speaker in the speech feature space and motion trajectory information of the speech sample to implement identification of a speech sample to be identified. Using the idea of speaker speech feature spatial positioning, the complexity of identifying the speaker is low, solving the problem of the high degree of complexity of GMM-UBM calculation; the speech feature space of a speaker in one language can serve as the speech feature space for identification of the speaker in another language, implementing data sharing.

Description

一种基于语音样本特征空间轨迹的说话人识别方法A speaker recognition method based on the trajectory of the feature space of speech samples 技术领域Technical field
本发明涉及生物特征识别领域,具体涉及一种基于语音样本特征空间轨迹的说话人识别方法。The invention relates to the field of biometric recognition, and in particular to a speaker recognition method based on the trajectory of the feature space of speech samples.
背景技术Background technique
随着人工智能技术的发展,音频感知已经成为音频处理技术研究的热点,其中音频分类或音频识别是音频感知的核心问题,在工程应用中,音频分类表现为说话人识别、音频事件识别、音频事件检测等。说话人识别技术是身份验证技术---生物特征识别技术的一种。生物特征识别技术是利用生物特征自动识别个体身份的技术,包括指纹识别、虹膜识别、基因识别、人脸识别等。与其他身份验证技术相比,说话人识别更加方便、自然,且具有比较低的用户侵犯性。说话人识别利用语音信号进行身份识别,具有人机交互自然、语音信号易于提取、可实现远程识别等优势。With the development of artificial intelligence technology, audio perception has become a hotspot in the research of audio processing technology. Among them, audio classification or audio recognition is the core issue of audio perception. In engineering applications, audio classification manifests as speaker recognition, audio event recognition, audio Event detection, etc. Speaker recognition technology is a kind of identity verification technology---biometric recognition technology. Biometric recognition technology is a technology that uses biometrics to automatically identify individuals, including fingerprint recognition, iris recognition, gene recognition, and face recognition. Compared with other identity verification technologies, speaker identification is more convenient, natural, and has lower user intrusiveness. Speaker recognition uses voice signals for identity recognition, which has the advantages of natural human-computer interaction, easy extraction of voice signals, and remote recognition.
现有的说话人识别系统包括两个阶段:训练阶段和识别阶段。在训练阶段,系统使用收集的说话人语音为说话人建立模型;在识别阶段,系统将输入语音与说话人模型进行匹配来作出判决。说话人识别系统需要从语音信号提取能反映说话人个性的特征,并建立准确的模型区分该说话人与其他说话人之间的差异。目前常用的音频分类技术主要有两大类,一类是生成统计模型,如混合高斯模型GMM和隐马尔可夫模型HMM,另一类是基于深度神经网络的方法,如DNN、RNN或LSTM等。不论是哪一种技术,都需要大量的标注训练样本,为了达到较好的识别性能,深度神经网络方法对样本规模要求更高。基于GMM或HMM的方法对不同音频类之间的区分性信息并没有加以特别的考虑,也没有考虑不同类样本数据的共享,比如:MIT Reynold教授等的论文《Speaker Verification Using Adapted Gaussian Mixture Models》(Digital Signal Processing 10(2000),19–41.)提到的方法,具有较高的计算复杂度;在大样本支持下,深度神经网络方法表现出了很好的性能,比如谷歌公司的论文《End-to-End Text-Dependent Speaker Verification》(2016 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP),2016,Pages:5115-5119)中使用神经网络对语音提取特征并训练,但是神经网络的训练需要大量的有标注语音,而大量样本的获取成本是非常高昂的,而且深度神经网络方法缺乏解释性,相当一个黑匣子。The existing speaker recognition system includes two stages: training stage and recognition stage. In the training stage, the system uses the collected speaker speech to build a model for the speaker; in the recognition stage, the system matches the input speech with the speaker model to make a decision. The speaker recognition system needs to extract features that can reflect the personality of the speaker from the speech signal, and establish an accurate model to distinguish the difference between the speaker and other speakers. At present, there are two main types of audio classification technology, one is to generate statistical models, such as the mixed Gaussian model GMM and the hidden Markov model HMM, and the other is based on deep neural network methods, such as DNN, RNN or LSTM. . No matter what kind of technology, a large number of labeled training samples are required. In order to achieve better recognition performance, the deep neural network method requires higher sample size. The GMM or HMM-based method does not give special consideration to the distinguishing information between different audio classes, nor does it consider the sharing of sample data of different classes, such as: MIT Professor Reynold's paper "Speaker Verification Using Adapted Gaussian Mixture Models" (Digital Signal Processing 10 (2000), 19–41.) The method mentioned has a high computational complexity; with the support of large samples, the deep neural network method has shown very good performance, such as the paper of Google "End-to-End Text-Dependent Speaker Verification" (2016 IEEE International Conference on Acoustics, Speech and Processing (ICASSP), 2016, Pages: 5115-5119) uses neural networks to extract features and train speech, but neural networks The training requires a lot of labeled speech, and the acquisition cost of a large number of samples is very high, and the deep neural network method lacks explanation, which is quite a black box.
现有说话人识别技术往往计算复杂度较高,需要大量标注过的说话人语音数据来训练模型,而采集大量的有标注语音数据需要巨大的工作量。因此需要找到一种能够 更加方便有效的说话人识别方法和系统。Existing speaker recognition technologies tend to be computationally complex, requiring a large amount of annotated speaker speech data to train the model, and collecting a large amount of annotated speech data requires huge workload. Therefore, we need to find a more convenient and effective speaker recognition method and system.
发明内容Summary of the invention
本发明的目的是针对现有技术的不足,提供了一种基于语音样本特征空间轨迹的说话人识别方法,其中语音特征空间是不依赖说话人、文本和语言的,因此语音特征空间的构建可以采用任何合格的语音数据,实现了语音数据的共享;而说话人语音轨迹,即使一个样本也能构建,因此不需要大量的有标注语音数据,克服了现有技术需要采集大量有标注语音数据的缺陷。The purpose of the present invention is to provide a speaker recognition method based on the trajectory of the feature space of the speech sample, in which the speech feature space does not depend on the speaker, text and language, so the construction of the speech feature space can be Use any qualified voice data to achieve the sharing of voice data; and the speaker's voice trajectory can be constructed even with a sample, so a large amount of annotated voice data is not required, which overcomes the need to collect a large amount of annotated voice data in the prior art defect.
本发明的目的可以通过如下技术方案实现:The object of the present invention can be achieved by the following technical solutions:
一种基于语音样本特征空间轨迹的说话人识别方法,其中一个语音样本能够视为语音特征空间的一次运动,具有活动空间和空间中的轨迹特性,所述方法包括以下步骤:A speaker recognition method based on the trajectory of the feature space of the voice sample, in which a voice sample can be regarded as a movement in the feature space of the voice, with trajectory characteristics in the active space and space, the method includes the following steps:
步骤1)、构建语音特征空间Ω:利用聚类方法将无标注语音样本在特征空间进行聚类,由聚类所得到的子类数据生成该子类数据的某种表达作为语音特征空间的表达Ω={g k,k=1,2,…,K}; Step 1), construct the voice feature space Ω: cluster the unlabeled voice samples in the feature space using a clustering method, and generate an expression of the subclass data from the subclass data obtained by the clustering as the expression of the voice feature space Ω={g k ,k=1, 2,...,K};
步骤2)、构建说话人知识:利用有说话人属性标注的纯净语音样本,获得其在语音特征空间Ω上的分布信息以及运动轨迹信息;Step 2), construct speaker knowledge: use pure speech samples marked with speaker attributes to obtain their distribution information and movement trajectory information in the speech feature space Ω;
步骤3)、说话人识别:对于待识别语音样本,首先获得该样本的语音特征空间分布表达以及轨迹,然后利用说话人语音特征空间分布信息计算样本分布与先验分布的差异以及沿轨迹的累计局部分布差异,作为说话人识别的依据并进行判断。Step 3) Speaker recognition: For the speech sample to be recognized, first obtain the spatial distribution expression and trajectory of the speech feature of the sample, and then use the speaker's speech feature spatial distribution information to calculate the difference between the sample distribution and the prior distribution and the accumulation along the trajectory The local distribution difference is used as the basis for speaker identification and judgment.
进一步地,步骤1)构建语音特征空间Ω的过程中,能够使用任何纯净语音样本,对说话人、语种因素没有任何约束。Further, in the process of constructing the speech feature space Ω in step 1), any pure speech samples can be used without any constraints on speakers and language factors.
进一步地,步骤1)中采用K-means或其它聚类方法将语音样本在特征空间进行聚类,所述语音特征空间表达Ω={g k,k=1,2,…,K}能够是类数据的分布函数(比如高斯分布函数)、聚类中心矢量(质心)或者生成模型(比如隐马尔可夫模型或神经网络)这些具有定位能力的标识,称之为特征空间标识子,语音特征空间所使用的类标识子规模K决定语音特征空间表达粒度,K越大,语音特征空间表达越精细。另一方面,空间表达的准确性与数据规模有关,数据越丰富,空间表达越完整;同时,构建语音特征空间的数据越有针对性,对于特定问题而言,空间表达会越精确。 Further, in step 1), K-means or other clustering methods are used to cluster the speech samples in the feature space, and the speech feature space expression Ω={g k ,k=1, 2,...,K} can be Distribution functions of class data (such as Gaussian distribution function), cluster center vector (centroid) or generative models (such as hidden Markov model or neural network), which have positioning capabilities, are called feature space identifiers, voice features The scale K of the class identifier used in the space determines the granularity of voice feature spatial expression. The larger K is, the finer the voice feature spatial expression. On the other hand, the accuracy of spatial expression is related to the size of the data. The richer the data, the more complete the spatial expression. At the same time, the more targeted the data for constructing the speech feature space, the more precise the spatial expression will be for a particular problem.
进一步地,步骤2)中,利用有说话人属性标注的纯净语音样本对语音特征空间进行标注,在采用高斯分布g k(m k,U k)作为空间标识子时,说话人特征空间分布信息按以下方式获得: Further, in step 2), the pure voice samples with speaker attribute annotations are used to mark the voice feature space. When the Gaussian distribution g k (m k , U k ) is used as the spatial identifier, the speaker feature space distribution information Obtained as follows:
一、计算语音样本每个特征f t与空间标识子g k(m k,U k)的位置关联度,定义为: 1. Calculate the positional correlation between each feature f t of the speech sample and the spatial identifier g k (m k , U k ), defined as:
Figure PCTCN2019111530-appb-000001
Figure PCTCN2019111530-appb-000001
式中,空间标识子用多维高斯分布来表示,m k表示第k个高斯分布的均值矢量,U k表示第k个多维高斯分布的方差矩阵; In the formula, the spatial identifier is represented by a multidimensional Gaussian distribution, m k represents the mean vector of the kth Gaussian distribution, and U k represents the variance matrix of the kth multidimensional Gaussian distribution;
二、计算说话人样本集与空间标识子g k(m k,U k)的位置关联度的期望值: 2. Calculate the expected value of the positional correlation between the speaker sample set and the spatial identifier g k (m k , U k ):
Figure PCTCN2019111530-appb-000002
Figure PCTCN2019111530-appb-000002
式中,
Figure PCTCN2019111530-appb-000003
表示第n个样本的第t帧特征与空间标识子g k(m k,U k)的关联度;
In the formula,
Figure PCTCN2019111530-appb-000003
Represents the degree of association between the t-th frame feature of the n-th sample and the spatial identifier g k (m k , U k );
三、计算说话人特征空间分布为:3. The spatial distribution of speaker features is calculated as:
Figure PCTCN2019111530-appb-000004
Figure PCTCN2019111530-appb-000004
进一步地,步骤2)中,说话人语音样本在语音特征空间Ω上的运动轨迹时序信息表示为语音样本特征的邻域序列Ψ 1Ψ 2…Ψ T,而语音样本特征f t的δ邻域为Ψ t={g k|d tk<δ},其中d tk为语音样本特征与语音样本分布的马哈拉诺比斯距离(Mahalanobis distance),即
Figure PCTCN2019111530-appb-000005
Further, in step 2), the time sequence information of the movement trajectory of the speaker's speech sample on the speech feature space Ω is expressed as the neighborhood sequence Ψ 1 Ψ 2 …Ψ T of the feature of the speech sample, and the δ neighborhood of the feature f t of the speech sample Is Ψ t = {g k |d tk <δ}, where d tk is the Mahalanobis distance (Mahalanobis distance) between the feature of the speech sample and the distribution of the speech sample, ie
Figure PCTCN2019111530-appb-000005
进一步地,所述语音样本特征f t的δ邻域Ψ t={g k|d tk<δ}的判决门限参考正态分布的特性,选取2<δ<3。 Further, the decision threshold of the δ neighborhood Ψ t ={g k |d tk <δ} of the feature f t of the speech sample refers to the characteristics of the normal distribution, and 2<δ<3 is selected.
进一步地,步骤3)中,语音样本f={f 1,f 2,…,f T}的说话人识别过程包括以下步骤: Further, in step 3), the speaker recognition process of the speech sample f={f 1 , f 2 ,..., f T } includes the following steps:
一、计算语音样本f={f 1,f 2,…,f T}在语音特征空间Ω的分布P=(p 1,p 2,…,p K),其中
Figure PCTCN2019111530-appb-000006
1. Calculate the distribution of speech samples f = {f 1 , f 2 , ..., f T } in the speech feature space Ω P = (p 1 , p 2 , ..., p K ), where
Figure PCTCN2019111530-appb-000006
二、确定语音样本f={f 1,f 2,…,f T}在语音特征空间Ω中的运动轨迹Ψ 1Ψ 2…Ψ T,Ψ t={g k|d tk<δ}; 2. Determine the movement trajectory of the speech sample f = {f 1 , f 2 ,..., f T } in the speech feature space Ω Ψ 1 Ψ 2 …Ψ T , Ψ t = {g k |d tk <δ};
三、计算样本分布P=(p 1,p 2,…,p K)与说话人s的先验特征空间分布
Figure PCTCN2019111530-appb-000007
的距离
Figure PCTCN2019111530-appb-000008
然后筛选包含真实解的可能的解集S p
Figure PCTCN2019111530-appb-000009
3. Calculate the sample distribution P = (p 1 , p 2 , ..., p K ) and the spatial distribution of the prior features of the speaker s
Figure PCTCN2019111530-appb-000007
the distance
Figure PCTCN2019111530-appb-000008
Then filter the possible solution set S p containing the real solution:
Figure PCTCN2019111530-appb-000009
四、计算运动轨迹Ψ 1Ψ 2…Ψ T的距离度量
Figure PCTCN2019111530-appb-000010
从S p中选出可能的解
Figure PCTCN2019111530-appb-000011
完成说话人识别。
4. Calculate the distance metric of the trajectory Ψ 1 Ψ 2 …Ψ T
Figure PCTCN2019111530-appb-000010
Select possible solutions from S p
Figure PCTCN2019111530-appb-000011
Complete speaker recognition.
具体地,步骤3)中,仅用语音样本的空间分布信息P=(p 1,p 2,…,p K)或者运动轨迹距离
Figure PCTCN2019111530-appb-000012
均能够获得很好的说话人识别性能。
Specifically, in step 3), only the spatial distribution information P=(p 1 ,p 2 ,...,p K ) of the speech samples or the distance of the movement track
Figure PCTCN2019111530-appb-000012
Both can obtain very good speaker recognition performance.
本发明与现有技术相比,具有如下优点和有益效果:Compared with the prior art, the present invention has the following advantages and beneficial effects:
1、本发明提供的一种基于语音样本特征空间轨迹的说话人识别方法,其中语音特征空间的建立是对大量的语音特征进行聚类,不需要有标注的数据,建立语音特征空间的数据样本来源于不同说话人即可,对于说话内容、说话人年龄、语种没有确切的要求,克服了神经网络方法中,需要大量有标注语音的问题,语音空间建立的数据采集方便实现。1. A speaker recognition method based on the trajectory of the voice sample feature space provided by the present invention, in which the establishment of the voice feature space is to cluster a large number of voice features without the need for annotated data, to establish data samples of the voice feature space It can be derived from different speakers. There is no exact requirement for the content of the speaker, the age of the speaker, and the language. It overcomes the problem of the need for a large number of labeled voices in the neural network method, and the data collection established in the voice space is easy to implement.
2、本发明提供的一种基于语音样本特征空间轨迹的说话人识别方法,基于说话人语音特征在语音特征空间中的定位与轨迹信息,不同于信号源生成类模型方法,如隐马尔可夫模型(HMM)等,定位是相对的,而生成模型是绝对的;与深度神经网络的方法相比,具有可解释性,每一个知识数据都具有一定的物理语义,比如样本特征在空间Ω上的关联度分布信息P=(p 1,p 2,…,p K)即表达了该样本的活动空间范围(非零元素所对应的标识子集合所代表的空间),又表达了在该空间的分布。 2. A speaker recognition method based on the trajectory of the voice sample feature space provided by the present invention, based on the location and trajectory information of the speaker's voice feature in the voice feature space, is different from the signal source generation model method, such as Hidden Markov Model (HMM), etc., the positioning is relative, and the generative model is absolute; compared with the deep neural network method, it is interpretable, and each knowledge data has a certain physical semantics, such as the sample features on the space Ω The correlation degree distribution information of P = (p 1 , p 2 , ..., p K ) expresses the range of the active space of the sample (the space represented by the identifier subset corresponding to the non-zero elements), and also expresses in the space Distribution.
3、本发明提供的一种基于语音样本特征空间轨迹的说话人识别方法,本质是语音特征在空间中定位,对于不同的说话人的语音特征,在建立的语音特征空间上进行定位,使用关联度来表示不同说话人的语音特征定位信息,通过较少的计算量就表达出了不同说话人之间的区分性,相比于需要用生成模型对每个说话人进行建模的GMM或HMM的方法,具有更低的计算复杂度。3. A speaker recognition method based on the trajectory of the feature space of the voice samples provided by the present invention is essentially that the voice features are located in the space. For the voice features of different speakers, the location is established on the established voice feature space and the association is used. Degree to represent the speech feature location information of different speakers, and expresses the distinction between different speakers with less calculation, compared to GMM or HMM that requires a generative model to model each speaker The method has a lower computational complexity.
4、本发明提供的一种基于语音样本特征空间轨迹的说话人识别方法,其中语音特征空间标识子集合是用来定位说话人语音特征的参照系统,是相对关系,与待识别样本没有严格的关系要求,因此特征空间具有共享性,建立的语音特征空间可以迁移到其他的说话人数据集上进行识别,比如:一个语种的说话人语音特征空间可作为另一语种的说话人识别的语音特征空间,实现了数据的共享。4. A speaker recognition method based on the trajectory of the voice sample feature space provided by the present invention, wherein the voice feature space identifier subset is a reference system used to locate the speaker's voice feature, which is a relative relationship and is not strictly related to the sample to be recognized The relationship requires that the feature space is shared, and the established voice feature space can be transferred to other speaker data sets for recognition, for example: a speaker's voice feature space of one language can be used as a voice feature for speaker recognition of another language Space to achieve data sharing.
附图说明BRIEF DESCRIPTION
图1为本发明实施例1中说话人识别方法的概略流程图。FIG. 1 is a schematic flowchart of a speaker recognition method in Embodiment 1 of the present invention.
图2为本发明实施例1中语音特征空间建立的步骤流程图。FIG. 2 is a flowchart of steps for establishing a voice feature space in Embodiment 1 of the present invention.
图3为本发明实施例1中说话人语音特征空间分布信息与轨迹信息生成的步骤流程图。FIG. 3 is a flowchart of steps for generating spatial distribution information and trajectory information of speaker voice features in Embodiment 1 of the present invention.
图4为本发明实施例1中对待识别语音样本进行识别的步骤流程图。FIG. 4 is a flowchart of steps for recognizing speech samples to be recognized in Embodiment 1 of the present invention.
具体实施方式detailed description
下面结合实施例及附图对本发明作进一步详细的描述,但本发明的实施方式不限于此。The present invention will be described in further detail below with reference to examples and drawings, but the embodiments of the present invention are not limited thereto.
实施例1:Example 1:
本实施例提供了一种基于语音样本特征空间轨迹的说话人识别方法,概略流程图如图1所示,包括以下三个步骤:This embodiment provides a speaker recognition method based on the trajectory of the voice sample feature space. The schematic flowchart is shown in FIG. 1 and includes the following three steps:
1)建立语音特征空间Ω,可以使用任何纯净语音样本,对说话人、语种等因素没有任何约束,然后利用聚类方法将语音样本在特征空间进行聚类,聚类所得到的子类数据表达为语音特征空间的表达{g k,k=1,2,…,K}; 1) Establish a voice feature space Ω, you can use any pure voice samples, without any restrictions on speakers, language and other factors, and then use the clustering method to cluster the voice samples in the feature space, and the subclass data obtained by the clustering is expressed Is the expression of the speech feature space {g k ,k=1, 2,...,K};
2)构建说话人知识,包括说话人在语音特征空间中的分布信息和运动轨迹信息两部分;2) Construct speaker knowledge, including speaker distribution information and motion trajectory information in the speech feature space;
3)对于待识别语音样本,利用说话人语音特征空间分布信息以及语音样本的运动轨迹信息进行识别。3) For the voice samples to be recognized, use the spatial distribution information of the speaker's voice features and the motion trajectory information of the voice samples for recognition.
参考图2,是本实施例中语音特征空间建立的步骤流程图。使用aishell中文语料库中的说话人语音数据作为无标注的语音样本集,aishell中一共包含400个说话人,选择每个人的60个wav文件用来训练语音特征空间,提取无标注语音样本集X={x 1,x 2,.....,x N}的12维MFCC特征,从而获得特征集F x={f i x,i=1,2,…,t N},其中f i x是短时帧特征,t N为所有样本的帧数之和; Referring to FIG. 2, it is a flowchart of steps for establishing a voice feature space in this embodiment. Use the speaker speech data in the aishell Chinese corpus as the unlabeled speech sample set. Aishell contains a total of 400 speakers. Select 60 wav files for each person to train the speech feature space and extract the unlabeled speech sample set X = 12-dimensional MFCC features of {x 1 ,x 2 ,.....,x N } to obtain the feature set F x ={f i x ,i=1, 2,..., t N }, where f i x Is a short-time frame feature, t N is the sum of the number of frames of all samples;
然后使用F x={f i x,i=1,2,…,t N}特征序列训练一个混合度为K的GMM,舍弃GMM的权重信息,保留每一个高斯分量作为语音特征空间的标识子集合。其中,K为音频特征空间标识子的数量,标识数量K选择4096,以便对音频特征空间给予精度较高的描述; Then use F x ={f i x ,i=1, 2,..., t N }feature sequence to train a GMM with mixing degree K, discard the weight information of GMM, and retain each Gaussian component as an identifier of the speech feature space set. Among them, K is the number of identifiers of the audio feature space, and the number of identifiers K is selected to be 4096, so as to give a higher precision description to the audio feature space;
语音特征空间标识子表示为Ω={g k,k=1,2,…,K},其中,g=N(m,U)为多维高斯分布函数; The speech feature space identifier is expressed as Ω={g k ,k=1, 2,...,K}, where g=N(m,U) is a multi-dimensional Gaussian distribution function;
参考图3,是本实施例中说话人特征空间分布信息生成的步骤流程图。对于aishell的中每个人,使用20个wav文件用来对语音特征空间进行标注。目标说话人语音样本集Y={(y 1,s 1),(y 2,s 2),.....,(y M,s M)},s i∈S={S l,l=1,2,…,L}(说话人集),说话人S l的样本为Y l={y m|s m=S l,m=1,2,…,M},提取其音频特征序列为
Figure PCTCN2019111530-appb-000013
计算语音样本所有特征f t与空间标识子g k(m k,U k)的位置关联度:
Referring to FIG. 3, it is a flowchart of steps for generating speaker feature space distribution information in this embodiment. For each person in aishell, 20 wav files are used to label the speech feature space. Target speaker speech sample set Y={(y 1 ,s 1 ),(y 2 ,s 2 ),... (y M ,s M )}, s i ∈S={S l ,l =1,2,...,L} (speaker set), the sample of the speaker S l is Y l ={y m |s m =S l ,m=1, 2,..., M}, and the audio features are extracted The sequence is
Figure PCTCN2019111530-appb-000013
Calculate the positional correlation between all the features f t of the speech sample and the spatial identifier g k (m k , U k ):
Figure PCTCN2019111530-appb-000014
Figure PCTCN2019111530-appb-000014
计算说话人样本集与空间标识子g k(m k,U k)的位置关联度的期望值: Calculate the expected value of the positional correlation between the speaker sample set and the spatial identifier g k (m k , U k ):
Figure PCTCN2019111530-appb-000015
Figure PCTCN2019111530-appb-000015
其中
Figure PCTCN2019111530-appb-000016
为第n个样本的第t帧特征与标识子g k(m k,U k)的位置关联度;
among them
Figure PCTCN2019111530-appb-000016
The position correlation between the t-th frame feature of the n-th sample and the identifier g k (m k , U k );
计算说话人特征空间分布为:The spatial distribution of speaker features is calculated as:
Figure PCTCN2019111530-appb-000017
Figure PCTCN2019111530-appb-000017
对目标说话人集中每个说话人的注册语音进行处理,得到每个说话人的语音特征分布信息。The registered speech of each speaker in the target speaker set is processed to obtain the speech feature distribution information of each speaker.
语音样本的运动轨迹时序信息表示为语音样本特征的邻域序列Ψ 1Ψ 2…Ψ T,而样本特征f t的δ邻域为Ψ t={g k|d tk<δ},其中d tk为特征与分布的Mahalanobis distance,即
Figure PCTCN2019111530-appb-000018
The timing information of the motion trajectory of the speech sample is expressed as the neighborhood sequence Ψ 1 Ψ 2 …Ψ T of the speech sample feature, and the δ neighborhood of the sample feature f t is Ψ t ={g k |d tk <δ}, where d tk Mahalanobis distance is the feature and distribution, namely
Figure PCTCN2019111530-appb-000018
参考图4,是本实施例中对语音样本进行识别的步骤流程图。待识别语音样本的特征为f={f 1,f 2,…,f T},识别过程如下: Referring to FIG. 4, it is a flowchart of steps for recognizing speech samples in this embodiment. The feature of the speech sample to be recognized is f={f 1 , f 2 ,..., f T }, and the recognition process is as follows:
计算语音f={f 1,f 2,…,f T}在特征空间Ω的分布P=(p 1,p 2,…,p K); Calculate the distribution of speech f = {f 1 , f 2 , ..., f T } in the feature space Ω P = (p 1 , p 2 , ..., p K );
其中,特征f t与空间标识子g k(m k,U k)的位置关联度为: Among them, the positional correlation between the feature f t and the spatial identifier g k (m k , U k ) is:
Figure PCTCN2019111530-appb-000019
Figure PCTCN2019111530-appb-000019
待识别说话人样本与空间标识子g k(m k,U k)的位置关联度为: The positional correlation between the sample of the speaker to be recognized and the spatial identifier g k (m k , U k ) is:
Figure PCTCN2019111530-appb-000020
Figure PCTCN2019111530-appb-000020
确定语音特征f={f 1,f 2,…,f T}在特征空间Ω中的轨迹Ψ 1Ψ 2…Ψ T,其中Ψ t={g k|d tk<δ}; Determine the trajectory Ψ 1 Ψ 2 …Ψ T of the speech feature f = {f 1 , f 2 ,..., f T } in the feature space Ω, where Ψ t = {g k |d tk <δ};
计算样本分布P=(p 1,p 2,…,p K)与说话人s的先验特征空间分布
Figure PCTCN2019111530-appb-000021
的距离
Figure PCTCN2019111530-appb-000022
其中α取2,然后筛选包含真实解的可能的解集S p
Figure PCTCN2019111530-appb-000023
选择距离最小的10个说话人作为候选的识别结果;
Calculate the sample distribution P = (p 1 , p 2 , ..., p K ) and the prior feature spatial distribution of the speaker s
Figure PCTCN2019111530-appb-000021
the distance
Figure PCTCN2019111530-appb-000022
Where α is 2 and then the possible solution set S p containing the real solution is selected:
Figure PCTCN2019111530-appb-000023
Select the 10 speakers with the smallest distance as the candidate recognition results;
计算轨迹Ψ 1Ψ 2…Ψ T的距离度量
Figure PCTCN2019111530-appb-000024
其中α取2,从候选的10个说话人中选择轨迹距离最小的说话人作为识别结果,即
Figure PCTCN2019111530-appb-000025
Calculate the distance metric of trajectories Ψ 1 Ψ 2 …Ψ T
Figure PCTCN2019111530-appb-000024
Where α is 2, and the speaker with the smallest trajectory distance is selected from the 10 candidate candidates as the recognition result, namely
Figure PCTCN2019111530-appb-000025
实施例2:Example 2:
本实施例提供了一种基于语音样本特征空间轨迹的说话人识别方法,包括以下步骤:This embodiment provides a speaker recognition method based on the trajectory of the feature space of a voice sample, including the following steps:
步骤1,使用英文语料库timit的语音数据来建立语音特征空间标识子集合;Step 1: Use the voice data of the English corpus timit to establish a voice feature space identifier sub-collection;
步骤2,使用aishell语料库中的语音数据对目标说话人集进行注册,同实施例1;Step 2: Use the voice data in the aishell corpus to register the target speaker set, as in Example 1;
步骤3,对待识别语音样本进行识别,同实施例1。Step 3: Recognize the speech samples to be recognized, as in Embodiment 1.
得到的识别效果与实施例1中相比有少量差距,可以证明另外一个语种的说话人语音特征空间可作为另一语种的说话人识别的语音特征空间,实现了数据的共享。The obtained recognition effect has a small gap compared with that in Embodiment 1. It can be proved that the speaker speech feature space of another language can be used as the speech feature space of speaker recognition of another language, and data sharing is realized.
以上所述,仅为本发明专利较佳的实施例,但本发明专利的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本发明专利所公开的范围内,根据本发明专利的技术方案及其发明专利构思加以等同替换或改变,都属于本发明专利的保护范围。The above is only the preferred embodiment of the invention patent, but the scope of protection of the invention patent is not limited to this, any person skilled in the art in the technical field is within the scope disclosed by the invention patent, according to the invention patent Technical solutions and their invention patent ideas are equivalently replaced or changed, and all belong to the protection scope of the invention patent.

Claims (7)

  1. 一种基于语音样本特征空间轨迹的说话人识别方法,其中一个语音样本能够视为语音特征空间的一次运动,具有活动空间和空间中的轨迹特性,其特征在于,所述方法包括以下步骤:A speaker recognition method based on a trajectory of a feature space of a voice sample, in which a voice sample can be regarded as a motion in the voice feature space, has trajectory characteristics in an active space and space, and is characterized in that the method includes the following steps:
    步骤1)、构建语音特征空间Ω:利用聚类方法将无标注语音样本在特征空间进行聚类,由聚类所得到的子类数据生成该子类数据的某种表达作为语音特征空间的表达Ω={g k,k=1,2,…,K}; Step 1), construct the voice feature space Ω: cluster the unlabeled voice samples in the feature space using a clustering method, and generate an expression of the subclass data from the subclass data obtained by the clustering as the expression of the voice feature space Ω={g k ,k=1, 2,...,K};
    步骤2)、构建说话人知识:利用有说话人属性标注的纯净语音样本,获得其在语音特征空间Ω上的分布信息以及运动轨迹信息;Step 2), construct speaker knowledge: use pure speech samples marked with speaker attributes to obtain their distribution information and movement trajectory information in the speech feature space Ω;
    步骤3)、说话人识别:对于待识别语音样本,首先获得该样本的语音特征空间分布表达以及轨迹,然后利用说话人语音特征空间分布信息计算样本分布与先验分布的差异以及沿轨迹的累计局部分布差异,作为说话人识别的依据并进行判断。Step 3) Speaker recognition: For the speech sample to be recognized, first obtain the spatial distribution expression and trajectory of the speech feature of the sample, and then use the speaker's speech feature spatial distribution information to calculate the difference between the sample distribution and the prior distribution and the accumulation along the trajectory The local distribution difference is used as the basis for speaker identification and judgment.
  2. 根据权利要求1所述的一种基于语音样本特征空间轨迹的说话人识别方法,其特征在于:步骤1)构建语音特征空间Ω的过程中,能够使用任何纯净语音样本,对说话人、语种因素没有任何约束。A speaker recognition method based on the trajectory of the feature space of speech samples according to claim 1, characterized in that: step 1) in the process of constructing the speech feature space Ω, any pure speech samples can be used There are no constraints.
  3. 根据权利要求1所述的一种基于语音样本特征空间轨迹的说话人识别方法,其特征在于:所述语音特征空间表达Ω={g k,k=1,2,…,K}能够是类数据的分布函数、聚类中心矢量或者生成模型这些具有定位能力的标识,称之为特征空间标识子,语音特征空间所使用的类标识子规模K决定语音特征空间表达粒度,K越大,语音特征空间表达越精细。 A speaker recognition method based on the trajectory of the feature space of a voice sample according to claim 1, characterized in that: the voice feature space expression Ω = {g k , k = 1, 2, ..., K} can be a class The data-distribution function, cluster center vector, or generation model, which have positioning capabilities, are called feature space identifiers. The size of the class identifier used in the voice feature space K determines the granularity of the voice feature space expression. The larger K, the voice The finer the feature space expression.
  4. 根据权利要求1所述的一种基于语音样本特征空间轨迹的说话人识别方法,其特征在于:步骤2)中,利用有说话人属性标注的纯净语音样本对语音特征空间进行标注,在采用高斯分布g k(m k,U k)作为空间标识子时,说话人特征空间分布信息按以下方式获得: A speaker recognition method based on the trajectory of the feature space of a voice sample according to claim 1, characterized in that: in step 2), the voice feature space is annotated with pure voice samples labeled with speaker attributes. When the distribution g k (m k , U k ) is used as a spatial identifier, the spatial distribution information of speaker features is obtained as follows:
    一、计算语音样本每个特征f t与空间标识子g k(m k,U k)的位置关联度,定义为: 1. Calculate the positional correlation between each feature f t of the speech sample and the spatial identifier g k (m k , U k ), defined as:
    Figure PCTCN2019111530-appb-100001
    Figure PCTCN2019111530-appb-100001
    式中,空间标识子用多维高斯分布来表示,m k表示第k个高斯分布的均值矢 量,U k表示第k个多维高斯分布的方差矩阵; In the formula, the spatial identifier is represented by a multidimensional Gaussian distribution, m k represents the mean vector of the kth Gaussian distribution, and U k represents the variance matrix of the kth multidimensional Gaussian distribution;
    二、计算说话人样本集与空间标识子g k(m k,U k)的位置关联度的期望值: 2. Calculate the expected value of the positional correlation between the speaker sample set and the spatial identifier g k (m k , U k ):
    Figure PCTCN2019111530-appb-100002
    Figure PCTCN2019111530-appb-100002
    式中,
    Figure PCTCN2019111530-appb-100003
    表示第n个样本的第t帧特征与空间标识子g k(m k,U k)的关联度;
    In the formula,
    Figure PCTCN2019111530-appb-100003
    Represents the degree of association between the t-th frame feature of the n-th sample and the spatial identifier g k (m k , U k );
    三、计算说话人特征空间分布为:3. The spatial distribution of speaker features is calculated as:
    Figure PCTCN2019111530-appb-100004
    Figure PCTCN2019111530-appb-100004
  5. 根据权利要求4所述的一种基于语音样本特征空间轨迹的说话人识别方法,其特征在于:步骤2)中,说话人语音样本在语音特征空间Ω上的运动轨迹时序信息表示为语音样本特征的邻域序列Ψ 1Ψ 2…Ψ T,而语音样本特征f t的δ邻域为
    Figure PCTCN2019111530-appb-100005
    其中d tk为语音样本特征与语音样本分布的马哈拉诺比斯距离(Mahalanobis distance),即
    Figure PCTCN2019111530-appb-100006
    A speaker recognition method based on the trajectory of the feature space of a voice sample according to claim 4, characterized in that: in step 2), the trajectory information of the trajectory of the speaker's voice sample in the voice feature space Ω is expressed as a feature of the voice sample The neighborhood sequence of Ψ 1 Ψ 2 …Ψ T , and the δ neighborhood of the speech sample feature f t is
    Figure PCTCN2019111530-appb-100005
    Where d tk is the Mahalanobis distance between the features of the speech sample and the distribution of the speech sample, ie
    Figure PCTCN2019111530-appb-100006
  6. 根据权利要求5所述的一种基于语音样本特征空间轨迹的说话人识别方法,其特征在于:所述语音样本特征f t的δ邻域Ψ t={g k|d tk<δ}的判决门限参考正态分布的特性,选取2<δ<3。 A speaker recognition method based on the trajectory of the feature space of the speech sample according to claim 5, characterized in that: the δ neighborhood of the speech sample feature f t Ψ t = {g k |d tk <δ} The threshold refers to the characteristics of the normal distribution, and 2<δ<3 is selected.
  7. 根据权利要求5所述的一种基于语音样本特征空间轨迹的说话人识别方法,其特征在于,步骤3)中,语音样本f={f 1,f 2,…,f T}的说话人识别过程包括以下步骤: A speaker recognition method based on the trajectory of the feature space of speech samples according to claim 5, characterized in that, in step 3), speaker recognition of speech samples f = {f 1 , f 2 ,..., f T } The process includes the following steps:
    一、计算语音样本f={f 1,f 2,…,f T}在语音特征空间Ω的分布P=(p 1,p 2,…,p K),其中
    Figure PCTCN2019111530-appb-100007
    1. Calculate the distribution of speech samples f = {f 1 , f 2 , ..., f T } in the speech feature space Ω P = (p 1 , p 2 , ..., p K ), where
    Figure PCTCN2019111530-appb-100007
    二、确定语音样本f={f 1,f 2,…,f T}在语音特征空间Ω中的运动轨迹Ψ 1Ψ 2…Ψ T,Ψ t={g k|d tk<δ}; 2. Determine the movement trajectory of the speech sample f = {f 1 , f 2 ,..., f T } in the speech feature space Ω Ψ 1 Ψ 2 …Ψ T , Ψ t = {g k |d tk <δ};
    三、计算样本分布P=(p 1,p 2,…,p K)与说话人s的先验特征空间分布
    Figure PCTCN2019111530-appb-100008
    的距离
    Figure PCTCN2019111530-appb-100009
    然后筛选包含真实解的可能的解集S p
    Figure PCTCN2019111530-appb-100010
    3. Calculate the sample distribution P = (p 1 , p 2 , ..., p K ) and the spatial distribution of the prior features of the speaker s
    Figure PCTCN2019111530-appb-100008
    the distance
    Figure PCTCN2019111530-appb-100009
    Then filter the possible solution set S p containing the real solution:
    Figure PCTCN2019111530-appb-100010
    四、计算运动轨迹Ψ 1Ψ 2…Ψ T的距离度量
    Figure PCTCN2019111530-appb-100011
    从S p中选出可能的解
    Figure PCTCN2019111530-appb-100012
    完成说话人识别。
    4. Calculate the distance metric of the trajectory Ψ 1 Ψ 2 …Ψ T
    Figure PCTCN2019111530-appb-100011
    Select possible solutions from S p
    Figure PCTCN2019111530-appb-100012
    Complete speaker recognition.
PCT/CN2019/111530 2019-01-11 2019-10-16 Speaker identification method based on speech sample feature space trajectory WO2020143263A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
SG11202103091XA SG11202103091XA (en) 2019-01-11 2019-10-16 A Speaker Recognition Method Based on Trajectories in Feature Spaces of Voice Samples

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910027145.3A CN109545229B (en) 2019-01-11 2019-01-11 Speaker recognition method based on voice sample characteristic space track
CN201910027145.3 2019-01-11

Publications (1)

Publication Number Publication Date
WO2020143263A1 true WO2020143263A1 (en) 2020-07-16

Family

ID=65835222

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/111530 WO2020143263A1 (en) 2019-01-11 2019-10-16 Speaker identification method based on speech sample feature space trajectory

Country Status (3)

Country Link
CN (1) CN109545229B (en)
SG (1) SG11202103091XA (en)
WO (1) WO2020143263A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112487978A (en) * 2020-11-30 2021-03-12 清华珠三角研究院 Method and device for positioning speaker in video and computer storage medium
CN113611285A (en) * 2021-09-03 2021-11-05 哈尔滨理工大学 Language identification method based on stacked bidirectional time sequence pooling
CN117235435A (en) * 2023-11-15 2023-12-15 世优(北京)科技有限公司 Method and device for determining audio signal loss function

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109545229B (en) * 2019-01-11 2023-04-21 华南理工大学 Speaker recognition method based on voice sample characteristic space track
CN111081261B (en) * 2019-12-25 2023-04-21 华南理工大学 Text-independent voiceprint recognition method based on LDA
CN111128128B (en) * 2019-12-26 2023-05-23 华南理工大学 Voice keyword detection method based on complementary model scoring fusion
CN111933156B (en) * 2020-09-25 2021-01-19 广州佰锐网络科技有限公司 High-fidelity audio processing method and device based on multiple feature recognition

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5598507A (en) * 1994-04-12 1997-01-28 Xerox Corporation Method of speaker clustering for unknown speakers in conversational audio data
CN102024455A (en) * 2009-09-10 2011-04-20 索尼株式会社 Speaker recognition system and method
CN102479511A (en) * 2010-11-23 2012-05-30 盛乐信息技术(上海)有限公司 Large-scale voiceprint authentication method and system
CN105845141A (en) * 2016-03-23 2016-08-10 广州势必可赢网络科技有限公司 Speaker confirmation model, speaker confirmation method and speaker confirmation device based on channel robustness
US20180342250A1 (en) * 2017-05-24 2018-11-29 AffectLayer, Inc. Automatic speaker identification in calls
CN109065059A (en) * 2018-09-26 2018-12-21 新巴特(安徽)智能科技有限公司 The method for identifying speaker with the voice cluster that audio frequency characteristics principal component is established
CN109065028A (en) * 2018-06-11 2018-12-21 平安科技(深圳)有限公司 Speaker clustering method, device, computer equipment and storage medium
CN109545229A (en) * 2019-01-11 2019-03-29 华南理工大学 A kind of method for distinguishing speek person based on speech samples Feature space trace

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6067517A (en) * 1996-02-02 2000-05-23 International Business Machines Corporation Transcription of speech data with segments from acoustically dissimilar environments
CN1302456C (en) * 2005-04-01 2007-02-28 郑方 Sound veins identifying method
JP4901657B2 (en) * 2007-09-05 2012-03-21 日本電信電話株式会社 Voice recognition apparatus, method thereof, program thereof, and recording medium

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5598507A (en) * 1994-04-12 1997-01-28 Xerox Corporation Method of speaker clustering for unknown speakers in conversational audio data
CN102024455A (en) * 2009-09-10 2011-04-20 索尼株式会社 Speaker recognition system and method
CN102479511A (en) * 2010-11-23 2012-05-30 盛乐信息技术(上海)有限公司 Large-scale voiceprint authentication method and system
CN105845141A (en) * 2016-03-23 2016-08-10 广州势必可赢网络科技有限公司 Speaker confirmation model, speaker confirmation method and speaker confirmation device based on channel robustness
US20180342250A1 (en) * 2017-05-24 2018-11-29 AffectLayer, Inc. Automatic speaker identification in calls
CN109065028A (en) * 2018-06-11 2018-12-21 平安科技(深圳)有限公司 Speaker clustering method, device, computer equipment and storage medium
CN109065059A (en) * 2018-09-26 2018-12-21 新巴特(安徽)智能科技有限公司 The method for identifying speaker with the voice cluster that audio frequency characteristics principal component is established
CN109545229A (en) * 2019-01-11 2019-03-29 华南理工大学 A kind of method for distinguishing speek person based on speech samples Feature space trace

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112487978A (en) * 2020-11-30 2021-03-12 清华珠三角研究院 Method and device for positioning speaker in video and computer storage medium
CN112487978B (en) * 2020-11-30 2024-04-16 清华珠三角研究院 Method and device for positioning speaker in video and computer storage medium
CN113611285A (en) * 2021-09-03 2021-11-05 哈尔滨理工大学 Language identification method based on stacked bidirectional time sequence pooling
CN113611285B (en) * 2021-09-03 2023-11-24 哈尔滨理工大学 Language identification method based on stacked bidirectional time sequence pooling
CN117235435A (en) * 2023-11-15 2023-12-15 世优(北京)科技有限公司 Method and device for determining audio signal loss function
CN117235435B (en) * 2023-11-15 2024-02-20 世优(北京)科技有限公司 Method and device for determining audio signal loss function

Also Published As

Publication number Publication date
CN109545229A (en) 2019-03-29
CN109545229B (en) 2023-04-21
SG11202103091XA (en) 2021-04-29

Similar Documents

Publication Publication Date Title
WO2020143263A1 (en) Speaker identification method based on speech sample feature space trajectory
Gao et al. Transition movement models for large vocabulary continuous sign language recognition
Zhuang et al. Real-world acoustic event detection
Ryu et al. Out-of-domain detection based on generative adversarial network
Wang et al. Recognizing human emotional state from audiovisual signals
Tang et al. Partially supervised speaker clustering
CN107301858B (en) Audio classification method based on audio characteristic space hierarchical description
CN111128128B (en) Voice keyword detection method based on complementary model scoring fusion
CN109800308B (en) Short text classification method based on part-of-speech and fuzzy pattern recognition combination
CN108520752A (en) A kind of method for recognizing sound-groove and device
US20230089308A1 (en) Speaker-Turn-Based Online Speaker Diarization with Constrained Spectral Clustering
Fang et al. A novel approach to automatically extracting basic units from chinese sign language
Van Leeuwen Speaker linking in large data sets
Bhati et al. Unsupervised Speech Signal to Symbol Transformation for Zero Resource Speech Applications.
Han et al. Boosted subunits: a framework for recognising sign language from videos
Debnath et al. RETRACTED ARTICLE: Audio-Visual Automatic Speech Recognition Towards Education for Disabilities
Chen et al. Self-lifting: A novel framework for unsupervised voice-face association learning
Nguyen et al. Joint deep cross-domain transfer learning for emotion recognition
Yao et al. Real time large vocabulary continuous sign language recognition based on OP/Viterbi algorithm
Lang et al. Study of face detection algorithm for real-time face detection system
CN116050419A (en) Unsupervised identification method and system oriented to scientific literature knowledge entity
CN113658582B (en) Lip language identification method and system for audio-visual collaboration
Polat et al. Unsupervised term discovery for continuous sign language
CN110807370B (en) Conference speaker identity noninductive confirmation method based on multiple modes
Qi-Rong et al. A novel hierarchical speech emotion recognition method based on improved DDAGSVM

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19908700

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19908700

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 03.11.2021)

122 Ep: pct application non-entry in european phase

Ref document number: 19908700

Country of ref document: EP

Kind code of ref document: A1