WO2019136811A1 - 语音对比方法、终端及计算机可读存储介质 - Google Patents

语音对比方法、终端及计算机可读存储介质 Download PDF

Info

Publication number
WO2019136811A1
WO2019136811A1 PCT/CN2018/077626 CN2018077626W WO2019136811A1 WO 2019136811 A1 WO2019136811 A1 WO 2019136811A1 CN 2018077626 W CN2018077626 W CN 2018077626W WO 2019136811 A1 WO2019136811 A1 WO 2019136811A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech
vector
feature
feature speech
class
Prior art date
Application number
PCT/CN2018/077626
Other languages
English (en)
French (fr)
Inventor
王健宗
黄章成
吴天博
肖京
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2019136811A1 publication Critical patent/WO2019136811A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/68Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/683Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction

Definitions

  • the present application relates to the field of communications technologies, and in particular, to a voice comparison method, a terminal, and a computer readable storage medium.
  • the present application provides a voice comparison method, a terminal, and a computer readable storage medium.
  • the present application provides a terminal, where the mobile terminal includes a memory and a processor, and the memory stores a voice comparison program operable on the processor, where the voice comparison program is The processor performs the following steps: calculating a first feature speech vector of the registered speech; clustering the first feature speech vector into a K class by using K-means clustering; acquiring a centroid of each class in the K class, The centroid is a first feature speech vector in the class to which the centroid belongs; acquiring a verification speech of the user and calculating a second feature speech vector of the verification speech; respectively, the second feature speech vector is respectively associated with each of the classes The centroid is compared; the category to which the verification voice belongs is judged according to the comparison result; and after determining the category to which the verification voice belongs, the second feature speech vector is respectively associated with the category to which the verification voice belongs All of the first feature speech vectors are compared; the output is compared.
  • the present application further provides a voice comparison method, which is applied to a terminal, where the method includes: calculating a first feature speech vector of a registered voice; and using the K-means cluster to perform the first feature speech vector Clustering into a K class; obtaining a centroid of each of the K classes, the centroid being a first feature speech vector in the class to which the centroid belongs; acquiring a verification voice of the user and calculating a second feature speech of the verification speech Comparing the second feature speech vector with the centroid of each class; determining the category to which the verification voice belongs according to the comparison result; and determining the category to which the verification voice belongs, The second feature speech vector is compared with all of the first feature speech vectors in the category to which the verification speech belongs, and the comparison result is output.
  • the voice comparison method, the terminal, and the computer readable storage medium proposed by the present application first calculate a first feature speech vector of a registered voice; and then, the first feature speech is performed by K-means clustering.
  • the vector clustering is a K class; then, the centroid of each of the K classes is obtained, the centroid is a first feature speech vector in the class to which the centroid belongs; and the user's verification speech is acquired and the verification speech is calculated a second feature speech vector; finally, the second feature speech vector is compared with the centroid of each class, and the category to which the verification speech belongs is determined according to the comparison result, when the verification speech is determined After the category belongs, the second feature speech vector is compared with all the first feature speech vectors in the category to which the verification speech belongs, and the comparison result is output.
  • the existing voiceprint system can be compared with the N individuals in the voiceprint library, which takes a lot of time, thereby improving the efficiency of voice recognition and promoting the popularization and industrialization of voice recognition technology. Moreover, the calculation time is greatly reduced, enabling the terminal to return results in real time.
  • FIG. 1 is a schematic structural diagram of hardware of a terminal that implements various embodiments of the present application
  • FIG. 2 is a structural diagram of a communication network system according to an embodiment of the present application.
  • FIG. 3 is a block diagram of a program of an embodiment of a voice comparison program of the present application.
  • FIG. 4 is a flow chart of an embodiment of a voice comparison method of the present application.
  • the terminal can be implemented in various forms.
  • the terminal described in the present application may include, for example, a mobile phone, a tablet computer, a notebook computer, a palmtop computer, a personal digital assistant (PDA), a portable media player (PMP), a navigation device, and the like.
  • Mobile terminals such as wearable devices, smart bracelets, pedometers, and fixed terminals such as digital TVs and desktop computers.
  • FIG. 1 is a schematic diagram of a hardware structure of a terminal 100 for implementing various embodiments of the present application.
  • the terminal 100 may include an RF (Radio Frequency) unit 101, a WiFi module 102, an audio output unit 103, and an A/. V (audio/video) input unit 104, sensor 105, display unit 106, user input unit 107, interface unit 108, memory 109, processor 110, and power supply 111 and the like.
  • RF Radio Frequency
  • V audio/video input unit 104
  • sensor 105 sensor 105
  • display unit 106 user input unit 107
  • interface unit 108 user input unit
  • memory 109 memory 109
  • processor 110 and power supply 111 and the like.
  • the terminal 100 may further include a Bluetooth module and the like, and details are not described herein again.
  • FIG. 2 is a structural diagram of a communication network system according to an embodiment of the present application.
  • the communication network system is an LTE system of a universal mobile communication technology, and the LTE system includes a UE (User Equipment, user equipment) that is sequentially connected in communication. 201, an E-UTRAN (Evolved UMTS Terrestrial Radio Access Network) 202, an EPC (Evolved Packet Core) 203, and an operator's IP service 204.
  • E-UTRAN Evolved UMTS Terrestrial Radio Access Network
  • EPC Evolved Packet Core
  • FIG. 3 it is a program module diagram of the first embodiment of the voice comparison program 300 of the present application.
  • the calculating module 301 is configured to calculate a first feature speech vector of the registered voice.
  • the registration voice can be obtained by using a voice input device on the terminal 100, such as a microphone.
  • the registration voice can be obtained from the remote voice acquisition device through the communication network, which is not limited in this application.
  • the step of calculating the first feature speech vector of the registered voice by the calculating module 301 specifically includes: using the MFCC method to extract the MFCC feature of each frame of the registered voice and composing the first matrix; using the UBM and An i-vector extractor filters out the most core features in the first matrix to form the first feature speech vector.
  • MFCC is an abbreviation of Mel-Frequency Cepstral Coefficients, which contains two key steps: conversion to the Mel frequency, followed by cepstrum analysis.
  • voice segmentation is performed on each voice to obtain a voice spectrum of multiple frames; and the acquired spectrum is obtained through a Mel filter bank to obtain a Mel spectrum, where the Mel filter group may be non-uniform.
  • the frequency is converted to a uniform frequency; finally, the cepstrum analysis is performed on the Mel spectrum to obtain the Mel frequency cepstrum coefficient MFCC.
  • This MFCC is the characteristic of the speech of the frame.
  • cepstrum analysis is to take the logarithm of the Mel spectrum, and then do Inverse transform, in which the actual inverse transform is generally implemented by DCT discrete cosine transform, and the second to thirteenth coefficients after DCT are taken as MFCC coefficients.
  • the MFCC of each frame of speech is composed into a vector matrix, and the most core vector in the matrix is filtered by a background model (UBM) and a feature-vector vector (i-vector) extractor, and the vector is used as the vector.
  • UBM background model
  • i-vector feature-vector vector
  • a feature speech vector of the speech wherein the background vector (UBM) and the feature speech vector (i-vector) extractor filter out the most core vector in the matrix belongs to the existing data algorithm of the vector matrix calculation, I will not repeat them.
  • the clustering module 302 is configured to cluster the first feature speech vectors into K classes by using K-means clustering.
  • the step of the clustering module 302 to cluster the first feature speech vector into the K class by K-means clustering specifically includes: selecting K locations from the first feature speech vector. Decoding a first feature speech vector as a sample of K-means clusters; wherein the K first memorized speech vectors in the K-means clustered samples are cluster centers; and all the first feature speech vectors are The first feature vectors closest to any one of the cluster centers are grouped into one class, and then all of the first feature speech vectors are clustered into K classes.
  • the centroid acquisition module 303 is configured to acquire a centroid of each of the K classes, where the centroid is a first feature speech vector in the class to which the centroid belongs.
  • the calculation module 301 is further configured to acquire a verification voice of the user and calculate a second feature speech vector of the verification voice.
  • the centroid acquisition module 303 is from the centroid of each class in the K class, and the so-called centroid is substantially a first feature speech vector in the category to which the centroid belongs, that is, it can be represented by such a centroid category.
  • the calculating module 301 in the step of calculating the second feature speech vector of the verification speech, specifically includes: separately extracting, by using the MFCC method, MFCC features of each frame of speech in the verification speech and forming a second matrix; using UBM (Universal Background Model) and An i-vector extractor filters out the most core features in the second matrix to form the second feature speech vector.
  • the comparison module 304 is configured to compare the second feature speech vectors with the centroids of each class.
  • the comparison module 304 specifically performs the pairwise comparison and scoring of the second feature speech vector and the centroid of each class by using a dot-product algorithm and a PLDA algorithm.
  • the vector dot product algorithm and the PLDA algorithm belong to an existing algorithm, and will not be repeated here.
  • the determining module 305 is configured to determine, according to the comparison result, a category to which the verification voice belongs.
  • the step of determining, by the determining module 305, the category to which the verification voice belongs according to the comparison result comprises: selecting a pairwise comparison score with the lowest score from the comparison result; classifying the verification voice into The pair with the lowest score is the corresponding category, that is, the category with the lowest score corresponding to the centroid in the pairwise comparison score.
  • the comparison module 304 is further configured to compare the second feature speech vector with all the first feature speech vectors in the category to which the verification speech belongs after determining the category to which the verification speech belongs. Right, and output the comparison result.
  • the centroid itself is a first speech feature vector (i-Vector), and the second feature speech vector i-Vector of the recognized speech is compared with the k centroid i-Vectors, and the closest selection is One type, that is, selecting one of the classes of the lowest centroids of the pairwise comparison, determines that the recognized voice belongs to this category. At this point, only k comparisons are needed, saving a lot of time.
  • the second speech feature vector i-Vector of the recognition speech is compared with each of the other first feature speech vectors i-Vector in the category to obtain the closest distance.
  • the second speech feature vector i-Vector is the second speech feature vector i-Vector which is most similar to the recognized speech, and the two are most likely to be derived from the same speaker speech.
  • the speech comparison program 300 proposed by the present application first calculates a first feature speech vector of the registered speech; then, the K-means cluster is used to cluster the first feature speech vector into a K class. And then acquiring a centroid of each of the K classes, the centroid being a first feature speech vector in the class to which the centroid belongs; and acquiring a verification speech of the user and calculating a second feature speech vector of the verification speech Finally, the second feature speech vector is compared with the centroid of each class, and the category to which the verification voice belongs is determined according to the comparison result, and after determining the category to which the verification voice belongs, The second feature speech vectors are respectively compared with all of the first feature speech vectors in the category to which the verification speech belongs, and the comparison result is output. In this way, the existing voiceprint system can be compared with the N individuals in the voiceprint library, which takes a lot of time, thereby improving the efficiency of voice recognition and promoting the popularization and industrialization of voice recognition technology.
  • the present application also proposes a speech comparison method.
  • FIG. 4 it is a schematic flowchart of the implementation of the first embodiment of the voice comparison method of the present application.
  • the order of execution of the steps in the flowchart shown in Fig. 4 may be changed according to different requirements, and some steps may be omitted.
  • Step S401 calculating a first feature speech vector of the registered speech.
  • the step of the terminal 100 calculating the first feature speech vector of the registered voice specifically includes: extracting, by using the MFCC method, the MFCC feature of each frame of the registered voice and forming a first matrix; using the UBM and the voice An i-vector extractor filters out the most core features in the first matrix to form the first feature speech vector.
  • MFCC is an abbreviation of Mel-Frequency Cepstral Coefficients, which contains two key steps: conversion to the Mel frequency, followed by cepstrum analysis.
  • voice segmentation is performed on each voice to obtain a voice spectrum of multiple frames; and the acquired spectrum is obtained through a Mel filter bank to obtain a Mel spectrum, where the Mel filter group may be non-uniform.
  • the frequency is converted to a uniform frequency; finally, the cepstrum analysis is performed on the Mel spectrum to obtain the Mel frequency cepstrum coefficient MFCC.
  • This MFCC is the characteristic of the speech of the frame.
  • cepstrum analysis is to take the logarithm of the Mel spectrum, and then do Inverse transform, in which the actual inverse transform is generally implemented by DCT discrete cosine transform, and the second to thirteenth coefficients after DCT are taken as MFCC coefficients.
  • the MFCC of each frame of speech is composed into a vector matrix, and the most core vector in the matrix is filtered by a background model (UBM) and a feature-vector vector (i-vector) extractor, and the vector is used as the vector.
  • UBM background model
  • i-vector feature-vector vector
  • a feature speech vector of the speech wherein the background vector (UBM) and the feature speech vector (i-vector) extractor filter out the most core vector in the matrix belongs to the existing data algorithm of the vector matrix calculation, I will not repeat them.
  • Step S402 clustering the first feature speech vectors into K classes by using K-means clustering.
  • the step of the terminal 100 to cluster the first feature speech vector into the K class by using the K-means clustering specifically includes: selecting the K pieces from the first feature speech vector. a feature speech vector as a sample of K-means clustering; K first-order feature speech vectors in the K-means clustered sample as a cluster center; all of the first feature speech vectors and any One of the cluster centers is clustered into the first feature vector, and all of the first feature speech vectors are clustered into a K class.
  • Step S403 obtaining a centroid of each class in the K class, the centroid being a first feature speech vector in the class to which the centroid belongs.
  • the terminal 100 is from the centroid of each of the K classes, and the so-called centroid is substantially a first feature speech vector in the category to which the centroid belongs, that is, the centroid can be represented by a centroid.
  • Step S404 Acquire a verification voice of the user and calculate a second feature speech vector of the verification voice.
  • the step of calculating the second feature speech vector of the verification voice by the terminal 100 specifically includes: separately extracting MFCC features of each frame of speech in the verification speech and forming a second matrix by using the MFCC method; using UBM (Universal Background Model) And an i-vector extractor filters out the most core features in the second matrix to form the second feature speech vector.
  • UBM Universal Background Model
  • Step S405 comparing the second feature speech vectors with the centroids of each class.
  • the terminal 100 specifically performs a pairwise comparison score on the second feature speech vector and the centroid of each class by using a dot-product algorithm and a PLDA algorithm.
  • the vector dot product algorithm and the PLDA algorithm belong to an existing algorithm, and will not be repeated here.
  • the step of determining, by the terminal 100, the category to which the verification voice belongs according to the comparison result includes: selecting a pairwise comparison score with the lowest score from the comparison result; classifying the verification voice The category corresponding to the lowest score of the two-two comparison score, that is, the category with the lowest score corresponding to the centroid in the pairwise comparison score.
  • Step S407 after determining the category to which the verification voice belongs, comparing the second feature speech vector with all the first feature speech vectors in the category to which the verification speech belongs, and outputting the comparison result.
  • the centroid itself is a first speech feature vector (i-Vector), and the second feature speech vector i-Vector of the recognized speech is compared with the k centroid i-Vectors, and the closest selection is One type, that is, selecting one of the classes of the lowest centroids of the pairwise comparison, determines that the recognized voice belongs to this category. At this point, only k comparisons are needed, saving a lot of time.
  • the second speech feature vector i-Vector of the recognition speech is compared with each of the other first feature speech vectors i-Vector in the category to obtain the closest distance.
  • the second speech feature vector i-Vector is the second speech feature vector i-Vector which is most similar to the recognized speech, and the two are most likely to be derived from the same speaker speech.
  • the speech comparison method proposed by the present application first calculates a first feature speech vector of the registered speech; then, the K-means cluster is used to cluster the first feature speech vector into a K class; Obtaining a centroid of each of the K classes, the centroid being a first feature speech vector in the class to which the centroid belongs; and acquiring a verification speech of the user and calculating a second feature speech vector of the verification speech; And comparing the second feature speech vector to the centroid of each class, and determining, according to the comparison result, the category to which the verification voice belongs, and after determining the category to which the verification voice belongs, The second feature speech vectors are respectively compared with all of the first feature speech vectors in the category to which the verification speech belongs, and the comparison result is output.
  • the existing voiceprint system can be compared with the N individuals in the voiceprint library, which takes a lot of time, thereby improving the efficiency of voice recognition and promoting the popularization and industrialization of voice recognition technology.
  • the present application further provides another embodiment, that is, a computer readable storage medium storing a voice comparison program, the voice comparison program being executable by at least one processor to The at least one processor performs the steps of the speech contrast method as described above.
  • the foregoing embodiment method can be implemented by means of software plus a necessary general hardware platform, and of course, can also be through hardware, but in many cases, the former is better.
  • Implementation Based on such understanding, the technical solution of the present application, which is essential or contributes to the prior art, may be embodied in the form of a software product stored in a storage medium (such as ROM/RAM, disk,
  • the optical disc includes a number of instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the methods described in various embodiments of the present application.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Acoustics & Sound (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Library & Information Science (AREA)
  • Signal Processing (AREA)
  • Computational Linguistics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Telephonic Communication Services (AREA)
  • Machine Translation (AREA)

Abstract

一种语音对比方法,应用于终端,包括步骤:计算注册语音的第一特征语音向量(S401);利用K-means聚类将第一特征语音向量聚类为K类(S402);获取K类中每一类的质心,该质心为其所属类中的一个第一特征语音向量(S403);获取用户的验证语音并计算验证语音的第二特征语音向量(S404);将第二特征语音向量分别与每一类的质心进行比对(S405);根据比对结果判断该验证语音所属的类别(S406);当判断出该验证语音所属的类别后,将该第二特征语音向量分别与该验证语音所属的类别中的所有第一特征语音向量进行比对;输出比对结果(S407)。还提供了一种终端及计算机可读存储介质。通过上述方式,可以大幅度减少语音比对的计算时间,使终端能够实时返回结果。

Description

语音对比方法、终端及计算机可读存储介质
本申请要求于2018年01月09日提交中国专利局、申请号为201810019441.4、发明名称为“语音对比方法、终端及计算机可读存储介质”的中国专利申请的优先权,其全部内容通过引用结合在申请中。
技术领域
本申请涉及通信技术领域,尤其涉及一种语音对比方法、终端及计算机可读存储介质。
背景技术
随着语音识别技术的不断发展,支持语音识别的应用也越来越多,比如语音开锁,语音支付等等。但在语音识别技术的应用中,其中涉及非常重要的一步就是验证语音与声纹库中N个注册语音的比对。目前传统的语音比对是利用1比N系统,然而1比N系统在识别时,识别语音的i-Vector要和声纹库中的N个语音的i-Vector(特征语音向量)分别进行比对,一共要有N次计算,耗费大量时间,导致比对效率低下,进而不利于语音识别技术的普及和应用。
发明内容
有鉴于此,本申请提出一种语音对比方法、终端及计算机可读存储介质,通过实施上述方式,可以克服现有声纹系统在识别时要和声纹库中的N个人分别进行比对,耗费大量时间的弊端,进而提升语音识别的效率,促进语音识别技术的普及和产业化发展。
首先,为实现上述目的,本申请提出一种终端,所述移动终端包括存储器、处理器,所述存储器上存储有可在所述处理器上运行的语音对比程序,所述语音对比程序被所述处理器执行时实现如下步骤:计算注册语音的第一 特征语音向量;利用K-means聚类将所述第一特征语音向量聚类为K类;获取K类中每一类的质心,所述质心为所述质心所属类中的一个第一特征语音向量;获取用户的验证语音并计算所述验证语音的第二特征语音向量;将所述第二特征语音向量分别与所述每一类的质心进行比对;根据比对结果判断所述验证语音所属的类别;当判断出所述验证语音所属的类别后,将所述第二特征语音向量分别与所述验证语音所属的类别中的所有所述第一特征语音向量进行比对;输出比对结果。
此外,为实现上述目的,本申请还提供一种语音对比方法,应用于终端,所述方法包括:计算注册语音的第一特征语音向量;利用K-means聚类将所述第一特征语音向量聚类为K类;获取K类中每一类的质心,所述质心为所述质心所属类中的一个第一特征语音向量;获取用户的验证语音并计算所述验证语音的第二特征语音向量;将所述第二特征语音向量分别与所述每一类的质心进行比对;根据比对结果判断所述验证语音所属的类别;当判断出所述验证语音所属的类别后,将所述第二特征语音向量分别与所述验证语音所属的类别中的所有所述第一特征语音向量进行比对;输出比对结果。
进一步地,为实现上述目的,本申请还提供一种计算机可读存储介质,所述计算机可读存储介质存储有语音对比程序,所述语音对比程序可被至少一个处理器执行,以使所述至少一个处理器执行如上所述的语音对比方法的步骤。
相较于现有技术,本申请所提出的语音对比方法、终端及计算机可读存储介质,首先计算注册语音的第一特征语音向量;然后,利用K-means聚类将所述第一特征语音向量聚类为K类;接着,获取K类中每一类的质心,所述质心为所述质心所属类中的一个第一特征语音向量;以及,获取用户的验证语音并计算所述验证语音的第二特征语音向量;最后,将所述第二特征语音向量分别与所述每一类的质心进行比对,根据比对结果判断所述验证语音所属 的类别,当判断出所述验证语音所属的类别后,将所述第二特征语音向量分别与所述验证语音所属的类别中的所有所述第一特征语音向量进行比对,输出比对结果。这样,可以解决现有声纹系统在识别时要和声纹库中的N个人分别进行比对,耗费大量时间的弊端,进而提升语音识别的效率,促进语音识别技术的普及和产业化发展。而且,大幅度减少计算时间,使终端能够实时返回结果。
附图说明
图1是实现本申请各个实施例的一种终端的硬件结构示意图;
图2是本申请实施例提供的一种通信网络系统架构图;
图3是本申请语音对比程序一实施例的程序模块图;
图4为本申请语音对比方法一实施例的流程图。
本申请目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。
具体实施方式
应当理解,此处所描述的具体实施例仅用以解释本申请,并不用于限定本申请。
在后续的描述中,使用用于表示元件的诸如“模块”、“部件”或“单元”的后缀仅为了有利于本申请的说明,其本身没有特定的意义。因此,“模块”、“部件”或“单元”可以混合地使用。
终端可以以各种形式来实施。例如,本申请中描述的终端可以包括诸如手机、平板电脑、笔记本电脑、掌上电脑、个人数字助理(Personal Digital Assistant,PDA)、便捷式媒体播放器(Portable Media Player,PMP)、导航装置、可穿戴设备、智能手环、计步器等移动终端,以及诸如数字TV、台式计算机等固定终端。
后续描述中将以移动终端为例进行说明,本领域技术人员将理解的是,除了特别用于移动目的的元件之外,根据本申请的实施方式的构造也能够应用于固定类型的终端。
请参阅图1,其为实现本申请各个实施例的一种终端100的硬件结构示意图,该终端100可以包括:RF(Radio Frequency,射频)单元101、WiFi模块102、音频输出单元103、A/V(音频/视频)输入单元104、传感器105、显示单元106、用户输入单元107、接口单元108、存储器109、处理器110、以及电源111等部件。本领域技术人员可以理解,图1中示出的终端100的结构并不构成对终端100的限定,终端100可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。
尽管图1未示出,终端100还可以包括蓝牙模块等,在此不再赘述。
为了便于理解本申请实施例,下面对本申请的终端100所基于的通信网络系统进行描述。
请参阅图2,图2为本申请实施例提供的一种通信网络系统架构图,该通信网络系统为通用移动通信技术的LTE系统,该LTE系统包括依次通讯连接的UE(User Equipment,用户设备)201,E-UTRAN(Evolved UMTS Terrestrial Radio Access Network,演进式UMTS陆地无线接入网)202,EPC(Evolved Packet Core,演进式分组核心网)203和运营商的IP业务204。
基于上述终端100硬件结构以及通信网络系统,提出本申请方法各个实施例。
首先,本申请提出一种语音对比程序300,并通过上述图1中所述终端100并结合可能的通信网络执行所述语音对比程序300进行,进而实现相应功能。
参阅图3所示,是本申请语音对比程序300第一实施例的程序模块图。
本实施例中,所述的语音对比程序300包括一系列的存储于存储器109上的计算机程序指令,当该计算机程序指令被处理器110执行时,可以实现本申请各实施例的语音的对比操作。在一些实施例中,基于该计算机程序指令各部分所实现的特定的操作,所述语音对比程序300可以被划分为一个或多个模 块。例如,在图3中,所述的语音对比程序300可以被分割成计算模块301、聚类模块302、质心获取模块303、比对模块304以及判断模块305。其中:
所述计算模块301,用于计算注册语音的第一特征语音向量。所述注册语音可以通过终端100上的语音输入设备进行获取,比如麦克风,当然也可以通过通信网络从远端的语音获取设备上进行注册语音的获取,本申请对此并不做限定。
在本实施方式中,所述计算模块301在计算注册语音的第一特征语音向量的步骤具体包括:使用MFCC方法提取所述注册语音中每帧语音的MFCC特征并组成第一矩阵;使用UBM和语音向量提取器(i-vector extractor)筛选出所述第一矩阵中最核心的特征,组成所述第一特征语音向量。
其中,MFCC是Mel-Frequency Cepstral Coefficients的缩写,包含两个关键步骤:转化到梅尔频率,然后进行倒谱分析。在本实施方式中,先对每一份语音进行语音分帧,获取多个帧的语音频谱;再将上述获取的频谱通过Mel滤波器组得到Mel频谱,其中Mel滤波器组可以将不统一的频率转化到统一的频率;最后在Mel频谱上面进行倒谱分析,获得Mel频率倒谱系数MFCC,这个MFCC就是这帧语音的特征,其中所谓倒谱分析即为对Mel频谱取对数,再做逆变换,其中实际逆变换一般是通过DCT离散余弦变换来实现,并取DCT后的第2个到第13个系数作为MFCC系数。如此,将每一帧语音的MFCC组成一个向量矩阵,并通过背景模型(UBM)和特征语音向量(i-vector)提取器(extractor)筛选出所述矩阵中最核心的向量,将该向量作为所述语音的特征语音向量,其中通过背景模型(UBM)和特征语音向量(i-vector)提取器(extractor)筛选出所述矩阵中最核心的向量属于向量矩阵计算的现有数据算法,本文便不再多做赘述。
所述聚类模块302,用于利用K-means聚类将所述第一特征语音向量聚类为K类。
在本实施方式中,所述聚类模块302利用K-means聚类将所述第一特征 语音向量聚类为K类的步骤具体包括:从所述第一特征语音向量中选出K个所述第一特征语音向量作为K-means聚类的样本;以所述K-means聚类的样本中K个所述第一特征语音向量为聚类中心;将所有所述第一特征语音向量中与任何一个所述聚类中心距离最近的所述第一特征向量聚为一类,进而将所有所述第一特征语音向量聚类为K类。
所述质心获取模块303,用于获取K类中每一类的质心,所述质心为所述质心所属类中的一个第一特征语音向量。所述计算模块301还用于获取用户的验证语音并计算所述验证语音的第二特征语音向量。
在本实施方式中,所述质心获取模块303从K类中每一类的质心,而所谓的质心实质上是该质心所属类别中个一个第一特征语音向量,即可以用这么一个质心表示其所属类别。计算模块301在计算所述验证语音的第二特征语音向量的步骤具体包括:使用MFCC方法分别提取所述验证语音中每帧语音的MFCC特征并组成第二矩阵;使用UBM(通用背景模型)和语音向量提取器(i-vector extractor)筛选出所述第二矩阵中最核心的特征,组成所述第二特征语音向量。
所述比对模块304,用于将所述第二特征语音向量分别与所述每一类的质心进行比对。
在本实施方式中,所述比对模块304具体是利用dot-product(向量点积)算法和PLDA算法对所述第二特征语音向量和所述每一类的质心进行两两对比打分。其中,向量点积算法和PLDA算法属于一种现有的算法,本文便不再多做赘述。
所述判断模块305,用于根据比对结果判断所述验证语音所属的类别。
在本实施方式中,判断模块305根据比对结果判断所述验证语音所属的类别的步骤具体包括:从所述比对结果中选择分数最低的两两对比打分;将所述验证语音归类到分数最低的两两对比打分对应的类别,即两两对比打分中分数最低对应质心所述的类别。
所述比对模块304还用于当判断出所述验证语音所属的类别后,将所述第二特征语音向量分别与所述验证语音所属的类别中的所有所述第一特征语音向量进行比对,并输出比对结果。
在本实施方式中,上述质心本身就是一个第一语音特征向量(i-Vector),而将识别语音的第二特征语音向量i-Vector与k个质心的i-Vector进行比对,选择最相近的一类,即选择两两对比打分最低的质心所属的一类,则判定该识别语音属于此类。此时只需进行k次比对,节省大量时间。当然在选择完验证语音归属于哪一类后,将识别语音第二语音特征向量i-Vector与所在类别中的其他每个第一特征语音向量i-Vector进行一一比对,得到距离最近的第二语音特征向量i-Vector即为与识别语音最为相似的第二语音特征向量i-Vector,则二者来源于同一说话人语音的可能性最大。
通过上述程序模块301-305,本申请所提出的语音对比程序300,首先计算注册语音的第一特征语音向量;然后,利用K-means聚类将所述第一特征语音向量聚类为K类;接着,获取K类中每一类的质心,所述质心为所述质心所属类中的一个第一特征语音向量;以及,获取用户的验证语音并计算所述验证语音的第二特征语音向量;最后,将所述第二特征语音向量分别与所述每一类的质心进行比对,根据比对结果判断所述验证语音所属的类别,当判断出所述验证语音所属的类别后,将所述第二特征语音向量分别与所述验证语音所属的类别中的所有所述第一特征语音向量进行比对,输出比对结果。这样,可以解决现有声纹系统在识别时要和声纹库中的N个人分别进行比对,耗费大量时间的弊端,进而提升语音识别的效率,促进语音识别技术的普及和产业化发展。
此外,本申请还提出一种语音对比方法。
参阅图4所示,是本申请语音对比方法第一实施例的实施流程示意图。在本实施例中,根据不同的需求,图4所示的流程图中的步骤的执行顺序可以改 变,某些步骤可以省略。
步骤S401,计算注册语音的第一特征语音向量。
在本实施方式中,所述终端100在计算注册语音的第一特征语音向量的步骤具体包括:使用MFCC方法提取所述注册语音中每帧语音的MFCC特征并组成第一矩阵;使用UBM和语音向量提取器(i-vector extractor)筛选出所述第一矩阵中最核心的特征,组成所述第一特征语音向量。
其中,MFCC是Mel-Frequency Cepstral Coefficients的缩写,包含两个关键步骤:转化到梅尔频率,然后进行倒谱分析。在本实施方式中,先对每一份语音进行语音分帧,获取多个帧的语音频谱;再将上述获取的频谱通过Mel滤波器组得到Mel频谱,其中Mel滤波器组可以将不统一的频率转化到统一的频率;最后在Mel频谱上面进行倒谱分析,获得Mel频率倒谱系数MFCC,这个MFCC就是这帧语音的特征,其中所谓倒谱分析即为对Mel频谱取对数,再做逆变换,其中实际逆变换一般是通过DCT离散余弦变换来实现,并取DCT后的第2个到第13个系数作为MFCC系数。如此,将每一帧语音的MFCC组成一个向量矩阵,并通过背景模型(UBM)和特征语音向量(i-vector)提取器(extractor)筛选出所述矩阵中最核心的向量,将该向量作为所述语音的特征语音向量,其中通过背景模型(UBM)和特征语音向量(i-vector)提取器(extractor)筛选出所述矩阵中最核心的向量属于向量矩阵计算的现有数据算法,本文便不再多做赘述。
步骤S402,利用K-means聚类将所述第一特征语音向量聚类为K类。
在本实施方式中,所述终端100利用K-means聚类将所述第一特征语音向量聚类为K类的步骤具体包括:从所述第一特征语音向量中选出K个所述第一特征语音向量作为K-means聚类的样本;以所述K-means聚类的样本中K个所述第一特征语音向量为聚类中心;将所有所述第一特征语音向量中与任何一个所述聚类中心距离最近的所述第一特征向量聚为一类,进而将所有所述第一特征语音向量聚类为K类。
步骤S403,获取K类中每一类的质心,所述质心为所述质心所属类中的一个第一特征语音向量。在本实施方式中,所述终端100从K类中每一类的质心,而所谓的质心实质上是该质心所属类别中一个第一特征语音向量,即可以用一个质心表示其所属类别。
步骤S404,获取用户的验证语音并计算所述验证语音的第二特征语音向量。
所述终端100在计算所述验证语音的第二特征语音向量的步骤具体包括:使用MFCC方法分别提取所述验证语音中每帧语音的MFCC特征并组成第二矩阵;使用UBM(通用背景模型)和语音向量提取器(i-vector extractor)筛选出所述第二矩阵中最核心的特征,组成所述第二特征语音向量。
步骤S405,将所述第二特征语音向量分别与所述每一类的质心进行比对。
在本实施方式中,所述终端100具体是利用dot-product(向量点积)算法和PLDA算法对所述第二特征语音向量和所述每一类的质心进行两两对比打分。其中,向量点积算法和PLDA算法属于一种现有的算法,本文便不再多做赘述。
步骤S406,用于根据比对结果判断所述验证语音所属的类别。
在本实施方式中,所述终端100根据比对结果判断所述验证语音所属的类别的步骤具体包括:从所述比对结果中选择分数最低的两两对比打分;将所述验证语音归类到分数最低的两两对比打分对应的类别,即两两对比打分中分数最低对应质心所述的类别。
步骤S407,当判断出所述验证语音所属的类别后,将所述第二特征语音向量分别与所述验证语音所属的类别中的所有所述第一特征语音向量进行比对,并输出比对结果。
在本实施方式中,上述质心本身就是一个第一语音特征向量(i-Vector),而将识别语音的第二特征语音向量i-Vector与k个质心的i-Vector进行比对,选择最相近的一类,即选择两两对比打分最低的质心所属的一类,则判定该识 别语音属于此类。此时只需进行k次比对,节省大量时间。当然在选择完验证语音归属于哪一类后,将识别语音第二语音特征向量i-Vector与所在类别中的其他每个第一特征语音向量i-Vector进行一一比对,得到距离最近的第二语音特征向量i-Vector即为与识别语音最为相似的第二语音特征向量i-Vector,则二者来源于同一说话人语音的可能性最大。
通过上述步骤S401-407,本申请所提出的语音对比方法,首先计算注册语音的第一特征语音向量;然后,利用K-means聚类将所述第一特征语音向量聚类为K类;接着,获取K类中每一类的质心,所述质心为所述质心所属类中的一个第一特征语音向量;以及,获取用户的验证语音并计算所述验证语音的第二特征语音向量;最后,将所述第二特征语音向量分别与所述每一类的质心进行比对,根据比对结果判断所述验证语音所属的类别,当判断出所述验证语音所属的类别后,将所述第二特征语音向量分别与所述验证语音所属的类别中的所有所述第一特征语音向量进行比对,输出比对结果。这样,可以解决现有声纹系统在识别时要和声纹库中的N个人分别进行比对,耗费大量时间的弊端,进而提升语音识别的效率,促进语音识别技术的普及和产业化发展。
本申请还提供了另一种实施方式,即提供一种计算机可读存储介质,所述计算机可读存储介质存储有语音对比程序,所述语音对比程序可被至少一个处理器执行,以使所述至少一个处理器执行如上述的语音对比方法的步骤。
上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器, 空调器,或者网络设备等)执行本申请各个实施例所述的方法。
以上仅为本申请的优选实施例,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。

Claims (20)

  1. 一种语音对比方法,应用于终端,其特征在于,所述方法包括步骤:
    计算注册语音的第一特征语音向量;
    利用K-means聚类将所述第一特征语音向量聚类为K类;
    获取K类中每一类的质心,所述质心为所述质心所属类中的一个第一特征语音向量;
    获取用户的验证语音并计算所述验证语音的第二特征语音向量;
    将所述第二特征语音向量分别与所述每一类的质心进行比对;
    根据比对结果判断所述验证语音所属的类别;
    当判断出所述验证语音所属的类别后,将所述第二特征语音向量分别与所述验证语音所属的类别中的所有所述第一特征语音向量进行比对;及
    输出比对结果。
  2. 如权利要求1所述的语音对比方法,其特征在于,所述计算注册语音的第一特征语音向量的步骤,包括:
    使用MFCC方法提取所述注册语音中每帧语音的MFCC特征并组成第一矩阵;
    使用UBM通用背景模型和语音向量提取器筛选出所述第一矩阵中最核心的特征,组成所述第一特征语音向量;
    所述计算所述验证语音的第二特征语音向量的步骤,包括:
    使用MFCC方法分别提取所述验证语音中每帧语音的MFCC特征并组成第二矩阵;及
    使用UBM通用背景模型和语音向量提取器筛选出所述第二矩阵中最核心的特征,组成所述第二特征语音向量。
  3. 如权利要求1所述的语音对比方法,其特征在于,所述利用K-means 聚类将所述第一特征语音向量聚类为K类的步骤,包括:
    从所述第一特征语音向量中选出K个所述第一特征语音向量作为K-means聚类的样本;
    以所述K-means聚类的样本中K个所述第一特征语音向量为聚类中心;及
    将所有所述第一特征语音向量中与任何一个所述聚类中心距离最近的所述第一特征向量聚为一类,进而将所有所述第一特征语音向量聚类为K类。
  4. 如权利要求1所述的语音对比方法,其特征在于,所述将所述第二特征语音向量分别与所述每一类的质心进行比对的步骤,包括:
    利用向量点积算法和PLDA算法对所述第二特征语音向量和所述每一类的质心进行两两对比打分。
  5. 如权利要求2所述的语音对比方法,其特征在于,所述将所述第二特征语音向量分别与所述每一类的质心进行比对的步骤,包括:
    利用向量点积算法和PLDA算法对所述第二特征语音向量和所述每一类的质心进行两两对比打分。
  6. 如权利要求3所述的语音对比方法,其特征在于,所述将所述第二特征语音向量分别与所述每一类的质心进行比对的步骤,包括:
    利用向量点积算法和PLDA算法对所述第二特征语音向量和所述每一类的质心进行两两对比打分。
  7. 如权利要求4所述的语音对比方法,其特征在于,所述根据比对结果判断所述验证语音所属的类别的步骤,包括:
    从所述比对结果中选择分数最低的两两对比打分;及
    将所述验证语音归类到分数最低的两两对比打分对应的类别。
  8. 一种终端,其特征在于,所述终端包括存储器、处理器,所述存储器上存储有可在所述处理器上运行的语音对比程序,所述语音对比程序被所述处理器执行时实现如下步骤:
    计算注册语音的第一特征语音向量;
    利用K-means聚类将所述第一特征语音向量聚类为K类;
    获取K类中每一类的质心,所述质心为所述质心所属类中的一个第一特征语音向量;
    获取用户的验证语音并计算所述验证语音的第二特征语音向量;
    将所述第二特征语音向量分别与所述每一类的质心进行比对;
    根据比对结果判断所述验证语音所属的类别;
    当判断出所述验证语音所属的类别后,将所述第二特征语音向量分别与所述验证语音所属的类别中的所有所述第一特征语音向量进行比对;及
    输出比对结果。
  9. 如权利要求8所述的终端,其特征在于,所述处理器在执行所述计算注册语音的第一特征语音向量的步骤时,执行如下步骤:
    使用MFCC方法提取所述注册语音中每帧语音的MFCC特征并组成第一矩阵;
    使用UBM通用背景模型和语音向量提取器筛选出所述第一矩阵中最核心的特征,组成所述第一特征语音向量;
    所述计算所述验证语音的第二特征语音向量的步骤,包括:
    使用MFCC方法分别提取所述验证语音中每帧语音的MFCC特征并组成第二矩阵;及
    使用UBM通用背景模型和语音向量提取器筛选出所述第二矩阵中最核 心的特征,组成所述第二特征语音向量。
  10. 如权利要求8所述的终端,其特征在于,所述处理器还用于在执行所述利用K-means聚类将所述第一特征语音向量聚类为K类的步骤时,执行如下步骤:
    从所述第一特征语音向量中选出K个所述第一特征语音向量作为K-means聚类的样本;
    以所述K-means聚类的样本中K个所述第一特征语音向量为聚类中心;及
    将所有所述第一特征语音向量中与任何一个所述聚类中心距离最近的所述第一特征向量聚为一类,进而将所有所述第一特征语音向量聚类为K类。
  11. 如权利要求8所述的终端,其特征在于,所述处理器在执行所述将所述第二特征语音向量分别与所述每一类的质心进行比对的步骤时,执行如下步骤:
    利用向量点积算法和PLDA算法对所述第二特征语音向量和所述每一类的质心进行两两对比打分。
  12. 如权利要求9所述的终端,其特征在于,所述处理器在执行所述将所述第二特征语音向量分别与所述每一类的质心进行比对的步骤时,执行如下步骤:
    利用向量点积算法和PLDA算法对所述第二特征语音向量和所述每一类的质心进行两两对比打分。
  13. 如权利要求10所述的终端,其特征在于,所述处理器在执行所述将所述第二特征语音向量分别与所述每一类的质心进行比对的步骤时,执行如 下步骤:
    利用向量点积算法和PLDA算法对所述第二特征语音向量和所述每一类的质心进行两两对比打分。
  14. 如权利要求8所述的终端,其特征在于,所述根据比对结果判断所述验证语音所属的类别的步骤,包括:
    从所述比对结果中选择分数最低的两两对比打分;及
    将所述验证语音归类到分数最低的两两对比打分对应的类别。
  15. 一种计算机可读存储介质,所述计算机可读存储介质存储有语音对比程序,所述语音对比程序可被至少一个处理器执行,以使所述至少一个处理器执行如下步骤:
    计算注册语音的第一特征语音向量;
    利用K-means聚类将所述第一特征语音向量聚类为K类;
    获取K类中每一类的质心,所述质心为所述质心所属类中的一个第一特征语音向量;
    获取用户的验证语音并计算所述验证语音的第二特征语音向量;
    将所述第二特征语音向量分别与所述每一类的质心进行比对;
    根据比对结果判断所述验证语音所属的类别;
    当判断出所述验证语音所属的类别后,将所述第二特征语音向量分别与所述验证语音所属的类别中的所有所述第一特征语音向量进行比对;及
    输出比对结果。
  16. 如权利要求15所述的计算机可读存储介质,其特征在于,所述处理器在执行所述计算注册语音的第一特征语音向量的步骤时,执行如下步骤:
    使用MFCC方法提取所述注册语音中每帧语音的MFCC特征并组成第一 矩阵;
    使用UBM通用背景模型和语音向量提取器筛选出所述第一矩阵中最核心的特征,组成所述第一特征语音向量;
    所述计算所述验证语音的第二特征语音向量的步骤,包括:
    使用MFCC方法分别提取所述验证语音中每帧语音的MFCC特征并组成第二矩阵;及
    使用UBM通用背景模型和语音向量提取器筛选出所述第二矩阵中最核心的特征,组成所述第二特征语音向量。
  17. 如权利要求15所述的计算机可读存储介质,其特征在于,所述处理器还用于在执行所述利用K-means聚类将所述第一特征语音向量聚类为K类的步骤时,执行如下步骤:
    从所述第一特征语音向量中选出K个所述第一特征语音向量作为K-means聚类的样本;
    以所述K-means聚类的样本中K个所述第一特征语音向量为聚类中心;及
    将所有所述第一特征语音向量中与任何一个所述聚类中心距离最近的所述第一特征向量聚为一类,进而将所有所述第一特征语音向量聚类为K类。
  18. 如权利要求15所述的计算机可读存储介质,其特征在于,所述处理器在执行所述将所述第二特征语音向量分别与所述每一类的质心进行比对的步骤时,执行如下步骤:
    利用向量点积算法和PLDA算法对所述第二特征语音向量和所述每一类的质心进行两两对比打分。
  19. 如权利要求16所述的计算机可读存储介质,其特征在于,所述处理 器在执行所述将所述第二特征语音向量分别与所述每一类的质心进行比对的步骤时,执行如下步骤:
    利用向量点积算法和PLDA算法对所述第二特征语音向量和所述每一类的质心进行两两对比打分。
  20. 如权利要求17所述的计算机可读存储介质,其特征在于,所述处理器在执行所述将所述第二特征语音向量分别与所述每一类的质心进行比对的步骤时,执行如下步骤:
    利用向量点积算法和PLDA算法对所述第二特征语音向量和所述每一类的质心进行两两对比打分。
PCT/CN2018/077626 2018-01-09 2018-02-28 语音对比方法、终端及计算机可读存储介质 WO2019136811A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810019441.4 2018-01-09
CN201810019441.4A CN108417226A (zh) 2018-01-09 2018-01-09 语音对比方法、终端及计算机可读存储介质

Publications (1)

Publication Number Publication Date
WO2019136811A1 true WO2019136811A1 (zh) 2019-07-18

Family

ID=63125809

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/077626 WO2019136811A1 (zh) 2018-01-09 2018-02-28 语音对比方法、终端及计算机可读存储介质

Country Status (2)

Country Link
CN (1) CN108417226A (zh)
WO (1) WO2019136811A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11341185B1 (en) * 2018-06-19 2022-05-24 Amazon Technologies, Inc. Systems and methods for content-based indexing of videos at web-scale

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111986698B (zh) 2019-05-24 2023-06-30 腾讯科技(深圳)有限公司 音频片段的匹配方法、装置、计算机可读介质及电子设备
CN110648670B (zh) * 2019-10-22 2021-11-26 中信银行股份有限公司 欺诈识别方法、装置、电子设备及计算机可读存储介质

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1932974A (zh) * 2005-09-13 2007-03-21 东芝泰格有限公司 说话者识别设备、说话者识别程序、和说话者识别方法
CN102024455A (zh) * 2009-09-10 2011-04-20 索尼株式会社 说话人识别系统及其方法
CN102201236A (zh) * 2011-04-06 2011-09-28 中国人民解放军理工大学 一种高斯混合模型和量子神经网络联合的说话人识别方法
CN102324232A (zh) * 2011-09-12 2012-01-18 辽宁工业大学 基于高斯混合模型的声纹识别方法及系统
CN102509547A (zh) * 2011-12-29 2012-06-20 辽宁工业大学 基于矢量量化的声纹识别方法及系统
CN104464738A (zh) * 2014-10-31 2015-03-25 北京航空航天大学 一种面向智能移动设备的声纹识别方法
CN105845140A (zh) * 2016-03-23 2016-08-10 广州势必可赢网络科技有限公司 应用于短语音条件下的说话人确认方法和装置
CN106782564A (zh) * 2016-11-18 2017-05-31 百度在线网络技术(北京)有限公司 用于处理语音数据的方法和装置

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9043207B2 (en) * 2009-11-12 2015-05-26 Agnitio S.L. Speaker recognition from telephone calls
CN103258535A (zh) * 2013-05-30 2013-08-21 中国人民财产保险股份有限公司 基于声纹识别的身份识别方法及系统
CN105469784B (zh) * 2014-09-10 2019-01-08 中国科学院声学研究所 一种基于概率线性鉴别分析模型的说话人聚类方法及系统
CN105161093B (zh) * 2015-10-14 2019-07-09 科大讯飞股份有限公司 一种判断说话人数目的方法及系统
CN105632502A (zh) * 2015-12-10 2016-06-01 江西师范大学 一种基于加权成对约束度量学习算法的说话人识别方法
CN106531170B (zh) * 2016-12-12 2019-09-17 姜卫武 基于说话人识别技术的口语测评身份认证方法
CN107452403B (zh) * 2017-09-12 2020-07-07 清华大学 一种说话人标记方法

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1932974A (zh) * 2005-09-13 2007-03-21 东芝泰格有限公司 说话者识别设备、说话者识别程序、和说话者识别方法
CN102024455A (zh) * 2009-09-10 2011-04-20 索尼株式会社 说话人识别系统及其方法
CN102201236A (zh) * 2011-04-06 2011-09-28 中国人民解放军理工大学 一种高斯混合模型和量子神经网络联合的说话人识别方法
CN102324232A (zh) * 2011-09-12 2012-01-18 辽宁工业大学 基于高斯混合模型的声纹识别方法及系统
CN102509547A (zh) * 2011-12-29 2012-06-20 辽宁工业大学 基于矢量量化的声纹识别方法及系统
CN104464738A (zh) * 2014-10-31 2015-03-25 北京航空航天大学 一种面向智能移动设备的声纹识别方法
CN105845140A (zh) * 2016-03-23 2016-08-10 广州势必可赢网络科技有限公司 应用于短语音条件下的说话人确认方法和装置
CN106782564A (zh) * 2016-11-18 2017-05-31 百度在线网络技术(北京)有限公司 用于处理语音数据的方法和装置

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11341185B1 (en) * 2018-06-19 2022-05-24 Amazon Technologies, Inc. Systems and methods for content-based indexing of videos at web-scale

Also Published As

Publication number Publication date
CN108417226A (zh) 2018-08-17

Similar Documents

Publication Publication Date Title
WO2019134247A1 (zh) 基于声纹识别模型的声纹注册方法、终端装置及存储介质
US10957339B2 (en) Speaker recognition method and apparatus, computer device and computer-readable medium
WO2020073694A1 (zh) 一种声纹识别的方法、模型训练的方法以及服务器
WO2020155584A1 (zh) 声纹特征的融合方法及装置,语音识别方法,系统及存储介质
Shum et al. On the use of spectral and iterative methods for speaker diarization
CN108269575B (zh) 更新声纹数据的语音识别方法、终端装置及存储介质
US20110320202A1 (en) Location verification system using sound templates
WO2019136801A1 (zh) 语音数据库创建方法、声纹注册方法、装置、设备及介质
CN101894548B (zh) 一种用于语种识别的建模方法及装置
WO2019136811A1 (zh) 语音对比方法、终端及计算机可读存储介质
WO2021051608A1 (zh) 一种基于深度学习的声纹识别方法、装置及设备
CN108520752A (zh) 一种声纹识别方法和装置
Liu et al. A Spearman correlation coefficient ranking for matching-score fusion on speaker recognition
WO2016119604A1 (zh) 一种语音信息搜索方法、装置及服务器
TW202018696A (zh) 語音識別方法、裝置及計算設備
WO2021072893A1 (zh) 一种声纹聚类方法、装置、处理设备以及计算机存储介质
Chin et al. Speaker identification using discriminative features and sparse representation
WO2020140609A1 (zh) 一种语音识别方法、设备及计算机可读存储介质
CN111640438A (zh) 音频数据处理方法、装置、存储介质及电子设备
JP6996627B2 (ja) 情報処理装置、制御方法、及びプログラム
JP2003535376A (ja) 分類システムの反復訓練用の方法と装置
TWI778234B (zh) 語者驗證系統
CN109920408B (zh) 基于语音识别的字典项设置方法、装置、设备和存储介质
WO2021051533A1 (zh) 基于地址信息的黑名单识别方法、装置、设备及存储介质
CN113035230A (zh) 认证模型的训练方法、装置及电子设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18899831

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 12.10.2020)

122 Ep: pct application non-entry in european phase

Ref document number: 18899831

Country of ref document: EP

Kind code of ref document: A1