WO2020001182A1 - Voiceprint recognition method, electronic device, and computer readable storage medium - Google Patents

Voiceprint recognition method, electronic device, and computer readable storage medium Download PDF

Info

Publication number
WO2020001182A1
WO2020001182A1 PCT/CN2019/086767 CN2019086767W WO2020001182A1 WO 2020001182 A1 WO2020001182 A1 WO 2020001182A1 CN 2019086767 W CN2019086767 W CN 2019086767W WO 2020001182 A1 WO2020001182 A1 WO 2020001182A1
Authority
WO
WIPO (PCT)
Prior art keywords
subsystem
voice data
subsystems
weight
change factor
Prior art date
Application number
PCT/CN2019/086767
Other languages
French (fr)
Chinese (zh)
Inventor
郑能恒
林�吉
Original Assignee
深圳大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳大学 filed Critical 深圳大学
Publication of WO2020001182A1 publication Critical patent/WO2020001182A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/06Decision making techniques; Pattern matching strategies
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination

Definitions

  • the present application relates to the field of electronic technology, and in particular, to a voiceprint recognition method, an electronic device, and a computer-readable storage medium. Background technique
  • voiceprint is the unique sound characteristic of everyone. In real life, everyone's voice when speaking has its own characteristics. Generally speaking, voiceprint recognition is divided into the following types: emotion recognition, age recognition, language recognition, gender recognition, speaker recognition, and so on.
  • a fusion strategy using linear logistic regression The central idea of this strategy is to normalize the score of each subsystem to an interval for a hybrid system with N subsystems, and then use the development set to train the fusion of each subsystem i Weights while training an overall offset!%,
  • the embodiments of the present application provide a voiceprint recognition method, an electronic device, and a computer-readable storage medium, which are used to improve the accuracy of voiceprint recognition by setting appropriate voiceprint recognition weights.
  • a first aspect of the embodiments of the present application provides a voiceprint recognition method, including:
  • Article 26 Extracting a change factor feature in the voice data, where the change factor feature is used to characterize comprehensive information related to the voice data, the comprehensive information includes at least sound transmission channel information, sound environment information, and sounding object information;
  • a power P weight is applied to the recognition results of the respective subsystems through the final fusion weights, and a comprehensive recognition result of the voice data is obtained according to the recognition results of the respective subsystems after weighting.
  • the method for training the error-prone classifier includes:
  • the short-term speech data set is used as the test data set of each subsystem, and all misjudged speech segments in the test process are labeled as N different labels according to different subsystems.
  • N is an integer greater than zero. ;
  • Training a general background model according to the extracted MFCC features training an overall change matrix; obtaining a change factor feature of the short-term speech data according to the overall change matrix;
  • an error-prone classifier capable of performing N-class classification is trained.
  • the method before training the error-prone point classifier capable of N-class classification according to the change factor feature and its corresponding label, the method includes:
  • Linear discriminant analysis is used to perform channel compensation on the characteristics of the change factor to obtain the characteristics of the change factor after the dimensionality reduction.
  • the sum of the relative misjudgment probabilities corresponding to the K subsystems is one.
  • the calculating the final fusion weight of the corresponding subsystem according to the offset includes: specifically passing the following formula:
  • the q is the final fusion weight of each of the K subsystems, where x is the input voice, and 0 ⁇ is the initial fusion weight of each subsystem when the input voice is%, and H is the relationship of q coefficient.
  • a second aspect of the embodiments of the present application provides another electronic device, including:
  • K subsystems and dynamic weight sub-modules where K is an integer greater than zero;
  • the dynamic weighting sub-module is configured to obtain voice data to be analyzed; and extract a change factor feature in the voice data, where the change factor feature is used to characterize comprehensive information related to the voice data, where the comprehensive information includes at least a voice Transmission channel information, sound environment information, and vocalization object information; misclassified the voice data according to the change factor characteristics through a error-prone classifier, and obtained that the voice data was misjudged in the K subsystems Determine the offset of the relative misjudgment probability corresponding to any subsystem and the average relative misjudged probability of the K subsystems, and calculate the final fusion weight of the corresponding subsystem according to the offset; Weight the recognition results of the respective subsystems by using the final fusion weight, and obtain a comprehensive recognition result of the voice data according to the recognition results of the respective subsystems after weighting;
  • the subsystem is configured to perform preliminary voiceprint recognition on the voice data, and obtain a recognition result of the voice data.
  • the dynamic weight sub-module includes: a feature extraction unit, an error-prone classifier, a weight calculation unit, and a comprehensive calculation unit;
  • the feature extraction unit is configured to extract a change factor feature in the voice data
  • the error-prone point classifier is configured to misclassify the voice data according to the characteristics of the change factor, and obtain a relative misjudgement probability that the voice data is misjudged in the K subsystems;
  • the weight calculation unit is configured to determine an offset between the relative misjudgment probability corresponding to any subsystem and the average relative misjudged probability of the K subsystems, and calculate the final fusion weight of the corresponding subsystem according to the offset ;
  • the comprehensive calculation unit is configured to weight the recognition results of the respective subsystems by using the final fusion weight, and obtain the comprehensive recognition results of the voice data according to the recognition results of the subsystems after weighting.
  • the weight calculation unit is further configured to:
  • the calculating the final fusion weight of the corresponding subsystem according to the offset includes:
  • the q is the final fusion weight of each of the K subsystems, where X is the input voice,
  • the P is a relationship coefficient of the q.
  • a third aspect of the embodiments of the present application provides another electronic device, including: a memory, a processor, and a computer program stored on the memory and executable on the processor.
  • the processor executes the computer program, To implement the voiceprint recognition method provided in the first aspect of the embodiment of the present application.
  • a fourth aspect of the embodiments of the present application provides a computer-readable storage medium on which a computer program is stored.
  • the computer program is executed by a processor, the voiceprint recognition method provided by the first aspect of the embodiments of the present application is implemented.
  • the scheme of this application classifies the speech segments with high error rates of each subsystem according to the characteristics of the change factors, classifies them into K-type error prone points, trains a corresponding classification model, and then classifies each piece of speech data to be analyzed , Reducing the prediction weight of the subsystem corresponding to the label obtained by the classification, and then optimizing the final result, achieving the effects of real-time evaluation and dynamic adjustment of the false positive rate of each subsystem.
  • FIG. 1-a is a schematic flowchart of a voiceprint recognition method provided by an embodiment of the present application.
  • 1-b is a structural diagram of a voiceprint recognition system provided by an embodiment of the present application.
  • FIG. 1c is a schematic flowchart of a method for training a prone classifier according to an embodiment of the present application
  • FIG. 1-d is an operation flowchart of a dynamic weight sub-module provided by an embodiment of the present application
  • FIG. 2 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
  • FIG. 3 is a schematic diagram of a hardware structure of an electronic device according to another embodiment of the present application. Embodiments of the invention
  • the voiceprint recognition method mainly includes the following steps:
  • the embodiment of the present invention is applied to a voiceprint recognition system, where the voiceprint recognition system includes K subsystems, and K is an integer greater than zero.
  • the system architecture of the voiceprint recognition system according to the embodiment of the present invention may refer to the figure.
  • Each subsystem in the voiceprint recognition system may correspond to different types of voiceprint recognition, and the types of voiceprint recognition include: emotion recognition, age recognition, and language recognition. Further, each subsystem may also correspond to each sub-category in a recognition scenario. For example, in speech recognition, a subsystem corresponds to a language (such as Chinese, English, or French, etc.). It can be understood that, in practical applications, the correspondence between the subsystem and the voiceprint recognition category can be determined according to the actual situation, which is not specifically limited here.
  • the voiceprint recognition method is mainly applied to the dynamic weight sub-module in the system architecture, that is, the voice data to be analyzed can be input into the dynamic weight sub-module for weight analysis first.
  • the dynamic weighting sub-module extracts a change factor feature in the voice data, and the change factor feature is used to characterize comprehensive information related to the voice data, and the comprehensive information includes at least sound transmission channel information, sound environment information, and sounding object information. .
  • the ivector features a large amount of information of the speaking object, such as transmission channel information, acoustic environment information, and speaker information.
  • the dynamic weighting sub-module classifies the voice data according to the characteristics of the change factor by using the error-prone classifier, and obtains the relative misjudgment probability that the voice data is misjudged in the K subsystems.
  • the classification result output by the error-prone classifier can be shown in the following table:
  • the relative misjudgment probability of each subsystem is I E, which is the average relative misjudgment probability.
  • the dynamic weight sub-module determines an offset between the relative misjudgment probability corresponding to any subsystem and the average relative misjudged probability of the K subsystems, and calculates the final fusion weight of the corresponding subsystem according to the offset.
  • Relative misjudgment probability represents the relative magnitude relationship between the misjudgment probabilities of different subsystems.
  • the relative misjudgment probability of subsystem a is 0.1
  • the relative misjudgment probability of subsystem b is 0.5.
  • the significance of the relative misjudgment probability is that the misjudgment probability of subsystem b is greater than that of subsystem a, rather than the system.
  • the false positive probability of b is 0.5.
  • the central idea is to use the offset between the relative misjudgment probability and the average probability as the calculation parameter of the fusion weight.
  • the weight values can be fine-tuned by adjusting the standard deviation of the weight arrays while maintaining the relative size of the weight values.
  • the dynamic weighting sub-module obtains the recognition result of the voice data by each subsystem.
  • step 105 and step 101 are two branches that can be executed in parallel, that is, there is no strict timing relationship between step 105 and step 101, that is, before step 106, step 101 may be performed first, or step 105 may be performed first. Step 105 and step 101 may be performed at the same time, which is not specifically limited herein.
  • the voiceprint recognition system weights the recognition results of the respective subsystems by using the final fusion weight, and obtains the comprehensive recognition result of the voice data according to the recognition results of the subsystems after weighting.
  • x) can be calculated as each subsystem after being calculated by the following function.
  • the q is the final fusion weight of each of the K subsystems, where x is the input voice
  • K In a general hybrid system, the number of subsystems K is generally fixed, so K here can be regarded as a constant. It can be seen that as U increases or decreases, o also increases or decreases non-linearly. Under normal circumstances, the value defaults to 0 without adjustment. If adjustment is needed, it is recommended to control the adjustment range between [-1, 1]. Too large or too small may have an adverse effect on the final fusion score result. In addition, a large adjustment may lead to a negative probability value, but it does not affect the decision process of the fusion score.
  • the solution of this application classifies the speech segments with high error rates of each subsystem according to the characteristics of the change factors, classifies them into K-type error prone points, and trains the corresponding classification model, and then classifies each piece of speech data to be analyzed to reduce the classification results.
  • the prediction weights of the subsystems corresponding to the tags are optimized, and the final result is optimized to achieve the effects of real-time evaluation and dynamic adjustment of the false positive rate of each subsystem.
  • the method includes:
  • the short-term speech data set is used as the test data set of each subsystem, and all misjudged speech segments in the test process are labeled as N different labels according to different subsystems.
  • N is an integer greater than zero. .
  • a change factor characteristic of the short-term speech data is obtained according to the overall change matrix T.
  • the characteristics of the change factor can be obtained according to the following formula:
  • m is the supervector of the background model, which is related to the acoustics and channel commonality of all speakers;
  • M is the mean supervector, which is obtained by adaptive training based on the supervector of the background model; T is the overall change matrix T; w is the feature factor vector of the change factor.
  • Linear discriminant analysis is used to perform channel compensation on the characteristics of the change factor to reduce the influence of redundant information such as channels in the characteristics of the change factor, and at the same time achieve the effect of reducing the dimension.
  • LDA linear discriminant analysis
  • an error-prone classifier capable of performing N-class classification is trained.
  • the svm classifier is used, and there are two schemes to choose from: one, a two-class svm using one vs rest strategy; two, a two-class svm using one vs one strategy.
  • the error-prone point classifier in the embodiment of the present invention can detect the misjudgment probability of different subsystems according to different application scenarios or voiceprint characteristics, make full use of the advantages of each subsystem and avoid high misdetection points, and then give more For proper fusion weights, the effectiveness of the hybrid system is maximized and the robustness is enhanced.
  • the embodiment of the present invention takes the hybrid system of language recognition as an example to describe in detail the voiceprint recognition method in the embodiment of the present invention, including:
  • each sub-system independently gives probability values of N different languages.
  • the ivector features are input into the error-prone point classifier, and the classification results output by the error-prone point classifier can be shown in the following table:
  • the relative misjudgment probability of each subsystem is the average relative misjudgment probability.
  • the true significance of the relative false positive probability lies in the offset between the relative false positive probability and the average false positive probability.
  • the initial fusion weight for each subsystem when the input speech is ⁇ .
  • the offset between the average probabilities is used as a calculation parameter for the fusion weight.
  • the weight values can be fine-tuned by adjusting the standard deviation of the weight arrays while maintaining the relative size of the weight values.
  • the q is the final fusion weight of each of the K subsystems, where x is the input voice, and 0 ⁇ is the initial fusion weight of each subsystem ⁇ when the input voice is%, and p is the q of the Relationship coefficient.
  • K In a general hybrid system, the number of subsystems K is generally fixed, so K here can be regarded as a constant. It can be seen that as U increases or decreases, o also increases or decreases non-linearly. Under normal circumstances, the value defaults to 0 without adjustment. If adjustment is needed, it is recommended to control the adjustment range between [-1, 1]. Too large or too small may have an adverse effect on the final fusion score result. In addition, a large adjustment may lead to a negative probability value, but it does not affect the decision process of the fusion score.
  • the first matrix on the left side of the above equation is a fusion weight matrix
  • the second matrix on the left side of the equation is a probability matrix of all languages given by the K subsystems
  • the right matrix on the right side of the equation is The term matrix is assigned the fusion probability matrix after the fusion weights.
  • FIG. 2 provides an electronic device according to an embodiment of the present application.
  • the electronic device can be used to implement the voiceprint recognition method provided by the embodiment shown in FIG. 1-a.
  • the electronic device mainly includes:
  • the dynamic weighting sub-module 210 is configured to obtain voice data to be analyzed; extract a change factor feature in the voice data, and the change factor feature is used to characterize comprehensive information related to the voice data, where the comprehensive information includes The information of the sound transmission channel, the sound environment information and the sounding object information; the error data is misclassified and classified according to the characteristics of the change factor through the error-prone classifier, and the speech data is obtained in the K subsystems 220 Relative misjudgment probability of misjudgment; determining an offset between the relative misjudgment probability corresponding to any subsystem and the average relative misjudgment probability of the K subsystems, and calculating the final fusion of the corresponding subsystem according to the offset Weighting; weighting the recognition results of the respective subsystems through the final fusion weights, and obtaining a comprehensive recognition result of the voice data according to the weighted recognition results of each subsystem;
  • the subsystem 220 is configured to perform preliminary voiceprint recognition on the voice data, and obtain a recognition result of the voice data.
  • the dynamic weight sub-module includes: a feature extraction unit, an error-prone classifier, a weight calculation unit, and a comprehensive calculation unit;
  • the feature extraction unit is configured to extract a change factor feature in the voice data
  • the error-prone point classifier is configured to misclassify the voice data according to the characteristics of the change factor, and obtain a relative misjudgement probability that the voice data is misjudged in the K subsystems;
  • the weight calculation unit is configured to determine an offset between a relative misjudgment probability corresponding to any subsystem and an average relative misjudgment probability of the K subsystems, and calculate a final value of the corresponding subsystem according to the offset.
  • the comprehensive calculation unit is configured to weight the recognition results of the respective subsystems by using the final fusion weight, and obtain the comprehensive recognition results of the voice data according to the recognition results of the subsystems after weighting.
  • the weight calculation unit is further configured to:
  • the calculating the final fusion weight of the corresponding subsystem according to the offset includes:
  • the q is the final fusion weight of each of the K subsystems, where x is the input speech, and 0 ⁇ is the initial fusion weight of each subsystem when the input speech is x, and H is the relationship of q coefficient.
  • each functional module is merely an example.
  • the division of each functional module is completed by different function modules, that is, the internal structure of the electronic device is divided into different function modules to complete all or part of the functions described above.
  • the corresponding functional modules in this embodiment may be implemented by corresponding hardware, or may be completed by corresponding hardware executing corresponding software.
  • the embodiments described in this specification can apply the above-mentioned description principles, which will not be described in detail below.
  • the electronic device includes: a memory 301, a processor 302, and a computer stored in the memory 301 and operable on the processor 302.
  • the electronic device further includes:
  • At least one input device 303 and at least one output device 304 At least one input device 303 and at least one output device 304.
  • the memory 301, the processor 302, the input device 303, and the output device 304 are connected through a bus 305.
  • the input device 303 may be a camera, a touch panel, a physical button, a mouse, or the like.
  • the output device 304 may be a display screen.
  • the memory 301 may be a high-speed random access memory (RAM, Random Access Memory) memory, or may be a non-volatile memory (non-volatile memory), such as a disk memory.
  • RAM Random Access Memory
  • non-volatile memory such as a disk memory.
  • the memory 301 is configured to store a set of executable program code, and the processor 302 is combined with the memory 301.
  • an embodiment of the present application further provides a computer-readable storage medium, where the computer-readable storage medium may be provided in the electronic device in the foregoing embodiments, and the computer-readable storage medium may be the foregoing FIG. 3
  • the memory in the embodiment is shown.
  • a computer program is stored on the computer-readable storage medium, and when the program is executed by a processor, the voiceprint recognition method described in the foregoing embodiment shown in FIG. 1-a is implemented.
  • the computer-storable medium may also be various media that can store program codes, such as a U disk, a mobile hard disk, a read-only memory (ROM, Read-Only Memory), a RAM, a magnetic disk, or an optical disc.
  • the disclosed apparatus and method may be implemented in other manners.
  • the device embodiments described above are only schematic.
  • the division of the modules is only a logical function division.
  • multiple modules or components may be combined or It can be integrated into another system, or some features can be ignored or not implemented.
  • the displayed or discussed mutual coupling or direct connection or communication connection may be through some interfaces, indirect coupling or communication connection of devices or modules, and may be electrical, mechanical, or other forms.
  • modules described as separate components may or may not be physically separated, and the components displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on multiple network modules. Some or all of these modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional module in each embodiment of the present application may be integrated into one processing module, or each module may exist separately physically, or two or more modules may be integrated into one module.
  • the above integrated modules can be implemented in the form of hardware or software functional modules.
  • the integrated module When the integrated module is implemented in the form of a software functional module and sold or used as an independent product, it may be stored in a computer-readable storage medium.
  • the technical solution of the present application is essentially a part that contributes to the existing technology or all or part of the technical solution may be embodied in the form of a software product, which is stored in a readable storage
  • the medium includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method described in the embodiments of the present application.
  • the aforementioned readable storage media include: U-disks, mobile hard disks, ROMs, RAMs, magnetic disks, or optical disks, which can store program codes.

Abstract

A voiceprint recognition method, an electronic device, and a computer readable storage medium. The voiceprint recognition method comprises: acquiring voice data to be analyzed; extracting a change factor feature in the voice data; using an fallible point classifier to perform miscalculation classification on the voice data according to the change factor feature so as to obtain relative miscalculation probabilities of voice data miscalculation in K subsystems; determining the offset of the relative miscalculation probability corresponding to any subsystem with respect to the average relative miscalculation probability of the K subsystems, and calculating a final fusion weight of the corresponding subsystem according to the offset; and weighting the recognition result of each subsystem by means of the final fusion weight, and obtaining a comprehensive recognition result of the voice data according to the recognition result of each subsystem after weighting.

Description

声纹识别方法、 电子装置及计算机可读存储介质 技术领域  Voiceprint recognition method, electronic device and computer-readable storage medium
本申请涉及电子技术领域, 尤其涉及一种声纹识别方法、 电子装置及计算 机可读存储介质。 背景技术  The present application relates to the field of electronic technology, and in particular, to a voiceprint recognition method, an electronic device, and a computer-readable storage medium. Background technique
随着智能设备和相关硬体设施的普及, 语音交互已经成为了人机交互中不 可或缺的一环。而语音交互中有关声纹的应用场景也越来越多,包括但不限于: 声纹考勤打卡、 软件登录、 银行转账与开户验证、 虚拟语音助理的唤醒、 针对 不同的用户群体进行个性化交互等, 这些系统无一例外的都利用到了声纹。 所 谓声纹, 即每个人独有的声音特性。 在现实生活生活中, 每个人说话时的声音 都有自己的特点。 一般来说, 声纹识别分为以下几种: 情感识别、 年龄识别、 语种识别、 性别识别、 说话人识别等。  With the popularity of smart devices and related hardware facilities, voice interaction has become an integral part of human-computer interaction. There are more and more application scenarios related to voiceprints in voice interactions, including but not limited to: voiceprint time and attendance, software login, bank transfer and account opening verification, wake-up of virtual voice assistant, personalized interaction for different user groups Etc., these systems all use voiceprints without exception. The so-called voiceprint is the unique sound characteristic of everyone. In real life, everyone's voice when speaking has its own characteristics. Generally speaking, voiceprint recognition is divided into the following types: emotion recognition, age recognition, language recognition, gender recognition, speaker recognition, and so on.
在现有技术中, 为了提高声纹识别的准确率, 大多采用多种类型的声纹系 统进行混搭, 将这些系统在得分域上赋予不同的权重进行加权融合, 进而得出 最终的判决结果。 如, 使用线性逻辑回归的融合策略: 该策略的中心思想是对 于一个有 N个子系统的混合系统, 将每个子系统的得分规整到一个区间上, 然 后利用开发集训练出每个子系统 i的融合权重 ,同时训练出一个总体的偏移 !%,
Figure imgf000003_0001
In the prior art, in order to improve the accuracy of voiceprint recognition, multiple types of voiceprint systems are mostly used for mashups, and these systems are assigned different weights on the score domain for weighted fusion to obtain the final judgment result. For example, a fusion strategy using linear logistic regression: The central idea of this strategy is to normalize the score of each subsystem to an interval for a hybrid system with N subsystems, and then use the development set to train the fusion of each subsystem i Weights while training an overall offset!%,
Figure imgf000003_0001
由于现实情况的复杂性, 现有技术中不同类型的识别子系统不一定适配初 始设定的权重, 因此, 采用固定权重的方法使得声纹识别的准确率不高。 技术问题  Due to the complexity of the actual situation, different types of recognition subsystems in the prior art may not necessarily adapt to the initially set weights. Therefore, a fixed weight method is used to make the accuracy of voiceprint recognition low. technical problem
本申请实施例提供一种声纹识别方法、 电子装置及计算机可读存储介质, 用于通过设置合适的声纹识别权重以提高声纹识别的准确率。 技术解决方案  The embodiments of the present application provide a voiceprint recognition method, an electronic device, and a computer-readable storage medium, which are used to improve the accuracy of voiceprint recognition by setting appropriate voiceprint recognition weights. Technical solutions
本申请实施例第一方面提供一种声纹识别方法, 包括:  A first aspect of the embodiments of the present application provides a voiceprint recognition method, including:
获耳又待分析的语音数据;  Get ear and voice data to be analyzed;
1 1
替换页 (细则第 26条) 提取所述语音数据中的变化因子特征, 所述变化因子特征用于表征所述语 音数据相关的综合信息, 所述综合信息至少包括声音传输通道信息, 声音环境 信息以及发声对象信息; Replacement page (Article 26) Extracting a change factor feature in the voice data, where the change factor feature is used to characterize comprehensive information related to the voice data, the comprehensive information includes at least sound transmission channel information, sound environment information, and sounding object information;
通过易错点分类器,根据所述变化因子特征对所述语音数据进行误判分类, 得到所述语音数据在所述 K个子系统中被误判的相对误判概率;  Misclassifying the voice data according to the characteristics of the change factor through an error-prone classifier to obtain the relative misjudgement probability that the voice data is misjudged in the K subsystems;
确定任一子系统对应的相对误判概率与所述 K个子系统的平均相对误判概 率的偏置量, 并根据所述偏置量计算相应子系统的最终融合权重;  Determining an offset between the relative misjudgment probability corresponding to any subsystem and the average relative misjudgment probability of the K subsystems, and calculating the final fusion weight of the corresponding subsystem according to the offset;
获取各个子系统对所述语音数据的识别结果;  Acquiring the recognition results of the voice data by each subsystem;
通过所述最终融合权重对相应的各个子系统的识别结果进行力 P权, 根据加 权后各个子系统的识别结果得到所述语音数据的综合识别结果。  A power P weight is applied to the recognition results of the respective subsystems through the final fusion weights, and a comprehensive recognition result of the voice data is obtained according to the recognition results of the respective subsystems after weighting.
可选的, 所述易错点分类器的训练方法包括:  Optionally, the method for training the error-prone classifier includes:
以短时语音数据集作为各子系统的测试数据集, 将测试过程中所有误判的 语音段依照不同的子系统标注为 N个不同的标签, 作为训练数据库, 所述 N为 大于零的整数;  The short-term speech data set is used as the test data set of each subsystem, and all misjudged speech segments in the test process are labeled as N different labels according to different subsystems. As a training database, N is an integer greater than zero. ;
对所述训练数据库中的每条短时语音数据, 提取 MFCC梅尔频率倒谱系数 特征;  Extracting MFCC Mel frequency cepstrum coefficient characteristics for each piece of short-term speech data in the training database;
根据提取到的 MFCC特征训练通用背景模型, 训练总体变化矩阵; 根据所述总体变化矩阵获得所述短时语音数据的变化因子特征;  Training a general background model according to the extracted MFCC features, training an overall change matrix; obtaining a change factor feature of the short-term speech data according to the overall change matrix;
根据所述变化因子特征及其对应的标签, 训练能进行 N类别分类的易错点 分类器。  According to the characteristics of the change factors and their corresponding labels, an error-prone classifier capable of performing N-class classification is trained.
可选的, 所述根据所述变化因子特征及其对应的标签, 训练能进行 N类别 分类的易错点分类器之前, 包括:  Optionally, before training the error-prone point classifier capable of N-class classification according to the change factor feature and its corresponding label, the method includes:
采用线性区分性分析对所述变化因子特征进行信道补偿, 获得降维后的变 化因子特征。  Linear discriminant analysis is used to perform channel compensation on the characteristics of the change factor to obtain the characteristics of the change factor after the dimensionality reduction.
可选的, 所述 K个子系统对应的相对误判概率的和为一。  Optionally, the sum of the relative misjudgment probabilities corresponding to the K subsystems is one.
可选的, 所述根据所述偏置量计算相应子系统的最终融合权重, 包括: 具体通过以下公式:
Figure imgf000004_0001
Optionally, the calculating the final fusion weight of the corresponding subsystem according to the offset includes: specifically passing the following formula:
Figure imgf000004_0001
2  2
替换页 (细则第 26条)
Figure imgf000005_0001
Replacement page (Article 26)
Figure imgf000005_0001
所述 q为所述 K个子系统各自的最终融合权重, 所述其中 x为输入语音, 所述 0^作为输入语音为%时各子系统 的初始融合权重, 所述 H为所述 q的关 系系数。  The q is the final fusion weight of each of the K subsystems, where x is the input voice, and 0 ^ is the initial fusion weight of each subsystem when the input voice is%, and H is the relationship of q coefficient.
本申请实施例第二方面提供另一种电子装置, 包括:  A second aspect of the embodiments of the present application provides another electronic device, including:
K个子系统和动态权重子模块, 所述 K为大于零的整数,;  K subsystems and dynamic weight sub-modules, where K is an integer greater than zero;
所述动态权重子模块用于获取待分析的语音数据; 提取所述语音数据中的 变化因子特征, 所述变化因子特征用于表征所述语音数据相关的综合信息, 所 述综合信息至少包括声音传输通道信息, 声音环境信息以及发声对象信息; 通 过易错点分类器, 根据所述变化因子特征对所述语音数据进行误判分类, 得到 所述语音数据在所述 K个子系统中被误判的相对误判概率; 确定任一子系统对 应的相对误判概率与所述 K个子系统的平均相对误判概率的偏置量, 并根据所 述偏置量计算相应子系统的最终融合权重; 通过所述最终融合权重对相应的各 个子系统的识别结果进行加权, 根据加权后各个子系统的识别结果得到所述语 音数据的综合识别结果;  The dynamic weighting sub-module is configured to obtain voice data to be analyzed; and extract a change factor feature in the voice data, where the change factor feature is used to characterize comprehensive information related to the voice data, where the comprehensive information includes at least a voice Transmission channel information, sound environment information, and vocalization object information; misclassified the voice data according to the change factor characteristics through a error-prone classifier, and obtained that the voice data was misjudged in the K subsystems Determine the offset of the relative misjudgment probability corresponding to any subsystem and the average relative misjudged probability of the K subsystems, and calculate the final fusion weight of the corresponding subsystem according to the offset; Weight the recognition results of the respective subsystems by using the final fusion weight, and obtain a comprehensive recognition result of the voice data according to the recognition results of the respective subsystems after weighting;
所述子系统用于对所述语音数据的进行初步声纹识别, 获得所述语音数据 的识别结果。  The subsystem is configured to perform preliminary voiceprint recognition on the voice data, and obtain a recognition result of the voice data.
可选的, 所述动态权重子模块包括: 特征提取单元, 易错点分类器, 权重 计算单元以及综合计算单元;  Optionally, the dynamic weight sub-module includes: a feature extraction unit, an error-prone classifier, a weight calculation unit, and a comprehensive calculation unit;
所述特征提取单元用于提取所述语音数据中的变化因子特征;  The feature extraction unit is configured to extract a change factor feature in the voice data;
所述易错点分类器用于根据所述变化因子特征对所述语音数据进行误判分 类, 得到所述语音数据在所述 K个子系统中被误判的相对误判概率;  The error-prone point classifier is configured to misclassify the voice data according to the characteristics of the change factor, and obtain a relative misjudgement probability that the voice data is misjudged in the K subsystems;
所述权重计算单元用于确定任一子系统对应的相对误判概率与所述 K个子 系统的平均相对误判概率的偏置量, 并根据所述偏置量计算相应子系统的最终 融合权重;  The weight calculation unit is configured to determine an offset between the relative misjudgment probability corresponding to any subsystem and the average relative misjudged probability of the K subsystems, and calculate the final fusion weight of the corresponding subsystem according to the offset ;
3  3
替换页 (细则第 26条) 所述综合计算单元用于通过所述最终融合权重对相应的各个子系统的识别 结果进行加权, 根据加权后各个子系统的识别结果得到所述语音数据的综合识 别结果。 Replacement page (Article 26) The comprehensive calculation unit is configured to weight the recognition results of the respective subsystems by using the final fusion weight, and obtain the comprehensive recognition results of the voice data according to the recognition results of the subsystems after weighting.
可选的, 所述权重计算单元具体还用于:  Optionally, the weight calculation unit is further configured to:
所述根据所述偏置量计算相应子系统的最终融合权重, 包括:  The calculating the final fusion weight of the corresponding subsystem according to the offset includes:
具体通过以下公式:
Figure imgf000006_0001
Specifically through the following formula:
Figure imgf000006_0001
权重,  Weights,
Figure imgf000006_0003
Figure imgf000006_0003
所述 q为所述 K个子系统各自的最终融合权重, 所述其中 X为输入语音,
Figure imgf000006_0002
所述 P为所述 q的关 系系数。
The q is the final fusion weight of each of the K subsystems, where X is the input voice,
Figure imgf000006_0002
The P is a relationship coefficient of the q.
本申请实施例第三方面提供另一种电子装置, 包括: 存储器、 处理器及存 储在所述存储器上并可在所述处理器上运行的计算机程序, 所述处理器执行所 述计算机程序时, 实现上述本申请实施例第一方面提供的声纹识别方法。  A third aspect of the embodiments of the present application provides another electronic device, including: a memory, a processor, and a computer program stored on the memory and executable on the processor. When the processor executes the computer program, To implement the voiceprint recognition method provided in the first aspect of the embodiment of the present application.
本申请实施例第四方面提供一种计算机可读存储介质, 其上存储有计算机 程序, 所述计算机程序被处理器执行时, 实现上述本申请实施例第一方面提供 的声纹识别方法。 有益效果  A fourth aspect of the embodiments of the present application provides a computer-readable storage medium on which a computer program is stored. When the computer program is executed by a processor, the voiceprint recognition method provided by the first aspect of the embodiments of the present application is implemented. Beneficial effect
由上可见, 本申请方案根据变化因子特征将各子系统错误高发的语音段进 行分类, 分为 K类易错点, 并训练出对应的分类模型, 再对每一条待分析的语 音数据进行分类, 降低分类所得的标签对应的子系统的预测权重, 进而优化最 终结果, 达到了对各子系统的误判率进行实时评测、 动态调整的效果。  As can be seen from the above, the scheme of this application classifies the speech segments with high error rates of each subsystem according to the characteristics of the change factors, classifies them into K-type error prone points, trains a corresponding classification model, and then classifies each piece of speech data to be analyzed , Reducing the prediction weight of the subsystem corresponding to the label obtained by the classification, and then optimizing the final result, achieving the effects of real-time evaluation and dynamic adjustment of the false positive rate of each subsystem.
附图说明 BRIEF DESCRIPTION OF THE DRAWINGS
图 1-a为本申请实施例提供的声纹识别方法的实现流程示意图;  FIG. 1-a is a schematic flowchart of a voiceprint recognition method provided by an embodiment of the present application;
图 1-b为本申请实施例提供的声纹识别系统的架构图;  1-b is a structural diagram of a voiceprint recognition system provided by an embodiment of the present application;
4  4
替换页 (细则第 26条) 图 l-c为本申请实施例提供的易错点分类器的训练方法的流程示意图; 图 1-d为本申请实施例提供的动态权重子模块的运作流程图; Replacement page (Article 26) FIG. 1c is a schematic flowchart of a method for training a prone classifier according to an embodiment of the present application; FIG. 1-d is an operation flowchart of a dynamic weight sub-module provided by an embodiment of the present application;
图 2为本申请一实施例提供的电子装置结构示意图;  2 is a schematic structural diagram of an electronic device according to an embodiment of the present application;
图 3为本申请另一实施例提供的电子装置硬件结构示意图。 本发明的实施方式  FIG. 3 is a schematic diagram of a hardware structure of an electronic device according to another embodiment of the present application. Embodiments of the invention
为使得本申请的发明目的、 特征、 优点能够更加的明显和易懂, 下面将结 合本申请实施例中的附图, 对本申请实施例中的技术方案进行清楚、 完整地描 述, 显然, 所描述的实施例仅仅是本申请一部分实施例, 而非全部实施例。 基 于本申请中的实施例, 本领域技术人员在没有做出创造性劳动前提下所获得的 所有其他实施例, 都属于本申请保护的范围。  In order to make the object, features, and advantages of the present application more obvious and easier to understand, the technical solutions in the embodiments of the present application will be clearly and completely described with reference to the accompanying drawings in the embodiments of the present application. Obviously, the described The embodiments are only a part of the embodiments of this application, but not all the embodiments. Based on the embodiments in this application, all other embodiments obtained by those skilled in the art without making creative work fall into the protection scope of this application.
实施例一  Example one
本申请实施例提供一种声纹识别方法,请参阅图 1-a,该声纹识别方法主要 包括以下步骤:  An embodiment of the present application provides a voiceprint recognition method. Please refer to FIG. 1-a. The voiceprint recognition method mainly includes the following steps:
101、 获取待分析的语音数据;  101. Acquire voice data to be analyzed.
本发明实施例应用于声纹识别系统, 所述声纹识别系统包括 K个子系统, 所述 K为大于零的整数。 本发明实施例的声纹识别系统的系统架构可以参考图 The embodiment of the present invention is applied to a voiceprint recognition system, where the voiceprint recognition system includes K subsystems, and K is an integer greater than zero. The system architecture of the voiceprint recognition system according to the embodiment of the present invention may refer to the figure.
1-b o 1-b o
其中, 所述声纹识别系统中的各个子系统可以分别对应不同类型的声纹识 别, 声纹识别的类型包括: 情感识别、 年龄识别、 语种识别。 进一步的, 各个 子系统也可以对应一种识别场景下的各个子类, 如语音识别中, 一个子系统对 应一个语种(如汉语、 英语或法语等)。 可以理解的是, 在实际应用中, 子系统 与声纹识别类别的对应关系可以根据实际情况而定, 此处具体不作限定。  Each subsystem in the voiceprint recognition system may correspond to different types of voiceprint recognition, and the types of voiceprint recognition include: emotion recognition, age recognition, and language recognition. Further, each subsystem may also correspond to each sub-category in a recognition scenario. For example, in speech recognition, a subsystem corresponds to a language (such as Chinese, English, or French, etc.). It can be understood that, in practical applications, the correspondence between the subsystem and the voiceprint recognition category can be determined according to the actual situation, which is not specifically limited here.
在本发明的实施例中, 声纹识别方法主要应用于系统架构中的动态权重子 模块, 即待分析的语音数据可先输入到动态权重子模块中进行权重分析。  In the embodiment of the present invention, the voiceprint recognition method is mainly applied to the dynamic weight sub-module in the system architecture, that is, the voice data to be analyzed can be input into the dynamic weight sub-module for weight analysis first.
102、 提取所述语音数据中的变化因子特征;  102. Extract a change factor feature in the voice data.
动态权重子模块提取所述语音数据中的变化因子特征, 所述变化因子特征 用于表征所述语音数据相关的综合信息, 所述综合信息至少包括声音传输通道 信息, 声音环境信息以及发声对象信息。  The dynamic weighting sub-module extracts a change factor feature in the voice data, and the change factor feature is used to characterize comprehensive information related to the voice data, and the comprehensive information includes at least sound transmission channel information, sound environment information, and sounding object information. .
5  5
替换页 (细则第 26条) 示例性的, 戶斤述变化因子特征模型构建中的 ivector ( identity vector , 身份 矢量)特征, ivector特征了说话对象的大量信息, 如传输通道信息、 声学环境 信息、 说话人信息等。 Replacement page (Article 26) Exemplarily, the feature of ivector (identity vector) in the construction of the change factor feature model is described. The ivector features a large amount of information of the speaking object, such as transmission channel information, acoustic environment information, and speaker information.
103、通过易错点分类器,根据所述变化因子特征对所述语音数据进行误判 分类;  103. Misjudge and classify the voice data according to the characteristics of the change factor through an error-prone point classifier;
动态权重子模块通过易错点分类器, 根据所述变化因子特征对所述语音数 据进行误判分类, 得到所述语音数据在所述 K个子系统中被误判的相对误判概 率。  The dynamic weighting sub-module classifies the voice data according to the characteristics of the change factor by using the error-prone classifier, and obtains the relative misjudgment probability that the voice data is misjudged in the K subsystems.
示例性的, 易错点分类器输出的分类结果可以如下表所示:
Figure imgf000008_0001
For example, the classification result output by the error-prone classifier can be shown in the following table:
Figure imgf000008_0001
其中 x为输入的语音数据, Pf (St | x)( / = 1, 2, ...,幻为在输入语音为 x的条件下 被系统 5 判的相对误判概率, 数值越高则代表该语音在对应的子系统下被误 判 (包括错误接受 (False Acceptance) /错误拒绝 (False Rejection) ) 的概率越高。 且所有子系统的相对误判概率之和为 1, 即: Where x is the input speech data, P f (S t | x) ( / = 1, 2, ..., is the relative misjudgment probability judged by the system 5 under the condition that the input speech is x, the higher the value It means that the probability that the voice is misjudged (including false acceptance / false rejection) under the corresponding subsystem is higher. And the sum of the relative misjudgment probabilities of all subsystems is 1, that is:
jyf(S^) = Ui = l,2, ...,K) jy f (S ^) = Ui = l, 2, ..., K)
I各子系统针对某条语音误判的概率相等时, 则各子系统的相对误判概率 为 I E, 即为平均相对误判概率。  When the probability of misjudgment for a certain voice of each subsystem is equal, the relative misjudgment probability of each subsystem is I E, which is the average relative misjudgment probability.
104、确定任一子系统对应的相对误判概率与所述 K个子系统的平均相对误 判概率的偏置量, 并根据所述偏置量计算相应子系统的最终融合权重;  104. Determine an offset between a relative misjudgment probability corresponding to any subsystem and an average relative misjudged probability of the K subsystems, and calculate a final fusion weight of the corresponding subsystem according to the offset;
动态权重子模块确定任一子系统对应的相对误判概率与所述 K个子系统的 平均相对误判概率的偏置量, 并根据所述偏置量计算相应子系统的最终融合权 重。  The dynamic weight sub-module determines an offset between the relative misjudgment probability corresponding to any subsystem and the average relative misjudged probability of the K subsystems, and calculates the final fusion weight of the corresponding subsystem according to the offset.
相对误判概率真正的意义在于相对误判概率与平均误判概率之间的偏置量。 相对误判概率表示当不同的子系统的误判概率之间的相对大小关系。 如: 子系 统 a的相对误判概率为 0.1, 子系统 b的相对误判概率为 0.5, 则相对误判概率 的意义在于子系统 b的误判概率比子系统 a要大, 而不是说系统 b的误判概率 为 0.5„  The true significance of the relative false positive probability lies in the offset between the relative false positive probability and the average false positive probability. Relative misjudgment probability represents the relative magnitude relationship between the misjudgment probabilities of different subsystems. For example, the relative misjudgment probability of subsystem a is 0.1, and the relative misjudgment probability of subsystem b is 0.5. The significance of the relative misjudgment probability is that the misjudgment probability of subsystem b is greater than that of subsystem a, rather than the system. The false positive probability of b is 0.5.
6  6
替换页 (细则第 26条) 示例性的, 偏置量的定义如下:
Figure imgf000009_0001
Replacement page (Article 26) For example, the definition of the offset is as follows:
Figure imgf000009_0001
某子系统针对某条语音的相对误判概率越高, 其偏置量越大, 即代表该条 语音在某子系统中被误判的概率越高, 此时应该降低该子系统的融合权重。 根  The higher the relative misjudgment probability of a certain subsystem for a certain speech, the greater the offset, which means that the higher the probability of the misjudgment of this speech in a certain subsystem, the fusion weight of this subsystem should be reduced at this time. . Root
Figure imgf000009_0002
其中心思 想是将相对误判概率与平均概率之间的偏置量作为融合权重的计算参量。同时, 为了调节动态权重子模块在最终概率值融合时的影响力, 可以在各权重值保持 相对大小的关系不变的情况下, 通过调整权重数组的标准差来微调权重值。
Figure imgf000009_0002
The central idea is to use the offset between the relative misjudgment probability and the average probability as the calculation parameter of the fusion weight. At the same time, in order to adjust the influence of the dynamic weight sub-modules when the final probability values are fused, the weight values can be fine-tuned by adjusting the standard deviation of the weight arrays while maintaining the relative size of the weight values.
105、 获取各个子系统对所述语音数据的识别结果;  105. Obtain a recognition result of the voice data by each subsystem;
动态权重子模块获取各个子系统对所述语音数据的识别结果。  The dynamic weighting sub-module obtains the recognition result of the voice data by each subsystem.
在本发明实施例中, 系统架构图 1-b所述, 待分析的语音数据可以分别输 入到各个子系统中进行声纹识别, 获得各个子系统的识别结果。其中, 步骤 105 和步骤 101为可以并列执行的两个分支, 即步骤 105和步骤 101之前没有严格 的时序关系, 即在执行步骤 106之前, 可以先执行步骤 101, 也可以先执行步 骤 105 , 也可以同时执行步骤 105和步骤 101, 此处具体不作限定。  In the embodiment of the present invention, as shown in FIG. 1-b of the system architecture, the speech data to be analyzed can be input to each subsystem for voiceprint recognition, and the recognition results of each subsystem are obtained. Among them, step 105 and step 101 are two branches that can be executed in parallel, that is, there is no strict timing relationship between step 105 and step 101, that is, before step 106, step 101 may be performed first, or step 105 may be performed first. Step 105 and step 101 may be performed at the same time, which is not specifically limited herein.
106、通过所述最终融合权重对相应的各个子系统的识别结果进行加权, 根 据加权后各个子系统的识别结果得到所述语音数据的综合识别结果。  106. Weight the recognition results of the respective subsystems according to the final fusion weight, and obtain a comprehensive recognition result of the voice data according to the weighted recognition results of each subsystem.
声纹识别系统通过所述最终融合权重对相应的各个子系统的识别结果进行 加权, 根据加权后各个子系统的识别结果得到所述语音数据的综合识别结果。  The voiceprint recognition system weights the recognition results of the respective subsystems by using the final fusion weight, and obtains the comprehensive recognition result of the voice data according to the recognition results of the subsystems after weighting.
示例性的, 可以将相对误判概率 (^ |x)经过下式函数计算后作为各子系统
Figure imgf000009_0003
Exemplarily, the relative misjudgment probability (^ | x) can be calculated as each subsystem after being calculated by the following function.
Figure imgf000009_0003
所述 q为所述 K个子系统各自的最终融合权重, 所述其中 x为输入语音,
Figure imgf000009_0004
The q is the final fusion weight of each of the K subsystems, where x is the input voice,
Figure imgf000009_0004
系系数。  Coefficient.
7 7
替换页 (细则第 26条) 其中, 对于任意某组 A Rlx;)数组, 与 n之间的关系满足如下定义:Replacement page (Article 26) Among them, for any certain group A Rlx;), the relationship between n and n satisfies the following definitions:
① H越小, 数组 C, ( £=1,2, K ) 的标准差越小; ① The smaller the H, the smaller the standard deviation of the array C, (£ = 1,2, K);
(2)|i越大, 数组 Ci ( i=l,2, K ) 的标准差越大; (2) The larger | i, the larger the standard deviation of the array C i (i = 1,2, K);
等。  Wait.
Figure imgf000010_0001
Figure imgf000010_0001
在一般的混合系统中, 子系统的个数 K一般是固定的, 所以这里的 K可看 做一个常数。 可以看出随着 U增大或减小, o也随之非线性增大或减小。 在一般 情况下, 值默认为 0 可不做调整。 如需调整, 建议调整范围控制在 [-1, 1] 之间, 太大或太小都有可能对最终的融合评分结果产生反效果。 另外, 大幅调 整 可能会导致概率值出现负值的情况, 但不影响融合评分的判决流程。  In a general hybrid system, the number of subsystems K is generally fixed, so K here can be regarded as a constant. It can be seen that as U increases or decreases, o also increases or decreases non-linearly. Under normal circumstances, the value defaults to 0 without adjustment. If adjustment is needed, it is recommended to control the adjustment range between [-1, 1]. Too large or too small may have an adverse effect on the final fusion score result. In addition, a large adjustment may lead to a negative probability value, but it does not affect the decision process of the fusion score.
本申请方案根据变化因子特征将各子系统错误高发的语音段进行分类, 分 为 K类易错点, 并训练出对应的分类模型, 再对每一条待分析的语音数据进行 分类, 降低分类所得的标签对应的子系统的预测权重, 进而优化最终结果, 达 到了对各子系统的误判率进行实时评测、 动态调整的效果。  The solution of this application classifies the speech segments with high error rates of each subsystem according to the characteristics of the change factors, classifies them into K-type error prone points, and trains the corresponding classification model, and then classifies each piece of speech data to be analyzed to reduce the classification results. The prediction weights of the subsystems corresponding to the tags are optimized, and the final result is optimized to achieve the effects of real-time evaluation and dynamic adjustment of the false positive rate of each subsystem.
实施例二  Example two
在本发明实施例中, 需要对易错点分类器进行构建, 请参阅图 1-c方法包 括:  In the embodiment of the present invention, a fault-prone classifier needs to be constructed. Referring to FIG. 1-c, the method includes:
201、 建立训练数据库;  201. Establish a training database;
以短时语音数据集作为各子系统的测试数据集, 将测试过程中所有误判的 语音段依照不同的子系统标注为 N个不同的标签, 作为训练数据库, 所述 N为 大于零的整数。  The short-term speech data set is used as the test data set of each subsystem, and all misjudged speech segments in the test process are labeled as N different labels according to different subsystems. As a training database, N is an integer greater than zero. .
202、 提取 M FCC梅尔频率倒谱系数特征;  202. Extract the characteristics of the M FCC Mel frequency cepstrum coefficient;
对所述训练数据库中的每条短时语音数据, 提取梅尔频率倒谱系数 For each short-term speech data in the training database, extract a Mel frequency cepstrum coefficient
( Mel Frequency Cepstrum Coefficient, MFCC )特征。 (Mel Frequency Cepstrum Coefficient, MFCC).
203、 训练总体变化矩阵;  203. Train the overall change matrix;
根据提取到的 MFCC特征训练通用背景模型 ( Universal Background Model , 替换页 (细则第 26条) UBM ), 训练总体变化矩阵 T。 Train the Universal Background Model (Replacement page) based on the extracted MFCC features (Details Article 26) UBM), training the overall change matrix T.
204、 获得所述短时语音数据的变化因子特征;  204. Obtain a change factor characteristic of the short-term speech data.
根据所述总体变化矩阵 T获得所述短时语音数据的变化因子特征。  A change factor characteristic of the short-term speech data is obtained according to the overall change matrix T.
示例性的, 可以根据以下公式求取变化因子特征:  Exemplarily, the characteristics of the change factor can be obtained according to the following formula:
M = m + Tw ;  M = m + Tw;
其中, m是背景模型的超向量, 它与所有说话人的声学与信道共性有关; Where m is the supervector of the background model, which is related to the acoustics and channel commonality of all speakers;
M为均值超矢量, 是在背景模型的超向量的基础上进行自适应训练所得; T为 总体变化矩阵 T; w是变化因子特征特征向量。 M is the mean supervector, which is obtained by adaptive training based on the supervector of the background model; T is the overall change matrix T; w is the feature factor vector of the change factor.
205、 对变化因子特征进行降维处理;  205. Perform dimension reduction processing on the characteristics of the change factor;
采用线性区分性分析对所述变化因子特征进行信道补偿, 以减弱变化因子 特征中信道等冗余信息的影响, 并同时达到降维的效果。 此处采用线性区分性 分析 ( Linear Discriminant Analysis, LDA ) 降维方法。  Linear discriminant analysis is used to perform channel compensation on the characteristics of the change factor to reduce the influence of redundant information such as channels in the characteristics of the change factor, and at the same time achieve the effect of reducing the dimension. Here, a linear discriminant analysis (LDA) dimensionality reduction method is used.
206、 训练能进行 N类别分类的易错点分类器。  206. Train an error-prone classifier capable of performing N-class classification.
根据所述变化因子特征及其对应的标签, 训练能进行 N类别分类的易错点 分类器。 这里采用 svm分类器, 有两种方案可选: 一、 采用 one vs rest策略的 二分类 svm; 二、 采用 one vs one策略的二分类 svm。  According to the characteristics of the change factors and their corresponding labels, an error-prone classifier capable of performing N-class classification is trained. Here, the svm classifier is used, and there are two schemes to choose from: one, a two-class svm using one vs rest strategy; two, a two-class svm using one vs one strategy.
本发明实施例中的易错点分类器, 能依据不同应用场景或声纹特征检测出 不同子系统的误判概率, 充分利用各子系统的优势并避开高误测点, 进而给出 更为适宜的融合权重, 使混合系统的效能最大化, 增强了鲁棒性。  The error-prone point classifier in the embodiment of the present invention can detect the misjudgment probability of different subsystems according to different application scenarios or voiceprint characteristics, make full use of the advantages of each subsystem and avoid high misdetection points, and then give more For proper fusion weights, the effectiveness of the hybrid system is maximized and the robustness is enhanced.
实施例三  Example three
本发明实施例以语种识别的混合系统为例, 详细说明本发明实施例中的声 纹识别方法, 包括:  The embodiment of the present invention takes the hybrid system of language recognition as an example to describe in detail the voiceprint recognition method in the embodiment of the present invention, including:
一、 本发明实施例的语种识别的混合系统的架构可以参考图 1-b, 每个子 系统独立给出 N个不同语种的概率值。  1. For the architecture of the language identification hybrid system in the embodiment of the present invention, refer to FIG. 1-b, and each sub-system independently gives probability values of N different languages.
二、 令 x为某条输入语音, 每个子系统的输出如下表所示:  2. Let x be an input voice. The output of each subsystem is shown in the following table:
Figure imgf000011_0001
Figure imgf000011_0001
^ |x ) ( / = 1, 2, ... , A0每个子系统分别独立给出某条输入语音属于某一语种  ^ x) (/ = 1, 2, ..., A0 each subsystem independently gives a certain input voice belongs to a certain language
9  9
替换页 (细则第 26条) Zj ( j = 1, 2, ... , AO的概率, 且所有概率之和也为 1, 即: Replacement page (Article 26) Z j ( j = 1, 2, ..., AO probability, and the sum of all probabilities is also 1, that is:
Figure imgf000012_0001
Figure imgf000012_0001
三、 执行动态权重子模块的运作流程, 请参阅图 1-d ;  3. The operation flow of executing the dynamic weight sub-module, please refer to Figure 1-d;
在提取了语音数据的 ivector特征之后, 将 ivector特征输入至易错点分类 器中, 易错点分类器输出的分类结果可以如下表所示:
Figure imgf000012_0004
After extracting the ivector features of the speech data, the ivector features are input into the error-prone point classifier, and the classification results output by the error-prone point classifier can be shown in the following table:
Figure imgf000012_0004
其中 x为输入的语音数据, Pf (St | x)( / = 1, 2, ...,幻为在输入语音为 x的条件下 被系统 5误判的相对误判概率, 数值越高则代表该语音在对应的子系统下被误 判 (包括错误接受 (False Acceptance) /错误拒绝 (False Rejection) ) 的概率越高。 且所有子系统的相对误判概率之和为 1 , 即: Where x is the input speech data, P f (S t | x) ( / = 1, 2, ..., is the relative misjudgment probability of being misjudged by the system 5 under the condition that the input speech is x, the more the value is A high value indicates that the probability of the voice being misjudged (including false acceptance / false rejection) in the corresponding subsystem is higher. And the sum of the relative misjudgment probabilities of all subsystems is 1, that is, :
fPf(S1 \ x) = Hi = \ 2, ...,K) fP f (S 1 \ x) = Hi = \ 2, ..., K)
I各子系统针对某条语音误判的概率相等时, 则各子系统的相对误判概率 为 即为平均相对误判概率。  When the probability of misjudgment of each subsystem for a certain voice is equal, the relative misjudgment probability of each subsystem is the average relative misjudgment probability.
相对误判概率真正的意义在于相对误判概率与平均误判概率之间的偏置量。 某子系统针对某条语音的相对误判概率越高, 其偏置量越大, 即代表该条语音 在某子系统中被误判的概率越高, 此时应该降低该子系统的融合权重。 根据上 述思想, 示例性的, 可以得出如下计算公式:
Figure imgf000012_0002
The true significance of the relative false positive probability lies in the offset between the relative false positive probability and the average false positive probability. The higher the relative misjudgment probability of a certain subsystem for a certain speech, the greater the offset, which means that the higher the probability of the misjudgment of this speech in a certain subsystem, the fusion weight of this subsystem should be reduced at this time . Based on the above ideas, as an example, the following calculation formula can be obtained:
Figure imgf000012_0002
为输入语音为 ^时各子系统 .的初始融合权重。 其中心思
Figure imgf000012_0003
平均概率之间的偏置量作为融合权重的计算参量。同时, 为了调节动态权重子模块在最终概率值融合时的影响力, 可以在各权重值保持 相对大小的关系不变的情况下, 通过调整权重数组的标准差来微调权重值。
The initial fusion weight for each subsystem when the input speech is ^. Central thinking
Figure imgf000012_0003
The offset between the average probabilities is used as a calculation parameter for the fusion weight. At the same time, in order to adjust the influence of the dynamic weight sub-modules when the final probability values are fused, the weight values can be fine-tuned by adjusting the standard deviation of the weight arrays while maintaining the relative size of the weight values.
为了调节动态权重子模块在最终概率值融合时的影响力, 可以在各权重值  In order to adjust the influence of the dynamic weight sub-modules when the final probability values are fused,
10  10
替换页 (细则第 26条) 保持相对大小的关系不变的情况下,通过调整权重数组的标准差来微调权重值。 Replacement page (Article 26) While maintaining the relative size relationship, fine-tune the weight value by adjusting the standard deviation of the weight array.
系统最终融合权重:
Figure imgf000013_0001
System final fusion weights:
Figure imgf000013_0001
所述 q为所述 K个子系统各自的最终融合权重, 所述其中 x为输入语音, 所述 0^作为输入语音为%时各子系统 ^的初始融合权重, 所述 p为所述 q的关 系系数。 The q is the final fusion weight of each of the K subsystems, where x is the input voice, and 0 ^ is the initial fusion weight of each subsystem ^ when the input voice is%, and p is the q of the Relationship coefficient.
Figure imgf000013_0002
Figure imgf000013_0002
① H越小, 数组 C, ( 1=1,2, K ) 的标准差越小;  ① The smaller the H, the smaller the standard deviation of the array C, (1 = 1,2, K);
(2)|i越大, 数组 Ci ( i=l,2, K ) 的标准差越大; (2) The larger | i, the larger the standard deviation of the array C i (i = 1,2, K);
③ 11= 0时, 数组 ( i=l,2, K )与数组 (X|x) ( i=l,2, ..., K )的标准差相等。  ③ When 11 = 0, the standard deviation of the array (i = l, 2, K) and the array (X | x) (i = l, 2, ..., K) are equal.
Figure imgf000013_0003
Figure imgf000013_0003
在一般的混合系统中, 子系统的个数 K一般是固定的, 所以这里的 K可看 做一个常数。 可以看出随着 U增大或减小, o也随之非线性增大或减小。 在一般 情况下, 值默认为 0 可不做调整。 如需调整, 建议调整范围控制在 [-1 , 1] 之间, 太大或太小都有可能对最终的融合评分结果产生反效果。 另外, 大幅调 整 可能会导致概率值出现负值的情况, 但不影响融合评分的判决流程。  In a general hybrid system, the number of subsystems K is generally fixed, so K here can be regarded as a constant. It can be seen that as U increases or decreases, o also increases or decreases non-linearly. Under normal circumstances, the value defaults to 0 without adjustment. If adjustment is needed, it is recommended to control the adjustment range between [-1, 1]. Too large or too small may have an adverse effect on the final fusion score result. In addition, a large adjustment may lead to a negative probability value, but it does not affect the decision process of the fusion score.
四、将融合权重数组 c£ ( (=1,2, , IT )以如下形式融入到最终的评分矩阵中, 得到混合系统输出的语种。 4. The fusion weight array c £ ((= 1,2,, IT) is integrated into the final scoring matrix in the following form to obtain the language output by the hybrid system.
Figure imgf000013_0004
Figure imgf000013_0004
其中, 上述等式左侧第一项矩阵为融合权重矩阵, 等式左侧第二项矩阵为 K个子系统给出的所有语种的概率矩阵, 等式右侧矩阵为给等式左侧第二项矩 阵分配融合权重后的融合概率矩阵。 最后对等式右侧矩阵中的每一列相加, 得 到该条语音为各语种的概率:  The first matrix on the left side of the above equation is a fusion weight matrix, the second matrix on the left side of the equation is a probability matrix of all languages given by the K subsystems, and the right matrix on the right side of the equation is The term matrix is assigned the fusion probability matrix after the fusion weights. Finally, add each column in the matrix on the right side of the equation to get the probability that the speech is in each language:
11  11
替换页 (细则第 26条) Replacement page (Article 26)
Figure imgf000014_0001
Figure imgf000014_0001
实施例四  Example 4
请参阅图 2 , 为本申请实施例提供一种电子装置。 该电子装置可用于实现 上述图 1-a所示实施例提供的声纹识别方法。 如图 2所示, 该电子装置主要包 括:  Please refer to FIG. 2, which provides an electronic device according to an embodiment of the present application. The electronic device can be used to implement the voiceprint recognition method provided by the embodiment shown in FIG. 1-a. As shown in FIG. 2, the electronic device mainly includes:
所述动态权重子模块 210用于获取待分析的语音数据; 提取所述语音数据 中的变化因子特征,所述变化因子特征用于表征所述语音数据相关的综合信息, 所述综合信息至少包括声音传输通道信息, 声音环境信息以及发声对象信息; 通过易错点分类器, 根据所述变化因子特征对所述语音数据进行误判分类, 得 到所述语音数据在所述 K个子系统 220中被误判的相对误判概率; 确定任一子 系统对应的相对误判概率与所述 K个子系统的平均相对误判概率的偏置量, 并 根据所述偏置量计算相应子系统的最终融合权重; 通过所述最终融合权重对相 应的各个子系统的识别结果进行加权, 根据加权后各个子系统的识别结果得到 所述语音数据的综合识别结果;  The dynamic weighting sub-module 210 is configured to obtain voice data to be analyzed; extract a change factor feature in the voice data, and the change factor feature is used to characterize comprehensive information related to the voice data, where the comprehensive information includes The information of the sound transmission channel, the sound environment information and the sounding object information; the error data is misclassified and classified according to the characteristics of the change factor through the error-prone classifier, and the speech data is obtained in the K subsystems 220 Relative misjudgment probability of misjudgment; determining an offset between the relative misjudgment probability corresponding to any subsystem and the average relative misjudgment probability of the K subsystems, and calculating the final fusion of the corresponding subsystem according to the offset Weighting; weighting the recognition results of the respective subsystems through the final fusion weights, and obtaining a comprehensive recognition result of the voice data according to the weighted recognition results of each subsystem;
所述子系统 220用于对所述语音数据的进行初步声纹识别, 获得所述语音 数据的识别结果。  The subsystem 220 is configured to perform preliminary voiceprint recognition on the voice data, and obtain a recognition result of the voice data.
可选的, 所述动态权重子模块包括: 特征提取单元, 易错点分类器, 权重 计算单元以及综合计算单元;  Optionally, the dynamic weight sub-module includes: a feature extraction unit, an error-prone classifier, a weight calculation unit, and a comprehensive calculation unit;
所述特征提取单元用于提取所述语音数据中的变化因子特征;  The feature extraction unit is configured to extract a change factor feature in the voice data;
所述易错点分类器用于根据所述变化因子特征对所述语音数据进行误判分 类, 得到所述语音数据在所述 K个子系统中被误判的相对误判概率;  The error-prone point classifier is configured to misclassify the voice data according to the characteristics of the change factor, and obtain a relative misjudgement probability that the voice data is misjudged in the K subsystems;
所述权重计算单元用于确定任一子系统对应的相对误判概率与所述 K个子 系统的平均相对误判概率的偏置量, 并根据所述偏置量计算相应子系统的最终  The weight calculation unit is configured to determine an offset between a relative misjudgment probability corresponding to any subsystem and an average relative misjudgment probability of the K subsystems, and calculate a final value of the corresponding subsystem according to the offset.
12  12
替换页 (细则第 26条) 融合权重; Replacement page (Article 26) Fusion weights;
所述综合计算单元用于通过所述最终融合权重对相应的各个子系统的识别 结果进行加权, 根据加权后各个子系统的识别结果得到所述语音数据的综合识 别结果。  The comprehensive calculation unit is configured to weight the recognition results of the respective subsystems by using the final fusion weight, and obtain the comprehensive recognition results of the voice data according to the recognition results of the subsystems after weighting.
可选的, 所述权重计算单元具体还用于:  Optionally, the weight calculation unit is further configured to:
所述根据所述偏置量计算相应子系统的最终融合权重, 包括:  The calculating the final fusion weight of the corresponding subsystem according to the offset includes:
Figure imgf000015_0001
Figure imgf000015_0001
根据所述初始融合权重, 并通过以下公式计算所述语音数据的综合识别结 果;
Figure imgf000015_0002
Calculate the comprehensive recognition result of the voice data according to the initial fusion weight and the following formula;
Figure imgf000015_0002
所述 q为所述 K个子系统各自的最终融合权重, 所述其中 x为输入语音, 所述 0^作为输入语音为 x时各子系统 的初始融合权重, 所述 H为所述 q的关 系系数。  The q is the final fusion weight of each of the K subsystems, where x is the input speech, and 0 ^ is the initial fusion weight of each subsystem when the input speech is x, and H is the relationship of q coefficient.
需要说明的是, 以上图 2示例的电子装置的实施方式中, 各功能模块的划 分仅是举例说明, 实际应用中可以根据需要, 例如相应硬件的配置要求或者软 件的实现的便利考虑, 而将上述功能分配由不同的功能模块完成, 即将电子装 置的内部结构划分成不同的功能模块, 以完成以上描述的全部或者部分功能。 而且,在实际应用中,本实施例中的相应的功能模块可以是由相应的硬件实现, 也可以由相应的硬件执行相应的软件完成。 本说明书提供的各个实施例都可应 用上述描述原则, 以下不再赘述。  It should be noted that, in the embodiment of the electronic device illustrated in FIG. 2 above, the division of each functional module is merely an example. In actual applications, according to requirements, such as the configuration requirements of the corresponding hardware or the convenience of software implementation, the The above function allocation is completed by different function modules, that is, the internal structure of the electronic device is divided into different function modules to complete all or part of the functions described above. Moreover, in actual applications, the corresponding functional modules in this embodiment may be implemented by corresponding hardware, or may be completed by corresponding hardware executing corresponding software. The embodiments described in this specification can apply the above-mentioned description principles, which will not be described in detail below.
本实施例提供的电子装置中各功能模块实现各自功能的具体过程, 请参见 上述图 1-a所示实施例中描述的具体内容, 此处不再赘述。  For the specific process of each function module in the electronic device provided in this embodiment to implement their functions, please refer to the specific content described in the embodiment shown in FIG. 1-a above, which will not be repeated here.
实施例五  Example 5
本申请实施例提供一种电子装置, 请参阅图 3 , 该电子装置包括: 存储器 301、 处理器 302及存储在存储器 301上并可在处理器 302上运行的计  An embodiment of the present application provides an electronic device. Referring to FIG. 3, the electronic device includes: a memory 301, a processor 302, and a computer stored in the memory 301 and operable on the processor 302.
13  13
替换页 (细则第 26条) 算机程序, 处理器 302执行该计算机程序时, 实现前述图 1-a所示实施例中描述 的声纹识别方法。 Replacement page (Article 26) A computer program, when the processor 302 executes the computer program, implements the voiceprint recognition method described in the foregoing embodiment shown in FIG. 1-a.
进一步的, 该电子装置还包括:  Further, the electronic device further includes:
至少一个输入设备 303以及至少一个输出设备 304。  At least one input device 303 and at least one output device 304.
上述存储器 301、 处理器 302、 输入设备 303以及输出设备 304 , 通过总线 305 连接。  The memory 301, the processor 302, the input device 303, and the output device 304 are connected through a bus 305.
其中, 输入设备 303具体可为摄像头、 触控面板、 物理按键或者鼠标等等。 输出设备 304具体可为显示屏。  The input device 303 may be a camera, a touch panel, a physical button, a mouse, or the like. The output device 304 may be a display screen.
存储器 301可以是高速随机存取记忆体 ( RAM , Random Access Memory )存 储器, 也可为非不稳定的存储器 ( non-volatile memory ), 例如磁盘存储器。 存 储器 301用于存储一组可执行程序代码, 处理器 302与存储器 301搞合。  The memory 301 may be a high-speed random access memory (RAM, Random Access Memory) memory, or may be a non-volatile memory (non-volatile memory), such as a disk memory. The memory 301 is configured to store a set of executable program code, and the processor 302 is combined with the memory 301.
进一步的, 本申请实施例还提供了一种计算机可读存储介质, 该计算机可 读存储介质可以是设置于上述各实施例中的电子装置中, 该计算机可读存储介 质可以是前述图 3所示实施例中的存储器。该计算机可读存储介质上存储有计算 机程序, 该程序被处理器执行时实现前述图 1-a所示实施例中描述的声纹识别方 法。进一步的,该计算机可存储介质还可以是 U盘、移动硬盘、只读存储器 ( ROM , Read-Only Memory )、 RAM、 磁碟或者光盘等各种可以存储程序代码的介质。  Further, an embodiment of the present application further provides a computer-readable storage medium, where the computer-readable storage medium may be provided in the electronic device in the foregoing embodiments, and the computer-readable storage medium may be the foregoing FIG. 3 The memory in the embodiment is shown. A computer program is stored on the computer-readable storage medium, and when the program is executed by a processor, the voiceprint recognition method described in the foregoing embodiment shown in FIG. 1-a is implemented. Further, the computer-storable medium may also be various media that can store program codes, such as a U disk, a mobile hard disk, a read-only memory (ROM, Read-Only Memory), a RAM, a magnetic disk, or an optical disc.
在本申请所提供的几个实施例中, 应该理解到, 所揭露的装置和方法, 可 以通过其它的方式实现。 例如, 以上所描述的装置实施例仅仅是示意性的, 例 如, 所述模块的划分, 仅仅为一种逻辑功能划分, 实际实现时可以有另外的划 分方式, 例如多个模块或组件可以结合或者可以集成到另一个系统, 或一些特 征可以忽略, 或不执行。 另一点, 所显示或讨论的相互之间的耦合或直接搞合 或通信连接可以是通过一些接口, 装置或模块的间接搞合或通信连接, 可以是 电性, 机械或其它的形式。  In the several embodiments provided in this application, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the device embodiments described above are only schematic. For example, the division of the modules is only a logical function division. In actual implementation, there may be another division manner. For example, multiple modules or components may be combined or It can be integrated into another system, or some features can be ignored or not implemented. In addition, the displayed or discussed mutual coupling or direct connection or communication connection may be through some interfaces, indirect coupling or communication connection of devices or modules, and may be electrical, mechanical, or other forms.
所述作为分离部件说明的模块可以是或者也可以不是物理上分开的, 作为 模块显示的部件可以是或者也可以不是物理模块, 即可以位于一个地方, 或者 也可以分布到多个网络模块上。 可以根据实际的需要选择其中的部分或者全部 模块来实现本实施例方案的目的。  The modules described as separate components may or may not be physically separated, and the components displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on multiple network modules. Some or all of these modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
14 14
替换页 (细则第 26条) 另外, 在本申请各个实施例中的各功能模块可以集成在一个处理模块中, 也可以是各个模块单独物理存在, 也可以两个或两个以上模块集成在一个模块 中。 上述集成的模块既可以采用硬件的形式实现, 也可以采用软件功能模块的 形式实现。 Replacement page (Article 26) In addition, each functional module in each embodiment of the present application may be integrated into one processing module, or each module may exist separately physically, or two or more modules may be integrated into one module. The above integrated modules can be implemented in the form of hardware or software functional modules.
所述集成的模块如果以软件功能模块的形式实现并作为独立的产品销售或 使用时, 可以存储在一个计算机可读取存储介质中。 基于这样的理解, 本申请 的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或 部分可以以软件产品的形式体现出来, 该计算机软件产品存储在一个可读存储 介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器, 或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。 而前述 的可读存储介质包括: U盘、 移动硬盘、 ROM、 RAM、 磁碟或者光盘等各种可 以存储程序代码的介质。  When the integrated module is implemented in the form of a software functional module and sold or used as an independent product, it may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application is essentially a part that contributes to the existing technology or all or part of the technical solution may be embodied in the form of a software product, which is stored in a readable storage The medium includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method described in the embodiments of the present application. The aforementioned readable storage media include: U-disks, mobile hard disks, ROMs, RAMs, magnetic disks, or optical disks, which can store program codes.
需要说明的是, 对于前述的各方法实施例, 为了简便描述, 故将其都表述 为一系列的动作组合, 但是本领域技术人员应该知悉, 本申请并不受所描述的 动作顺序的限制,因为依据本申请,某些步骤可以采用其它顺序或者同时进行。 其次, 本领域技术人员也应该知悉, 说明书中所描述的实施例均属于优选实施 例, 所涉及的动作和模块并不一定都是本申请所必须的。  It should be noted that, for the foregoing method embodiments, for simplicity of description, they are all described as a series of action combinations, but those skilled in the art should know that this application is not limited by the described action order. Because according to the present application, certain steps may be performed in another order or simultaneously. Secondly, a person skilled in the art should also know that the embodiments described in the specification are all preferred embodiments, and the actions and modules involved are not necessarily required for this application.
在上述实施例中, 对各个实施例的描述都各有侧重, 某个实施例中没有详 述的部分, 可以参见其它实施例的相关描述。  In the above embodiments, the description of each embodiment has its own emphasis. For a part that is not described in detail in one embodiment, reference may be made to the related description of other embodiments.
以上为对本申请所提供的声纹识别方法、 电子装置及计算机可读存储介质 的描述, 对于本领域的技术人员, 依据本申请实施例的思想, 在具体实施方式 及应用范围上均会有改变之处, 综上, 本说明书内容不应理解为对本申请的限 制。  The above is the description of the voiceprint recognition method, the electronic device, and the computer-readable storage medium provided by the present application. For those skilled in the art, according to the ideas of the embodiments of the present application, the specific implementation and application range will be changed. In summary, the content of this specification should not be construed as a limitation on this application.
15 15
替换页 (细则第 26条)  Replacement page (Article 26)

Claims

权利要求书 Claim
1. 一种声纹识别方法, 应用于声纹识别系统, 所述声纹识别系 统 K个子系统, 所述 K为大于零的整数, 其特征在于, 包括:  1. A voiceprint recognition method applied to a voiceprint recognition system, said voiceprint recognition system having K subsystems, said K being an integer greater than zero, characterized by comprising:
获耳又待分析的语音数据;  Get ear and voice data to be analyzed;
提取所述语音数据中的变化因子特征,所述变化因子特征用于表 征所述语音数据相关的综合信息,所述综合信息至少包括声音传输通 道信息, 声音环境信息以及发声对象信息;  Extracting a change factor feature in the voice data, the change factor feature is used to represent comprehensive information related to the voice data, and the comprehensive information includes at least sound channel information, sound environment information, and sounding object information;
通过易错点分类器,根据所述变化因子特征对所述语音数据进行 误判分类,得到所述语音数据在所述 K个子系统中被误判的相对误判 概率;  Misclassifying the voice data according to the characteristics of the change factor through the error-prone classifier to obtain the relative misjudgement probability that the voice data is misjudged in the K subsystems;
确定任一子系统对应的相对误判概率与所述 K 个子系统的平均 相对误判概率的偏置量,并根据所述偏置量计算相应子系统的最终融 合权重;  Determining an offset amount between the relative misjudgment probability corresponding to any subsystem and the average relative misjudgement probability of the K subsystems, and calculating the final fusion weight of the corresponding subsystem according to the offset amount;
获取各个子系统对所述语音数据的识别结果;  Acquiring the recognition results of the voice data by each subsystem;
通过所述最终融合权重对相应的各个子系统的识别结果进行加 权,根据加权后各个子系统的识别结果得到所述语音数据的综合识别 结果。  The recognition result of each subsystem is weighted by the final fusion weight, and the comprehensive recognition result of the voice data is obtained according to the recognition result of each subsystem after weighting.
2. 根据权利要求 1所述的方法, 其特征在于, 所述易错点分类 器的训练方法包括:  2. The method according to claim 1, wherein the method for training the error-prone classifier comprises:
以短时语音数据集作为各子系统的测试数据集,将测试过程中所 有误判的语音段依照不同的子系统标注为 N个不同的标签,作为训练 数据库, 所述 N为大于零的整数;  The short-term speech data set is used as the test data set of each subsystem, and all misjudged speech segments in the test process are labeled as N different labels according to different subsystems as a training database, where N is an integer greater than zero ;
对所述训练数据库中的每条短时语音数据,提取 MFCC梅尔频率 倒谱系数特征;  For each short-term speech data in the training database, extract the MFCC Mel frequency cepstrum coefficient feature;
根据提取到的 MFCC特征训练通用背景模型,训练总体变化矩阵; 根据所述总体变化矩阵获得所述短时语音数据的变化因子特征; 根据所述变化因子特征及其对应的标签,训练能进行 N类别分类  Training a general background model based on the extracted MFCC features, and training an overall change matrix; obtaining a change factor feature of the short-term speech data according to the overall change matrix; training can perform N based on the change factor feature and its corresponding label Category classification
17 17
替换页 (细则第 26条) 的易错点分类器。 Replacement page (Article 26) Error-prone classifier.
3. 根据权利要求 2所述的方法, 其特征在于, 所述根据所述变 化因子特征及其对应的标签,训练能进行 N类别分类的易错点分类器 之前, 包括:  3. The method according to claim 2, wherein before training the error-prone classifier capable of N-class classification according to the characteristics of the change factor and its corresponding label, comprises:
采用线性区分性分析对所述变化因子特征进行信道补偿,获得降 维后的变化因子特征。  Linear discriminant analysis is used to perform channel compensation on the characteristics of the change factor to obtain the characteristics of the change factor after dimensionality reduction.
4. 根据权利要求 1所述的方法, 其特征在于, 所述 K个子系统 对应的相对误判概率的和为一。  4. The method according to claim 1, wherein the sum of the relative misjudgment probabilities corresponding to the K subsystems is one.
5. 根据权利要求 1所述的方法, 其特征在于,  5. The method according to claim 1, wherein:
所述根据所述偏置量计算相应子系统的最终融合权重, 包括: 根据所述偏置量计算相应子系统的初始融合权重,具体通过以下  Calculating the final fusion weight of the corresponding subsystem according to the offset includes: calculating the initial fusion weight of the corresponding subsystem according to the offset, specifically through the following:
Figure imgf000019_0002
Figure imgf000019_0002
根据所述初始融合权重,并通过以下公式计算所述最终融合权重; Q = Qst ,x + ^[Qs, ,x -去]( Z = 1, 2, , ) According to the initial fusion weight, and calculate the final fusion weight by the following formula; Q = Qs t, x + ^ [Qs,, x-go] ( Z = 1, 2 ,,)
所述 q为所述 K个子系统各自的最终融合权重,所述其中 x为输
Figure imgf000019_0001
x时各子系统 的初始融合权重, 所 述 为所述 的关系系数。
The q is the final fusion weight of each of the K subsystems, where x is the output
Figure imgf000019_0001
The initial fusion weight of each subsystem at x, where the relationship coefficient is described.
6. —种声纹识别系统, 其特征在于, 包括:  6. —A voiceprint recognition system, characterized in that it includes:
K个子系统和动态权重子模块, 所述 K为大于零的整数,; 所述动态权重子模块用于获取待分析的语音数据;提取所述语音 数据中的变化因子特征,所述变化因子特征用于表征所述语音数据相 关的综合信息, 所述综合信息至少包括声音传输通道信息, 声音环境 信息以及发声对象信息; 通过易错点分类器, 根据所述变化因子特征 对所述语音数据进行误判分类,得到所述语音数据在所述 K个子系统  K subsystems and dynamic weight sub-modules, where K is an integer greater than zero; the dynamic weight sub-modules are used to obtain speech data to be analyzed; extracting change factor characteristics, the change factor characteristics in the speech data It is used to characterize comprehensive information related to the voice data, where the comprehensive information includes at least sound transmission channel information, sound environment information, and vocalization object information; the error data is processed by the error-prone classifier according to the change factor characteristics. Misjudgment classification to obtain the voice data in the K subsystems
18 18
替换页 (细则第 26条) 中被误判的相对误判概率;确定任一子系统对应的相对误判概率与所 述 K个子系统的平均相对误判概率的偏置量,并根据所述偏置量计算 相应子系统的最终融合权重;通过所述最终融合权重对相应的各个子 系统的识别结果进行加权,根据加权后各个子系统的识别结果得到所 述语音数据的综合识别结果; Replacement page (Article 26) The relative misjudgment probability of being misjudged in the system; determining the offset between the relative misjudgement probability corresponding to any subsystem and the average relative misjudgement probability of the K subsystems, and calculating the corresponding subsystem's A final fusion weight; weighting the recognition results of the respective subsystems by the final fusion weights, and obtaining a comprehensive recognition result of the voice data according to the recognition results of the subsystems after weighting;
所述子系统用于对所述语音数据进行初步声纹识别,获得所述语 音数据的识别结果。  The subsystem is configured to perform preliminary voiceprint recognition on the voice data and obtain a recognition result of the voice data.
7. 根据权利要求 6所述的系统, 其特征在于, 所述动态权重子 模块包括: 特征提取单元, 易错点分类器, 权重计算单元以及综合计 算单元;  7. The system according to claim 6, wherein the dynamic weight sub-module comprises: a feature extraction unit, an error-prone classifier, a weight calculation unit, and a comprehensive calculation unit;
所述特征提取单元用于提取所述语音数据中的变化因子特征; 所述易错点分类器用于根据所述变化因子特征对所述语音数据 进行误判分类,得到所述语音数据在所述 K个子系统中被误判的相对 误判概率;  The feature extraction unit is configured to extract a change factor feature in the voice data; the error-prone point classifier is configured to misclassify the voice data according to the change factor feature, and obtain the voice data in the voice data. Relative misjudgment probability of being misjudged in K subsystems;
所述权重计算单元用于确定任一子系统对应的相对误判概率与 所述 K个子系统的平均相对误判概率的偏置量,并根据所述偏置量计 算相应子系统的最终融合权重;  The weight calculation unit is configured to determine an offset between the relative misjudgment probability corresponding to any subsystem and the average relative misjudged probability of the K subsystems, and calculate the final fusion weight of the corresponding subsystem according to the offset ;
所述综合计算单元用于通过所述最终融合权重对相应的各个子 系统的识别结果进行加权,根据加权后各个子系统的识别结果得到所 述语音数据的综合识别结果。  The comprehensive calculation unit is configured to weight the recognition results of the respective subsystems by using the final fusion weight, and obtain the comprehensive recognition results of the voice data according to the recognition results of the subsystems after the weighting.
8. 根据权利要求 6所述的系统, 其特征在于, 所述权重计算单 元具体还用于:  The system according to claim 6, wherein the weight calculation unit is further configured to:
所述根据所述偏置量计算相应子系统的最终融合权重, 包括: 根据所述偏置量计算相应子系统的初始融合权重,具体通过以下  Calculating the final fusion weight of the corresponding subsystem according to the offset includes: calculating the initial fusion weight of the corresponding subsystem according to the offset, specifically through the following:
Figure imgf000020_0001
5#的初始融合权重,
Figure imgf000020_0001
5 # initial fusion weight,
19 19
替换页 (细则第 26条) 丄 - A ($ |x )表示所述偏置量; Replacement page (Article 26) 丄-A ($ | x) represents the offset;
K K
;  ;
Figure imgf000021_0001
Figure imgf000021_0001
述 H为所述 的关系系数。  Let H be the relation coefficient.
9. 一种电子装置, 包括: 存储器、 处理器及存储在所述存储器 上并可在所述处理器上运行的计算机程序, 其特征在于, 所述处理器 执行所述计算机程序时,实现权利要求 1至 4中的任意一项所述方法。  9. An electronic device comprising: a memory, a processor, and a computer program stored on the memory and executable on the processor, characterized in that when the processor executes the computer program, rights are realized The method according to any one of claims 1 to 4.
10. 一种计算机可读存储介质,其上存储有计算机程序,其特征在于, 所述计算机程序被处理器执行时,实现权利要求 1至 4中的任意一项 所述方法。 10. A computer-readable storage medium having stored thereon a computer program, characterized in that when the computer program is executed by a processor, the method according to any one of claims 1 to 4 is implemented.
20 20
替换页 (细则第 26条)  Replacement page (Article 26)
PCT/CN2019/086767 2018-06-28 2019-05-14 Voiceprint recognition method, electronic device, and computer readable storage medium WO2020001182A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810688682.8 2018-06-28
CN201810688682.8A CN108831487B (en) 2018-06-28 2018-06-28 Voiceprint recognition method, electronic device and computer-readable storage medium

Publications (1)

Publication Number Publication Date
WO2020001182A1 true WO2020001182A1 (en) 2020-01-02

Family

ID=64133507

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/086767 WO2020001182A1 (en) 2018-06-28 2019-05-14 Voiceprint recognition method, electronic device, and computer readable storage medium

Country Status (2)

Country Link
CN (1) CN108831487B (en)
WO (1) WO2020001182A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108831487B (en) * 2018-06-28 2020-08-18 深圳大学 Voiceprint recognition method, electronic device and computer-readable storage medium
CN110970036B (en) * 2019-12-24 2022-07-12 网易(杭州)网络有限公司 Voiceprint recognition method and device, computer storage medium and electronic equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7409338B1 (en) * 2004-11-10 2008-08-05 Mediatek Incorporation Softbit speech decoder and related method for performing speech loss concealment
CN105895087A (en) * 2016-03-24 2016-08-24 海信集团有限公司 Voice recognition method and apparatus
CN107680600A (en) * 2017-09-11 2018-02-09 平安科技(深圳)有限公司 Sound-groove model training method, audio recognition method, device, equipment and medium
CN108022589A (en) * 2017-10-31 2018-05-11 努比亚技术有限公司 Aiming field classifier training method, specimen discerning method, terminal and storage medium
CN108831487A (en) * 2018-06-28 2018-11-16 深圳大学 Method for recognizing sound-groove, electronic device and computer readable storage medium

Family Cites Families (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB0902163D0 (en) * 2009-02-10 2009-03-25 Airmax Group Plc A method and system for vehicle monitoring
WO2011114520A1 (en) * 2010-03-19 2011-09-22 富士通株式会社 Identification device, identification method and program
CN102324232A (en) * 2011-09-12 2012-01-18 辽宁工业大学 Method for recognizing sound-groove and system based on gauss hybrid models
CN103065631B (en) * 2013-01-24 2015-07-29 华为终端有限公司 A kind of method of speech recognition, device
US9502038B2 (en) * 2013-01-28 2016-11-22 Tencent Technology (Shenzhen) Company Limited Method and device for voiceprint recognition
US9237232B1 (en) * 2013-03-14 2016-01-12 Verint Americas Inc. Recording infrastructure having biometrics engine and analytics service
US9396730B2 (en) * 2013-09-30 2016-07-19 Bank Of America Corporation Customer identification through voice biometrics
CN107274905B (en) * 2016-04-08 2019-09-27 腾讯科技(深圳)有限公司 A kind of method for recognizing sound-groove and system
CN107492382B (en) * 2016-06-13 2020-12-18 阿里巴巴集团控股有限公司 Voiceprint information extraction method and device based on neural network
CN106710599A (en) * 2016-12-02 2017-05-24 深圳撒哈拉数据科技有限公司 Particular sound source detection method and particular sound source detection system based on deep neural network
CN107610708B (en) * 2017-06-09 2018-06-19 平安科技(深圳)有限公司 Identify the method and apparatus of vocal print
CN107507612B (en) * 2017-06-30 2020-08-28 百度在线网络技术(北京)有限公司 Voiceprint recognition method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7409338B1 (en) * 2004-11-10 2008-08-05 Mediatek Incorporation Softbit speech decoder and related method for performing speech loss concealment
CN105895087A (en) * 2016-03-24 2016-08-24 海信集团有限公司 Voice recognition method and apparatus
CN107680600A (en) * 2017-09-11 2018-02-09 平安科技(深圳)有限公司 Sound-groove model training method, audio recognition method, device, equipment and medium
CN108022589A (en) * 2017-10-31 2018-05-11 努比亚技术有限公司 Aiming field classifier training method, specimen discerning method, terminal and storage medium
CN108831487A (en) * 2018-06-28 2018-11-16 深圳大学 Method for recognizing sound-groove, electronic device and computer readable storage medium

Also Published As

Publication number Publication date
CN108831487B (en) 2020-08-18
CN108831487A (en) 2018-11-16

Similar Documents

Publication Publication Date Title
WO2020073694A1 (en) Voiceprint identification method, model training method and server
CN110265040B (en) Voiceprint model training method and device, storage medium and electronic equipment
WO2021208287A1 (en) Voice activity detection method and apparatus for emotion recognition, electronic device, and storage medium
TWI527023B (en) A voiceprint recognition method and apparatus
WO2021082420A1 (en) Voiceprint authentication method and device, medium and electronic device
WO2021135438A1 (en) Multilingual speech recognition model training method, apparatus, device, and storage medium
CN107492382A (en) Voiceprint extracting method and device based on neutral net
CN109767787A (en) Emotion identification method, equipment and readable storage medium storing program for executing
WO2020034628A1 (en) Accent identification method and device, computer device, and storage medium
CN112259106A (en) Voiceprint recognition method and device, storage medium and computer equipment
CN109119069B (en) Specific crowd identification method, electronic device and computer readable storage medium
CN111199741A (en) Voiceprint identification method, voiceprint verification method, voiceprint identification device, computing device and medium
US11756572B2 (en) Self-supervised speech representations for fake audio detection
WO2018000271A1 (en) Intention scene recognition method and system based on user portrait
CN111583906A (en) Role recognition method, device and terminal for voice conversation
Wang et al. A network model of speaker identification with new feature extraction methods and asymmetric BLSTM
WO2020001182A1 (en) Voiceprint recognition method, electronic device, and computer readable storage medium
Wu et al. Dilated residual networks with multi-level attention for speaker verification
Zhang et al. I-vector based physical task stress detection with different fusion strategies
JP2021081713A (en) Method, device, apparatus, and media for processing voice signal
Ding et al. Enhancing GMM speaker identification by incorporating SVM speaker verification for intelligent web-based speech applications
Shivakumar et al. Simplified and supervised i-vector modeling for speaker age regression
Xia et al. Learning salient segments for speech emotion recognition using attentive temporal pooling
Gao et al. Metric Learning Based Feature Representation with Gated Fusion Model for Speech Emotion Recognition.
CN110853669A (en) Audio identification method, device and equipment

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19825279

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 04.05.2021)

122 Ep: pct application non-entry in european phase

Ref document number: 19825279

Country of ref document: EP

Kind code of ref document: A1