WO2019136911A1 - Voice recognition method for updating voiceprint data, terminal device, and storage medium - Google Patents

Voice recognition method for updating voiceprint data, terminal device, and storage medium Download PDF

Info

Publication number
WO2019136911A1
WO2019136911A1 PCT/CN2018/089415 CN2018089415W WO2019136911A1 WO 2019136911 A1 WO2019136911 A1 WO 2019136911A1 CN 2018089415 W CN2018089415 W CN 2018089415W WO 2019136911 A1 WO2019136911 A1 WO 2019136911A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
speech
threshold
verification
feature
Prior art date
Application number
PCT/CN2018/089415
Other languages
French (fr)
Chinese (zh)
Inventor
王健宗
郑斯奇
于夕畔
肖京
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2019136911A1 publication Critical patent/WO2019136911A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies
    • G10L17/12Score normalisation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction

Definitions

  • the present application relates to the field of voice recognition, and in particular, to a voice recognition method, a terminal device, and a storage medium for updating voiceprint data.
  • the conventional voiceprint registration recognition method generally includes the following steps: 1. Feature extraction, after obtaining the user registration voice data, extracting the sound features of the data. 2. Generate an authentication vector. 3. Alignment verification, the feature vector (i-vector) of the user registration is retained in the voiceprint library, and the i-vector for verifying the voice extraction is compared with the i-vector at the time of registration for each verification, that is, the predetermined cosine is utilized.
  • the distance formula calculates the distance between the current authentication vector and the registration vector corresponding to the user. If the distance is within the set threshold range, the two i-vectors are considered to be generated by the same person's voice, that is, the verification is successful; otherwise, the return fails.
  • each comparison vector is the authentication vector generated when the user first registers the voice. Over time, the user's voice may change due to factors such as age, physical condition, environment, etc. Each time the verification is performed using the i-vector at the time of registration, the verification may fail.
  • the present application proposes a voice recognition method, a terminal device, and a storage medium for updating voiceprint data.
  • the present application provides a terminal device, where the terminal device includes a memory and a processor, and the memory stores a voice recognition program for updating voiceprint data that can be run on the processor.
  • the speech recognition program for updating the voiceprint data is executed by the processor to implement the following steps:
  • the registration voice is updated according to the verification voice.
  • the present application further provides a voice recognition method for updating voiceprint data, which is applied to a terminal device, and the method includes:
  • the registration voice is updated according to the verification voice.
  • the present application further provides a storage medium storing a voice recognition program for updating voiceprint data, and the voice recognition program for updating voiceprint data may be executed by at least one processor.
  • the voice recognition method, the terminal device, and the storage medium for updating the voiceprint data proposed by the present application first register a preset number of user registration voices, and calculate the preset number of user registration voices. a feature speech vector of each user registration voice; secondly, the feature speech vector of each user registration voice is scored in pairs, and the first score average is obtained as a first threshold; and then, the verification voice is obtained, And calculating a feature speech vector of the verification speech; then, scoring the feature speech vector of the verification speech and the feature speech vector of the registration speech, and obtaining a second scoring average; further, determining the Whether the second score average is greater than a second threshold, the second threshold is greater than the first threshold; and finally, if the second score average is greater than the second threshold, updating the preview according to the verification voice Set the number of users to register voice.
  • the existing speech recognition method can be changed over time, and the user's voice may be changed due to factors such as age, physical condition, environment, etc., and each time the verification is performed using the i-vector at the time of registration, the failure of the verification may be caused.
  • the user will follow the process of comparison and update, that is, each time the verification is met, the registration information of the user in the voiceprint library will be updated, which can improve the accuracy of subsequent voiceprint verification. And can adapt to changes in the voice of the registrant over time.
  • FIG. 1 is a diagram showing an operating environment of a terminal device according to a preferred embodiment of the present application.
  • FIG. 2 is a program block diagram of an embodiment of a speech recognition program for updating voiceprint data according to the present application
  • FIG. 3 is a flow chart of an embodiment of a voice recognition program for updating voiceprint data according to the present application
  • the terminal device can be implemented in various forms.
  • the terminal device described in the present application may include, for example, a mobile phone, a tablet computer, a notebook computer, a palmtop computer, a personal digital assistant (PDA), a portable media player (PMP), a navigation device, Mobile terminals such as wearable devices, smart bracelets, pedometers, and fixed terminals such as digital TVs, desktop computers, and the like.
  • PDA personal digital assistant
  • PMP portable media player
  • Mobile terminals such as wearable devices, smart bracelets, pedometers
  • fixed terminals such as digital TVs, desktop computers, and the like.
  • FIG. 1 is a diagram showing an operating environment of a terminal device 100 according to a preferred embodiment of the present application.
  • the electronic device 100 also includes a voice recognition program 300 that updates voiceprint data, a memory 20, a processor 30, a sensing unit 40, and the like.
  • the sensing unit 40 may be various sensors that sense user voices, and are mainly used to acquire verification voices.
  • the memory 20 includes at least one type of readable storage medium including a flash memory, a hard disk, a multimedia card, a card type memory (eg, SD or DX memory, etc.), a random access memory (RAM), a static Random access memory (SRAM), read only memory (ROM), electrically erasable programmable read only memory (EEPROM), programmable read only memory (PROM), magnetic memory, magnetic disk, optical disk, and the like.
  • the processor 30 can be a central processing unit (CPU), a controller, a microcontroller, a microprocessor, or other data processing chip.
  • the present application proposes a speech recognition program 300 for updating voiceprint data.
  • FIG. 2 it is a program module diagram of the first embodiment of the speech recognition program 300 for updating voiceprint data in the present application.
  • the voice recognition program 300 for updating voiceprint data includes a series of computer program instructions stored on the memory 109.
  • the computer program instructions are executed by the processor 110, embodiments of the present application may be implemented.
  • the speech recognition program 300 that updates the voiceprint data can be divided into one or more modules based on the particular operations implemented by the various portions of the computer program instructions. For example, in FIG. 2, the voice recognition program 300 for updating voiceprint data may be divided into a registration module 301, a first comparison module 302, an acquisition module 303, a second comparison module 304, a determination module 305, and an update. Module 306. among them:
  • the registration module 301 is configured to register a preset number of registered voices, and calculate a feature voice vector of each registered voice in the preset number of registered voices. For example, the registration module 301 registers N registered voices.
  • the voice recognition program 300 for updating the voiceprint data is stored in the terminal device 100.
  • the terminal device 100 of the embodiment may be any terminal having a voice recognition function, such as a mobile phone, a portable computer, and a personal digital assistant. Bank payment terminals, access control devices, etc. These devices can implement some specific functions and applications through voice recognition technology.
  • the terminal device 100 acquires an effective voice when the user performs voice registration, and can start acquiring when the user clicks on the voice input, until the user stops the voice input, thereby avoiding unnecessary noise interference and improving the purity of the voice sample to be processed. degree.
  • the above N registered voices are preferably three registered voices, and of course, N may be selected as other suitable positive integers as needed.
  • the registration module 301 separately calculates a feature speech vector of each registered voice in the preset number of registered voices by:
  • the registration module 301 extracts the MFCC features of each frame of speech in each speech using the Mel Frequency Cepstrum Coefficient MFCC method and forms a matrix, and uses a Universal Background Model (UBM) and a feature speech vector (i-vector) extractor.
  • UBM Universal Background Model
  • i-vector feature speech vector
  • MFCC is an abbreviation of Mel-Frequency Cepstral Coefficients, which contains two key steps: conversion to the Mel frequency, followed by cepstrum analysis.
  • voice segmentation is performed on each voice to obtain a voice spectrum of multiple frames; and the acquired spectrum is obtained through a Mel filter bank to obtain a Mel spectrum, where the Mel filter group may be non-uniform.
  • the frequency is converted to a uniform frequency; finally, the cepstrum analysis is performed on the Mel spectrum to obtain the Mel frequency cepstrum coefficient MFCC.
  • This MFCC is the characteristic of the speech of the frame.
  • cepstrum analysis is to take the logarithm of the Mel spectrum, and then do Inverse transform, in which the actual inverse transform is generally implemented by DCT discrete cosine transform, and the second to thirteenth coefficients after DCT are taken as MFCC coefficients.
  • the MFCC of each frame of speech is composed into a vector matrix, and the most core vector in the matrix is filtered by a background model (UBM) and a feature-vector vector (i-vector) extractor, and the vector is used as the vector.
  • UBM background model
  • i-vector feature-vector vector
  • a feature speech vector of the speech wherein the background of the most central vector in the matrix is filtered by a background model (UBM) and a feature speech vector (i-vector) extractor, which belongs to the existing calculation method of vector matrix calculation, I will not repeat them.
  • UBM background model
  • i-vector feature speech vector
  • the first comparison module 302 is configured to perform a pairwise comparison of the feature speech vectors of each registered voice, and obtain a first score average as a first threshold. Specifically, the first comparison module 302 performs a pairwise comparison score on the feature speech vectors of each of the speeches by using a dot-product algorithm and a PLDA algorithm.
  • the vector dot product algorithm and the PLDA algorithm are an existing algorithm, and will not be described in detail herein.
  • the obtaining module 303 is configured to acquire a verification voice, and calculate a feature voice vector of the verification voice. In the present embodiment, after acquiring the verification speech, the acquisition module 303 also calculates the feature speech vector of the verification speech by using the MFCC algorithm and the UBM model and the vector extractor.
  • the second comparison module 304 is configured to perform a pairwise comparison of the feature speech vectors of the verification speech and the feature speech vectors of the registered speech, and obtain a second scoring average.
  • the feature speech vectors of the verification speech are respectively scored with the feature speech vectors of the plurality of registered speeches to obtain a plurality of corresponding second scoring values, and the plurality of second scoring values are obtained.
  • the average is obtained to obtain a second score average.
  • the registration module 301 registers a preset number of registered voices, for example, three registered voices are registered, and the feature voice vector corresponding to the registered voice is a feature voice vector group including three voice feature vectors. .
  • the second comparison module 304 performs a process of pairwise matching the feature speech vectors of the verification speech with the feature speech vectors of the registered speech, specifically, the obtained one verification speech and the characteristic speech vector group of the registered speech respectively.
  • the three characteristic speech vectors are separately scored, and three scoring values are obtained, and the three scoring values are averaged to obtain the second scoring average.
  • the determining module 305 is configured to determine whether the second score average is greater than a second threshold.
  • the second threshold is the first threshold plus a preset value.
  • the second threshold is greater than the first threshold, and the preset value of the phase difference may be customized according to the actual situation in the repeated experiment, and the present application does not limit the preset value.
  • the update module 306 is configured to update the registration voice according to the verification voice when the second score average is greater than the second threshold. In this embodiment, the update module 306 updates the registration voice in the following manner:
  • the update module 306 further determines whether the second score average is greater than a third threshold, and the third threshold is greater than the second threshold.
  • the update module 306 updates the verification voice whose second score average value is greater than the third threshold value into the registration voice, and registers according to the updated registration voice.
  • the third threshold can be determined by the following steps:
  • the update module 306 first selects all the verification voices whose second score is higher than the second threshold, and counts as N, and then, the second score average corresponding to the selected verification voice is performed by the high value. To the low ordering; finally, the second score average of the N/3 is selected as the third threshold.
  • the third threshold is dynamically set by using the second score average corresponding to the verification voice in the continuous verification process, thereby dynamically updating the registration voice according to the third threshold, and ensuring that the registered voice can be changed according to changes of the user in different periods.
  • the magnitude relationship between the first threshold, the second threshold, and the third threshold is that the first threshold is smaller than the second threshold, and the second threshold is smaller than the third threshold, where the first threshold is also the first score average.
  • the upper is the average difference between the registered speeches, and the second scoring average substantially reflects the difference between the verified speech and the registered speech, if at this time the verification speech is compared with the registered speech by a difference between the registered speech and the registered speech.
  • An average difference is also larger than a preset value, indicating that the difference between the verification voice and the previous registration voice is relatively large, and the registration voice needs to be updated at this time.
  • further setting a third threshold based on the second threshold may further filter out the verification voice that is significantly different from the registered voice, and then update the registration voice according to the verification voice.
  • the present application also proposes a voice recognition method for updating voiceprint data.
  • FIG. 3 it is a schematic flowchart of the implementation of the first embodiment of the voice recognition method for updating voiceprint data in the present application.
  • the order of execution of the steps in the flowchart shown in FIG. 3 may be changed according to different requirements, and some steps may be omitted.
  • Step S401 Register a preset number of registered voices, and calculate a feature voice vector of each registered voice in the preset number of registered voices.
  • the terminal device 100 registers N registered voices.
  • the voice recognition method for updating the voiceprint data is stored in the terminal device 100.
  • the terminal device 100 in this embodiment may be any terminal having a voice recognition function, such as a mobile phone, a portable computer, a personal digital assistant, Bank payment terminals, access control devices, etc., which can implement some specific functions and applications through voice recognition technology.
  • the terminal device 100 acquires an effective voice when the user performs voice registration, and can start acquiring when the user clicks on the voice input, until the user stops the voice input, thereby avoiding unnecessary noise interference and improving the purity of the voice sample to be processed. degree.
  • the above N registered voices are preferably three registered voices, and of course, N may be selected as other suitable positive integers as needed.
  • the terminal device 100 separately calculates a feature speech vector of each registered voice in the preset number of registered voices by:
  • the terminal device 100 extracts the MFCC features of each frame of speech in each speech using the Mel Frequency Cepstrum Coefficient MFCC method and forms a matrix, and uses a Universal Background Model (UBM) and a feature speech vector (i-vector) extractor.
  • UBM Universal Background Model
  • i-vector feature speech vector
  • MFCC is an abbreviation of Mel-Frequency Cepstral Coefficients, which contains two key steps: conversion to the Mel frequency, followed by cepstrum analysis.
  • voice segmentation is performed on each voice to obtain a voice spectrum of multiple frames; and the acquired spectrum is obtained through a Mel filter bank to obtain a Mel spectrum, where the Mel filter group may be non-uniform.
  • the frequency is converted to a uniform frequency; finally, the cepstrum analysis is performed on the Mel spectrum to obtain the Mel frequency cepstrum coefficient MFCC.
  • This MFCC is the characteristic of the speech of the frame.
  • the so-called cepstrum analysis is to take the logarithm of the Mel spectrum, and then do The inverse transform is actually implemented by DCT discrete cosine transform, and the second to thirteenth coefficients after DCT are taken as MFCC coefficients.
  • the MFCC of each frame of speech is composed into a vector matrix, and the most core vector in the matrix is filtered by a background model (UBM) and a feature-vector vector (i-vector) extractor, and the vector is used as the vector.
  • UBM background model
  • i-vector feature-vector vector
  • a feature speech vector of the speech wherein the background vector (UBM) and the feature speech vector (i-vector) extractor filter out the most core vector in the matrix belongs to the existing data algorithm of the vector matrix calculation, I will not repeat them.
  • Step S402 performing a pairwise comparison of the feature speech vectors of each registered speech, and obtaining a first scoring average as the first threshold.
  • the terminal device 100 performs a pairwise comparison and scoring of the feature speech vectors of each of the speeches by using a dot-product algorithm and a PLDA algorithm.
  • the vector dot product algorithm and the PLDA algorithm are an existing algorithm, and will not be described in detail herein.
  • Step S403 obtaining a verification voice, and calculating a feature speech vector of the verification voice.
  • the acquisition module 303 after acquiring the verification speech, the acquisition module 303 also calculates the feature speech vector of the verification speech by using the MFCC algorithm and the UBM model and the vector extractor.
  • Step S404 the feature speech vectors of the verification speech are respectively compared with the feature speech vectors of the registered speech, and the second scoring average is obtained.
  • the feature speech vectors of the verification speech are respectively scored with the feature speech vectors of the plurality of registered speeches, and then a plurality of corresponding second scoring values are obtained, and the plurality of second scoring values are averaged to obtain a second. Score the average.
  • the registration module 301 registers a preset number of registered voices, for example, three registered voices are registered, and the feature voice vector corresponding to the registered voice is a feature voice vector group including three voice feature vectors. . At this time, in the process of voice verification, there is only one verification voice acquired, and there is also a feature voice vector for verifying the voice.
  • the second comparison module 304 performs a process of pairwise matching the feature speech vectors of the verification speech with the feature speech vectors of the registered speech, specifically, the obtained one verification speech and the characteristic speech vector group of the registered speech respectively.
  • the three characteristic speech vectors are separately scored, and three scoring values are obtained, and the three scoring values are averaged to obtain the second scoring average.
  • Step S405 determining whether the second score average value is greater than a second threshold.
  • the second threshold is the first threshold plus a preset value.
  • step S406 is performed, otherwise, the flow is ended.
  • the second threshold is greater than the first threshold, and the preset value of the phase difference may be customized according to the actual situation in the repeated experiment, and the present application does not limit the preset value.
  • Step S406 updating the registration voice according to the verification voice.
  • the terminal device 100 further updates the registration voice by:
  • the terminal device 100 first determines whether the second scoring average is greater than a third threshold, wherein the third threshold is greater than the second threshold. Then, the terminal device 100 updates the verification voice whose second score average value is greater than the third threshold value into the registration voice, and registers according to the updated registration voice.
  • the third threshold is determined by the following steps:
  • the terminal device 100 first selects all the verification voices whose second score is higher than the second threshold, and counts as N, and then performs the second score average corresponding to the selected verification voice. High to low ordering; finally, the second score average of the N/3 is selected as the third threshold.
  • the third threshold is dynamically set by using the second score average corresponding to the verification voice in the continuous verification process, thereby dynamically updating the registration voice according to the third threshold, and ensuring that the registered voice can be changed according to changes of the user in different periods.
  • the magnitude relationship between the first threshold, the second threshold, and the third threshold is that the first threshold is smaller than the second threshold, and the second threshold is smaller than the third threshold, where the first threshold is also the first score average.
  • the upper is the average difference between the registered speeches, and the second scoring average substantially reflects the difference between the verified speech and the registered speech, if at this time the verification speech is compared with the registered speech by a difference between the registered speech and the registered speech.
  • An average difference is also larger than a preset value, indicating that the difference between the verification voice and the previous registration voice is relatively large, and the registration voice needs to be updated at this time.
  • further setting a third threshold based on the second threshold may further filter out the verification voice that is significantly different from the registered voice, and then update the registration voice according to the verification voice.
  • the voice recognition method for updating the voiceprint data proposed by the present application can solve the change in the existing voice recognition method over time, and the user voice may change due to factors such as age, physical condition, environment, etc.
  • the i-vector at the time of verification for comparison it may lead to the drawback of verification failure, and then it can be carried out according to the comparison and update process every time the user verifies the verification, that is, the verification is passed every time the user meets the requirements.
  • the registration information in the library will be updated to improve the accuracy of subsequent voiceprint verification and to adapt to changes in the voice of the registrant over time.
  • the present application further provides another embodiment, that is, a storage medium storing a voice recognition program for updating voiceprint data, the voice recognition program for updating voiceprint data being executable by at least one processor And the step of causing the at least one processor to perform the speech recognition method of updating the voiceprint data as described above.
  • the foregoing embodiment method can be implemented by means of software plus a necessary general hardware platform, and of course, can also be through hardware, but in many cases, the former is better.
  • Implementation Based on such understanding, the technical solution of the present application, which is essential or contributes to the prior art, may be embodied in the form of a software product stored in a storage medium (such as ROM/RAM, disk,
  • the optical disc includes a number of instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the methods described in various embodiments of the present application.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Business, Economics & Management (AREA)
  • Game Theory and Decision Science (AREA)
  • Telephonic Communication Services (AREA)
  • Telephone Function (AREA)

Abstract

Disclosed is a voice recognition method for updating voiceprint data, comprising: registering preset number of registration voice and calculating a feature voice vector of each registration voice; comparing and scoring the feature voice vectors of the registration voices two by two; obtaining verification voice and calculating the feature voice vectors of the verification voice; and comparing and scoring the feature voice vectors of the verification voice and the feature voice vectors of the registration voice two by two, and updating the preset number of registration voice according to the verification voice. The present application further provides a terminal device and a storage medium. By means of the voice recognition method for updating the voiceprint data, the terminal device, and the storage medium provided by the present application, each verification of a user can be carried out according to a comparison and updating process, and the follow-up voiceprint verification accuracy can be improved; moreover, the method can adapt to change of voice of a registered person which fluctuates over time.

Description

更新声纹数据的语音识别方法、终端装置及存储介质Voice recognition method, terminal device and storage medium for updating voiceprint data
优先权申明Priority claim
本申请要求于2018年1月12日提交中国专利局、申请号为201810030623.1,发明名称为“更新声纹数据的语音识别方法、终端装置及存储介质”的中国专利申请的优先权,其内容全部通过引用结合在本申请中。This application claims priority to Chinese Patent Application No. 201810030623.1, entitled "Voice Recognition Method, Terminal Device and Storage Medium for Updating Voiceprint Data", which was submitted to the Chinese Patent Office on January 12, 2018. This is incorporated herein by reference.
技术领域Technical field
本申请涉及语音识别领域,尤其涉及一种更新声纹数据的语音识别方法、终端装置及存储介质。The present application relates to the field of voice recognition, and in particular, to a voice recognition method, a terminal device, and a storage medium for updating voiceprint data.
背景技术Background technique
目前传统的声纹注册识别方法中,一般包括如下几个步骤:1.特征提取,在获得用户注册语音数据后,对数据进行声音特征的提取。2.生成鉴别向量。3.比对验证,声纹库中保留用户注册时的特征向量(i-vector),每次验证时,验证语音提取的i-vector与注册时i-vector进行比较,即利用预先确定的余弦距离公式计算当前鉴别向量与用户对应的注册向量之间的距离,若距离在设定的阈值范围内,则认为两个i-vector为同一人的语音所产生,即验证成功;否则返回失败。At present, the conventional voiceprint registration recognition method generally includes the following steps: 1. Feature extraction, after obtaining the user registration voice data, extracting the sound features of the data. 2. Generate an authentication vector. 3. Alignment verification, the feature vector (i-vector) of the user registration is retained in the voiceprint library, and the i-vector for verifying the voice extraction is compared with the i-vector at the time of registration for each verification, that is, the predetermined cosine is utilized. The distance formula calculates the distance between the current authentication vector and the registration vector corresponding to the user. If the distance is within the set threshold range, the two i-vectors are considered to be generated by the same person's voice, that is, the verification is successful; otherwise, the return fails.
然而,在这种传统的注册验证下,每次比对时,对比的鉴别向量皆为用户第一次注册语音时生成的鉴别向量。随时间变化,用户声音可能因年龄、身体状况、环境等因素发生改变,每次验证时使用注册时的i-vector进行比较,可能导致验证失败。However, under this conventional registration verification, each comparison vector is the authentication vector generated when the user first registers the voice. Over time, the user's voice may change due to factors such as age, physical condition, environment, etc. Each time the verification is performed using the i-vector at the time of registration, the verification may fail.
发明内容Summary of the invention
有鉴于此,本申请提出一种更新声纹数据的语音识别方法、终端装置及存储介质,通过实施上述方式,可以提升后续声纹验证准确率,且能适应注 册人随时间波动的声音变化。In view of this, the present application proposes a voice recognition method, a terminal device, and a storage medium for updating voiceprint data. By implementing the above manner, the accuracy of subsequent voiceprint verification can be improved, and the voice change of the registrant with time can be adapted.
首先,为实现上述目的,本申请提出一种终端装置,所述终端装置包括存储器、处理器,所述存储器上存储有可在所述处理器上运行的更新声纹数据的语音识别程序,所述更新声纹数据的语音识别程序被所述处理器执行时实现如下步骤:First, in order to achieve the above object, the present application provides a terminal device, where the terminal device includes a memory and a processor, and the memory stores a voice recognition program for updating voiceprint data that can be run on the processor. The speech recognition program for updating the voiceprint data is executed by the processor to implement the following steps:
注册预设数目的用户注册语音,并计算所述预设数目的用户注册语音中的每条用户注册语音的特征语音向量;Registering a preset number of user registration voices, and calculating a feature voice vector of each user registration voice in the preset number of user registration voices;
将所述每条用户注册语音的特征语音向量进行两两比对打分,并获取第一打分平均值作为第一阈值;Performing a pairwise comparison on the feature speech vectors of each user registration voice, and obtaining a first score average as the first threshold;
获取验证语音,并计算所述验证语音的特征语音向量;Acquiring a verification voice, and calculating a feature speech vector of the verification voice;
将所述验证语音的特征语音向量分别和注册语音的特征语音向量进行两两比对打分,并获取第二打分平均值;And characterizing the feature speech vectors of the verification speech and the feature speech vectors of the registered speech respectively, and obtaining a second scoring average;
判断所述第二打分平均值是否大于第二阈值,所述第二阈值为所述第一阈值加一预设值;及Determining whether the second score average is greater than a second threshold, the second threshold being the first threshold plus a preset value; and
若所述第二打分平均值大于所述第二阈值,则根据所述验证语音更新注册语音。If the second score average is greater than the second threshold, the registration voice is updated according to the verification voice.
此外,为实现上述目的,本申请还提供一种更新声纹数据的语音识别方法,应用于终端装置,所述方法包括:In addition, in order to achieve the above object, the present application further provides a voice recognition method for updating voiceprint data, which is applied to a terminal device, and the method includes:
注册预设数目的用户注册语音,并计算所述预设数目的用户注册语音中的每条用户注册语音的特征语音向量;Registering a preset number of user registration voices, and calculating a feature voice vector of each user registration voice in the preset number of user registration voices;
将所述每条用户注册语音的特征语音向量进行两两比对打分,并获取第一打分平均值作为第一阈值;Performing a pairwise comparison on the feature speech vectors of each user registration voice, and obtaining a first score average as the first threshold;
获取验证语音,并计算所述验证语音的特征语音向量;Acquiring a verification voice, and calculating a feature speech vector of the verification voice;
将所述验证语音的特征语音向量分别和注册语音的特征语音向量进行两两比对打分,并获取第二打分平均值;And characterizing the feature speech vectors of the verification speech and the feature speech vectors of the registered speech respectively, and obtaining a second scoring average;
判断所述第二打分平均值是否大于第二阈值,所述第二阈值为所述第一 阈值加一预设值;及Determining whether the second score average is greater than a second threshold, the second threshold being the first threshold plus a preset value; and
若所述第二打分平均值大于所述第二阈值,则根据所述验证语音更新注册语音。If the second score average is greater than the second threshold, the registration voice is updated according to the verification voice.
进一步地,为实现上述目的,本申请还提供一种存储介质,所述存储介质存储有更新声纹数据的语音识别程序,所述更新声纹数据的语音识别程序可被至少一个处理器执行,以使所述至少一个处理器执行如上所述的更新声纹数据的语音识别方法的步骤。Further, in order to achieve the above object, the present application further provides a storage medium storing a voice recognition program for updating voiceprint data, and the voice recognition program for updating voiceprint data may be executed by at least one processor. The step of causing the at least one processor to perform the speech recognition method of updating the voiceprint data as described above.
相较于现有技术,本申请所提出的更新声纹数据的语音识别方法、终端装置及存储介质,首先,注册预设数目的用户注册语音,并计算所述预设数目的用户注册语音中的每条用户注册语音的特征语音向量;其次,将所述每条用户注册语音的特征语音向量进行两两比对打分,并获取第一打分平均值作为第一阈值;接着,获取验证语音,并计算所述验证语音的特征语音向量;然后,将所述验证语音的特征语音向量和注册语音的特征语音向量进行两两比对打分,并获取第二打分平均值;进一步地,判断所述第二打分平均值是否大于第二阈值,所述第二阈值大于所述第一阈值;最后,若所述第二打分平均值大于所述第二阈值,则根据所述验证语音更新所述预设数目的用户注册语音。这样,可以解决现有语音识别方法中随时间变化,用户声音可能因年龄、身体状况、环境等因素发生改变,每次验证时使用注册时的i-vector进行比较,可能导致验证失败的弊端,进而可以在用户每次验证时,都将按照对比和更新的流程进行,即每次符合要求的验证通过,用户在声纹库中的注册信息都将进行更新,可以提升后续声纹验证准确率,且能适应注册人随时间波动的声音变化。Compared with the prior art, the voice recognition method, the terminal device, and the storage medium for updating the voiceprint data proposed by the present application first register a preset number of user registration voices, and calculate the preset number of user registration voices. a feature speech vector of each user registration voice; secondly, the feature speech vector of each user registration voice is scored in pairs, and the first score average is obtained as a first threshold; and then, the verification voice is obtained, And calculating a feature speech vector of the verification speech; then, scoring the feature speech vector of the verification speech and the feature speech vector of the registration speech, and obtaining a second scoring average; further, determining the Whether the second score average is greater than a second threshold, the second threshold is greater than the first threshold; and finally, if the second score average is greater than the second threshold, updating the preview according to the verification voice Set the number of users to register voice. In this way, the existing speech recognition method can be changed over time, and the user's voice may be changed due to factors such as age, physical condition, environment, etc., and each time the verification is performed using the i-vector at the time of registration, the failure of the verification may be caused. In addition, each time the user verifies, the user will follow the process of comparison and update, that is, each time the verification is met, the registration information of the user in the voiceprint library will be updated, which can improve the accuracy of subsequent voiceprint verification. And can adapt to changes in the voice of the registrant over time.
附图说明DRAWINGS
图1是本申请较佳实施例之终端装置的运行环境图;1 is a diagram showing an operating environment of a terminal device according to a preferred embodiment of the present application;
图2是本申请更新声纹数据的语音识别程序一实施例的程序模块图;2 is a program block diagram of an embodiment of a speech recognition program for updating voiceprint data according to the present application;
图3为本申请更新声纹数据的语音识别程序一实施例的流程图;3 is a flow chart of an embodiment of a voice recognition program for updating voiceprint data according to the present application;
本申请目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。The implementation, functional features and advantages of the present application will be further described with reference to the accompanying drawings.
具体实施方式Detailed ways
应当理解,此处所描述的具体实施例仅用以解释本申请,并不用于限定本申请。It is understood that the specific embodiments described herein are merely illustrative of the application and are not intended to be limiting.
在后续的描述中,使用用于表示元件的诸如“模块”、“部件”或“单元”的后缀仅为了有利于本申请的说明,其本身没有特定的意义。因此,“模块”、“部件”或“单元”可以混合地使用。In the following description, the suffixes such as "module," "component," or "unit" used to denote an element are merely illustrative for the benefit of the present application, and have no particular meaning per se. Therefore, "module", "component" or "unit" can be used in combination.
终端装置可以以各种形式来实施。例如,本申请中描述的终端装置可以包括诸如手机、平板电脑、笔记本电脑、掌上电脑、个人数字助理(Personal Digital Assistant,PDA)、便捷式媒体播放器(Portable Media Player,PMP)、导航装置、可穿戴设备、智能手环、计步器等移动终端,以及诸如数字TV、台式计算机等固定终端。The terminal device can be implemented in various forms. For example, the terminal device described in the present application may include, for example, a mobile phone, a tablet computer, a notebook computer, a palmtop computer, a personal digital assistant (PDA), a portable media player (PMP), a navigation device, Mobile terminals such as wearable devices, smart bracelets, pedometers, and fixed terminals such as digital TVs, desktop computers, and the like.
后续描述中将以移动终端为例进行说明,本领域技术人员将理解的是,除了特别用于移动目的的元件之外,根据本申请的实施方式的构造也能够应用于固定类型的终端。The following description will be made by taking a mobile terminal as an example, and those skilled in the art will understand that the configuration according to the embodiment of the present application can be applied to a terminal of a fixed type in addition to an element particularly for mobile purposes.
.
参阅图1所示,是本申请较佳实施例之终端装置100的运行环境图。电子装置100还包括更新声纹数据的语音识别程序300、存储器20、处理器30和感测单元40等。所述感测单元40可以是各种感测用户语音的传感器,主要用于获取验证语音。Referring to FIG. 1, FIG. 1 is a diagram showing an operating environment of a terminal device 100 according to a preferred embodiment of the present application. The electronic device 100 also includes a voice recognition program 300 that updates voiceprint data, a memory 20, a processor 30, a sensing unit 40, and the like. The sensing unit 40 may be various sensors that sense user voices, and are mainly used to acquire verification voices.
所述存储器20至少包括一种类型的可读存储介质,所述可读存储介质包括闪存、硬盘、多媒体卡、卡型存储器(例如,SD或DX存储器等)、随机访问存储器(RAM)、静态随机访问存储器(SRAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、可编程只读存储器(PROM)、磁 性存储器、磁盘、光盘等。所述处理器30可以是中央处理器(Central Processing Unit,CPU)、控制器、微控制器、微处理器、或其他数据处理芯片等。The memory 20 includes at least one type of readable storage medium including a flash memory, a hard disk, a multimedia card, a card type memory (eg, SD or DX memory, etc.), a random access memory (RAM), a static Random access memory (SRAM), read only memory (ROM), electrically erasable programmable read only memory (EEPROM), programmable read only memory (PROM), magnetic memory, magnetic disk, optical disk, and the like. The processor 30 can be a central processing unit (CPU), a controller, a microcontroller, a microprocessor, or other data processing chip.
基于上述终端装置100的运行环境图,提出本申请方法各个实施例。Various embodiments of the method of the present application are proposed based on the operating environment map of the terminal device 100 described above.
首先,本申请提出一种更新声纹数据的语音识别程序300。First, the present application proposes a speech recognition program 300 for updating voiceprint data.
参阅图2所示,是本申请更新声纹数据的语音识别程序300第一实施例的程序模块图。Referring to FIG. 2, it is a program module diagram of the first embodiment of the speech recognition program 300 for updating voiceprint data in the present application.
本实施例中,所述的更新声纹数据的语音识别程序300包括一系列的存储于存储器109上的计算机程序指令,当该计算机程序指令被处理器110执行时,可以实现本申请各实施例的更新声纹数据的语音识别操作。在一些实施例中,基于该计算机程序指令各部分所实现的特定的操作,所述更新声纹数据的语音识别程序300可以被划分为一个或多个模块。例如,在图2中,所述的更新声纹数据的语音识别程序300可以被分割成注册模块301、第一比对模块302、获取模块303、第二比对模块304、判断模块305以及更新模块306。其中:In this embodiment, the voice recognition program 300 for updating voiceprint data includes a series of computer program instructions stored on the memory 109. When the computer program instructions are executed by the processor 110, embodiments of the present application may be implemented. The voice recognition operation of updating the voiceprint data. In some embodiments, the speech recognition program 300 that updates the voiceprint data can be divided into one or more modules based on the particular operations implemented by the various portions of the computer program instructions. For example, in FIG. 2, the voice recognition program 300 for updating voiceprint data may be divided into a registration module 301, a first comparison module 302, an acquisition module 303, a second comparison module 304, a determination module 305, and an update. Module 306. among them:
所述注册模块301,用于注册预设数目的注册语音,并计算所述预设数目的注册语音中的每条注册语音的特征语音向量。例如,所述注册模块301注册N条注册语音。本实施例中,所述的更新声纹数据的语音识别程序300存储于终端装置100,本实施例的终端装置100可以为具有语音识别功能的任何一个终端,比如手机,便携式电脑、个人数字助理、银行支付终端、门禁设备等等,这些设备通过语音识别技术可以去实现一些具体的功能和应用。另外,终端装置100获取用户进行语音注册时的有效语音,可以从用户点击语音录入的时候开始获取,一直到用户停止语音录入,如此可以避免一些不必要的噪音干扰,提高待处理语音样本的纯净度。另外,上述N条注册语音优选为3条注册语音,当然根据需要,N也可以选择为其他合适的正整数。The registration module 301 is configured to register a preset number of registered voices, and calculate a feature voice vector of each registered voice in the preset number of registered voices. For example, the registration module 301 registers N registered voices. In this embodiment, the voice recognition program 300 for updating the voiceprint data is stored in the terminal device 100. The terminal device 100 of the embodiment may be any terminal having a voice recognition function, such as a mobile phone, a portable computer, and a personal digital assistant. Bank payment terminals, access control devices, etc. These devices can implement some specific functions and applications through voice recognition technology. In addition, the terminal device 100 acquires an effective voice when the user performs voice registration, and can start acquiring when the user clicks on the voice input, until the user stops the voice input, thereby avoiding unnecessary noise interference and improving the purity of the voice sample to be processed. degree. In addition, the above N registered voices are preferably three registered voices, and of course, N may be selected as other suitable positive integers as needed.
本实施例中,所述注册模块301通过以下方式分别计算所述预设数目的注册语音中的每条注册语音的特征语音向量:In this embodiment, the registration module 301 separately calculates a feature speech vector of each registered voice in the preset number of registered voices by:
所述注册模块301使用梅尔频率倒谱系数MFCC方法提取每一份语音中每帧语音的MFCC特征并组成一个矩阵,并使用通用背景模型(UBM)和特征语音向量(i-vector)提取器(extractor)筛选出所述矩阵中最核心的特征,组成所述特征语音向量。The registration module 301 extracts the MFCC features of each frame of speech in each speech using the Mel Frequency Cepstrum Coefficient MFCC method and forms a matrix, and uses a Universal Background Model (UBM) and a feature speech vector (i-vector) extractor. The extractor selects the most core features in the matrix to form the feature speech vector.
其中,MFCC是Mel-Frequency Cepstral Coefficients的缩写,包含两个关键步骤:转化到梅尔频率,然后进行倒谱分析。在本实施方式中,先对每一份语音进行语音分帧,获取多个帧的语音频谱;再将上述获取的频谱通过Mel滤波器组得到Mel频谱,其中Mel滤波器组可以将不统一的频率转化到统一的频率;最后在Mel频谱上面进行倒谱分析,获得Mel频率倒谱系数MFCC,这个MFCC就是这帧语音的特征,其中所谓倒谱分析即为对Mel频谱取对数,再做逆变换,其中实际逆变换一般是通过DCT离散余弦变换来实现,并取DCT后的第2个到第13个系数作为MFCC系数。如此,将每一帧语音的MFCC组成一个向量矩阵,并通过背景模型(UBM)和特征语音向量(i-vector)提取器(extractor)筛选出所述矩阵中最核心的向量,将该向量作为所述语音的特征语音向量,其中通过背景模型(UBM)和特征语音向量(i-vector)提取器(extractor)筛选出所述矩阵中最核心的向量属于向量矩阵计算的现有计算方法,本文便不再多做赘述。Among them, MFCC is an abbreviation of Mel-Frequency Cepstral Coefficients, which contains two key steps: conversion to the Mel frequency, followed by cepstrum analysis. In this embodiment, voice segmentation is performed on each voice to obtain a voice spectrum of multiple frames; and the acquired spectrum is obtained through a Mel filter bank to obtain a Mel spectrum, where the Mel filter group may be non-uniform. The frequency is converted to a uniform frequency; finally, the cepstrum analysis is performed on the Mel spectrum to obtain the Mel frequency cepstrum coefficient MFCC. This MFCC is the characteristic of the speech of the frame. The so-called cepstrum analysis is to take the logarithm of the Mel spectrum, and then do Inverse transform, in which the actual inverse transform is generally implemented by DCT discrete cosine transform, and the second to thirteenth coefficients after DCT are taken as MFCC coefficients. In this way, the MFCC of each frame of speech is composed into a vector matrix, and the most core vector in the matrix is filtered by a background model (UBM) and a feature-vector vector (i-vector) extractor, and the vector is used as the vector. a feature speech vector of the speech, wherein the background of the most central vector in the matrix is filtered by a background model (UBM) and a feature speech vector (i-vector) extractor, which belongs to the existing calculation method of vector matrix calculation, I will not repeat them.
所述第一比对模块302,用于将所述每条注册语音的特征语音向量进行两两比对打分,并获取第一打分平均值作为第一阈值。具体地,所述第一比对模块302利用向量点积(dot-product)算法和PLDA算法对所述每一份语音的特征语音向量进行两两对比打分。在本实施方式中,向量点积算法和PLDA算法为一种现有算法,本文便不多做赘述。The first comparison module 302 is configured to perform a pairwise comparison of the feature speech vectors of each registered voice, and obtain a first score average as a first threshold. Specifically, the first comparison module 302 performs a pairwise comparison score on the feature speech vectors of each of the speeches by using a dot-product algorithm and a PLDA algorithm. In the present embodiment, the vector dot product algorithm and the PLDA algorithm are an existing algorithm, and will not be described in detail herein.
所述获取模块303,用于获取验证语音,并计算所述验证语音的特征语音向量。在本实施方式中,获取模块303在获取验证语音后,同样是利用上述MFCC算法和UBM模型及向量提取器计算出验证语音的特征语音向量。The obtaining module 303 is configured to acquire a verification voice, and calculate a feature voice vector of the verification voice. In the present embodiment, after acquiring the verification speech, the acquisition module 303 also calculates the feature speech vector of the verification speech by using the MFCC algorithm and the UBM model and the vector extractor.
所述第二比对模块304,用于将所述验证语音的特征语音向量分别和注册语音的特征语音向量进行两两比对打分,并获取第二打分平均值。在本实施方式中,通过将验证语音的特征语音向量分别与多个注册语音的特征语音向量进行两两比对打分,进而获取多个相应的第二打分值,并对多个第二打分值取平均值获得第二打分平均值。在本实施方式中,所述注册模块301注册了预设数目的注册语音,比如注册了3条注册语音,此时注册语音相应的特征语音向量即为包括3条语音特征向量的特征语音向量组。此时对于在语音验证的过程中,获取的验证语音只有一条,同时也存在着一条验证语音的特征语音向量。第二比对模块304将所述验证语音的特征语音向量分别和注册语音的特征语音向量进行两两比对打分的过程,具体就是将获取的一条验证语音分别和注册语音的特征语音向量组中的3条特征语音向量分别进行比对打分,得到三个打分值,对三个打分值进行平均进而获取第二打分平均值。The second comparison module 304 is configured to perform a pairwise comparison of the feature speech vectors of the verification speech and the feature speech vectors of the registered speech, and obtain a second scoring average. In this embodiment, the feature speech vectors of the verification speech are respectively scored with the feature speech vectors of the plurality of registered speeches to obtain a plurality of corresponding second scoring values, and the plurality of second scoring values are obtained. The average is obtained to obtain a second score average. In this embodiment, the registration module 301 registers a preset number of registered voices, for example, three registered voices are registered, and the feature voice vector corresponding to the registered voice is a feature voice vector group including three voice feature vectors. . At this time, in the process of voice verification, there is only one verification voice acquired, and there is also a feature voice vector for verifying the voice. The second comparison module 304 performs a process of pairwise matching the feature speech vectors of the verification speech with the feature speech vectors of the registered speech, specifically, the obtained one verification speech and the characteristic speech vector group of the registered speech respectively. The three characteristic speech vectors are separately scored, and three scoring values are obtained, and the three scoring values are averaged to obtain the second scoring average.
所述判断模块305,用于判断所述第二打分平均值是否大于第二阈值。本实施例中,所述第二阈值为所述第一阈值加一预设值。在本实施方式中,第二阈值大于第一阈值,其相差的预设值可以根据用户在反复实验中根据实际情况自定义设定,本申请并不对此预设值大小进行限制。The determining module 305 is configured to determine whether the second score average is greater than a second threshold. In this embodiment, the second threshold is the first threshold plus a preset value. In this embodiment, the second threshold is greater than the first threshold, and the preset value of the phase difference may be customized according to the actual situation in the repeated experiment, and the present application does not limit the preset value.
所述更新模块306,用于在所述第二打分平均值大于所述第二阈值时,根据所述验证语音更新注册语音。本实施例中,所述更新模块306通过以下方式更新注册语音:The update module 306 is configured to update the registration voice according to the verification voice when the second score average is greater than the second threshold. In this embodiment, the update module 306 updates the registration voice in the following manner:
所述更新模块306还进一步判断所述第二打分平均值是否大于第三阈值,所述第三阈值大于所述第二阈值。所述更新模块306将所述第二打分平均值大于所述第三阈值的所述验证语音更新到所述注册语音中,并根据更新后的所述注册语音进行注册。其中,第三阈值可以通过如下步骤确定:The update module 306 further determines whether the second score average is greater than a third threshold, and the third threshold is greater than the second threshold. The update module 306 updates the verification voice whose second score average value is greater than the third threshold value into the registration voice, and registers according to the updated registration voice. The third threshold can be determined by the following steps:
所述更新模块306首先,选取所有所述第二打分平均值高于第二阈值的验证语音,并计数为N,然后,将选取后的验证语音对应的所述第二打分平均值进行由高到低排序;最后,选取第N/3的第二打分平均值作为所述第三 阈值。通过利用不断验证过程中验证语音对应的第二打分平均值动态设定第三阈值,进而根据第三阈值动态的更新注册语音,及保证注册语音可以根据用户不同时期的变化而产生变化。The update module 306 first selects all the verification voices whose second score is higher than the second threshold, and counts as N, and then, the second score average corresponding to the selected verification voice is performed by the high value. To the low ordering; finally, the second score average of the N/3 is selected as the third threshold. The third threshold is dynamically set by using the second score average corresponding to the verification voice in the continuous verification process, thereby dynamically updating the registration voice according to the third threshold, and ensuring that the registered voice can be changed according to changes of the user in different periods.
需要说明的是,上述第一阈值、第二阈值、第三阈值的大小关系分别为第一阈值小于第二阈值,第二阈值小于第三阈值,其中第一阈值也就是第一打分平均值实质上是注册语音之间的一个平均差值,而第二打分平均值实质上反应了验证语音与注册语音的一个差值,如果此时验证语音与注册语音的一个差值比注册语音之间的一个平均差值还大了一个预设值,说明此时验证语音与之前的注册语音的差距比较大,此时就需要对注册语音进行更新。更进一步的,在第二阈值的基础上进一步设定一个第三阈值,可以进一步筛选出与注册语音明显差异的验证语音,进而根据此类验证语音更新注册语音。It should be noted that the magnitude relationship between the first threshold, the second threshold, and the third threshold is that the first threshold is smaller than the second threshold, and the second threshold is smaller than the third threshold, where the first threshold is also the first score average. The upper is the average difference between the registered speeches, and the second scoring average substantially reflects the difference between the verified speech and the registered speech, if at this time the verification speech is compared with the registered speech by a difference between the registered speech and the registered speech. An average difference is also larger than a preset value, indicating that the difference between the verification voice and the previous registration voice is relatively large, and the registration voice needs to be updated at this time. Further, further setting a third threshold based on the second threshold may further filter out the verification voice that is significantly different from the registered voice, and then update the registration voice according to the verification voice.
通过执行上述程序模块301-306,可以解决现有语音识别方法中随时间变化,用户声音可能因年龄、身体状况、环境等因素发生改变,每次验证时使用注册时的i-vector进行比较,可能导致验证失败的弊端,进而可以在用户每次验证时,都将按照对比和更新的流程进行,即每次符合要求的验证通过,用户在声纹库中的注册信息都将进行更新,可以提升后续声纹验证准确率,且能适应注册人随时间波动的声音变化。By executing the above-mentioned program modules 301-306, it is possible to solve the change in the existing speech recognition method over time, and the user's voice may be changed due to factors such as age, physical condition, environment, etc., and each time the verification is performed using the i-vector at the time of registration, The drawbacks that may lead to verification failure, and then each time the user verifies, will follow the process of comparison and update, that is, each time the verification is met, the registration information of the user in the voiceprint library will be updated. Improve the accuracy of subsequent voiceprint verification, and adapt to the changes in the voice of the registrant over time.
此外,本申请还提出一种更新声纹数据的语音识别方法。In addition, the present application also proposes a voice recognition method for updating voiceprint data.
参阅图3所示,是本申请更新声纹数据的语音识别方法第一实施例的实施流程示意图。在本实施例中,根据不同的需求,图3所示的流程图中的步骤的执行顺序可以改变,某些步骤可以省略。Referring to FIG. 3, it is a schematic flowchart of the implementation of the first embodiment of the voice recognition method for updating voiceprint data in the present application. In this embodiment, the order of execution of the steps in the flowchart shown in FIG. 3 may be changed according to different requirements, and some steps may be omitted.
步骤S401,注册预设数目的注册语音,并计算所述预设数目的注册语音中的每条注册语音的特征语音向量。例如,所述终端装置100注册N条注册语音。本实施例中,所述的更新声纹数据的语音识别方法存储于终端装置100,本实施例的终端装置100可以为具有语音识别功能的任何一个终端,比如手 机,便携式电脑、个人数字助理、银行支付终端、门禁设备等等,这些设备通过语音识别技术可以去实现一些具体的功能和应用。另外,终端装置100获取用户进行语音注册时的有效语音,可以从用户点击语音录入的时候开始获取,一直到用户停止语音录入,如此可以避免一些不必要的噪音干扰,提高待处理语音样本的纯净度。另外,上述N条注册语音优选为3条注册语音,当然根据需要,N也可以选择为其他合适的正整数。Step S401: Register a preset number of registered voices, and calculate a feature voice vector of each registered voice in the preset number of registered voices. For example, the terminal device 100 registers N registered voices. In this embodiment, the voice recognition method for updating the voiceprint data is stored in the terminal device 100. The terminal device 100 in this embodiment may be any terminal having a voice recognition function, such as a mobile phone, a portable computer, a personal digital assistant, Bank payment terminals, access control devices, etc., which can implement some specific functions and applications through voice recognition technology. In addition, the terminal device 100 acquires an effective voice when the user performs voice registration, and can start acquiring when the user clicks on the voice input, until the user stops the voice input, thereby avoiding unnecessary noise interference and improving the purity of the voice sample to be processed. degree. In addition, the above N registered voices are preferably three registered voices, and of course, N may be selected as other suitable positive integers as needed.
本实施例中,所述终端装置100通过以下方式分别计算所述预设数目的注册语音中的每条注册语音的特征语音向量:In this embodiment, the terminal device 100 separately calculates a feature speech vector of each registered voice in the preset number of registered voices by:
所述终端装置100使用梅尔频率倒谱系数MFCC方法提取每一份语音中每帧语音的MFCC特征并组成一个矩阵,并使用通用背景模型(UBM)和特征语音向量(i-vector)提取器(extractor)筛选出所述矩阵中最核心的特征,组成所述特征语音向量。The terminal device 100 extracts the MFCC features of each frame of speech in each speech using the Mel Frequency Cepstrum Coefficient MFCC method and forms a matrix, and uses a Universal Background Model (UBM) and a feature speech vector (i-vector) extractor. The extractor selects the most core features in the matrix to form the feature speech vector.
其中,MFCC是Mel-Frequency Cepstral Coefficients的缩写,包含两个关键步骤:转化到梅尔频率,然后进行倒谱分析。在本实施方式中,先对每一份语音进行语音分帧,获取多个帧的语音频谱;再将上述获取的频谱通过Mel滤波器组得到Mel频谱,其中Mel滤波器组可以将不统一的频率转化到统一的频率;最后在Mel频谱上面进行倒谱分析,获得Mel频率倒谱系数MFCC,这个MFCC就是这帧语音的特征,其中所谓倒谱分析即为对Mel频谱取对数,再做逆变换,而逆变换实际一般是通过DCT离散余弦变换来实现,并取DCT后的第2个到第13个系数作为MFCC系数。如此,将每一帧语音的MFCC组成一个向量矩阵,并通过背景模型(UBM)和特征语音向量(i-vector)提取器(extractor)筛选出所述矩阵中最核心的向量,将该向量作为所述语音的特征语音向量,其中通过背景模型(UBM)和特征语音向量(i-vector)提取器(extractor)筛选出所述矩阵中最核心的向量属于向量矩阵计算的现有数据算法,本文便不再多做赘述。Among them, MFCC is an abbreviation of Mel-Frequency Cepstral Coefficients, which contains two key steps: conversion to the Mel frequency, followed by cepstrum analysis. In this embodiment, voice segmentation is performed on each voice to obtain a voice spectrum of multiple frames; and the acquired spectrum is obtained through a Mel filter bank to obtain a Mel spectrum, where the Mel filter group may be non-uniform. The frequency is converted to a uniform frequency; finally, the cepstrum analysis is performed on the Mel spectrum to obtain the Mel frequency cepstrum coefficient MFCC. This MFCC is the characteristic of the speech of the frame. The so-called cepstrum analysis is to take the logarithm of the Mel spectrum, and then do The inverse transform is actually implemented by DCT discrete cosine transform, and the second to thirteenth coefficients after DCT are taken as MFCC coefficients. In this way, the MFCC of each frame of speech is composed into a vector matrix, and the most core vector in the matrix is filtered by a background model (UBM) and a feature-vector vector (i-vector) extractor, and the vector is used as the vector. a feature speech vector of the speech, wherein the background vector (UBM) and the feature speech vector (i-vector) extractor filter out the most core vector in the matrix belongs to the existing data algorithm of the vector matrix calculation, I will not repeat them.
步骤S402,将所述每条注册语音的特征语音向量进行两两比对打分,并 获取第一打分平均值作为第一阈值。具体地,所述终端装置100利用向量点积(dot-product)算法和PLDA算法对所述每一份语音的特征语音向量进行两两对比打分。在本实施方式中,向量点积算法和PLDA算法为一种现有算法,本文便不多做赘述。Step S402, performing a pairwise comparison of the feature speech vectors of each registered speech, and obtaining a first scoring average as the first threshold. Specifically, the terminal device 100 performs a pairwise comparison and scoring of the feature speech vectors of each of the speeches by using a dot-product algorithm and a PLDA algorithm. In the present embodiment, the vector dot product algorithm and the PLDA algorithm are an existing algorithm, and will not be described in detail herein.
步骤S403,获取验证语音,并计算所述验证语音的特征语音向量。在本实施方式中,获取模块303在获取验证语音后,同样是利用上述MFCC算法和UBM模型及向量提取器计算出验证语音的特征语音向量。Step S403, obtaining a verification voice, and calculating a feature speech vector of the verification voice. In the present embodiment, after acquiring the verification speech, the acquisition module 303 also calculates the feature speech vector of the verification speech by using the MFCC algorithm and the UBM model and the vector extractor.
步骤S404,将所述验证语音的特征语音向量分别和注册语音的特征语音向量进行两两比对打分,并获取第二打分平均值。通过将验证语音的特征语音向量分别与多个注册语音的特征语音向量进行两两比对打分,进而获取多个相应的第二打分值,并对多个第二打分值取平均值获得第二打分平均值。在本实施方式中,所述注册模块301注册了预设数目的注册语音,比如注册了3条注册语音,此时注册语音相应的特征语音向量即为包括3条语音特征向量的特征语音向量组。此时对于在语音验证的过程中,获取的验证语音只有一条,同时也存在着一条验证语音的特征语音向量。第二比对模块304将所述验证语音的特征语音向量分别和注册语音的特征语音向量进行两两比对打分的过程,具体就是将获取的一条验证语音分别和注册语音的特征语音向量组中的3条特征语音向量分别进行比对打分,得到三个打分值,对三个打分值进行平均进而获取第二打分平均值。Step S404, the feature speech vectors of the verification speech are respectively compared with the feature speech vectors of the registered speech, and the second scoring average is obtained. The feature speech vectors of the verification speech are respectively scored with the feature speech vectors of the plurality of registered speeches, and then a plurality of corresponding second scoring values are obtained, and the plurality of second scoring values are averaged to obtain a second. Score the average. In this embodiment, the registration module 301 registers a preset number of registered voices, for example, three registered voices are registered, and the feature voice vector corresponding to the registered voice is a feature voice vector group including three voice feature vectors. . At this time, in the process of voice verification, there is only one verification voice acquired, and there is also a feature voice vector for verifying the voice. The second comparison module 304 performs a process of pairwise matching the feature speech vectors of the verification speech with the feature speech vectors of the registered speech, specifically, the obtained one verification speech and the characteristic speech vector group of the registered speech respectively. The three characteristic speech vectors are separately scored, and three scoring values are obtained, and the three scoring values are averaged to obtain the second scoring average.
步骤S405,判断所述第二打分平均值是否大于第二阈值。本实施例中,所述第二阈值为所述第一阈值加一预设值。在所述第二打分平均值大于所述第二阈值时,执行步骤S406,否则,结束流程。在本实施方式中,第二阈值大于第一阈值,其相差的预设值可以根据用户在反复实验中根据实际情况自定义设定,本申请并不对此预设值大小进行限制。Step S405, determining whether the second score average value is greater than a second threshold. In this embodiment, the second threshold is the first threshold plus a preset value. When the second score average is greater than the second threshold, step S406 is performed, otherwise, the flow is ended. In this embodiment, the second threshold is greater than the first threshold, and the preset value of the phase difference may be customized according to the actual situation in the repeated experiment, and the present application does not limit the preset value.
步骤S406,根据所述验证语音更新注册语音。本实施例中,所述终端装置100进一步通过以下方式更新注册语音:Step S406, updating the registration voice according to the verification voice. In this embodiment, the terminal device 100 further updates the registration voice by:
所述终端装置100首先判断所述第二打分平均值是否大于第三阈值,其中,所述第三阈值大于所述第二阈值。然后,所述终端装置100将所述第二打分平均值大于所述第三阈值的所述验证语音更新到所述注册语音中,并根据更新后所述注册语音进行注册。在本实施方式中,所述第三阈值通过如下步骤进行确定:The terminal device 100 first determines whether the second scoring average is greater than a third threshold, wherein the third threshold is greater than the second threshold. Then, the terminal device 100 updates the verification voice whose second score average value is greater than the third threshold value into the registration voice, and registers according to the updated registration voice. In this embodiment, the third threshold is determined by the following steps:
所述终端装置100首先,选取所有所述第二打分平均值高于第二阈值的验证语音,并计数为N,然后,将选取的所述验证语音对应的所述第二打分平均值进行由高到低排序;最后,选取第N/3的第二打分平均值作为所述第三阈值。通过利用不断验证过程中验证语音对应的第二打分平均值动态设定第三阈值,进而根据第三阈值动态的更新注册语音,及保证注册语音可以根据用户不同时期的变化而产生变化。The terminal device 100 first selects all the verification voices whose second score is higher than the second threshold, and counts as N, and then performs the second score average corresponding to the selected verification voice. High to low ordering; finally, the second score average of the N/3 is selected as the third threshold. The third threshold is dynamically set by using the second score average corresponding to the verification voice in the continuous verification process, thereby dynamically updating the registration voice according to the third threshold, and ensuring that the registered voice can be changed according to changes of the user in different periods.
需要说明的是,上述第一阈值、第二阈值、第三阈值的大小关系分别为第一阈值小于第二阈值,第二阈值小于第三阈值,其中第一阈值也就是第一打分平均值实质上是注册语音之间的一个平均差值,而第二打分平均值实质上反应了验证语音与注册语音的一个差值,如果此时验证语音与注册语音的一个差值比注册语音之间的一个平均差值还大了一个预设值,说明此时验证语音与之前的注册语音的差距比较大,此时就需要对注册语音进行更新。更进一步的,在第二阈值的基础上进一步设定一个第三阈值,可以进一步筛选出与注册语音明显差异的验证语音,进而根据此类验证语音更新注册语音。It should be noted that the magnitude relationship between the first threshold, the second threshold, and the third threshold is that the first threshold is smaller than the second threshold, and the second threshold is smaller than the third threshold, where the first threshold is also the first score average. The upper is the average difference between the registered speeches, and the second scoring average substantially reflects the difference between the verified speech and the registered speech, if at this time the verification speech is compared with the registered speech by a difference between the registered speech and the registered speech. An average difference is also larger than a preset value, indicating that the difference between the verification voice and the previous registration voice is relatively large, and the registration voice needs to be updated at this time. Further, further setting a third threshold based on the second threshold may further filter out the verification voice that is significantly different from the registered voice, and then update the registration voice according to the verification voice.
通过上述步骤S401-406,本申请所提出的更新声纹数据的语音识别方法,可以解决现有语音识别方法中随时间变化,用户声音可能因年龄、身体状况、环境等因素发生改变,每次验证时使用注册时的i-vector进行比较,可能导致验证失败的弊端,进而可以在用户每次验证时,都将按照对比和更新的流程进行,即每次符合要求的验证通过,用户在声纹库中的注册信息都将进行更新,可以提升后续声纹验证准确率,且能适应注册人随时间波动的声音变化。Through the above steps S401-406, the voice recognition method for updating the voiceprint data proposed by the present application can solve the change in the existing voice recognition method over time, and the user voice may change due to factors such as age, physical condition, environment, etc. When using the i-vector at the time of verification for comparison, it may lead to the drawback of verification failure, and then it can be carried out according to the comparison and update process every time the user verifies the verification, that is, the verification is passed every time the user meets the requirements. The registration information in the library will be updated to improve the accuracy of subsequent voiceprint verification and to adapt to changes in the voice of the registrant over time.
本申请还提供了另一种实施方式,即提供一种存储介质,所述存储介质存储有更新声纹数据的语音识别程序,所述更新声纹数据的语音识别程序可被至少一个处理器执行,以使所述至少一个处理器执行如上所述的更新声纹数据的语音识别方法的步骤。The present application further provides another embodiment, that is, a storage medium storing a voice recognition program for updating voiceprint data, the voice recognition program for updating voiceprint data being executable by at least one processor And the step of causing the at least one processor to perform the speech recognition method of updating the voiceprint data as described above.
上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。The serial numbers of the embodiments of the present application are merely for the description, and do not represent the advantages and disadvantages of the embodiments.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,空调器,或者网络设备等)执行本申请各个实施例所述的方法。Through the description of the above embodiments, those skilled in the art can clearly understand that the foregoing embodiment method can be implemented by means of software plus a necessary general hardware platform, and of course, can also be through hardware, but in many cases, the former is better. Implementation. Based on such understanding, the technical solution of the present application, which is essential or contributes to the prior art, may be embodied in the form of a software product stored in a storage medium (such as ROM/RAM, disk, The optical disc includes a number of instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the methods described in various embodiments of the present application.
以上仅为本申请的优选实施例,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。The above is only a preferred embodiment of the present application, and is not intended to limit the scope of the patent application, and the equivalent structure or equivalent process transformations made by the specification and the drawings of the present application, or directly or indirectly applied to other related technical fields. The same is included in the scope of patent protection of this application.

Claims (20)

  1. 一种更新声纹数据的语音识别方法,应用于终端装置,其特征在于,所述方法包括步骤:A voice recognition method for updating voiceprint data is applied to a terminal device, and the method includes the steps of:
    注册预设数目的注册语音,并计算所述预设数目的注册语音中的每条注册语音的特征语音向量;Registering a preset number of registered voices, and calculating a feature voice vector of each registered voice in the preset number of registered voices;
    将所述每条注册语音的特征语音向量进行两两比对打分,并获取第一打分平均值作为第一阈值;Performing a pairwise comparison of the feature speech vectors of each registered voice, and obtaining a first score average as a first threshold;
    获取验证语音,并计算所述验证语音的特征语音向量;Acquiring a verification voice, and calculating a feature speech vector of the verification voice;
    将所述验证语音的特征语音向量分别和注册语音的特征语音向量进行两两比对打分,并获取第二打分平均值;And characterizing the feature speech vectors of the verification speech and the feature speech vectors of the registered speech respectively, and obtaining a second scoring average;
    判断所述第二打分平均值是否大于第二阈值,所述第二阈值为所述第一阈值加一预设数值;及Determining whether the second score average is greater than a second threshold, the second threshold being the first threshold plus a preset value; and
    若所述第二打分平均值大于所述第二阈值,则根据所述验证语音更新所述注册语音。And if the second score average is greater than the second threshold, updating the registered voice according to the verification voice.
  2. 如权利要求1所述的更新声纹数据的语音识别方法,其特征在于,所述计算所述预设数目的注册语音中的每条注册语音的特征语音向量的步骤,包括:The method for updating the voiceprint data of the voiceprint data according to claim 1, wherein the calculating the feature voice vector of each registered voice in the preset number of registered voices comprises:
    使用MFCC方法提取每一份语音中每帧语音的MFCC特征并组成一个矩阵;及Extracting MFCC features of each frame of speech in each speech using MFCC method and forming a matrix; and
    使用UBM和特征语音向量提取器筛选出所述矩阵中最核心的特征,组成所述特征语音向量。The UOB and the feature speech vector extractor are used to filter out the most core features in the matrix to form the feature speech vector.
  3. 如权利要求1所述的更新声纹数据的语音识别方法,其特征在于,所述计算所述验证语音的特征语音向量的步骤,包括:The method for updating a voiceprint data of the voiceprint data according to claim 1, wherein the calculating the feature voice vector of the verification voice comprises:
    使用MFCC方法提取验证语音中每帧语音的MFCC特征并组成一个矩阵;及Extracting the MFCC features of each frame of speech in the verification speech using the MFCC method and forming a matrix;
    使用UBM和特征语音向量提取器筛选出所述矩阵中最核心的特征,组成 所述验证语音的特征语音向量。The UBO and the feature speech vector extractor are used to filter out the most core features in the matrix to form a feature speech vector of the verification speech.
  4. 如权利要求2或3所述的更新声纹数据的语音识别方法,其特征在于,所述使用MFCC方法提取MFCC特征并组成一个矩阵的步骤,包括:The voice recognition method for updating voiceprint data according to claim 2 or 3, wherein the step of extracting MFCC features and forming a matrix using the MFCC method comprises:
    对每一份语音进行语音分帧,获取多个帧的语音频谱;Perform voice framing on each voice to obtain a voice spectrum of multiple frames;
    将所述语音频谱通过Mel滤波器组得到Mel频谱;Passing the speech spectrum through a Mel filter bank to obtain a Mel spectrum;
    对所述Mel频谱进行倒谱分析,得到Mel频率倒谱系数MFCC;Performing a cepstrum analysis on the Mel spectrum to obtain a Mel frequency cepstral coefficient MFCC;
    将每一帧的语音频谱的MFCC组成一个向量矩阵。The MFCC of the speech spectrum of each frame is composed into a vector matrix.
  5. 如权利要求1所述的更新声纹数据的语音识别方法,其特征在于,所述将所述每条注册语音的特征语音向量进行两两比对打分的步骤,包括:The method for updating a voiceprint data according to claim 1, wherein the step of scoring the feature speech vectors of each of the registered voices by two or two comparisons comprises:
    利用向量点积算法和PLDA算法对所述每一份语音的特征语音向量进行两两对比打分。The feature speech vectors of each of the speeches are subjected to pairwise comparison scoring using a vector dot product algorithm and a PLDA algorithm.
  6. 如权利要求1所述的更新声纹数据的语音识别方法,其特征在于,所述根据所述验证语音更新所述预设数目的注册语音的步骤,还包括:The method for updating a voiceprint data according to claim 1, wherein the step of updating the preset number of registered voices according to the verification voice further comprises:
    判断所述第二打分平均值是否大于第三阈值,所述第三阈值大于所述第二阈值;及Determining whether the second score average is greater than a third threshold, the third threshold being greater than the second threshold; and
    将所述第二打分平均值大于所述第三阈值的所述验证语音更新到所述注册语音中,并根据更新后的所述注册语音进行注册。And updating the verification voice whose second score average value is greater than the third threshold value into the registration voice, and registering according to the updated registration voice.
  7. 如权利要求6所述的更新声纹数据的语音识别方法,其特征在于,所述第三阈值通过如下步骤进行确定:The speech recognition method for updating voiceprint data according to claim 6, wherein the third threshold is determined by the following steps:
    选取所有高于所述第二阈值的所述第二打分平均值的验证语音,并计数为N;Selecting all the verification voices of the second score average value higher than the second threshold, and counting as N;
    将选取后验证语音对应的所述第二打分平均值进行由高到低排序;及Sorting the average of the second score corresponding to the selected verification voice from high to low; and
    选取第N/3的第二打分平均值作为所述第三阈值。The second score average of the N/3 is selected as the third threshold.
  8. 一种终端装置,其特征在于,所述终端装置包括存储器、处理器,所述存储器上存储有可在所述处理器上运行的更新声纹数据的语音识别程序,所述更新声纹数据的语音识别程序被所述处理器执行时实现如下步骤:A terminal device, comprising: a memory, a processor, wherein the memory stores a voice recognition program for updating voiceprint data that can be run on the processor, where the voiceprint data is updated The speech recognition program is implemented by the processor to implement the following steps:
    注册预设数目的注册语音,并计算所述预设数目的注册语音中的每条注册语音的特征语音向量;Registering a preset number of registered voices, and calculating a feature voice vector of each registered voice in the preset number of registered voices;
    将所述每条注册语音的特征语音向量进行两两比对打分,并获取第一打分平均值作为第一阈值;Performing a pairwise comparison of the feature speech vectors of each registered voice, and obtaining a first score average as a first threshold;
    获取验证语音,并计算所述验证语音的特征语音向量;Acquiring a verification voice, and calculating a feature speech vector of the verification voice;
    将所述验证语音的特征语音向量分别和注册语音的特征语音向量进行两两比对打分,并获取第二打分平均值;And characterizing the feature speech vectors of the verification speech and the feature speech vectors of the registered speech respectively, and obtaining a second scoring average;
    判断所述第二打分平均值是否大于第二阈值,所述第二阈值为所述第一阈值加一预设值;及Determining whether the second score average is greater than a second threshold, the second threshold being the first threshold plus a preset value; and
    若所述第二打分平均值大于所述第二阈值,则根据所述验证语音更新注册语音。If the second score average is greater than the second threshold, the registration voice is updated according to the verification voice.
  9. 如权利要求8所述的终端装置,其特征在于,所述计算所述预设数目的注册语音中的每条注册语音的特征语音向量的步骤,包括:The terminal device according to claim 8, wherein the calculating the feature speech vector of each registered voice in the preset number of registered voices comprises:
    使用MFCC方法提取每一份语音中每帧语音的MFCC特征并组成一个矩阵;及Extracting MFCC features of each frame of speech in each speech using MFCC method and forming a matrix; and
    使用UBM和特征语音向量提取器筛选出所述矩阵中最核心的特征,组成所述特征语音向量。The UOB and the feature speech vector extractor are used to filter out the most core features in the matrix to form the feature speech vector.
  10. 如权利要求8所述的终端装置,其特征在于,所述计算所述验证语音的特征语音向量的步骤,包括:The terminal device according to claim 8, wherein the calculating the feature speech vector of the verification speech comprises:
    使用MFCC方法提取验证语音中每帧语音的MFCC特征并组成一个矩阵;及Extracting the MFCC features of each frame of speech in the verification speech using the MFCC method and forming a matrix;
    使用UBM和特征语音向量提取器筛选出所述矩阵中最核心的特征,组成所述验证语音的特征语音向量。UBM and feature speech vector extractor are used to filter out the most core features in the matrix to form the feature speech vector of the verification speech.
  11. 如权利要求9或10所述的终端装置,其特征在于,所述使用MFCC方法提取MFCC特征并组成一个矩阵的步骤,包括:The terminal device according to claim 9 or 10, wherein the step of extracting MFCC features and forming a matrix using the MFCC method comprises:
    对每一份语音进行语音分帧,获取多个帧的语音频谱;Perform voice framing on each voice to obtain a voice spectrum of multiple frames;
    将所述语音频谱通过Mel滤波器组得到Mel频谱;Passing the speech spectrum through a Mel filter bank to obtain a Mel spectrum;
    对所述Mel频谱进行倒谱分析,得到Mel频率倒谱系数MFCC;Performing a cepstrum analysis on the Mel spectrum to obtain a Mel frequency cepstral coefficient MFCC;
    将每一帧的语音频谱的MFCC组成一个向量矩阵。The MFCC of the speech spectrum of each frame is composed into a vector matrix.
  12. 如权利要求8所述的终端装置,其特征在于,所述将所述每条注册语音的特征语音向量进行两两比对打分的步骤,包括:The terminal device according to claim 8, wherein the step of scoring the feature speech vectors of each of the registered voices by two or two comparisons comprises:
    利用向量点积算法和PLDA算法对所述每一份语音的特征语音向量进行两两对比打分。The feature speech vectors of each of the speeches are subjected to pairwise comparison scoring using a vector dot product algorithm and a PLDA algorithm.
  13. 如权利要求8所述的终端装置,其特征在于,所述根据所述验证语音更新所述预设数目的注册语音的步骤,还包括:The terminal device according to claim 8, wherein the step of updating the preset number of registered voices according to the verification voice further comprises:
    判断所述第二打分平均值是否大于第三阈值,所述第三阈值大于所述第二阈值;及Determining whether the second score average is greater than a third threshold, the third threshold being greater than the second threshold; and
    将所述第二打分平均值大于所述第三阈值的所述验证语音更新到所述注册语音中,并根据更新后的所述注册语音进行注册。And updating the verification voice whose second score average value is greater than the third threshold value into the registration voice, and registering according to the updated registration voice.
  14. 如权利要求13所述的终端装置,其特征在于,所述第三阈值通过如下步骤确定:The terminal device according to claim 13, wherein said third threshold is determined by the following steps:
    选取所有所述第二打分平均值高于所述第二阈值的验证语音,并计数为N;Selecting all the verification voices whose second score is higher than the second threshold, and counting N;
    将选取后的验证语音对应的所述第二打分平均值进行由高到低排序;及Sorting the average of the second score corresponding to the selected verification voice from high to low; and
    选取第N/3的第二打分平均值作为所述第三阈值。The second score average of the N/3 is selected as the third threshold.
  15. 一种存储介质,所述存储介质存储有更新声纹数据的语音识别程序,所述更新声纹数据的语音识别程序可被至少一个处理器执行,以使所述至少一个处理器执行如下步骤:A storage medium storing a speech recognition program for updating voiceprint data, the speech recognition program for updating the voiceprint data being executable by at least one processor to cause the at least one processor to perform the following steps:
    注册预设数目的注册语音,并计算所述预设数目的注册语音中的每条注册语音的特征语音向量;Registering a preset number of registered voices, and calculating a feature voice vector of each registered voice in the preset number of registered voices;
    将所述每条注册语音的特征语音向量进行两两比对打分,并获取第一打分平均值作为第一阈值;Performing a pairwise comparison of the feature speech vectors of each registered voice, and obtaining a first score average as a first threshold;
    获取验证语音,并计算所述验证语音的特征语音向量;Acquiring a verification voice, and calculating a feature speech vector of the verification voice;
    将所述验证语音的特征语音向量分别和注册语音的特征语音向量进行两两比对打分,并获取第二打分平均值;And characterizing the feature speech vectors of the verification speech and the feature speech vectors of the registered speech respectively, and obtaining a second scoring average;
    判断所述第二打分平均值是否大于第二阈值,所述第二阈值为所述第一阈值加一预设值;及Determining whether the second score average is greater than a second threshold, the second threshold being the first threshold plus a preset value; and
    若所述第二打分平均值大于所述第二阈值,则根据所述验证语音更新注册语音。If the second score average is greater than the second threshold, the registration voice is updated according to the verification voice.
  16. 如权利要求15所述的存储介质,其特征在于,所述计算所述预设数目的注册语音中的每条注册语音的特征语音向量的步骤,包括:The storage medium according to claim 15, wherein the calculating the feature speech vector of each registered voice in the preset number of registered voices comprises:
    使用MFCC方法提取每一份语音中每帧语音的MFCC特征并组成一个矩阵;及Extracting MFCC features of each frame of speech in each speech using MFCC method and forming a matrix; and
    使用UBM和特征语音向量提取器筛选出所述矩阵中最核心的特征,组成所述特征语音向量。The UOB and the feature speech vector extractor are used to filter out the most core features in the matrix to form the feature speech vector.
  17. 如权利要求16所述的存储介质,其特征在于,所述使用MFCC方法提取MFCC特征并组成一个矩阵的步骤,包括:The storage medium according to claim 16, wherein said step of extracting MFCC features using a MFCC method and forming a matrix comprises:
    对每一份语音进行语音分帧,获取多个帧的语音频谱;Perform voice framing on each voice to obtain a voice spectrum of multiple frames;
    将所述语音频谱通过Mel滤波器组得到Mel频谱;Passing the speech spectrum through a Mel filter bank to obtain a Mel spectrum;
    对所述Mel频谱进行倒谱分析,得到Mel频率倒谱系数MFCC;Performing a cepstrum analysis on the Mel spectrum to obtain a Mel frequency cepstral coefficient MFCC;
    将每一帧的语音频谱的MFCC组成一个向量矩阵。The MFCC of the speech spectrum of each frame is composed into a vector matrix.
  18. 如权利要求15所述的存储介质,其特征在于,所述将所述每条注册语音的特征语音向量进行两两比对打分的步骤,包括:The storage medium according to claim 15, wherein the step of scoring the feature speech vectors of each of the registered voices by two or two comparisons comprises:
    利用向量点积算法和PLDA算法对所述每一份语音的特征语音向量进行两两对比打分。The feature speech vectors of each of the speeches are subjected to pairwise comparison scoring using a vector dot product algorithm and a PLDA algorithm.
  19. 如权利要求15所述的存储介质,其特征在于,所述根据所述验证语音更新所述预设数目的注册语音的步骤,还包括:The storage medium according to claim 15, wherein the step of updating the preset number of registered voices according to the verification voice further comprises:
    判断所述第二打分平均值是否大于第三阈值,所述第三阈值大于所述第 二阈值;及Determining whether the second score average is greater than a third threshold, the third threshold being greater than the second threshold; and
    将所述第二打分平均值大于所述第三阈值的所述验证语音更新到所述注册语音中,并根据更新后的所述注册语音进行注册。And updating the verification voice whose second score average value is greater than the third threshold value into the registration voice, and registering according to the updated registration voice.
  20. 如权利要求19所述的存储介质,其特征在于,所述第三阈值通过如下步骤确定:A storage medium according to claim 19, wherein said third threshold is determined by the following steps:
    选取所有所述第二打分平均值高于所述第二阈值的验证语音,并计数为N;Selecting all the verification voices whose second score is higher than the second threshold, and counting N;
    将选取后的验证语音对应的所述第二打分平均值进行由高到低排序;及Sorting the average of the second score corresponding to the selected verification voice from high to low; and
    选取第N/3的第二打分平均值作为所述第三阈值。The second score average of the N/3 is selected as the third threshold.
PCT/CN2018/089415 2018-01-12 2018-05-31 Voice recognition method for updating voiceprint data, terminal device, and storage medium WO2019136911A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810030623.1A CN108269575B (en) 2018-01-12 2018-01-12 Voice recognition method for updating voiceprint data, terminal device and storage medium
CN201810030623.1 2018-01-12

Publications (1)

Publication Number Publication Date
WO2019136911A1 true WO2019136911A1 (en) 2019-07-18

Family

ID=62775513

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/089415 WO2019136911A1 (en) 2018-01-12 2018-05-31 Voice recognition method for updating voiceprint data, terminal device, and storage medium

Country Status (2)

Country Link
CN (1) CN108269575B (en)
WO (1) WO2019136911A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111599365A (en) * 2020-04-08 2020-08-28 云知声智能科技股份有限公司 Adaptive threshold generation system and method for voiceprint recognition system
CN112487804A (en) * 2020-11-25 2021-03-12 合肥三恩信息科技有限公司 Chinese novel speech synthesis system based on semantic context scene
TWI787996B (en) * 2021-09-08 2022-12-21 華南商業銀行股份有限公司 Voiceprint identification device for financial transaction system and method thereof
TWI817897B (en) * 2021-09-08 2023-10-01 華南商業銀行股份有限公司 Low-noise voiceprint identification device for financial transaction system and method thereof

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110400567B (en) * 2019-07-30 2021-10-19 深圳秋田微电子股份有限公司 Dynamic update method for registered voiceprint and computer storage medium
CN110660398B (en) * 2019-09-19 2020-11-20 北京三快在线科技有限公司 Voiceprint feature updating method and device, computer equipment and storage medium
CN111785280A (en) * 2020-06-10 2020-10-16 北京三快在线科技有限公司 Identity authentication method and device, storage medium and electronic equipment
CN112289322B (en) * 2020-11-10 2022-11-15 思必驰科技股份有限公司 Voiceprint recognition method and device

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070036289A1 (en) * 2005-07-27 2007-02-15 Fu Guo K Voice authentication system and method using a removable voice id card
CN104616655A (en) * 2015-02-05 2015-05-13 清华大学 Automatic vocal print model reconstruction method and device
CN106157959A (en) * 2015-03-31 2016-11-23 讯飞智元信息科技有限公司 Sound-groove model update method and system
CN106782564A (en) * 2016-11-18 2017-05-31 百度在线网络技术(北京)有限公司 Method and apparatus for processing speech data
US9685161B2 (en) * 2012-07-09 2017-06-20 Huawei Device Co., Ltd. Method for updating voiceprint feature model and terminal
CN107424614A (en) * 2017-07-17 2017-12-01 广东讯飞启明科技发展有限公司 A kind of sound-groove model update method

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104184587B (en) * 2014-08-08 2016-04-20 腾讯科技(深圳)有限公司 Vocal print generation method, server, client and system
WO2016015687A1 (en) * 2014-07-31 2016-02-04 腾讯科技(深圳)有限公司 Voiceprint verification method and device
CN107068154A (en) * 2017-03-13 2017-08-18 平安科技(深圳)有限公司 The method and system of authentication based on Application on Voiceprint Recognition

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070036289A1 (en) * 2005-07-27 2007-02-15 Fu Guo K Voice authentication system and method using a removable voice id card
US9685161B2 (en) * 2012-07-09 2017-06-20 Huawei Device Co., Ltd. Method for updating voiceprint feature model and terminal
CN104616655A (en) * 2015-02-05 2015-05-13 清华大学 Automatic vocal print model reconstruction method and device
CN106157959A (en) * 2015-03-31 2016-11-23 讯飞智元信息科技有限公司 Sound-groove model update method and system
CN106782564A (en) * 2016-11-18 2017-05-31 百度在线网络技术(北京)有限公司 Method and apparatus for processing speech data
CN107424614A (en) * 2017-07-17 2017-12-01 广东讯飞启明科技发展有限公司 A kind of sound-groove model update method

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111599365A (en) * 2020-04-08 2020-08-28 云知声智能科技股份有限公司 Adaptive threshold generation system and method for voiceprint recognition system
CN111599365B (en) * 2020-04-08 2023-05-05 云知声智能科技股份有限公司 Adaptive threshold generation system and method for voiceprint recognition system
CN112487804A (en) * 2020-11-25 2021-03-12 合肥三恩信息科技有限公司 Chinese novel speech synthesis system based on semantic context scene
CN112487804B (en) * 2020-11-25 2024-04-19 合肥三恩信息科技有限公司 Chinese novel speech synthesis system based on semantic context scene
TWI787996B (en) * 2021-09-08 2022-12-21 華南商業銀行股份有限公司 Voiceprint identification device for financial transaction system and method thereof
TWI817897B (en) * 2021-09-08 2023-10-01 華南商業銀行股份有限公司 Low-noise voiceprint identification device for financial transaction system and method thereof

Also Published As

Publication number Publication date
CN108269575A (en) 2018-07-10
CN108269575B (en) 2021-11-02

Similar Documents

Publication Publication Date Title
WO2019136911A1 (en) Voice recognition method for updating voiceprint data, terminal device, and storage medium
CN106683680B (en) Speaker recognition method and device, computer equipment and computer readable medium
WO2019179036A1 (en) Deep neural network model, electronic device, identity authentication method, and storage medium
JP7152514B2 (en) Voiceprint identification method, model training method, server, and computer program
JP6621536B2 (en) Electronic device, identity authentication method, system, and computer-readable storage medium
US10853676B1 (en) Validating identity and/or location from video and/or audio
JP6429945B2 (en) Method and apparatus for processing audio data
WO2019179029A1 (en) Electronic device, identity verification method and computer-readable storage medium
WO2020077895A1 (en) Signing intention determining method and apparatus, computer device, and storage medium
CN109166586B (en) Speaker identification method and terminal
US11875799B2 (en) Method and device for fusing voiceprint features, voice recognition method and system, and storage medium
US20160086609A1 (en) Systems and methods for audio command recognition
WO2019134247A1 (en) Voiceprint registration method based on voiceprint recognition model, terminal device, and storage medium
WO2021082420A1 (en) Voiceprint authentication method and device, medium and electronic device
US6772119B2 (en) Computationally efficient method and apparatus for speaker recognition
WO2019136909A1 (en) Voice living-body detection method based on deep learning, server and storage medium
US8977547B2 (en) Voice recognition system for registration of stable utterances
US11545154B2 (en) Method and apparatus with registration for speaker recognition
WO2021051572A1 (en) Voice recognition method and apparatus, and computer device
JP2007249179A (en) System, method and computer program product for updating biometric model based on change in biometric feature
CN101894548A (en) Modeling method and modeling device for language identification
WO2019179033A1 (en) Speaker authentication method, server, and computer-readable storage medium
CN109947971B (en) Image retrieval method, image retrieval device, electronic equipment and storage medium
WO2019196305A1 (en) Electronic device, identity verification method, and storage medium
JP7266448B2 (en) Speaker recognition method, speaker recognition device, and speaker recognition program

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18899180

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 12.10.2020)

122 Ep: pct application non-entry in european phase

Ref document number: 18899180

Country of ref document: EP

Kind code of ref document: A1