WO2021013255A1 - 一种声纹识别方法及装置 - Google Patents

一种声纹识别方法及装置 Download PDF

Info

Publication number
WO2021013255A1
WO2021013255A1 PCT/CN2020/104545 CN2020104545W WO2021013255A1 WO 2021013255 A1 WO2021013255 A1 WO 2021013255A1 CN 2020104545 W CN2020104545 W CN 2020104545W WO 2021013255 A1 WO2021013255 A1 WO 2021013255A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
electronic device
voiceprint recognition
scene
recognition model
Prior art date
Application number
PCT/CN2020/104545
Other languages
English (en)
French (fr)
Inventor
曾夕娟
周小鹏
芦宇
胡伟湘
蔡丹蔚
李明
Original Assignee
华为技术有限公司
昆山杜克大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司, 昆山杜克大学 filed Critical 华为技术有限公司
Publication of WO2021013255A1 publication Critical patent/WO2021013255A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/20Pattern transformations or operations aimed at increasing system robustness, e.g. against channel noise or different working conditions
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L9/00Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
    • H04L9/32Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols including means for verifying the identity or authority of a user of the system or for message authentication, e.g. authorization, entity authentication, data integrity or data verification, non-repudiation, key authentication or verification of credentials
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L9/00Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
    • H04L9/32Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols including means for verifying the identity or authority of a user of the system or for message authentication, e.g. authorization, entity authentication, data integrity or data verification, non-repudiation, key authentication or verification of credentials
    • H04L9/3226Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols including means for verifying the identity or authority of a user of the system or for message authentication, e.g. authorization, entity authentication, data integrity or data verification, non-repudiation, key authentication or verification of credentials using a predetermined code, e.g. password, passphrase or PIN
    • H04L9/3231Biological data, e.g. fingerprint, voice or retina
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02082Noise filtering the noise being echo, reverberation of the speech

Definitions

  • the embodiments of the present application relate to the field of computer technology, and in particular to a voiceprint recognition method and device.
  • Voiceprint recognition is a technology that automatically recognizes and confirms the speaker's identity through voice signals.
  • the basic scheme of voiceprint recognition includes two stages: voiceprint registration and voiceprint verification.
  • voiceprint registration stage the registered voice information of the registrant is converted into a verification model; in the voiceprint verification stage, the verification voice information is scored for similarity with the verification model generated in the voiceprint registration stage to determine whether the verification voice comes from The registrant.
  • Far-field voiceprint recognition is more challenging than near-field voiceprint recognition.
  • the main reason is the distortion of the voice signal under far-field conditions, which is reflected in the superposition of environmental noise and room reverberation.
  • a speaker speaks in a room or a confined space
  • sound waves propagate in the air and reflect on walls and obstacles; due to the absorption of materials, the medium and high frequencies of the sound waves attenuate, and then spread to the room again, resulting in reverberation. Therefore, in the far-field condition, the registration voice does not match the verification voice, and the accuracy of voiceprint recognition is low.
  • a solution is for users to register their voiceprints in the near and far fields respectively. Specifically, in order to match the verification voice under the far-field condition with the registered voice, the solution proposes that the user performs voiceprint registration under the near-field condition and the far-field condition respectively.
  • this solution requires the user to perform multiple voiceprint registrations under near-field and far-field conditions, which reduces user experience.
  • Another solution is to enhance the front-end voice signal.
  • the near-field clean voice is collected as the registration voice in the voiceprint registration stage, and the collected far-field voice data is processed by the front-end in the voiceprint verification stage to obtain the enhanced voice, and then the enhanced voice is used as the verification voice input.
  • the high frequency part of the enhanced voice in this scheme is still lost compared to the near-field clean voice, so the enhanced voice still does not match the registered voice, resulting in low robustness of the voiceprint recognition system and insignificant improvement in the recognition rate And other issues.
  • This application provides a voiceprint recognition method and device to solve the problem of low robustness of the voiceprint recognition method in the prior art.
  • the voiceprint registration method includes: the electronic device prompts the user to enter a registered voice; the electronic device collects the registered voice entered by the user; the electronic device generates a sample voice under far-field conditions based on the registered voice; The voiceprint recognition model is trained based on the sample voice.
  • the electronic device in the embodiment of the present application can generate sample voices under far-field conditions based on the registration voice simulation, without requiring the user to perform multiple voiceprint registrations under near-field and far-field conditions, thereby improving user experience.
  • the electronic device trains the voiceprint recognition model based on the sample voice in the far-field condition, which can improve the robustness of the voiceprint recognition model, thereby improving the accuracy of voiceprint recognition.
  • the electronic device when it generates the sample voice under far-field conditions based on the registered voice, it can simulate the reverberation of the sound under the far-field condition; and generate the registration based on the reverberation simulation of the sound under the far-field condition Voice sample data under far-field conditions.
  • the reverberation of the sound under the far-field condition is simulated, so that the sample voice of the registered voice under the far-field condition can be simulated.
  • the electronic device when the electronic device generates the sample voice under far-field conditions based on the registered voice, it can generate noisy voice based on the registered voice and noise data; the electronic device simulates the reverberation of sound under the far-field condition; the electronic device is based on The sound reverberation simulation under far-field conditions generates sample data of noisy speech under far-field conditions.
  • the noisy voice when simulating the sample voice of the registered voice under far-field conditions, the sample voice can be made more in line with the actual scene, which can improve the robustness of the voiceprint recognition model, thereby improving the performance of voiceprint recognition. accuracy.
  • an electronic device when it simulates sound reverberation under far-field conditions, it can simulate sound wall reflections based on far-field conditions to obtain room impulse response (RIR).
  • RIR room impulse response
  • the RIR is obtained by simulating the wall reflection of sound, which can simulate the reverberation of sound under far-field conditions.
  • the electronic device when the electronic device trains the voiceprint recognition model based on the sample voice, it can perform feature extraction on the sample voice to obtain feature data; and train the voiceprint recognition model based on the feature data.
  • the robustness of the voiceprint recognition model can be improved by extracting the characteristic data of the sample voice.
  • the voiceprint recognition model may include one or more sub-models, where one sub-model corresponds to one scene.
  • one or more sub-models can be trained separately based on the feature data.
  • the voiceprint recognition model may include a fusion model, where the fusion model corresponds to one or more scenes.
  • the fusion model is trained based on the feature data.
  • the electronic device maintains a fusion model, thereby saving the computing resources of the electronic device.
  • the voiceprint recognition method includes: the electronic device prompts the user to enter the verification voice; the electronic device collects the verification voice entered by the user; the electronic device matches the verification voice input to the voiceprint recognition model to obtain the matching result ; The electronic device determines whether the user is a registrant of the voiceprint recognition model based on the matching result.
  • the voiceprint recognition model can be trained using the method described in the first aspect above.
  • the process of training the voiceprint recognition model of the electronic device may include: the electronic device prompts the user to enter a registered voice; the electronic device collects the registration entered by the user Voice: The electronic device generates sample voices under far-field conditions based on the registered voice; the electronic device trains the voiceprint recognition model based on the sample voice.
  • the electronic device can accurately identify and verify whether the voice is from the registrant by using the voiceprint recognition model trained in the first aspect.
  • the electronic device may perform scene detection on the verification voice after collecting the verification voice entered by the user.
  • the electronic device by detecting the scene where the verification voice is located, the electronic device can perform voiceprint recognition on the verification voice in combination with the scene where the verification voice is located, thereby improving the accuracy of voiceprint recognition.
  • the voiceprint recognition model may include one or more sub-models, where one sub-model corresponds to a scene; the electronic device will verify the voice input voiceprint recognition model to match, including: the electronic device will verify the voice input The sub-model corresponding to the first scene is matched, where the first scene is the result of scene detection.
  • the electronic device performs voiceprint recognition on the verification voice by combining the scene where the verification voice is located, which can improve the accuracy of voiceprint recognition.
  • the electronic device can evaluate the quality of the verification voice and obtain the quality evaluation result. If the quality evaluation result indicates that the verification voice is high-quality voice, the electronic device may perform incremental learning on the voiceprint recognition model based on the verification voice.
  • the model in the voiceprint recognition model library is updated, so that the voiceprint recognition model library is The model can be more and more suitable for users' actual use scenarios.
  • the electronic device when the electronic device incrementally learns the voiceprint recognition model based on the verification voice, it can perform data enhancement processing on the verification voice to obtain processed voice data.
  • the voiceprint recognition model is incrementally learned based on the processed voice data.
  • the model in the voiceprint recognition model library is updated, so that the model in the voiceprint recognition model library becomes more and more robust during the user's use.
  • the models in the model library can be more and more suitable for users' actual use scenarios.
  • the electronic device may determine that the first scene where the verification voice is located is a high-frequency scene.
  • the electronic device can perform data enhancement processing on the verification voice to obtain j sample voices with different noise levels.
  • the electronic device incrementally learns the voiceprint recognition model based on the processed voice data, it can group j sample voices according to the noise level to obtain M sets of voice data, where M is an integer greater than 0 and not greater than j; and based on M
  • the group of voice data is trained on the sub-models corresponding to the first scene respectively to obtain M high-frequency sub-models.
  • the voiceprint registration device includes: a first device, a microphone, and a processor, wherein the first device is a speaker or a display screen.
  • the processor is configured to execute: trigger the first device to prompt the user to enter a registered voice; collect the registered voice entered by the user through a microphone; generate a sample voice under far-field conditions based on the registered voice; and train a voiceprint recognition model based on the sample voice.
  • the processor when the processor triggers the first device to prompt the user to enter the registration voice, it can trigger the speaker to play the prompt voice, where the prompt voice is used to prompt the user to enter the registration voice.
  • the processor may also trigger the display screen to display prompt text, where the prompt text is used to prompt the user to enter a registered voice.
  • the processor when generating sample voices under far-field conditions based on registered voices, can be specifically used to: simulate the reverberation of registered voices to sounds under far-field conditions; The sound reverberation simulation generates sample data of the registered voice under far-field conditions.
  • the processor when generating sample voices under far-field conditions based on registered voices, can be specifically used to: generate noise voices based on registered voices and noise data; simulate the reverberation of sounds under far-field conditions ; Based on the sound reverberation simulation under far-field conditions, sample data of noisy speech under far-field conditions are generated.
  • the processor when simulating sound reverberation under far-field conditions, can be specifically used to simulate sound wall reflection based on far-field conditions to obtain room impulse response RIR.
  • the processor when training the voiceprint recognition model based on the sample voice, can be specifically used to: extract the features of the sample voice to obtain feature data; train the voiceprint recognition model based on the feature data .
  • the voiceprint recognition model may include one or more sub-models, where one sub-model corresponds to one scene.
  • the processor when training the voiceprint recognition model based on the feature data, can be specifically used to: separately train one or more sub-models based on the feature data.
  • the voiceprint recognition model may include a fusion model, where the fusion model corresponds to one or more scenes.
  • the processor when training the voiceprint recognition model based on the feature data, can be specifically used to: train the fusion model based on the feature data.
  • the voiceprint recognition device includes: a first device, a microphone, and a processor, wherein the first device is a speaker or a display screen.
  • the processor is configured to execute: trigger the first device to prompt the user to enter the verification voice; collect the verification voice entered by the user through the microphone; match the verification voice input voiceprint recognition model to obtain the matching result; determine whether the user is a voiceprint based on the matching result Identify the registrant of the model.
  • the voiceprint recognition model is obtained through training of the voiceprint registration device of the third aspect.
  • the voiceprint registration device and voiceprint recognition can be one device.
  • the processor of the voiceprint recognition device can also be used to execute: trigger the first device to prompt the user to enter Register voice; collect the registered voice entered by the user through the microphone; generate sample voice under far-field conditions based on the registered voice; train the voiceprint recognition model based on the sample voice.
  • the aforementioned voiceprint registration device and voiceprint recognition may also be two different devices.
  • the voiceprint registration device may include: a first device, a microphone, and a processor, where the first device is a speaker or a display screen.
  • the processor is configured to execute: trigger the first device to prompt the user to enter a registered voice; collect the registered voice entered by the user through a microphone; generate a sample voice under far-field conditions based on the registered voice; and train a voiceprint recognition model based on the sample voice.
  • the processor when the processor triggers the first device to prompt the user to enter the registration voice, it can trigger the speaker to play the prompt voice, where the prompt voice is used to prompt the user to enter the verification voice.
  • the processor may also trigger the display screen to display prompt text, where the prompt text is used to prompt the user to enter the verification voice.
  • the processor may also be used to perform scene detection on the verification voice after the verification voice entered by the user is collected through a microphone.
  • the voiceprint recognition model may include one or more sub-models, where one sub-model corresponds to one scene.
  • the processor when matching the verification voice input voiceprint recognition model, may be specifically used to match the sub-model corresponding to the first scene of the verification voice input, where the first scene is the result of scene detection.
  • the processor can also be used: if the user is the registrant of the voiceprint recognition model, evaluate the quality of the verification voice to obtain the quality evaluation result; if the quality evaluation result indicates that the verification voice is a high-quality voice , Incremental learning of voiceprint recognition model based on verification speech.
  • the processor when incrementally learning the voiceprint recognition model based on the verification voice, can be specifically used to: perform data enhancement processing on the verification voice to obtain processed voice data; based on the processed voice The data performs incremental learning on the voiceprint recognition model.
  • the processor may also be used to determine that the first scene where the verification voice is located is a high-frequency scene before performing data enhancement processing on the verification voice.
  • the processor when performing data enhancement processing on the verification voice, may be specifically used to perform data enhancement processing on the verification voice to obtain j sample voices with different noise levels.
  • the processor when incrementally learning the voiceprint recognition model based on the processed voice data, can be specifically used to: group j sample voices according to noise levels to obtain M groups of voice data, where M is greater than 0 and not greater than j An integer of; Based on M groups of speech data, the sub-models corresponding to the first scene are trained respectively to obtain M high-frequency sub-models.
  • the chip provided by the embodiment of the present application includes a processor and a communication interface, and the communication interface is used to receive code instructions and transmit them to the processor.
  • the processor is used to call the code instructions transmitted by the communication interface to execute: trigger the speaker or display screen of the electronic device to prompt the user to enter the registered voice; trigger the microphone of the electronic device to collect the registered voice entered by the user; generate the remote-field condition based on the registered voice Sample voice; train the voiceprint recognition model based on the sample voice.
  • the processor when the processor triggers the speaker to prompt the user to enter the registration voice, it can trigger the speaker to play the prompt voice, where the prompt voice is used to prompt the user to enter the registration voice.
  • the processor when the processor triggers the display screen to prompt the user to enter the registration voice, it can trigger the display screen to display prompt text, where the prompt text is used to prompt the user to enter the registration voice.
  • the processor when generating sample voices under far-field conditions based on registered voices, can be specifically used to: simulate the reverberation of registered voices to sounds under far-field conditions; The sound reverberation simulation generates sample data of the registered voice under far-field conditions.
  • the processor when generating sample voices under far-field conditions based on registered voices, can be specifically used to: generate noise voices based on registered voices and noise data; simulate the reverberation of sounds under far-field conditions ; Based on the sound reverberation simulation under far-field conditions, sample data of noisy speech under far-field conditions are generated.
  • the processor when simulating sound reverberation under far-field conditions, can be specifically used to simulate sound wall reflection based on far-field conditions to obtain room impulse response RIR.
  • the processor when training the voiceprint recognition model based on the sample voice, can be specifically used to: extract the features of the sample voice to obtain feature data; train the voiceprint recognition model based on the feature data .
  • the voiceprint recognition model may include one or more sub-models, where one sub-model corresponds to one scene.
  • the processor when training the voiceprint recognition model based on the feature data, can be specifically used to: separately train one or more sub-models based on the feature data.
  • the voiceprint recognition model may include a fusion model, where the fusion model corresponds to one or more scenes.
  • the processor when training the voiceprint recognition model based on the feature data, can be specifically used to: train the fusion model based on the feature data.
  • the chip provided by the embodiment of the present application includes a processor and a communication interface, where the communication interface is used to receive code instructions and transmit them to the processor.
  • the processor is used to call the code instructions transmitted by the communication interface to execute: trigger the speaker of the electronic device or the display screen of the electronic device to prompt the user to enter the verification voice; collect the verification voice entered by the user through the microphone; input the verification voice into the voiceprint recognition model Matching to obtain a matching result, wherein the voiceprint recognition model is obtained through training by the device of any one of claims 13 to 18; based on the matching result, it is determined whether the user is a registrant of the voiceprint recognition model.
  • the processor when the processor triggers the speaker of the electronic device to prompt the user to enter the registration voice, it can trigger the speaker to play the prompt voice, where the prompt voice is used to prompt the user to enter the verification voice.
  • the processor when the processor triggers the display screen of the electronic device to prompt the user to enter the registration voice, it can trigger the display screen to display prompt text, where the prompt text is used to prompt the user to enter the verification voice.
  • the processor may also call the code instructions transmitted by the communication interface to execute: after triggering the microphone of the electronic device to collect the verification voice entered by the user, perform scene detection on the verification voice.
  • the voiceprint recognition model may include one or more sub-models, where one sub-model corresponds to one scene.
  • the processor when matching the verification voice input voiceprint recognition model, may be specifically used to match the sub-model corresponding to the first scene of the verification voice input, where the first scene is the result of scene detection.
  • the processor can also be used: if the user is the registrant of the voiceprint recognition model, evaluate the quality of the verification voice to obtain the quality evaluation result; if the quality evaluation result indicates that the verification voice is a high-quality voice , Incremental learning of voiceprint recognition model based on verification speech.
  • the processor when incrementally learning the voiceprint recognition model based on the verification voice, can be specifically used to: perform data enhancement processing on the verification voice to obtain processed voice data; based on the processed voice The data performs incremental learning on the voiceprint recognition model.
  • the processor may also be used to determine that the first scene where the verification voice is located is a high-frequency scene before performing data enhancement processing on the verification voice.
  • the processor when performing data enhancement processing on the verification voice, may be specifically used to perform data enhancement processing on the verification voice to obtain j sample voices with different noise levels.
  • the processor when incrementally learning the voiceprint recognition model based on the processed voice data, can be specifically used to: group j sample voices according to noise levels to obtain M groups of voice data, where M is greater than 0 and not greater than j An integer of; Based on M groups of speech data, the sub-models corresponding to the first scene are trained respectively to obtain M high-frequency sub-models.
  • the present application also provides a computer-readable storage medium, the computer-readable storage medium includes instructions, which when run on a computer, cause the computer to execute the methods described in the above aspects.
  • the present application also provides a computer program product including instructions, which, when run, causes the methods described in the foregoing aspects to be executed.
  • FIG. 1 is a schematic diagram of the hardware structure of an electronic device provided by this application.
  • FIG. 2 is a schematic flowchart of a voiceprint recognition method provided by this application.
  • Figure 3 is a schematic diagram of a display screen provided by this application prompting the user to enter a registered voice
  • FIG. 4 is a schematic diagram of a user triggering an electronic device to perform voiceprint verification provided by this application;
  • FIG. 5 is a schematic diagram of a display screen provided by this application for outputting a recognition result
  • FIG. 6 is a schematic diagram of a voiceprint recognition process provided by this application.
  • FIG. 7 is a schematic diagram of a display screen provided by this application for outputting a recognition result
  • FIG. 8 is a schematic diagram of another voiceprint recognition process provided by this application.
  • the electronic device in the embodiment of the present application is an electronic device with a voiceprint recognition function.
  • Voiceprint recognition is a technology that automatically recognizes and confirms the speaker's identity through voice signals.
  • the electronic device in the embodiment of the present application can collect the user's voice data, and perform voiceprint recognition on the voice data to determine whether the user is a registered person.
  • GUI graphical user interface
  • the electronic device in the embodiments of the present application may be a portable electronic device, such as a mobile phone, a tablet computer, an artificial intelligence (AI) smart voice terminal, a wearable device, augmented reality (AR)/virtual reality (virtual reality). , VR) equipment, etc.
  • portable electronic devices include but are not limited to carrying Or portable electronic devices with other operating systems.
  • the aforementioned portable electronic device may also be a vehicle-mounted terminal, a laptop computer (Laptop), and the like. It should also be understood that the electronic devices in the embodiments of the present application may also be desktop computers, smart home devices (such as smart TVs, smart speakers), etc., which are not limited.
  • the electronic device includes a processor 110, an internal memory 121, an external memory interface 122, a camera 131, a display screen 132, a sensor module 140, a subscriber identification module (SIM) card interface 151, and buttons 152 , Audio module 160, speaker 161, receiver 162, microphone 163, earphone interface 164, universal serial bus (USB) interface 170, charging management module 180, power management module 181, battery 182, mobile communication module 191 and Wireless communication module 192.
  • the electronic device may also include motors, indicators, buttons, and so on.
  • FIG. 1 is only an example.
  • the electronic device of the embodiment of the present application may have more or fewer components than the electronic device shown in the figure, may combine two or more components, or may have different component configurations.
  • the various components shown in the figure may be implemented in hardware, software, or a combination of hardware and software including one or more signal processing and/or application specific integrated circuits.
  • the processor 110 may include one or more processing units.
  • the processor 110 may include an application processor (AP), a modem, a graphics processing unit (GPU), an image signal processor (ISP), a controller, a video codec, Digital signal processor (digital signal processor, DSP), baseband processor, and/or neural network processor (neural-network processing unit, NPU), etc.
  • AP application processor
  • GPU graphics processing unit
  • ISP image signal processor
  • controller a video codec
  • Digital signal processor digital signal processor
  • DSP digital signal processor
  • baseband processor baseband processor
  • neural network processing unit neural-network processing unit
  • a buffer may be provided in the processor 110 to store instructions and/or data.
  • the buffer in the processor 110 may be a cache memory.
  • the buffer can be used to store instructions and/or data that have just been used, generated, or recycled by the processor 110. If the processor 110 needs to use the instruction or data, it can be directly called from the buffer. It helps to reduce the time for the processor 110 to obtain instructions or data, thereby helping to improve the efficiency of the system.
  • the internal memory 121 may be used to store programs and/or data.
  • the internal memory 121 includes a program storage area and a data storage area.
  • the storage program area can be used to store an operating system (such as Android, IOS, etc.), a computer program required for at least one function (such as a voiceprint recognition function, a sound playback function), and the like.
  • the data storage area may be used to store data (such as audio data) created and/or collected during the use of the electronic device.
  • the processor 110 may call the program and/or data stored in the internal memory 121 to cause the electronic device to execute a corresponding method, thereby implementing one or more functions.
  • the processor 110 calls certain programs and/or data in the internal memory, so that the electronic device executes the voiceprint recognition method provided in the embodiments of the present application, thereby realizing the voiceprint recognition function.
  • the internal memory 121 may be a high-speed random access memory, and/or a non-volatile memory.
  • the non-volatile memory may include at least one of one or more disk storage devices, flash memory devices, and/or universal flash storage (UFS).
  • the external memory interface 122 may be used to connect an external memory card (for example, a Micro SD card) to expand the storage capacity of the electronic device.
  • the external memory card communicates with the processor 110 through the external memory interface 122 to realize the data storage function.
  • the electronic device can save files such as images, music, and videos in the external memory card through the external memory interface 122.
  • the camera 131 can be used to capture moving and still images and the like.
  • the camera 131 includes a lens and an image sensor.
  • the optical image generated by the object through the lens is projected onto the image sensor, and then converted into an electrical signal for subsequent processing.
  • the image sensor may be a charge coupled device (CCD) or a complementary metal-oxide-semiconductor (CMOS) phototransistor.
  • CMOS complementary metal-oxide-semiconductor
  • the image sensor converts the light signal into an electrical signal, and then transfers the electrical signal to the ISP to convert it into a digital image signal.
  • the electronic device may include 1 or N cameras 131, where N is a positive integer greater than 1.
  • the display screen 132 may include a display panel for displaying a user interface.
  • the display panel can adopt liquid crystal display (LCD), organic light-emitting diode (OLED), active-matrix organic light-emitting diode or active-matrix organic light-emitting diode (active-matrix organic light-emitting diode).
  • LCD liquid crystal display
  • OLED organic light-emitting diode
  • active-matrix organic light-emitting diode active-matrix organic light-emitting diode
  • emitting diode AMOLED
  • flexible light-emitting diode FLED
  • Miniled MicroLed, Micro-oLed, quantum dot light emitting diode (QLED), etc.
  • the electronic device may include one or M display screens 132, and M is a positive integer greater than one.
  • the electronic device may implement a display function through a GPU, a display screen 132, an application processor, and the like.
  • the sensor module 140 may include one or more sensors. For example, touch sensor 140A, gyroscope 140B, acceleration sensor 140C, fingerprint sensor 140D, pressure sensor 140E, etc. In some embodiments, the sensor module 140 may also include an ambient light sensor, a distance sensor, a proximity light sensor, a bone conduction sensor, a temperature sensor, and the like.
  • the touch sensor 140A may also be referred to as a “touch panel”.
  • the touch sensor 140A may be disposed on the display screen 132, and the touch screen is composed of the touch sensor 140A and the display screen 132, which is also called a “touch screen”.
  • the touch sensor 140A is used to detect touch operations acting on or near it.
  • the touch sensor 140A may transmit the detected touch operation to the application processor to determine the type of touch event.
  • the electronic device can provide visual output related to the touch operation and the like through the display screen 132.
  • the touch sensor 140A may also be disposed on the surface of the electronic device, which is different from the position of the display screen 132.
  • the gyroscope 140B can be used to determine the movement posture of the electronic device.
  • the angular velocity of the electronic device around three axes ie, x, y, and z axes
  • the gyroscope 140B can be used for image stabilization.
  • the gyroscope 140B detects the angle of the shake of the electronic device, calculates the distance that the lens module needs to compensate according to the angle, and allows the lens to counteract the shake of the electronic device through a reverse movement, thereby achieving anti-shake.
  • the gyro sensor 140B can also be used for navigation and somatosensory game scenes.
  • the acceleration sensor 140C can detect the magnitude of the acceleration of the electronic device in various directions (generally three-axis). The magnitude and direction of gravity can be detected when the electronic device is stationary. The acceleration sensor 140C can also be used to recognize the posture of an electronic device, and is used in applications such as horizontal and vertical screen switching, pedometer, and so on.
  • the fingerprint sensor 140D is used to collect fingerprints. Electronic devices can use the collected fingerprint characteristics to unlock fingerprints, access application locks, take photos with fingerprints, and answer calls with fingerprints.
  • the pressure sensor 140E is used to sense the pressure signal and can convert the pressure signal into an electrical signal.
  • the pressure sensor 140E may be provided on the display screen 132. Among them, touch operations that act on the same touch position but have different touch operation strengths can correspond to different operation instructions.
  • the SIM card interface 151 is used to connect to a SIM card.
  • the SIM card can be inserted into the SIM card interface 151 or pulled out from the SIM card interface 151 to achieve contact and separation with the electronic device.
  • the electronic device may support 1 or K SIM card interfaces 151, and K is a positive integer greater than 1.
  • the SIM card interface 151 may support Nano SIM cards, Micro SIM cards, and/or SIM cards, etc.
  • the same SIM card interface 151 can insert multiple cards at the same time. The types of the multiple cards can be the same or different.
  • the SIM card interface 151 can also be compatible with different types of SIM cards.
  • the SIM card interface 151 may also be compatible with external memory cards.
  • the electronic device interacts with the network through the SIM card to realize functions such as call and data communication.
  • the electronic device may also adopt an eSIM, that is, an embedded SIM card.
  • the eSIM card can be embedded in the electronic device and cannot be separated from the electronic device.
  • the button 152 may include a power-on button, a volume button, and the like.
  • the button 152 may be a mechanical button or a touch button.
  • the electronic device can receive key input and generate key signal input related to user settings and function control of the electronic device.
  • the electronic device can implement audio functions through the audio module 160, the speaker 161, the receiver 162, the microphone 163, the earphone interface 164, and the application processor. For example, audio playback function, recording function, voiceprint registration function, voiceprint verification function, voiceprint recognition function, etc.
  • the audio module 160 can be used to perform digital-to-analog conversion and/or analog-to-digital conversion on audio data, and can also be used to encode and/or decode audio data.
  • the audio module 160 may be set independently of the processor, or may be set in the processor 110, or part of the functional modules of the audio module 160 may be set in the processor 110.
  • the speaker 161 also called a “speaker” is used to convert audio data into sound and play the sound.
  • the electronic device 100 may listen to music through the speaker 161, answer a hands-free call, or issue a voice prompt, etc.
  • the receiver 162 also called “earpiece” is used to convert audio data into sound and play the sound. For example, when the electronic device 100 answers a call, the receiver 162 may be brought close to the human ear to answer the call.
  • the microphone 163 also known as a "microphone” or a “microphone”, is used to collect sounds (such as ambient sounds, including sounds made by people, sounds made by equipment, etc.), and convert the sounds into audio electrical data.
  • sounds such as ambient sounds, including sounds made by people, sounds made by equipment, etc.
  • the user can approach the microphone 163 through the mouth to make a sound, and the microphone 163 collects the sound made by the user.
  • the microphone 163 can collect surrounding sound in real time to obtain audio data.
  • the situation in which the microphone 163 collects sounds is related to the environment. For example, when the surrounding environment is relatively noisy and the user utters the verification speech, the sound collected by the microphone 163 includes the surrounding environment noise and the sound of the user issuing the verification speech.
  • the sound collected by the microphone 163 is the voice of the user making the verification speech.
  • the sound collected by the microphone 163 is the superposition of the surrounding environment noise and the reverberation of the verification utterance of the user.
  • the voiceprint recognition function of the electronic device is turned on, but the user does not speak the verification speech, and the sound collected by the microphone 163 is only the surrounding environment noise.
  • the electronic device may be provided with at least one microphone 163.
  • two microphones 163 are provided in the electronic device, which can realize noise reduction function in addition to collecting sound.
  • three, four or more microphones 163 may be provided in the electronic device, so that in addition to sound collection and noise reduction, sound source identification or directional recording functions can also be realized.
  • the earphone interface 164 is used to connect wired earphones.
  • the earphone interface 164 may be a USB interface 170, or a 3.5mm open mobile terminal platform (OMTP) standard interface, a cellular telecommunications industry association (cellular telecommunications industry association of the USA, CTIA) standard interface, etc. .
  • OMTP open mobile terminal platform
  • CTIA cellular telecommunications industry association of the USA, CTIA
  • the USB interface 170 is an interface that complies with the USB standard specification, and specifically may be a Mini USB interface, a Micro USB interface, a USB Type C interface, and the like.
  • the USB interface 170 can be used to connect a charger to charge the electronic device, and can also be used to transfer data between the electronic device and the peripheral device. It can also be used to connect headphones and play audio through the headphones.
  • the USB interface 170 can also be used to connect other electronic devices, such as AR devices, computers, and so on.
  • the charging management module 180 is used to receive charging input from the charger.
  • the charger can be a wireless charger or a wired charger.
  • the charging management module 180 may receive the charging input of the wired charger through the USB interface 170.
  • the charging management module 180 may receive the wireless charging input through the wireless charging coil of the electronic device. While charging the battery 182, the charging management module 180 can also supply power to the electronic device through the power management module 180.
  • the power management module 181 is used to connect the battery 182, the charging management module 180, and the processor 110.
  • the power management module 181 receives input from the battery 182 and/or the charging management module 180, and supplies power to the processor 110, the internal memory 121, the display screen 132, the camera 131, and the like.
  • the power management module 181 can also be used to monitor parameters such as battery capacity, battery cycle times, and battery health status (leakage, impedance).
  • the power management module 181 may also be provided in the processor 110.
  • the power management module 181 and the charging management module 180 may also be provided in the same device.
  • the mobile communication module 191 can provide a wireless communication solution including 2G/3G/4G/5G and the like applied to electronic devices.
  • the mobile communication module 191 may include filters, switches, power amplifiers, low noise amplifiers (LNA), and the like.
  • the wireless communication module 192 can provide applications on electronic devices including WLAN (such as Wi-Fi network), Bluetooth (Bluetooth, BT), global navigation satellite system (GNSS), frequency modulation (FM), Wireless communication solutions such as near field communication (NFC) and infrared (IR) technology.
  • WLAN such as Wi-Fi network
  • Bluetooth Bluetooth, BT
  • GNSS global navigation satellite system
  • FM frequency modulation
  • FM Wireless communication solutions
  • NFC near field communication
  • IR infrared
  • the wireless communication module 192 may be one or more devices integrating at least one communication processing module.
  • the antenna 1 of the electronic device is coupled with the mobile communication module 191, and the antenna 2 is coupled with the wireless communication module 192, so that the electronic device can communicate with other devices.
  • the mobile communication module 191 may communicate with other devices through the antenna 1
  • the wireless communication module 193 may communicate with other devices through the antenna 2.
  • Embodiments of the present application The following describes the voiceprint recognition method provided by the embodiments of the present application in detail with reference to the drawings and application scenarios.
  • the following embodiments can all be implemented in the electronic device 100 having the above hardware structure.
  • the distance between the sound source and the microphone (mic) is relatively close, for example, the sound source is within 1 meter from the mic.
  • Near-field voice It can be understood as voice data collected under near-field conditions. For example, when the distance between the sound source and the mic is less than 1 meter, the voice data collected by the mic for the sound source is near-field voice.
  • Near-field speech can include near-field clean speech and near-field noisy speech, where near-field clean speech can be understood as noise-free speech data collected under near-field conditions, and near-field noisy speech can be understood as near-field Voice data with noise collected under the conditions.
  • the distance between the sound source and the microphone (mic) is relatively long, for example, the sound source distance mic is within 1 meter to 10 meters, etc.
  • Far-field voice can be understood as voice data collected under far-field conditions. For example, when the distance between the sound source and the mic is less than 5 meters, the voice data collected by the mic for the sound source is far-field voice.
  • Far-field speech can include far-field clean speech and far-field noisy speech. Among them, far-field clean speech can be understood as noise-free speech data collected under far-field conditions, and far-field noisy speech can be understood as far-field Voice data with noise collected under the conditions.
  • Voiceprint recognition model It can be an electronic device based on Gaussian mixture model (GMM)-background model (universal background model, UBM), support vector machine (SVM), joint factor analysis (joint factor analysis, JFA), identity vector (identity vector, I-vector), X-vector and other methods established data models, electronic devices use sample data to train the initial voiceprint recognition model after establishing the initial voiceprint recognition model.
  • GMM Gaussian mixture model
  • UBM linear background model
  • SVM support vector machine
  • JFA joint factor analysis
  • identity vector identity vector
  • I-vector identity vector
  • X-vector X-vector
  • Multi-scene fusion model It can be understood as using sample data from multiple scenes to train the initial voiceprint recognition model. After the voiceprint recognition model is trained, it can be regarded as a multi-scene fusion model.
  • the initial voiceprint recognition model is trained using sample data of one scene to obtain a single scene model.
  • the model corresponding to each scene is a single scene model.
  • the sample data of a scene is used to train the initial voiceprint recognition model.
  • the voiceprint recognition model After the voiceprint recognition model is trained, it can be regarded as the model corresponding to the scene.
  • the sample data of the home scene is used to compare the initial voiceprint recognition model.
  • the voiceprint recognition model is trained to obtain the model corresponding to the home scene (or can be referred to as the home model), and the sample data of the vehicle scene is used to train the initial voiceprint recognition model to obtain the model corresponding to the vehicle scene (or can also be called Vehicle model). Therefore, by separately training the initial voiceprint recognition model using sample data of different scenes, a single scene model corresponding to each scene can be obtained.
  • Incremental learning Whenever new sample data is added, there is no need to rebuild the voiceprint recognition model, but on the basis of the original voiceprint recognition model, the changes caused by the new sample data are updated, that is, the previous training On the basis of a good voiceprint recognition model, new sample data is used for further training, so as to continuously update the voiceprint recognition model.
  • Feature extraction a method of transforming data to highlight the representative features of the data.
  • it may refer to a method and process of transforming voice data to extract characteristic information from voice data.
  • Scene detection By extracting the background data of the voice data, the scene where the voice data is located can be judged.
  • FIG. 2 it exemplarily shows the flow of a voiceprint recognition method provided by an embodiment of the present application, and the method is executed by an electronic device.
  • the basic scheme of voiceprint recognition includes two stages: voiceprint registration and voiceprint verification. Among them, voiceprint registration can be implemented through steps S201 to S204. Voiceprint verification can be implemented through steps S205 to S209.
  • the electronic device collects the registered voice entered by the user.
  • the registered voice entered by the user may be a near-field clean voice.
  • the electronic device may collect the surrounding environment sound through the microphone 163, and obtain the registered voice entered by the user.
  • the user can speak the registration voice under the prompt of the electronic device.
  • the electronic device can display text on the display screen 132 to prompt the user to speak the registration word "1234567".
  • the electronic device may also perform voice prompts through the speaker 161, and so on.
  • the electronic device automatically prompts the user to speak the registered voice when the user activates the voiceprint recognition function of the electronic device for the first time, or it can also be that the user operates the electronic device to prompt the user to speak when the user activates the voiceprint recognition function of the electronic device for the first time
  • the user can trigger the electronic device to prompt the user to speak the registered voice when the user subsequently activates the voiceprint recognition function.
  • the user can input the registration voice multiple times when performing voiceprint registration, so that the accuracy of voiceprint recognition can be improved.
  • the electronic device may store the registered voice in a high-quality voice sample library, where the high-quality voice sample library is used to store voices with a voice quality score greater than or equal to a quality threshold.
  • the electronic device performs data enhancement processing on the registered voice included in the high-quality voice sample library to obtain multiple sample voices.
  • the sample voice can be, but is not limited to: noisy voice generated from registered voice, far-field voice generated from registered voice, far-field noisy voice generated from registered voice, etc.
  • the electronic device can generate noisy speech, far-field speech, etc. based on registered speech, without requiring users to register separately in scenes such as near-field and far-field, thereby improving user experience.
  • the electronic device When the electronic device generates noisy voice based on the registered voice, it can be achieved by adding the registered voice and noise source in the simulated room, and processing the registered voice and noise source to obtain the noisy voice, where there can be one or more noise sources.
  • the electronic device can generate noisy voices with different noise levels based on the registered voice. For example, different scenes may correspond to different noise levels, so the electronic device can simulate the registered voice for each scene to generate the noisy speech corresponding to the scene.
  • ISM image source model
  • RIR room impulse response
  • the far-field voice corresponding to the registered voice is generated according to the RIR simulation.
  • the electronic device can generate far-field voices of different far-field levels based on the registered voice. For example, different scenes may correspond to different far-field distances, so the electronic device can simulate the registered voice for each scene to generate the far-field speech corresponding to the scene.
  • the electronic device can also use other methods to simulate the reverberation under far-field conditions, for example, impulse response convolution of the sound to simulate the reverberation of the sound under far-field conditions, and so on.
  • the noise source can have one Or more; use the ISM algorithm to calculate the RIR, and generate the far-field noisy speech corresponding to the noisy speech according to the RIR simulation.
  • the electronic device can generate far-field noisy speech with different far-field levels and different noise levels based on the registered voice.
  • the electronic device can simulate the registered voice according to the noise characteristics and far-field characteristics of the specific scene to generate the far-field noisy speech corresponding to the scene.
  • the noise level can be understood as the noise intensity level
  • the far field level can be understood as the far field distance level
  • S204 The electronic device performs feature extraction on the sample voice, and trains a model in the voiceprint recognition model library based on the extracted features to obtain a trained model.
  • the models in the voiceprint recognition model library can be, but not limited to, established by using methods such as GMM-UBM, SVM, JFA, I-vector, X-vector, and the like.
  • the voiceprint recognition model library may include a multi-scene fusion model. Therefore, the electronic device can use sample voices of multiple scenes to train the multi-scene fusion model.
  • the voiceprint recognition model library may also include models corresponding to multiple scenes respectively. Therefore, for the model corresponding to each scene, the electronic device can use the sample voice corresponding to the scene for training.
  • the voiceprint recognition model may also include a multi-scene fusion model and a model corresponding to each of the multiple scenes. Therefore, the electronic device can use sample voices from multiple scenes to train the multi-scene fusion model and target the model corresponding to each scene , The electronic device can use the sample voice corresponding to the scene for training.
  • the voiceprint recognition model is a multi-scenario fusion model
  • the multi-scenario fusion model can get a unique matching score after verifying the voice input. After learning the data of the high-quality speech sample library, the multi-scenario fusion model will be compared with the actual use The scene is getting more and more matched. If the voiceprint recognition model is a model corresponding to multiple scenes, the voiceprint recognition can be performed by inputting the verification voice into the model corresponding to the scene by performing scene detection on the entered verification voice.
  • the verification voice has passed the quality assessment and entered the high-quality voice sample library, it will be enhanced with data, and the model corresponding to the scene can be updated by incremental learning, so that the model of the corresponding scene can be more and more matched with the actual scene.
  • the electronic device can use filter bank (FBank), Mel-frequency cepstral coefficients (MFCC), D-vector and other methods when extracting features of the sample speech.
  • FBank filter bank
  • MFCC Mel-frequency cepstral coefficients
  • D-vector D-vector
  • S205 The electronic device collects the verification voice entered by the user.
  • the user can speak the verification voice under the prompt of the electronic device.
  • the method for the electronic device to prompt the user to speak the verification language is similar to the method for the electronic device to prompt the user to speak the registered language, and the repetitions will not be repeated.
  • the electronic device may collect the verification voice entered by the user under the user's operation trigger.
  • the user triggers a verification instruction by operating the electronic device, so that the electronic device collects and prompts the user to enter the verification voice after receiving the verification instruction, and collects the user input Verification voice.
  • the user can trigger the verification instruction by clicking the corresponding position of the icon corresponding to the voiceprint recognition function on the touch screen of the electronic device, so that the electronic device prompts the user to speak the verification voice; for example, the user can operate physical entities (such as physical keys, mouse,
  • the user can trigger a verification instruction through a specific gesture (such as double-clicking the touch screen of the electronic device, etc.), so that the electronic device prompts the user to speak a verification voice.
  • the user can speak the keyword "voiceprint recognition” to an electronic device (such as a smart phone, a vehicle-mounted device, etc.), and the electronic device collects the keyword "voiceprint recognition” sent by the user through the microphone 163 and triggers a verification instruction. And prompt the user to speak the verification voice.
  • an electronic device such as a smart phone, a vehicle-mounted device, etc.
  • the electronic device can collect the control command and use the control command as a verification voice for voiceprint recognition. That is, the electronic device triggers the verification instruction when receiving the control command, and uses the control instruction as a verification voice for voiceprint recognition.
  • the user can send a control command "open music" to an electronic device (such as a smart phone, a vehicle-mounted device, etc.), and the electronic device collects the user's voice "open music” through the microphone 163, and then The voice is used as verification voice for voiceprint recognition.
  • the user can send a control command "turn to 27°C" to an electronic device (such as a smart air conditioner), and the electronic device collects the user's voice “turn to 27°C” through the microphone 163, and then uses the voice as a verification voice.
  • an electronic device such as a smart air conditioner
  • S206 The electronic device performs feature extraction and scene detection on the verification voice.
  • the electronic device When the electronic device performs feature extraction on the verification voice, it can, but is not limited to, adopt FBank, MFCC, D-vector and other methods.
  • the electronic device may add a scene tag to the verification voice after performing scene detection on the verification voice. For example, after the electronic device performs scene detection on the verification voice and determines that the verification voice is entered in the vehicle-mounted scene, the verification voice may be added with a scene tag corresponding to the vehicle-mounted scene.
  • the method of scene detection may include, but is not limited to, GMM, deep neural network (deep neural network, DNN), etc.
  • Scene tags can be selected according to application scenarios, such as home scenes, car scenes, background music scenes, noisy human voice environments, and far-field scenes, near-field scenes, etc.
  • the electronic device may pre-train a detection model for each scene (the detection model may be based on the GMM algorithm, or it may be based on the DNN algorithm), so that the electronic device can input the verification voice into each scene in turn.
  • the model matching score is detected, and the scene corresponding to the verification voice is determined according to the matching score of the model corresponding to each scene.
  • the electronic device can also pre-train a classification model (the classification model can be based on the DNN algorithm), so that the electronic device can input the verification voice into the classification model, and the classification model can output the classification result. It is the scene corresponding to the verification voice.
  • the classification model can be based on the DNN algorithm
  • S207 The electronic device verifies the voiceprint recognition model trained in the voiceprint registration stage of the voice input for matching scores. If the matching score is greater than the matching threshold, it can be determined that the verification voice is from the registrant, otherwise it is not from the registrant.
  • the matching score method may include, but is not limited to: cosine distance (CDS), linear discriminant analysis (LDA), prob-ailistic linear discriminant analysis (PLDA) and other algorithms.
  • CDS cosine distance
  • LDA linear discriminant analysis
  • PLDA prob-ailistic linear discriminant analysis
  • the voiceprint recognition model is a multi-scene fusion model
  • a score can be obtained through the matching score of the multi-scene fusion model.
  • the voiceprint recognition model includes models corresponding to multiple scenes, it can pass multiple scenes.
  • the corresponding models perform matching scores respectively to obtain multiple scores, and then combine the scene tags obtained in step S206 to obtain a fusion score in a weighted manner.
  • the electronic device may output the recognition result to the user when it is determined that the verification voice is not from the registrant. Specifically, the electronic device may output the recognition result on the display screen 132. As shown in FIG. 5, the electronic device may display the text "Not a registered person! on the display screen 132. For another example, the electronic device may also broadcast the voice "not a registered person" through the speaker 161, and so on.
  • the electronic device may combine the scene tags of the verification voice to perform a quality evaluation on the verification voice. If the quality score of the verified voice is greater than the quality threshold, the verified voice can be added to the high-quality voice sample library.
  • the method for evaluating the quality of the verification voice may be: determining the value of a parameter characterizing the voice quality of the verification voice to determine whether the verification voice is a high-quality voice, wherein the parameter characterizing the voice quality may be but not limited to It includes one or more of the following parameters: signal-to-noise ratio (SNR), segment signal-to-noise ratio (SegSNR), perceptual evaluation of speech quality speech quality, PESQ), log likelihood ratio measure (LLR), etc.
  • SNR signal-to-noise ratio
  • SegSNR segment signal-to-noise ratio
  • PESQ perceptual evaluation of speech quality speech quality
  • LLR log likelihood ratio measure
  • the verification voice input can also be used in the model for quality evaluation to determine whether the verification voice is high-quality voice, where the model for quality evaluation can be based on the GMM algorithm or the DNN algorithm .
  • a quality score can be obtained after the verification voice is input into the model, and then it is determined whether the verification voice is a high-quality voice according to the level of the quality score.
  • the voiceprint recognition model may include models corresponding to multiple scenarios, and the high-quality voice sample library may also be classified and stored according to multiple scenarios, that is, the high-quality voice sample library may include sample libraries corresponding to the multiple scenarios.
  • the sample library corresponding to a scene can be used to train the model corresponding to the scene. Based on this, a possible implementation is that if the quality score of the verification voice is greater than the quality threshold, the electronic device may add the verification voice to the sample library corresponding to the scene detected in step S206.
  • the voiceprint recognition model includes the model corresponding to the A scene, the model corresponding to the B scene, the model corresponding to the C scene, and the model corresponding to the D scene.
  • the high-quality voice sample library can include the sample library corresponding to the A scene and the sample library corresponding to the B scene.
  • the sample library corresponding to the C scene and the sample library corresponding to the D scene are described as examples. Assuming that it is determined in step S206 that the verification voice comes from the A scenario, the electronic device can add the verification voice to the verification voice when the quality score of the verification voice is greater than the quality threshold Go to the sample library corresponding to scene A.
  • S209 The electronic device performs data enhancement processing on the voice of the high-quality voice sample library, uses the processed voice data for incremental learning, and updates the voiceprint recognition model.
  • the incremental learning algorithm can include, but is not limited to: Method 1, adding the enhanced voice data to the original registered voice in a weighted manner, and using the added voice data to train the voiceprint recognition model; Method 2 , Based on the enhanced speech data alone, train the voiceprint recognition model obtained in the last training to obtain a new voiceprint recognition model, and add the new voiceprint recognition model and the voiceprint recognition model obtained in the previous training to complete the model. Update.
  • the scene tags are combined to enhance the data of the voice in the high-quality voice library, which can obtain richer data during the user's use. For example, clean speech in the near field can be enhanced to obtain clean speech in the far field, and low-noise speech in the home scene can be enhanced to obtain noisy speech in the home scene.
  • the voiceprint recognition model includes models corresponding to multiple scenes, and the high-quality speech sample library is classified and stored according to multiple scenes, the model of the corresponding scene can be updated through incremental learning.
  • the voiceprint recognition model can be more and more matched with the actual use scene, thereby improving the voiceprint recognition system Robustness.
  • Scenario 1 For situations where the usage scenarios often change, such as portable electronic devices such as mobile phones, earphones, and bracelets, the portable electronic devices such as mobile phones, earphones, earrings, etc. will be in different scenarios as the user moves, for example, The user comes out of the home and drives to the shopping mall. In this case, these portable electronic devices experience moving from the home scene to the car scene, and then enter the shopping scene.
  • the user equipment performs voiceprint recognition on these portable electronic devices, the following steps S601 to S614 may be used to implement voiceprint recognition.
  • the voiceprint recognition process may specifically include:
  • Step S601 The electronic device collects k registered voices of the user. Wherein, k can be an integer greater than or equal to 1. Step S602 is executed.
  • the user can enter the registration voice multiple times under the prompt of the electronic device.
  • the prompt method refer to the method described in step S201 above, and details are not repeated here.
  • the electronic device can collect k registered voices of the user through the microphone 163.
  • Step S602. The electronic device adds k registered voices to the high-quality voice sample library. Step S603 is executed.
  • Step S603 The electronic device performs data enhancement processing on the k registered voices to obtain sample voices.
  • data enhancement processing refer to the method described in step S203 for details, and details are not repeated here. Among them, one registered voice can generate multiple sample voices with different noise levels and different far-field levels.
  • Step S604 is executed.
  • the high-quality speech sample library can be classified and stored for different scenarios. Therefore, the electronic device can perform data enhancement processing on the k registered voices for different scenarios, so that sample voices corresponding to the scenarios can be generated for different scenarios.
  • the electronic device can perform data enhancement processing on k pieces of registered voice for scene A. Specifically, the electronic device can generate s1 pieces of sample data with different noise levels and different far-field levels based on one piece of registered voice, so as to obtain k corresponding to scene A. ⁇ s1 piece of sample voice; for scene B, perform data enhancement processing on k pieces of registered voice. Specifically, the electronic device can generate s2 pieces of sample data with different noise levels and different far-field levels based on one piece of registered voice, so as to obtain the corresponding B scene k ⁇ s2 sample voices; perform data enhancement processing on k registered voices for the C scenario. Specifically, the electronic device can generate s3 sample data with different noise levels and different far-field levels based on one registered voice, so as to obtain the corresponding C scenario K ⁇ s3 sample voices.
  • the electronic device may store the sample speech of the scene in the sample library corresponding to the scene for each scene, for example, store k ⁇ s1 sample speeches of the A scene in the sample library 1 corresponding to the A scene, The k ⁇ s2 sample speeches of the B scene are stored in the sample library 2 corresponding to the B scene, and the k ⁇ s3 sample speeches of the C scene are stored in the sample library 3 corresponding to the C scene.
  • the electronic device may use the noise source corresponding to the scene to perform data enhancement processing on the k registered voices to obtain the sample speech corresponding to the scene, where the noise source corresponding to the scene may be
  • the noise data collected by the scene may also be noise data generated by simulation for the scene, and so on.
  • the electronic device may use the noise source of the A scene to perform data enhancement processing on the registration data
  • the electronic device may use the noise source of the B scene to perform data enhancement processing on the registration data.
  • Step S604 The electronic device performs feature extraction on the sample voice, and trains a model in the voiceprint recognition model library based on the extracted features to obtain a trained model.
  • Step S605 is executed.
  • the electronic device can establish a multi-scene fusion model, that is, the voiceprint recognition model library includes a multi-scene fusion model.
  • the electronic device may use the sample data obtained in step S603 to train the multi-scene fusion model.
  • the electronic device can also build models for different scenarios, that is, the voiceprint recognition model can include multiple models, such as a near-field quiet model, a near-field home model, a far-field home model, a vehicle model, etc., and can adopt various scenarios.
  • the sample library trains the model corresponding to the scene.
  • the sample library 1 of the A scene is used to train the model corresponding to the A scene.
  • the sample library of the near-field quiet scene is used to train the near-field quiet model, and the near-field home is used.
  • the sample library of the scene trains the near field home model
  • the sample library of the far field home scene is used to train the far field home model
  • the sample library of the vehicle scene is used to train the vehicle model.
  • the electronic device may separately establish models for the home scene, the vehicle scene, the shopping mall scene, and the work scene.
  • the electronic device After the electronic device collects the registered voice entered by the user, the electronic device targets the home scene, the vehicle scene, and the shopping mall scene based on the registered voice.
  • Work scenes are enhanced with voice data, so as to obtain sample voices in home scenes, car scenes, shopping mall scenes, and work scenes.
  • the sample data of the home scene can be used to train the model of the home scene
  • the sample data of the car scene can be used to The vehicle-mounted scene model is trained
  • the shopping mall scene sample data is used to train the shopping mall scene model
  • the work scene sample data is used to train the work scene model.
  • the electronic device collects the verification voice entered by the user, it can select the model of the corresponding scene in combination with the result of the verification voice scene detection. For example, assuming that the result of scene detection on the verification voice is a home scene, the electronic device The verification voice can be input into the model of the home scene for matching.
  • the electronic device can also separately establish models for different scenes and simultaneously establish a multi-scene fusion model, that is, the voiceprint recognition model can include a multi-scene fusion model and models corresponding to multiple scenes.
  • S605 The electronic device collects the verification voice entered by the user.
  • Step S605 The electronic device collects the verification voice entered by the user. Step S606 is executed.
  • step S605 For details of step S605, refer to step S205, which will not be repeated here.
  • Step S606 The electronic device performs feature extraction and scene detection on the verification voice. Step S607 is executed.
  • step S606 For details of step S606, refer to step S206, which will not be repeated here.
  • Step S607 The electronic device performs a matching score on the model trained in the verification voice input step S604 to obtain a first score.
  • Step S608 is executed.
  • the voiceprint recognition model includes models corresponding to multiple scenes
  • the electronic device can select the scene detected in step S606 (assumed to be scene A)
  • the corresponding model is scored for matching, that is, the model corresponding to the scene of verification voice input A is scored for matching to obtain the first score.
  • the voiceprint recognition model includes a multi-scene fusion model
  • the electronic device can select the multi-scene fusion model to score the matching, that is, verify the voice input and perform the matching score in the multi-scene fusion model to obtain the first score .
  • the voiceprint recognition model includes models corresponding to multiple scenes
  • the electronic device can input the verification voice into the model corresponding to each scene to score the matching, obtain multiple scores, and merge the multiple scores.
  • the first score may be, but is not limited to: an average value of multiple scores, a weighted value of multiple scores, and so on.
  • step S608 The electronic device determines whether the first score is greater than the first threshold. If yes, perform step S609 and step S611; if not, perform step S610.
  • S609 The electronic device outputs a voiceprint recognition result: it is a registrant.
  • the electronic device may display the text "Is a registered person" on the display screen 132.
  • the display interface may be as shown in FIG. 7.
  • the electronic device may also broadcast the voice "I am a registered person" through the speaker 163.
  • S610 The electronic device outputs a voiceprint recognition result: not a registrant.
  • the electronic device may display the text “not a registered person” on the display screen 132.
  • the display interface may be as shown in FIG. 5.
  • the electronic device may also broadcast the voice "not a registered person" through the speaker 163.
  • Step S611 The electronic device evaluates the quality of the verification voice, and obtains a second score. Step S612 is executed.
  • the electronic device can evaluate the quality of the verification voice in combination with the scene detected in step S606.
  • the electronic device may score the verification voice according to the model corresponding to the scene detected in step S606, and if the score is higher than the quality evaluation threshold, it is added to the high-quality voice sample library.
  • a quality evaluation method may also be used to determine the quality evaluation score of the verification voice, and the quality evaluation score may be compared with the threshold corresponding to the scene detected in step S606 to determine whether it is a high-quality speech of the scene.
  • step S612 The electronic device determines whether the second score is greater than the second threshold. If yes, go to step S613; if no, end.
  • Step S613 The electronic device stores the verification voice in a high-quality voice sample library. Step S614 is executed.
  • the electronic device may store the verification voice in the detected voice in step S606.
  • the sample library corresponding to the scene for example, if it is detected in step S606 that the verification voice comes from the home scene, the electronic device may store the verification voice in the sample library corresponding to the home scene.
  • the electronic device can also perform data enhancement on the verification voice, and store the data-enhanced verification voice in a high-quality voice sample library.
  • the verification voice is a home scene voice
  • the verification voice can be data-enhanced to obtain a far-field home voice, or the verification voice can be data-enhanced to obtain a home-scene voice with other noise levels.
  • the noise level of the obtained home scene speech may be greater than the verification speech.
  • S614 The electronic device performs incremental learning based on the newly added voice data in the high-quality voice sample library, and updates the model in the voiceprint recognition model library.
  • the electronic device can train the multi-scenario fusion model obtained in the previous training based on the newly added voice data from the high-quality voice sample library to obtain a new multi-scenario Fusion model.
  • the electronic device can perform a weighted addition on the new multi-scene fusion model and the multi-scene fusion model obtained in the previous training to complete the model update.
  • the electronic device can also add the weighting method of the newly added voice data of the high-quality voice sample library to the voice data originally stored in the high-quality voice sample library, and train the multiple scenes obtained from the previous training based on the voice data obtained by the addition.
  • the fusion model completes the model update.
  • the scene detected in step S606 is the car scene, and the verification voice is stored in the sample library of the car scene in step S613 as an example.
  • the electronic device can be based on the car scene.
  • the newly-added voice data in the sample library trains the vehicle-mounted scene model obtained in the previous training to obtain a new vehicle-mounted scene model.
  • the electronic device can perform a weighted addition on the new vehicle-mounted scene model and the vehicle-mounted scene model obtained in the previous training to complete the model update.
  • the electronic device can also add the weighted voice data newly added to the sample library of the vehicle scene with the voice data originally stored in the sample library of the vehicle scene, and train the vehicle scene obtained from the previous training based on the voice data obtained by the addition.
  • the model completes the model update.
  • the above voiceprint recognition process can solve the problem of data mismatch caused by the single registered voice scene and the variable verification voice scene by performing multi-scene tag data enhancement on the original registered voice.
  • the model in the voiceprint recognition model library is updated, so that the model in the voiceprint recognition model library is used by the user. It can be more and more applicable to actual user scenarios. Therefore, the voiceprint recognition method can improve the robustness of the voiceprint recognition algorithm to multiple scenes and changing scenes.
  • Scenario 2 For a situation where the usage scene is often a certain kind of scene, for example, for devices such as smart speakers, smart homes, and vehicle-mounted devices, the following steps S801 to S817 can be used to realize voiceprint recognition.
  • the voiceprint recognition process may specifically include:
  • step S814 may be executed after step S813.
  • step S814 The electronic device judges whether the verification voice is a high-quality voice in a high-frequency scene. If yes, go to step S815. If not, step S817 is executed. Among them, step S817 can refer to step S614, which will not be repeated here.
  • most of the verification voices collected by the electronic device during voice recognition are from a certain scene, and the scene can be considered as a high-frequency scene.
  • the electronic device can determine whether the verification voice is a high-quality voice in a high-frequency scene in the following manner: For the scene detected in step S806 (assuming scene A), the electronic device can count the most recent N voiceprint recognition processes The scene detection result of the verification voice in the A scene is n, if n is greater than the third threshold (or n/N is greater than the fourth threshold), the electronic device can determine that the A scene is a high-frequency scene, and the verification voice is a high-frequency scene On the contrary, it is verified that the voice is not the high-quality voice in the high-frequency scene.
  • the electronic device can verify the voice during the last 10 voiceprint recognition processes.
  • the number of times the scene detection result of the verification voice is the home scene is n, if n is greater than 5 (That is, the third threshold), it can be judged that the home scene is a high-frequency scene, that is, the verification voice is a high-quality voice in a high-frequency scene; if n is less than or equal to 5, it can be judged that the home scene is not a high-frequency scene, that is, the verification voice is not High-quality voice in high-frequency scenarios.
  • the electronic device can verify the voice's scene detection result in the last 20 voiceprint recognition processes as the number of home scenes n, if n/20 If it is greater than 50% (that is, the fourth threshold), it can be judged that the vehicle scene is a high frequency scene, that is, the verification voice is a high-quality speech in a high frequency scene; if n/20 is less than or equal to 50%, it can be judged that the vehicle scene is not High-frequency scenarios, that is, verify that the voice is not high-quality voice in high-frequency scenarios.
  • Step S815 The electronic device performs data enhancement on i sample voices in the sample library of the first scene, where the first scene is the scene detected in step S806.
  • the i piece of voice may be all sample voices of the sample library of the first scene, or part of the sample voices of the sample library of the first scene.
  • Step S816 is executed.
  • the electronic device may perform data enhancement on the sample voice to obtain j noise voices with different noise levels, where the noise level of the j noise voices is greater than the sample voice .
  • the electronic device can perform data enhancement on the sample voice to obtain k far-field voices of different far-field levels, where the far-field levels of the k far-field voices are all greater than The sample voice.
  • the electronic device can perform data enhancement on the sample voice to obtain j noise voices with different noise levels, and then perform data enhancement for each noise voice to obtain j ⁇ k remote voices. Field noise voice.
  • step S816 The electronic device performs incremental learning based on the voice data obtained in step S815 to obtain a sub-model in a high-frequency scene.
  • the electronic device may divide the voice data obtained in step S815 into M groups according to the noise level, where the voice data in the same group have the same noise level, or the voice data in the same group have the same noise level. Within the range of noise level. Then, for each group, the electronic device uses the voice data of the group to train the model of the first scene obtained in the previous training, obtains the corresponding sub-model of the group, and adds it to the voiceprint recognition model library. Specifically, for each group, the electronic device can use the voice data of the group to train the corresponding sub-models of the group in the first scene obtained from the previous training.
  • the original registered voice is enhanced by multi-scene tag data to solve the problem of mismatch between the registered voice scene and the verified voice scene.
  • high-quality verification voices in high-frequency scenes are added to the high-quality voice sample library, and data enhancement and incremental learning are performed to refine the models in high-frequency scenes so that electronic devices can
  • the voiceprint recognition can be performed more accurately under different noise levels or far-field levels of the high-frequency scene. For example, in a vehicle-mounted scene, the corresponding sub-models of 30km/h, 60km/h, 90km/h, and 120km/h can be accurately matched, instead of a rough vehicle-mounted scene model.
  • the corresponding sub-models of 3m, 4m, and 5m in the far-field can be accurately matched, instead of a rough far-field home environment speaker model. Therefore, during the user's use, the user can match the sub-models in the high-frequency scene according to the scene detection results, making the voiceprint recognition more accurate, and as the use data increases, the models in the voiceprint recognition model library continue to learn incrementally. Updates can be more and more accurate.
  • the embodiments of the present application can be provided as methods, systems, or computer program products. Therefore, the present application may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, this application may adopt the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program codes.
  • a computer-usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.
  • These computer program instructions can also be stored in a computer-readable memory that can guide a computer or other programmable data processing equipment to work in a specific manner, so that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction device.
  • the device implements the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
  • These computer program instructions can also be loaded on a computer or other programmable data processing equipment, so that a series of operation steps are executed on the computer or other programmable equipment to produce computer-implemented processing, so as to execute on the computer or other programmable equipment.
  • the instructions provide steps for implementing functions specified in a flow or multiple flows in the flowchart and/or a block or multiple blocks in the block diagram.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Signal Processing (AREA)
  • Computational Linguistics (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Quality & Reliability (AREA)
  • Business, Economics & Management (AREA)
  • Game Theory and Decision Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biodiversity & Conservation Biology (AREA)
  • Biomedical Technology (AREA)
  • General Health & Medical Sciences (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

一种声纹识别方法及装置,该方法包括:采集用户录入的注册语音(S201);基于注册语音生成远场条件下的样本语音(S203);基于样本语音对声纹识别模型进行训练(S204);采集用户录入的验证语音(S205);将验证语音输入声纹识别模型进行匹配,得到匹配结果(S207);基于匹配结果确定用户是否为声纹识别模型的注册人(S208)。该方法涉及人工智能等相关领域,用以解决现有技术中声纹识别方法鲁棒性低的问题。

Description

一种声纹识别方法及装置
相关申请的交叉引用
本申请要求在2019年07月24日提交中国专利局、申请号为201910673696.7、申请名称为“一种声纹识别方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请实施例涉及计算机技术领域,尤其涉及一种声纹识别方法及装置。
背景技术
声纹识别,是一种通过语音信号来自动辨识和确认说话人身份的技术。声纹识别的基本方案包括声纹注册和声纹验证两个阶段。在声纹注册阶段,将注册人的注册语音信息转换成验证模型;在声纹验证阶段,将验证语音的信息与声纹注册阶段生成的验证模型进行相似度打分,判断该验证语音是否来自于该注册人。
远场声纹识别,相比于近场声纹识别,具有更大的挑战性。其中主要的原因是语音信号在远场条件下的失真,体现为环境噪音的叠加以及房间混响。说话人在房间,或者密闭空间里讲话时,声波在空气中传播,并在墙壁和障碍物上反射;由于材料的吸收,声波中高频衰减,然后再次扩散到房间,从而导致混响。因此,在远场条件中,注册语音与验证语音不相匹配,声纹识别准确性较低。
针对这个问题,一种解决方案为用户在近、远场分别进行声纹注册。具体的,为了使得远场条件下的验证语音能够与注册语音相匹配,该方案提出用户在近场条件和远场条件分别进行声纹注册。但是,这种方案需要用户在近场条件、远场条件进行多次的声纹注册,降低了用户体验。
另一种解决方案是前端语音信号增强。具体的,在声纹注册阶段采集近场干净语音作为注册语音,在声纹验证阶段将采集到的远场语音数据通过前端处理,得到增强后语音,再将增强后的语音作为验证语音输入。但是,这种方案中增强后的语音的高频部分较近场干净语音仍有损失,因此增强后语音与注册语音依然不匹配,从而导致声纹识别系统鲁棒性低,识别率提升不明显等问题。
发明内容
本申请提供了一种声纹识别方法及装置,用以解决现有技术中声纹识别方法鲁棒性低的问题。
第一方面,本申请实施例提供的声纹注册方法,包括:电子设备提示用户录入注册语音;电子设备采集用户录入的注册语音;电子设备基于注册语音生成远场条件下的样本语音;电子设备基于样本语音对声纹识别模型进行训练。本申请实施例中电子设备可以根据注册语音仿真生成远场条件下的样本语音,而不需要用户在近场条件、远场条件进行多次的声纹注册,从而可以提高用户体验。并且,电子设备基于远场条件下样本语音对声纹识 别模型进行训练,可以提高声纹识别模型的鲁棒性,从而可以提高声纹识别的准确率。
在一种可能的设计中,电子设备基于注册语音生成远场条件下的样本语音时,可以模拟在远场条件下对声音的混响;并基于远场条件下对声音的混响仿真生成注册语音在远场条件下的样本数据。上述设计中通过模拟声音在远场条件下的混响,从而可以仿真得到注册语音在远场条件下的样本语音。
在一种可能的设计中,电子设备基于注册语音生成远场条件下的样本语音时,可以基于注册语音以及噪声数据生成噪声语音;电子设备模拟远场条件下对声音的混响;电子设备基于远场条件下对声音的混响仿真生成噪声语音在远场条件下的样本数据。上述设计中,通过在仿真注册语音在远场条件下的样本语音时结合噪声语音,从而可以使得样本语音更符合实际场景,从而可以提高声纹识别模型的鲁棒性,进而提高声纹识别的准确性。
在一种可能的设计中,电子设备模拟远场条件下对声音的混响时,可以基于远场条件仿真声音的墙壁反射,得到房间冲击响应(room impulse response,RIR)。上述设计中,通过仿真声音的墙壁反射得到RIR,可以模拟远场条件下对声音的混响。
在一种可能的设计中,电子设备基于样本语音对声纹识别模型进行训练时,可以对样本语音进行特征提取,得到特征数据;并基于特征数据对声纹识别模型进行训练。上述设计中,通过提取样本语音的特征数据,从而可以提高声纹识别模型的鲁棒性。
在一种可能的设计中,声纹识别模型可以包括一个或多个子模型,其中,一个子模型对应一个场景。电子设备基于特征数据对声纹识别模型进行训练时,可以基于特征数据分别对一个或多个子模型进行训练。上述设计中,通过针对不同场景分别训练对应的声纹识别子模型,从而可以解决注册语音场景单一而验证语音场景多变导致数据不匹配问题。
在一种可能的设计中,声纹识别模型可以包括一个融合模型,其中,融合模型对应一个或多个场景。电子设备基于特征数据对声纹识别模型进行训练时,基于特征数据对融合模型进行训练。上述设计中,电子设备通过维护一个融合模型,从而可以节省电子设备的计算资源。
第二方面,本申请实施例提供的声纹识别方法,包括:电子设备提示用户录入验证语音;电子设备采集用户录入的验证语音;电子设备将验证语音输入声纹识别模型进行匹配,得到匹配结果;电子设备基于匹配结果确定用户是否为声纹识别模型的注册人。其中,声纹识别模型可以采用上述第一方面所述的方法训练得到,具体的,电子设备在训练声纹识别模型的过程可以包括:电子设备提示用户录入注册语音;电子设备采集用户录入的注册语音;电子设备基于注册语音生成远场条件下的样本语音;电子设备基于样本语音对声纹识别模型进行训练。本申请实施例中,电子设备通过采用第一方面训练好的声纹识别模型,可以准确的识别验证语音是否来自注册人。
在一种可能的设计中,电子设备在采集用户录入的验证语音之后,可以对验证语音进行场景检测。上述设计中,通过检测验证语音所在的场景,使得电子设备可以结合验证语音所在的场景对验证语音进行声纹识别,从而可以提高声纹识别的准确性。
在一种可能的设计中,声纹识别模型可以包括一个或多个子模型,其中,一个子模型对应一个场景;电子设备将验证语音输入声纹识别模型进行匹配,包括:电子设备将验证语音输入第一场景对应的子模型进行匹配,其中,第一场景为场景检测的结果。上述设计中,电子设备通过结合验证语音所在的场景对验证语音进行声纹识别,可以提高声纹识别的准确性。
在一种可能的设计中,若用户是声纹识别模型的注册人,电子设备可以对验证语音进行质量评估,得到质量评估结果。若质量评估结果表示验证语音为高质量语音,电子设备可以基于验证语音对声纹识别模型进行增量学习。上述设计中,通过将高质量的验证语音加入到高质量语音样本库,并进行数据增强和增量学习,更新声纹识别模型库中的模型,从而用户使用过程中,声纹识别模型库中的模型可以越来越适用于用户实际使用场景。
在一种可能的设计中,电子设备基于验证语音对声纹识别模型进行增量学习时,可以对验证语音进行数据增强处理,得到处理后的语音数据。基于处理的语音数据对声纹识别模型进行增量学习。上述设计中,通过基于验证语音进一步进行增量学习,更新声纹识别模型库中的模型,从而用户使用过程中,声纹识别模型库中的模型的鲁棒性越来越高,声纹识别模型库中的模型可以越来越适用于用户实际使用场景。
在一种可能的设计中,在电子设备对验证语音进行数据增强处理之前,电子设备可以确定验证语音所在的第一场景为高频场景。电子设备对验证语音进行数据增强处理时,可以将验证语音进行数据增强处理,得到j个不同噪声等级的样本语音。电子设备基于处理的语音数据对声纹识别模型进行增量学习时,可以将j个样本语音按照噪声等级进行分组,得到M组语音数据,M为大于0且不大于j的整数;并基于M组语音数据分别对第一场景对应的子模型进行训练,得到M个高频子模型。
第三方面,本申请实施例提供的声纹注册装置,包括:第一器件,麦克风以及处理器,其中,第一器件为扬声器或者显示屏。处理器,用于执行:触发第一器件提示用户录入注册语音;通过麦克风采集用户录入的注册语音;基于注册语音生成远场条件下的样本语音;基于样本语音对声纹识别模型进行训练。
在一种可能的设计中,处理器触发第一器件提示用户录入注册语音时,可以触发扬声器播放提示语音,其中,提示语音用于提示用户录入注册语音。或者,处理器也可以触发显示屏显示提示文字,其中,提示文字用于提示用户录入注册语音。
在一种可能的设计中,处理器,在基于注册语音生成远场条件下的样本语音时,可以具体用于:模拟注册语音在远场条件下对声音的混响;基于远场条件下对声音的混响仿真生成注册语音在远场条件下的样本数据。
在一种可能的设计中,处理器,在基于注册语音生成远场条件下的样本语音时,可以具体用于:基于注册语音以及噪声数据生成噪声语音;模拟远场条件下对声音的混响;基于远场条件下对声音的混响仿真生成噪声语音在远场条件下的样本数据。
在一种可能的设计中,处理器,在模拟远场条件下对声音的混响时,可以具体用于:基于远场条件仿真声音的墙壁反射,得到房间冲击响应RIR。
在一种可能的设计中,处理器,在基于样本语音对声纹识别模型进行训练时,可以具体用于:对样本语音进行特征提取,得到特征数据;基于特征数据对声纹识别模型进行训练。
在一种可能的设计中,声纹识别模型可以包括一个或多个子模型,其中,一个子模型对应一个场景。处理器,在基于特征数据对声纹识别模型进行训练时,可以具体用于:基于特征数据分别对一个或多个子模型进行训练。
在一种可能的设计中,声纹识别模型可以包括一个融合模型,其中,融合模型对应一个或多个场景。处理器,在基于特征数据对声纹识别模型进行训练时,可以具体用于:基于特征数据对融合模型进行训练。
第四方面,本申请实施例提供的声纹识别装置,装置包括:第一器件、麦克风以及处理器,其中,第一器件为扬声器或者显示屏。处理器,用于执行:触发第一器件提示用户录入验证语音;通过麦克风采集用户录入的验证语音;将验证语音输入声纹识别模型进行匹配,得到匹配结果;基于匹配结果确定用户是否为声纹识别模型的注册人。其中,声纹识别模型为通过上述第三方面的声纹注册装置的训练得到。
其中,上述声纹注册装置与声纹识别可以是一个装置,这种情况下,在训练声纹识别模型时,声纹识别装置的处理器,还可以用于执行:触发第一器件提示用户录入注册语音;通过麦克风采集用户录入的注册语音;基于注册语音生成远场条件下的样本语音;基于样本语音对声纹识别模型进行训练。
或者,上述声纹注册装置与声纹识别也可以是两个不同装置,这种情况下,声纹注册装置可以包括:第一器件,麦克风以及处理器,其中,第一器件为扬声器或者显示屏。处理器,用于执行:触发第一器件提示用户录入注册语音;通过麦克风采集用户录入的注册语音;基于注册语音生成远场条件下的样本语音;基于样本语音对声纹识别模型进行训练。
在一种可能的设计中,处理器触发第一器件提示用户录入注册语音时,可以触发扬声器播放提示语音,其中,提示语音用于提示用户录入验证语音。或者,处理器也可以触发显示屏显示提示文字,其中,提示文字用于提示用户录入验证语音。
在一种可能的设计中,处理器,还可以用于:在通过麦克风采集用户录入的验证语音之后,对验证语音进行场景检测。
在一种可能的设计中,声纹识别模型可以包括一个或多个子模型,其中,一个子模型对应一个场景。处理器,在将验证语音输入声纹识别模型进行匹配时,可以具体用于:将验证语音输入第一场景对应的子模型进行匹配,其中,第一场景为场景检测的结果。
在一种可能的设计中,处理器,还可以用于:若用户是声纹识别模型的注册人,对验证语音进行质量评估,得到质量评估结果;若质量评估结果表示验证语音为高质量语音,基于验证语音对声纹识别模型进行增量学习。
在一种可能的设计中,处理器,在基于验证语音对声纹识别模型进行增量学习时,可以具体用于:对验证语音进行数据增强处理,得到处理后的语音数据;基于处理的语音数据对声纹识别模型进行增量学习。
在一种可能的设计中,处理器,还可以用于:在对验证语音进行数据增强处理之前,确定验证语音所在的第一场景为高频场景。处理器,在对验证语音进行数据增强处理时,可以具体用于:将验证语音进行数据增强处理,得到j个不同噪声等级的样本语音。处理器,在基于处理的语音数据对声纹识别模型进行增量学习时,可以具体用于:将j个样本语音按照噪声等级进行分组,得到M组语音数据,M为大于0且不大于j的整数;基于M组语音数据分别对第一场景对应的子模型进行训练,得到M个高频子模型。
第五方面,本申请实施例提供的芯片,该芯片包括处理器和通信接口,所述通信接口用于接收代码指令,并传输到处理器。处理器,用于调用通信接口传输的代码指令以执行:触发电子设备的扬声器或者显示屏提示用户录入注册语音;触发电子设备的麦克风采集用户录入的注册语音;基于注册语音生成远场条件下的样本语音;基于样本语音对声纹识别模型进行训练。
在一种可能的设计中,处理器触发扬声器提示用户录入注册语音时,可以触发扬声器播放提示语音,其中,提示语音用于提示用户录入注册语音。
在一种可能的设计中,处理器触发显示屏提示用户录入注册语音时,可以触发显示屏显示提示文字,其中,提示文字用于提示用户录入注册语音。
在一种可能的设计中,处理器,在基于注册语音生成远场条件下的样本语音时,可以具体用于:模拟注册语音在远场条件下对声音的混响;基于远场条件下对声音的混响仿真生成注册语音在远场条件下的样本数据。
在一种可能的设计中,处理器,在基于注册语音生成远场条件下的样本语音时,可以具体用于:基于注册语音以及噪声数据生成噪声语音;模拟远场条件下对声音的混响;基于远场条件下对声音的混响仿真生成噪声语音在远场条件下的样本数据。
在一种可能的设计中,处理器,在模拟远场条件下对声音的混响时,可以具体用于:基于远场条件仿真声音的墙壁反射,得到房间冲击响应RIR。
在一种可能的设计中,处理器,在基于样本语音对声纹识别模型进行训练时,可以具体用于:对样本语音进行特征提取,得到特征数据;基于特征数据对声纹识别模型进行训练。
在一种可能的设计中,声纹识别模型可以包括一个或多个子模型,其中,一个子模型对应一个场景。处理器,在基于特征数据对声纹识别模型进行训练时,可以具体用于:基于特征数据分别对一个或多个子模型进行训练。
在一种可能的设计中,声纹识别模型可以包括一个融合模型,其中,融合模型对应一个或多个场景。处理器,在基于特征数据对声纹识别模型进行训练时,可以具体用于:基于特征数据对融合模型进行训练。
第六方面,本申请实施例提供的芯片,该芯片包括处理器和通信接口,所述通信接口用于接收代码指令,并传输到处理器。处理器,用于调用通信接口传输的代码指令以执行:触发电子设备的扬声器或者电子设备的显示屏提示用户录入验证语音;通过麦克风采集用户录入的验证语音;将验证语音输入声纹识别模型进行匹配,得到匹配结果,其中,声纹识别模型为权利要求13~18任一项的装置训练得到;基于匹配结果确定用户是否为声纹识别模型的注册人。
在一种可能的设计中,处理器触发电子设备的扬声器提示用户录入注册语音时,可以触发扬声器播放提示语音,其中,提示语音用于提示用户录入验证语音。
在一种可能的设计中,处理器触发电子设备的显示屏提示用户录入注册语音时,可以触发显示屏显示提示文字,其中,提示文字用于提示用户录入验证语音。
在一种可能的设计中,处理器,还可以调用通信接口传输的代码指令执行:在触发电子设备的麦克风采集用户录入的验证语音之后,对验证语音进行场景检测。
在一种可能的设计中,声纹识别模型可以包括一个或多个子模型,其中,一个子模型对应一个场景。处理器,在将验证语音输入声纹识别模型进行匹配时,可以具体用于:将验证语音输入第一场景对应的子模型进行匹配,其中,第一场景为场景检测的结果。
在一种可能的设计中,处理器,还可以用于:若用户是声纹识别模型的注册人,对验证语音进行质量评估,得到质量评估结果;若质量评估结果表示验证语音为高质量语音,基于验证语音对声纹识别模型进行增量学习。
在一种可能的设计中,处理器,在基于验证语音对声纹识别模型进行增量学习时,可以具体用于:对验证语音进行数据增强处理,得到处理后的语音数据;基于处理的语音数据对声纹识别模型进行增量学习。
在一种可能的设计中,处理器,还可以用于:在对验证语音进行数据增强处理之前,确定验证语音所在的第一场景为高频场景。处理器,在对验证语音进行数据增强处理时,可以具体用于:将验证语音进行数据增强处理,得到j个不同噪声等级的样本语音。处理器,在基于处理的语音数据对声纹识别模型进行增量学习时,可以具体用于:将j个样本语音按照噪声等级进行分组,得到M组语音数据,M为大于0且不大于j的整数;基于M组语音数据分别对第一场景对应的子模型进行训练,得到M个高频子模型。
第七方面,本申请还提供一种计算机可读存储介质,所述计算机可读存储介质中包括指令,当其在计算机上运行时,使得计算机执行上述各方面所述的方法。
第八方面,本申请还提供一种包括指令的计算机程序产品,当其被运行时,使得上述各方面所述的方法被执行。
附图说明
图1为本申请提供的一种电子设备的硬件结构示意图;
图2为本申请提供的一种声纹识别方法的流程示意图;
图3为本申请提供的一种显示屏提示用户录入注册语音的示意图;
图4为本申请提供的一种用户触发电子设备进行声纹验证的示意图;
图5为本申请提供的一种显示屏输出识别结果的示意图;
图6为本申请提供的一种声纹识别过程的示意图;
图7为本申请提供的一种显示屏输出识别结果的示意图;
图8为本申请提供的另一种声纹识别过程的示意图。
具体实施方式
应理解,在本申请中除非另有说明,“/”表示或的意思,例如,A/B可以表示A或B;本申请中的“和/或”仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。“至少一个”是指一个或者多个,“多个”是指两个或两个以上。
在本申请中,“示例的”、“在一些实施例中”、“在另一些实施例中”等用于表示作例子、例证或说明。本申请中被描述为“示例”的任何实施例或设计方案不应被解释为比其它实施例或设计方案更优选或更具优势。确切而言,使用示例的一词旨在以具体方式呈现概念。
另外,本申请中涉及的“第一”、“第二”等词汇,仅用于区分描述的目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量,也不能理解为指示或暗示顺序。
本申请实施例中的电子设备为具有声纹识别功能的电子设备。声纹识别,是一种通过语音信号来自动辨识和确认说话人身份的技术。本申请实施例中的电子设备可以采集用户的语音数据,并对语音数据进行声纹识别以判断该用户是否为注册人。
以下介绍电子设备、用于这样的电子设备的图形用户界面(graphical user interface,GUI)、和用于使用这样的电子设备的实施例。为描述方便,以下将GUI简称为用户界面。
本申请实施例中的电子设备可以为便携式电子设备,诸如手机、平板电脑、人工智能(artificial intelligence,AI)智能语音终端、可穿戴设备、增强现实(augmented reality, AR)/虚拟现实(virtual reality,VR)设备等。便携式电子设备的示例性实施例包括但不限于搭载
Figure PCTCN2020104545-appb-000001
或者其它操作系统的便携式电子设备。上述便携式电子设备也可以是车载终端、膝上型计算机(Laptop)等。还应当理解的是,本申请实施例的电子设备还可以台式计算机、智能家居设备(例如智能电视、智能音箱)等,对此不作限定。
示例的,如图1所示,为本申请实施例的一种电子设备的硬件结构示意图。具体的如图所示,电子设备包括处理器110、内部存储器121、外部存储器接口122、摄像头131、显示屏132、传感器模块140、用户标识模块(subscriber identification module,SIM)卡接口151、按键152、音频模块160、扬声器161、受话器162、麦克风163、耳机接口164、通用串行总线(universal serial bus,USB)接口170、充电管理模块180、电源管理模块181、电池182、移动通信模块191和无线通信模块192。在另一些实施例中,电子设备还可以包括马达、指示器、按键等。
应理解,图1所示的硬件结构仅是一个示例。本申请实施例的电子设备可以具有比图中所示电子设备更多的或者更少的部件,可以组合两个或更多的部件,或者可以具有不同的部件配置。图中所示出的各种部件可以在包括一个或多个信号处理和/或专用集成电路在内的硬件、软件、或硬件和软件的组合中实现。
其中,处理器110可以包括一个或多个处理单元。例如:处理器110可以包括应用处理器(application processor,AP)、调制解调器、图形处理器(graphics processing unit,GPU)、图像信号处理器(image signal processor,ISP)、控制器、视频编解码器、数字信号处理器(digital signal processor,DSP)、基带处理器、和/或神经网络处理器(neural-network processing unit,NPU)等。其中,不同的处理单元可以是独立的器件,也可以集成在一个或多个处理器中。
在一些实施例中,处理器110中还可以设置缓存器,用于存储指令和/或数据。示例的,处理器110中的缓存器可以为高速缓冲存储器。该缓存器可以用于保存处理器110刚用过的、生成的、或循环使用的指令和/或数据。如果处理器110需要使用该指令或数据,可从该缓存器中直接调用。有助于减少了处理器110获取指令或数据的时间,从而有助于提高系统的效率。
内部存储器121可以用于存储程序和/或数据。在一些实施例中,内部存储器121包括存储程序区和存储数据区。其中,存储程序区可以用于存储操作系统(如Android、IOS等操作系统)、至少一个功能所需的计算机程序(比如声纹识别功能、声音播放功能)等。存储数据区可以用于存储电子设备使用过程中所创建、和/或采集的数据(比如音频数据)等。示例的,处理器110可以通过调用内部存储器121中存储的程序和/或数据,使得电子设备执行相应的方法,从而实现一种或多种功能。例如,处理器110调用内部存储器中的某些程序和/或数据,使得电子设备执行本申请实施例中所提供的声纹识别方法、从而实现声纹识别功能。其中,内部存储器121可以采用高速随机存取存储器、和/或非易失性存储器等。例如,非易失性存储器可以包括一个或多个磁盘存储器件、闪存器件、和/或通用闪存存储器(universal flash storage,UFS)等中的至少一个。
外部存储器接口122可以用于连接外部存储卡(例如,Micro SD卡),实现扩展电子设备的存储能力。外部存储卡通过外部存储器接口122与处理器110通信,实现数据存储功能。例如电子设备可以通过外部存储器接口122将图像、音乐、视频等文件保存在外部 存储卡中。
摄像头131可以用于捕获动、静态图像等。通常情况下,摄像头131包括镜头和图像传感器。其中,物体通过镜头生成的光学图像投射到图像传感器上,然后转换为电信号,在进行后续处理。示例的,图像传感器可以是电荷耦合器件(charge coupled device,CCD)或互补金属氧化物半导体(complementary metal-oxide-semiconductor,CMOS)光电晶体管。图像传感器把光信号转换成电信号,之后将电信号传递给ISP转换成数字图像信号。需要说明的是,电子设备可以包括1个或N个摄像头131,其中,N为大于1的正整数。
显示屏132可以包括显示面板,用于显示用户界面。显示面板可以采用液晶显示屏(liquid crystal display,LCD)、有机发光二极管(organic light-emitting diode,OLED)、有源矩阵有机发光二极体或主动矩阵有机发光二极体(active-matrix organic light emitting diode,AMOLED)、柔性发光二极管(flex light-emitting diode,FLED)、Miniled、MicroLed、Micro-oLed、量子点发光二极管(quantum dot light emitting diodes,QLED)等。需要说明的是,电子设备可以包括1个或M个显示屏132,M为大于1的正整数。示例的,电子设备可以通过GPU、显示屏132、应用处理器等实现显示功能。
传感器模块140可以包括一个或多个传感器。例如,触摸传感器140A、陀螺仪140B、加速度传感器140C、指纹传感器140D、压力传感器140E等。在一些实施例中,传感器模块140还可以包括环境光传感器、距离传感器、接近光传感器、骨传导传感器、温度传感器等。
其中,触摸传感器140A,也可称为“触控面板”。触摸传感器140A可以设置于显示屏132,由触摸传感器140A与显示屏132组成触摸屏,也称“触控屏”。触摸传感器140A用于检测作用于其上或附近的触摸操作。触摸传感器140A可以将检测到的触摸操作传递给应用处理器,以确定触摸事件类型。电子设备可以通过显示屏132提供与触摸操作相关的视觉输出等。在另一些实施例中,触摸传感器140A也可以设置于电子设备的表面,与显示屏132所处的位置不同。
陀螺仪140B可以用于确定电子设备的运动姿态。在一些实施例中,可以通过陀螺仪140B确定电子设备围绕三个轴(即,x、y和z轴)的角速度。陀螺仪140B可以用于拍摄防抖。示例性的,当按下快门,陀螺仪140B检测电子设备抖动的角度,根据角度计算出镜头模组需要补偿的距离,让镜头通过反向运动抵消电子设备的抖动,从而实现防抖。陀螺仪传感器140B还可以用于导航、体感游戏场景。
加速度传感器140C可检测电子设备在各个方向上(一般为三轴)加速度的大小。当电子设备静止时可检测出重力的大小及方向。加速度传感器140C还可以用于识别电子设备的姿态,应用于横竖屏切换、计步器等应用。
指纹传感器140D用于采集指纹。电子设备可以利用采集的指纹特性实现指纹解锁、访问应用锁、指纹拍照、指纹接听来电等。
压力传感器140E用于感受压力信号,可以将压力信号转换成电信号。示例的,压力传感器140E可以设置于显示屏132。其中,作用于相同触摸位置,但不同触摸操作强度的触摸操作,可以对应不同的操作指令。
SIM卡接口151用于连接SIM卡。SIM卡可以通过插入SIM卡接口151,或从SIM卡接口151拔出,实现和电子设备的接触和分离。电子设备可以支持1个或K个SIM卡接口151,K为大于1的正整数。SIM卡接口151可以支持Nano SIM卡、Micro SIM卡、 和/或SIM卡等。同一个SIM卡接口151可以同时插入多张卡。所述多张卡的类型可以相同,也可以不同。SIM卡接口151也可以兼容不同类型的SIM卡。SIM卡接口151也可以兼容外部存储卡。电子设备通过SIM卡和网络交互,实现通话以及数据通信等功能。在一些实施例中,电子设备还可以采用eSIM,即:嵌入式SIM卡。eSIM卡可以嵌在电子设备中,不能和电子设备分离。
按键152可以包括开机键、音量键等。按键152可以是机械按键,也可以是触摸式按键。电子设备可以接收按键输入,产生与电子设备的用户设置以及功能控制有关的键信号输入。
电子设备可以通过音频模块160、扬声器161、受话器162、麦克风163、耳机接口164以及应用处理器等实现音频功能。例如,音频播放功能、录音功能、声纹注册功能、声纹验证功能、声纹识别功能等。
音频模块160可以用于对音频数据进行数模转换、和/或模数转换,还可以用于对音频数据进行编码和/或解码。示例的,音频模块160可以独立于处理器设置,也可以设置于处理器110中,或将音频模块160的部分功能模块设置于处理器110中。
扬声器161,也称“喇叭”,用于将音频数据转换为声音,并播放声音。例如,电子设备100可以通过扬声器161收听音乐、接听免提电话、或者发出语音提示等。
受话器162,也称“听筒”,用于将音频数据转换成声音,并播放声音。例如,当电子设备100接听电话时,可以通过将受话器162靠近人耳进行接听。
麦克风163,也称“话筒”、“传声器”,用于采集声音(例如周围环境声音,包括人发出的声音、设备发出的声音等),并将声音转换为音频电数据。当拨打电话或发送语音时,用户可以通过人嘴靠近麦克风163发出声音,麦克风163采集用户发出的声音。当电子设备的声纹识别功能已开启的情况下,麦克风163可以实时采集周围环境声音,获取音频数据。其中,麦克风163采集声音的情况与所处的环境相关。例如,当周围环境较为嘈杂时,用户说出验证话语时,则麦克风163采集的声音包括周围环境噪声和用户发出验证话语的声音。再例如,当周围环境较为安静时,用户说出验证话语,则麦克风163采集的声音为用户发出验证话语的声音。再例如,当周围环境为远场条件时,用户说出验证话语,则麦克风163采集的声音为周围环境噪音的叠加以及混响用户发出验证话语的混响。又例如,当周围环境较为嘈杂时,电子设备的声纹识别功能已开启,但是用户并未说出验证话语,则麦克风163采集的声音仅为周围环境噪声。
需要说明的是,电子设备可以设置至少一个麦克风163。例如,电子设备中设置两个麦克风163,除了采集声音,还可以实现降噪功能。又示例如,电子设备中还可以设置三个、四个或更多个麦克风163,从而可以在实现声音采集、降噪的基础上,还可以实现声音来源的识别、或定向录音功能等。
耳机接口164用于连接有线耳机。耳机接口164可以是USB接口170,也可以是3.5mm的开放移动电子设备平台(open mobile terminal platform,OMTP)标准接口、美国蜂窝电信工业协会(cellular telecommunications industry association of the USA,CTIA)标准接口等。
USB接口170是符合USB标准规范的接口,具体可以是Mini USB接口、Micro USB接口、USB Type C接口等。USB接口170可以用于连接充电器为电子设备充电,也可以用于电子设备与外围设备之间传输数据。也可以用于连接耳机,通过耳机播放音频。示例 的,USB接口170除了可以为耳机接口164以外,还可以用于连接其他电子设备,例如AR设备、计算机等。
充电管理模块180用于从充电器接收充电输入。其中,充电器可以是无线充电器,也可以是有线充电器。在一些有线充电的实施例中,充电管理模块180可以通过USB接口170接收有线充电器的充电输入。在一些无线充电的实施例中,充电管理模块180可以通过电子设备的无线充电线圈接收无线充电输入。充电管理模块180为电池182充电的同时,还可以通过电源管理模块180为电子设备供电。
电源管理模块181用于连接电池182、充电管理模块180与处理器110。电源管理模块181接收电池182和/或充电管理模块180的输入,为处理器110、内部存储器121、显示屏132、摄像头131等供电。电源管理模块181还可以用于监测电池容量、电池循环次数、电池健康状态(漏电、阻抗)等参数。在其他一些实施例中,电源管理模块181也可以设置于处理器110中。在另一些实施例中,电源管理模块181和充电管理模块180也可以设置于同一个器件中。
移动通信模块191可以提供应用在电子设备上的包括2G/3G/4G/5G等无线通信的解决方案。移动通信模块191可以包括滤波器、开关、功率放大器、低噪声放大器(low noise amplifier,LNA)等。
无线通信模块192可以提供应用在电子设备上的包括WLAN(如Wi-Fi网络)、蓝牙(Bluetooth,BT)、全球导航卫星系统(global navigation satellite system,GNSS)、调频(frequency modulation,FM)、近距离无线通信技术(near field communication,NFC)、红外技术(infrared,IR)等无线通信的解决方案。无线通信模块192可以是集成至少一个通信处理模块的一个或多个器件。
在一些实施例中,电子设备的天线1和移动通信模块191耦合,天线2和无线通信模块192耦合,使得电子设备可以与其他设备通信。具体的,移动通信模块191可以通过天线1与其它设备通信,无线通信模块193可以通过天线2与其它设备通信。
本申请实施例以下将结合附图和应用场景,对本申请实施例提供的声纹识别方法进行详细介绍。以下实施例均可以在具有上述硬件结构的电子设备100中实现。
为了很好的理解本申请实施例提供的声纹识别方法,下面对本申请实施例中涉及的词语进行解释说明。
近场条件:声源与麦克风(mic)之间的距离较近,例如声源距离mic在1米之内。
近场语音:可以理解为在近场条件下采集的语音数据,例如声源和mic之间的距离为1米以下时mic针对该声源采集的语音数据为近场语音。近场语音可以包括近场干净语音以及近场带噪语音,其中,近场干净语音可以理解为在近场条件下采集到的没有噪声的语音数据,近场带噪语音可以理解为在近场条件下采集到的带噪声的语音数据。
远场条件:声源与麦克风(mic)之间的距离较远,例如声源距离mic在1米~10米之内等等。
远场语音:可以理解为在远场条件下采集的语音数据,例如声源和mic之间的距离为5米以下时mic针对该声源采集的语音数据为远场语音。远场语音可以包括远场干净语音以及远场带噪语音,其中,远场干净语音可以理解为在远场条件下采集到的没有噪声的语音数据,远场带噪语音可以理解为在远场条件下采集到的带噪声的语音数据。
声纹识别模型:可以是电子设备基于高斯混合模型(gaussian mixture model,GMM) -背景模型(universal background model,UBM)、支持向量机(support vector machine,SVM)、联合因子分析(joint factor analysis,JFA)、身份向量(identity vector,I-vector)、X-vector等方法建立的数据模型,电子设备在建立初始的声纹识别模型后采用样本数据对该初始的声纹识别模型进行训练,训练好的声纹识别模型可以用于进行声纹识别。
多场景融合模型:可以理解为采用多个场景的样本数据对初始的声纹识别模型进行训练,该声纹识别模型训练好后即可以看做是一个多场景融合模型。
单场景模型:采用一个场景的样本数据对初始的声纹识别模型进行训练,即可得到一个单场景模型,本申请实施例中,每个场景对应的模型分别为一个单场景模型。具体来说,采用一个场景的样本数据对初始的声纹识别模型进行训练,该声纹识别模型训练好后即可以看做是该场景对应的模型,例如,采用居家场景的样本数据对初始的声纹识别模型进行训练,得到居家场景对应的模型(或者可以简称为居家模型),采用车载场景的样本数据对初始的声纹识别模型进行训练,得到车载场景对应的模型(或者也可以称为车载模型)。因此,通过采用不同场景的样本数据分别对初始的声纹识别模型进行训练,即可得到每个场景对应的单场景模型。
增量学习:每当新增样本数据时,并不需要重建声纹识别模型,而是在原有声纹识别模型的基础上,对由于新增样本数据所引起的变化进行更新,也就是在之前训练好的声纹识别模型的基础上采用新增的样本数据进一步训练,从而不断更新声纹识别模型。
特征提取:对数据进行变换,以突出该数据具有代表性特征的一种方法,本申请实施例中可以指对语音数据进行变换提取语音数据中属于特征性的信息的方法及过程。
场景检测:通过提取语音数据的背景数据,从而判断该语音数据所处的场景。
参见图2,示例性的示出了本申请实施例提供的一种声纹识别方法的流程,该方法由电子设备执行。声纹识别的基本方案包括声纹注册和声纹验证两个阶段。其中,声纹注册可以通过步骤S201~S204实现。声纹验证可以通过步骤S205~S209实现。
S201,电子设备采集用户录入的注册语音。其中,用户录入的注册语音可以是近场干净语音。
具体来说,电子设备可以通过麦克风163采集周围环境声音,获取用户录入的注册语音。
具体实施中,用户可以在电子设备的提示下说出注册语音,例如,如图3所示,电子设备可以在显示屏132上显示文字提示用户说出注册语“1234567”。又例如,电子设备也可以通过扬声器161进行语音提示,等等。其中,可以是用户首次启动电子设备的声纹识别功能时电子设备自动提示用户说出注册语音,或者,也可以是用户首次启动电子设备的声纹识别功能时由用户操作电子设备提示用户说出注册语音,或者,也可以是用户在后续启动声纹识别功能时用户根据需求触发电子设备提示用户说出注册语音。
作为一种可能的实施方式,用户在进行声纹注册时可以多次输入注册语音,从而可以提高声纹识别的准确性。
S202,电子设备在采集到注册语音后,可以将注册语音存储到高质量语音样本库,其中,高质量语音样本库用于存储语音质量得分大于或等于质量阈值的语音。
S203,电子设备将高质量语音样本库中包括的注册语音进行数据增强处理,得到多个样本语音。该样本语音可以但不限于为:由注册语音生成的带噪语音,由注册语音生成的远场语音,由注册语音生成的远场带噪语音等。
上述方式中,电子设备可以基于注册语音生成带噪语音、远场语音等,而不需要用户分别在近场、远场等场景分别进行注册,从而可以提升用户体验。
电子设备基于注册语音生成带噪语音时,可以通过如下方式实现:在仿真房间中加入注册语音和噪声源,将注册语音和噪声源进行处理得到带噪语音,其中,噪声源可以有一个或多个。具体的,电子设备基于注册语音可以生成不同噪声等级的带噪语音。例如,不同场景可能对应不同的噪声等级,因此电子设备可以针对每个场景将注册语音仿真生成该场景对应的带噪语音。
电子设备基于注册语音生成远场语音时,可以通过如下方式实现:使用镜像源模型(image source model,ISM)算法,ISM算法以虚拟声音源仿真声音的墙壁反射,根据信号延迟与衰减参数计算房间冲击响应(room impulse response,RIR),其中,语音到墙壁后会有一个有损耗的反射,做RIR是为了仿真模拟远场条件下对声音的混响。根据RIR仿真生成注册语音对应的远场语音。具体的,电子设备基于注册语音可以生成不同远场等级的远场语音。例如,不同场景可能对应不同的远场距离,因此电子设备可以针对每个场景将注册语音仿真生成该场景对应的远场语音。
此外,电子设备还可以采用其他方式仿真远场条件下的混响,例如,对声音进行脉冲响应卷积来模拟远场条件下对声音的混响,等。
电子设备基于注册语音生成远场带噪语音时,可以通过如下方式实现:在仿真房间中加入注册语音和噪声源,将注册语音和噪声源进行处理得到带噪语音,其中,噪声源可以有一个或多个;使用ISM算法计算RIR,根据RIR仿真生成带噪语音对应的远场带噪语音。具体的,电子设备基于注册语音可以生成不同远场等级、不同噪声等级的远场带噪语音。例如,电子设备可以针对具体场景的噪声特点、远场特点等将注册语音仿真生成该场景对应的远场带噪语音。
上述过程中,噪声等级可以理解为噪声强度等级,远场等级可以理解为远场距离等级。
S204,电子设备对样本语音进行特征提取,并基于提取的特征训练声纹识别模型库中的模型,得到训练好的模型。
示例性的,声纹识别模型库中的模型可以但不限于采用GMM-UBM、SVM、JFA、I-vector、X-vector等方法建立得到的。
在具体实施中,声纹识别模型库可以包括多场景融合模型,因此,电子设备可以使用多个场景的样本语音对多场景融合模型进行训练。或者,声纹识别模型库中也可以包括多个场景分别对应的模型,因此,针对每个场景对应的模型,电子设备可以使用该场景对应的样本语音进行训练。或者,声纹识别模型也可以包括多场景融合模型以及多个场景分别对应的模型,因此,电子设备可以使用多个场景的样本语音对多场景融合模型进行训练,并针对每个场景对应的模型,电子设备可以使用该场景对应的样本语音进行训练。
若声纹识别模型为一个多场景融合模型,则验证语音输入该多场景融合模型可以得到一个唯一的匹配得分,在学习到高质量语音样本库的数据后,该多场景融合模型将与实际使用场景越来越匹配。若声纹识别模型为多个场景分别对应的模型,则可以通过对录入的验证语音进行场景检测,将验证语音输入对应场景的模型进行声纹识别。进一步的,若该验证语音已通过质量评估进入高质量语音样本库,则将其进行数据增强,增量学习更新该场景对应的模型,可以使对应场景的模型与实际场景越来越匹配。
电子设备对样本语音进行特征提取时可以但不限于采用滤波器组(filter bank,FBank)、 梅尔频率倒谱系数(mel-frequency cepstral coefficients,MFCC)、D-vector等方法。
S205,电子设备采集用户录入的验证语音。
具体实施中,用户可以在电子设备的提示下说出验证语音。其中,电子设备提示用户说出验证语的方法与电子设备提示用户说出注册语的方法类似,重复之处不再一一赘述。
其中,电子设备可以是在用户的操作触发下采集用户录入的验证语音,例如,用户通过操作电子设备触发验证指令,从而电子设备在收到验证指令后采集提示用户录入验证语音,并采集用户录入的验证语音。例如,用户可以通过点击电子设备的触摸屏上声纹识别功能对应图标的相应位置触发验证指令,从而电子设备提示用户说出验证语音;又例如,用户可以通过操作物理实体(如物理键、鼠标、摇杆等)进行触发;又例如,用户可以通过特定手势(如双击电子设备的触摸屏等等)进行触发验证指令,从而电子设备提示用户说出验证语音。又例如,用户可以向电子设备(如智能手机、车载装置等等)说出关键词“声纹识别”,电子设备通过麦克风163采集到用户发出的关键词“声纹识别”后触发验证指令,并提示用户说出验证语音。
或者,用户也可以在向电子设备说出用于控制电子设备的控制命令时,电子设备可以采集该控制命令,并将该控制命令作为验证语音进行声纹识别。即,电子设备在接收到控制命令时触发验证指令,并将该控制指令作为验证语音进行声纹识别。例如,如图4所示,用户可以向电子设备(如智能手机、车载装置等等)发出控制命令“打开音乐”,电子设备通过麦克风163采集到用户发出的语音“打开音乐”后,将该语音作为验证语音进行声纹识别。又例如,用户可以向电子设备(如智能空调)发出控制命令“调到27℃”,电子设备通过麦克风163采集到用户发出的语音“调到27℃”后,将该语音作为验证语音进行声纹识别。
S206,电子设备对验证语音进行特征提取以及场景检测。
电子设备对验证语音进行特征提取时,可以但不限于采用FBank、MFCC、D-vector等方法。
进一步的,电子设备在对验证语音进行场景检测之后可以给该验证语音加上场景标签。例如,电子设备对验证语音进行场景检测之后确定该验证语音在车载场景中录入的,则可以给该验证语音加上车载场景对应的场景标签。
示例性的,场景检测的方法可以但不限于包括GMM、深度神经网络(deep neural network,DNN)等。场景标签可以根据应用场景来选取,比如居家场景、车载场景、背景音乐场景、嘈杂人声环境、以及远场场景、近场场景等。
一些实施方式中,电子设备可以预先针对每个场景训练一个检测模型(该检测模型可以是基于GMM算法的,也可以是基于DNN算法的),从而电子设备可以将验证语音依次输入各个场景对应的检测模型匹配得分,根据各个场景对应模型的匹配得分确定该验证语音对应的场景。
另一些实施方式中,电子设备也可以预先训练一个分类模型(该分类模型可以是基于DNN算法的),从而电子设备可以将验证语音输入该分类模型,该分类模型可以输出分类结果,该分类结果即为验证语音对应的场景。
S207,电子设备将验证语音输入声纹注册阶段训练好的声纹识别模型进行匹配得分。如果匹配得分大于匹配阈值则可以判断该验证语音来自于注册人,否则不是来自注册人。
其中,匹配得分的方法可以但不限于包括:余弦距离(cosine distance,CDS)、线性 判别分析(linear discriminant analysis,LDA)、概率线性判别分析(prob-ailistic linear discriminant analysis,PLDA)等算法。
具体来说,若声纹识别模型为一个多场景融合模型,则通过该多场景融合模型匹配得分可以得到一个得分,若声纹识别模型包括多个场景分别对应的模型,则可以通过多个场景对应的模型分别进行匹配得分,得到多个得分,然后结合步骤S206中得到的场景标签采用加权的方式得到一个融合得分。
进一步的,电子设备在确定验证语音不是来自注册人时可以向用户输出识别结果。具体的,电子设备可以在显示屏132上输出识别结果,如图5所示,电子设备可以在显示屏132上显示文字“不是注册人!”。又例如,电子设备也可以通过扬声器161播报语音“不是注册人”,等等。
S208,电子设备在确定验证语音来自注册人时,可以结合该验证语音的场景标签,对该验证语音进行质量评估。若该验证语音的质量得分大于质量阈值,则可以将该验证语音加入高质量语音样本库。
示例性的,对该验证语音进行质量评估的方法可以为:确定验证语音的表征语音质量的参数的值来确定该验证语音是否为高质量的语音,其中,表征语音质量的参数可以但不限于包括如下参数中一项或多项:信噪比(signal-to-noise ratio,SNR)、分段信噪比(segment signal-to-noise ratio,SegSNR)、语音质量的感知评价(perceptual evaluation of speech quality,PESQ)、对数似然比测度(log likelihood ratio measure,LLR)等。
或者,也可以将验证语音输入用于进行质量评估的模型中来确定验证语音是否为高质量的语音,其中,用于进行质量评估的模型可以是基于GMM算法的,也可以是基于DNN算法的。具体的,可以将验证语音输入该模型后得到一个质量得分,然后根据该质量得分的高低确定验证语音是否为高质量的语音。
具体实施中,声纹识别模型可以包括多个场景分别对应的模型,高质量语音样本库也可以按照多个场景进行分类存储,即高质量语音样本库可以包括该多个场景分别对应的样本库,其中,一个场景对应的样本库可以用于训练该场景对应的模型。基于此,一种可能的实现方式为,若验证语音的质量得分大于质量阈值,电子设备可以将该验证语音加入到步骤S206中检测到的场景所对应的样本库中。
以声纹识别模型包括A场景对应的模型、B场景对应的模型、C场景对应的模型以及D场景对应的模型,高质量语音样本库可以包括A场景对应的样本库、B场景对应的样本库、C场景对应的样本库以及D场景对应的样本库为例进行说明,假设步骤S206中确定验证语音来自A场景,则电子设备可以在验证语音的质量得分大于质量阈值时,将该验证语音加入到A场景对应的样本库中。
S209,电子设备将高质量语音样本库的语音进行数据增强处理,将处理后的语音数据用于增量学习,更新声纹识别模型。
其中,增量学习的算法可以但不限于包括:方法1,将增强后的语音数据通过加权的方式与原始注册语音相加,将相加得到的语音数据用于训练声纹识别模型;方法2,单独基于增强后的语音数据对上一次训练得到声纹识别模型进行训练,得到新的声纹识别模型,将新的声纹识别模型与上一次训练得到声纹识别模型进行加权相加完成模型更新。
步骤S209中结合场景标签,将高质量语音库内语音进行数据增强,可以在用户使用过程中得到更加丰富的数据。比如,近场干净的语音可以增强得到远场干净的语音,居家 场景低噪语音可以增强得到居家场景中噪的语音。并且,在声纹识别模型包括多个场景对应的模型,高质量语音样本库按照多个场景进行分类存储的情况下,可以通过增量学习,更新对应场景的模型。通过将用户日常验证语音数据中的高质量语音进行数据增强后用于增量学习,更新声纹识别模型,可以使声纹识别模型与实际使用场景越来越匹配,从而可以提升声纹识别系统的鲁棒性。
为了更好地理解本申请实施例,以下结合具体应用场景,对声纹识别过程进行具体详细描述。
场景一:对于使用场景经常变化的情况,比如手机、耳机、手环等随身携带的电子设备,由于手机、耳机、耳环等随身携带的电子设备会随着用户的移动处于不同的场景,例如,用户从家里出来,开车到达商场,这种情况下,这些随身携带的电子设备经历了从家居场景移动到车载场景,然后又进入商场场景。用户设备在这些随身携带的电子设备上进行声纹识别时,可以使用如下步骤S601~S614来实现声纹识别。
如图6所示,声纹识别过程具体可以包括:
S601,电子设备采集用户的k条注册语音。其中,k可以为大于或等于1的整数。执行步骤S602。
其中,用户可以在电子设备的提示下多次录入注册语音,提示方法具体可以参阅上述步骤S201所述方法,这里不再重复赘述。从而,电子设备可以通过麦克风163采集用户的k条注册语音。
S602,电子设备将k条注册语音加入高质量语音样本库。执行步骤S603。
S603,电子设备对k条注册语音进行数据增强处理,得到样本语音。数据增强处理的方法具体可以参阅步骤S203所述方法,这里不再重复赘述。其中,一条注册语音可以生成多条不同噪声等级、不同远场等级的样本语音。执行步骤S604。
具体来说,高质量语音样本库可以针对不同场景分类存储。因此电子设备可以针对不同场景对k条注册语音分别进行数据增强处理,从而针对不同场景可以生成该场景对应的样本语音。
例如,电子设备可以针对A场景对k条注册语音进行数据增强处理,具体的,电子设备可以基于一条注册语音生成s1条噪声等级不同、远场等级不同的样本数据,从而得到A场景对应的k×s1条样本语音;针对B场景对k条注册语音进行数据增强处理,具体的,电子设备可以基于一条注册语音生成s2条噪声等级不同、远场等级不同的样本数据,从而得到B场景对应的k×s2条样本语音;针对C场景对k条注册语音进行数据增强处理,具体的,电子设备可以基于一条注册语音生成s3条噪声等级不同、远场等级不同的样本数据,从而得到C场景对应的k×s3条样本语音。
进一步的,电子设备可以针对每个场景,将该场景的样本语音存储到该场景对应的样本库中,如,将A场景的k×s1条样本语音存储到A场景对应的样本库1,将B场景的k×s2条样本语音存储到B场景对应的样本库2,C场景的k×s3条样本语音存储到C场景对应的样本库3。
一些实施例中,针对每个场景,电子设备可以采用该场景对应的噪声源对k条注册语音进行数据增强处理,得到该场景对应的样本语音,其中,该场景对应的噪声源可以是在该场景采集的噪声数据,或者,也可以是针对该场景仿真生成的噪声数据,等等。例如,针对A场景,电子设备可以采用A场景的噪声源对注册数据进行数据增强处理,针对B 场景,电子设备可以采用B场景的噪声源对注册数据进行数据增强处理。
S604,电子设备对样本语音进行特征提取,并基于提取的特征训练声纹识别模型库中的模型,得到训练好的模型。其中,特征提取的方法具体参阅步骤S204所述方法,重复之处不再赘述。执行步骤S605。
其中,电子设备可以建立一个多场景融合模型,即声纹识别模型库中包括一个多场景融合模型。电子设备可以采用步骤S603中获取的样本数据对该多场景融合模型进行训练。
或者,电子设备也可以针对不同场景分别建立模型,即声纹识别模型可以包括多个模型,如近场安静模型、近场居家模型、远场居家模型、车载模型等,并可以采用各个场景的样本库对该场景对应的模型进行训练,例如,采用A场景的样本库1训练A场景对应的模型,示例性的,如,近场安静场景的样本库训练近场安静模型,采用近场居家场景的样本库训练近场居家模型,采用远场居家场景的样本库训练远场居家模型,采用车载场景的样本库训练车载模型等。
示例性的,电子设备可以针对居家场景、车载场景、商场场景、工作场景分别建立模型,并且电子设备在采集到用户录入的注册语音后,基于该注册语音分别针对居家场景、车载场景、商场场景、工作场景进行语音数据增强,从而分别得到居家场景、车载场景、商场场景、工作场景下的样本语音,然后可以采用居家场景的样本数据对居家场景的模型进行训练、采用车载场景的样本数据对车载场景的模型进行训练、采用商场场景的样本数据对商场场景的模型进行训练、采用工作场景的样本数据对工作场景的模型进行训练。从而,电子设备在采集用户录入的验证语音后,可以结合对该验证语音场景检测的结果来选择对应场景的模型,例如,假设对该验证语音进行场景检测得到的结果为居家场景,则电子设备可以将该验证语音输入居家场景的模型进行匹配。
当然,电子设备也可以针对不同场景分别建立模型,同时建立一个多场景融合模型,即声纹识别模型可以包括一个多场景融合模型以及多个场景分别对应的模型。S605,电子设备采集用户录入的验证语音。
S605,电子设备采集用户录入的验证语音。执行步骤S606。
步骤S605具体可以参阅步骤S205,这里不再重复赘述。
S606,电子设备对验证语音进行特征提取以及场景检测。执行步骤S607。
步骤S606具体可以参阅步骤S206,这里不再重复赘述。
S607,电子设备将验证语音输入步骤S604中训练好的模型进行匹配得分,得到第一分值。执行步骤S608。
进行匹配得分的方法可以有多种,其中,一种可能的方法为:声纹识别模型包括多个场景分别对应的模型,电子设备可以选择步骤S606中所检测到的场景(假设为A场景)对应的模型进行匹配得分,即将验证语音输入A场景对应的模型进行匹配得分,从而得到第一分值。
另一种可能的方法为:声纹识别模型包括多场景融合模型,电子设备可以选择多场景融合模型来进行匹配得分,即将验证语音输入多场景融合模型中进行匹配得分,从而得到第一分值。
又一种可能的方法为:声纹识别模型包括多个场景分别对应的模型,电子设备可以将验证语音分别输入各个场景对应的模型进行匹配得分,得到多个得分,将多个得分进行融合,从而得到第一分值。示例性的,第一分值可以但不限于为:多个得分的平均值、多个 得分的加权值等等。
具体实施中还可以采用其他方法进行匹配得分,这里不再一一列举。
需要补充说明的是,如果在实际实施过程中由于某种原因不想在声纹识别模型库中同时维护多个模型,则步骤S604中可以只建立并训练一个多场景融合模型。
S608,电子设备判断第一分值是否大于第一阈值。若是,执行步骤S609以及步骤S611;若否,执行步骤S610。
S609,电子设备输出声纹识别结果:是注册人。
具体的,电子设备可以通过在显示屏132中呈现文字“是注册人”,示例性的,呈现界面可以如图7所示。
或者,电子设备也可以通过扬声器163播报语音“是注册人”。
S610,电子设备输出声纹识别结果:不是注册人。
具体的,电子设备可以通过在显示屏132中呈现文字“不是注册人”,示例性的,呈现界面可以如图5所示。
或者,电子设备也可以通过扬声器163播报语音“不是注册人”。
S611,电子设备对该验证语音进行质量评估,得到第二分值。执行步骤S612。
具体的,电子设备可以结合步骤S606中检测得到的场景对该验证语音进行质量评估。
在一些实施例中,电子设备可以根据步骤S606中检测得到的场景对应的模型对验证语音进行打分,如果分数高于质量评估的阈值,则加入高质量语音样本库。
或者,也可以采用质量评估的方法确定验证语音的质量评估得分,并将该质量评估得分与步骤S606所检测得到的场景对应的阈值进行比较,来确定是不是该场景的高质量语音。
S612,电子设备判断第二分值是否大于第二阈值。若是,执行步骤S613;若否,则结束。
S613,电子设备将验证语音存储到高质量语音样本库。执行步骤S614。
具体的,若高质量语音样本库按照多个场景进行分类存储,即高质量语音样本库可以包括该多个场景分别对应的样本库,则电子设备可以将验证语音存储到步骤S606所检测到的场景对应的样本库,如,步骤S606检测到验证语音来源于居家场景,则电子设备可以将验证语音存储到居家场景对应的样本库中。
电子设备还可以将验证语音进行数据增强,并数据增强后的验证语音存储到高质量语音样本库。例如,如验证语音为居家场景语音,则可以将验证语音进行数据增强后得到远场居家语音,或者,也可以将验证语音进行数据增强后得到其他噪声等级的居家场景语音,其中,数据增强后得到的居家场景语音的噪声等级可以大于验证语音。
S614,电子设备基于高质量语音样本库新增的语音数据进行增量学习,更新声纹识别模型库中的模型。
具体来说,若声纹识别模型库中包括一个多场景融合模型,则电子设备可以基于高质量语音样本库新增的语音数据对上一次训练得到多场景融合模型进行训练,得到新的多场景融合模型。电子设备可以将该新的多场景融合模型与上一次训练得到的多场景融合模型进行加权相加完成模型更新。或者,电子设备也可以将高质量语音样本库新增的语音数据加权的方式与高质量语音样本库原来存储的语音数据相加,并基于相加得到的语音数据训练上一次训练得到的多场景融合模型完成模型更新。
若声纹识别模型库中包括多个场景分别对应的模型,以步骤S606检测到的场景为车载场景,步骤S613中将验证语音存储到车载场景的样本库为例,电子设备可以基于车载场景的样本库新增的语音数据对上一次训练得到车载场景模型进行训练,得到新的车载场景模型。电子设备可以将该新的车载场景模型与上一次训练得到的车载场景模型进行加权相加完成模型更新。或者,电子设备也可以将车载场景的样本库新增的语音数据加权的方式与车载场景的样本库原来存储的语音数据相加,并基于相加得到的语音数据训练上一次训练得到的车载场景模型完成模型更新。
上述声纹识别过程通过对原始注册语音进行多场景标签的数据增强,可以解决注册语音场景单一而验证语音场景多变导致数据不匹配问题。并且,通过将高质量的验证语音加入到高质量语音样本库,并进行数据增强和增量学习,更新声纹识别模型库中的模型,从而用户使用过程中,声纹识别模型库中的模型可以越来越适用于用户实际使用场景。因此,通过上述声纹识别方法可以提高声纹识别算法对多场景和变场景的鲁棒性。
场景二:对于使用场景常常为某一种场景的情况,比如用于智能音箱、智能家居、车载装置等设备上可以使用如下步骤S801~S817来实现声纹识别。
如图8所示,声纹识别过程具体可以包括:
S801~S813,具体可以参阅步骤S601~S613,这里不再重复赘述。
其中,在步骤S813之后可以执行步骤S814。
S814,电子设备判断验证语音是否为高频场景下的高质量语音。若是,执行步骤S815。若否,执行步骤S817。其中,步骤S817可以参阅步骤S614,这里不再重复赘述。
其中,电子设备进行声音识别时所采集的验证语音大部分是来自某场景,则可以认为该场景为高频场景。
具体实施中,电子设备可以通过如下方式判断验证语音是否为高频场景下的高质量语音:针对步骤S806中检测到的场景(假设为A场景),电子设备可以统计最近N次声纹识别过程中验证语音的场景检测结果为A场景的次数n,若n大于第三阈值(或n/N大于第四阈值),则电子设备可以确定A场景为高频场景,则验证语音为高频场景下的高质量语音,反之,则验证语音不是高频场景下的高质量语音。
例如,假设步骤S806中针对验证语音进行场景检测得到的结果是居家场景,则电子设备可以通过最近10次声纹识别过程中验证语音的场景检测结果为居家场景的次数n,若n大于5(即第三阈值),则可以判断居家场景为高频场景,即验证语音为高频场景下的高质量语音;若n小于或等于5,则可以判断居家场景不是高频场景,即验证语音不是高频场景下的高质量语音。
又例如,假设步骤S806中针对验证语音进行场景检测得到的结果是车载场景,则电子设备可以通过最近20次声纹识别过程中验证语音的场景检测结果为居家场景的次数n,若n/20大于50%(即第四阈值),则可以判断车载场景为高频场景,即验证语音是高频场景下的高质量语音;若n/20小于或等于50%,则可以判断车载场景不是高频场景,即验证语音不是高频场景下的高质量语音。
S815,电子设备将第一场景的样本库中的i条样本语音进行数据增强,其中,第一场景为步骤S806中检测到的场景。i条语音可以是第一场景的样本库的所有样本语音,也可以是第一场景的样本库的部分样本语音。执行步骤S816。
具体的,针对i条样本语音中的每一条样本语音,电子设备可以该样本语音进行数据增强,得到j个不同噪声等级的噪声语音,其中,该j个噪声语音的噪声等级均大于该样本语音。或者,针对i条样本语音中的每一条样本语音,电子设备可以该样本语音进行数据增强,得到k个不同远场等级的远场语音,其中,该k个远场语音的远场等级均大于该样本语音。或者,针对i条样本语音中的每一条样本语音,电子设备可以该样本语音进行数据增强,得到j个不同噪声等级的噪声语音,再针对每个噪声语音进行数据增强,得到j×k个远场噪声语音。
S816,电子设备基于步骤S815中得到的语音数据进行增量学习,得到高频场景下的子模型。
具体的,电子设备可以将步骤S815中得到的语音数据按照噪声等级分为M个组,其中,同一组内的语音数据的噪声等级相同,或者,同一组内的语音数据的噪声等级在同一个噪声等级范围内。然后,针对每一组,电子设备使用该组的语音数据对上一次训练得到的第一场景的模型进行训练,得到该组对应的子模型,并加入到声纹识别模型库中。具体来说,针对每一组,电子设备可以使用该组的语音数据对上一次训练得到的第一场景下该组对应的子模型进行训练。
上述声纹识别过程中,通过对原始注册语音进行多场景标签的数据增强,解决注册语音场景与验证语音场景不匹配问题。并且,通过增加高频场景判断,将高频场景下的高质量的验证语音加入到高质量语音样本库,并进行数据增强和增量学习,精细化高频场景下的模型,使得电子设备在该高频场景的不同噪声等级下或远场等级下可以更加准确地进行声纹识别。比如,在车载场景中,可精准匹配到时速30km/h、60km/h、90km/h、120km/h等对应的子模型,而不是一个粗略的车载场景模型。再比如,在远场居家环境中,可以精准匹配到远场3m、4m、5m等对应的子模型,而不是一个粗略的远场居家环境说话人模型。从而用户使用过程中,可以根据场景检测结果匹配到高频场景下的子模型,使得声纹识别更加准确,并且随着使用数据的增加,声纹识别模型库中的模型通过的增量学习不断更新,可以越来越准确。
本领域内的技术人员应明白,本申请的实施例可提供为方法、系统、或计算机程序产品。因此,本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。
本申请是参照根据本申请的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
显然,本领域的技术人员可以对本申请进行各种改动和变型而不脱离本申请的精神和范围。这样,倘若本申请的这些修改和变型属于本申请权利要求及其等同技术的范围之内,则本申请也意图包含这些改动和变型在内。

Claims (28)

  1. 一种声纹注册方法,其特征在于,包括:
    电子设备提示用户录入注册语音;
    所述电子设备采集所述用户录入的所述注册语音;
    所述电子设备基于所述注册语音生成远场条件下的样本语音;
    所述电子设备基于所述样本语音对声纹识别模型进行训练。
  2. 如权利要求1所述的方法,其特征在于,所述电子设备基于所述注册语音生成远场条件下的样本语音,包括:
    所述电子设备模拟在远场条件下对声音的混响;
    所述电子设备基于所述远场条件下对声音的混响仿真生成所述注册语音在所述远场条件下的样本数据。
  3. 如权利要求1所述的方法,其特征在于,所述电子设备基于所述注册语音生成远场条件下的样本语音,包括:
    所述电子设备基于所述注册语音以及噪声数据生成噪声语音;
    所述电子设备模拟远场条件下对声音的混响;
    所述电子设备基于所述远场条件下对声音的混响仿真生成所述噪声语音在所述远场条件下的样本数据。
  4. 如权利要求2或3所述的方法,其特征在于,所述电子设备模拟远场条件下对声音的混响,包括:
    所述电子设备基于远场条件仿真声音的墙壁反射,得到房间冲击响应RIR。
  5. 如权利要求1至4任一项所述的方法,其特征在于,所述电子设备基于所述样本语音对声纹识别模型进行训练,包括:
    所述电子设备对所述样本语音进行特征提取,得到特征数据;
    所述电子设备基于所述特征数据对声纹识别模型进行训练。
  6. 如权利要求5所述的方法,其特征在于,所述声纹识别模型包括一个或多个子模型,其中,一个子模型对应一个场景;
    所述电子设备基于所述特征数据对声纹识别模型进行训练,包括:
    所述电子设备基于所述特征数据分别对所述一个或多个子模型进行训练。
  7. 如权利要求5所述的方法,其特征在于,所述声纹识别模型包括一个融合模型,其中,所述融合模型对应一个或多个场景;
    所述电子设备基于所述特征数据对声纹识别模型进行训练,包括:
    所述电子设备基于所述特征数据对所述融合模型进行训练。
  8. 一种声纹识别方法,其特征在于,所述方法包括:
    电子设备提示用户录入验证语音;
    所述电子设备采集所述用户录入的所述验证语音;
    所述电子设备将所述验证语音输入声纹识别模型进行匹配,得到匹配结果,其中,所述声纹识别模型基于权利要求1~7任一项所述的方法训练得到;
    所述电子设备基于所述匹配结果确定所述用户是否为所述声纹识别模型的注册人。
  9. 如权利要求8所述的方法,其特征在于,在所述电子设备采集所述用户录入的所 述验证语音之后,还包括:
    所述电子设备对所述验证语音进行场景检测。
  10. 如权利要求9所述的方法,其特征在于,所述声纹识别模型包括一个或多个子模型,其中,一个子模型对应一个场景;
    所述电子设备将所述验证语音输入声纹识别模型进行匹配,包括:
    所述电子设备将所述验证语音输入第一场景对应的子模型进行匹配,其中,所述第一场景为所述场景检测的结果。
  11. 如权利要求8至10任一项所述的方法,其特征在于,所述方法还包括:
    若所述用户是所述声纹识别模型的注册人,所述电子设备对所述验证语音进行质量评估,得到质量评估结果;
    若所述质量评估结果表示所述验证语音为高质量语音,所述电子设备基于所述验证语音对所述声纹识别模型进行增量学习。
  12. 如权利要求11所述的方法,其特征在于,所述电子设备基于所述验证语音对所述声纹识别模型进行增量学习,包括:
    所述电子设备对所述验证语音进行数据增强处理,得到处理后的语音数据;
    所述电子设备基于所述处理的语音数据对所述声纹识别模型进行增量学习。
  13. 如权利要求12所述的方法,其特征在于,在所述电子设备对所述验证语音进行数据增强处理之前,所述方法还包括:
    所述电子设备确定所述验证语音所在的第一场景为高频场景;
    所述电子设备对所述验证语音进行数据增强处理,包括:
    所述电子设备将所述验证语音进行数据增强处理,得到j个不同噪声等级的样本语音;
    所述电子设备基于所述处理的语音数据对所述声纹识别模型进行增量学习,包括:
    所述电子设备将所述j个样本语音按照噪声等级进行分组,得到M组语音数据,所述M为大于0且不大于j的整数;
    所述电子设备基于所述M组语音数据分别对所述第一场景对应的子模型进行训练,得到M个高频子模型。
  14. 一种声纹注册装置,其特征在于,包括:
    第一器件,麦克风以及处理器,其中,所述第一器件为扬声器或者显示屏;
    所述处理器,用于执行:
    触发所述第一器件提示用户录入注册语音;
    通过所述麦克风采集所述用户录入的所述注册语音;
    基于所述注册语音生成远场条件下的样本语音;
    基于所述样本语音对声纹识别模型进行训练。
  15. 如权利要求14所述的装置,其特征在于,所述处理器,在基于所述注册语音生成远场条件下的样本语音时,具体用于:
    模拟所述注册语音在远场条件下对声音的混响;
    基于所述远场条件下对声音的混响仿真生成所述注册语音在所述远场条件下的样本数据。
  16. 如权利要求14所述的装置,其特征在于,所述处理器,在基于所述注册语音生成远场条件下的样本语音时,具体用于:
    基于所述注册语音以及噪声数据生成噪声语音;
    模拟远场条件下对声音的混响;
    基于所述远场条件下对声音的混响仿真生成所述噪声语音在所述远场条件下的样本数据。
  17. 如权利要求15或16所述的装置,其特征在于,所述处理器,在模拟远场条件下对声音的混响时,具体用于:
    基于远场条件仿真声音的墙壁反射,得到房间冲击响应RIR。
  18. 如权利要求14至17任一项所述的装置,其特征在于,所述处理器,在基于所述样本语音对声纹识别模型进行训练时,具体用于:
    对所述样本语音进行特征提取,得到特征数据;
    基于所述特征数据对声纹识别模型进行训练。
  19. 如权利要求18所述的装置,其特征在于,所述声纹识别模型包括一个或多个子模型,其中,一个子模型对应一个场景;
    所述处理器,在基于所述特征数据对声纹识别模型进行训练时,具体用于:
    基于所述特征数据分别对所述一个或多个子模型进行训练。
  20. 如权利要求18所述的装置,其特征在于,所述声纹识别模型包括一个融合模型,其中,所述融合模型对应一个或多个场景;
    所述处理器,在基于所述特征数据对声纹识别模型进行训练时,具体用于:
    基于所述特征数据对所述融合模型进行训练。
  21. 一种声纹识别装置,其特征在于,所述装置包括:
    第一器件、麦克风以及处理器,其中,所述第一器件为扬声器或者显示屏;
    所述处理器,用于执行:
    触发所述第一器件提示用户录入验证语音;
    通过所述麦克风采集所述用户录入的所述验证语音;
    将所述验证语音输入声纹识别模型进行匹配,得到匹配结果,其中,所述声纹识别模型为权利要求14~18任一项所述的装置训练得到;
    基于所述匹配结果确定所述用户是否为所述声纹识别模型的注册人。
  22. 如权利要求21所述的装置,其特征在于,所述处理器,还用于:
    在通过所述麦克风采集所述用户录入的所述验证语音之后,对所述验证语音进行场景检测。
  23. 如权利要求22所述的装置,其特征在于,所述声纹识别模型包括一个或多个子模型,其中,一个子模型对应一个场景;
    所述处理器,在将所述验证语音输入声纹识别模型进行匹配时,具体用于:
    将所述验证语音输入第一场景对应的子模型进行匹配,其中,所述第一场景为所述场景检测的结果。
  24. 如权利要求21~23任一项所述的装置,其特征在于,所述处理器,还用于:
    若所述用户是所述声纹识别模型的注册人,对所述验证语音进行质量评估,得到质量评估结果;
    若所述质量评估结果表示所述验证语音为高质量语音,基于所述验证语音对所述声纹识别模型进行增量学习。
  25. 如权利要求24所述的装置,其特征在于,所述处理器,在基于所述验证语音对所述声纹识别模型进行增量学习时,具体用于:
    对所述验证语音进行数据增强处理,得到处理后的语音数据;
    基于所述处理的语音数据对所述声纹识别模型进行增量学习。
  26. 如权利要求25所述的装置,其特征在于,所述处理器,还用于:
    在对所述验证语音进行数据增强处理之前,确定所述验证语音所在的第一场景为高频场景;
    所述处理器,在对所述验证语音进行数据增强处理时,具体用于:
    将所述验证语音进行数据增强处理,得到j个不同噪声等级的样本语音;
    所述处理器,在基于所述处理的语音数据对所述声纹识别模型进行增量学习时,具体用于:
    将所述j个样本语音按照噪声等级进行分组,得到M组语音数据,所述M为大于0且不大于j的整数;
    基于所述M组语音数据分别对所述第一场景对应的子模型进行训练,得到M个高频子模型。
  27. 一种芯片,其特征在于,所述芯片与电子设备中的存储器耦合,执行如权利要求1至13任一所述的方法。
  28. 一种计算机存储介质,其特征在于,所述计算机存储介质中存储计算机指令,该计算机指令在被一个或多个处理器执行时实现权利要求1至13中任一项所述的方法。
PCT/CN2020/104545 2019-07-24 2020-07-24 一种声纹识别方法及装置 WO2021013255A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910673696.7 2019-07-24
CN201910673696.7A CN112289325A (zh) 2019-07-24 2019-07-24 一种声纹识别方法及装置

Publications (1)

Publication Number Publication Date
WO2021013255A1 true WO2021013255A1 (zh) 2021-01-28

Family

ID=74193322

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/104545 WO2021013255A1 (zh) 2019-07-24 2020-07-24 一种声纹识别方法及装置

Country Status (2)

Country Link
CN (1) CN112289325A (zh)
WO (1) WO2021013255A1 (zh)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113241081B (zh) * 2021-04-25 2023-06-16 华南理工大学 一种基于梯度反转层的远场说话人认证方法及系统
CN117012205B (zh) * 2022-04-29 2024-08-16 荣耀终端有限公司 声纹识别方法、图形界面及电子设备
CN115065912B (zh) * 2022-06-22 2023-04-25 广东帝比电子科技有限公司 基于声纹筛技术的对音箱能量进行筛选的反馈抑制装置
CN116612766B (zh) * 2023-07-14 2023-11-17 北京中电慧声科技有限公司 具备声纹注册功能的会议系统及声纹注册方法

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007279349A (ja) * 2006-04-06 2007-10-25 Toshiba Corp 特徴量補正装置、特徴量補正方法および特徴量補正プログラム
CN104952450A (zh) * 2015-05-15 2015-09-30 百度在线网络技术(北京)有限公司 远场识别的处理方法和装置
CN107481731A (zh) * 2017-08-01 2017-12-15 百度在线网络技术(北京)有限公司 一种语音数据增强方法及系统
CN107680586A (zh) * 2017-08-01 2018-02-09 百度在线网络技术(北京)有限公司 远场语音声学模型训练方法及系统
CN108269567A (zh) * 2018-01-23 2018-07-10 北京百度网讯科技有限公司 用于生成远场语音数据的方法、装置、计算设备以及计算机可读存储介质
CN109841218A (zh) * 2019-01-31 2019-06-04 北京声智科技有限公司 一种针对远场环境的声纹注册方法及装置

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10096321B2 (en) * 2016-08-22 2018-10-09 Intel Corporation Reverberation compensation for far-field speaker recognition
WO2019129511A1 (en) * 2017-12-26 2019-07-04 Robert Bosch Gmbh Speaker identification with ultra-short speech segments for far and near field voice assistance applications
CN108305633B (zh) * 2018-01-16 2019-03-29 平安科技(深圳)有限公司 语音验证方法、装置、计算机设备和计算机可读存储介质
CN109509473B (zh) * 2019-01-28 2022-10-04 维沃移动通信有限公司 语音控制方法及终端设备

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007279349A (ja) * 2006-04-06 2007-10-25 Toshiba Corp 特徴量補正装置、特徴量補正方法および特徴量補正プログラム
CN104952450A (zh) * 2015-05-15 2015-09-30 百度在线网络技术(北京)有限公司 远场识别的处理方法和装置
CN107481731A (zh) * 2017-08-01 2017-12-15 百度在线网络技术(北京)有限公司 一种语音数据增强方法及系统
CN107680586A (zh) * 2017-08-01 2018-02-09 百度在线网络技术(北京)有限公司 远场语音声学模型训练方法及系统
CN108269567A (zh) * 2018-01-23 2018-07-10 北京百度网讯科技有限公司 用于生成远场语音数据的方法、装置、计算设备以及计算机可读存储介质
CN109841218A (zh) * 2019-01-31 2019-06-04 北京声智科技有限公司 一种针对远场环境的声纹注册方法及装置

Also Published As

Publication number Publication date
CN112289325A (zh) 2021-01-29

Similar Documents

Publication Publication Date Title
CN111699528B (zh) 电子装置及执行电子装置的功能的方法
WO2021013255A1 (zh) 一种声纹识别方法及装置
US20240038218A1 (en) Speech model personalization via ambient context harvesting
CN107799126B (zh) 基于有监督机器学习的语音端点检测方法及装置
CN111179961B (zh) 音频信号处理方法、装置、电子设备及存储介质
US20220415010A1 (en) Map construction method, relocalization method, and electronic device
CN111696570B (zh) 语音信号处理方法、装置、设备及存储介质
WO2021135628A1 (zh) 语音信号的处理方法、语音分离方法
US10353495B2 (en) Personalized operation of a mobile device using sensor signatures
CN108922531B (zh) 槽位识别方法、装置、电子设备及存储介质
CN114141230A (zh) 电子设备及其语音识别方法和介质
WO2022199500A1 (zh) 一种模型训练方法、场景识别方法及相关设备
CN111863020A (zh) 语音信号处理方法、装置、设备及存储介质
US11830501B2 (en) Electronic device and operation method for performing speech recognition
CN111341307A (zh) 语音识别方法、装置、电子设备及存储介质
US20230197084A1 (en) Apparatus and method for classifying speakers by using acoustic sensor
CN114429768B (zh) 说话人日志模型的训练方法、装置、设备及存储介质
CN115394285A (zh) 语音克隆方法、装置、设备及存储介质
CN109102812B (zh) 一种声纹识别方法、系统及电子设备
CN114220177A (zh) 唇部音节识别方法、装置、设备及介质
CN109102810B (zh) 声纹识别方法和装置
KR20230084154A (ko) 동적 분류기를 사용한 사용자 음성 활동 검출
WO2022233239A1 (zh) 一种升级方法、装置及电子设备
CN111176430A (zh) 一种智能终端的交互方法、智能终端及存储介质
US20240212681A1 (en) Voice recognition device having barge-in function and method thereof

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20842941

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20842941

Country of ref document: EP

Kind code of ref document: A1