CN112289325A - Voiceprint recognition method and device - Google Patents

Voiceprint recognition method and device Download PDF

Info

Publication number
CN112289325A
CN112289325A CN201910673696.7A CN201910673696A CN112289325A CN 112289325 A CN112289325 A CN 112289325A CN 201910673696 A CN201910673696 A CN 201910673696A CN 112289325 A CN112289325 A CN 112289325A
Authority
CN
China
Prior art keywords
voice
voiceprint recognition
scene
electronic device
verification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910673696.7A
Other languages
Chinese (zh)
Inventor
曾夕娟
周小鹏
芦宇
胡伟湘
蔡丹蔚
李明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Duke Kunshan University
Huawei Technologies Co Ltd
Original Assignee
Duke Kunshan University
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Duke Kunshan University, Huawei Technologies Co Ltd filed Critical Duke Kunshan University
Priority to CN201910673696.7A priority Critical patent/CN112289325A/en
Priority to PCT/CN2020/104545 priority patent/WO2021013255A1/en
Publication of CN112289325A publication Critical patent/CN112289325A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/20Pattern transformations or operations aimed at increasing system robustness, e.g. against channel noise or different working conditions
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L9/00Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
    • H04L9/32Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols including means for verifying the identity or authority of a user of the system or for message authentication, e.g. authorization, entity authentication, data integrity or data verification, non-repudiation, key authentication or verification of credentials
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L9/00Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
    • H04L9/32Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols including means for verifying the identity or authority of a user of the system or for message authentication, e.g. authorization, entity authentication, data integrity or data verification, non-repudiation, key authentication or verification of credentials
    • H04L9/3226Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols including means for verifying the identity or authority of a user of the system or for message authentication, e.g. authorization, entity authentication, data integrity or data verification, non-repudiation, key authentication or verification of credentials using a predetermined code, e.g. password, passphrase or PIN
    • H04L9/3231Biological data, e.g. fingerprint, voice or retina
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02082Noise filtering the noise being echo, reverberation of the speech

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Signal Processing (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Business, Economics & Management (AREA)
  • Game Theory and Decision Science (AREA)
  • Quality & Reliability (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biodiversity & Conservation Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

A voiceprint recognition method and a voiceprint recognition device are used for solving the problem that the voiceprint recognition method in the prior art is low in robustness. The method relates to the relevant fields of artificial intelligence and the like, and specifically comprises the following steps: the electronic equipment prompts a user to input registration voice; the electronic equipment collects registration voice input by a user; the electronic equipment generates sample voice under a far-field condition based on the registered voice; the electronic device trains a voiceprint recognition model based on the sample speech.

Description

Voiceprint recognition method and device
Technical Field
The embodiment of the application relates to the technical field of computers, in particular to a voiceprint recognition method and device.
Background
Voiceprint recognition is a technique for automatically identifying and confirming the identity of a speaker through a voice signal. The basic scheme of voiceprint recognition includes two stages of voiceprint registration and voiceprint verification. In the voiceprint registration stage, converting the registration voice information of the registrant into a verification model; in the voiceprint verification stage, the similarity scoring is carried out on the information of the verification voice and the verification model generated in the voiceprint registration stage, and whether the verification voice comes from the registrant or not is judged.
Far-field voiceprint recognition is more challenging than near-field voiceprint recognition. The main reasons for this are the distortion of the speech signal in far-field conditions, which is reflected by the superposition of ambient noise and room reverberation. When a speaker speaks in a room or a closed space, sound waves are transmitted in the air and reflected on walls and obstacles; high frequencies in the sound wave are attenuated due to absorption by the material and then diffuse into the room again, resulting in reverberation. Therefore, in the far-field condition, the enrollment voice and the verification voice do not match, and the voiceprint recognition accuracy is low.
One solution to this problem is to register the user with voiceprints in the near and far fields, respectively. Specifically, in order to enable verification speech under far-field conditions to be matched with registration speech, the scheme provides that a user performs voiceprint registration under near-field conditions and far-field conditions respectively. However, this solution requires the user to perform voiceprint registration for multiple times under near-field and far-field conditions, which reduces the user experience.
Another solution is front-end speech signal enhancement. Specifically, near-field clean voice is collected as registration voice in a voiceprint registration stage, collected far-field voice data is processed through a front end in a voiceprint verification stage to obtain enhanced voice, and the enhanced voice is input as verification voice. However, the high-frequency part of the enhanced voice in the scheme still loses near-field clean voice, so that the enhanced voice is still not matched with the registered voice, and the problems that the robustness of a voiceprint recognition system is low, the recognition rate is not obviously improved and the like are caused.
Disclosure of Invention
The application provides a voiceprint recognition method and a voiceprint recognition device, which are used for solving the problem that the voiceprint recognition method in the prior art is low in robustness.
In a first aspect, a voiceprint registration method provided in an embodiment of the present application includes: the electronic equipment prompts a user to input registration voice; the electronic equipment collects registration voice input by a user; the electronic equipment generates sample voice under a far-field condition based on the registered voice; the electronic device trains a voiceprint recognition model based on the sample speech. In the embodiment of the application, the electronic equipment can generate the sample voice under the far-field condition according to the registered voice simulation, and does not need to perform voice print registration for many times under the near-field condition and the far-field condition, so that the user experience can be improved. In addition, the electronic equipment trains the voiceprint recognition model based on the sample voice under the far field condition, so that the robustness of the voiceprint recognition model can be improved, and the accuracy of voiceprint recognition can be improved.
In one possible design, when the electronic device generates a sample speech in far-field conditions based on the registered speech, reverberation on sound in far-field conditions may be simulated; and generating sample data of the registered voice under the far-field condition based on the reverberation simulation of the sound under the far-field condition. In the design, the reverberation of the sound under the far-field condition is simulated, so that the sample voice of the registered voice under the far-field condition can be simulated.
In one possible design, when the electronic device generates a sample speech in far-field conditions based on the enrollment speech, a noise speech may be generated based on the enrollment speech and the noise data; the electronic equipment simulates reverberation of sound under far-field conditions; the electronic device generates sample data of a noisy speech under far-field conditions based on a reverberation simulation of the sound under far-field conditions. In the design, the noise voice is combined when the sample voice of the registered voice under the far field condition is simulated, so that the sample voice can better accord with the actual scene, the robustness of the voiceprint recognition model can be improved, and the accuracy of voiceprint recognition is improved.
In one possible design, when the electronic device simulates reverberation on sound in far-field conditions, a Room Impulse Response (RIR) can be obtained by simulating wall reflections of sound based on the far-field conditions. In the above design, the RIR is obtained by simulating wall reflection of sound, and reverberation of sound under far-field conditions can be simulated.
In one possible design, when the electronic device trains the voiceprint recognition model based on sample voice, feature extraction can be performed on the sample voice to obtain feature data; and training the voiceprint recognition model based on the feature data. In the above design, the robustness of the voiceprint recognition model can be improved by extracting the feature data of the sample voice.
In one possible design, the voiceprint recognition model may include one or more sub-models, where one sub-model corresponds to one scene. When the electronic device trains the voiceprint recognition model based on the feature data, one or more sub-models can be respectively trained based on the feature data. In the design, the corresponding voiceprint recognition submodels are respectively trained aiming at different scenes, so that the problem that the data are not matched due to the fact that the registered voice scene is single and the verification voice scene is variable can be solved.
In one possible design, the voiceprint recognition model may include a fusion model, where the fusion model corresponds to one or more scenes. When the electronic equipment trains the voiceprint recognition model based on the characteristic data, the fusion model is trained based on the characteristic data. In the design, the electronic equipment can save the computing resources of the electronic equipment by maintaining a fusion model.
In a second aspect, a voiceprint recognition method provided in an embodiment of the present application includes: the electronic equipment prompts a user to input verification voice; the electronic equipment collects verification voice input by a user; the electronic equipment matches the verification voice input voiceprint recognition model to obtain a matching result; the electronic device determines whether the user is a registrant of the voiceprint recognition model based on the matching result. The voiceprint recognition model may be obtained by training using the method described in the first aspect, and specifically, the process of training the voiceprint recognition model by the electronic device may include: the electronic equipment prompts a user to input registration voice; the electronic equipment collects registration voice input by a user; the electronic equipment generates sample voice under a far-field condition based on the registered voice; the electronic device trains a voiceprint recognition model based on the sample speech. In the embodiment of the application, the electronic device can accurately identify and verify whether the voice comes from the registrant or not by adopting the voiceprint recognition model trained on the first aspect.
In one possible design, the electronic device may perform scene detection on the verification voice after acquiring the verification voice entered by the user. In the design, the scene where the verification voice is located is detected, so that the electronic equipment can perform voiceprint recognition on the verification voice by combining the scene where the verification voice is located, and the accuracy of the voiceprint recognition can be improved.
In one possible design, the voiceprint recognition model may include one or more sub-models, where one sub-model corresponds to one scene; the electronic device matches the verified voice input voiceprint recognition model, comprising: and the electronic equipment matches the sub-model corresponding to the first scene of the verification voice input, wherein the first scene is the result of scene detection. In the design, the electronic equipment performs voiceprint recognition on the verification voice by combining the scene where the verification voice is located, so that the accuracy of the voiceprint recognition can be improved.
In one possible design, if the user is a registrant of the voiceprint recognition model, the electronic device may perform quality evaluation on the verification speech to obtain a quality evaluation result. If the quality evaluation result indicates that the verification speech is high-quality speech, the electronic device may perform incremental learning on the voiceprint recognition model based on the verification speech. In the design, high-quality verification voice is added into the high-quality voice sample library, data enhancement and incremental learning are carried out, and the model in the voiceprint recognition model library is updated, so that the model in the voiceprint recognition model library can be more and more suitable for the actual use scene of a user in the use process of the user.
In one possible design, when the electronic device performs incremental learning on the voiceprint recognition model based on the verification speech, data enhancement processing can be performed on the verification speech to obtain processed speech data. Incremental learning of the voiceprint recognition model is performed based on the processed speech data. In the design, incremental learning is further carried out based on verification voice, and the model in the voiceprint recognition model library is updated, so that the robustness of the model in the voiceprint recognition model library is higher and higher in the using process of a user, and the model in the voiceprint recognition model library can be more and more suitable for the actual using scene of the user.
In one possible design, before the electronic device performs data enhancement processing on the verification voice, the electronic device may determine that the first scene in which the verification voice is located is a high-frequency scene. When the electronic equipment performs data enhancement processing on the verification voice, the data enhancement processing can be performed on the verification voice to obtain j sample voices with different noise levels. When the electronic equipment performs incremental learning on the voiceprint recognition model based on the processed voice data, j sample voices can be grouped according to noise levels to obtain M groups of voice data, wherein M is an integer which is greater than 0 and not greater than j; and respectively training the submodels corresponding to the first scene based on the M groups of voice data to obtain M high-frequency submodels.
In a third aspect, a voiceprint registration apparatus provided in an embodiment of the present application includes: the microphone comprises a first device, a microphone and a processor, wherein the first device is a loudspeaker or a display screen. A processor to perform: triggering a first device to prompt a user to enter a registration voice; collecting the registered voice input by the user through a microphone; generating a sample voice under a far-field condition based on the registered voice; the voiceprint recognition model is trained based on sample speech.
In one possible design, when the processor triggers the first device to prompt the user to enter the registration voice, the processor may trigger the speaker to play a prompt voice, where the prompt voice is used to prompt the user to enter the registration voice. Or the processor can trigger the display screen to display prompt words, wherein the prompt words are used for prompting the user to enter the registration voice.
In one possible design, the processor, when generating the sample speech in far-field conditions based on the enrollment speech, may be specifically configured to: simulating reverberation of registered voice to sound under far-field conditions; generating sample data of the registered voice under the far-field condition based on the reverberation simulation of the sound under the far-field condition.
In one possible design, the processor, when generating the sample speech in far-field conditions based on the enrollment speech, may be specifically configured to: generating a noise voice based on the registration voice and the noise data; simulating reverberation of sound under far-field conditions; generating sample data of a noisy speech under far-field conditions based on a reverberation simulation of the sound under far-field conditions.
In one possible design, the processor, when simulating reverberation for sound under far-field conditions, may be specifically configured to: the room impact response RIR is obtained by simulating the wall reflection of sound based on far-field conditions.
In one possible design, the processor, when training the voiceprint recognition model based on the sample speech, may be specifically configured to: carrying out feature extraction on the sample voice to obtain feature data; and training the voiceprint recognition model based on the characteristic data.
In one possible design, the voiceprint recognition model may include one or more sub-models, where one sub-model corresponds to one scene. The processor, when training the voiceprint recognition model based on the feature data, may be specifically configured to: one or more sub-models are trained, respectively, based on the feature data.
In one possible design, the voiceprint recognition model may include a fusion model, where the fusion model corresponds to one or more scenes. The processor, when training the voiceprint recognition model based on the feature data, may be specifically configured to: the fusion model is trained based on the feature data.
In a fourth aspect, an apparatus for voiceprint recognition provided in an embodiment of the present application includes: the microphone comprises a first device, a microphone and a processor, wherein the first device is a loudspeaker or a display screen. A processor to perform: triggering a first device to prompt a user to enter verification voice; collecting verification voice input by a user through a microphone; matching the verification voice input voiceprint recognition models to obtain a matching result; and determining whether the user is a registrant of the voiceprint recognition model based on the matching result. Wherein the voiceprint recognition model is obtained by training the voiceprint registration apparatus of the third aspect.
In this case, when training the voiceprint recognition model, the processor of the voiceprint recognition apparatus may be further configured to: triggering a first device to prompt a user to enter a registration voice; collecting the registered voice input by the user through a microphone; generating a sample voice under a far-field condition based on the registered voice; the voiceprint recognition model is trained based on sample speech.
Alternatively, the voiceprint registration apparatus and the voiceprint recognition may be two different apparatuses, in which case the voiceprint registration apparatus may include: the microphone comprises a first device, a microphone and a processor, wherein the first device is a loudspeaker or a display screen. A processor to perform: triggering a first device to prompt a user to enter a registration voice; collecting the registered voice input by the user through a microphone; generating a sample voice under a far-field condition based on the registered voice; the voiceprint recognition model is trained based on sample speech.
In one possible design, when the processor triggers the first device to prompt the user to enter the registration voice, the processor may trigger the speaker to play a prompt voice, where the prompt voice is used to prompt the user to enter the verification voice. Or the processor can trigger the display screen to display prompt words, wherein the prompt words are used for prompting the user to enter the verification voice.
In one possible design, the processor may be further configured to: after the verification voice entered by the user is collected through the microphone, scene detection is performed on the verification voice.
In one possible design, the voiceprint recognition model may include one or more sub-models, where one sub-model corresponds to one scene. The processor, when matching the verification speech input voiceprint recognition model, may be specifically configured to: and matching the submodels corresponding to the first scene of the verification voice input, wherein the first scene is a scene detection result.
In one possible design, the processor may be further configured to: if the user is a registrant of the voiceprint recognition model, performing quality evaluation on the verification voice to obtain a quality evaluation result; and if the quality evaluation result shows that the verification voice is high-quality voice, performing incremental learning on the voiceprint recognition model based on the verification voice.
In one possible design, the processor, when incrementally learning the voiceprint recognition model based on the verification speech, may be specifically configured to: carrying out data enhancement processing on the verification voice to obtain processed voice data; incremental learning of the voiceprint recognition model is performed based on the processed speech data.
In one possible design, the processor may be further configured to: before data enhancement processing is carried out on the verification voice, a first scene where the verification voice is located is determined to be a high-frequency scene. The processor, when performing data enhancement processing on the verification speech, may specifically be configured to: and carrying out data enhancement processing on the verification voice to obtain j sample voices with different noise levels. The processor, when performing incremental learning on the voiceprint recognition model based on the processed speech data, may be specifically configured to: grouping the j sample voices according to the noise level to obtain M groups of voice data, wherein M is an integer which is greater than 0 and not greater than j; and respectively training the submodels corresponding to the first scene based on the M groups of voice data to obtain M high-frequency submodels.
In a fifth aspect, a chip provided in an embodiment of the present application includes a processor and a communication interface, where the communication interface is configured to receive a code instruction and transmit the code instruction to the processor. A processor for invoking code instructions transmitted by the communication interface to perform: triggering a loudspeaker or a display screen of the electronic equipment to prompt a user to enter a registration voice; triggering a microphone of the electronic equipment to collect the registered voice input by the user; generating a sample voice under a far-field condition based on the registered voice; the voiceprint recognition model is trained based on sample speech.
In one possible design, when the processor triggers the speaker to prompt the user to enter the registration voice, the speaker may be triggered to play the prompt voice, where the prompt voice is used to prompt the user to enter the registration voice.
In one possible design, when the processor triggers the display screen to prompt the user to enter the registration voice, the processor may trigger the display screen to display prompt words, where the prompt words are used to prompt the user to enter the registration voice.
In one possible design, the processor, when generating the sample speech in far-field conditions based on the enrollment speech, may be specifically configured to: simulating reverberation of registered voice to sound under far-field conditions; generating sample data of the registered voice under the far-field condition based on the reverberation simulation of the sound under the far-field condition.
In one possible design, the processor, when generating the sample speech in far-field conditions based on the enrollment speech, may be specifically configured to: generating a noise voice based on the registration voice and the noise data; simulating reverberation of sound under far-field conditions; generating sample data of a noisy speech under far-field conditions based on a reverberation simulation of the sound under far-field conditions.
In one possible design, the processor, when simulating reverberation for sound under far-field conditions, may be specifically configured to: the room impact response RIR is obtained by simulating the wall reflection of sound based on far-field conditions.
In one possible design, the processor, when training the voiceprint recognition model based on the sample speech, may be specifically configured to: carrying out feature extraction on the sample voice to obtain feature data; and training the voiceprint recognition model based on the characteristic data.
In one possible design, the voiceprint recognition model may include one or more sub-models, where one sub-model corresponds to one scene. The processor, when training the voiceprint recognition model based on the feature data, may be specifically configured to: one or more sub-models are trained, respectively, based on the feature data.
In one possible design, the voiceprint recognition model may include a fusion model, where the fusion model corresponds to one or more scenes. The processor, when training the voiceprint recognition model based on the feature data, may be specifically configured to: the fusion model is trained based on the feature data.
In a sixth aspect, a chip provided in an embodiment of the present application includes a processor and a communication interface, where the communication interface is configured to receive a code instruction and transmit the code instruction to the processor. A processor for invoking code instructions transmitted by the communication interface to perform: triggering a loudspeaker of the electronic equipment or a display screen of the electronic equipment to prompt a user to enter verification voice; collecting verification voice input by a user through a microphone; matching the verified voice input voiceprint recognition models to obtain matching results, wherein the voiceprint recognition models are obtained by training the device of any one of claims 13-18; and determining whether the user is a registrant of the voiceprint recognition model based on the matching result.
In one possible design, when the processor triggers the speaker of the electronic device to prompt the user to enter the registration voice, the speaker may be triggered to play a prompt voice, where the prompt voice is used to prompt the user to enter the verification voice.
In one possible design, when the processor triggers the display screen of the electronic device to prompt the user to enter the registration voice, the processor may trigger the display screen to display prompt text, where the prompt text is used to prompt the user to enter the verification voice.
In one possible design, the processor may further invoke the code transmitted by the communication interface to perform: and after the microphone of the electronic equipment is triggered to collect the verification voice input by the user, carrying out scene detection on the verification voice.
In one possible design, the voiceprint recognition model may include one or more sub-models, where one sub-model corresponds to one scene. The processor, when matching the verification speech input voiceprint recognition model, may be specifically configured to: and matching the submodels corresponding to the first scene of the verification voice input, wherein the first scene is a scene detection result.
In one possible design, the processor may be further configured to: if the user is a registrant of the voiceprint recognition model, performing quality evaluation on the verification voice to obtain a quality evaluation result; and if the quality evaluation result shows that the verification voice is high-quality voice, performing incremental learning on the voiceprint recognition model based on the verification voice.
In one possible design, the processor, when incrementally learning the voiceprint recognition model based on the verification speech, may be specifically configured to: carrying out data enhancement processing on the verification voice to obtain processed voice data; incremental learning of the voiceprint recognition model is performed based on the processed speech data.
In one possible design, the processor may be further configured to: before data enhancement processing is carried out on the verification voice, a first scene where the verification voice is located is determined to be a high-frequency scene. The processor, when performing data enhancement processing on the verification speech, may specifically be configured to: and carrying out data enhancement processing on the verification voice to obtain j sample voices with different noise levels. The processor, when performing incremental learning on the voiceprint recognition model based on the processed speech data, may be specifically configured to: grouping the j sample voices according to the noise level to obtain M groups of voice data, wherein M is an integer which is greater than 0 and not greater than j; and respectively training the submodels corresponding to the first scene based on the M groups of voice data to obtain M high-frequency submodels.
In a seventh aspect, the present application also provides a computer-readable storage medium including instructions which, when executed on a computer, cause the computer to perform the method of the above aspects.
In an eighth aspect, the present application also provides a computer program product comprising instructions which, when executed, cause the method of the above aspects to be performed.
Drawings
Fig. 1 is a schematic diagram of a hardware structure of an electronic device provided in the present application;
fig. 2 is a schematic flow chart of a voiceprint recognition method provided in the present application;
FIG. 3 is a schematic diagram of a display screen prompting a user to enter a registration voice according to the present application;
fig. 4 is a schematic diagram illustrating that a user triggers an electronic device to perform voiceprint authentication according to the present application;
fig. 5 is a schematic diagram of a display screen outputting a recognition result provided by the present application;
FIG. 6 is a schematic diagram of a voiceprint recognition process provided by the present application;
FIG. 7 is a schematic diagram of a display screen outputting a recognition result according to the present disclosure;
FIG. 8 is a schematic diagram of another voiceprint recognition process provided by the present application.
Detailed Description
It is to be understood that in this application, "/" indicates an OR meaning, e.g., A/B may indicate either A or B; in the present application, "and/or" is only an association relationship describing an associated object, and means that there may be three relationships, for example, a and/or B, and may mean: a exists alone, A and B exist simultaneously, and B exists alone. "at least one" means one or more, "a plurality" means two or more.
In this application, "exemplary," "in some embodiments," "in other embodiments," and the like are used to mean serving as an example, instance, or illustration. Any embodiment or design described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments or designs. Rather, the term using examples is intended to present concepts in a concrete fashion.
Furthermore, the terms "first," "second," and the like, as used herein, are used for descriptive purposes only and not for purposes of indicating or implying relative importance or implicit indication of a number of technical features being indicated or implied as well as the order in which such is indicated or implied.
The electronic device in the embodiment of the application is an electronic device with a voiceprint recognition function. Voiceprint recognition is a technique for automatically identifying and confirming the identity of a speaker through a voice signal. The electronic equipment in the embodiment of the application can collect voice data of a user and perform voiceprint recognition on the voice data to judge whether the user is a registrant or not.
The following describes electronic devices, Graphical User Interfaces (GUIs) for such electronic devices, and embodiments for using such electronic devices. For convenience of description, the GUI will be simply referred to as a user interface hereinafter.
The electronic device in the embodiment of the present application may be a portable electronic device, such as a mobile phone, a tablet computer, an Artificial Intelligence (AI) intelligent voice terminal, a wearable device, an Augmented Reality (AR)/Virtual Reality (VR) device, and the like. Exemplary embodiments of the portable electronic device include, but are not limited to, a mount
Figure BDA0002142527810000071
Or other operating system. The portable electronic device may be a vehicle-mounted terminal, a Laptop computer (Laptop), or the like. It should also be understood that the electronic device according to the embodiment of the present application may also be a desktop computer, an intelligent home device (e.g., an intelligent television, an intelligent sound box), and the like, which is not limited thereto.
For example, as shown in fig. 1, a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present application is shown. Specifically, as shown in the figure, the electronic device includes a processor 110, an internal memory 121, an external memory interface 122, a camera 131, a display 132, a sensor module 140, a Subscriber Identity Module (SIM) card interface 151, a key 152, an audio module 160, a speaker 161, a receiver 162, a microphone 163, an earphone interface 164, a Universal Serial Bus (USB) interface 170, a charging management module 180, a power management module 181, a battery 182, a mobile communication module 191, and a wireless communication module 192. In other embodiments, the electronic device may also include motors, indicators, keys, and the like.
It should be understood that the hardware configuration shown in fig. 1 is only one example. The electronic devices of the embodiments of the application may have more or fewer components than the electronic devices shown in the figures, may combine two or more components, or may have different configurations of components. The various components shown in the figures may be implemented in hardware, software, or a combination of hardware and software, including one or more signal processing and/or application specific integrated circuits.
Processor 110 may include one or more processing units, among others. For example: the processor 110 may include an Application Processor (AP), a modem, a Graphics Processing Unit (GPU), an Image Signal Processor (ISP), a controller, a video codec, a Digital Signal Processor (DSP), a baseband processor, and/or a neural-Network Processing Unit (NPU), and the like. The different processing units may be separate devices or may be integrated into one or more processors.
In some embodiments, a buffer may also be provided in processor 110 for storing instructions and/or data. As an example, the cache in the processor 110 may be a cache memory. The buffer may be used to hold instructions and/or data that have just been used, generated, or recycled by processor 110. If the processor 110 needs to use the instruction or data, it can be called directly from the buffer. Helping to reduce the time for processor 110 to fetch instructions or data and thus helping to improve the efficiency of the system.
The internal memory 121 may be used to store programs and/or data. In some embodiments, the internal memory 121 includes a program storage area and a data storage area. The storage program area may be used to store an operating system (e.g., an operating system such as Android and IOS), a computer program required by at least one function (e.g., a voiceprint recognition function and a sound playing function), and the like. The storage data area may be used to store data (e.g., audio data) created, and/or collected during use of the electronic device, etc. For example, the processor 110 may implement one or more functions by calling programs and/or data stored in the internal memory 121 to cause the electronic device to execute a corresponding method. For example, the processor 110 calls certain programs and/or data in the internal memory to cause the electronic device to execute the voiceprint recognition method provided in the embodiments of the present application, thereby implementing the voiceprint recognition function. The internal memory 121 may be a high-speed random access memory, a nonvolatile memory, or the like. For example, the non-volatile memory may include at least one of one or more magnetic disk storage devices, flash memory devices, and/or universal flash memory (UFS), among others.
The external memory interface 122 may be used to connect an external memory card (e.g., a Micro SD card) to extend the storage capability of the electronic device. The external memory card communicates with the processor 110 through the external memory interface 122 to implement a data storage function. For example, the electronic device may save files such as images, music, videos, and the like in the external memory card through the external memory interface 122.
The camera 131 may be used to capture motion, still images, and the like. Typically, the camera 131 includes a lens and an image sensor. The optical image generated by the object through the lens is projected on the image sensor, and then is converted into an electric signal for subsequent processing. For example, the image sensor may be a Charge Coupled Device (CCD) or a complementary metal-oxide-semiconductor (CMOS) phototransistor. The image sensor converts the optical signal into an electrical signal and then transmits the electrical signal to the ISP to be converted into a digital image signal. It should be noted that the electronic device may include 1 or N cameras 131, where N is a positive integer greater than 1.
The display screen 132 may include a display panel for displaying a user interface. The display panel may be a Liquid Crystal Display (LCD), an organic light-emitting diode (OLED), an active-matrix organic light-emitting diode (AMOLED), a flexible light-emitting diode (FLED), a miniature, a Micro-o led, a quantum dot light-emitting diode (QLED), or the like. It should be noted that the electronic device may include 1 or M display screens 132, where M is a positive integer greater than 1. For example, the electronic device may implement display functionality via the GPU, the display screen 132, the application processor, and/or the like.
The sensor module 140 may include one or more sensors. For example, the touch sensor 140A, the gyroscope 140B, the acceleration sensor 140C, the fingerprint sensor 140D, the pressure sensor 140E, and the like. In some embodiments, the sensor module 140 may also include an ambient light sensor, a distance sensor, a proximity light sensor, a bone conduction sensor, a temperature sensor, and the like.
Here, the touch sensor 140A may also be referred to as a "touch panel". The touch sensor 140A may be disposed on the display screen 132, and the touch sensor 140A and the display screen 132 form a touch screen, which is also called a "touch screen". The touch sensor 140A is used to detect a touch operation applied thereto or nearby. The touch sensor 140A may pass the detected touch operation to an application processor to determine the touch event type. The electronic device may provide visual output related to touch operations, etc. through the display screen 132. In other embodiments, the touch sensor 140A may be disposed on a surface of the electronic device at a different location than the display screen 132.
Gyroscope 140B may be used to determine the motion pose of the electronic device. In some embodiments, the angular velocity of the electronic device about three axes (i.e., the x, y, and z axes) may be determined by gyroscope 140B. The gyroscope 140B may be used for photographing anti-shake. Illustratively, when the shutter is pressed, the gyroscope 140B detects a shake angle of the electronic device, calculates a distance to be compensated for the lens module according to the shake angle, and allows the lens to counteract the shake of the electronic device through a reverse movement, thereby achieving anti-shake. The gyro sensor 140B may also be used for navigation, body sensing game scenes.
The acceleration sensor 140C can detect the magnitude of acceleration of the electronic device in various directions (typically three axes). When the electronic device is at rest, the magnitude and direction of gravity can be detected. The acceleration sensor 140C may also be used to recognize the posture of the electronic device, and be applied to horizontal and vertical screen switching, pedometer, and other applications.
The fingerprint sensor 140D is used to capture a fingerprint. The electronic equipment can utilize the collected fingerprint characteristics to realize fingerprint unlocking, application lock access, fingerprint photographing, incoming call answering and the like.
The pressure sensor 140E is used for sensing a pressure signal, and can convert the pressure signal into an electrical signal. For example, the pressure sensor 140E may be disposed on the display screen 132. The touch operations which act on the same touch position but have different touch operation intensities can correspond to different operation instructions.
The SIM card interface 151 is used to connect a SIM card. The SIM card can be attached to and detached from the electronic device by being inserted into the SIM card interface 151 or being pulled out from the SIM card interface 151. The electronic device may support 1 or K SIM card interfaces 151, K being a positive integer greater than 1. The SIM card interface 151 may support a Nano SIM card, a Micro SIM card, and/or a SIM card, among others. Multiple cards can be inserted into the same SIM card interface 151 at the same time. The types of the plurality of cards may be the same or different. The SIM card interface 151 may also be compatible with different types of SIM cards. The SIM card interface 151 may also be compatible with an external memory card. The electronic equipment realizes functions of conversation, data communication and the like through the interaction of the SIM card and the network. In some embodiments, the electronic device may also employ esims, namely: an embedded SIM card. The eSIM card can be embedded in the electronic device and cannot be separated from the electronic device.
The keys 152 may include a power on key, a volume key, and the like. The keys 152 may be mechanical keys or touch keys. The electronic device may receive a key input, and generate a key signal input related to user settings and function control of the electronic device.
The electronic device may implement audio functions through the audio module 160, the speaker 161, the receiver 162, the microphone 163, the headphone interface 164, and the application processor, etc. Such as an audio playing function, a recording function, a voiceprint registration function, a voiceprint verification function, a voiceprint recognition function, etc.
The audio module 160 may be used to perform digital-to-analog conversion, and/or analog-to-digital conversion on the audio data, and may also be used to encode and/or decode the audio data. For example, the audio module 160 may be disposed independently of the processor, may be disposed in the processor 110, or may dispose some functional modules of the audio module 160 in the processor 110.
The speaker 161, also called a "speaker", converts audio data into sound and plays the sound. For example, the electronic device 100 may listen to music, listen to a speakerphone, or issue a voice prompt, etc. via the speaker 161.
A receiver 162, also called "earpiece", is used to convert audio data into sound and play the sound. For example, when the electronic device 100 answers a call, the answer can be made by placing the receiver 162 close to the ear of the person.
The microphone 163, also referred to as a "microphone" or "microphone", is used to collect sound (e.g., ambient sound, including human-generated sound, device-generated sound, etc.) and convert the sound into audio electrical data. When making a call or transmitting voice, the user can make a sound by approaching the microphone 163 through the mouth of the person, and the microphone 163 collects the sound made by the user. When the voiceprint recognition function of the electronic device is turned on, the microphone 163 may collect ambient sounds in real time to obtain audio data. The condition of the microphone 163 collecting sound is related to the environment. For example, when the ambient environment is noisy and the user speaks the authentication utterance, then the sound collected by the microphone 163 includes ambient environment noise and the sound from which the user uttered the authentication utterance. For another example, when the surrounding environment is quiet, the user utters the authentication utterance, and the sound collected by the microphone 163 is the sound of the authentication utterance made by the user. As another example, when the ambient environment is in a far-field condition, and the user speaks a verification utterance, the sound collected by the microphone 163 is a superposition of ambient noise and reverberates to reverberate the user's utterances for the verification utterance. For another example, when the surrounding environment is noisy, the voiceprint recognition function of the electronic device is turned on, but the user does not speak the verification speech, and the sound collected by the microphone 163 is only the surrounding environment noise.
It should be noted that the electronic device may be provided with at least one microphone 163. For example, two microphones 163 are provided in the electronic device, and in addition to collecting sound, a noise reduction function can be realized. For example, three, four or more microphones 163 may be further disposed in the electronic device, so that the recognition of the sound source, the directional recording function, or the like may be further implemented on the basis of implementing sound collection and noise reduction.
The earphone interface 164 is used to connect a wired earphone. The headset interface 164 may be a USB interface 170, or may be a 3.5mm open mobile electronic device platform (OMTP) standard interface, a cellular telecommunications industry association (cellular telecommunications industry association of the USA, CTIA) standard interface, or the like.
The USB interface 170 is an interface conforming to the USB standard specification, and may specifically be a Mini USB interface, a Micro USB interface, a USB Type C interface, or the like. The USB interface 170 may be used to connect a charger to charge the electronic device, and may also be used to transmit data between the electronic device and a peripheral device. And the earphone can also be used for connecting an earphone and playing audio through the earphone. For example, the USB interface 170 may be used to connect other electronic devices, such as AR devices, computers, and the like, in addition to the headset interface 164.
The charge management module 180 is configured to receive a charging input from a charger. The charger may be a wireless charger or a wired charger. In some wired charging embodiments, the charging management module 180 may receive charging input from a wired charger via the USB interface 170. In some wireless charging embodiments, the charging management module 180 may receive a wireless charging input through a wireless charging coil of the electronic device. While the charging management module 180 charges the battery 182, the power management module 180 may also supply power to the electronic device.
The power management module 181 is used to connect the battery 182, the charging management module 180 and the processor 110. The power management module 181 receives input from the battery 182 and/or the charging management module 180 to power the processor 110, the internal memory 121, the display 132, the camera 131, and the like. The power management module 181 may also be used to monitor parameters such as battery capacity, battery cycle count, battery state of health (leakage, impedance), and the like. In some other embodiments, the power management module 181 may also be disposed in the processor 110. In other embodiments, the power management module 181 and the charging management module 180 may be disposed in the same device.
The mobile communication module 191 may provide a solution including 2G/3G/4G/5G wireless communication, etc. applied to the electronic device. The mobile communication module 191 may include a filter, a switch, a power amplifier, a Low Noise Amplifier (LNA), and the like.
The wireless communication module 192 may provide a solution for wireless communication applied to an electronic device, including WLAN (e.g., Wi-Fi network), Bluetooth (BT), Global Navigation Satellite System (GNSS), Frequency Modulation (FM), Near Field Communication (NFC), Infrared (IR), and the like. The wireless communication module 192 may be one or more devices that integrate at least one communication processing module.
In some embodiments, the antenna 1 of the electronic device is coupled to the mobile communication module 191 and the antenna 2 is coupled to the wireless communication module 192 so that the electronic device can communicate with other devices. Specifically, the mobile communication module 191 may communicate with other devices through the antenna 1, and the wireless communication module 193 may communicate with other devices through the antenna 2.
The voiceprint recognition method provided by the embodiment of the application is described in detail below with reference to the accompanying drawings and application scenarios. The following embodiments may be implemented in the electronic device 100 having the above-described hardware structure.
In order to better understand the voiceprint recognition method provided in the embodiment of the present application, the following explains the words involved in the embodiment of the present application.
Near field conditions: the distance between the sound source and the microphone (mic) is relatively short, for example, the distance mic of the sound source is within 1 meter.
Near-field speech: it is understood that speech data collected under near-field conditions, for example, mic speech data collected for a sound source when the distance between the sound source and the mic is 1 meter or less is near-field speech. The near-field speech may include near-field clean speech understood as non-noisy speech data collected in near-field conditions, and near-field noisy speech understood as noisy speech data collected in near-field conditions.
Far field conditions: the distance between the sound source and the microphone (mic) is relatively long, for example, the distance mic of the sound source is within 1 meter to 10 meters, and the like.
Far-field speech: it is understood that speech data collected under far-field conditions, for example, mic is far-field speech for a sound source when the distance between the sound source and the mic is 5 meters or less. The far-field speech may include far-field clean speech, which may be understood as speech data collected in far-field conditions without noise, and far-field noisy speech, which may be understood as speech data collected in far-field conditions with noise.
Voiceprint recognition model: the method can be a data model established by the electronic device based on Gaussian Mixture Model (GMM) -background model (UBM), Support Vector Machine (SVM), Joint Factor Analysis (JFA), identity vector (I-vector), X-vector and other methods, the electronic device trains the initial voiceprint recognition model by using sample data after establishing the initial voiceprint recognition model, and the trained voiceprint recognition model can be used for voiceprint recognition.
The multi-scene fusion model comprises the following steps: the initial voiceprint recognition model is trained by adopting sample data of a plurality of scenes, and the voiceprint recognition model can be regarded as a multi-scene fusion model after being trained.
Single scene model: the initial voiceprint recognition model is trained by adopting sample data of one scene, and then a single scene model can be obtained. Specifically, an initial voiceprint recognition model is trained by using sample data of one scene, and after the voiceprint recognition model is trained, the initial voiceprint recognition model can be regarded as a model corresponding to the scene, for example, the initial voiceprint recognition model is trained by using sample data of a home scene to obtain a model corresponding to the home scene (or may be referred to as a home model for short), and the initial voiceprint recognition model is trained by using sample data of a vehicle-mounted scene to obtain a model corresponding to the vehicle-mounted scene (or may be referred to as a vehicle-mounted model for short). Therefore, the initial voiceprint recognition model is trained by adopting sample data of different scenes, and the single scene model corresponding to each scene can be obtained.
Incremental learning: when the sample data is newly added, the voiceprint recognition model does not need to be reconstructed, but the change caused by the newly added sample data is updated on the basis of the original voiceprint recognition model, namely, the newly added sample data is adopted for further training on the basis of the previously trained voiceprint recognition model, so that the voiceprint recognition model is continuously updated.
Feature extraction: the embodiment of the application can refer to a method and a process for transforming voice data to extract characteristic information in the voice data.
Scene detection: by extracting the background data of the voice data, the scene of the voice data is judged.
Referring to fig. 2, a flow of a voiceprint recognition method provided by an embodiment of the present application is exemplarily shown, and the method is executed by an electronic device. The basic scheme of voiceprint recognition includes two stages of voiceprint registration and voiceprint verification. The voiceprint registration can be realized through steps S201 to S204. Voiceprint verification can be achieved through steps S205 to S209.
S201, the electronic equipment collects registration voice input by a user. Wherein the registration voice entered by the user may be a near-field clean voice.
Specifically, the electronic device may capture ambient sounds through the microphone 163 to obtain the registration voice entered by the user.
In particular implementations, the user may speak the registration utterance at the prompt of the electronic device, for example, as shown in FIG. 3, the electronic device may display text on the display screen 132 prompting the user to speak the registration utterance "1234567". Also for example, the electronic device may perform voice prompts through the speaker 161, and so on. The electronic device may automatically prompt the user to speak the registration voice when the user starts the voiceprint recognition function of the electronic device for the first time, or the user may operate the electronic device to prompt the user to speak the registration voice when the user starts the voiceprint recognition function of the electronic device for the first time, or the user may trigger the electronic device to prompt the user to speak the registration voice according to the requirement when the user subsequently starts the voiceprint recognition function.
As a possible implementation manner, the user may input the registration voice multiple times when performing the voiceprint registration, so that the accuracy of voiceprint recognition may be improved.
S202, after the electronic device collects the registration voice, the electronic device may store the registration voice into a high-quality voice sample library, where the high-quality voice sample library is used to store the voice with the voice quality score greater than or equal to the quality threshold.
S203, the electronic equipment performs data enhancement processing on the registered voice included in the high-quality voice sample library to obtain a plurality of sample voices. The sample speech may be, but is not limited to: a noisy speech generated from the registered speech, a far-field noisy speech generated from the registered speech, and the like.
In the above manner, the electronic device can generate the voice with noise, the voice with far field, and the like based on the registration voice, and does not need to register in the scenes of the near field, the far field, and the like, so that the user experience can be improved.
When the electronic device generates the voice with noise based on the registered voice, the method can be realized by the following mode: and adding registration voice and noise sources into the simulation room, and processing the registration voice and the noise sources to obtain the voice with noise, wherein the noise sources can be one or more. Specifically, the electronic device may generate noisy speech of different noise levels based on the enrollment speech. For example, different scenes may correspond to different noise levels, and thus the electronic device may simulate the registered speech for each scene to generate the noisy speech for that scene.
When the electronic device generates far-field speech based on the registered speech, the method can be implemented as follows: using an Image Source Model (ISM) algorithm, simulating wall reflection of sound by a virtual sound source, and calculating Room Impulse Response (RIR) according to signal delay and attenuation parameters, wherein the RIR is used for simulating reverberation of sound under a far-field condition. And generating far-field voice corresponding to the registration voice according to the RIR simulation. In particular, the electronic device may generate far-field speech at different far-field levels based on the enrollment speech. For example, different scenes may correspond to different far-field distances, and thus the electronic device may, for each scene, register a speech simulation to generate far-field speech for that scene.
In addition, the electronic device may also simulate reverberation in far-field conditions in other ways, such as performing an impulse response convolution on a sound to simulate reverberation in far-field conditions, and so on.
When the electronic equipment generates far-field noisy speech based on the registered speech, the method can be realized by the following modes: adding registration voice and noise sources into the simulation room, and processing the registration voice and the noise sources to obtain voice with noise, wherein one or more noise sources can be provided; and calculating RIR by using an ISM algorithm, and generating far-field noisy speech corresponding to the noisy speech according to RIR simulation. Specifically, the electronic device may generate far-field noisy speech at different far-field levels and different noise levels based on the enrollment speech. For example, the electronic device may simulate the registered speech to generate far-field noisy speech corresponding to a particular scene for noise characteristics, far-field characteristics, and the like of the scene.
In the above process, the noise level may be understood as a noise intensity level, and the far-field level may be understood as a far-field distance level.
And S204, the electronic equipment extracts the characteristics of the sample voice, trains the models in the voiceprint recognition model base based on the extracted characteristics and obtains the trained models.
Illustratively, the model in the voiceprint recognition model library can be established by adopting methods such as GMM-UBM, SVM, JFA, I-vector, X-vector and the like, but not limited to.
In particular implementations, the voiceprint recognition model library can include multi-scene fusion models, and thus, the electronic device can train the multi-scene fusion models using sample speech of multiple scenes. Alternatively, the voiceprint recognition model library may also include models corresponding to a plurality of scenes, and therefore, for a model corresponding to each scene, the electronic device may be trained using sample speech corresponding to the scene. Or, the voiceprint recognition model may also include a multi-scene fusion model and models corresponding to a plurality of scenes, respectively, so that the electronic device may train the multi-scene fusion model using sample voices of the plurality of scenes, and for a model corresponding to each scene, the electronic device may train using the sample voices corresponding to the scene.
If the voiceprint recognition model is a multi-scene fusion model, the voice input multi-scene fusion model is verified to obtain a unique matching score, and after the data of the high-quality voice sample library is learned, the multi-scene fusion model is increasingly matched with the actual use scene. If the voiceprint recognition model is a model corresponding to each of the plurality of scenes, the voiceprint recognition can be performed on the model of the scene corresponding to the input verification voice by performing scene detection on the input verification voice. Furthermore, if the verification voice enters a high-quality voice sample library through quality evaluation, data enhancement is performed on the verification voice, and the model corresponding to the scene is updated through incremental learning, so that the model corresponding to the scene can be more and more matched with the actual scene.
The electronic device may perform feature extraction on the sample speech by using, but not limited to, a filter bank (FBank), mel-frequency cepstral coefficients (MFCCs), a D-vector, and other methods.
S205, the electronic equipment collects verification voice input by the user.
In a specific implementation, the user can speak the verification voice under the prompt of the electronic device. The method for prompting the user to say the verification language by the electronic equipment is similar to the method for prompting the user to say the registration language by the electronic equipment, and repeated parts are not repeated one by one.
The electronic device may collect the verification voice entered by the user under the operation trigger of the user, for example, the user triggers the verification instruction by operating the electronic device, so that the electronic device collects the verification voice that prompts the user to enter after receiving the verification instruction, and collects the verification voice entered by the user. For example, a user can trigger a verification instruction by clicking a corresponding position of an icon corresponding to the voiceprint recognition function on a touch screen of the electronic device, so that the electronic device prompts the user to speak a verification voice; for another example, the user may trigger by operating a physical entity (e.g., a physical key, a mouse, a joystick, etc.); for another example, the user may trigger the verification instruction through a specific gesture (e.g., double-clicking a touch screen of the electronic device, etc.), such that the electronic device prompts the user to speak a verification voice. For another example, the user may speak the keyword "voiceprint recognition" into an electronic device (e.g., a smart phone, a vehicle-mounted device, etc.), and the electronic device triggers the verification instruction after collecting the keyword "voiceprint recognition" issued by the user through the microphone 163, and prompts the user to speak a verification voice.
Or, when the user speaks a control command for controlling the electronic device to the electronic device, the electronic device may collect the control command and perform voiceprint recognition by using the control command as a verification voice. That is, the electronic device triggers the verification instruction when receiving the control command, and performs voiceprint recognition by using the control instruction as verification voice. For example, as shown in fig. 4, the user may issue a control command "open music" to an electronic device (e.g., a smart phone, a vehicle-mounted device, etc.), and the electronic device collects the voice "open music" issued by the user through a microphone 163, and then performs voiceprint recognition by using the voice as a verification voice. For another example, the user may issue a control command "tune to 27 ℃" to an electronic device (e.g., a smart air conditioner), and after the electronic device collects the voice "tune to 27 ℃" issued by the user through the microphone 163, the voice is used as a verification voice to perform voiceprint recognition.
And S206, the electronic equipment performs feature extraction and scene detection on the verification voice.
When the electronic device performs feature extraction on the verification speech, methods such as FBank, MFCC, D-vector and the like can be adopted.
Further, the electronic device may add a scene tag to the verification voice after performing scene detection on the verification voice. For example, after the electronic device performs scene detection on the verification voice, it is determined that the verification voice is recorded in the vehicle-mounted scene, and then a scene tag corresponding to the vehicle-mounted scene may be added to the verification voice.
Exemplary, methods of scene detection may include, but are not limited to, GMM, Deep Neural Network (DNN), and the like. The scene tags can be selected according to application scenes, such as a home scene, a vehicle-mounted scene, a background music scene, a noisy human voice environment, a far-field scene, a near-field scene, and the like.
In some embodiments, the electronic device may train a detection model (which may be based on a GMM algorithm or a DNN algorithm) in advance for each scene, so that the electronic device may sequentially input the verification speech into the matching scores of the detection models corresponding to the scenes, and determine the scene corresponding to the verification speech according to the matching scores of the models corresponding to the scenes.
In other embodiments, the electronic device may also pre-train a classification model (the classification model may be based on a DNN algorithm), so that the electronic device may input the verification speech into the classification model, and the classification model may output a classification result, where the classification result is a scene corresponding to the verification speech.
And S207, the electronic equipment performs matching scoring on the voiceprint recognition model trained in the voiceprint registration stage after voice input verification. The verification speech may be determined to be from the registrant if the match score is greater than the match threshold, and not otherwise.
The method of matching scores may include, but is not limited to: cosine Distance (CDS), Linear Discriminant Analysis (LDA), prob-aid linear discriminant analysis (PLDA), and the like.
Specifically, if the voiceprint recognition model is a multi-scene fusion model, a score may be obtained through the matching score of the multi-scene fusion model, and if the voiceprint recognition model includes models corresponding to multiple scenes, the matching score may be respectively performed through the models corresponding to the multiple scenes to obtain multiple scores, and then a fusion score is obtained in a weighting manner in combination with the scene tag obtained in step S206.
Further, the electronic device may output the recognition result to the user upon determining that the verification speech is not from the registrant. Specifically, the electronic device may output the recognition result on the display screen 132, and as shown in FIG. 5, the electronic device may display the text "not registered person! ". For another example, the electronic apparatus may also broadcast a voice "not registered person" through the speaker 161, or the like.
S208, when the electronic device determines that the verification voice comes from the registrant, the electronic device can perform quality evaluation on the verification voice by combining the scene label of the verification voice. If the quality score of the verification voice is greater than the quality threshold, the verification voice may be added to the high quality voice sample library.
For example, the method for performing quality evaluation on the verification speech may be: determining a value of a parameter characterizing voice quality of the verification voice to determine whether the verification voice is high-quality voice, wherein the parameter characterizing voice quality may include, but is not limited to, one or more of the following parameters: signal-to-noise ratio (SNR), segment SNR, Perceptual Evaluation of Speech Quality (PESQ), log likelihood ratio measure (LLR), and the like.
Alternatively, the verification speech may be input into a model for quality evaluation to determine whether the verification speech is high-quality speech, wherein the model for quality evaluation may be based on the GMM algorithm or the DNN algorithm. Specifically, the verification speech may be input into the model to obtain a quality score, and then whether the verification speech is high-quality speech may be determined according to the quality score.
In specific implementation, the voiceprint recognition model may include models corresponding to a plurality of scenes, and the high-quality speech sample library may also be classified and stored according to the plurality of scenes, that is, the high-quality speech sample library may include sample libraries corresponding to the plurality of scenes, where a sample library corresponding to one scene may be used to train a model corresponding to the scene. Based on this, one possible implementation manner is that, if the quality score of the verification voice is greater than the quality threshold, the electronic device may add the verification voice to the sample library corresponding to the scene detected in step S206.
Assuming that the voiceprint recognition model includes a model corresponding to a scene a, a model corresponding to a scene B, a model corresponding to a scene C, and a model corresponding to a scene D, and the high-quality speech sample library may include a sample library corresponding to a scene a, a sample library corresponding to a scene B, a sample library corresponding to a scene C, and a sample library corresponding to a scene D, as an example, it is explained, assuming that the verification speech is determined in step S206 to be from a scene a, the electronic device may add the verification speech to the sample library corresponding to a scene a when the quality score of the verification speech is greater than the quality threshold.
S209, the electronic equipment performs data enhancement processing on the voice of the high-quality voice sample library, uses the processed voice data for incremental learning, and updates the voiceprint recognition model.
The algorithm of incremental learning may include, but is not limited to: the method 1 is that enhanced voice data is added to original registered voice in a weighting mode, and the voice data obtained by adding is used for training a voiceprint recognition model; and 2, training the voiceprint recognition model obtained by the last training based on the enhanced voice data to obtain a new voiceprint recognition model, and performing weighted addition on the new voiceprint recognition model and the voiceprint recognition model obtained by the last training to complete model updating.
In step S209, the scene tag is combined to enhance the data of the speech in the high-quality speech library, so that more abundant data can be obtained during the use of the user. For example, near-field clean speech may be enhanced to produce far-field clean speech, and low-noise speech in a home scenario may be enhanced to produce noisy speech in the home scenario. Moreover, when the voiceprint recognition model comprises models corresponding to a plurality of scenes and the high-quality voice sample library is stored in a classified manner according to the scenes, the models corresponding to the scenes can be updated through incremental learning. The high-quality voice in the daily verification voice data of the user is used for incremental learning after being subjected to data enhancement, and the voiceprint recognition model is updated, so that the voiceprint recognition model can be more and more matched with an actual use scene, and the robustness of the voiceprint recognition system can be improved.
For better understanding of the embodiments of the present application, the voiceprint recognition process is described in detail below with reference to a specific application scenario.
Scene one: for the situation that the usage scene is changed frequently, such as mobile phones, earphones, bracelets and other portable electronic devices, the portable electronic devices such as the mobile phones, the earphones, the earrings and the like are in different scenes along with the movement of the user, for example, the user comes out from home and drives to a shopping mall, and under the situation, the portable electronic devices are moved from a home scene to a vehicle-mounted scene and then enter the shopping mall scene. When the user equipment performs voiceprint recognition on these portable electronic devices, the following steps S601 to S614 can be used to realize voiceprint recognition.
As shown in fig. 6, the voiceprint recognition process may specifically include:
s601, the electronic equipment collects k pieces of registration voice of the user. Wherein k may be an integer greater than or equal to 1. Step S602 is performed.
The user may enter the registration voice for multiple times under the prompt of the electronic device, and the prompting method may specifically refer to the method in step S201, which is not described herein repeatedly. Thus, the electronic device can collect k registered voices of the user through the microphone 163.
S602, the electronic equipment adds the k pieces of registered voice into a high-quality voice sample library. Step S603 is performed.
And S603, the electronic equipment performs data enhancement processing on the k pieces of registered voice to obtain sample voice. The method of data enhancement processing may specifically refer to the method in step S203, and details are not repeated here. Wherein, one registered voice can generate a plurality of sample voices with different noise levels and different far-field levels. Step S604 is performed.
In particular, a library of high quality speech samples may be stored for different scene classifications. Therefore, the electronic device can perform data enhancement processing on the k pieces of registered voices respectively according to different scenes, so that sample voices corresponding to the scenes can be generated according to different scenes.
For example, the electronic device may perform data enhancement processing on k pieces of registered voices for the a scene, and specifically, the electronic device may generate, based on one piece of registered voice, s1 pieces of sample data with different noise levels and different far-field levels, so as to obtain k × s1 pieces of sample voices corresponding to the a scene; specifically, the electronic device may generate, based on one registered voice, 2 sample data with different noise levels and different far-field levels, so as to obtain k × s2 sample voices corresponding to the B scene; specifically, the electronic device may generate, based on one registered voice, 3 sample data with different noise levels and different far-field levels, so as to obtain k × s3 sample voices corresponding to the C scene.
Further, the electronic device may store, for each scene, sample voices of the scene into a sample library corresponding to the scene, for example, k × s1 sample voices of the a scene into a sample library 1 corresponding to the a scene, k × s2 sample voices of the B scene into a sample library 2 corresponding to the B scene, and k × s3 sample voices of the C scene into a sample library 3 corresponding to the C scene.
In some embodiments, for each scene, the electronic device may perform data enhancement processing on the k pieces of registered voices by using a noise source corresponding to the scene to obtain sample voices corresponding to the scene, where the noise source corresponding to the scene may be noise data collected in the scene, or noise data generated for simulation of the scene, and so on. For example, for a scene a, the electronic device may perform data enhancement processing on the registration data using the noise source of the scene a, and for a scene B, the electronic device may perform data enhancement processing on the registration data using the noise source of the scene B.
S604, the electronic equipment extracts the characteristics of the sample voice, trains the models in the voiceprint recognition model base based on the extracted characteristics, and obtains the trained models. The method for extracting features specifically refers to the method described in step S204, and repeated details are not repeated. Step S605 is executed.
The electronic equipment can establish a multi-scene fusion model, namely a voiceprint recognition model library comprises the multi-scene fusion model. The electronic device may train the multi-scene fusion model by using the sample data acquired in step S603.
Alternatively, the electronic device may also respectively establish models for different scenes, that is, the voiceprint recognition model may include multiple models, such as a near-field quiet model, a near-field home model, a far-field home model, a vehicle-mounted model, and the like, and may train the model corresponding to each scene by using a sample library of each scene, for example, train the model corresponding to the a scene by using the sample library 1 of the a scene, and exemplarily, for example, train the near-field quiet model by using the sample library of the near-field quiet scene, train the near-field home model by using the sample library of the near-field home scene, train the far-field home model by using the sample library of the far-field home scene, train the vehicle-mounted model by using the sample library of the vehicle-mounted scene, and.
For example, the electronic device may respectively establish models for a home scene, a vehicle-mounted scene, a market scene, and a work scene, and after acquiring a registration voice entered by a user, the electronic device respectively performs voice data enhancement for the home scene, the vehicle-mounted scene, the market scene, and the work scene based on the registration voice, so as to respectively obtain sample voices under the home scene, the vehicle-mounted scene, the market scene, and the work scene, and then may train the model of the home scene with sample data of the home scene, train the model of the vehicle-mounted scene with sample data of the vehicle-mounted scene, train the model of the market scene with sample data of the market scene, and train the model of the work scene with sample data of the work scene. Therefore, after the electronic device collects the verification voice input by the user, the electronic device may select a model corresponding to the scene in combination with the result of the scene detection of the verification voice, for example, assuming that the result of the scene detection of the verification voice is a home scene, the electronic device may match the model of the scene input of the verification voice.
Of course, the electronic device may also respectively establish models for different scenes, and simultaneously establish a multi-scene fusion model, that is, the voiceprint recognition model may include a multi-scene fusion model and models corresponding to multiple scenes respectively. S605, the electronic equipment collects verification voice input by the user.
S605, the electronic equipment collects verification voice input by the user. Step S606 is performed.
Step S605 may refer to step S205 specifically, and details are not repeated here.
And S606, the electronic equipment performs feature extraction and scene detection on the verification voice. Step S607 is executed.
Step S606 may refer to step S206, and details are not repeated here.
S607, the electronic device matches the model trained in the step S604 for voice verification input to obtain a first score. Step S608 is performed.
There are various methods for performing the matching score, wherein one possible method is: the voiceprint recognition model includes models corresponding to a plurality of scenes, and the electronic device may select the model corresponding to the scene (assumed as the scene a) detected in step S606 to perform a matching score, that is, perform a matching score on the model corresponding to the scene a of the verified speech input, so as to obtain a first score.
Another possible method is: the voiceprint recognition model comprises a multi-scene fusion model, the electronic equipment can select the multi-scene fusion model to perform matching scoring, namely, the verification voice is input into the multi-scene fusion model to perform matching scoring, and therefore a first score is obtained.
Yet another possible method is: the voiceprint recognition model comprises a plurality of models corresponding to the scenes respectively, the electronic equipment can input the verification voice into the models corresponding to the scenes respectively for matching scores to obtain a plurality of scores, and the scores are fused to obtain a first score. Illustratively, the first score may be, but is not limited to: an average of multiple scores, a weighted value of multiple scores, and so on.
Other methods for matching scores may also be used in implementations, and are not listed here.
It should be added that if it is not desired to maintain multiple models in the voiceprint recognition model library at the same time for some reason in the actual implementation process, only one multi-scene fusion model may be established and trained in step S604.
S608, the electronic device judges whether the first score is larger than a first threshold. If yes, go to step S609 and step S611; if not, go to step S610.
S609, the electronic equipment outputs a voiceprint recognition result: is a registrant.
Specifically, the electronic device may present the text "is a registered person" in the display screen 132, and the exemplary presentation interface may be as shown in fig. 7.
Alternatively, the electronic device may broadcast the voice "is registered person" through the speaker 163.
S610, the electronic equipment outputs a voiceprint recognition result: not a registrant.
Specifically, the electronic device may present the text "not registered" in the display screen 132, and the exemplary presentation interface may be as shown in fig. 5.
Alternatively, the electronic apparatus may broadcast the voice "not registered person" through the speaker 163.
S611, the electronic equipment performs quality evaluation on the verification voice to obtain a second score. Step S612 is performed.
Specifically, the electronic device may perform quality evaluation on the verification speech in combination with the scene detected in step S606.
In some embodiments, the electronic device may score the verification speech according to the model corresponding to the scene detected in step S606, and add a high-quality speech sample library if the score is higher than the threshold of the quality evaluation.
Alternatively, the quality evaluation method may be used to determine a quality evaluation score of the verification speech, and compare the quality evaluation score with the threshold corresponding to the scene detected in step S606, to determine whether the verification speech is high-quality speech of the scene.
S612, the electronic equipment judges whether the second score is larger than a second threshold value. If yes, go to step S613; if not, the process is ended.
S613, the electronic equipment stores the verification voice to a high-quality voice sample library. Step S614 is performed.
Specifically, if the high-quality voice sample library is stored in a classified manner according to a plurality of scenes, that is, the high-quality voice sample library may include sample libraries corresponding to the plurality of scenes, the electronic device may store the verification voice in the sample library corresponding to the scene detected in step S606, and if it is detected in step S606 that the verification voice originates from the home scene, the electronic device may store the verification voice in the sample library corresponding to the home scene.
The electronic equipment can also perform data enhancement on the verification voice, and the verification voice after data enhancement is stored in a high-quality voice sample library. For example, if the verification speech is the home scene speech, the verification speech may be subjected to data enhancement to obtain a far-field home speech, or the verification speech may also be subjected to data enhancement to obtain a home scene speech with other noise levels, where the noise level of the home scene speech obtained after the data enhancement may be greater than the verification speech.
And S614, the electronic equipment performs incremental learning based on the newly added voice data of the high-quality voice sample library, and updates the model in the voiceprint recognition model library.
Specifically, if the voiceprint recognition model library includes one multi-scene fusion model, the electronic device may train the multi-scene fusion model obtained through the last training based on the newly added voice data of the high-quality voice sample library to obtain a new multi-scene fusion model. The electronic device can perform weighted addition on the new multi-scene fusion model and the multi-scene fusion model obtained by last training to complete model updating. Or the electronic device may also add the speech data weighting method added in the high-quality speech sample library to the speech data originally stored in the high-quality speech sample library, and train the multi-scene fusion model obtained from the last training based on the speech data obtained by the addition to complete model updating.
If the voiceprint recognition model library includes models corresponding to a plurality of scenes, the scene detected in step S606 is taken as a vehicle-mounted scene, and the verification speech is stored in the sample library of the vehicle-mounted scene in step S613, for example, the electronic device may train the vehicle-mounted scene model obtained in the last training based on the speech data newly added in the sample library of the vehicle-mounted scene, so as to obtain a new vehicle-mounted scene model. The electronic device can perform weighted addition on the new vehicle-mounted scene model and the vehicle-mounted scene model obtained by last training to complete model updating. Or, the electronic device may also add the newly added voice data weighting mode of the sample library of the vehicle-mounted scene to the originally stored voice data of the sample library of the vehicle-mounted scene, and train the vehicle-mounted scene model obtained in the last training based on the voice data obtained by the addition to complete model updating.
In the voiceprint recognition process, the data enhancement of the multi-scene label is carried out on the original registered voice, so that the problem of data mismatching caused by single registered voice scene and variable verified voice scene can be solved. And the high-quality verification voice is added into the high-quality voice sample library, data enhancement and incremental learning are carried out, and the model in the voiceprint recognition model library is updated, so that the model in the voiceprint recognition model library can be more and more suitable for the actual use scene of the user in the use process of the user. Therefore, the robustness of the voiceprint recognition algorithm to multiple scenes and variable scenes can be improved through the voiceprint recognition method.
Scene two: for the case that the usage scenario is often a certain scenario, for example, for devices such as smart speakers, smart homes, and vehicle-mounted devices, the following steps S801 to S817 may be used to implement voiceprint recognition.
As shown in fig. 8, the voiceprint recognition process may specifically include:
s801 to S813 may refer to steps S601 to S613 specifically, and details are not repeated here.
Wherein step S814 may be performed after step S813.
S814, the electronic device judges whether the verification voice is high-quality voice in a high-frequency scene. If yes, go to step S815. If not, go to step S817. Step S817 may refer to step S614, and is not repeated herein.
Most of verification voices collected by the electronic equipment during voice recognition come from a certain scene, and the scene can be regarded as a high-frequency scene.
In specific implementation, the electronic device may determine whether the verification speech is a high-quality speech in a high-frequency scene by: for the scene detected in step S806 (assumed to be a scene a), the electronic device may count the number N of times that the scene detection result of the verification speech in the last N times of voiceprint recognition processes is the scene a, if N is greater than a third threshold (or N/N is greater than a fourth threshold), the electronic device may determine that the scene a is a high-frequency scene, and verify that the speech is high-quality speech in the high-frequency scene, otherwise, verify that the speech is not high-quality speech in the high-frequency scene.
For example, assuming that a result obtained by performing scene detection on the verification speech in step S806 is a home scene, the electronic device may determine that the scene detection result of the verification speech in the last 10 times of voiceprint recognition is the number n of times of the home scene, and if n is greater than 5 (i.e., a third threshold), it may determine that the home scene is a high-frequency scene, that is, the verification speech is high-quality speech in the high-frequency scene; if n is less than or equal to 5, it can be determined that the home scene is not a high-frequency scene, i.e., the verification speech is not a high-quality speech in a high-frequency scene.
For another example, assuming that a result obtained by performing scene detection on the verification speech in step S806 is an in-vehicle scene, the electronic device may determine that the scene detection result of the verification speech in the last 20 voiceprint recognition processes is the number n of times of the home scene, and if n/20 is greater than 50% (i.e., a fourth threshold), it may determine that the in-vehicle scene is a high-frequency scene, that is, the verification speech is a high-quality speech in the high-frequency scene; if n/20 is less than or equal to 50%, it can be determined that the vehicle-mounted scene is not a high-frequency scene, i.e., the verification speech is not high-quality speech in the high-frequency scene.
S815, the electronic device performs data enhancement on the i sample voices in the sample library of the first scene, where the first scene is the scene detected in step S806. The i pieces of speech may be all sample speech of the sample library of the first scene, or may be partial sample speech of the sample library of the first scene. Step S816 is performed.
Specifically, for each sample voice of i sample voices, the electronic device may perform data enhancement on the sample voice to obtain j noise voices with different noise levels, where the noise levels of the j noise voices are all greater than the sample voice. Alternatively, for each sample voice of the i sample voices, the electronic device may perform data enhancement on the sample voice to obtain k far-field voices with different far-field levels, wherein the far-field levels of the k far-field voices are all greater than the sample voice. Or, for each sample voice in the i sample voices, the electronic device may perform data enhancement on the sample voice to obtain j noise voices with different noise levels, and then perform data enhancement on each noise voice to obtain j × k far-field noise voices.
And S816, the electronic equipment performs incremental learning based on the voice data obtained in the step S815 to obtain the submodel in the high-frequency scene.
Specifically, the electronic device may divide the voice data obtained in step S815 into M groups according to noise levels, where the noise levels of the voice data in the same group are the same, or the noise levels of the voice data in the same group are within the same noise level range. Then, aiming at each group, the electronic equipment trains the model of the first scene obtained by the last training by using the voice data of the group to obtain a sub-model corresponding to the group, and adds the sub-model into a voiceprint recognition model library. Specifically, for each group, the electronic device may train the sub-model corresponding to the group in the first scene obtained by the last training using the voice data of the group.
In the voiceprint recognition process, the problem that the registered voice scene is not matched with the verified voice scene is solved by performing data enhancement of the multi-scene label on the original registered voice. And high-frequency scene judgment is added, high-quality verification voice in a high-frequency scene is added into a high-quality voice sample library, data enhancement and incremental learning are carried out, and a model in the high-frequency scene is refined, so that the electronic equipment can more accurately carry out voiceprint recognition under different noise levels or far field levels of the high-frequency scene. For example, in the vehicle-mounted scene, the corresponding sub-models of the speed per hour of 30km/h, 60km/h, 90km/h, 120km/h and the like can be accurately matched, instead of a rough vehicle-mounted scene model. For another example, in a far-field home environment, sub-models corresponding to 3m, 4m, 5m, etc. in the far-field can be precisely matched, instead of a rough speaker model in the far-field home environment. Therefore, in the using process of a user, the submodel in a high-frequency scene can be matched according to the scene detection result, so that the voiceprint recognition is more accurate, and the incremental learning of the model in the voiceprint recognition model library is continuously updated along with the increase of the using data, so that the voiceprint recognition can be more and more accurate.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims (28)

1. A voiceprint registration method, comprising:
the electronic equipment prompts a user to input registration voice;
the electronic equipment collects the registration voice input by the user;
the electronic device generates a sample voice under far-field conditions based on the enrollment voice;
the electronic device trains a voiceprint recognition model based on the sample speech.
2. The method of claim 1, wherein the electronic device generates sample speech in far-field conditions based on the enrollment speech, comprising:
the electronic device simulates reverberation on sound under far-field conditions;
the electronic device generates sample data of the registered voice under the far-field condition based on a reverberation simulation of a sound under the far-field condition.
3. The method of claim 1, wherein the electronic device generates sample speech in far-field conditions based on the enrollment speech, comprising:
the electronic device generating a noise voice based on the registration voice and noise data;
the electronic device simulates reverberation on sound under far-field conditions;
the electronic device generates sample data of the noise speech under the far-field condition based on a reverberation simulation of sound under the far-field condition.
4. The method of claim 2 or 3, wherein the electronic device simulates reverberation on sound in far-field conditions, comprising:
the electronic device simulates wall reflections of sound based on far-field conditions to obtain a room impact response RIR.
5. The method of any of claims 1-4, wherein the electronic device trains a voiceprint recognition model based on the sample speech, comprising:
the electronic equipment performs feature extraction on the sample voice to obtain feature data;
the electronic device trains a voiceprint recognition model based on the feature data.
6. The method of claim 5, wherein the voiceprint recognition model comprises one or more sub-models, wherein one sub-model corresponds to one scene;
the electronic device trains a voiceprint recognition model based on the feature data, including:
the electronic device trains the one or more sub-models based on the feature data, respectively.
7. The method of claim 5, wherein the voiceprint recognition model comprises a fusion model, wherein the fusion model corresponds to one or more scenes;
the electronic device trains a voiceprint recognition model based on the feature data, including:
the electronic device trains the fusion model based on the feature data.
8. A voiceprint recognition method, the method comprising:
the electronic equipment prompts a user to input verification voice;
the electronic equipment collects the verification voice input by the user;
the electronic equipment matches the verification voice input voiceprint recognition model to obtain a matching result, wherein the voiceprint recognition model is obtained by training based on the method of any one of claims 1-7;
the electronic device determines whether the user is a registrant of the voiceprint recognition model based on the matching result.
9. The method of claim 8, after the electronic device collects the verification voice entered by the user, further comprising:
and the electronic equipment performs scene detection on the verification voice.
10. The method of claim 9, wherein the voiceprint recognition model comprises one or more sub-models, wherein one sub-model corresponds to one scene;
the electronic device matching the verified voice input voiceprint recognition model, comprising:
and the electronic equipment matches the sub-model corresponding to the first scene of the verification voice input, wherein the first scene is the result of the scene detection.
11. The method of any of claims 8 to 10, further comprising:
if the user is the registrant of the voiceprint recognition model, the electronic equipment carries out quality evaluation on the verification voice to obtain a quality evaluation result;
and if the quality evaluation result shows that the verification voice is high-quality voice, the electronic equipment performs incremental learning on the voiceprint recognition model based on the verification voice.
12. The method of claim 11, wherein the electronic device incrementally learns the voiceprint recognition model based on the verification speech, comprising:
the electronic equipment performs data enhancement processing on the verification voice to obtain processed voice data;
the electronic device incrementally learns the voiceprint recognition model based on the processed speech data.
13. The method of claim 12, wherein prior to the electronic device performing data enhancement processing on the verification speech, the method further comprises:
the electronic equipment determines that a first scene where the verification voice is located is a high-frequency scene;
the electronic equipment performs data enhancement processing on the verification voice, and the data enhancement processing comprises the following steps:
the electronic equipment performs data enhancement processing on the verification voice to obtain j sample voices with different noise levels;
the electronic device incrementally learns the voiceprint recognition model based on the processed speech data, including:
the electronic equipment groups the j sample voices according to the noise level to obtain M groups of voice data, wherein M is an integer which is more than 0 and not more than j;
and the electronic equipment trains the submodels corresponding to the first scene respectively based on the M groups of voice data to obtain M high-frequency submodels.
14. A voiceprint registration apparatus, comprising:
the device comprises a first device, a microphone and a processor, wherein the first device is a loudspeaker or a display screen;
the processor is configured to perform:
triggering the first device to prompt a user to enter a registration voice;
collecting the registration voice input by the user through the microphone;
generating a sample voice under far-field conditions based on the registered voice;
training a voiceprint recognition model based on the sample speech.
15. The apparatus as claimed in claim 14, wherein said processor, when generating the sample speech in far-field conditions based on the enrollment speech, is specifically configured to:
simulating reverberation of the registered voice to sound under far-field conditions;
generating sample data of the registered voice under the far-field condition based on the reverberation simulation of the sound under the far-field condition.
16. The apparatus as claimed in claim 14, wherein said processor, when generating the sample speech in far-field conditions based on the enrollment speech, is specifically configured to:
generating a noise voice based on the registration voice and noise data;
simulating reverberation of sound under far-field conditions;
generating sample data of the noise speech under the far-field condition based on a reverberation simulation of sound under the far-field condition.
17. The apparatus of claim 15 or 16, wherein the processor, when simulating reverberation for sound in far-field conditions, is specifically configured to:
the room impact response RIR is obtained by simulating the wall reflection of sound based on far-field conditions.
18. The apparatus according to any of the claims 14 to 17, wherein the processor, when training a voiceprint recognition model based on the sample speech, is specifically configured to:
extracting the characteristics of the sample voice to obtain characteristic data;
and training a voiceprint recognition model based on the characteristic data.
19. The apparatus of claim 18, wherein the voiceprint recognition model comprises one or more sub-models, wherein one sub-model corresponds to one scene;
the processor, when training the voiceprint recognition model based on the feature data, is specifically configured to:
training the one or more sub-models based on the feature data, respectively.
20. The apparatus of claim 18, wherein the voiceprint recognition model comprises a fusion model, wherein the fusion model corresponds to one or more scenes;
the processor, when training the voiceprint recognition model based on the feature data, is specifically configured to:
training the fusion model based on the feature data.
21. A voiceprint recognition apparatus, said apparatus comprising:
the device comprises a first device, a microphone and a processor, wherein the first device is a loudspeaker or a display screen;
the processor is configured to perform:
triggering the first device to prompt a user to enter verification voice;
collecting the verification voice input by the user through the microphone;
matching the verification voice input voiceprint recognition models to obtain matching results, wherein the voiceprint recognition models are obtained by training the device according to any one of claims 13-18;
determining whether the user is a registrant of the voiceprint recognition model based on the matching result.
22. The apparatus of claim 21, wherein the processor is further configured to:
after the verification voice input by the user is collected through the microphone, scene detection is carried out on the verification voice.
23. The apparatus of claim 22, wherein the voiceprint recognition model comprises one or more sub-models, wherein one sub-model corresponds to one scene;
the processor, when matching the verification speech input voiceprint recognition model, is specifically configured to:
and matching the sub-model corresponding to the first scene of the verification voice input, wherein the first scene is the result of the scene detection.
24. The apparatus of any of claims 21 to 23, wherein the processor is further configured to:
if the user is the registrant of the voiceprint recognition model, performing quality evaluation on the verification voice to obtain a quality evaluation result;
and if the quality evaluation result shows that the verification voice is high-quality voice, performing incremental learning on the voiceprint recognition model based on the verification voice.
25. The apparatus as recited in claim 24, wherein said processor, when incrementally learning said voiceprint recognition model based on said verification speech, is specifically configured to:
performing data enhancement processing on the verification voice to obtain processed voice data;
performing incremental learning on the voiceprint recognition model based on the processed speech data.
26. The apparatus of claim 25, wherein the processor is further configured to:
before data enhancement processing is carried out on the verification voice, determining that a first scene where the verification voice is located is a high-frequency scene;
the processor, when performing data enhancement processing on the verification speech, is specifically configured to:
carrying out data enhancement processing on the verification voice to obtain j sample voices with different noise levels;
the processor, when performing incremental learning on the voiceprint recognition model based on the processed speech data, is specifically configured to:
grouping the j sample voices according to noise levels to obtain M groups of voice data, wherein M is an integer which is greater than 0 and not greater than j;
and training the submodels corresponding to the first scene respectively based on the M groups of voice data to obtain M high-frequency submodels.
27. A chip, wherein the chip is coupled to a memory in an electronic device, and wherein the method of any of claims 1 to 13 is performed.
28. A computer storage medium having stored therein computer instructions which, when executed by one or more processors, implement the method of any one of claims 1 to 13.
CN201910673696.7A 2019-07-24 2019-07-24 Voiceprint recognition method and device Pending CN112289325A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201910673696.7A CN112289325A (en) 2019-07-24 2019-07-24 Voiceprint recognition method and device
PCT/CN2020/104545 WO2021013255A1 (en) 2019-07-24 2020-07-24 Voiceprint recognition method and apparatus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910673696.7A CN112289325A (en) 2019-07-24 2019-07-24 Voiceprint recognition method and device

Publications (1)

Publication Number Publication Date
CN112289325A true CN112289325A (en) 2021-01-29

Family

ID=74193322

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910673696.7A Pending CN112289325A (en) 2019-07-24 2019-07-24 Voiceprint recognition method and device

Country Status (2)

Country Link
CN (1) CN112289325A (en)
WO (1) WO2021013255A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113241081A (en) * 2021-04-25 2021-08-10 华南理工大学 Far-field speaker authentication method and system based on gradient inversion layer
CN115065912A (en) * 2022-06-22 2022-09-16 广州市迪声音响有限公司 Feedback inhibition device for screening sound box energy based on voiceprint screen technology
CN116612766A (en) * 2023-07-14 2023-08-18 北京中电慧声科技有限公司 Conference system with voiceprint registration function and voiceprint registration method
WO2023207185A1 (en) * 2022-04-29 2023-11-02 荣耀终端有限公司 Voiceprint recognition method, graphical interface, and electronic device

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104952450A (en) * 2015-05-15 2015-09-30 百度在线网络技术(北京)有限公司 Far field identification processing method and device
CN107481731A (en) * 2017-08-01 2017-12-15 百度在线网络技术(北京)有限公司 A kind of speech data Enhancement Method and system
CN107680586A (en) * 2017-08-01 2018-02-09 百度在线网络技术(北京)有限公司 Far field Speech acoustics model training method and system
US20180053512A1 (en) * 2016-08-22 2018-02-22 Intel Corporation Reverberation compensation for far-field speaker recognition
CN108269567A (en) * 2018-01-23 2018-07-10 北京百度网讯科技有限公司 For generating the method, apparatus of far field voice data, computing device and computer readable storage medium
CN108305633A (en) * 2018-01-16 2018-07-20 平安科技(深圳)有限公司 Speech verification method, apparatus, computer equipment and computer readable storage medium
CN109509473A (en) * 2019-01-28 2019-03-22 维沃移动通信有限公司 Sound control method and terminal device
CN109841218A (en) * 2019-01-31 2019-06-04 北京声智科技有限公司 A kind of voiceprint registration method and device for far field environment
WO2019129511A1 (en) * 2017-12-26 2019-07-04 Robert Bosch Gmbh Speaker identification with ultra-short speech segments for far and near field voice assistance applications

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4245617B2 (en) * 2006-04-06 2009-03-25 株式会社東芝 Feature amount correction apparatus, feature amount correction method, and feature amount correction program

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104952450A (en) * 2015-05-15 2015-09-30 百度在线网络技术(北京)有限公司 Far field identification processing method and device
US20180053512A1 (en) * 2016-08-22 2018-02-22 Intel Corporation Reverberation compensation for far-field speaker recognition
CN107481731A (en) * 2017-08-01 2017-12-15 百度在线网络技术(北京)有限公司 A kind of speech data Enhancement Method and system
CN107680586A (en) * 2017-08-01 2018-02-09 百度在线网络技术(北京)有限公司 Far field Speech acoustics model training method and system
WO2019129511A1 (en) * 2017-12-26 2019-07-04 Robert Bosch Gmbh Speaker identification with ultra-short speech segments for far and near field voice assistance applications
CN108305633A (en) * 2018-01-16 2018-07-20 平安科技(深圳)有限公司 Speech verification method, apparatus, computer equipment and computer readable storage medium
CN108269567A (en) * 2018-01-23 2018-07-10 北京百度网讯科技有限公司 For generating the method, apparatus of far field voice data, computing device and computer readable storage medium
CN109509473A (en) * 2019-01-28 2019-03-22 维沃移动通信有限公司 Sound control method and terminal device
CN109841218A (en) * 2019-01-31 2019-06-04 北京声智科技有限公司 A kind of voiceprint registration method and device for far field environment

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113241081A (en) * 2021-04-25 2021-08-10 华南理工大学 Far-field speaker authentication method and system based on gradient inversion layer
CN113241081B (en) * 2021-04-25 2023-06-16 华南理工大学 Far-field speaker authentication method and system based on gradient inversion layer
WO2023207185A1 (en) * 2022-04-29 2023-11-02 荣耀终端有限公司 Voiceprint recognition method, graphical interface, and electronic device
CN117012205A (en) * 2022-04-29 2023-11-07 荣耀终端有限公司 Voiceprint recognition method, graphical interface and electronic equipment
CN115065912A (en) * 2022-06-22 2022-09-16 广州市迪声音响有限公司 Feedback inhibition device for screening sound box energy based on voiceprint screen technology
CN115065912B (en) * 2022-06-22 2023-04-25 广东帝比电子科技有限公司 Feedback inhibition device for screening sound box energy based on voiceprint screen technology
CN116612766A (en) * 2023-07-14 2023-08-18 北京中电慧声科技有限公司 Conference system with voiceprint registration function and voiceprint registration method
CN116612766B (en) * 2023-07-14 2023-11-17 北京中电慧声科技有限公司 Conference system with voiceprint registration function and voiceprint registration method

Also Published As

Publication number Publication date
WO2021013255A1 (en) 2021-01-28

Similar Documents

Publication Publication Date Title
CN107799126B (en) Voice endpoint detection method and device based on supervised machine learning
CN112289325A (en) Voiceprint recognition method and device
CN107408386B (en) Electronic device is controlled based on voice direction
EP4113451A1 (en) Map construction method and apparatus, repositioning method and apparatus, storage medium, and electronic device
CN104303177A (en) Instant translation system
CN110322760B (en) Voice data generation method, device, terminal and storage medium
CN111696570B (en) Voice signal processing method, device, equipment and storage medium
CN110047468B (en) Speech recognition method, apparatus and storage medium
CN109360549B (en) Data processing method, wearable device and device for data processing
WO2022033556A1 (en) Electronic device and speech recognition method therefor, and medium
CN113099031B (en) Sound recording method and related equipment
CN108877787A (en) Audio recognition method, device, server and storage medium
CN107945806B (en) User identification method and device based on sound characteristics
CN111863020A (en) Voice signal processing method, device, equipment and storage medium
CN113393856B (en) Pickup method and device and electronic equipment
CN114067776A (en) Electronic device and audio noise reduction method and medium thereof
CN111341307A (en) Voice recognition method and device, electronic equipment and storage medium
CN113220590A (en) Automatic testing method, device, equipment and medium for voice interaction application
US20230197084A1 (en) Apparatus and method for classifying speakers by using acoustic sensor
CN113920979A (en) Voice data acquisition method, device, equipment and computer readable storage medium
WO2022233239A1 (en) Upgrading method and apparatus, and electronic device
CN109102810B (en) Voiceprint recognition method and device
CN111091807A (en) Speech synthesis method, speech synthesis device, computer equipment and storage medium
CN111696566A (en) Voice processing method, apparatus and medium
CN115035886B (en) Voiceprint recognition method and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination