WO2022007846A1 - Procédé d'amélioration de la qualité de la parole, dispositif, système et support de stockage - Google Patents

Procédé d'amélioration de la qualité de la parole, dispositif, système et support de stockage Download PDF

Info

Publication number
WO2022007846A1
WO2022007846A1 PCT/CN2021/105003 CN2021105003W WO2022007846A1 WO 2022007846 A1 WO2022007846 A1 WO 2022007846A1 CN 2021105003 W CN2021105003 W CN 2021105003W WO 2022007846 A1 WO2022007846 A1 WO 2022007846A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
verified
registered
speech
scene
Prior art date
Application number
PCT/CN2021/105003
Other languages
English (en)
Chinese (zh)
Inventor
胡伟湘
黄劲文
曾夕娟
芦宇
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2022007846A1 publication Critical patent/WO2022007846A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/20Pattern transformations or operations aimed at increasing system robustness, e.g. against channel noise or different working conditions
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L2013/021Overlap-add techniques

Definitions

  • the present application relates to the technical field of biometrics, and in particular, to a speech enhancement method, device, system, and computer-readable storage medium.
  • biometric authentication technology based on biometric identification has gradually been popularized and applied in the fields of family life and public safety.
  • Biometric features that can be applied to biometric authentication include fingerprint, face (face), iris, DNA, voiceprint, etc.
  • voiceprint recognition technology also known as speaker recognition technology
  • the contact method realizes the collection of sound samples, and the collection method is more concealed, so it is easier to be accepted by users.
  • Some embodiments of the present application provide a speech enhancement method, a terminal device, a speech enhancement system, and a computer-readable storage medium.
  • the present application is described below from various aspects, and the embodiments and beneficial effects of the following aspects can be referred to each other.
  • an embodiment of the present application provides a voice enhancement method, which is applied to an electronic device, including: collecting a voice to be verified; determining environmental noise and/or environmental characteristic parameters contained in the voice to be verified; The environment feature parameter enhances the registered voice; compares the to-be-verified voice and the enhanced registered voice, and determines that the to-be-verified voice and the registered voice are from the same user.
  • the registration voice is enhanced according to the noise components in the voice to be verified, so that the enhanced registration voice and the voice to be verified have similar noise components.
  • the difference lies in the difference between the two effective speech components. After comparing the two through the voiceprint recognition algorithm, a more accurate recognition result can be obtained.
  • the user only needs to record the registration voice in a quiet environment, and there is no need to separately record the registration voice in multiple scenarios, so the user experience is better.
  • the registration speech is the speech from the registration speaker collected in a quiet environment. In this way, there is no obvious noise component in the registered speech, which can improve the accuracy of recognition.
  • enhancing the registration speech based on the environmental noise includes superimposing the environmental noise on the registration speech.
  • the implementation method of the present application obtains the enhanced registration voice by superimposing the environmental noise on the registration voice, and the algorithm is simple.
  • the ambient noise is sound picked up by a secondary microphone of the electronic device.
  • the embodiments of the present application can conveniently determine the noise contained in the speech to be verified.
  • the duration of the to-be-verified speech is less than the duration of the registered speech. In this way, the user can input a short voice to be verified, which is beneficial to improve the user experience.
  • the environmental characteristic parameter includes a scene type corresponding to the voice to be verified; the enhancement of the registered voice based on the environmental characteristic parameter includes: determining the template noise corresponding to the scene type based on the scene type corresponding to the to-be-verified voice, And superimpose template noise on the registered speech.
  • the registration speech is enhanced by superimposing template noise on the registration speech, so that the enhanced registration speech and the to-be-verified speech have noise components as close as possible, which is beneficial to improve the recognition accuracy.
  • the scene type corresponding to the voice to be verified is determined according to the scene recognition algorithm that recognizes the voice to be verified.
  • the scene recognition algorithm is any one of the following: GMM algorithm; DNN algorithm.
  • the scene type of the voice to be verified is any one of the following: a home scene; a vehicle-mounted scene; an outdoor noisy scene; a venue scene; a cinema scene.
  • the scene types of the embodiments of the present application cover the places where the user performs daily activities, which is beneficial to improve the user experience.
  • the environmental parameter characteristics of the voice to be verified include the distance between the user who generates the voice to be verified and the electronic device; the enhancement of the registered voice based on the environmental characteristic parameters includes: according to the user who generated the voice to be verified and the electronic device The distance between the registered voices is simulated in the far field.
  • the far-field simulation of the registered voice is used to simulate the acquisition distance of the registered voice (the distance between the voice acquisition device of the registered voice and the user who generates the registered voice) to the acquisition distance of the voice to be verified (the voice collection of the voice to be verified). distance between the device and the user producing the speech to be authenticated).
  • the attenuation component of the voice to be verified during the propagation process can be considered, so that the enhanced registered voice and the voice to be verified have noise components as close as possible, which is beneficial to improve the recognition efficiency. Accuracy.
  • performing a far-field simulation on the registered voice according to the distance between the user who generates the voice to be verified and the electronic device includes: according to the distance between the user who generates the voice to be verified and the electronic device, based on the mirror source model Methods
  • the impulse response function of the acquisition site of the speech to be verified is established; the impulse response function is convolved with the audio signal of the registered speech to perform far-field simulation of the registered speech.
  • the voice to be verified and the enhanced registration voice are voices processed by the same front-end processing algorithm.
  • front-end processing the interference factors in the speech can be removed, which is beneficial to improve the accuracy of voiceprint recognition.
  • the front-end processing algorithm includes at least one of the following processing algorithms: echo cancellation; de-reverberation; active noise reduction; dynamic gain; directional pickup.
  • the number of registered voices is multiple; and, based on environmental noise and/or environmental characteristic parameters, the multiple registered voices are respectively enhanced to obtain multiple enhanced registered voices.
  • a plurality of enhanced registration voices are obtained, and the to-be-verified voice and a plurality of enhanced registration voices can be respectively matched to obtain a plurality of similarity matching results, which can be further matched according to the plurality of similarity matching results.
  • the error of a single matching result can be averaged, which is beneficial to improve the accuracy of voiceprint recognition and the robustness of the voiceprint recognition algorithm.
  • comparing the to-be-verified voice and the enhanced registered voice, and determining that the to-be-verified voice and the registered voice are from the same user include: extracting characteristic parameters of the to-be-verified voice and enhanced registered voice characteristic parameters through a feature parameter extraction algorithm ; Carry out parameter identification through the parameter recognition model of the characteristic parameters of the voice to be verified and the characteristic parameters of the enhanced registered voice, so as to obtain the voice template of the speaker to be verified and the voice template of the registered speaker respectively; The voice template is matched with the voice template of the registered speaker, and according to the matching result, it is determined that the voice to be verified and the registered voice are from the same user.
  • the feature parameter extraction algorithm is MFCC algorithm, log mel algorithm or LPCC algorithm; and/or, the parameter identification model is an identity vector model, a time-delay neural network model or a ResNet model; and/or, the template matching algorithm is Cosine distance method, linear discriminant method or probabilistic linear discriminant analysis method.
  • an embodiment of the present application provides a voice enhancement method, including: a terminal device collects the voice to be verified, and sends the voice to be verified to a server that is communicatively connected to the terminal device; the server determines the environment contained in the voice to be verified Noise and/or environmental characteristic parameters; the server, based on the environmental noise and/or environmental characteristic parameters, enhances the registered voice; the server, compares the to-be-verified voice and the enhanced registered voice, and determines that the to-be-verified voice and the registered voice are from the same user; the server, The determination result of determining that the voice to be verified and the registered voice are from the same user is sent to the terminal device.
  • the registration voice is enhanced according to the noise components in the voice to be verified, so that the enhanced registration voice and the voice to be verified have similar noise components.
  • the difference lies in the difference between the two effective speech components.
  • the user only needs to record the registration voice in a quiet environment, and there is no need to separately record the registration voice in multiple scenarios, so the user experience is better.
  • the speaker recognition algorithm is implemented on the server, which can save local computing resources of the terminal device.
  • the registration speech is the speech from the registration speaker collected in a quiet environment. In this way, there is no obvious noise component in the registered speech, which can improve the accuracy of recognition.
  • enhancing the registration speech based on the environmental noise includes superimposing the environmental noise on the registration speech.
  • the implementation method of the present application obtains the enhanced registration voice by superimposing the environmental noise on the registration voice, and the algorithm is simple.
  • the ambient noise is the sound picked up by the secondary microphone of the terminal device.
  • the embodiments of the present application can conveniently determine the noise contained in the speech to be verified.
  • the duration of the to-be-verified speech is less than the duration of the registered speech. In this way, the user can input a short voice to be verified, which is beneficial to improve user experience.
  • the environmental characteristic parameter includes a scene type corresponding to the voice to be verified; the enhancement of the registered voice based on the environmental characteristic parameter includes: determining the template noise corresponding to the scene type based on the scene type corresponding to the to-be-verified voice, And superimpose template noise on the registered speech.
  • the registration speech is enhanced by superimposing template noise on the registration speech, so that the enhanced registration speech and the to-be-verified speech have noise components as close as possible, which is beneficial to improve the recognition accuracy.
  • the scene type corresponding to the voice to be verified is determined according to the scene recognition algorithm that recognizes the voice to be verified.
  • the scene recognition algorithm is any one of the following: GMM algorithm; DNN algorithm.
  • the scene type of the voice to be verified is any one of the following: a home scene; a vehicle-mounted scene; an outdoor noisy scene; a venue scene; a cinema scene.
  • the scene types of the embodiments of the present application cover the places where the user performs daily activities, which is beneficial to improve the user experience.
  • the environmental parameter characteristics of the voice to be verified include the distance between the user who generates the voice to be verified and the terminal device; the enhancement of the registered voice based on the environmental characteristic parameters includes: according to the user who generated the voice to be verified and the terminal device The distance between the registered voices is simulated in the far field.
  • the far-field simulation of the registered voice is used to simulate the acquisition distance of the registered voice (the distance between the voice acquisition device of the registered voice and the user who generates the registered voice) to the acquisition distance of the voice to be verified (the voice collection of the voice to be verified). distance between the device and the user producing the speech to be authenticated).
  • the attenuation component of the voice to be verified during the propagation process can be considered, so that the enhanced registered voice and the voice to be verified have noise components as close as possible, which is beneficial to improve the recognition efficiency. Accuracy.
  • performing far-field simulation on the registered voice according to the distance between the user who generates the voice to be verified and the terminal device including: according to the distance between the user who generates the voice to be verified and the terminal device, based on the mirror source model.
  • the voice to be verified and the enhanced registration voice are voices processed by the same front-end processing algorithm.
  • front-end processing the interference factors in the speech can be removed, which is beneficial to improve the accuracy of voiceprint recognition.
  • the front-end processing algorithm includes at least one of the following processing algorithms: echo cancellation; de-reverberation; active noise reduction; dynamic gain; directional pickup.
  • the number of registered voices is multiple; and, based on environmental noise and/or environmental characteristic parameters, the server enhances the multiple registered voices respectively, so as to obtain multiple enhanced registered voices.
  • a plurality of enhanced registration voices are obtained, and the to-be-verified voice and a plurality of enhanced registration voices can be respectively matched to obtain a plurality of similarity matching results, which can be further matched according to the plurality of similarity matching results.
  • the error of a single matching result can be averaged, which is beneficial to improve the accuracy of voiceprint recognition and the robustness of the voiceprint recognition algorithm.
  • comparing the to-be-verified voice and the enhanced registered voice, and determining that the to-be-verified voice and the registered voice are from the same user include: extracting characteristic parameters of the to-be-verified voice and enhanced registered voice characteristic parameters through a feature parameter extraction algorithm ; Carry out parameter identification through the parameter recognition model of the characteristic parameters of the voice to be verified and the characteristic parameters of the enhanced registered voice, so as to obtain the voice template of the speaker to be verified and the voice template of the registered speaker respectively; The voice template is matched with the voice template of the registered speaker, and according to the matching result, it is determined that the voice to be verified and the registered voice are from the same user.
  • the feature parameter extraction algorithm is MFCC algorithm, log mel algorithm or LPCC algorithm; and/or, the parameter identification model is an identity vector model, a time-delay neural network model or a ResNet model; and/or, the template matching algorithm is Cosine distance method, linear discriminant method or probabilistic linear discriminant analysis method.
  • embodiments of the present application provide an electronic device, including: a memory for storing instructions executed by one or more processors of the electronic device; a processor, when the processor executes the instructions in the memory, it can The electronic device is caused to execute the speaker identification method provided by any embodiment of the first aspect of the present application.
  • a memory for storing instructions executed by one or more processors of the electronic device
  • a processor when the processor executes the instructions in the memory, it can The electronic device is caused to execute the speaker identification method provided by any embodiment of the first aspect of the present application.
  • an embodiment of the present application provides a speech enhancement system, including a terminal device and a server communicatively connected to the terminal device, wherein,
  • the terminal device collects the voice to be verified, and sends the voice to be verified to the server;
  • the server is used to determine the environmental noise and/or environmental feature parameters contained in the voice to be verified, and enhance the registered voice based on the environmental noise and/or the environmental feature parameters and compare the voice to be verified and the enhanced registered voice, and determine that the voice to be verified and the registered voice come from the same user;
  • the server is also used to send the determination result of determining that the voice to be verified and the registered voice come from the same user to the terminal device.
  • the registration voice is enhanced according to the noise components in the voice to be verified, so that the enhanced registration voice and the voice to be verified have similar noise components.
  • the difference lies in the difference between the two effective speech components.
  • the user only needs to record the registration voice in a quiet environment, and there is no need to separately record the registration voice in multiple scenarios, so the user experience is better.
  • the speaker recognition algorithm is implemented on the server, which can save local computing resources of the terminal device.
  • the registration speech is the speech from the registration speaker collected in a quiet environment. In this way, there is no obvious noise component in the registered speech, which can improve the accuracy of recognition.
  • enhancing the registration speech based on the environmental noise includes superimposing the environmental noise on the registration speech.
  • the implementation method of the present application obtains the enhanced registration voice by superimposing the environmental noise on the registration voice, and the algorithm is simple.
  • the ambient noise is the sound picked up by the secondary microphone of the terminal device.
  • the embodiments of the present application can conveniently determine the noise contained in the speech to be verified.
  • the duration of the to-be-verified speech is less than the duration of the registered speech. In this way, the user can input a short voice to be verified, which is beneficial to improve user experience.
  • the environmental characteristic parameter includes a scene type corresponding to the voice to be verified; the enhancement of the registered voice based on the environmental characteristic parameter includes: determining the template noise corresponding to the scene type based on the scene type corresponding to the to-be-verified voice, And superimpose template noise on the registered speech.
  • the registration speech is enhanced by superimposing template noise on the registration speech, so that the enhanced registration speech and the to-be-verified speech have noise components as close as possible, which is beneficial to improve the recognition accuracy.
  • the scene type corresponding to the voice to be verified is determined according to the scene recognition algorithm that recognizes the voice to be verified.
  • the scene recognition algorithm is any one of the following: GMM algorithm; DNN algorithm.
  • the scene type of the voice to be verified is any one of the following: a home scene; a vehicle-mounted scene; an outdoor noisy scene; a venue scene; a cinema scene.
  • the scene types of the embodiments of the present application cover the places where the user performs daily activities, which is beneficial to improve the user experience.
  • the environmental parameter characteristics of the voice to be verified include the distance between the user who generates the voice to be verified and the terminal device; the enhancement of the registered voice based on the environmental characteristic parameters includes: according to the user who generated the voice to be verified and the terminal device The distance between the registered voices is simulated in the far field.
  • the far-field simulation of the registered voice is used to simulate the acquisition distance of the registered voice (the distance between the voice acquisition device of the registered voice and the user who generates the registered voice) to the acquisition distance of the voice to be verified (the voice collection of the voice to be verified). distance between the device and the user producing the speech to be authenticated).
  • the attenuation component of the voice to be verified during the propagation process can be considered, so that the enhanced registered voice and the voice to be verified have noise components as close as possible, which is beneficial to improve the recognition efficiency. Accuracy.
  • performing far-field simulation on the registered voice according to the distance between the user who generates the voice to be verified and the terminal device including: according to the distance between the user who generates the voice to be verified and the terminal device, based on the mirror source model.
  • the voice to be verified and the enhanced registration voice are voices processed by the same front-end processing algorithm.
  • front-end processing the interference factors in the speech can be removed, which is beneficial to improve the accuracy of voiceprint recognition.
  • the front-end processing algorithm includes at least one of the following processing algorithms: echo cancellation; de-reverberation; active noise reduction; dynamic gain; directional pickup.
  • the number of registered voices is multiple; and, based on environmental noise and/or environmental characteristic parameters, the server enhances the multiple registered voices respectively, so as to obtain multiple enhanced registered voices.
  • a plurality of enhanced registration voices are obtained, and the to-be-verified voice and a plurality of enhanced registration voices can be respectively matched to obtain a plurality of similarity matching results, which can be further matched according to the plurality of similarity matching results.
  • the error of a single matching result can be averaged, which is beneficial to improve the accuracy of voiceprint recognition and the robustness of the voiceprint recognition algorithm.
  • comparing the to-be-verified voice and the enhanced registered voice, and determining that the to-be-verified voice and the registered voice are from the same user include: extracting characteristic parameters of the to-be-verified voice and enhanced registered voice characteristic parameters through a feature parameter extraction algorithm ; Carry out parameter identification through the parameter recognition model of the characteristic parameters of the voice to be verified and the characteristic parameters of the enhanced registered voice, so as to obtain the voice template of the speaker to be verified and the voice template of the registered speaker respectively; The voice template is matched with the voice template of the registered speaker, and according to the matching result, it is determined that the voice to be verified and the registered voice are from the same user.
  • the feature parameter extraction algorithm is MFCC algorithm, log mel algorithm or LPCC algorithm; and/or, the parameter identification model is an identity vector model, a time-delay neural network model or a ResNet model; and/or, the template matching algorithm is Cosine distance method, linear discriminant method or probabilistic linear discriminant analysis method.
  • an embodiment of the present application provides a computer-readable storage medium, where an instruction is stored in the computer-readable storage medium, and when the instruction is executed on a computer, the computer can execute the information provided by any one of the embodiments of the first aspect of the present application.
  • method, or causing a computer to execute the method provided by any embodiment of the second aspect of the present application are beneficial effects that can be achieved in the fifth aspect.
  • Fig. 1a shows an exemplary application scenario of the speech enhancement method provided by the embodiment of the present application
  • Fig. 1b shows another exemplary application scenario of the speech enhancement method provided by the embodiment of the present application
  • FIG. 2 shows a schematic structural diagram of a speech enhancement device provided by an embodiment of the present application
  • FIG. 3 shows a flowchart of a speech enhancement method provided by an embodiment of the present application
  • FIG. 4 shows a flowchart of a speech enhancement method provided by another embodiment of the present application.
  • FIG. 5 shows an application scenario of the speech enhancement method provided by the embodiment of the present application
  • FIG. 6 shows a structural diagram of an electronic device provided by an embodiment of the present application.
  • FIG. 7 shows a block diagram of a system-on-chip (SoC) provided by an embodiment of the present application.
  • SoC system-on-chip
  • Speaker recognition technology also known as voiceprint recognition technology
  • voiceprint recognition technology is a technology that uses the uniqueness of the speaker's voiceprint to identify the speaker's identity. Because each person's vocal organs (for example, tongue, teeth, larynx, lungs, nasal cavity, vocal passages, etc.) are innately different, and vocalization habits, etc. have acquired differences, therefore, each person's voiceprint features are unique. By analyzing the features of the pattern, the identity of the speaker can be identified.
  • the specific process of speaker identification is to collect the voice of the speaker whose identity is to be confirmed, and compare it with the voice of a specific speaker to confirm whether the speaker whose identity is to be confirmed is the specific speaker.
  • the voice of the speaker whose identity is to be confirmed is called “voice to be verified”
  • the speaker whose identity is to be confirmed is called “speaker to be verified”
  • the voice of a specific speaker is called “registered voice”
  • the specific speaker Speakers are called “registered speakers”.
  • the above process is described by taking the voiceprint unlocking function of the mobile phone (ie, unlocking the screen of the mobile phone by means of voiceprint recognition) as an example.
  • the mobile phone owner records his own voice (the voice is the registered voice) in the mobile phone through the microphone on the mobile phone.
  • the current user of the mobile phone enters the real-time voice (the voice is the voice to be verified) through the mobile phone microphone, and the mobile phone uses the built-in voiceprint recognition program to compare the voice to be verified and the registered voice , to determine whether the current user of the mobile phone is the owner of the mobile phone.
  • the to-be-verified voice matches the registered voice, it is judged that the current user of the mobile phone is the owner, the current user of the mobile phone has passed the identity authentication, and the mobile phone completes the subsequent screen unlocking action; if the to-be-verified voice does not match the registered voice, it is judged If the current user of the mobile phone is not the owner, and the current user of the mobile phone has not passed the identity authentication, the mobile phone can refuse the subsequent screen unlocking action.
  • voiceprint recognition technology can be applied to the field of family life, and voice control of smart phones, smart cars, smart homes (eg, smart audio and video equipment, smart lighting systems, smart door locks), etc.; voiceprint recognition technology can also be applied In the field of payment, the voiceprint authentication is combined with other authentication methods (such as passwords, dynamic verification codes, etc.) to perform double or multiple authentication of the user's identity to improve the security of payment; voiceprint recognition technology can also be applied to information In the security field, voiceprint authentication is used as a way to log in to an account; voiceprint recognition technology can also be applied to the judicial field, using voiceprint as auxiliary evidence for judging identity.
  • the main device for voiceprint recognition can be other electronic devices other than mobile phones, such as mobile devices, including wearable devices (such as wristbands, earphones, etc.), vehicle terminals, etc.; or fixed devices, including smart Home, network server, etc.
  • the voiceprint recognition algorithm can be implemented in the cloud in addition to the terminal. For example, after the mobile phone collects the voice to be verified, the collected voice to be verified can be sent to the cloud, and the voice to be verified is recognized by the voiceprint recognition algorithm in the cloud. After the recognition is completed, the cloud returns the recognition result to the mobile phone. Through the cloud recognition mode, users can share the computing resources in the cloud to save the local computing resources of the mobile phone.
  • the voice to be verified when the voice of the speaker to be verified is collected, if there is noisy human voice noise in the surrounding environment, these noises will be collected by the microphone together and become part of the voice to be verified.
  • the voice to be verified not only includes the voice of the speaker to be verified, but also contains noise components, which will reduce the recognition rate of the voiceprint.
  • This embodiment does not limit the scene of the voiceprint recognition, for example, it may also be a home scene, a car scene, a meeting place scene, a cinema scene, and the like.
  • the owner of the mobile phone needs to unlock the mobile phone through voiceprint recognition, if there is noise in the surrounding environment, the sound collected by the mobile phone microphone is not only the owner's voice, but also the noise in the environment. After the real-time voice is compared with the registered voice preset in the mobile phone by the owner, it may result that the two do not match. Even if the current user of the mobile phone is the owner of the mobile phone, the mobile phone may still give a result that the user identity authentication fails, thus affecting the user experience.
  • some technical solutions remove noise components in the voice to be verified by performing denoising processing on the voice to be verified, so as to improve the recognition rate of the voiceprint.
  • the voice to be verified after the denoising process still contains some noise components, and some valid voice components (the voice components of the speaker to be verified) are also removed. In this way, the voice to be verified after the denoising process may appear. It still cannot be recognized correctly, and the recognition rate of voiceprint is not significantly improved.
  • the user records registration voices in multiple different scenarios (for example, home scenarios, cinema scenarios, outdoor noisy scenarios, etc.), and when performing voiceprint recognition, compares the voice to be verified with the registered voice recorded in the corresponding scenario , in order to improve the recognition rate of voiceprint.
  • the user needs to record registration voices respectively in multiple different scenarios, and the user experience is low.
  • the embodiments of the present application provide a voice enhancement method, which is used to improve the voiceprint recognition rate and the robustness of the voiceprint recognition method, and improve user experience.
  • a noise component corresponding to the noise component in the voice to be verified will be superimposed on the registered voice, and then the registered voice after the noise component has been superimposed is compared with the voice to be verified, to get the recognition result.
  • the registration voice will be enhanced according to the noise components in the voice to be verified, so that the enhanced registration voice and the voice to be verified have similar noise components, so that the voice to be verified and the enhanced registration voice
  • the main difference between the two is the difference between the two effective speech components.
  • the user After comparing the two through the voiceprint recognition algorithm, a more accurate recognition result can be obtained.
  • the user only needs to record the registration voice in a quiet environment, and there is no need to separately record the registration voice in multiple scenarios, so the user experience is better.
  • the "valid speech component” is the speech component from the speaker, for example, the valid speech component in the speech to be verified is the speech component of the speaker to be verified, and the valid speech component in the enhanced registered speech is the speech component of the registered speaker .
  • FIG. 2 shows the structure of the mobile phone 100 .
  • the mobile phone 100 may include a processor 110, an external memory interface 120, an internal memory 121, an antenna, a communication module 150, an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, a camera 193, a display screen 194, and the like.
  • the structures illustrated in the embodiments of the present invention do not constitute a specific limitation on the mobile phone 100 .
  • the mobile phone 100 may include more or less components than shown, or some components may be combined, or some components may be separated, or different component arrangements.
  • the illustrated components may be implemented in hardware, software, or a combination of software and hardware.
  • the processor 110 may include one or more processing units, for example, the processor 110 may include an application processor (application processor, AP), a modem processor, a controller, a digital signal processor (digital signal processor, DSP), baseband processor, etc. Wherein, different processing units may be independent devices, or may be integrated in one or more processors.
  • application processor application processor, AP
  • modem processor modem processor
  • controller controller
  • DSP digital signal processor
  • baseband processor baseband processor
  • the processor can generate an operation control signal according to the instruction operation code and timing signal, and complete the control of fetching and executing instructions.
  • a memory may also be provided in the processor 110 for storing instructions and data.
  • the memory in processor 110 is cache memory. This memory may hold instructions or data that have just been used or recycled by the processor 110 . If the processor 110 needs to use the instruction or data again, it can be called directly from the memory. Repeated accesses are avoided and the latency of the processor 110 is reduced, thereby increasing the efficiency of the system.
  • the processor 110 may include one or more interfaces.
  • the interface may include an integrated circuit built-in audio (inter-integrated circuit sound, I2S) interface, a pulse code modulation (pulse code modulation, PCM) interface, and/or a general-purpose input/output (general-purpose input/output, GPIO) interface, etc.
  • I2S integrated circuit sound
  • PCM pulse code modulation
  • GPIO general-purpose input/output
  • the I2S interface can be used for audio communication.
  • the processor 110 may contain multiple sets of I2S buses.
  • the processor 110 may be coupled with the audio module 170 through an I2S bus to implement communication between the processor 110 and the audio module 170 .
  • the PCM interface can also be used for audio communications, sampling, quantizing and encoding analog signals.
  • the GPIO interface can be configured by software.
  • the GPIO interface can be configured as a control signal or as a data signal.
  • the GPIO interface may be used to connect the processor 110 with the camera 193, the display screen 194, the audio module 170, and the like.
  • the GPIO interface can also be configured as an I2S interface, etc.
  • the interface connection relationship between the modules illustrated in the embodiment of the present invention is only a schematic illustration, and does not constitute a structural limitation of the mobile phone 100 .
  • the mobile phone 100 may also adopt different interface connection manners in the foregoing embodiments, or a combination of multiple interface connection manners.
  • the wireless communication function of the mobile phone 100 may be implemented by an antenna, a communication module 150, a modem processor, a baseband processor, and the like.
  • Antennas are used to transmit and receive electromagnetic wave signals.
  • Each antenna in handset 100 may be used to cover a single or multiple communication frequency bands. Different antennas can also be reused to improve antenna utilization.
  • the antennas can be multiplexed into the diversity antennas of the wireless local area network.
  • the antenna may be used in conjunction with a tuning switch.
  • the communication module 150 may provide a wireless communication solution including 2G/3G/4G/5G, etc. applied on the mobile phone 100 .
  • the communication module 150 may include at least one filter, switch, power amplifier, low noise amplifier (LNA), and the like.
  • the communication module 150 can receive the electromagnetic wave by the antenna, filter, amplify, etc. the received electromagnetic wave, and transmit it to the modulation and demodulation processor for demodulation.
  • the communication module 150 can also amplify the signal modulated by the modulation and demodulation processor, and then convert it into electromagnetic waves for radiation through the antenna.
  • at least part of the functional modules of the communication module 150 may be provided in the processor 110 .
  • at least some of the functional modules of the communication module 150 may be provided in the same device as at least some of the modules of the processor 110 .
  • the modem processor may include a modulator and a demodulator.
  • the modulator is used to modulate the low frequency baseband signal to be sent into a medium and high frequency signal.
  • the demodulator is used to demodulate the received electromagnetic wave signal into a low frequency baseband signal. Then the demodulator transmits the demodulated low-frequency baseband signal to the baseband processor for processing.
  • the low frequency baseband signal is processed by the baseband processor and passed to the application processor.
  • the application processor outputs sound signals through audio devices (not limited to the speaker 170A, the receiver 170B, etc.), or displays images or videos through the display screen 194 .
  • the modem processor may be a stand-alone device.
  • the modulation and demodulation processor may be independent of the processor 110, and may be provided in the same device as the communication module 150 or other functional modules.
  • the external memory interface 120 can be used to connect an external memory card, such as a Micro SD card, to expand the storage capacity of the mobile phone 100 .
  • the external memory card communicates with the processor 110 through the external memory interface 120 to realize the data storage function. For example to save files like music, video etc in external memory card.
  • Internal memory 121 may be used to store computer executable program code, which includes instructions.
  • the internal memory 121 may include a storage program area and a storage data area.
  • the storage program area can store an operating system, an application program required for at least one function (such as a sound playback function, an image playback function, etc.), a voiceprint recognition program, a voice signal front-end processing program, and the like.
  • the storage data area can store data (such as audio data, phone book, etc.) created during the use of the mobile phone 100, and data required for voiceprint recognition, such as audio data of registered voice, trained voice parameter recognition model, etc.
  • the internal memory 121 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, universal flash storage (UFS), and the like.
  • the processor 110 executes various functional applications and data processing of the mobile phone 100 by executing the instructions stored in the internal memory 121 and/or the instructions stored in the memory provided in the processor.
  • the mobile phone 100 can implement audio functions through an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, and an application processor. Such as music playback, recording, etc.
  • the audio module 170 is used for converting digital audio information into analog audio signal output, and also for converting analog audio input into digital audio signal. Audio module 170 may also be used to encode and decode audio signals. In some embodiments, the audio module 170 may be provided in the processor 110 , or some functional modules of the audio module 170 may be provided in the processor 110 .
  • Speaker 170A also referred to as a "speaker" is used to convert audio electrical signals into sound signals.
  • the mobile phone 100 can listen to music through the speaker 170A, or listen to a hands-free call.
  • the receiver 170B also referred to as "earpiece" is used to convert audio electrical signals into sound signals.
  • the voice can be answered by placing the receiver 170B close to the human ear.
  • the microphone 170C also called “mic”, “microphone”, or “microphone”, is used to convert sound signals into electrical signals.
  • the user can make a sound by approaching the microphone 170C through the human mouth, and input the sound signal into the microphone 170C.
  • the mobile phone 100 may be provided with at least one microphone 170C.
  • the mobile phone 100 may be provided with two microphones 170C, which can implement a noise reduction function in addition to collecting sound signals.
  • the mobile phone 100 has two microphones at the top and bottom, one microphone 170C is provided on the bottom side of the mobile phone 100 , and the other microphone 170C is provided on the top side of the mobile phone 100 .
  • the mouth is usually close to the microphone 170C on the bottom side. Therefore, the user's voice will generate a larger audio signal Va in the microphone, which is referred to as the "main mic" herein.
  • the audio signal Va is referred to herein as "sub-mic".
  • the distance between the noise sound source and the main mic and the auxiliary mic is basically the same, that is, it can be considered that the main mic and the auxiliary mic
  • the intensity of the noise is basically the same.
  • the noise signal and the user speech signal can be separated by using the signal strength difference caused by the difference of the two mic positions. For example, after the audio signal picked up by the main mic and the audio signal picked up by the secondary mic are differentiated (that is, the signal in the main mic is subtracted from the signal in the secondary mic), the user's voice signal (this is the dual mic) can be obtained. The principle of active noise cancellation). Furthermore, after removing the user's voice signal from the main mic signal, the noise signal can be separated. Alternatively, since the audio signal Vb on the secondary mic is significantly smaller than the audio signal Va on the primary mic, it can be considered that the signal picked up by the secondary mic is a noise signal.
  • a setting method of dual mics of the mobile phone 100 is given above, but this is only an exemplary description, and other setting methods can be used for the microphones, for example, the main mic is arranged on the front of the mobile phone 100, and the secondary mic is arranged on the back of the mobile phone.
  • the mobile phone 100 may further be provided with three, four or more microphones 170C to collect sound signals, reduce noise, identify sound sources, and implement directional recording functions.
  • the earphone jack 170D is used to connect wired earphones.
  • the earphone interface 170D may be a universal serial bus (USB) interface, or may be a 3.5mm open mobile terminal platform (OMTP) standard interface, the cellular telecommunications industry association (cellular telecommunications industry association) of the USA, CTIA) standard interface.
  • USB universal serial bus
  • OMTP open mobile terminal platform
  • CTIA cellular telecommunications industry association
  • this embodiment is used to provide a voice enhancement method. After the voice to be verified is collected, the noise contained in the voice to be verified is separated from the voice to be verified, and then the separated noise is superimposed on the registered voice. , in this way, the voice to be verified and the registered voice after superimposed noise have similar noise components, and the main difference between the two is the difference between the two effective voice components, which can improve the voiceprint recognition rate and the voiceprint recognition method. robustness.
  • the speech enhancement method provided by this embodiment includes the following steps:
  • S110 Collect registered voice.
  • the mobile phone 100 has a voiceprint unlocking application (which may be a system application or a third-party application).
  • a voiceprint unlocking application which may be a system application or a third-party application.
  • the owner of the mobile phone 100 registers the user account of the voiceprint unlocking application, he collects his own voice through the mobile phone 100, and the voiceprint unlocking application uses the voice as the reference voice for subsequent voiceprint recognition. This voice is the registered voice.
  • the present application is not limited to this.
  • the owner of the mobile phone 100 enters the registration voice through the setting wizard of the mobile phone 100, and the voiceprint unlocking application of the mobile phone 100 uses the voice as the reference voice for voiceprint recognition. .
  • the registered voice is the voice recorded by the owner of the mobile phone 100 in a quiet environment, so that there is no obvious noise component in the registered voice.
  • the signal-to-noise ratio ie, the ratio of the host voice signal strength to the noise signal strength
  • the signal-to-noise ratio in the recording environment is higher than the set value (eg, 30dB)
  • the recording is considered to be recorded.
  • the environment is quiet.
  • the intensity of the noise signal in the registered voice recording environment is lower than a set value (eg, 20 dB)
  • the recording environment is considered to be a quiet environment.
  • the registration voice from the host is collected through the microphone of the mobile phone 100 .
  • the registered voice is near-field voice.
  • the distance between the owner's mouth and the main mic of the mobile phone 100 should be kept within 30cm to 1m. For example, if the owner holds the mobile phone 100 and speaks to the main mic, the distance between the owner's mouth and the main mic of the mobile phone 100 should be kept within 30cm. , which can avoid the attenuation of the host voice due to the long propagation distance.
  • the owner When recording the registered voice, the owner enters 6 voices to form 6 registered voices. Entering multiple languages helps to improve the flexibility of speech recognition and the richness of voiceprint information.
  • the length of each registered voice is 10-30s. Further, each registered voice corresponds to different text content, so as to enrich the voiceprint information contained in the registered voice.
  • the mobile phone 100 After collecting the registered voice, the mobile phone 100 stores the audio signal of the registered voice in the internal memory. However, the present application is not limited to this, and the mobile phone 100 may also upload the audio signal of the registered voice to the cloud, so as to recognize the voiceprint through the cloud recognition mode.
  • the above recording method, recording length, and quantity of the registered voice are only exemplary descriptions, and the present application is not limited thereto.
  • the registered voice may be recorded by other recording devices (eg, voice recorder, dedicated microphone, etc.), the number of registered voices may be one, and the length of the registered voice may be greater than 30s.
  • step S110 is mentioned first. It can be understood that step S110, as the data preparation process of the speech enhancement method, is relatively independent from the single speech enhancement process, and does not need to be performed every time. Occurs with other steps of the speech enhancement method.
  • S120 Collect the voice to be verified, and the voice to be verified is the voice recorded by the current user of the mobile phone in a noisy human voice scene.
  • the mobile phone user can unlock the screen of the mobile phone by means of voiceprint recognition in this scenario.
  • the current user of the mobile phone is the person who currently operates the mobile phone 100 , which may be the owner himself or someone other than the owner himself.
  • the voice to be verified is collected through the microphone of the mobile phone 100 .
  • the microphone of the mobile phone 100 is turned on.
  • the current user of the mobile phone 100 can input the voice to be verified through the microphone of the mobile phone 100 to unlock the mobile phone through voiceprint recognition.
  • the user needs to operate the mobile phone 100 from a distance eg, open an application in the mobile phone (eg, a music application, a phone application)
  • the user needs to operate the mobile phone when both hands are occupied eg, when doing housework
  • the to-be-verified voice is a voice with specific content.
  • the voice to be verified may also be voice of any text content.
  • the length of the voice to be verified is 10-30 s, so that the voice to be verified can contain relatively rich voiceprint information, which is beneficial to improve the voiceprint recognition rate.
  • this application does not limit this.
  • the length of the voice to be verified is less than 10s, so the length of the voice to be verified is less than the length of the registered voice. In this case, the user can enter a shorter to-be-verified voice. Verification of voice is beneficial to improve user experience.
  • the length of the voice to be verified is less than the length of the registered voice, part of the voice fragments can be intercepted from the voice to be verified, and spliced with the originally collected voice to be verified, so that the spliced voice has substantially the same length as the registered voice , so that in the subsequent steps of this embodiment (will be described in detail below), the feature parameters extracted from the registered voice and the feature parameters extracted from the voice to be verified have the same dimension, which is convenient for the similarity of the two. degree for comparison. In the description of this article, it does not distinguish between the original collected voice to be verified and the spliced voice to be verified, which is referred to as the voice to be verified in this document.
  • the meaning of splicing the A voice and the B voice is to connect the A voice and the B voice end to end, so that the length of the spliced voice is the sum of the lengths of the A voice and the B voice.
  • the present application does not limit the connection order of the A voice and the B voice.
  • the A voice may be connected after the B voice, or the A voice may be connected before the B voice.
  • the noise contained in the voice to be verified is the sound generated by other sound sources other than the current user of the mobile phone 100 in the recognition scene.
  • the sound of household equipment for example, a vacuum cleaner
  • the sound of the car broadcast the sound of the engine in the car scene
  • the sound of the sound projected in the theater environment the voice of other audiences in the theater, etc.
  • the sound picked up by the mic of the mobile phone 100 is determined as the noise contained in the voice to be verified, so that the noise contained in the voice to be verified can be easily determined.
  • the present application is not limited to this.
  • the initial segment of the speech to be verified contains only noise components, so that after multiple copies of the initial segment of the speech to be verified, it is determined as the to-be-verified speech.
  • the noise contained in the speech for another example, in other embodiments, the speech to be verified is divided into multiple speech frames, and the medium energy of each speech frame is calculated.
  • the speech frame can be determined as a noise frame, thereby simplifying the noise extraction process.
  • other methods in the prior art may also be used to determine the noise in the speech to be verified, which will not be described in detail.
  • the energy of the speech frame represents the sum of the squares of the signal values of the speech signals included in the speech frame.
  • the signal value of the i-th speech signal in the speech frame is x i
  • the number of speech signals in the speech frame is N
  • S140 Superimpose the noise contained in the voice to be verified on the registration voice to obtain an enhanced registration voice.
  • the signal value of the noise signal and the signal value of the registration speech signal are added to obtain the enhanced registration speech.
  • the present application is not limited to this, and in other embodiments, the superposition of the registration speech signal and the noise signal may also be completed in the frequency domain.
  • the embodiment of the present application realizes the enhancement of the registered voice signal by simply superimposing the numerical value of the voice signal, and the algorithm is simple.
  • the length of the noise is equal to the length of the registered voice. In other embodiments, the length of the noise may be smaller than the length of the registered voice.
  • the number of registered voices is 6. Therefore, noises contained in the voices to be verified are respectively superimposed on the 6 registered voices to obtain 6 enhanced registered voices.
  • S150 Extract the feature parameters of the voice to be verified and the feature parameters of the enhanced registered voice. Since the MFCC method can better conform to the auditory perception characteristics of the human ear, in this embodiment, the feature parameters in the speech signal are extracted by the Mel-Frequency Cepstrum Coefficient (MFCC) method.
  • MFCC Mel-Frequency Cepstrum Coefficient
  • an audio signal representing speech to be authenticated by S T is first to be authenticated as a series of voice speech frame x (n), where, for the n-number of speech frames.
  • x voice speech frame
  • the length of each speech frame is 10-30ms.
  • a length of 10s audio signal S T 500 is divided into speech frames.
  • the MFCC feature extraction method includes the steps of Fourier transform, Mel filtering, discrete cosine transform, etc. on the speech frame x(n).
  • the order of the discrete cosine transform is 20. Therefore, the MFCC feature parameter of each speech frame x(n) has 20 dimensions.
  • the extraction process can be adjusted as required. For example, differential calculation may be performed on the MFCC feature parameters extracted above. For example, after taking the first-order difference and the second-order difference of the MFCC feature parameters extracted above, for each speech frame, a set of 60-dimensional MFCC feature parameters is obtained.
  • other parameters of the extraction process such as the length and number of speech frames, the order of discrete cosine transform, etc., can also be adjusted according to the computing capability of the device and the requirements of recognition accuracy.
  • the feature parameters in the speech signal can also be extracted by other methods, for example, the log mel method, the Linear Predictive Cepstrum Coefficient (LPCC) method, and the like.
  • LPCC Linear Predictive Cepstrum Coefficient
  • the identification model for parameter identification is not limited in this application, and can be a probability model, such as an identity vector (I-vector) model; or a deep neural network model, such as a Time-Delay Neural Network (TDNN) model, ResNet model, etc.
  • I-vector identity vector
  • TDNN Time-Delay Neural Network
  • the 10,000-dimensional feature parameters of the speech to be verified are input into the recognition model, and the speech template of the current user of the mobile phone 100 is obtained after the dimensionality reduction and abstraction of the recognition model.
  • the speech template of the current user of the mobile phone 100 is a 512-dimensional feature vector, denoted as A.
  • each voice template is a feature vector of 512
  • the 6 master voice templates are marked as B1 respectively. , B2, ..., B6.
  • the template matching method may be a cosine distance method, a linear discriminant method, or a probabilistic linear discriminant analysis method, or the like.
  • the cosine distance method is used as an example for description below.
  • the cosine distance method evaluates the similarity of two feature vectors by computing the cosine of the angle between them. Taking the feature vector A (the feature vector corresponding to the voice template of the current user of the mobile phone 100) and the feature vector B1 (the feature vector corresponding to the main voice template of the mobile phone 100) as an example, the cosine similarity can be expressed as:
  • a i is the ith coordinate in the eigenvector A
  • b i is the ith coordinate in the eigenvector B1
  • ⁇ 1 is the angle between the eigenvector A and the eigenvector B1.
  • the larger the value of cos ⁇ 1 the closer the direction of the eigenvector A and the eigenvector B1 is, and the higher the similarity of the two eigenvectors.
  • the smaller the value of cos ⁇ 1 the lower the similarity between the two feature vectors.
  • the similarity P between the current user's voice and the host's voice is greater than the set value (for example, 0.8)
  • the mobile phone 100 unlocks the screen; otherwise, it is determined that the current user of the mobile phone 100 is the host. Not the owner himself, the phone 100 will not unlock the screen.
  • the to-be-verified voice is compared with the six enhanced registered voices to obtain six cosine similarity calculation results, and then the six cosine similarity results are averaged to obtain the final result of the current user voice and the host voice. Similarity P.
  • the matching errors between the voice to be verified and the single enhanced registered voice can be averaged, which is beneficial to improve the accuracy of voiceprint recognition and the robustness of the voiceprint recognition algorithm.
  • the voiceprint recognition algorithm (the algorithms corresponding to steps S130 to S170 ) can be implemented on the mobile phone 100 to realize the offline recognition of the voiceprint; it can also be implemented in the cloud to save the mobile phone 100 local computing resources.
  • the voiceprint recognition algorithm is implemented in the cloud
  • the mobile phone 100 uploads the to-be-verified voice collected in step S120 to the cloud server, and the cloud server uses the voiceprint recognition algorithm to authenticate the identity of the current user of the mobile phone 100, and returns the authentication result.
  • the mobile phone 100 decides whether to unlock the screen according to the authentication result.
  • a reverberation component is added to the registration speech to obtain an enhanced registration speech.
  • the voice of the speaker to be verified will generate reverberation in the room, and the reverberation, as a part of the interference factor, will have a certain impact on the recognition rate of the voiceprint.
  • the reverberation prediction is performed on the registered voice based on the recognition scene, that is, the reverberation of the registered voice in the recognition scene is simulated, and the registered voice is added to the registered voice based on the reverberation simulation.
  • the reverberation components generated in the voiceprint are used to make the non-speech components of the speech to be verified and the non-speech components in the enhanced registration speech as close as possible, thereby improving the voiceprint recognition rate and the robustness of the voiceprint recognition method.
  • the reverberation generated by the registered speech in the recognition scene is estimated.
  • the image source model method can simulate the reflection path of the sound wave in the room, and calculate the room impulse response function (RIR) of the sound field according to the delay and attenuation parameters of the sound wave.
  • RIR room impulse response function
  • the reverberation generated by the registered speech in the room is obtained by convolving the audio signal of the registered speech with the impulse response function.
  • the distance between the speaker to be verified and the microphone may be far (for example, more than 1m), so that when the voice of the speaker to be verified reaches the microphone There will be some attenuation. Therefore, in some embodiments, in order to consider the distance factor between the voice to be verified and the microphone, when the reverberation estimation is performed on the registered voice by using the image source model method, far-field simulation is also performed on the registered voice.
  • the distance between the registered voice in the simulated sound field and the voice receiving device is set according to the distance between the speaker to be verified and the microphone, so that the registered voice can be
  • the acquisition distance simulates the same acquisition distance as the voice to be verified, thereby further reducing the difference between the voice to be verified and the enhanced registered voice except for the effective voice components, improving the recognition rate of voiceprints and the efficiency of the voiceprint recognition method. robustness.
  • the to-be-verified speech is also subjected to front-end processing, for example, echo cancellation and de-reverberation are performed on the to-be-verified speech , active noise reduction, dynamic gain, directional pickup, etc.
  • front-end processing for example, echo cancellation and de-reverberation are performed on the to-be-verified speech , active noise reduction, dynamic gain, directional pickup, etc.
  • the enhanced registered voice is subjected to the same front-end processing as the voice to be verified (that is, the voice to be verified and the enhanced registered voice are passed through.
  • the same front-end processing algorithm module to further improve the voiceprint recognition rate and the robustness of the voiceprint recognition method.
  • the feature parameter extraction step of the speech signal (ie, step S150 ) may be omitted, and the speech signal may be recognized directly through a deep neural network model.
  • this embodiment is used to provide another voice enhancement method.
  • the scene of the voice to be verified is also recognized to obtain The scene type corresponding to the voice to be verified.
  • the enhanced registration voice is also determined according to the above scene type.
  • the speech enhancement method performed by the mobile phone 100 according to this embodiment includes the following steps:
  • the registered voice is the voice recorded by the owner of the mobile phone 100 in a quiet environment, so that there is no obvious noise component in the registered voice.
  • the voice to be verified is the voice recorded by the current user of the mobile phone in the noisy human voice scene.
  • the mobile phone user can unlock the screen of the mobile phone by means of voiceprint recognition in this scenario.
  • the former user of the mobile phone is the person who currently operates the mobile phone 100, which may be the owner himself, or may be someone other than the owner himself.
  • S230 Determine the noise contained in the speech to be verified.
  • the noise contained in the voice to be verified is the sound generated by other sound sources other than the current user of the mobile phone 100 in the recognition scene.
  • S240 Superimpose the noise contained in the voice to be verified on the registration voice to obtain an enhanced registration voice.
  • the signal value of the noise signal and the signal value of the registration speech signal are added to obtain the enhanced registration speech.
  • steps S210-S240 are substantially the same as steps S110-S140 in Embodiment 1, and detailed processes in the steps are not repeated.
  • the number of registered voices is the same as that of the first embodiment, that is, the number of registered voices is 6. Therefore, in step S240, the noises contained in the voices to be verified are respectively superimposed on the 6 registered voices to obtain 6 Enhanced registration voice.
  • S250 Determine the scene type corresponding to the voice to be verified. Specifically, after the voice to be verified is collected, the scene type corresponding to the voice to be verified is identified by a voice recognition algorithm, such as a GMM method, a DNN method, or the like.
  • the label value of the scene type can be a home scene; a car scene; an outdoor noisy scene; a venue scene; a cinema scene, etc.
  • the template noise is noise corresponding to the scene type determined in step S250, for example, template noise is noise recorded under the scene determined in step S250.
  • template noise is noise recorded under the scene determined in step S250.
  • it can correspond to multiple groups of template noise.
  • the scene type corresponding to the voice to be verified is determined in step S250 to be a home scene, and three groups of template noises are recorded in the home scene (for example, the sound generated by home audio and video equipment, the background generated when family members talk voice, and/or noise from household appliances, etc.).
  • a total of 24 enhanced registration voices are formed.
  • step S270 Extract the feature parameters of the voice to be verified and the feature parameters of the enhanced registered voice, refer to step S150 in the first embodiment. However, it can be understood that, in this embodiment, the characteristic parameters in the 24 enhanced registered voices are extracted respectively.
  • S280 Perform parameter identification on the feature parameters of the voice to be verified and the feature parameters of the enhanced registered voice to obtain the voice template of the current user of the mobile phone 100 and the voice template of the owner of the mobile phone 100 respectively, refer to S160 in the first embodiment.
  • the obtained 24 host voice templates are respectively recorded as B1, B2, . . . , B24.
  • step S290 Match the voice template of the owner of the mobile phone 100 with the voice template of the current user of the mobile phone 100 to obtain a recognition result.
  • step S170 the cosine similarity between the 24 host voice templates and the current user voice template of the mobile phone 100 are cos ⁇ 1 , cos ⁇ 2 , . . . , cos ⁇ 24 , respectively.
  • the similarity P between the current user's voice and the host's voice is greater than the set value (for example, 0.8)
  • the mobile phone 100 unlocks the screen; otherwise, it is determined that the current user of the mobile phone 100 is the host. Not the owner himself, the phone 100 will not unlock the screen.
  • steps S230 and S240 are omitted, that is, the step of enhancing the registered voice according to the noise contained in the voice to be verified is omitted, and the registered voice is only enhanced according to the template noise corresponding to the recognition scene.
  • enhanced voice register 18, which corresponds to the owner of the utterance B7, B2, whil, B24, respectively, with the voice of the user machine 100 of the main current voice phone similarity P (cos ⁇ 7 + cos ⁇ 2 +...+cos ⁇ 24 )/18.
  • the implementation body of the voiceprint recognition algorithm (implemented locally in the mobile phone 100 or in the cloud), other processing of speech (for example, reverberation estimation, far-field simulation, Front-end processing, etc.), etc., may refer to the introduction in Embodiment 1, and will not be repeated here.
  • the scene type corresponding to the voice to be verified, the distance between the speaker to be verified and the microphone, etc. are all environmental characteristic parameters in the voice to be verified.
  • This embodiment changes the application scenario of the voice enhancement method on the basis of the first embodiment. Specifically, the voice enhancement method in this embodiment is applied to the scenario shown in FIG. 5 for controlling the smart speaker 200 .
  • the smart speaker 200 has a voice recognition function, and the user can interact with the smart speaker 200 through voice, so as to perform functions such as song on demand, weather query, schedule management, and smart home control through the smart speaker 200 .
  • the method authenticates the identity of the user to determine whether the current user is the owner of the smart speaker 200, and then determines whether the current user has the authority to control the smart speaker 200 to perform the operation.
  • the speech enhancement method of this embodiment includes:
  • S310 Collect registered voice.
  • the registration voice from the owner of the smart speaker 200 is collected through the microphone of the smart speaker 200, but the application is not limited to this.
  • the registered voice can be saved locally in the smart speaker 200 to recognize the user's voiceprint through the smart speaker 200 to realize offline recognition of the voiceprint; the registered voice can also be uploaded to the cloud to use The computing resources in the cloud recognize the user's voiceprint to save the local computing resources of the smart speaker 200 .
  • S320 Collect the voice to be verified.
  • the voice to be verified is collected through the microphone of the smart speaker 200 .
  • acquisition parameters of the voice to be verified for example, the duration and text content of the voice to be verified
  • S330 Determine the noise contained in the speech to be verified.
  • the speech to be verified is divided into a plurality of speech frames, and the medium energy of each speech frame is calculated. Since the energy in the noise is generally smaller than that in the valid speech, when the energy in the speech frame is smaller than a predetermined value, the speech frame can be determined as a noise frame, thereby simplifying the noise extraction process.
  • S340 Superimpose the noise contained in the voice to be verified on the registration voice to obtain an enhanced registration voice.
  • the signal value of the noise signal and the signal value of the registration speech signal are added to obtain the enhanced registration speech.
  • S350 Extract the feature parameters of the voice to be verified and the feature parameters of the enhanced registered voice.
  • the feature parameters of the voice to be verified and the feature parameters of the enhanced registered voice are extracted by the MFCC method.
  • the recognition model for parameter recognition is not limited in this embodiment, and may be a probability model, such as an identity vector (I-vector) model; or a deep neural network model, such as a Time-Delay Neural Network (TDNN) ) model, ResNet model, etc.
  • I-vector identity vector
  • TDNN Time-Delay Neural Network
  • the template matching method may be a cosine distance method, a linear discriminant method, or a probabilistic linear discriminant analysis method, or the like. If the similarity between the current user's voice and the host's voice is greater than the set value, it is determined that the current user of the smart speaker 200 is the owner himself. At this time, the smart speaker 200 performs corresponding operations in response to the user's voice command; The current user of the speaker 200 is not the owner himself, and the smart speaker 200 ignores the user's voice command.
  • the speech enhancement method in this embodiment is substantially the same as the speech enhancement method in Embodiment 1 except for the application scenario. Therefore, for technical details not described in this embodiment, reference may be made to the description in Embodiment 1.
  • the voiceprint recognition algorithm (the algorithms corresponding to steps S330 to S370 ) can be implemented on the smart speaker 200 to realize offline recognition of voiceprints; it can also be implemented in the cloud to save the local smart speaker 200 computing resources.
  • the voiceprint recognition algorithm is implemented in the cloud
  • the smart speaker 200 uploads the to-be-verified voice collected in step S120 to the cloud server.
  • the cloud server uses the voiceprint recognition algorithm to authenticate the identity of the current user of the smart speaker 200, the authentication The result is returned to the smart speaker 200, and the smart speaker 200 determines whether to execute the user's voice command according to the authentication result.
  • Electronic device 400 may include one or more processors 401 coupled to controller hub 403 .
  • the controller hub 403 is connected to 406 via a multidrop bus such as a Front Side Bus (FSB), a point-to-point interface such as a QuickPath Interconnect (QPI), or the like
  • the processor 401 communicates.
  • Processor 401 executes instructions that control general types of data processing operations.
  • the controller hub 403 includes, but is not limited to, a graphics memory controller hub (GMCH, Graphics & Memory Controller Hub) (not shown) and an input/output hub (IOH, Input Output Hub) (which can be on a separate chip) (not shown), where the GMCH includes the memory and graphics controller and is coupled to the IOH.
  • GMCH graphics memory controller hub
  • IOH input/output hub
  • IOH Input Output Hub
  • Electronic device 400 may also include a coprocessor 402 and memory 404 coupled to controller hub 403 .
  • a coprocessor 402 and memory 404 coupled to controller hub 403 .
  • one or both of the memory and GMCH may be integrated within the processor (as described in this application), with the memory 404 and coprocessor 402 coupled directly to the processor 401 and to the controller hub 403, the controller hub 403 and IOH are in a single chip.
  • the memory 404 may be, for example, Dynamic Random Access Memory (DRAM), Phase Change Memory (PCM), or a combination of the two.
  • Memory 404 may include one or more tangible, non-transitory computer-readable media for storing data and/or instructions.
  • the computer-readable storage medium stores instructions, in particular temporary and permanent copies of the instructions.
  • the instructions may include instructions that, when executed by at least one of the processors, cause the electronic device 400 to implement the speech enhancement method described in FIGS. 3 and 4 .
  • the instructions When the instructions are executed on the computer, the computer is caused to execute the method disclosed in the first embodiment and/or the second embodiment.
  • the coprocessor 402 is a special-purpose processor, such as, for example, a high-throughput MIC (Many Integrated Core) processor, a network or communication processor, a compression engine, a graphics processor, a GPGPU (General- purpose computing on graphics processing units, general-purpose computing on graphics processing units), or embedded processors, etc.
  • a high-throughput MIC Many Integrated Core
  • a network or communication processor a compression engine
  • a graphics processor e.g., a graphics processing units
  • GPGPU General- purpose computing on graphics processing units, general-purpose computing on graphics processing units
  • embedded processors e.g., embedded processors, etc.
  • the electronic device 400 may further include a network interface (NIC, Network Interface Controller) 406 .
  • the network interface 406 may include a transceiver for providing a radio interface for the electronic device 400 to communicate with any other suitable devices (eg, front-end modules, antennas, etc.).
  • network interface 406 may be integrated with other components of electronic device 400 .
  • the network interface 406 can implement the functions of the communication unit in the above-mentioned embodiments.
  • the electronic device 400 may further include an input/output (I/O, Input/Output) device 405 .
  • I/O 405 may include: a user interface designed to enable a user to interact with electronic device 400 ; a peripheral component interface designed to enable peripheral components to also interact with electronic device 400 ; and/or sensors designed to determine association with electronic device 400 environmental conditions and/or location information.
  • Figure 6 is exemplary only. That is, although FIG. 6 shows that the electronic device 400 includes multiple devices such as the processor 401, the controller center 403, the memory 404, etc., in practical applications, the device using each method of the present application may only include the electronic device 400 Some of the devices, for example, may include only the processor 401 and the network interface 406 . The properties of the optional device in Figure 6 are shown in dashed lines.
  • SoC 500 includes: interconnect unit 550 coupled to processor 510; system agent unit 580; bus controller unit 590; integrated memory controller unit 540; , which may include integrated graphics logic, image processor, audio processor and video processor; Static Random Access Memory (SRAM, Static Random-Access Memory) unit 530; Direct Memory Access (DMA, Direct Memory Access) unit 560 .
  • interconnect unit 550 coupled to processor 510; system agent unit 580; bus controller unit 590; integrated memory controller unit 540; , which may include integrated graphics logic, image processor, audio processor and video processor; Static Random Access Memory (SRAM, Static Random-Access Memory) unit 530; Direct Memory Access (DMA, Direct Memory Access) unit 560 .
  • SRAM Static Random Access Memory
  • DMA Direct Memory Access
  • the coprocessor 520 includes a special purpose processor such as, for example, a network or communications processor, a compression engine, a GPGPU (General-purpose computing on graphics processing units), a high-throughput MIC processor, or embedded processor, etc.
  • a special purpose processor such as, for example, a network or communications processor, a compression engine, a GPGPU (General-purpose computing on graphics processing units), a high-throughput MIC processor, or embedded processor, etc.
  • Static random access memory (SRAM) unit 530 may include one or more tangible, non-transitory computer-readable media for storing data and/or instructions.
  • the computer-readable storage medium stores instructions, in particular temporary and permanent copies of the instructions.
  • the instructions may include instructions that, when executed by at least one of the processors, cause the SoC to implement the speech enhancement method described in FIGS. 3 and 4 .
  • the computer is caused to execute the method disclosed in the first embodiment and/or the second embodiment.
  • Program code may be applied to input instructions to perform the functions described herein and to generate output information.
  • the output information can be applied to one or more output devices in a known manner.
  • a processing system includes any system having a processor such as, for example, a Digital Signal Processor (DSP), a microcontroller, an Application Specific Integrated Circuit (ASIC), or a microprocessor.
  • DSP Digital Signal Processor
  • ASIC Application Specific Integrated Circuit
  • the program code may be implemented in a high-level procedural language or an object-oriented programming language to communicate with the processing system.
  • the program code may also be implemented in assembly or machine language, if desired.
  • the mechanisms described herein are not limited to the scope of any particular programming language. In either case, the language may be a compiled language or an interpreted language.
  • One or more aspects of at least one embodiment may be implemented by representative instructions stored on a computer-readable storage medium, the instructions representing various logic in a processor, the instructions, when read by a machine, cause the machine to make Logic that implements the techniques described herein.
  • These representations referred to as "IP (Intellectual Property) cores,” may be stored on tangible computer-readable storage media and provided to multiple customers or production facilities for loading into the actual manufacturing of the logic or processor. in the manufacturing machine.
  • an instruction converter may be used to convert instructions from a source instruction set to a target instruction set.
  • an instruction translator may transform (eg, using static binary transforms, dynamic binary transforms including dynamic compilation), warp, emulate, or otherwise convert an instruction into one or more other instructions to be processed by the core.
  • Instruction translators can be implemented in software, hardware, firmware, or a combination thereof.
  • the instruction translator may be on-processor, off-processor, or partially on-processor and partially off-processor.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Collating Specific Patterns (AREA)
  • Telephone Function (AREA)

Abstract

La présente invention concerne un procédé d'amélioration de la qualité de la parole fondé sur l'intelligence artificielle (IA), un dispositif terminal, un système d'amélioration de la qualité de la parole et un support de stockage lisible par ordinateur. Un dispositif électronique acquiert la parole à vérifier ; le dispositif électronique détermine le bruit environnemental et/ou un paramètre de caractéristiques de l'environnement compris dans la parole à vérifier ; le dispositif électronique améliore ensuite la qualité d'une parole d'enregistrement sur la base du bruit environnemental et/ou du paramètre de caractéristiques de l'environnement ; finalement, le dispositif électronique compare la parole à vérifier avec la parole d'enregistrement améliorée pour déterminer si la parole à vérifier et la parole d'enregistrement appartiennent au même utilisateur. Dans des modes de réalisation de la présente invention, la qualité de la parole d'enregistrement est améliorée en fonction d'une composante de bruit dans la parole à vérifier, de façon à faire en sorte que la parole d'enregistrement améliorée et la parole à vérifier présentent des composantes de bruit similaires, de sorte qu'un résultat de reconnaissance plus précis puisse être obtenu.
PCT/CN2021/105003 2020-07-08 2021-07-07 Procédé d'amélioration de la qualité de la parole, dispositif, système et support de stockage WO2022007846A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010650893.X 2020-07-08
CN202010650893.XA CN113921013A (zh) 2020-07-08 2020-07-08 语音增强方法、设备、系统以及存储介质

Publications (1)

Publication Number Publication Date
WO2022007846A1 true WO2022007846A1 (fr) 2022-01-13

Family

ID=79231704

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/105003 WO2022007846A1 (fr) 2020-07-08 2021-07-07 Procédé d'amélioration de la qualité de la parole, dispositif, système et support de stockage

Country Status (2)

Country Link
CN (1) CN113921013A (fr)
WO (1) WO2022007846A1 (fr)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117268796A (zh) * 2023-11-16 2023-12-22 天津大学 车辆故障声学事件检测方法
CN117725187A (zh) * 2024-02-08 2024-03-19 人和数智科技有限公司 一种适用于社会救助的问答系统

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101051463A (zh) * 2006-04-06 2007-10-10 株式会社东芝 说话人认证的验证方法及装置
WO2010049695A1 (fr) * 2008-10-29 2010-05-06 British Telecommunications Public Limited Company Vérification de locuteur
CN106384588A (zh) * 2016-09-08 2017-02-08 河海大学 基于矢量泰勒级数的加性噪声与短时混响的联合补偿方法
CN108022591A (zh) * 2017-12-30 2018-05-11 北京百度网讯科技有限公司 车内环境中语音识别的处理方法、装置和电子设备
CN108257606A (zh) * 2018-01-15 2018-07-06 江南大学 一种基于自适应并行模型组合的鲁棒语音身份识别方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101051463A (zh) * 2006-04-06 2007-10-10 株式会社东芝 说话人认证的验证方法及装置
WO2010049695A1 (fr) * 2008-10-29 2010-05-06 British Telecommunications Public Limited Company Vérification de locuteur
CN106384588A (zh) * 2016-09-08 2017-02-08 河海大学 基于矢量泰勒级数的加性噪声与短时混响的联合补偿方法
CN108022591A (zh) * 2017-12-30 2018-05-11 北京百度网讯科技有限公司 车内环境中语音识别的处理方法、装置和电子设备
CN108257606A (zh) * 2018-01-15 2018-07-06 江南大学 一种基于自适应并行模型组合的鲁棒语音身份识别方法

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117268796A (zh) * 2023-11-16 2023-12-22 天津大学 车辆故障声学事件检测方法
CN117268796B (zh) * 2023-11-16 2024-01-26 天津大学 车辆故障声学事件检测方法
CN117725187A (zh) * 2024-02-08 2024-03-19 人和数智科技有限公司 一种适用于社会救助的问答系统
CN117725187B (zh) * 2024-02-08 2024-04-30 人和数智科技有限公司 一种适用于社会救助的问答系统

Also Published As

Publication number Publication date
CN113921013A (zh) 2022-01-11

Similar Documents

Publication Publication Date Title
CN107799126B (zh) 基于有监督机器学习的语音端点检测方法及装置
CN111091828B (zh) 语音唤醒方法、设备及系统
WO2022007846A1 (fr) Procédé d'amélioration de la qualité de la parole, dispositif, système et support de stockage
CN111179911B (zh) 目标语音提取方法、装置、设备、介质和联合训练方法
CN113129917A (zh) 基于场景识别的语音处理方法及其装置、介质和系统
CN108711429B (zh) 电子设备及设备控制方法
CN107240405B (zh) 一种音箱及告警方法
JP2009509575A (ja) 音響的外耳特徴付けのための方法及び装置
WO2014117722A1 (fr) Procédé de traitement de la parole, dispositif et appareil terminal
WO2022033556A1 (fr) Dispositif électronique et procédé de reconnaissance vocale associé, et support
WO2021013255A1 (fr) Procédé et appareil de reconnaissance d'empreinte vocale
CN113830026A (zh) 一种设备控制方法及计算机可读存储介质
CN115312068B (zh) 语音控制方法、设备及存储介质
CN113539290B (zh) 语音降噪方法和装置
CN204791241U (zh) 一种语音交互式门禁系统
CN115482830A (zh) 语音增强方法及相关设备
WO2022199405A1 (fr) Procédé et appareil de commande vocale
WO2021031811A1 (fr) Procédé et dispositif d'amélioration vocale
CN114067782A (zh) 音频识别方法及其装置、介质和芯片系统
CN113611318A (zh) 一种音频数据增强方法及相关设备
CN116386623A (zh) 一种智能设备的语音交互方法、存储介质及电子装置
US11783809B2 (en) User voice activity detection using dynamic classifier
CN109922397A (zh) 音频智能处理方法、存储介质、智能终端及智能蓝牙耳机
WO2022052691A1 (fr) Procédé de traitement de la voix multidispositif, support, dispositif électronique et système
CN115116458A (zh) 语音数据转换方法、装置、计算机设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21837111

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21837111

Country of ref document: EP

Kind code of ref document: A1