WO2022007846A1 - Speech enhancement method, device, system, and storage medium - Google Patents

Speech enhancement method, device, system, and storage medium Download PDF

Info

Publication number
WO2022007846A1
WO2022007846A1 PCT/CN2021/105003 CN2021105003W WO2022007846A1 WO 2022007846 A1 WO2022007846 A1 WO 2022007846A1 CN 2021105003 W CN2021105003 W CN 2021105003W WO 2022007846 A1 WO2022007846 A1 WO 2022007846A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
verified
registered
speech
scene
Prior art date
Application number
PCT/CN2021/105003
Other languages
French (fr)
Chinese (zh)
Inventor
胡伟湘
黄劲文
曾夕娟
芦宇
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2022007846A1 publication Critical patent/WO2022007846A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/20Pattern transformations or operations aimed at increasing system robustness, e.g. against channel noise or different working conditions
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L2013/021Overlap-add techniques

Definitions

  • the present application relates to the technical field of biometrics, and in particular, to a speech enhancement method, device, system, and computer-readable storage medium.
  • biometric authentication technology based on biometric identification has gradually been popularized and applied in the fields of family life and public safety.
  • Biometric features that can be applied to biometric authentication include fingerprint, face (face), iris, DNA, voiceprint, etc.
  • voiceprint recognition technology also known as speaker recognition technology
  • the contact method realizes the collection of sound samples, and the collection method is more concealed, so it is easier to be accepted by users.
  • Some embodiments of the present application provide a speech enhancement method, a terminal device, a speech enhancement system, and a computer-readable storage medium.
  • the present application is described below from various aspects, and the embodiments and beneficial effects of the following aspects can be referred to each other.
  • an embodiment of the present application provides a voice enhancement method, which is applied to an electronic device, including: collecting a voice to be verified; determining environmental noise and/or environmental characteristic parameters contained in the voice to be verified; The environment feature parameter enhances the registered voice; compares the to-be-verified voice and the enhanced registered voice, and determines that the to-be-verified voice and the registered voice are from the same user.
  • the registration voice is enhanced according to the noise components in the voice to be verified, so that the enhanced registration voice and the voice to be verified have similar noise components.
  • the difference lies in the difference between the two effective speech components. After comparing the two through the voiceprint recognition algorithm, a more accurate recognition result can be obtained.
  • the user only needs to record the registration voice in a quiet environment, and there is no need to separately record the registration voice in multiple scenarios, so the user experience is better.
  • the registration speech is the speech from the registration speaker collected in a quiet environment. In this way, there is no obvious noise component in the registered speech, which can improve the accuracy of recognition.
  • enhancing the registration speech based on the environmental noise includes superimposing the environmental noise on the registration speech.
  • the implementation method of the present application obtains the enhanced registration voice by superimposing the environmental noise on the registration voice, and the algorithm is simple.
  • the ambient noise is sound picked up by a secondary microphone of the electronic device.
  • the embodiments of the present application can conveniently determine the noise contained in the speech to be verified.
  • the duration of the to-be-verified speech is less than the duration of the registered speech. In this way, the user can input a short voice to be verified, which is beneficial to improve the user experience.
  • the environmental characteristic parameter includes a scene type corresponding to the voice to be verified; the enhancement of the registered voice based on the environmental characteristic parameter includes: determining the template noise corresponding to the scene type based on the scene type corresponding to the to-be-verified voice, And superimpose template noise on the registered speech.
  • the registration speech is enhanced by superimposing template noise on the registration speech, so that the enhanced registration speech and the to-be-verified speech have noise components as close as possible, which is beneficial to improve the recognition accuracy.
  • the scene type corresponding to the voice to be verified is determined according to the scene recognition algorithm that recognizes the voice to be verified.
  • the scene recognition algorithm is any one of the following: GMM algorithm; DNN algorithm.
  • the scene type of the voice to be verified is any one of the following: a home scene; a vehicle-mounted scene; an outdoor noisy scene; a venue scene; a cinema scene.
  • the scene types of the embodiments of the present application cover the places where the user performs daily activities, which is beneficial to improve the user experience.
  • the environmental parameter characteristics of the voice to be verified include the distance between the user who generates the voice to be verified and the electronic device; the enhancement of the registered voice based on the environmental characteristic parameters includes: according to the user who generated the voice to be verified and the electronic device The distance between the registered voices is simulated in the far field.
  • the far-field simulation of the registered voice is used to simulate the acquisition distance of the registered voice (the distance between the voice acquisition device of the registered voice and the user who generates the registered voice) to the acquisition distance of the voice to be verified (the voice collection of the voice to be verified). distance between the device and the user producing the speech to be authenticated).
  • the attenuation component of the voice to be verified during the propagation process can be considered, so that the enhanced registered voice and the voice to be verified have noise components as close as possible, which is beneficial to improve the recognition efficiency. Accuracy.
  • performing a far-field simulation on the registered voice according to the distance between the user who generates the voice to be verified and the electronic device includes: according to the distance between the user who generates the voice to be verified and the electronic device, based on the mirror source model Methods
  • the impulse response function of the acquisition site of the speech to be verified is established; the impulse response function is convolved with the audio signal of the registered speech to perform far-field simulation of the registered speech.
  • the voice to be verified and the enhanced registration voice are voices processed by the same front-end processing algorithm.
  • front-end processing the interference factors in the speech can be removed, which is beneficial to improve the accuracy of voiceprint recognition.
  • the front-end processing algorithm includes at least one of the following processing algorithms: echo cancellation; de-reverberation; active noise reduction; dynamic gain; directional pickup.
  • the number of registered voices is multiple; and, based on environmental noise and/or environmental characteristic parameters, the multiple registered voices are respectively enhanced to obtain multiple enhanced registered voices.
  • a plurality of enhanced registration voices are obtained, and the to-be-verified voice and a plurality of enhanced registration voices can be respectively matched to obtain a plurality of similarity matching results, which can be further matched according to the plurality of similarity matching results.
  • the error of a single matching result can be averaged, which is beneficial to improve the accuracy of voiceprint recognition and the robustness of the voiceprint recognition algorithm.
  • comparing the to-be-verified voice and the enhanced registered voice, and determining that the to-be-verified voice and the registered voice are from the same user include: extracting characteristic parameters of the to-be-verified voice and enhanced registered voice characteristic parameters through a feature parameter extraction algorithm ; Carry out parameter identification through the parameter recognition model of the characteristic parameters of the voice to be verified and the characteristic parameters of the enhanced registered voice, so as to obtain the voice template of the speaker to be verified and the voice template of the registered speaker respectively; The voice template is matched with the voice template of the registered speaker, and according to the matching result, it is determined that the voice to be verified and the registered voice are from the same user.
  • the feature parameter extraction algorithm is MFCC algorithm, log mel algorithm or LPCC algorithm; and/or, the parameter identification model is an identity vector model, a time-delay neural network model or a ResNet model; and/or, the template matching algorithm is Cosine distance method, linear discriminant method or probabilistic linear discriminant analysis method.
  • an embodiment of the present application provides a voice enhancement method, including: a terminal device collects the voice to be verified, and sends the voice to be verified to a server that is communicatively connected to the terminal device; the server determines the environment contained in the voice to be verified Noise and/or environmental characteristic parameters; the server, based on the environmental noise and/or environmental characteristic parameters, enhances the registered voice; the server, compares the to-be-verified voice and the enhanced registered voice, and determines that the to-be-verified voice and the registered voice are from the same user; the server, The determination result of determining that the voice to be verified and the registered voice are from the same user is sent to the terminal device.
  • the registration voice is enhanced according to the noise components in the voice to be verified, so that the enhanced registration voice and the voice to be verified have similar noise components.
  • the difference lies in the difference between the two effective speech components.
  • the user only needs to record the registration voice in a quiet environment, and there is no need to separately record the registration voice in multiple scenarios, so the user experience is better.
  • the speaker recognition algorithm is implemented on the server, which can save local computing resources of the terminal device.
  • the registration speech is the speech from the registration speaker collected in a quiet environment. In this way, there is no obvious noise component in the registered speech, which can improve the accuracy of recognition.
  • enhancing the registration speech based on the environmental noise includes superimposing the environmental noise on the registration speech.
  • the implementation method of the present application obtains the enhanced registration voice by superimposing the environmental noise on the registration voice, and the algorithm is simple.
  • the ambient noise is the sound picked up by the secondary microphone of the terminal device.
  • the embodiments of the present application can conveniently determine the noise contained in the speech to be verified.
  • the duration of the to-be-verified speech is less than the duration of the registered speech. In this way, the user can input a short voice to be verified, which is beneficial to improve user experience.
  • the environmental characteristic parameter includes a scene type corresponding to the voice to be verified; the enhancement of the registered voice based on the environmental characteristic parameter includes: determining the template noise corresponding to the scene type based on the scene type corresponding to the to-be-verified voice, And superimpose template noise on the registered speech.
  • the registration speech is enhanced by superimposing template noise on the registration speech, so that the enhanced registration speech and the to-be-verified speech have noise components as close as possible, which is beneficial to improve the recognition accuracy.
  • the scene type corresponding to the voice to be verified is determined according to the scene recognition algorithm that recognizes the voice to be verified.
  • the scene recognition algorithm is any one of the following: GMM algorithm; DNN algorithm.
  • the scene type of the voice to be verified is any one of the following: a home scene; a vehicle-mounted scene; an outdoor noisy scene; a venue scene; a cinema scene.
  • the scene types of the embodiments of the present application cover the places where the user performs daily activities, which is beneficial to improve the user experience.
  • the environmental parameter characteristics of the voice to be verified include the distance between the user who generates the voice to be verified and the terminal device; the enhancement of the registered voice based on the environmental characteristic parameters includes: according to the user who generated the voice to be verified and the terminal device The distance between the registered voices is simulated in the far field.
  • the far-field simulation of the registered voice is used to simulate the acquisition distance of the registered voice (the distance between the voice acquisition device of the registered voice and the user who generates the registered voice) to the acquisition distance of the voice to be verified (the voice collection of the voice to be verified). distance between the device and the user producing the speech to be authenticated).
  • the attenuation component of the voice to be verified during the propagation process can be considered, so that the enhanced registered voice and the voice to be verified have noise components as close as possible, which is beneficial to improve the recognition efficiency. Accuracy.
  • performing far-field simulation on the registered voice according to the distance between the user who generates the voice to be verified and the terminal device including: according to the distance between the user who generates the voice to be verified and the terminal device, based on the mirror source model.
  • the voice to be verified and the enhanced registration voice are voices processed by the same front-end processing algorithm.
  • front-end processing the interference factors in the speech can be removed, which is beneficial to improve the accuracy of voiceprint recognition.
  • the front-end processing algorithm includes at least one of the following processing algorithms: echo cancellation; de-reverberation; active noise reduction; dynamic gain; directional pickup.
  • the number of registered voices is multiple; and, based on environmental noise and/or environmental characteristic parameters, the server enhances the multiple registered voices respectively, so as to obtain multiple enhanced registered voices.
  • a plurality of enhanced registration voices are obtained, and the to-be-verified voice and a plurality of enhanced registration voices can be respectively matched to obtain a plurality of similarity matching results, which can be further matched according to the plurality of similarity matching results.
  • the error of a single matching result can be averaged, which is beneficial to improve the accuracy of voiceprint recognition and the robustness of the voiceprint recognition algorithm.
  • comparing the to-be-verified voice and the enhanced registered voice, and determining that the to-be-verified voice and the registered voice are from the same user include: extracting characteristic parameters of the to-be-verified voice and enhanced registered voice characteristic parameters through a feature parameter extraction algorithm ; Carry out parameter identification through the parameter recognition model of the characteristic parameters of the voice to be verified and the characteristic parameters of the enhanced registered voice, so as to obtain the voice template of the speaker to be verified and the voice template of the registered speaker respectively; The voice template is matched with the voice template of the registered speaker, and according to the matching result, it is determined that the voice to be verified and the registered voice are from the same user.
  • the feature parameter extraction algorithm is MFCC algorithm, log mel algorithm or LPCC algorithm; and/or, the parameter identification model is an identity vector model, a time-delay neural network model or a ResNet model; and/or, the template matching algorithm is Cosine distance method, linear discriminant method or probabilistic linear discriminant analysis method.
  • embodiments of the present application provide an electronic device, including: a memory for storing instructions executed by one or more processors of the electronic device; a processor, when the processor executes the instructions in the memory, it can The electronic device is caused to execute the speaker identification method provided by any embodiment of the first aspect of the present application.
  • a memory for storing instructions executed by one or more processors of the electronic device
  • a processor when the processor executes the instructions in the memory, it can The electronic device is caused to execute the speaker identification method provided by any embodiment of the first aspect of the present application.
  • an embodiment of the present application provides a speech enhancement system, including a terminal device and a server communicatively connected to the terminal device, wherein,
  • the terminal device collects the voice to be verified, and sends the voice to be verified to the server;
  • the server is used to determine the environmental noise and/or environmental feature parameters contained in the voice to be verified, and enhance the registered voice based on the environmental noise and/or the environmental feature parameters and compare the voice to be verified and the enhanced registered voice, and determine that the voice to be verified and the registered voice come from the same user;
  • the server is also used to send the determination result of determining that the voice to be verified and the registered voice come from the same user to the terminal device.
  • the registration voice is enhanced according to the noise components in the voice to be verified, so that the enhanced registration voice and the voice to be verified have similar noise components.
  • the difference lies in the difference between the two effective speech components.
  • the user only needs to record the registration voice in a quiet environment, and there is no need to separately record the registration voice in multiple scenarios, so the user experience is better.
  • the speaker recognition algorithm is implemented on the server, which can save local computing resources of the terminal device.
  • the registration speech is the speech from the registration speaker collected in a quiet environment. In this way, there is no obvious noise component in the registered speech, which can improve the accuracy of recognition.
  • enhancing the registration speech based on the environmental noise includes superimposing the environmental noise on the registration speech.
  • the implementation method of the present application obtains the enhanced registration voice by superimposing the environmental noise on the registration voice, and the algorithm is simple.
  • the ambient noise is the sound picked up by the secondary microphone of the terminal device.
  • the embodiments of the present application can conveniently determine the noise contained in the speech to be verified.
  • the duration of the to-be-verified speech is less than the duration of the registered speech. In this way, the user can input a short voice to be verified, which is beneficial to improve user experience.
  • the environmental characteristic parameter includes a scene type corresponding to the voice to be verified; the enhancement of the registered voice based on the environmental characteristic parameter includes: determining the template noise corresponding to the scene type based on the scene type corresponding to the to-be-verified voice, And superimpose template noise on the registered speech.
  • the registration speech is enhanced by superimposing template noise on the registration speech, so that the enhanced registration speech and the to-be-verified speech have noise components as close as possible, which is beneficial to improve the recognition accuracy.
  • the scene type corresponding to the voice to be verified is determined according to the scene recognition algorithm that recognizes the voice to be verified.
  • the scene recognition algorithm is any one of the following: GMM algorithm; DNN algorithm.
  • the scene type of the voice to be verified is any one of the following: a home scene; a vehicle-mounted scene; an outdoor noisy scene; a venue scene; a cinema scene.
  • the scene types of the embodiments of the present application cover the places where the user performs daily activities, which is beneficial to improve the user experience.
  • the environmental parameter characteristics of the voice to be verified include the distance between the user who generates the voice to be verified and the terminal device; the enhancement of the registered voice based on the environmental characteristic parameters includes: according to the user who generated the voice to be verified and the terminal device The distance between the registered voices is simulated in the far field.
  • the far-field simulation of the registered voice is used to simulate the acquisition distance of the registered voice (the distance between the voice acquisition device of the registered voice and the user who generates the registered voice) to the acquisition distance of the voice to be verified (the voice collection of the voice to be verified). distance between the device and the user producing the speech to be authenticated).
  • the attenuation component of the voice to be verified during the propagation process can be considered, so that the enhanced registered voice and the voice to be verified have noise components as close as possible, which is beneficial to improve the recognition efficiency. Accuracy.
  • performing far-field simulation on the registered voice according to the distance between the user who generates the voice to be verified and the terminal device including: according to the distance between the user who generates the voice to be verified and the terminal device, based on the mirror source model.
  • the voice to be verified and the enhanced registration voice are voices processed by the same front-end processing algorithm.
  • front-end processing the interference factors in the speech can be removed, which is beneficial to improve the accuracy of voiceprint recognition.
  • the front-end processing algorithm includes at least one of the following processing algorithms: echo cancellation; de-reverberation; active noise reduction; dynamic gain; directional pickup.
  • the number of registered voices is multiple; and, based on environmental noise and/or environmental characteristic parameters, the server enhances the multiple registered voices respectively, so as to obtain multiple enhanced registered voices.
  • a plurality of enhanced registration voices are obtained, and the to-be-verified voice and a plurality of enhanced registration voices can be respectively matched to obtain a plurality of similarity matching results, which can be further matched according to the plurality of similarity matching results.
  • the error of a single matching result can be averaged, which is beneficial to improve the accuracy of voiceprint recognition and the robustness of the voiceprint recognition algorithm.
  • comparing the to-be-verified voice and the enhanced registered voice, and determining that the to-be-verified voice and the registered voice are from the same user include: extracting characteristic parameters of the to-be-verified voice and enhanced registered voice characteristic parameters through a feature parameter extraction algorithm ; Carry out parameter identification through the parameter recognition model of the characteristic parameters of the voice to be verified and the characteristic parameters of the enhanced registered voice, so as to obtain the voice template of the speaker to be verified and the voice template of the registered speaker respectively; The voice template is matched with the voice template of the registered speaker, and according to the matching result, it is determined that the voice to be verified and the registered voice are from the same user.
  • the feature parameter extraction algorithm is MFCC algorithm, log mel algorithm or LPCC algorithm; and/or, the parameter identification model is an identity vector model, a time-delay neural network model or a ResNet model; and/or, the template matching algorithm is Cosine distance method, linear discriminant method or probabilistic linear discriminant analysis method.
  • an embodiment of the present application provides a computer-readable storage medium, where an instruction is stored in the computer-readable storage medium, and when the instruction is executed on a computer, the computer can execute the information provided by any one of the embodiments of the first aspect of the present application.
  • method, or causing a computer to execute the method provided by any embodiment of the second aspect of the present application are beneficial effects that can be achieved in the fifth aspect.
  • Fig. 1a shows an exemplary application scenario of the speech enhancement method provided by the embodiment of the present application
  • Fig. 1b shows another exemplary application scenario of the speech enhancement method provided by the embodiment of the present application
  • FIG. 2 shows a schematic structural diagram of a speech enhancement device provided by an embodiment of the present application
  • FIG. 3 shows a flowchart of a speech enhancement method provided by an embodiment of the present application
  • FIG. 4 shows a flowchart of a speech enhancement method provided by another embodiment of the present application.
  • FIG. 5 shows an application scenario of the speech enhancement method provided by the embodiment of the present application
  • FIG. 6 shows a structural diagram of an electronic device provided by an embodiment of the present application.
  • FIG. 7 shows a block diagram of a system-on-chip (SoC) provided by an embodiment of the present application.
  • SoC system-on-chip
  • Speaker recognition technology also known as voiceprint recognition technology
  • voiceprint recognition technology is a technology that uses the uniqueness of the speaker's voiceprint to identify the speaker's identity. Because each person's vocal organs (for example, tongue, teeth, larynx, lungs, nasal cavity, vocal passages, etc.) are innately different, and vocalization habits, etc. have acquired differences, therefore, each person's voiceprint features are unique. By analyzing the features of the pattern, the identity of the speaker can be identified.
  • the specific process of speaker identification is to collect the voice of the speaker whose identity is to be confirmed, and compare it with the voice of a specific speaker to confirm whether the speaker whose identity is to be confirmed is the specific speaker.
  • the voice of the speaker whose identity is to be confirmed is called “voice to be verified”
  • the speaker whose identity is to be confirmed is called “speaker to be verified”
  • the voice of a specific speaker is called “registered voice”
  • the specific speaker Speakers are called “registered speakers”.
  • the above process is described by taking the voiceprint unlocking function of the mobile phone (ie, unlocking the screen of the mobile phone by means of voiceprint recognition) as an example.
  • the mobile phone owner records his own voice (the voice is the registered voice) in the mobile phone through the microphone on the mobile phone.
  • the current user of the mobile phone enters the real-time voice (the voice is the voice to be verified) through the mobile phone microphone, and the mobile phone uses the built-in voiceprint recognition program to compare the voice to be verified and the registered voice , to determine whether the current user of the mobile phone is the owner of the mobile phone.
  • the to-be-verified voice matches the registered voice, it is judged that the current user of the mobile phone is the owner, the current user of the mobile phone has passed the identity authentication, and the mobile phone completes the subsequent screen unlocking action; if the to-be-verified voice does not match the registered voice, it is judged If the current user of the mobile phone is not the owner, and the current user of the mobile phone has not passed the identity authentication, the mobile phone can refuse the subsequent screen unlocking action.
  • voiceprint recognition technology can be applied to the field of family life, and voice control of smart phones, smart cars, smart homes (eg, smart audio and video equipment, smart lighting systems, smart door locks), etc.; voiceprint recognition technology can also be applied In the field of payment, the voiceprint authentication is combined with other authentication methods (such as passwords, dynamic verification codes, etc.) to perform double or multiple authentication of the user's identity to improve the security of payment; voiceprint recognition technology can also be applied to information In the security field, voiceprint authentication is used as a way to log in to an account; voiceprint recognition technology can also be applied to the judicial field, using voiceprint as auxiliary evidence for judging identity.
  • the main device for voiceprint recognition can be other electronic devices other than mobile phones, such as mobile devices, including wearable devices (such as wristbands, earphones, etc.), vehicle terminals, etc.; or fixed devices, including smart Home, network server, etc.
  • the voiceprint recognition algorithm can be implemented in the cloud in addition to the terminal. For example, after the mobile phone collects the voice to be verified, the collected voice to be verified can be sent to the cloud, and the voice to be verified is recognized by the voiceprint recognition algorithm in the cloud. After the recognition is completed, the cloud returns the recognition result to the mobile phone. Through the cloud recognition mode, users can share the computing resources in the cloud to save the local computing resources of the mobile phone.
  • the voice to be verified when the voice of the speaker to be verified is collected, if there is noisy human voice noise in the surrounding environment, these noises will be collected by the microphone together and become part of the voice to be verified.
  • the voice to be verified not only includes the voice of the speaker to be verified, but also contains noise components, which will reduce the recognition rate of the voiceprint.
  • This embodiment does not limit the scene of the voiceprint recognition, for example, it may also be a home scene, a car scene, a meeting place scene, a cinema scene, and the like.
  • the owner of the mobile phone needs to unlock the mobile phone through voiceprint recognition, if there is noise in the surrounding environment, the sound collected by the mobile phone microphone is not only the owner's voice, but also the noise in the environment. After the real-time voice is compared with the registered voice preset in the mobile phone by the owner, it may result that the two do not match. Even if the current user of the mobile phone is the owner of the mobile phone, the mobile phone may still give a result that the user identity authentication fails, thus affecting the user experience.
  • some technical solutions remove noise components in the voice to be verified by performing denoising processing on the voice to be verified, so as to improve the recognition rate of the voiceprint.
  • the voice to be verified after the denoising process still contains some noise components, and some valid voice components (the voice components of the speaker to be verified) are also removed. In this way, the voice to be verified after the denoising process may appear. It still cannot be recognized correctly, and the recognition rate of voiceprint is not significantly improved.
  • the user records registration voices in multiple different scenarios (for example, home scenarios, cinema scenarios, outdoor noisy scenarios, etc.), and when performing voiceprint recognition, compares the voice to be verified with the registered voice recorded in the corresponding scenario , in order to improve the recognition rate of voiceprint.
  • the user needs to record registration voices respectively in multiple different scenarios, and the user experience is low.
  • the embodiments of the present application provide a voice enhancement method, which is used to improve the voiceprint recognition rate and the robustness of the voiceprint recognition method, and improve user experience.
  • a noise component corresponding to the noise component in the voice to be verified will be superimposed on the registered voice, and then the registered voice after the noise component has been superimposed is compared with the voice to be verified, to get the recognition result.
  • the registration voice will be enhanced according to the noise components in the voice to be verified, so that the enhanced registration voice and the voice to be verified have similar noise components, so that the voice to be verified and the enhanced registration voice
  • the main difference between the two is the difference between the two effective speech components.
  • the user After comparing the two through the voiceprint recognition algorithm, a more accurate recognition result can be obtained.
  • the user only needs to record the registration voice in a quiet environment, and there is no need to separately record the registration voice in multiple scenarios, so the user experience is better.
  • the "valid speech component” is the speech component from the speaker, for example, the valid speech component in the speech to be verified is the speech component of the speaker to be verified, and the valid speech component in the enhanced registered speech is the speech component of the registered speaker .
  • FIG. 2 shows the structure of the mobile phone 100 .
  • the mobile phone 100 may include a processor 110, an external memory interface 120, an internal memory 121, an antenna, a communication module 150, an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, a camera 193, a display screen 194, and the like.
  • the structures illustrated in the embodiments of the present invention do not constitute a specific limitation on the mobile phone 100 .
  • the mobile phone 100 may include more or less components than shown, or some components may be combined, or some components may be separated, or different component arrangements.
  • the illustrated components may be implemented in hardware, software, or a combination of software and hardware.
  • the processor 110 may include one or more processing units, for example, the processor 110 may include an application processor (application processor, AP), a modem processor, a controller, a digital signal processor (digital signal processor, DSP), baseband processor, etc. Wherein, different processing units may be independent devices, or may be integrated in one or more processors.
  • application processor application processor, AP
  • modem processor modem processor
  • controller controller
  • DSP digital signal processor
  • baseband processor baseband processor
  • the processor can generate an operation control signal according to the instruction operation code and timing signal, and complete the control of fetching and executing instructions.
  • a memory may also be provided in the processor 110 for storing instructions and data.
  • the memory in processor 110 is cache memory. This memory may hold instructions or data that have just been used or recycled by the processor 110 . If the processor 110 needs to use the instruction or data again, it can be called directly from the memory. Repeated accesses are avoided and the latency of the processor 110 is reduced, thereby increasing the efficiency of the system.
  • the processor 110 may include one or more interfaces.
  • the interface may include an integrated circuit built-in audio (inter-integrated circuit sound, I2S) interface, a pulse code modulation (pulse code modulation, PCM) interface, and/or a general-purpose input/output (general-purpose input/output, GPIO) interface, etc.
  • I2S integrated circuit sound
  • PCM pulse code modulation
  • GPIO general-purpose input/output
  • the I2S interface can be used for audio communication.
  • the processor 110 may contain multiple sets of I2S buses.
  • the processor 110 may be coupled with the audio module 170 through an I2S bus to implement communication between the processor 110 and the audio module 170 .
  • the PCM interface can also be used for audio communications, sampling, quantizing and encoding analog signals.
  • the GPIO interface can be configured by software.
  • the GPIO interface can be configured as a control signal or as a data signal.
  • the GPIO interface may be used to connect the processor 110 with the camera 193, the display screen 194, the audio module 170, and the like.
  • the GPIO interface can also be configured as an I2S interface, etc.
  • the interface connection relationship between the modules illustrated in the embodiment of the present invention is only a schematic illustration, and does not constitute a structural limitation of the mobile phone 100 .
  • the mobile phone 100 may also adopt different interface connection manners in the foregoing embodiments, or a combination of multiple interface connection manners.
  • the wireless communication function of the mobile phone 100 may be implemented by an antenna, a communication module 150, a modem processor, a baseband processor, and the like.
  • Antennas are used to transmit and receive electromagnetic wave signals.
  • Each antenna in handset 100 may be used to cover a single or multiple communication frequency bands. Different antennas can also be reused to improve antenna utilization.
  • the antennas can be multiplexed into the diversity antennas of the wireless local area network.
  • the antenna may be used in conjunction with a tuning switch.
  • the communication module 150 may provide a wireless communication solution including 2G/3G/4G/5G, etc. applied on the mobile phone 100 .
  • the communication module 150 may include at least one filter, switch, power amplifier, low noise amplifier (LNA), and the like.
  • the communication module 150 can receive the electromagnetic wave by the antenna, filter, amplify, etc. the received electromagnetic wave, and transmit it to the modulation and demodulation processor for demodulation.
  • the communication module 150 can also amplify the signal modulated by the modulation and demodulation processor, and then convert it into electromagnetic waves for radiation through the antenna.
  • at least part of the functional modules of the communication module 150 may be provided in the processor 110 .
  • at least some of the functional modules of the communication module 150 may be provided in the same device as at least some of the modules of the processor 110 .
  • the modem processor may include a modulator and a demodulator.
  • the modulator is used to modulate the low frequency baseband signal to be sent into a medium and high frequency signal.
  • the demodulator is used to demodulate the received electromagnetic wave signal into a low frequency baseband signal. Then the demodulator transmits the demodulated low-frequency baseband signal to the baseband processor for processing.
  • the low frequency baseband signal is processed by the baseband processor and passed to the application processor.
  • the application processor outputs sound signals through audio devices (not limited to the speaker 170A, the receiver 170B, etc.), or displays images or videos through the display screen 194 .
  • the modem processor may be a stand-alone device.
  • the modulation and demodulation processor may be independent of the processor 110, and may be provided in the same device as the communication module 150 or other functional modules.
  • the external memory interface 120 can be used to connect an external memory card, such as a Micro SD card, to expand the storage capacity of the mobile phone 100 .
  • the external memory card communicates with the processor 110 through the external memory interface 120 to realize the data storage function. For example to save files like music, video etc in external memory card.
  • Internal memory 121 may be used to store computer executable program code, which includes instructions.
  • the internal memory 121 may include a storage program area and a storage data area.
  • the storage program area can store an operating system, an application program required for at least one function (such as a sound playback function, an image playback function, etc.), a voiceprint recognition program, a voice signal front-end processing program, and the like.
  • the storage data area can store data (such as audio data, phone book, etc.) created during the use of the mobile phone 100, and data required for voiceprint recognition, such as audio data of registered voice, trained voice parameter recognition model, etc.
  • the internal memory 121 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, universal flash storage (UFS), and the like.
  • the processor 110 executes various functional applications and data processing of the mobile phone 100 by executing the instructions stored in the internal memory 121 and/or the instructions stored in the memory provided in the processor.
  • the mobile phone 100 can implement audio functions through an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, and an application processor. Such as music playback, recording, etc.
  • the audio module 170 is used for converting digital audio information into analog audio signal output, and also for converting analog audio input into digital audio signal. Audio module 170 may also be used to encode and decode audio signals. In some embodiments, the audio module 170 may be provided in the processor 110 , or some functional modules of the audio module 170 may be provided in the processor 110 .
  • Speaker 170A also referred to as a "speaker" is used to convert audio electrical signals into sound signals.
  • the mobile phone 100 can listen to music through the speaker 170A, or listen to a hands-free call.
  • the receiver 170B also referred to as "earpiece" is used to convert audio electrical signals into sound signals.
  • the voice can be answered by placing the receiver 170B close to the human ear.
  • the microphone 170C also called “mic”, “microphone”, or “microphone”, is used to convert sound signals into electrical signals.
  • the user can make a sound by approaching the microphone 170C through the human mouth, and input the sound signal into the microphone 170C.
  • the mobile phone 100 may be provided with at least one microphone 170C.
  • the mobile phone 100 may be provided with two microphones 170C, which can implement a noise reduction function in addition to collecting sound signals.
  • the mobile phone 100 has two microphones at the top and bottom, one microphone 170C is provided on the bottom side of the mobile phone 100 , and the other microphone 170C is provided on the top side of the mobile phone 100 .
  • the mouth is usually close to the microphone 170C on the bottom side. Therefore, the user's voice will generate a larger audio signal Va in the microphone, which is referred to as the "main mic" herein.
  • the audio signal Va is referred to herein as "sub-mic".
  • the distance between the noise sound source and the main mic and the auxiliary mic is basically the same, that is, it can be considered that the main mic and the auxiliary mic
  • the intensity of the noise is basically the same.
  • the noise signal and the user speech signal can be separated by using the signal strength difference caused by the difference of the two mic positions. For example, after the audio signal picked up by the main mic and the audio signal picked up by the secondary mic are differentiated (that is, the signal in the main mic is subtracted from the signal in the secondary mic), the user's voice signal (this is the dual mic) can be obtained. The principle of active noise cancellation). Furthermore, after removing the user's voice signal from the main mic signal, the noise signal can be separated. Alternatively, since the audio signal Vb on the secondary mic is significantly smaller than the audio signal Va on the primary mic, it can be considered that the signal picked up by the secondary mic is a noise signal.
  • a setting method of dual mics of the mobile phone 100 is given above, but this is only an exemplary description, and other setting methods can be used for the microphones, for example, the main mic is arranged on the front of the mobile phone 100, and the secondary mic is arranged on the back of the mobile phone.
  • the mobile phone 100 may further be provided with three, four or more microphones 170C to collect sound signals, reduce noise, identify sound sources, and implement directional recording functions.
  • the earphone jack 170D is used to connect wired earphones.
  • the earphone interface 170D may be a universal serial bus (USB) interface, or may be a 3.5mm open mobile terminal platform (OMTP) standard interface, the cellular telecommunications industry association (cellular telecommunications industry association) of the USA, CTIA) standard interface.
  • USB universal serial bus
  • OMTP open mobile terminal platform
  • CTIA cellular telecommunications industry association
  • this embodiment is used to provide a voice enhancement method. After the voice to be verified is collected, the noise contained in the voice to be verified is separated from the voice to be verified, and then the separated noise is superimposed on the registered voice. , in this way, the voice to be verified and the registered voice after superimposed noise have similar noise components, and the main difference between the two is the difference between the two effective voice components, which can improve the voiceprint recognition rate and the voiceprint recognition method. robustness.
  • the speech enhancement method provided by this embodiment includes the following steps:
  • S110 Collect registered voice.
  • the mobile phone 100 has a voiceprint unlocking application (which may be a system application or a third-party application).
  • a voiceprint unlocking application which may be a system application or a third-party application.
  • the owner of the mobile phone 100 registers the user account of the voiceprint unlocking application, he collects his own voice through the mobile phone 100, and the voiceprint unlocking application uses the voice as the reference voice for subsequent voiceprint recognition. This voice is the registered voice.
  • the present application is not limited to this.
  • the owner of the mobile phone 100 enters the registration voice through the setting wizard of the mobile phone 100, and the voiceprint unlocking application of the mobile phone 100 uses the voice as the reference voice for voiceprint recognition. .
  • the registered voice is the voice recorded by the owner of the mobile phone 100 in a quiet environment, so that there is no obvious noise component in the registered voice.
  • the signal-to-noise ratio ie, the ratio of the host voice signal strength to the noise signal strength
  • the signal-to-noise ratio in the recording environment is higher than the set value (eg, 30dB)
  • the recording is considered to be recorded.
  • the environment is quiet.
  • the intensity of the noise signal in the registered voice recording environment is lower than a set value (eg, 20 dB)
  • the recording environment is considered to be a quiet environment.
  • the registration voice from the host is collected through the microphone of the mobile phone 100 .
  • the registered voice is near-field voice.
  • the distance between the owner's mouth and the main mic of the mobile phone 100 should be kept within 30cm to 1m. For example, if the owner holds the mobile phone 100 and speaks to the main mic, the distance between the owner's mouth and the main mic of the mobile phone 100 should be kept within 30cm. , which can avoid the attenuation of the host voice due to the long propagation distance.
  • the owner When recording the registered voice, the owner enters 6 voices to form 6 registered voices. Entering multiple languages helps to improve the flexibility of speech recognition and the richness of voiceprint information.
  • the length of each registered voice is 10-30s. Further, each registered voice corresponds to different text content, so as to enrich the voiceprint information contained in the registered voice.
  • the mobile phone 100 After collecting the registered voice, the mobile phone 100 stores the audio signal of the registered voice in the internal memory. However, the present application is not limited to this, and the mobile phone 100 may also upload the audio signal of the registered voice to the cloud, so as to recognize the voiceprint through the cloud recognition mode.
  • the above recording method, recording length, and quantity of the registered voice are only exemplary descriptions, and the present application is not limited thereto.
  • the registered voice may be recorded by other recording devices (eg, voice recorder, dedicated microphone, etc.), the number of registered voices may be one, and the length of the registered voice may be greater than 30s.
  • step S110 is mentioned first. It can be understood that step S110, as the data preparation process of the speech enhancement method, is relatively independent from the single speech enhancement process, and does not need to be performed every time. Occurs with other steps of the speech enhancement method.
  • S120 Collect the voice to be verified, and the voice to be verified is the voice recorded by the current user of the mobile phone in a noisy human voice scene.
  • the mobile phone user can unlock the screen of the mobile phone by means of voiceprint recognition in this scenario.
  • the current user of the mobile phone is the person who currently operates the mobile phone 100 , which may be the owner himself or someone other than the owner himself.
  • the voice to be verified is collected through the microphone of the mobile phone 100 .
  • the microphone of the mobile phone 100 is turned on.
  • the current user of the mobile phone 100 can input the voice to be verified through the microphone of the mobile phone 100 to unlock the mobile phone through voiceprint recognition.
  • the user needs to operate the mobile phone 100 from a distance eg, open an application in the mobile phone (eg, a music application, a phone application)
  • the user needs to operate the mobile phone when both hands are occupied eg, when doing housework
  • the to-be-verified voice is a voice with specific content.
  • the voice to be verified may also be voice of any text content.
  • the length of the voice to be verified is 10-30 s, so that the voice to be verified can contain relatively rich voiceprint information, which is beneficial to improve the voiceprint recognition rate.
  • this application does not limit this.
  • the length of the voice to be verified is less than 10s, so the length of the voice to be verified is less than the length of the registered voice. In this case, the user can enter a shorter to-be-verified voice. Verification of voice is beneficial to improve user experience.
  • the length of the voice to be verified is less than the length of the registered voice, part of the voice fragments can be intercepted from the voice to be verified, and spliced with the originally collected voice to be verified, so that the spliced voice has substantially the same length as the registered voice , so that in the subsequent steps of this embodiment (will be described in detail below), the feature parameters extracted from the registered voice and the feature parameters extracted from the voice to be verified have the same dimension, which is convenient for the similarity of the two. degree for comparison. In the description of this article, it does not distinguish between the original collected voice to be verified and the spliced voice to be verified, which is referred to as the voice to be verified in this document.
  • the meaning of splicing the A voice and the B voice is to connect the A voice and the B voice end to end, so that the length of the spliced voice is the sum of the lengths of the A voice and the B voice.
  • the present application does not limit the connection order of the A voice and the B voice.
  • the A voice may be connected after the B voice, or the A voice may be connected before the B voice.
  • the noise contained in the voice to be verified is the sound generated by other sound sources other than the current user of the mobile phone 100 in the recognition scene.
  • the sound of household equipment for example, a vacuum cleaner
  • the sound of the car broadcast the sound of the engine in the car scene
  • the sound of the sound projected in the theater environment the voice of other audiences in the theater, etc.
  • the sound picked up by the mic of the mobile phone 100 is determined as the noise contained in the voice to be verified, so that the noise contained in the voice to be verified can be easily determined.
  • the present application is not limited to this.
  • the initial segment of the speech to be verified contains only noise components, so that after multiple copies of the initial segment of the speech to be verified, it is determined as the to-be-verified speech.
  • the noise contained in the speech for another example, in other embodiments, the speech to be verified is divided into multiple speech frames, and the medium energy of each speech frame is calculated.
  • the speech frame can be determined as a noise frame, thereby simplifying the noise extraction process.
  • other methods in the prior art may also be used to determine the noise in the speech to be verified, which will not be described in detail.
  • the energy of the speech frame represents the sum of the squares of the signal values of the speech signals included in the speech frame.
  • the signal value of the i-th speech signal in the speech frame is x i
  • the number of speech signals in the speech frame is N
  • S140 Superimpose the noise contained in the voice to be verified on the registration voice to obtain an enhanced registration voice.
  • the signal value of the noise signal and the signal value of the registration speech signal are added to obtain the enhanced registration speech.
  • the present application is not limited to this, and in other embodiments, the superposition of the registration speech signal and the noise signal may also be completed in the frequency domain.
  • the embodiment of the present application realizes the enhancement of the registered voice signal by simply superimposing the numerical value of the voice signal, and the algorithm is simple.
  • the length of the noise is equal to the length of the registered voice. In other embodiments, the length of the noise may be smaller than the length of the registered voice.
  • the number of registered voices is 6. Therefore, noises contained in the voices to be verified are respectively superimposed on the 6 registered voices to obtain 6 enhanced registered voices.
  • S150 Extract the feature parameters of the voice to be verified and the feature parameters of the enhanced registered voice. Since the MFCC method can better conform to the auditory perception characteristics of the human ear, in this embodiment, the feature parameters in the speech signal are extracted by the Mel-Frequency Cepstrum Coefficient (MFCC) method.
  • MFCC Mel-Frequency Cepstrum Coefficient
  • an audio signal representing speech to be authenticated by S T is first to be authenticated as a series of voice speech frame x (n), where, for the n-number of speech frames.
  • x voice speech frame
  • the length of each speech frame is 10-30ms.
  • a length of 10s audio signal S T 500 is divided into speech frames.
  • the MFCC feature extraction method includes the steps of Fourier transform, Mel filtering, discrete cosine transform, etc. on the speech frame x(n).
  • the order of the discrete cosine transform is 20. Therefore, the MFCC feature parameter of each speech frame x(n) has 20 dimensions.
  • the extraction process can be adjusted as required. For example, differential calculation may be performed on the MFCC feature parameters extracted above. For example, after taking the first-order difference and the second-order difference of the MFCC feature parameters extracted above, for each speech frame, a set of 60-dimensional MFCC feature parameters is obtained.
  • other parameters of the extraction process such as the length and number of speech frames, the order of discrete cosine transform, etc., can also be adjusted according to the computing capability of the device and the requirements of recognition accuracy.
  • the feature parameters in the speech signal can also be extracted by other methods, for example, the log mel method, the Linear Predictive Cepstrum Coefficient (LPCC) method, and the like.
  • LPCC Linear Predictive Cepstrum Coefficient
  • the identification model for parameter identification is not limited in this application, and can be a probability model, such as an identity vector (I-vector) model; or a deep neural network model, such as a Time-Delay Neural Network (TDNN) model, ResNet model, etc.
  • I-vector identity vector
  • TDNN Time-Delay Neural Network
  • the 10,000-dimensional feature parameters of the speech to be verified are input into the recognition model, and the speech template of the current user of the mobile phone 100 is obtained after the dimensionality reduction and abstraction of the recognition model.
  • the speech template of the current user of the mobile phone 100 is a 512-dimensional feature vector, denoted as A.
  • each voice template is a feature vector of 512
  • the 6 master voice templates are marked as B1 respectively. , B2, ..., B6.
  • the template matching method may be a cosine distance method, a linear discriminant method, or a probabilistic linear discriminant analysis method, or the like.
  • the cosine distance method is used as an example for description below.
  • the cosine distance method evaluates the similarity of two feature vectors by computing the cosine of the angle between them. Taking the feature vector A (the feature vector corresponding to the voice template of the current user of the mobile phone 100) and the feature vector B1 (the feature vector corresponding to the main voice template of the mobile phone 100) as an example, the cosine similarity can be expressed as:
  • a i is the ith coordinate in the eigenvector A
  • b i is the ith coordinate in the eigenvector B1
  • ⁇ 1 is the angle between the eigenvector A and the eigenvector B1.
  • the larger the value of cos ⁇ 1 the closer the direction of the eigenvector A and the eigenvector B1 is, and the higher the similarity of the two eigenvectors.
  • the smaller the value of cos ⁇ 1 the lower the similarity between the two feature vectors.
  • the similarity P between the current user's voice and the host's voice is greater than the set value (for example, 0.8)
  • the mobile phone 100 unlocks the screen; otherwise, it is determined that the current user of the mobile phone 100 is the host. Not the owner himself, the phone 100 will not unlock the screen.
  • the to-be-verified voice is compared with the six enhanced registered voices to obtain six cosine similarity calculation results, and then the six cosine similarity results are averaged to obtain the final result of the current user voice and the host voice. Similarity P.
  • the matching errors between the voice to be verified and the single enhanced registered voice can be averaged, which is beneficial to improve the accuracy of voiceprint recognition and the robustness of the voiceprint recognition algorithm.
  • the voiceprint recognition algorithm (the algorithms corresponding to steps S130 to S170 ) can be implemented on the mobile phone 100 to realize the offline recognition of the voiceprint; it can also be implemented in the cloud to save the mobile phone 100 local computing resources.
  • the voiceprint recognition algorithm is implemented in the cloud
  • the mobile phone 100 uploads the to-be-verified voice collected in step S120 to the cloud server, and the cloud server uses the voiceprint recognition algorithm to authenticate the identity of the current user of the mobile phone 100, and returns the authentication result.
  • the mobile phone 100 decides whether to unlock the screen according to the authentication result.
  • a reverberation component is added to the registration speech to obtain an enhanced registration speech.
  • the voice of the speaker to be verified will generate reverberation in the room, and the reverberation, as a part of the interference factor, will have a certain impact on the recognition rate of the voiceprint.
  • the reverberation prediction is performed on the registered voice based on the recognition scene, that is, the reverberation of the registered voice in the recognition scene is simulated, and the registered voice is added to the registered voice based on the reverberation simulation.
  • the reverberation components generated in the voiceprint are used to make the non-speech components of the speech to be verified and the non-speech components in the enhanced registration speech as close as possible, thereby improving the voiceprint recognition rate and the robustness of the voiceprint recognition method.
  • the reverberation generated by the registered speech in the recognition scene is estimated.
  • the image source model method can simulate the reflection path of the sound wave in the room, and calculate the room impulse response function (RIR) of the sound field according to the delay and attenuation parameters of the sound wave.
  • RIR room impulse response function
  • the reverberation generated by the registered speech in the room is obtained by convolving the audio signal of the registered speech with the impulse response function.
  • the distance between the speaker to be verified and the microphone may be far (for example, more than 1m), so that when the voice of the speaker to be verified reaches the microphone There will be some attenuation. Therefore, in some embodiments, in order to consider the distance factor between the voice to be verified and the microphone, when the reverberation estimation is performed on the registered voice by using the image source model method, far-field simulation is also performed on the registered voice.
  • the distance between the registered voice in the simulated sound field and the voice receiving device is set according to the distance between the speaker to be verified and the microphone, so that the registered voice can be
  • the acquisition distance simulates the same acquisition distance as the voice to be verified, thereby further reducing the difference between the voice to be verified and the enhanced registered voice except for the effective voice components, improving the recognition rate of voiceprints and the efficiency of the voiceprint recognition method. robustness.
  • the to-be-verified speech is also subjected to front-end processing, for example, echo cancellation and de-reverberation are performed on the to-be-verified speech , active noise reduction, dynamic gain, directional pickup, etc.
  • front-end processing for example, echo cancellation and de-reverberation are performed on the to-be-verified speech , active noise reduction, dynamic gain, directional pickup, etc.
  • the enhanced registered voice is subjected to the same front-end processing as the voice to be verified (that is, the voice to be verified and the enhanced registered voice are passed through.
  • the same front-end processing algorithm module to further improve the voiceprint recognition rate and the robustness of the voiceprint recognition method.
  • the feature parameter extraction step of the speech signal (ie, step S150 ) may be omitted, and the speech signal may be recognized directly through a deep neural network model.
  • this embodiment is used to provide another voice enhancement method.
  • the scene of the voice to be verified is also recognized to obtain The scene type corresponding to the voice to be verified.
  • the enhanced registration voice is also determined according to the above scene type.
  • the speech enhancement method performed by the mobile phone 100 according to this embodiment includes the following steps:
  • the registered voice is the voice recorded by the owner of the mobile phone 100 in a quiet environment, so that there is no obvious noise component in the registered voice.
  • the voice to be verified is the voice recorded by the current user of the mobile phone in the noisy human voice scene.
  • the mobile phone user can unlock the screen of the mobile phone by means of voiceprint recognition in this scenario.
  • the former user of the mobile phone is the person who currently operates the mobile phone 100, which may be the owner himself, or may be someone other than the owner himself.
  • S230 Determine the noise contained in the speech to be verified.
  • the noise contained in the voice to be verified is the sound generated by other sound sources other than the current user of the mobile phone 100 in the recognition scene.
  • S240 Superimpose the noise contained in the voice to be verified on the registration voice to obtain an enhanced registration voice.
  • the signal value of the noise signal and the signal value of the registration speech signal are added to obtain the enhanced registration speech.
  • steps S210-S240 are substantially the same as steps S110-S140 in Embodiment 1, and detailed processes in the steps are not repeated.
  • the number of registered voices is the same as that of the first embodiment, that is, the number of registered voices is 6. Therefore, in step S240, the noises contained in the voices to be verified are respectively superimposed on the 6 registered voices to obtain 6 Enhanced registration voice.
  • S250 Determine the scene type corresponding to the voice to be verified. Specifically, after the voice to be verified is collected, the scene type corresponding to the voice to be verified is identified by a voice recognition algorithm, such as a GMM method, a DNN method, or the like.
  • the label value of the scene type can be a home scene; a car scene; an outdoor noisy scene; a venue scene; a cinema scene, etc.
  • the template noise is noise corresponding to the scene type determined in step S250, for example, template noise is noise recorded under the scene determined in step S250.
  • template noise is noise recorded under the scene determined in step S250.
  • it can correspond to multiple groups of template noise.
  • the scene type corresponding to the voice to be verified is determined in step S250 to be a home scene, and three groups of template noises are recorded in the home scene (for example, the sound generated by home audio and video equipment, the background generated when family members talk voice, and/or noise from household appliances, etc.).
  • a total of 24 enhanced registration voices are formed.
  • step S270 Extract the feature parameters of the voice to be verified and the feature parameters of the enhanced registered voice, refer to step S150 in the first embodiment. However, it can be understood that, in this embodiment, the characteristic parameters in the 24 enhanced registered voices are extracted respectively.
  • S280 Perform parameter identification on the feature parameters of the voice to be verified and the feature parameters of the enhanced registered voice to obtain the voice template of the current user of the mobile phone 100 and the voice template of the owner of the mobile phone 100 respectively, refer to S160 in the first embodiment.
  • the obtained 24 host voice templates are respectively recorded as B1, B2, . . . , B24.
  • step S290 Match the voice template of the owner of the mobile phone 100 with the voice template of the current user of the mobile phone 100 to obtain a recognition result.
  • step S170 the cosine similarity between the 24 host voice templates and the current user voice template of the mobile phone 100 are cos ⁇ 1 , cos ⁇ 2 , . . . , cos ⁇ 24 , respectively.
  • the similarity P between the current user's voice and the host's voice is greater than the set value (for example, 0.8)
  • the mobile phone 100 unlocks the screen; otherwise, it is determined that the current user of the mobile phone 100 is the host. Not the owner himself, the phone 100 will not unlock the screen.
  • steps S230 and S240 are omitted, that is, the step of enhancing the registered voice according to the noise contained in the voice to be verified is omitted, and the registered voice is only enhanced according to the template noise corresponding to the recognition scene.
  • enhanced voice register 18, which corresponds to the owner of the utterance B7, B2, whil, B24, respectively, with the voice of the user machine 100 of the main current voice phone similarity P (cos ⁇ 7 + cos ⁇ 2 +...+cos ⁇ 24 )/18.
  • the implementation body of the voiceprint recognition algorithm (implemented locally in the mobile phone 100 or in the cloud), other processing of speech (for example, reverberation estimation, far-field simulation, Front-end processing, etc.), etc., may refer to the introduction in Embodiment 1, and will not be repeated here.
  • the scene type corresponding to the voice to be verified, the distance between the speaker to be verified and the microphone, etc. are all environmental characteristic parameters in the voice to be verified.
  • This embodiment changes the application scenario of the voice enhancement method on the basis of the first embodiment. Specifically, the voice enhancement method in this embodiment is applied to the scenario shown in FIG. 5 for controlling the smart speaker 200 .
  • the smart speaker 200 has a voice recognition function, and the user can interact with the smart speaker 200 through voice, so as to perform functions such as song on demand, weather query, schedule management, and smart home control through the smart speaker 200 .
  • the method authenticates the identity of the user to determine whether the current user is the owner of the smart speaker 200, and then determines whether the current user has the authority to control the smart speaker 200 to perform the operation.
  • the speech enhancement method of this embodiment includes:
  • S310 Collect registered voice.
  • the registration voice from the owner of the smart speaker 200 is collected through the microphone of the smart speaker 200, but the application is not limited to this.
  • the registered voice can be saved locally in the smart speaker 200 to recognize the user's voiceprint through the smart speaker 200 to realize offline recognition of the voiceprint; the registered voice can also be uploaded to the cloud to use The computing resources in the cloud recognize the user's voiceprint to save the local computing resources of the smart speaker 200 .
  • S320 Collect the voice to be verified.
  • the voice to be verified is collected through the microphone of the smart speaker 200 .
  • acquisition parameters of the voice to be verified for example, the duration and text content of the voice to be verified
  • S330 Determine the noise contained in the speech to be verified.
  • the speech to be verified is divided into a plurality of speech frames, and the medium energy of each speech frame is calculated. Since the energy in the noise is generally smaller than that in the valid speech, when the energy in the speech frame is smaller than a predetermined value, the speech frame can be determined as a noise frame, thereby simplifying the noise extraction process.
  • S340 Superimpose the noise contained in the voice to be verified on the registration voice to obtain an enhanced registration voice.
  • the signal value of the noise signal and the signal value of the registration speech signal are added to obtain the enhanced registration speech.
  • S350 Extract the feature parameters of the voice to be verified and the feature parameters of the enhanced registered voice.
  • the feature parameters of the voice to be verified and the feature parameters of the enhanced registered voice are extracted by the MFCC method.
  • the recognition model for parameter recognition is not limited in this embodiment, and may be a probability model, such as an identity vector (I-vector) model; or a deep neural network model, such as a Time-Delay Neural Network (TDNN) ) model, ResNet model, etc.
  • I-vector identity vector
  • TDNN Time-Delay Neural Network
  • the template matching method may be a cosine distance method, a linear discriminant method, or a probabilistic linear discriminant analysis method, or the like. If the similarity between the current user's voice and the host's voice is greater than the set value, it is determined that the current user of the smart speaker 200 is the owner himself. At this time, the smart speaker 200 performs corresponding operations in response to the user's voice command; The current user of the speaker 200 is not the owner himself, and the smart speaker 200 ignores the user's voice command.
  • the speech enhancement method in this embodiment is substantially the same as the speech enhancement method in Embodiment 1 except for the application scenario. Therefore, for technical details not described in this embodiment, reference may be made to the description in Embodiment 1.
  • the voiceprint recognition algorithm (the algorithms corresponding to steps S330 to S370 ) can be implemented on the smart speaker 200 to realize offline recognition of voiceprints; it can also be implemented in the cloud to save the local smart speaker 200 computing resources.
  • the voiceprint recognition algorithm is implemented in the cloud
  • the smart speaker 200 uploads the to-be-verified voice collected in step S120 to the cloud server.
  • the cloud server uses the voiceprint recognition algorithm to authenticate the identity of the current user of the smart speaker 200, the authentication The result is returned to the smart speaker 200, and the smart speaker 200 determines whether to execute the user's voice command according to the authentication result.
  • Electronic device 400 may include one or more processors 401 coupled to controller hub 403 .
  • the controller hub 403 is connected to 406 via a multidrop bus such as a Front Side Bus (FSB), a point-to-point interface such as a QuickPath Interconnect (QPI), or the like
  • the processor 401 communicates.
  • Processor 401 executes instructions that control general types of data processing operations.
  • the controller hub 403 includes, but is not limited to, a graphics memory controller hub (GMCH, Graphics & Memory Controller Hub) (not shown) and an input/output hub (IOH, Input Output Hub) (which can be on a separate chip) (not shown), where the GMCH includes the memory and graphics controller and is coupled to the IOH.
  • GMCH graphics memory controller hub
  • IOH input/output hub
  • IOH Input Output Hub
  • Electronic device 400 may also include a coprocessor 402 and memory 404 coupled to controller hub 403 .
  • a coprocessor 402 and memory 404 coupled to controller hub 403 .
  • one or both of the memory and GMCH may be integrated within the processor (as described in this application), with the memory 404 and coprocessor 402 coupled directly to the processor 401 and to the controller hub 403, the controller hub 403 and IOH are in a single chip.
  • the memory 404 may be, for example, Dynamic Random Access Memory (DRAM), Phase Change Memory (PCM), or a combination of the two.
  • Memory 404 may include one or more tangible, non-transitory computer-readable media for storing data and/or instructions.
  • the computer-readable storage medium stores instructions, in particular temporary and permanent copies of the instructions.
  • the instructions may include instructions that, when executed by at least one of the processors, cause the electronic device 400 to implement the speech enhancement method described in FIGS. 3 and 4 .
  • the instructions When the instructions are executed on the computer, the computer is caused to execute the method disclosed in the first embodiment and/or the second embodiment.
  • the coprocessor 402 is a special-purpose processor, such as, for example, a high-throughput MIC (Many Integrated Core) processor, a network or communication processor, a compression engine, a graphics processor, a GPGPU (General- purpose computing on graphics processing units, general-purpose computing on graphics processing units), or embedded processors, etc.
  • a high-throughput MIC Many Integrated Core
  • a network or communication processor a compression engine
  • a graphics processor e.g., a graphics processing units
  • GPGPU General- purpose computing on graphics processing units, general-purpose computing on graphics processing units
  • embedded processors e.g., embedded processors, etc.
  • the electronic device 400 may further include a network interface (NIC, Network Interface Controller) 406 .
  • the network interface 406 may include a transceiver for providing a radio interface for the electronic device 400 to communicate with any other suitable devices (eg, front-end modules, antennas, etc.).
  • network interface 406 may be integrated with other components of electronic device 400 .
  • the network interface 406 can implement the functions of the communication unit in the above-mentioned embodiments.
  • the electronic device 400 may further include an input/output (I/O, Input/Output) device 405 .
  • I/O 405 may include: a user interface designed to enable a user to interact with electronic device 400 ; a peripheral component interface designed to enable peripheral components to also interact with electronic device 400 ; and/or sensors designed to determine association with electronic device 400 environmental conditions and/or location information.
  • Figure 6 is exemplary only. That is, although FIG. 6 shows that the electronic device 400 includes multiple devices such as the processor 401, the controller center 403, the memory 404, etc., in practical applications, the device using each method of the present application may only include the electronic device 400 Some of the devices, for example, may include only the processor 401 and the network interface 406 . The properties of the optional device in Figure 6 are shown in dashed lines.
  • SoC 500 includes: interconnect unit 550 coupled to processor 510; system agent unit 580; bus controller unit 590; integrated memory controller unit 540; , which may include integrated graphics logic, image processor, audio processor and video processor; Static Random Access Memory (SRAM, Static Random-Access Memory) unit 530; Direct Memory Access (DMA, Direct Memory Access) unit 560 .
  • interconnect unit 550 coupled to processor 510; system agent unit 580; bus controller unit 590; integrated memory controller unit 540; , which may include integrated graphics logic, image processor, audio processor and video processor; Static Random Access Memory (SRAM, Static Random-Access Memory) unit 530; Direct Memory Access (DMA, Direct Memory Access) unit 560 .
  • SRAM Static Random Access Memory
  • DMA Direct Memory Access
  • the coprocessor 520 includes a special purpose processor such as, for example, a network or communications processor, a compression engine, a GPGPU (General-purpose computing on graphics processing units), a high-throughput MIC processor, or embedded processor, etc.
  • a special purpose processor such as, for example, a network or communications processor, a compression engine, a GPGPU (General-purpose computing on graphics processing units), a high-throughput MIC processor, or embedded processor, etc.
  • Static random access memory (SRAM) unit 530 may include one or more tangible, non-transitory computer-readable media for storing data and/or instructions.
  • the computer-readable storage medium stores instructions, in particular temporary and permanent copies of the instructions.
  • the instructions may include instructions that, when executed by at least one of the processors, cause the SoC to implement the speech enhancement method described in FIGS. 3 and 4 .
  • the computer is caused to execute the method disclosed in the first embodiment and/or the second embodiment.
  • Program code may be applied to input instructions to perform the functions described herein and to generate output information.
  • the output information can be applied to one or more output devices in a known manner.
  • a processing system includes any system having a processor such as, for example, a Digital Signal Processor (DSP), a microcontroller, an Application Specific Integrated Circuit (ASIC), or a microprocessor.
  • DSP Digital Signal Processor
  • ASIC Application Specific Integrated Circuit
  • the program code may be implemented in a high-level procedural language or an object-oriented programming language to communicate with the processing system.
  • the program code may also be implemented in assembly or machine language, if desired.
  • the mechanisms described herein are not limited to the scope of any particular programming language. In either case, the language may be a compiled language or an interpreted language.
  • One or more aspects of at least one embodiment may be implemented by representative instructions stored on a computer-readable storage medium, the instructions representing various logic in a processor, the instructions, when read by a machine, cause the machine to make Logic that implements the techniques described herein.
  • These representations referred to as "IP (Intellectual Property) cores,” may be stored on tangible computer-readable storage media and provided to multiple customers or production facilities for loading into the actual manufacturing of the logic or processor. in the manufacturing machine.
  • an instruction converter may be used to convert instructions from a source instruction set to a target instruction set.
  • an instruction translator may transform (eg, using static binary transforms, dynamic binary transforms including dynamic compilation), warp, emulate, or otherwise convert an instruction into one or more other instructions to be processed by the core.
  • Instruction translators can be implemented in software, hardware, firmware, or a combination thereof.
  • the instruction translator may be on-processor, off-processor, or partially on-processor and partially off-processor.

Abstract

The present application provides an artificial intelligence (AI)-based speech enhancement method, a terminal device, a speech enhancement system, and a computer readable storage medium. An electronic device acquires speech to be verified; the electronic device determines at least one of environmental noise and an environment feature parameter comprised in the speech to be verified; the electronic device then enhances a registration speech on the basis of the environmental noise and/or the environment feature parameter; finally, the electronic device compares the speech to be verified with the enhanced registration speech to determine whether the speech to be verified and the registration speech are from the same user. In embodiments of the present application, the registration speech is enhanced according to a noise component in the speech to be verified so as to cause the enhanced registration speech and the speech to be verified to have similar noise components, so that a more accurate recognition result can be obtained.

Description

语音增强方法、设备、系统以及存储介质Speech enhancement method, device, system and storage medium
本申请要求2020年07月08日提交中国专利局、申请号为202010650893.X、申请名称为“语音增强方法、设备、系统以及存储介质”的中国专利申请的优先权,上述申请的全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application with the application number 202010650893.X and the application name "Speech Enhancement Method, Device, System and Storage Medium" filed with the China Patent Office on July 8, 2020. The entire content of the above application is approved by Reference is incorporated in this application.
技术领域technical field
本申请涉及生物识别技术领域,尤其涉及一种语音增强方法、设备、系统以及计算机可读存储介质。The present application relates to the technical field of biometrics, and in particular, to a speech enhancement method, device, system, and computer-readable storage medium.
背景技术Background technique
目前,基于生物特征识别的生物认证技术逐渐在家庭生活、公共安全等领域中得到了推广应用。可应用于生物认证的生物特征包括指纹、面部(人脸)、虹膜、DNA、声纹等,其中,以声纹作为识别特征的声纹识别技术(又称,说话人识别技术),以不接触的方式实现声音样本的采集,采集方式更加隐蔽,因而更容易被用户接受。At present, biometric authentication technology based on biometric identification has gradually been popularized and applied in the fields of family life and public safety. Biometric features that can be applied to biometric authentication include fingerprint, face (face), iris, DNA, voiceprint, etc. Among them, voiceprint recognition technology (also known as speaker recognition technology) that uses voiceprint as an identification feature The contact method realizes the collection of sound samples, and the collection method is more concealed, so it is easier to be accepted by users.
现有技术中,当声音样本的采集环境中存在噪声时,声纹的识别率会受到影响。In the prior art, when there is noise in the collection environment of the sound sample, the recognition rate of the voiceprint will be affected.
发明内容SUMMARY OF THE INVENTION
本申请的一些实施方式提供了一种语音增强方法、终端设备、语音增强系统以及计算机可读存储介质,以下从多个方面介绍本申请,以下多个方面的实施方式和有益效果可互相参考。Some embodiments of the present application provide a speech enhancement method, a terminal device, a speech enhancement system, and a computer-readable storage medium. The present application is described below from various aspects, and the embodiments and beneficial effects of the following aspects can be referred to each other.
第一方面,本申请实施方式提供了一种语音增强方法,应用于电子设备,包括:采集待验证语音;确定待验证语音中包含的环境噪声和/或环境特征参数;基于环境噪声和/或环境特征参数对注册语音进行增强;比较待验证语音与增强的注册语音进行比较,确定待验证语音与注册语音来自相同用户。In a first aspect, an embodiment of the present application provides a voice enhancement method, which is applied to an electronic device, including: collecting a voice to be verified; determining environmental noise and/or environmental characteristic parameters contained in the voice to be verified; The environment feature parameter enhances the registered voice; compares the to-be-verified voice and the enhanced registered voice, and determines that the to-be-verified voice and the registered voice are from the same user.
根据本申请的实施方式,根据待验证语音中的噪声成分对注册语音进行增强,以使得增强的注册语音和待验证语音具有相接近的噪声成分,这样,待验证语音和增强的注册语音的主要区别就在于两者有效语音成分之间的区别,通过声纹识别算法对两者进行比较后,可以得到更准确的识别结果。另外,本申请实施方式中,用户只需在安静环境下录入注册语音即可,无需在多个场景分别录制注册语音,因而用户体验较好。According to the embodiment of the present application, the registration voice is enhanced according to the noise components in the voice to be verified, so that the enhanced registration voice and the voice to be verified have similar noise components. The difference lies in the difference between the two effective speech components. After comparing the two through the voiceprint recognition algorithm, a more accurate recognition result can be obtained. In addition, in the embodiment of the present application, the user only needs to record the registration voice in a quiet environment, and there is no need to separately record the registration voice in multiple scenarios, so the user experience is better.
在一些实施方式中,注册语音为在安静环境下采集的来自注册说话人的语音。这样,注册语音中没有明显的噪声分量,可以提高识别的准确率。In some embodiments, the registration speech is the speech from the registration speaker collected in a quiet environment. In this way, there is no obvious noise component in the registered speech, which can improve the accuracy of recognition.
在一些实施方式中,基于环境噪声对注册语音进行增强,包括:在注册语音上叠加环境噪声。本申请实施方法通过在注册语音上叠加环境噪声得到增强的注册语音,算法简单。In some embodiments, enhancing the registration speech based on the environmental noise includes superimposing the environmental noise on the registration speech. The implementation method of the present application obtains the enhanced registration voice by superimposing the environmental noise on the registration voice, and the algorithm is simple.
在一些实施方式中,环境噪声为电子设备的副麦克风拾取到的声音。本申请实施方式可以方便地确定待验证语音中所包含的噪声。In some embodiments, the ambient noise is sound picked up by a secondary microphone of the electronic device. The embodiments of the present application can conveniently determine the noise contained in the speech to be verified.
在一些实施方式中,待验证语音的时长小于注册语音的时长。这样,用户可以录入较短的待 验证语音,有利于提高用户体验。In some embodiments, the duration of the to-be-verified speech is less than the duration of the registered speech. In this way, the user can input a short voice to be verified, which is beneficial to improve the user experience.
在一些实施方式中,环境特征参数包括待验证语音所对应的场景类型;基于环境特征参数对注册语音进行增强,包括:基于待验证语音所对应的场景类型,确定场景类型所对应的模板噪声,并在注册语音上叠加模板噪声。In some embodiments, the environmental characteristic parameter includes a scene type corresponding to the voice to be verified; the enhancement of the registered voice based on the environmental characteristic parameter includes: determining the template noise corresponding to the scene type based on the scene type corresponding to the to-be-verified voice, And superimpose template noise on the registered speech.
根据本申请的实施方式,通过在注册语音上叠加模板噪声对注册语音进行增强,使得增强的注册语音和待验证语音具有尽量接近的噪声成分,有利于提高识别的准确率。According to the embodiment of the present application, the registration speech is enhanced by superimposing template noise on the registration speech, so that the enhanced registration speech and the to-be-verified speech have noise components as close as possible, which is beneficial to improve the recognition accuracy.
在一些实施方式中,待验证语音所对应的场景类型是根据场景识别算法对待验证语音进行识别而确定的。在一些实施方式中,场景识别算法为下述任意一种:GMM算法;DNN算法。In some embodiments, the scene type corresponding to the voice to be verified is determined according to the scene recognition algorithm that recognizes the voice to be verified. In some embodiments, the scene recognition algorithm is any one of the following: GMM algorithm; DNN algorithm.
在一些实施方式中,待验证语音的场景类型为下述任意一种:居家场景;车载场景;室外嘈杂场景;会场场景;影院场景。本申请实施方式的场景类型涵盖了用户日常活动的场所,有利于提高用户体验。In some embodiments, the scene type of the voice to be verified is any one of the following: a home scene; a vehicle-mounted scene; an outdoor noisy scene; a venue scene; a cinema scene. The scene types of the embodiments of the present application cover the places where the user performs daily activities, which is beneficial to improve the user experience.
在一些实施方式中,待验证语音的环境参数特征包括产生待验证语音的用户与电子设备之间的距离;基于环境特征参数对注册语音进行增强,包括:根据产生待验证语音的用户与电子设备之间的距离,对注册语音进行远场仿真。其中,对注册语音进行远场仿真用于将注册语音的采集距离(注册语音的语音采集装置与产生注册语音的用户之间的距离)模拟至待验证语音的采集距离(待验证语音的语音采集装置与产生待验证语音的用户之间的距离)。In some embodiments, the environmental parameter characteristics of the voice to be verified include the distance between the user who generates the voice to be verified and the electronic device; the enhancement of the registered voice based on the environmental characteristic parameters includes: according to the user who generated the voice to be verified and the electronic device The distance between the registered voices is simulated in the far field. Wherein, the far-field simulation of the registered voice is used to simulate the acquisition distance of the registered voice (the distance between the voice acquisition device of the registered voice and the user who generates the registered voice) to the acquisition distance of the voice to be verified (the voice collection of the voice to be verified). distance between the device and the user producing the speech to be authenticated).
根据本申请的实施方式,通过对注册语音进行远场仿真,可以考虑待验证语音在传播过程中的衰减成分,使得增强的注册语音和待验证语音具有尽量接近的噪声成分,有利于提高识别的准确率。According to the embodiment of the present application, by performing far-field simulation on the registered voice, the attenuation component of the voice to be verified during the propagation process can be considered, so that the enhanced registered voice and the voice to be verified have noise components as close as possible, which is beneficial to improve the recognition efficiency. Accuracy.
在一些实施方式中,根据产生待验证语音的用户与电子设备之间的距离,对注册语音进行远场仿真,包括:根据产生待验证语音的用户与电子设备之间的距离,基于镜像源模型方法建立待验证语音的采集场所的脉冲响应函数;将脉冲响应函数与注册语音的音频信号进行卷积,以对注册语音进行远场仿真。In some embodiments, performing a far-field simulation on the registered voice according to the distance between the user who generates the voice to be verified and the electronic device includes: according to the distance between the user who generates the voice to be verified and the electronic device, based on the mirror source model Methods The impulse response function of the acquisition site of the speech to be verified is established; the impulse response function is convolved with the audio signal of the registered speech to perform far-field simulation of the registered speech.
在一些实施方式中,待验证语音和增强的注册语音为经过相同的前端处理算法处理过的语音。通过前端处理可以去除语音中的干扰因素,有利于提高声纹识别的准确率。In some embodiments, the voice to be verified and the enhanced registration voice are voices processed by the same front-end processing algorithm. Through front-end processing, the interference factors in the speech can be removed, which is beneficial to improve the accuracy of voiceprint recognition.
在一些实施方式中,前端处理算法包括以下至少一种处理算法:回声抵消;去混响;主动降噪;动态增益;定向拾音。In some embodiments, the front-end processing algorithm includes at least one of the following processing algorithms: echo cancellation; de-reverberation; active noise reduction; dynamic gain; directional pickup.
在一些实施方式中,注册语音的数量为多条;并且,基于环境噪声和/或环境特征参数,对多条注册语音分别进行增强,以得到多条增强的注册语音。In some embodiments, the number of registered voices is multiple; and, based on environmental noise and/or environmental characteristic parameters, the multiple registered voices are respectively enhanced to obtain multiple enhanced registered voices.
根据本申请的实施方式,得到多条增强的注册语音,可以将待验证语音与多条增强的注册语音进行分别进行匹配,以得到多个相似度匹配结果,进而可根据多个相似度匹配结果综合判断待验证说话人语音与注册说话人语音的相似度,从而可以对单个匹配结果的误差进行平均,有利于提高声纹识别的准确率和声纹识别算法的鲁棒性。According to the embodiment of the present application, a plurality of enhanced registration voices are obtained, and the to-be-verified voice and a plurality of enhanced registration voices can be respectively matched to obtain a plurality of similarity matching results, which can be further matched according to the plurality of similarity matching results. By comprehensively judging the similarity between the voice of the speaker to be verified and the voice of the registered speaker, the error of a single matching result can be averaged, which is beneficial to improve the accuracy of voiceprint recognition and the robustness of the voiceprint recognition algorithm.
在一些实施方式中,比较待验证语音与增强的注册语音,确定待验证语音与注册语音来自相同用户,包括:通过特征参数提取算法提取待验证语音的特征参数,和增强的注册语音的特征参数;通过参数识别模型对待验证语音的特征参数和增强的注册语音的特征参数进行参数识别,以分别得到待验证说话人的语音模板和注册说话人的语音模板;通过模板匹配算法对待验证说话人的语音模板和注册说话人的语音模板进行匹配,根据匹配结果确定待验证语音与注册语音来自相同用户。In some embodiments, comparing the to-be-verified voice and the enhanced registered voice, and determining that the to-be-verified voice and the registered voice are from the same user, include: extracting characteristic parameters of the to-be-verified voice and enhanced registered voice characteristic parameters through a feature parameter extraction algorithm ; Carry out parameter identification through the parameter recognition model of the characteristic parameters of the voice to be verified and the characteristic parameters of the enhanced registered voice, so as to obtain the voice template of the speaker to be verified and the voice template of the registered speaker respectively; The voice template is matched with the voice template of the registered speaker, and according to the matching result, it is determined that the voice to be verified and the registered voice are from the same user.
在一些实施方式中,特征参数提取算法为MFCC算法,log mel算法或者LPCC算法;和/或,参数识别模型为身份向量模型、时延神经网络模型或者ResNet模型;和/或,模板匹配算法为余弦距离法、线性判别法或者概率线性判别分析法。In some embodiments, the feature parameter extraction algorithm is MFCC algorithm, log mel algorithm or LPCC algorithm; and/or, the parameter identification model is an identity vector model, a time-delay neural network model or a ResNet model; and/or, the template matching algorithm is Cosine distance method, linear discriminant method or probabilistic linear discriminant analysis method.
第二方面,本申请实施方式提供了一种语音增强方法,包括:终端设备采集待验证语音,并将待验证语音发送至与终端设备通信相连的服务器;服务器,确定待验证语音中包含的环境噪声和/或环境特征参数;服务器,基于环境噪声和/或环境特征参数对注册语音进行增强;服务器,比较待验证语音与增强的注册语音,确定待验证语音与注册语音来自相同用户;服务器,将确定待验证语音与注册语音来自相同用户的确定结果发送至终端设备。In a second aspect, an embodiment of the present application provides a voice enhancement method, including: a terminal device collects the voice to be verified, and sends the voice to be verified to a server that is communicatively connected to the terminal device; the server determines the environment contained in the voice to be verified Noise and/or environmental characteristic parameters; the server, based on the environmental noise and/or environmental characteristic parameters, enhances the registered voice; the server, compares the to-be-verified voice and the enhanced registered voice, and determines that the to-be-verified voice and the registered voice are from the same user; the server, The determination result of determining that the voice to be verified and the registered voice are from the same user is sent to the terminal device.
根据本申请的实施方式,根据待验证语音中的噪声成分对注册语音进行增强,以使得增强的注册语音和待验证语音具有相接近的噪声成分,这样,待验证语音和增强的注册语音的主要区别就在于两者有效语音成分之间的区别,通过声纹识别算法对两者进行比较后,可以得到更准确的识别结果。另外,本申请实施方式中,用户只需在安静环境下录入注册语音即可,无需在多个场景分别录制注册语音,因而用户体验较好。本申请实施方式中,说话人识别算法在服务器上实现,可以节省终端设备本地的计算资源。According to the embodiment of the present application, the registration voice is enhanced according to the noise components in the voice to be verified, so that the enhanced registration voice and the voice to be verified have similar noise components. The difference lies in the difference between the two effective speech components. After comparing the two through the voiceprint recognition algorithm, a more accurate recognition result can be obtained. In addition, in the embodiment of the present application, the user only needs to record the registration voice in a quiet environment, and there is no need to separately record the registration voice in multiple scenarios, so the user experience is better. In the embodiments of the present application, the speaker recognition algorithm is implemented on the server, which can save local computing resources of the terminal device.
在一些实施方式中,注册语音为在安静环境下采集的来自注册说话人的语音。这样,注册语音中没有明显的噪声分量,可以提高识别的准确率。In some embodiments, the registration speech is the speech from the registration speaker collected in a quiet environment. In this way, there is no obvious noise component in the registered speech, which can improve the accuracy of recognition.
在一些实施方式中,基于环境噪声对注册语音进行增强,包括:在注册语音上叠加环境噪声。本申请实施方法通过在注册语音上叠加环境噪声得到增强的注册语音,算法简单。In some embodiments, enhancing the registration speech based on the environmental noise includes superimposing the environmental noise on the registration speech. The implementation method of the present application obtains the enhanced registration voice by superimposing the environmental noise on the registration voice, and the algorithm is simple.
在一些实施方式中,环境噪声为终端设备的副麦克风拾取到的声音。本申请实施方式可以方便地确定待验证语音中所包含的噪声。In some embodiments, the ambient noise is the sound picked up by the secondary microphone of the terminal device. The embodiments of the present application can conveniently determine the noise contained in the speech to be verified.
在一些实施方式中,待验证语音的时长小于注册语音的时长。这样,用户可以录入较短的待验证语音,有利于提高用户体验。In some embodiments, the duration of the to-be-verified speech is less than the duration of the registered speech. In this way, the user can input a short voice to be verified, which is beneficial to improve user experience.
在一些实施方式中,环境特征参数包括待验证语音所对应的场景类型;基于环境特征参数对注册语音进行增强,包括:基于待验证语音所对应的场景类型,确定场景类型所对应的模板噪声,并在注册语音上叠加模板噪声。In some embodiments, the environmental characteristic parameter includes a scene type corresponding to the voice to be verified; the enhancement of the registered voice based on the environmental characteristic parameter includes: determining the template noise corresponding to the scene type based on the scene type corresponding to the to-be-verified voice, And superimpose template noise on the registered speech.
根据本申请的实施方式,通过在注册语音上叠加模板噪声对注册语音进行增强,使得增强的注册语音和待验证语音具有尽量接近的噪声成分,有利于提高识别的准确率。According to the embodiment of the present application, the registration speech is enhanced by superimposing template noise on the registration speech, so that the enhanced registration speech and the to-be-verified speech have noise components as close as possible, which is beneficial to improve the recognition accuracy.
在一些实施方式中,待验证语音所对应的场景类型是根据场景识别算法对待验证语音进行识别而确定的。在一些实施方式中,场景识别算法为下述任意一种:GMM算法;DNN算法。In some embodiments, the scene type corresponding to the voice to be verified is determined according to the scene recognition algorithm that recognizes the voice to be verified. In some embodiments, the scene recognition algorithm is any one of the following: GMM algorithm; DNN algorithm.
在一些实施方式中,待验证语音的场景类型为下述任意一种:居家场景;车载场景;室外嘈杂场景;会场场景;影院场景。本申请实施方式的场景类型涵盖了用户日常活动的场所,有利于提高用户体验。In some embodiments, the scene type of the voice to be verified is any one of the following: a home scene; a vehicle-mounted scene; an outdoor noisy scene; a venue scene; a cinema scene. The scene types of the embodiments of the present application cover the places where the user performs daily activities, which is beneficial to improve the user experience.
在一些实施方式中,待验证语音的环境参数特征包括产生待验证语音的用户与终端设备之间的距离;基于环境特征参数对注册语音进行增强,包括:根据产生待验证语音的用户与终端设备之间的距离,对注册语音进行远场仿真。其中,对注册语音进行远场仿真用于将注册语音的采集距离(注册语音的语音采集装置与产生注册语音的用户之间的距离)模拟至待验证语音的采集距离(待验证语音的语音采集装置与产生待验证语音的用户之间的距离)。In some embodiments, the environmental parameter characteristics of the voice to be verified include the distance between the user who generates the voice to be verified and the terminal device; the enhancement of the registered voice based on the environmental characteristic parameters includes: according to the user who generated the voice to be verified and the terminal device The distance between the registered voices is simulated in the far field. Wherein, the far-field simulation of the registered voice is used to simulate the acquisition distance of the registered voice (the distance between the voice acquisition device of the registered voice and the user who generates the registered voice) to the acquisition distance of the voice to be verified (the voice collection of the voice to be verified). distance between the device and the user producing the speech to be authenticated).
根据本申请的实施方式,通过对注册语音进行远场仿真,可以考虑待验证语音在传播过程中的衰减成分,使得增强的注册语音和待验证语音具有尽量接近的噪声成分,有利于提高识别的准 确率。According to the embodiment of the present application, by performing far-field simulation on the registered voice, the attenuation component of the voice to be verified during the propagation process can be considered, so that the enhanced registered voice and the voice to be verified have noise components as close as possible, which is beneficial to improve the recognition efficiency. Accuracy.
在一些实施方式中,根据产生待验证语音的用户与终端设备之间的距离,对注册语音进行远场仿真,包括:根据产生待验证语音的用户与终端设备之间的距离,基于镜像源模型方法建立待验证语音的采集场所的脉冲响应函数;将脉冲响应函数与注册语音的音频信号进行卷积,以对注册语音进行远场仿真。In some embodiments, performing far-field simulation on the registered voice according to the distance between the user who generates the voice to be verified and the terminal device, including: according to the distance between the user who generates the voice to be verified and the terminal device, based on the mirror source model Methods The impulse response function of the acquisition site of the speech to be verified is established; the impulse response function is convolved with the audio signal of the registered speech to perform far-field simulation of the registered speech.
在一些实施方式中,待验证语音和增强的注册语音为经过相同的前端处理算法处理过的语音。通过前端处理可以去除语音中的干扰因素,有利于提高声纹识别的准确率。In some embodiments, the voice to be verified and the enhanced registration voice are voices processed by the same front-end processing algorithm. Through front-end processing, the interference factors in the speech can be removed, which is beneficial to improve the accuracy of voiceprint recognition.
在一些实施方式中,前端处理算法包括以下至少一种处理算法:回声抵消;去混响;主动降噪;动态增益;定向拾音。In some embodiments, the front-end processing algorithm includes at least one of the following processing algorithms: echo cancellation; de-reverberation; active noise reduction; dynamic gain; directional pickup.
在一些实施方式中,注册语音的数量为多条;并且,服务器基于环境噪声和/或环境特征参数,对多条注册语音分别进行增强,以得到多条增强的注册语音。In some embodiments, the number of registered voices is multiple; and, based on environmental noise and/or environmental characteristic parameters, the server enhances the multiple registered voices respectively, so as to obtain multiple enhanced registered voices.
根据本申请的实施方式,得到多条增强的注册语音,可以将待验证语音与多条增强的注册语音进行分别进行匹配,以得到多个相似度匹配结果,进而可根据多个相似度匹配结果综合判断待验证说话人语音与注册说话人语音的相似度,从而可以对单个匹配结果的误差进行平均,有利于提高声纹识别的准确率和声纹识别算法的鲁棒性。According to the embodiment of the present application, a plurality of enhanced registration voices are obtained, and the to-be-verified voice and a plurality of enhanced registration voices can be respectively matched to obtain a plurality of similarity matching results, which can be further matched according to the plurality of similarity matching results. By comprehensively judging the similarity between the voice of the speaker to be verified and the voice of the registered speaker, the error of a single matching result can be averaged, which is beneficial to improve the accuracy of voiceprint recognition and the robustness of the voiceprint recognition algorithm.
在一些实施方式中,比较待验证语音与增强的注册语音,确定待验证语音与注册语音来自相同用户,包括:通过特征参数提取算法提取待验证语音的特征参数,和增强的注册语音的特征参数;通过参数识别模型对待验证语音的特征参数和增强的注册语音的特征参数进行参数识别,以分别得到待验证说话人的语音模板和注册说话人的语音模板;通过模板匹配算法对待验证说话人的语音模板和注册说话人的语音模板进行匹配,根据匹配结果确定所述待验证语音与所述注册语音来自相同用户。In some embodiments, comparing the to-be-verified voice and the enhanced registered voice, and determining that the to-be-verified voice and the registered voice are from the same user, include: extracting characteristic parameters of the to-be-verified voice and enhanced registered voice characteristic parameters through a feature parameter extraction algorithm ; Carry out parameter identification through the parameter recognition model of the characteristic parameters of the voice to be verified and the characteristic parameters of the enhanced registered voice, so as to obtain the voice template of the speaker to be verified and the voice template of the registered speaker respectively; The voice template is matched with the voice template of the registered speaker, and according to the matching result, it is determined that the voice to be verified and the registered voice are from the same user.
在一些实施方式中,特征参数提取算法为MFCC算法,log mel算法或者LPCC算法;和/或,参数识别模型为身份向量模型、时延神经网络模型或者ResNet模型;和/或,模板匹配算法为余弦距离法、线性判别法或者概率线性判别分析法。In some embodiments, the feature parameter extraction algorithm is MFCC algorithm, log mel algorithm or LPCC algorithm; and/or, the parameter identification model is an identity vector model, a time-delay neural network model or a ResNet model; and/or, the template matching algorithm is Cosine distance method, linear discriminant method or probabilistic linear discriminant analysis method.
第三方面,本申请实施方式提供了一种电子设备,包括:存储器,用于存储由电子设备的一个或多个处理器执行的指令;处理器,当处理器执行存储器中的指令时,可使得电子设备执行本申请第一方面任一实施方式提供的说话人识别方法。第三方面能达到的有益效果可参考第一方面任一实施方式所提供的方法的有益效果,此处不再赘述。In a third aspect, embodiments of the present application provide an electronic device, including: a memory for storing instructions executed by one or more processors of the electronic device; a processor, when the processor executes the instructions in the memory, it can The electronic device is caused to execute the speaker identification method provided by any embodiment of the first aspect of the present application. For the beneficial effects that can be achieved in the third aspect, reference may be made to the beneficial effects of the method provided by any embodiment of the first aspect, which will not be repeated here.
第四方面,本申请实施方式提供了一种语音增强系统,包括终端设备以及与终端设备通信连接的服务器,其中,In a fourth aspect, an embodiment of the present application provides a speech enhancement system, including a terminal device and a server communicatively connected to the terminal device, wherein,
终端设备采集待验证语音,并将待验证语音发送至服务器;服务器,用于确定待验证语音中包含的环境噪声和/或环境特征参数,基于环境噪声和/或环境特征参数对注册语音进行增强;并比较待验证语音与增强的注册语音,确定待验证语音与注册语音来自相同用户;服务器,还用于将确定待验证语音与注册语音来自相同用户的确定结果发送至终端设备。The terminal device collects the voice to be verified, and sends the voice to be verified to the server; the server is used to determine the environmental noise and/or environmental feature parameters contained in the voice to be verified, and enhance the registered voice based on the environmental noise and/or the environmental feature parameters and compare the voice to be verified and the enhanced registered voice, and determine that the voice to be verified and the registered voice come from the same user; the server is also used to send the determination result of determining that the voice to be verified and the registered voice come from the same user to the terminal device.
根据本申请的实施方式,根据待验证语音中的噪声成分对注册语音进行增强,以使得增强的注册语音和待验证语音具有相接近的噪声成分,这样,待验证语音和增强的注册语音的主要区别就在于两者有效语音成分之间的区别,通过声纹识别算法对两者进行比较后,可以得到更准确的识别结果。另外,本申请实施方式中,用户只需在安静环境下录入注册语音即可,无需在多个场景分别录制注册语音,因而用户体验较好。本申请实施方式中,说话人识别算法在服务器上实现, 可以节省终端设备本地的计算资源。According to the embodiment of the present application, the registration voice is enhanced according to the noise components in the voice to be verified, so that the enhanced registration voice and the voice to be verified have similar noise components. The difference lies in the difference between the two effective speech components. After comparing the two through the voiceprint recognition algorithm, a more accurate recognition result can be obtained. In addition, in the embodiment of the present application, the user only needs to record the registration voice in a quiet environment, and there is no need to separately record the registration voice in multiple scenarios, so the user experience is better. In the embodiments of the present application, the speaker recognition algorithm is implemented on the server, which can save local computing resources of the terminal device.
在一些实施方式中,注册语音为在安静环境下采集的来自注册说话人的语音。这样,注册语音中没有明显的噪声分量,可以提高识别的准确率。In some embodiments, the registration speech is the speech from the registration speaker collected in a quiet environment. In this way, there is no obvious noise component in the registered speech, which can improve the accuracy of recognition.
在一些实施方式中,基于环境噪声对注册语音进行增强,包括:在注册语音上叠加环境噪声。本申请实施方法通过在注册语音上叠加环境噪声得到增强的注册语音,算法简单。In some embodiments, enhancing the registration speech based on the environmental noise includes superimposing the environmental noise on the registration speech. The implementation method of the present application obtains the enhanced registration voice by superimposing the environmental noise on the registration voice, and the algorithm is simple.
在一些实施方式中,环境噪声为终端设备的副麦克风拾取到的声音。本申请实施方式可以方便地确定待验证语音中所包含的噪声。In some embodiments, the ambient noise is the sound picked up by the secondary microphone of the terminal device. The embodiments of the present application can conveniently determine the noise contained in the speech to be verified.
在一些实施方式中,待验证语音的时长小于注册语音的时长。这样,用户可以录入较短的待验证语音,有利于提高用户体验。In some embodiments, the duration of the to-be-verified speech is less than the duration of the registered speech. In this way, the user can input a short voice to be verified, which is beneficial to improve user experience.
在一些实施方式中,环境特征参数包括待验证语音所对应的场景类型;基于环境特征参数对注册语音进行增强,包括:基于待验证语音所对应的场景类型,确定场景类型所对应的模板噪声,并在注册语音上叠加模板噪声。In some embodiments, the environmental characteristic parameter includes a scene type corresponding to the voice to be verified; the enhancement of the registered voice based on the environmental characteristic parameter includes: determining the template noise corresponding to the scene type based on the scene type corresponding to the to-be-verified voice, And superimpose template noise on the registered speech.
根据本申请的实施方式,通过在注册语音上叠加模板噪声对注册语音进行增强,使得增强的注册语音和待验证语音具有尽量接近的噪声成分,有利于提高识别的准确率。According to the embodiment of the present application, the registration speech is enhanced by superimposing template noise on the registration speech, so that the enhanced registration speech and the to-be-verified speech have noise components as close as possible, which is beneficial to improve the recognition accuracy.
在一些实施方式中,待验证语音所对应的场景类型是根据场景识别算法对待验证语音进行识别而确定的。在一些实施方式中,场景识别算法为下述任意一种:GMM算法;DNN算法。In some embodiments, the scene type corresponding to the voice to be verified is determined according to the scene recognition algorithm that recognizes the voice to be verified. In some embodiments, the scene recognition algorithm is any one of the following: GMM algorithm; DNN algorithm.
在一些实施方式中,待验证语音的场景类型为下述任意一种:居家场景;车载场景;室外嘈杂场景;会场场景;影院场景。本申请实施方式的场景类型涵盖了用户日常活动的场所,有利于提高用户体验。In some embodiments, the scene type of the voice to be verified is any one of the following: a home scene; a vehicle-mounted scene; an outdoor noisy scene; a venue scene; a cinema scene. The scene types of the embodiments of the present application cover the places where the user performs daily activities, which is beneficial to improve the user experience.
在一些实施方式中,待验证语音的环境参数特征包括产生待验证语音的用户与终端设备之间的距离;基于环境特征参数对注册语音进行增强,包括:根据产生待验证语音的用户与终端设备之间的距离,对注册语音进行远场仿真。其中,对注册语音进行远场仿真用于将注册语音的采集距离(注册语音的语音采集装置与产生注册语音的用户之间的距离)模拟至待验证语音的采集距离(待验证语音的语音采集装置与产生待验证语音的用户之间的距离)。In some embodiments, the environmental parameter characteristics of the voice to be verified include the distance between the user who generates the voice to be verified and the terminal device; the enhancement of the registered voice based on the environmental characteristic parameters includes: according to the user who generated the voice to be verified and the terminal device The distance between the registered voices is simulated in the far field. Wherein, the far-field simulation of the registered voice is used to simulate the acquisition distance of the registered voice (the distance between the voice acquisition device of the registered voice and the user who generates the registered voice) to the acquisition distance of the voice to be verified (the voice collection of the voice to be verified). distance between the device and the user producing the speech to be authenticated).
根据本申请的实施方式,通过对注册语音进行远场仿真,可以考虑待验证语音在传播过程中的衰减成分,使得增强的注册语音和待验证语音具有尽量接近的噪声成分,有利于提高识别的准确率。According to the embodiment of the present application, by performing far-field simulation on the registered voice, the attenuation component of the voice to be verified during the propagation process can be considered, so that the enhanced registered voice and the voice to be verified have noise components as close as possible, which is beneficial to improve the recognition efficiency. Accuracy.
在一些实施方式中,根据产生待验证语音的用户与终端设备之间的距离,对注册语音进行远场仿真,包括:根据产生待验证语音的用户与终端设备之间的距离,基于镜像源模型方法建立待验证语音的采集场所的脉冲响应函数;将脉冲响应函数与注册语音的音频信号进行卷积,以对注册语音进行远场仿真。In some embodiments, performing far-field simulation on the registered voice according to the distance between the user who generates the voice to be verified and the terminal device, including: according to the distance between the user who generates the voice to be verified and the terminal device, based on the mirror source model Methods The impulse response function of the acquisition site of the speech to be verified is established; the impulse response function is convolved with the audio signal of the registered speech to perform far-field simulation of the registered speech.
在一些实施方式中,待验证语音和增强的注册语音为经过相同的前端处理算法处理过的语音。通过前端处理可以去除语音中的干扰因素,有利于提高声纹识别的准确率。In some embodiments, the voice to be verified and the enhanced registration voice are voices processed by the same front-end processing algorithm. Through front-end processing, the interference factors in the speech can be removed, which is beneficial to improve the accuracy of voiceprint recognition.
在一些实施方式中,前端处理算法包括以下至少一种处理算法:回声抵消;去混响;主动降噪;动态增益;定向拾音。In some embodiments, the front-end processing algorithm includes at least one of the following processing algorithms: echo cancellation; de-reverberation; active noise reduction; dynamic gain; directional pickup.
在一些实施方式中,注册语音的数量为多条;并且,服务器基于环境噪声和/或环境特征参数,对多条注册语音分别进行增强,以得到多条增强的注册语音。In some embodiments, the number of registered voices is multiple; and, based on environmental noise and/or environmental characteristic parameters, the server enhances the multiple registered voices respectively, so as to obtain multiple enhanced registered voices.
根据本申请的实施方式,得到多条增强的注册语音,可以将待验证语音与多条增强的注册语音进行分别进行匹配,以得到多个相似度匹配结果,进而可根据多个相似度匹配结果综合判断待 验证说话人语音与注册说话人语音的相似度,从而可以对单个匹配结果的误差进行平均,有利于提高声纹识别的准确率和声纹识别算法的鲁棒性。According to the embodiment of the present application, a plurality of enhanced registration voices are obtained, and the to-be-verified voice and a plurality of enhanced registration voices can be respectively matched to obtain a plurality of similarity matching results, which can be further matched according to the plurality of similarity matching results. By comprehensively judging the similarity between the voice of the speaker to be verified and the voice of the registered speaker, the error of a single matching result can be averaged, which is beneficial to improve the accuracy of voiceprint recognition and the robustness of the voiceprint recognition algorithm.
在一些实施方式中,比较待验证语音与增强的注册语音,确定待验证语音与注册语音来自相同用户,包括:通过特征参数提取算法提取待验证语音的特征参数,和增强的注册语音的特征参数;通过参数识别模型对待验证语音的特征参数和增强的注册语音的特征参数进行参数识别,以分别得到待验证说话人的语音模板和注册说话人的语音模板;通过模板匹配算法对待验证说话人的语音模板和注册说话人的语音模板进行匹配,根据匹配结果确定所述待验证语音与所述注册语音来自相同用户。In some embodiments, comparing the to-be-verified voice and the enhanced registered voice, and determining that the to-be-verified voice and the registered voice are from the same user, include: extracting characteristic parameters of the to-be-verified voice and enhanced registered voice characteristic parameters through a feature parameter extraction algorithm ; Carry out parameter identification through the parameter recognition model of the characteristic parameters of the voice to be verified and the characteristic parameters of the enhanced registered voice, so as to obtain the voice template of the speaker to be verified and the voice template of the registered speaker respectively; The voice template is matched with the voice template of the registered speaker, and according to the matching result, it is determined that the voice to be verified and the registered voice are from the same user.
在一些实施方式中,特征参数提取算法为MFCC算法,log mel算法或者LPCC算法;和/或,参数识别模型为身份向量模型、时延神经网络模型或者ResNet模型;和/或,模板匹配算法为余弦距离法、线性判别法或者概率线性判别分析法。In some embodiments, the feature parameter extraction algorithm is MFCC algorithm, log mel algorithm or LPCC algorithm; and/or, the parameter identification model is an identity vector model, a time-delay neural network model or a ResNet model; and/or, the template matching algorithm is Cosine distance method, linear discriminant method or probabilistic linear discriminant analysis method.
第五方面,本申请实施方式提供一种计算机可读存储介质,计算机可读存储介质中存储有指令,该指令在计算机上执行时,可使计算机执行本申请第一方面任一实施方式提供的方法,或者使计算机执行本申请第二方面任一实施方式提供的方法。第五方面能达到的有益效果可参考第一方面任一实施方式或第二方面任一实施方式所提供的方法的有益效果,此处不再赘述。In a fifth aspect, an embodiment of the present application provides a computer-readable storage medium, where an instruction is stored in the computer-readable storage medium, and when the instruction is executed on a computer, the computer can execute the information provided by any one of the embodiments of the first aspect of the present application. method, or causing a computer to execute the method provided by any embodiment of the second aspect of the present application. For the beneficial effects that can be achieved in the fifth aspect, reference may be made to the beneficial effects of the method provided by any embodiment of the first aspect or any embodiment of the second aspect, which will not be repeated here.
附图说明Description of drawings
图1a示出了本申请实施方式提供的语音增强方法的示例性应用场景;Fig. 1a shows an exemplary application scenario of the speech enhancement method provided by the embodiment of the present application;
图1b示出了本申请实施方式提供的语音增强方法的另一个示例性应用场景;Fig. 1b shows another exemplary application scenario of the speech enhancement method provided by the embodiment of the present application;
图2示出了本申请实施方式提供的语音增强设备的构造示意图;FIG. 2 shows a schematic structural diagram of a speech enhancement device provided by an embodiment of the present application;
图3示出了本申请一个实施例提供的语音增强方法的流程图;FIG. 3 shows a flowchart of a speech enhancement method provided by an embodiment of the present application;
图4示出了本申请另一个实施例提供的语音增强方法的流程图;FIG. 4 shows a flowchart of a speech enhancement method provided by another embodiment of the present application;
图5示出了本申请实施例提供的语音增强方法的一个应用场景;FIG. 5 shows an application scenario of the speech enhancement method provided by the embodiment of the present application;
图6示出了本申请实施方式提供的电子设备的结构图;FIG. 6 shows a structural diagram of an electronic device provided by an embodiment of the present application;
图7示出了本申请实施方式提供的片上系统(SoC)的框图。FIG. 7 shows a block diagram of a system-on-chip (SoC) provided by an embodiment of the present application.
具体实施方式detailed description
下面将结合附图对本申请实施方式进行详细描述。The embodiments of the present application will be described in detail below with reference to the accompanying drawings.
说话人识别技术(又称,声纹识别技术)是利用说话人声纹的独特性来对说话人的身份进行识别的技术。因为每个人的发声器官(例如,舌、牙齿、喉头、肺、鼻腔、发声通道等)具有先天区别,且发声习惯等具有后天差异,因此,每个人的声纹特征是独一无二的,通过对声纹特征进行分析,可对说话人的身份进行识别。Speaker recognition technology (also known as voiceprint recognition technology) is a technology that uses the uniqueness of the speaker's voiceprint to identify the speaker's identity. Because each person's vocal organs (for example, tongue, teeth, larynx, lungs, nasal cavity, vocal passages, etc.) are innately different, and vocalization habits, etc. have acquired differences, therefore, each person's voiceprint features are unique. By analyzing the features of the pattern, the identity of the speaker can be identified.
说话人识别的具体过程为,采集待确认身份的说话人的语音,将之与特定说话人的语音进行比较,以确认待确认身份的说话人是否为该特定说话人。本文中,将待确认身份的说话人的语音称为“待验证语音”,该待确认身份的说话人称为“待验证说话人”;将特定说话人的语音称为“注册语音”,该特定说话人称为“注册说话人”。The specific process of speaker identification is to collect the voice of the speaker whose identity is to be confirmed, and compare it with the voice of a specific speaker to confirm whether the speaker whose identity is to be confirmed is the specific speaker. In this paper, the voice of the speaker whose identity is to be confirmed is called "voice to be verified", and the speaker whose identity is to be confirmed is called "speaker to be verified"; the voice of a specific speaker is called "registered voice", the specific speaker Speakers are called "registered speakers".
参考图1a,以手机的声纹解锁功能(即,通过声纹识别的手段对手机屏幕进行解锁)为例,对上述过程进行介绍。在利用手机声纹解锁功能之前,手机机主通过手机上麦克风在手机中录入本人的语音(该语音为注册语音)。Referring to FIG. 1 a , the above process is described by taking the voiceprint unlocking function of the mobile phone (ie, unlocking the screen of the mobile phone by means of voiceprint recognition) as an example. Before using the voiceprint unlocking function of the mobile phone, the mobile phone owner records his own voice (the voice is the registered voice) in the mobile phone through the microphone on the mobile phone.
当需要通过声纹识别的手段对手机屏幕进行解锁时,手机的当前用户通过手机麦克风录入实时语音(该语音为待验证语音),手机通过内置的声纹识别程序对待验证语音和注册语音进行比较,以判断手机的当前用户是否为手机的机主。如果待验证语音与注册语音的相匹配,则判断手机的当前用户为机主本人,手机的当前用户通过身份认证,手机完成后续的屏幕解锁动作;如果待验证语音与注册语音不匹配,则判断手机的当前用户并非机主本人,手机的当前用户未通过身份认证,手机可以拒绝后续的屏幕解锁动作。When it is necessary to unlock the screen of the mobile phone by means of voiceprint recognition, the current user of the mobile phone enters the real-time voice (the voice is the voice to be verified) through the mobile phone microphone, and the mobile phone uses the built-in voiceprint recognition program to compare the voice to be verified and the registered voice , to determine whether the current user of the mobile phone is the owner of the mobile phone. If the to-be-verified voice matches the registered voice, it is judged that the current user of the mobile phone is the owner, the current user of the mobile phone has passed the identity authentication, and the mobile phone completes the subsequent screen unlocking action; if the to-be-verified voice does not match the registered voice, it is judged If the current user of the mobile phone is not the owner, and the current user of the mobile phone has not passed the identity authentication, the mobile phone can refuse the subsequent screen unlocking action.
以上以手机的声纹解锁功能为例对声纹识别技术的应用进行了说明,但本申请不限于此,声纹识别技术可应用于需要对说话人的身份进行识别的其他场景。例如,声纹识别技术可以应用于家庭生活领域,对智能手机、智能汽车、智能家居(例如,智能音视频设备、智能照明系统、智能门锁)等进行语音控制;声纹识别技术还可以应用于支付领域,将声纹认证与其他认证手段(例如,密码、动态验证码等)相结合对用户的身份进行双重或多重认证,以提高支付的安全性;声纹识别技术还可以应用于信息安全领域,将声纹认证作为登录账号的方式;声纹识别技术还可以应用于司法领域,将声纹作为判断身份的辅助证据等。The application of the voiceprint recognition technology is described above by taking the voiceprint unlocking function of a mobile phone as an example, but the present application is not limited to this, and the voiceprint recognition technology can be applied to other scenarios where the identity of the speaker needs to be recognized. For example, voiceprint recognition technology can be applied to the field of family life, and voice control of smart phones, smart cars, smart homes (eg, smart audio and video equipment, smart lighting systems, smart door locks), etc.; voiceprint recognition technology can also be applied In the field of payment, the voiceprint authentication is combined with other authentication methods (such as passwords, dynamic verification codes, etc.) to perform double or multiple authentication of the user's identity to improve the security of payment; voiceprint recognition technology can also be applied to information In the security field, voiceprint authentication is used as a way to log in to an account; voiceprint recognition technology can also be applied to the judicial field, using voiceprint as auxiliary evidence for judging identity.
并且,声纹识别的主体设备可以是除手机之外的其他电子设备,例如,移动式设备,包括可穿戴设备(例如,手环、耳机等)、车载终端等;或者固定式设备,包括智能家居、网络服务器等。另外,声纹识别的算法除了可以在终端实现外,还可以在云端实现。例如,手机采集到待验证语音后,可以将采集到的待验证语音发送到云端,通过云端的声纹识别算法对待验证语音进行识别,云端在完成识别之后,将识别结果返回至手机。通过云端识别模式,用户可以共享云端的计算资源,以节约手机本地的计算资源。In addition, the main device for voiceprint recognition can be other electronic devices other than mobile phones, such as mobile devices, including wearable devices (such as wristbands, earphones, etc.), vehicle terminals, etc.; or fixed devices, including smart Home, network server, etc. In addition, the voiceprint recognition algorithm can be implemented in the cloud in addition to the terminal. For example, after the mobile phone collects the voice to be verified, the collected voice to be verified can be sent to the cloud, and the voice to be verified is recognized by the voiceprint recognition algorithm in the cloud. After the recognition is completed, the cloud returns the recognition result to the mobile phone. Through the cloud recognition mode, users can share the computing resources in the cloud to save the local computing resources of the mobile phone.
如图1b所示的场景,在对待验证说话人的语音进行采集的时候,如果周围环境中存在嘈杂的人声噪声,这些噪声会被麦克风一起采集进去,成为待验证语音的一部分。这样,待验证语音中除了包括待验证说话人的声音外,还掺杂进了噪声的成分,这样会降低声纹的识别率。As shown in Figure 1b, when the voice of the speaker to be verified is collected, if there is noisy human voice noise in the surrounding environment, these noises will be collected by the microphone together and become part of the voice to be verified. In this way, the voice to be verified not only includes the voice of the speaker to be verified, but also contains noise components, which will reduce the recognition rate of the voiceprint.
本实施例对声纹识别的场景不作限定,例如,还可以是居家场景、车载场景、会场场景、影院场景等。This embodiment does not limit the scene of the voiceprint recognition, for example, it may also be a home scene, a car scene, a meeting place scene, a cinema scene, and the like.
手机的机主需要通过声纹识别解锁手机时,如果周围的环境中存在噪声,手机麦克风采集到的声音除了机主语音外,还有环境中的噪声,这样,手机在将采集到的机主实时语音与机主预置在手机中的注册语音进行比较后,可能会得出两者不匹配结果。即便手机的当前用户为机主本人,手机仍可能给出用户身份认证未通过的结果,从而影响用户体验。When the owner of the mobile phone needs to unlock the mobile phone through voiceprint recognition, if there is noise in the surrounding environment, the sound collected by the mobile phone microphone is not only the owner's voice, but also the noise in the environment. After the real-time voice is compared with the registered voice preset in the mobile phone by the owner, it may result that the two do not match. Even if the current user of the mobile phone is the owner of the mobile phone, the mobile phone may still give a result that the user identity authentication fails, thus affecting the user experience.
现有技术中,有的技术方案通过对待验证语音进行消噪处理来去除待验证语音中的噪声成分,以提高声纹的识别率。但是,经消噪处理后的待验证语音中仍包含部分噪音成分,并且部分有效语音成分(待验证说话人的语音成分)也一并被去除,这样,可能出现消噪处理后的待验证语音仍不能被正确识别,声纹的识别率提升不明显。In the prior art, some technical solutions remove noise components in the voice to be verified by performing denoising processing on the voice to be verified, so as to improve the recognition rate of the voiceprint. However, the voice to be verified after the denoising process still contains some noise components, and some valid voice components (the voice components of the speaker to be verified) are also removed. In this way, the voice to be verified after the denoising process may appear. It still cannot be recognized correctly, and the recognition rate of voiceprint is not significantly improved.
现有技术中,还有的技术方案是通过在不同的场景中分别录制注册语音来提高声纹的识别率。具体地,用户在多个不同场景(例如,居家场景、影院场景、室外嘈杂场景等)下分别录制注册语音,在进行声纹识别时,将待验证语音与对应场景下录制的注册语音进行比较,以提高声纹的识别率。该现有技术中,用户需要在多个不同的场景分别录制注册语音,用户体验较低。In the prior art, there are also technical solutions to improve the recognition rate of voiceprints by recording registered voices in different scenarios respectively. Specifically, the user records registration voices in multiple different scenarios (for example, home scenarios, cinema scenarios, outdoor noisy scenarios, etc.), and when performing voiceprint recognition, compares the voice to be verified with the registered voice recorded in the corresponding scenario , in order to improve the recognition rate of voiceprint. In the prior art, the user needs to record registration voices respectively in multiple different scenarios, and the user experience is low.
为此,本申请实施方式提供了一种语音增强方法,用于提高声纹的识别率和声纹识别方法的鲁棒性,并且使得用户体验获得提升。本申请中,在采集到待验证语音后,会在注册语音上叠加上与待验证语音中的噪声成分相对应的噪声成分,然后将叠加过噪声成分之后的注册语音与待验证语音进行比较,以得 到识别结果。换句话说,本申请中,会根据待验证语音中的噪声成分对注册语音进行增强,以使得增强的注册语音和待验证语音具有相接近的噪声成分,这样,待验证语音和增强的注册语音的主要区别就在于两者有效语音成分之间的区别,通过声纹识别算法对两者进行比较后,可以得到更准确的识别结果。另外,本申请实施方式中,用户只需在安静环境下录入注册语音即可,无需在多个场景分别录制注册语音,因而用户体验较好。To this end, the embodiments of the present application provide a voice enhancement method, which is used to improve the voiceprint recognition rate and the robustness of the voiceprint recognition method, and improve user experience. In this application, after the voice to be verified is collected, a noise component corresponding to the noise component in the voice to be verified will be superimposed on the registered voice, and then the registered voice after the noise component has been superimposed is compared with the voice to be verified, to get the recognition result. In other words, in this application, the registration voice will be enhanced according to the noise components in the voice to be verified, so that the enhanced registration voice and the voice to be verified have similar noise components, so that the voice to be verified and the enhanced registration voice The main difference between the two is the difference between the two effective speech components. After comparing the two through the voiceprint recognition algorithm, a more accurate recognition result can be obtained. In addition, in the embodiment of the present application, the user only needs to record the registration voice in a quiet environment, and there is no need to separately record the registration voice in multiple scenarios, so the user experience is better.
这里,“有效语音成分”为来自说话人的语音成分,例如,待验证语音中的有效语音成分为待验证说话人的语音成分,增强的注册语音中的有效语音成分为注册说话人的语音成分。Here, the "valid speech component" is the speech component from the speaker, for example, the valid speech component in the speech to be verified is the speech component of the speaker to be verified, and the valid speech component in the enhanced registered speech is the speech component of the registered speaker .
以下仍结合图1b中手机的声纹解锁功能对本申请的技术方案进行介绍,但可以理解,本申请不限于此。The technical solution of the present application will still be introduced below with reference to the voiceprint unlocking function of the mobile phone in FIG. 1b, but it is understood that the present application is not limited thereto.
图2示出了手机100的结构。手机100可以包括处理器110,外部存储器接口120,内部存储器121,天线,通信模块150,音频模块170,扬声器170A,受话器170B,麦克风170C,耳机接口170D,摄像头193,显示屏194等。FIG. 2 shows the structure of the mobile phone 100 . The mobile phone 100 may include a processor 110, an external memory interface 120, an internal memory 121, an antenna, a communication module 150, an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, a camera 193, a display screen 194, and the like.
可以理解的是,本发明实施例示意的结构并不构成对手机100的具体限定。在本申请另一些实施例中,手机100可以包括比图示更多或更少的部件,或者组合某些部件,或者拆分某些部件,或者不同的部件布置。图示的部件可以以硬件,软件或软件和硬件的组合实现。It can be understood that the structures illustrated in the embodiments of the present invention do not constitute a specific limitation on the mobile phone 100 . In other embodiments of the present application, the mobile phone 100 may include more or less components than shown, or some components may be combined, or some components may be separated, or different component arrangements. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.
处理器110可以包括一个或多个处理单元,例如:处理器110可以包括应用处理器(application processor,AP),调制解调处理器,控制器,数字信号处理器(digital signal processor,DSP),基带处理器等。其中,不同的处理单元可以是独立的器件,也可以集成在一个或多个处理器中。The processor 110 may include one or more processing units, for example, the processor 110 may include an application processor (application processor, AP), a modem processor, a controller, a digital signal processor (digital signal processor, DSP), baseband processor, etc. Wherein, different processing units may be independent devices, or may be integrated in one or more processors.
处理器可以根据指令操作码和时序信号,产生操作控制信号,完成取指令和执行指令的控制。The processor can generate an operation control signal according to the instruction operation code and timing signal, and complete the control of fetching and executing instructions.
处理器110中还可以设置存储器,用于存储指令和数据。在一些实施例中,处理器110中的存储器为高速缓冲存储器。该存储器可以保存处理器110刚用过或循环使用的指令或数据。如果处理器110需要再次使用该指令或数据,可从所述存储器中直接调用。避免了重复存取,减少了处理器110的等待时间,因而提高了系统的效率。A memory may also be provided in the processor 110 for storing instructions and data. In some embodiments, the memory in processor 110 is cache memory. This memory may hold instructions or data that have just been used or recycled by the processor 110 . If the processor 110 needs to use the instruction or data again, it can be called directly from the memory. Repeated accesses are avoided and the latency of the processor 110 is reduced, thereby increasing the efficiency of the system.
在一些实施例中,处理器110可以包括一个或多个接口。接口可以包括集成电路内置音频(inter-integrated circuit sound,I2S)接口,脉冲编码调制(pulse code modulation,PCM)接口,和/或通用输入输出(general-purpose input/output,GPIO)接口等。In some embodiments, the processor 110 may include one or more interfaces. The interface may include an integrated circuit built-in audio (inter-integrated circuit sound, I2S) interface, a pulse code modulation (pulse code modulation, PCM) interface, and/or a general-purpose input/output (general-purpose input/output, GPIO) interface, etc.
I2S接口可以用于音频通信。在一些实施例中,处理器110可以包含多组I2S总线。处理器110可以通过I2S总线与音频模块170耦合,实现处理器110与音频模块170之间的通信。PCM接口也可以用于音频通信,将模拟信号抽样,量化和编码。The I2S interface can be used for audio communication. In some embodiments, the processor 110 may contain multiple sets of I2S buses. The processor 110 may be coupled with the audio module 170 through an I2S bus to implement communication between the processor 110 and the audio module 170 . The PCM interface can also be used for audio communications, sampling, quantizing and encoding analog signals.
GPIO接口可以通过软件配置。GPIO接口可以被配置为控制信号,也可被配置为数据信号。在一些实施例中,GPIO接口可以用于连接处理器110与摄像头193,显示屏194,音频模块170等。GPIO接口还可以被配置为I2S接口等。The GPIO interface can be configured by software. The GPIO interface can be configured as a control signal or as a data signal. In some embodiments, the GPIO interface may be used to connect the processor 110 with the camera 193, the display screen 194, the audio module 170, and the like. The GPIO interface can also be configured as an I2S interface, etc.
可以理解的是,本发明实施例示意的各模块间的接口连接关系,只是示意性说明,并不构成对手机100的结构限定。在本申请另一些实施例中,手机100也可以采用上述实施例中不同的接口连接方式,或多种接口连接方式的组合。It can be understood that, the interface connection relationship between the modules illustrated in the embodiment of the present invention is only a schematic illustration, and does not constitute a structural limitation of the mobile phone 100 . In other embodiments of the present application, the mobile phone 100 may also adopt different interface connection manners in the foregoing embodiments, or a combination of multiple interface connection manners.
手机100的无线通信功能可以通过天线,通信模块150,调制解调处理器以及基带处理器等实现。The wireless communication function of the mobile phone 100 may be implemented by an antenna, a communication module 150, a modem processor, a baseband processor, and the like.
天线用于发射和接收电磁波信号。手机100中的每个天线可用于覆盖单个或多个通信频带。不同的天线还可以复用,以提高天线的利用率。例如:可以将天线复用为无线局域网的分集天线。在另外一些实施例中,天线可以和调谐开关结合使用。Antennas are used to transmit and receive electromagnetic wave signals. Each antenna in handset 100 may be used to cover a single or multiple communication frequency bands. Different antennas can also be reused to improve antenna utilization. For example, the antennas can be multiplexed into the diversity antennas of the wireless local area network. In other embodiments, the antenna may be used in conjunction with a tuning switch.
通信模块150可以提供应用在手机100上的包括2G/3G/4G/5G等无线通信的解决方案。通信模块150可以包括至少一个滤波器,开关,功率放大器,低噪声放大器(low noise amplifier,LNA)等。通信模块150可以由天线接收电磁波,并对接收的电磁波进行滤波,放大等处理,传送至调制解调处理器进行解调。通信模块150还可以对经调制解调处理器调制后的信号放大,经天线转为电磁波辐射出去。在一些实施例中,通信模块150的至少部分功能模块可以被设置于处理器110中。在一些实施例中,通信模块150的至少部分功能模块可以与处理器110的至少部分模块被设置在同一个器件中。The communication module 150 may provide a wireless communication solution including 2G/3G/4G/5G, etc. applied on the mobile phone 100 . The communication module 150 may include at least one filter, switch, power amplifier, low noise amplifier (LNA), and the like. The communication module 150 can receive the electromagnetic wave by the antenna, filter, amplify, etc. the received electromagnetic wave, and transmit it to the modulation and demodulation processor for demodulation. The communication module 150 can also amplify the signal modulated by the modulation and demodulation processor, and then convert it into electromagnetic waves for radiation through the antenna. In some embodiments, at least part of the functional modules of the communication module 150 may be provided in the processor 110 . In some embodiments, at least some of the functional modules of the communication module 150 may be provided in the same device as at least some of the modules of the processor 110 .
调制解调处理器可以包括调制器和解调器。其中,调制器用于将待发送的低频基带信号调制成中高频信号。解调器用于将接收的电磁波信号解调为低频基带信号。随后解调器将解调得到的低频基带信号传送至基带处理器处理。低频基带信号经基带处理器处理后,被传递给应用处理器。应用处理器通过音频设备(不限于扬声器170A,受话器170B等)输出声音信号,或通过显示屏194显示图像或视频。在一些实施例中,调制解调处理器可以是独立的器件。在另一些实施例中,调制解调处理器可以独立于处理器110,与通信模块150或其他功能模块设置在同一个器件中。The modem processor may include a modulator and a demodulator. Wherein, the modulator is used to modulate the low frequency baseband signal to be sent into a medium and high frequency signal. The demodulator is used to demodulate the received electromagnetic wave signal into a low frequency baseband signal. Then the demodulator transmits the demodulated low-frequency baseband signal to the baseband processor for processing. The low frequency baseband signal is processed by the baseband processor and passed to the application processor. The application processor outputs sound signals through audio devices (not limited to the speaker 170A, the receiver 170B, etc.), or displays images or videos through the display screen 194 . In some embodiments, the modem processor may be a stand-alone device. In other embodiments, the modulation and demodulation processor may be independent of the processor 110, and may be provided in the same device as the communication module 150 or other functional modules.
外部存储器接口120可以用于连接外部存储卡,例如Micro SD卡,实现扩展手机100的存储能力。外部存储卡通过外部存储器接口120与处理器110通信,实现数据存储功能。例如将音乐,视频等文件保存在外部存储卡中。The external memory interface 120 can be used to connect an external memory card, such as a Micro SD card, to expand the storage capacity of the mobile phone 100 . The external memory card communicates with the processor 110 through the external memory interface 120 to realize the data storage function. For example to save files like music, video etc in external memory card.
内部存储器121可以用于存储计算机可执行程序代码,所述可执行程序代码包括指令。内部存储器121可以包括存储程序区和存储数据区。其中,存储程序区可存储操作系统,至少一个功能所需的应用程序(比如声音播放功能,图像播放功能等)、声纹识别程序、语音信号前端处理程序等。存储数据区可存储手机100使用过程中所创建的数据(比如音频数据,电话本等),以及声纹识别所需的数据,例如,注册语音的音频数据,训练好的语音参数识别模型等。此外,内部存储器121可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件,闪存器件,通用闪存存储器(universal flash storage,UFS)等。处理器110通过运行存储在内部存储器121的指令,和/或存储在设置于处理器中的存储器的指令,执行手机100的各种功能应用以及数据处理。Internal memory 121 may be used to store computer executable program code, which includes instructions. The internal memory 121 may include a storage program area and a storage data area. The storage program area can store an operating system, an application program required for at least one function (such as a sound playback function, an image playback function, etc.), a voiceprint recognition program, a voice signal front-end processing program, and the like. The storage data area can store data (such as audio data, phone book, etc.) created during the use of the mobile phone 100, and data required for voiceprint recognition, such as audio data of registered voice, trained voice parameter recognition model, etc. In addition, the internal memory 121 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, universal flash storage (UFS), and the like. The processor 110 executes various functional applications and data processing of the mobile phone 100 by executing the instructions stored in the internal memory 121 and/or the instructions stored in the memory provided in the processor.
手机100可以通过音频模块170,扬声器170A,受话器170B,麦克风170C,耳机接口170D,以及应用处理器等实现音频功能。例如音乐播放,录音等。The mobile phone 100 can implement audio functions through an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, and an application processor. Such as music playback, recording, etc.
音频模块170用于将数字音频信息转换成模拟音频信号输出,也用于将模拟音频输入转换为数字音频信号。音频模块170还可以用于对音频信号编码和解码。在一些实施例中,音频模块170可以设置于处理器110中,或将音频模块170的部分功能模块设置于处理器110中。The audio module 170 is used for converting digital audio information into analog audio signal output, and also for converting analog audio input into digital audio signal. Audio module 170 may also be used to encode and decode audio signals. In some embodiments, the audio module 170 may be provided in the processor 110 , or some functional modules of the audio module 170 may be provided in the processor 110 .
扬声器170A,也称“喇叭”,用于将音频电信号转换为声音信号。手机100可以通过扬声器170A收听音乐,或收听免提通话。Speaker 170A, also referred to as a "speaker", is used to convert audio electrical signals into sound signals. The mobile phone 100 can listen to music through the speaker 170A, or listen to a hands-free call.
受话器170B,也称“听筒”,用于将音频电信号转换成声音信号。当手机100接听电话或语音信息时,可以通过将受话器170B靠近人耳接听语音。The receiver 170B, also referred to as "earpiece", is used to convert audio electrical signals into sound signals. When the mobile phone 100 answers a call or a voice message, the voice can be answered by placing the receiver 170B close to the human ear.
麦克风170C,也称“mic”,“话筒”,“传声器”,用于将声音信号转换为电信号。在录入注册语音或待验证语音时,用户可以通过人嘴靠近麦克风170C发声,将声音信号输入到麦克风170C。手机100可以设置至少一个麦克风170C。The microphone 170C, also called "mic", "microphone", or "microphone", is used to convert sound signals into electrical signals. When entering the registered voice or the voice to be verified, the user can make a sound by approaching the microphone 170C through the human mouth, and input the sound signal into the microphone 170C. The mobile phone 100 may be provided with at least one microphone 170C.
在另一些实施例中,手机100可以设置两个麦克风170C,除了采集声音信号,还可以实现降噪功能。具体地,手机100上下各有一个麦克风,一个麦克风170C设于手机100的底部侧边,另一个麦克风170C设于手机100的顶部侧边。用户在拨打电话或发送语音消息时,嘴巴通常靠近底部侧边的麦克风170C,因此,用户语音会在该麦克风中产生较大的音频信号Va,本文称之为“主mic”。与此同时, 用户语音也会在顶部侧边的麦克风170C也上产生一定量的音频信号Vb,但由于该麦克风离用户嘴巴较远,因而该麦克风上的音频信号Vb要显著小于主mic上的音频信号Va,本文称之为“副mic”。In other embodiments, the mobile phone 100 may be provided with two microphones 170C, which can implement a noise reduction function in addition to collecting sound signals. Specifically, the mobile phone 100 has two microphones at the top and bottom, one microphone 170C is provided on the bottom side of the mobile phone 100 , and the other microphone 170C is provided on the top side of the mobile phone 100 . When a user makes a call or sends a voice message, the mouth is usually close to the microphone 170C on the bottom side. Therefore, the user's voice will generate a larger audio signal Va in the microphone, which is referred to as the "main mic" herein. At the same time, the user's voice will also generate a certain amount of audio signal Vb on the microphone 170C on the top side, but since the microphone is far away from the user's mouth, the audio signal Vb on the microphone is significantly smaller than that on the main mic. The audio signal Va is referred to herein as "sub-mic".
对于环境中的噪声来说,由于噪声的声源通常是远离手机100的,因此可认为噪声声源与主mic和副mic的距离基本是一致的,即,可认为主mic和副mic采集到的噪声的强度是基本相同的。For the noise in the environment, since the sound source of the noise is usually far away from the mobile phone 100, it can be considered that the distance between the noise sound source and the main mic and the auxiliary mic is basically the same, that is, it can be considered that the main mic and the auxiliary mic The intensity of the noise is basically the same.
利用两个mic位置差异造成的信号强度差异可以分离噪声信号和用户语音信号。例如,将主mic拾取到的音频信号与副mic拾取到的音频信号进行差分后(即在主mic中的信号减去副mic中的信号),可得到用户的语音信号(这便是双mic主动降噪的原理)。进而,在主mic的信号中去除用户的语音信号后,即可以分离出噪声信号。或者,由于副mic上的音频信号Vb显著小于主mic上的音频信号Va,因此可认为副mic拾取到的信号即为噪声信号。The noise signal and the user speech signal can be separated by using the signal strength difference caused by the difference of the two mic positions. For example, after the audio signal picked up by the main mic and the audio signal picked up by the secondary mic are differentiated (that is, the signal in the main mic is subtracted from the signal in the secondary mic), the user's voice signal (this is the dual mic) can be obtained. The principle of active noise cancellation). Furthermore, after removing the user's voice signal from the main mic signal, the noise signal can be separated. Alternatively, since the audio signal Vb on the secondary mic is significantly smaller than the audio signal Va on the primary mic, it can be considered that the signal picked up by the secondary mic is a noise signal.
以上给出了手机100双mic的一种设置方式,但这仅是示例性说明,麦克风可采用其他的设置方式,例如,主mic设于手机100的正面,副mic设于手机的背面等。A setting method of dual mics of the mobile phone 100 is given above, but this is only an exemplary description, and other setting methods can be used for the microphones, for example, the main mic is arranged on the front of the mobile phone 100, and the secondary mic is arranged on the back of the mobile phone.
在另一些实施例中,手机100还可以设置三个,四个或更多麦克风170C,实现采集声音信号,降噪,还可以识别声音来源,实现定向录音功能等。In other embodiments, the mobile phone 100 may further be provided with three, four or more microphones 170C to collect sound signals, reduce noise, identify sound sources, and implement directional recording functions.
耳机接口170D用于连接有线耳机。耳机接口170D可以是通用串行总线(universal serial bus,USB)接口,也可以是3.5mm的开放移动电子设备平台(open mobile terminal platform,OMTP)标准接口,美国蜂窝电信工业协会(cellular telecommunications industry association of the USA,CTIA)标准接口。The earphone jack 170D is used to connect wired earphones. The earphone interface 170D may be a universal serial bus (USB) interface, or may be a 3.5mm open mobile terminal platform (OMTP) standard interface, the cellular telecommunications industry association (cellular telecommunications industry association) of the USA, CTIA) standard interface.
【实施例一】[Example 1]
以下结合图1b中的手机声纹解锁场景,对本实施例的技术方案进行说明。可以理解的是,本申请不限于此,本申请的语音增强方法还可以应用与图1b所示场景之外的其他场景。The technical solution of this embodiment will be described below with reference to the mobile phone voiceprint unlocking scene in FIG. 1b. It can be understood that the present application is not limited to this, and the speech enhancement method of the present application can also be applied to other scenarios than the scenario shown in FIG. 1 b.
参考图3,本实施例用于提供一种语音增强方法,在采集到待验证语音后,从待验证语音中分离出待验证语音中包含的噪声,然后将分离出来的噪声叠加到注册语音上,这样,待验证语音与叠加过噪声后的注册语音具有相似的噪声成分,两者的主要区别在于两者有效语音成分之间的区别,从而可以提高声纹的识别率以及声纹识别方法的鲁棒性。具体地,本实施例提供的语音增强方法包括以下步骤:Referring to FIG. 3, this embodiment is used to provide a voice enhancement method. After the voice to be verified is collected, the noise contained in the voice to be verified is separated from the voice to be verified, and then the separated noise is superimposed on the registered voice. , in this way, the voice to be verified and the registered voice after superimposed noise have similar noise components, and the main difference between the two is the difference between the two effective voice components, which can improve the voiceprint recognition rate and the voiceprint recognition method. robustness. Specifically, the speech enhancement method provided by this embodiment includes the following steps:
S110:采集注册语音。为提供声纹解锁功能,手机100上具有声纹解锁应用(可以为系统应用,也可为第三方应用)。为利用手机100的声纹解锁的功能,手机100机主在注册该声纹解锁应用的用户账号时,通过手机100采集本人语音,声纹解锁应用将该语音作为后续声纹识别的基准语音,该语音即为注册语音。但本申请不限于此,例如,在其他实施例中,手机100首次开机时,机主通过手机100的设置向导录入注册语音,手机100的声纹解锁应用将该语音作为声纹识别的基准语音。S110: Collect registered voice. In order to provide the voiceprint unlocking function, the mobile phone 100 has a voiceprint unlocking application (which may be a system application or a third-party application). In order to utilize the voiceprint unlocking function of the mobile phone 100, when the owner of the mobile phone 100 registers the user account of the voiceprint unlocking application, he collects his own voice through the mobile phone 100, and the voiceprint unlocking application uses the voice as the reference voice for subsequent voiceprint recognition. This voice is the registered voice. However, the present application is not limited to this. For example, in other embodiments, when the mobile phone 100 is turned on for the first time, the owner of the mobile phone 100 enters the registration voice through the setting wizard of the mobile phone 100, and the voiceprint unlocking application of the mobile phone 100 uses the voice as the reference voice for voiceprint recognition. .
这里,注册语音为手机100机主在安静环境下录制的语音,这样,注册语音中没有明显的噪声分量。Here, the registered voice is the voice recorded by the owner of the mobile phone 100 in a quiet environment, so that there is no obvious noise component in the registered voice.
当注册语音录制环境中的信噪比(即,机主语音信号强度与噪声信号强度的比值)进行表征,当录制环境中的信噪比高于设定值(例如,30dB)时,认为录制环境为安静环境。或者,当注册语音录制环境中噪声信号的强度低于设定值(例如,20dB)时,认为录制环境为安静环境。When the signal-to-noise ratio (ie, the ratio of the host voice signal strength to the noise signal strength) in the registered voice recording environment is characterized, when the signal-to-noise ratio in the recording environment is higher than the set value (eg, 30dB), the recording is considered to be recorded. The environment is quiet. Alternatively, when the intensity of the noise signal in the registered voice recording environment is lower than a set value (eg, 20 dB), the recording environment is considered to be a quiet environment.
本实施例中,通过手机100的麦克风采集来自于机主的注册语音。该注册语音为近场语音。在录制注册语音时,机主嘴巴与手机100主mic的距离保持在30cm~1m以内,例如,机主手持手机100正对主mic讲话,机主嘴巴与手机100主mic的距离保持在30cm以内,这样可以避免机主语音由于传播距离较远而出现的衰减。In this embodiment, the registration voice from the host is collected through the microphone of the mobile phone 100 . The registered voice is near-field voice. When recording the registered voice, the distance between the owner's mouth and the main mic of the mobile phone 100 should be kept within 30cm to 1m. For example, if the owner holds the mobile phone 100 and speaks to the main mic, the distance between the owner's mouth and the main mic of the mobile phone 100 should be kept within 30cm. , which can avoid the attenuation of the host voice due to the long propagation distance.
机主在录制注册语音时录入6条语音,以形成6条注册语音。录入多段语言,有助于提升语音识别的灵活性以及声纹信息的丰富性。When recording the registered voice, the owner enters 6 voices to form 6 registered voices. Entering multiple languages helps to improve the flexibility of speech recognition and the richness of voiceprint information.
为兼顾用户的操作体验,以及确保每段注册语音中包含足够的声纹信息,每条注册语音的长度为 10~30s。进一步地,每条注册语音对应于不同的文本内容,以丰富注册语音中包含的声纹信息。在采集到注册语音后,手机100将注册语音的音频信号存储在内部存储器中。但本申请不限于此,手机100还可以将注册语音的音频信号上传至云端,以通过云端识别模式对声纹进行识别。In order to take into account the user's operating experience and ensure that each registered voice contains enough voiceprint information, the length of each registered voice is 10-30s. Further, each registered voice corresponds to different text content, so as to enrich the voiceprint information contained in the registered voice. After collecting the registered voice, the mobile phone 100 stores the audio signal of the registered voice in the internal memory. However, the present application is not limited to this, and the mobile phone 100 may also upload the audio signal of the registered voice to the cloud, so as to recognize the voiceprint through the cloud recognition mode.
以上注册语音的录制方式、录制长度和数量等仅是示例性说明,本申请不限于此。例如,在其他示例中,注册语音可以通过其他录音设备(例如,录音笔、专用话筒等)进行录制,注册语音的数量可以为1条,注册语音的长度可以大于30s。The above recording method, recording length, and quantity of the registered voice are only exemplary descriptions, and the present application is not limited thereto. For example, in other examples, the registered voice may be recorded by other recording devices (eg, voice recorder, dedicated microphone, etc.), the number of registered voices may be one, and the length of the registered voice may be greater than 30s.
为了叙述的连贯性,步骤S110首先被提及,可以理解的是,步骤S110作为语音增强方法的数据准备过程,其相对于单次的语音增强过程来说是相对独立的,不需要每次都与语音增强方法的其他步骤一起发生。For the coherence of the description, step S110 is mentioned first. It can be understood that step S110, as the data preparation process of the speech enhancement method, is relatively independent from the single speech enhancement process, and does not need to be performed every time. Occurs with other steps of the speech enhancement method.
S120:采集待验证语音,待验证语音为手机的当前用户在嘈杂人声场景下录制的语音。换句话说,手机用户可以在该场景中通过声纹识别的手段对手机屏幕进行解锁。另外,手机的当前用户是当前操作手机100的人,可能是机主本人,也可能机主本人之外的其他人。S120: Collect the voice to be verified, and the voice to be verified is the voice recorded by the current user of the mobile phone in a noisy human voice scene. In other words, the mobile phone user can unlock the screen of the mobile phone by means of voiceprint recognition in this scenario. In addition, the current user of the mobile phone is the person who currently operates the mobile phone 100 , which may be the owner himself or someone other than the owner himself.
本实施例中,通过手机100的麦克风采集待验证语音。当手机100的屏幕处于锁屏状态时,手机100的麦克风开启,此时,手机100的当前用户可通过手机100的麦克风录入待验证语音,以通过声纹识别的方式解锁手机。例如,当用户需要远距离操作手机100(例如,开启手机中的某个应用(例如,音乐应用、电话应用)),或者用户需要在双手被占用的情况下(例如,家务劳动时)操作手机100时,通过手机100的麦克风输入待验证语音,以通过声纹识别方式解锁手机。In this embodiment, the voice to be verified is collected through the microphone of the mobile phone 100 . When the screen of the mobile phone 100 is in the locked screen state, the microphone of the mobile phone 100 is turned on. At this time, the current user of the mobile phone 100 can input the voice to be verified through the microphone of the mobile phone 100 to unlock the mobile phone through voiceprint recognition. For example, when the user needs to operate the mobile phone 100 from a distance (eg, open an application in the mobile phone (eg, a music application, a phone application)), or when the user needs to operate the mobile phone when both hands are occupied (eg, when doing housework) 100, input the voice to be verified through the microphone of the mobile phone 100 to unlock the mobile phone through voiceprint recognition.
待验证语音是具有特定内容的语音。在其他的实施方式中,待验证语音也可以是任意文本内容的语音。The to-be-verified voice is a voice with specific content. In other implementation manners, the voice to be verified may also be voice of any text content.
本实施例中,对待验证语音的长度为10~30s,这样,待验证语音中可以包含较为丰富的声纹信息,有利于提高声纹的识别率。但本申请对此不作限定,例如,在另一些实施例中,待验证语音的长度小于10s,这样,待验证语音的长度小于注册语音的长度,这种情况下,用户可以录入较短的待验证语音,有利于提高用户体验。当待验证语音的长度小于注册语音的长度时,可以从待验证语音中截取部分语音片段,并与原始采集到的待验证语音进行拼接,以使得拼接后的语音具有与注册语音基本相同的长度,这样,在本实施例的后续步骤中(下文将进行详细说明),从注册语音中提取到的特征参数和从待验证语音中提取到的特征参数具有相同的维度,便于对两者的相似度进行比较。在本文的描述中,不区分原始采集的待验证语音还是拼接后的待验证语音,本文都称为待验证语音。In this embodiment, the length of the voice to be verified is 10-30 s, so that the voice to be verified can contain relatively rich voiceprint information, which is beneficial to improve the voiceprint recognition rate. However, this application does not limit this. For example, in other embodiments, the length of the voice to be verified is less than 10s, so the length of the voice to be verified is less than the length of the registered voice. In this case, the user can enter a shorter to-be-verified voice. Verification of voice is beneficial to improve user experience. When the length of the voice to be verified is less than the length of the registered voice, part of the voice fragments can be intercepted from the voice to be verified, and spliced with the originally collected voice to be verified, so that the spliced voice has substantially the same length as the registered voice , so that in the subsequent steps of this embodiment (will be described in detail below), the feature parameters extracted from the registered voice and the feature parameters extracted from the voice to be verified have the same dimension, which is convenient for the similarity of the two. degree for comparison. In the description of this article, it does not distinguish between the original collected voice to be verified and the spliced voice to be verified, which is referred to as the voice to be verified in this document.
本文中,A语音与B进行拼接的含义为将A语音与B语音首尾相接,以使得拼接后的语音的长度为A语音和B语音的长度之和。在此基础上,本申请不限定A语音和B语音的连接次序,例如,可以将A语音连接在B语音的后面,也可以将A语音连接在B语音的前面。In this paper, the meaning of splicing the A voice and the B voice is to connect the A voice and the B voice end to end, so that the length of the spliced voice is the sum of the lengths of the A voice and the B voice. On this basis, the present application does not limit the connection order of the A voice and the B voice. For example, the A voice may be connected after the B voice, or the A voice may be connected before the B voice.
S130:确定待验证语音中包含的噪声。本实施例中,待验证语音中包含的噪声是识别场景下中除手机100的当前用户以外的其他声源产生的声音。例如,居家场景中家用设备(例如,吸尘器)的声音、洗碗时水流的声音;车载场景中车载广播的声音、发动机的声音;影院环境中放映音响的声音、影院其他观众的语音等。S130: Determine the noise contained in the speech to be verified. In this embodiment, the noise contained in the voice to be verified is the sound generated by other sound sources other than the current user of the mobile phone 100 in the recognition scene. For example, the sound of household equipment (for example, a vacuum cleaner) in a home scene, the sound of water flowing when washing dishes; the sound of the car broadcast, the sound of the engine in the car scene; the sound of the sound projected in the theater environment, the voice of other audiences in the theater, etc.
本实施例中,将手机100副mic拾取到声音确定为待验证语音中包含的噪声,从而可以方便地确定待验证语音中所包含的噪声。但本申请不限于此,例如,在一些实施例中,认为待验证语音的起始片段中仅包含噪声成分,从而,将待验证语音的起始片段进行多段复制后,将之确定为待验证语音中包含的噪声;再如,在另一些实施例中,将待验证语音分为多个语音帧,并计算各语音帧的中能量。由于噪声中的能量通常小于有效语音中的能量,因此,当语音帧中的能量小于预定值时,可以将该语音帧确定为 噪声帧,从而简化噪声的提取过程。另外,还可以采用现有技术中的其他方法来确定待验证语音中的噪声,不再一一赘述。In this embodiment, the sound picked up by the mic of the mobile phone 100 is determined as the noise contained in the voice to be verified, so that the noise contained in the voice to be verified can be easily determined. However, the present application is not limited to this. For example, in some embodiments, it is considered that the initial segment of the speech to be verified contains only noise components, so that after multiple copies of the initial segment of the speech to be verified, it is determined as the to-be-verified speech. The noise contained in the speech; for another example, in other embodiments, the speech to be verified is divided into multiple speech frames, and the medium energy of each speech frame is calculated. Since the energy in noise is generally smaller than that in valid speech, when the energy in a speech frame is less than a predetermined value, the speech frame can be determined as a noise frame, thereby simplifying the noise extraction process. In addition, other methods in the prior art may also be used to determine the noise in the speech to be verified, which will not be described in detail.
其中,语音帧的能量表示该语音帧中包括的各语音信号的信号数值的平方之和。示例性地,设语音帧中第i个语音信号的信号数值为x i,该语音帧中语音信号的个数为N,则该语音帧中的能量为
Figure PCTCN2021105003-appb-000001
The energy of the speech frame represents the sum of the squares of the signal values of the speech signals included in the speech frame. Exemplarily, suppose that the signal value of the i-th speech signal in the speech frame is x i , and the number of speech signals in the speech frame is N, then the energy in the speech frame is
Figure PCTCN2021105003-appb-000001
S140:在注册语音上叠加待验证语音中包含的噪声,以得到增强的注册语音。本实施例中,在时域内,将噪声信号的信号数值与注册语音信号的信号数值相加,以得到增强的注册语音。但本申请不限于此,在其他实施例中,也可以在频域内完成注册语音信号和噪声信号的叠加。本申请实施例通过对声音信号信号数值进行简单叠加,实现注册语音信号的增强,算法简单。S140: Superimpose the noise contained in the voice to be verified on the registration voice to obtain an enhanced registration voice. In this embodiment, in the time domain, the signal value of the noise signal and the signal value of the registration speech signal are added to obtain the enhanced registration speech. However, the present application is not limited to this, and in other embodiments, the superposition of the registration speech signal and the noise signal may also be completed in the frequency domain. The embodiment of the present application realizes the enhancement of the registered voice signal by simply superimposing the numerical value of the voice signal, and the algorithm is simple.
本实施例中,噪声的长度与注册语音的长度相等,在其他实施例中,噪声的长度可以小于注册语音的长度。In this embodiment, the length of the noise is equal to the length of the registered voice. In other embodiments, the length of the noise may be smaller than the length of the registered voice.
本实施例中,注册语音的数量为6条,因此,在6条注册语音上分别叠加待验证语音中包含的噪声,以得到6条增强的注册语音。In this embodiment, the number of registered voices is 6. Therefore, noises contained in the voices to be verified are respectively superimposed on the 6 registered voices to obtain 6 enhanced registered voices.
S150:提取待验证语音的特征参数和增强的注册语音的特征参数。由于MFCC方法能够较好地符合人耳的听觉感知特性,因此,本实施例通过梅尔频率倒谱系数(Mel-Frequency Cepstrum Coefficient,MFCC)方法提取语音信号中的特征参数。S150: Extract the feature parameters of the voice to be verified and the feature parameters of the enhanced registered voice. Since the MFCC method can better conform to the auditory perception characteristics of the human ear, in this embodiment, the feature parameters in the speech signal are extracted by the Mel-Frequency Cepstrum Coefficient (MFCC) method.
首先以待验证语音为例,对特征参数的提取过程进行介绍。为便于描述,用S T表示待验证语音的音频信号。在进行特征提取之前,首先将待验证语音的音频信号S T划分为一系列的语音帧x(n),其中,n为语音帧的数量。考虑到发声器官的运动模型在10~30ms内基本保持稳定,因此,每个语音帧的长度为10~30ms。具体地,本实施例将长度为10s的音频信号S T划分为500个语音帧。 First, taking the voice to be verified as an example, the extraction process of feature parameters is introduced. For ease of description, an audio signal representing speech to be authenticated by S T. The audio signal S T divided prior to feature extraction is first to be authenticated as a series of voice speech frame x (n), where, for the n-number of speech frames. Considering that the motion model of the vocal organ is basically stable within 10-30ms, the length of each speech frame is 10-30ms. In particular, the present embodiment a length of 10s audio signal S T 500 is divided into speech frames.
对音频信号S T进行分帧处理之后,通过MFCC方法提取各语音帧x(n)中的特征参数。MFCC特征提取方法包括对语音帧x(n)进行傅里叶变换,梅尔滤波、离散余弦变换等步骤,语音帧x(n)的特征参数即为离散余弦变换后各阶余弦函数的系数。本实施例中,离散余弦变换的阶数为20阶,因此,各语音帧x(n)的MFCC特征参数为20维。 After dividing the audio signal S T frame processing to extract feature parameters for each speech frame x (n) by the MFCC methods. The MFCC feature extraction method includes the steps of Fourier transform, Mel filtering, discrete cosine transform, etc. on the speech frame x(n). In this embodiment, the order of the discrete cosine transform is 20. Therefore, the MFCC feature parameter of each speech frame x(n) has 20 dimensions.
将各语音帧x(n)的特征参数进行拼接后,得到待验证语音的音频信号S T的MFCC特征参数,可以理解,其维数为20×500=10000维。 Each speech frame x (n) of the characteristic parameters for splicing to give the speech to be verified MFCC feature parameter S T of the audio signal, it is understood that dimensions of 20 × 500 = 10000 dimension.
增强的注册语音的特征参数的提取过程可参照上述过程,不再赘述。可以理解,对于每一条增强的注册语音,分别得到一组MFCC特征参数。For the extraction process of the feature parameter of the enhanced registered voice, reference may be made to the above process, and details are not repeated here. It can be understood that, for each enhanced registration speech, a set of MFCC characteristic parameters are obtained respectively.
需要说明的是,以上是对MFCC方法的原理性说明,实际实施过程中,可以根据需要对提取过程进行调整。例如,可对上述提取到的MFCC特征参数进行差分计算。例如,对上述提取到的MFCC特征参数取一阶差分和二阶差分后,对于每个语音帧,得到一组60维的MFCC特征参数。另外,提取过程的其他参数,例如,语音帧的长度、数量、离散余弦变换的阶数等,也可以根据设备计算能力和识别精度需求等进行相应调整。It should be noted that the above is a principle description of the MFCC method. In the actual implementation process, the extraction process can be adjusted as required. For example, differential calculation may be performed on the MFCC feature parameters extracted above. For example, after taking the first-order difference and the second-order difference of the MFCC feature parameters extracted above, for each speech frame, a set of 60-dimensional MFCC feature parameters is obtained. In addition, other parameters of the extraction process, such as the length and number of speech frames, the order of discrete cosine transform, etc., can also be adjusted according to the computing capability of the device and the requirements of recognition accuracy.
另外,除MFCC方法外,还可以通过其他方法提取语音信号中的特征参数,例如,log mel方法,线性预测倒谱系数(Linear Predictive Cepstrum Coefficient,LPCC)方法等。In addition, in addition to the MFCC method, the feature parameters in the speech signal can also be extracted by other methods, for example, the log mel method, the Linear Predictive Cepstrum Coefficient (LPCC) method, and the like.
S160:对待验证语音的特征参数和增强的注册语音的特征参数进行参数识别,以分别得到手机100当前用户的语音模板和手机100机主的语音模板。本申请对参数识别的识别模型不作限定,可以是概率模型,例如,身份向量(I-vector)模型;也可以是深度神经网络模型,例如,时延神经网络(Time-Delay Neural Network,TDNN)模型,ResNet模型等。S160: Perform parameter identification on the feature parameters of the voice to be verified and the feature parameters of the enhanced registered voice, so as to obtain the voice template of the current user of the mobile phone 100 and the voice template of the owner of the mobile phone 100, respectively. The identification model for parameter identification is not limited in this application, and can be a probability model, such as an identity vector (I-vector) model; or a deep neural network model, such as a Time-Delay Neural Network (TDNN) model, ResNet model, etc.
将待验证语音的10000维特征参数输入识别模型,通过识别模型的降维和抽象之后,得到手机100 当前用户的语音模板。本实施例中,手机100当前用户的语音模板为一个512维的特征向量,记为A。The 10,000-dimensional feature parameters of the speech to be verified are input into the recognition model, and the speech template of the current user of the mobile phone 100 is obtained after the dimensionality reduction and abstraction of the recognition model. In this embodiment, the speech template of the current user of the mobile phone 100 is a 512-dimensional feature vector, denoted as A.
相应地,将6条增强的注册语音的特征参数输入识别模型,得到6个手机100机主的语音模板,各语音模板均为一个512为的特征向量,6个机主语音模板分别记为B1,B2,……,B6。Correspondingly, the characteristic parameters of 6 enhanced registered voices are input into the recognition model, and the voice templates of 6 mobile phone 100 owners are obtained, each voice template is a feature vector of 512, and the 6 master voice templates are marked as B1 respectively. , B2, ..., B6.
可以理解,上述特征向量的维数仅是示例性说明,实际可以根据设备的计算能力和识别精度要求进行调整。It can be understood that the dimension of the feature vector described above is only an exemplary illustration, and can actually be adjusted according to the computing capability and identification accuracy requirements of the device.
S170:将手机100机主的语音模板与手机100当前用户的语音模板进行匹配,以得到识别结果。本申请中,模板匹配方法可以为余弦距离法、线性判别法或概率线性判别分析法等。以下以余弦距离法为例进行说明。S170: Match the voice template of the owner of the mobile phone 100 with the voice template of the current user of the mobile phone 100 to obtain a recognition result. In this application, the template matching method may be a cosine distance method, a linear discriminant method, or a probabilistic linear discriminant analysis method, or the like. The cosine distance method is used as an example for description below.
余弦距离法通过计算两个特征向量之间的夹角的余弦来评估它们的相似度。以特征向量A(手机100当前用户的语音模板对应的特征向量)和特征向量B1(手机100机主语音模板对应的特征向量)为例,其余弦相似度可表示为:The cosine distance method evaluates the similarity of two feature vectors by computing the cosine of the angle between them. Taking the feature vector A (the feature vector corresponding to the voice template of the current user of the mobile phone 100) and the feature vector B1 (the feature vector corresponding to the main voice template of the mobile phone 100) as an example, the cosine similarity can be expressed as:
Figure PCTCN2021105003-appb-000002
Figure PCTCN2021105003-appb-000002
其中,a i为特征向量A中的第i个坐标,b i为特征向量B1中的第i个坐标,θ 1为特征特征向量A和特征向量B1的夹角。其中,cosθ 1的值越大,表示特征特征向量A和特征向量B1的方向越趋近于一致,两个特征向量的相似度越高。反之,cosθ 1的值越小,两个特征向量的相似度越低。 Among them, a i is the ith coordinate in the eigenvector A, b i is the ith coordinate in the eigenvector B1, and θ 1 is the angle between the eigenvector A and the eigenvector B1. Among them, the larger the value of cosθ 1 , the closer the direction of the eigenvector A and the eigenvector B1 is, and the higher the similarity of the two eigenvectors. Conversely, the smaller the value of cosθ 1 , the lower the similarity between the two feature vectors.
对于6条增强的注册语音,得到6个机主语音模板B1,B2,……,B6,其与手机100当前用户语音模板的余弦相似度分别为cosθ 1、cosθ 2,……,cosθ 6。对6个余弦相似度取均值,得到当前用户语音与机主语音的相似度P=(cosθ 1+cosθ 2+……+cosθ 6)/6。 For speech enhancement register 6, the owner Six utterance B1, B2, ......, B6, 100 with the current mobile phone users cosine similarity utterance respectively cosθ 1, cosθ 2, ......, cosθ 6. Taking the average of the six cosine similarities, the similarity between the current user voice and the host voice P=(cosθ 1 +cosθ 2 +...+cosθ 6 )/6 is obtained.
如果当前用户语音与机主语音的相似度P大于设定值(例如,0.8),则判断手机100的当前用户为机主本人,此时,手机100解锁屏幕;否则,判断手机100的当前用户并非机主本人,手机100不会解锁屏幕。If the similarity P between the current user's voice and the host's voice is greater than the set value (for example, 0.8), it is determined that the current user of the mobile phone 100 is the host himself. At this time, the mobile phone 100 unlocks the screen; otherwise, it is determined that the current user of the mobile phone 100 is the host. Not the owner himself, the phone 100 will not unlock the screen.
本实施例中,将待验证语音与6条增强的注册语音分别进行比较,得到6个余弦相似度计算结果,再将6个余弦相似度结果进行平均以得到当前用户语音与机主语音的最终相似度P。本实施例可以对待验证语音和单条增强的注册语音的匹配误差进行平均,有利于提高声纹识别的准确率和声纹识别算法的鲁棒性。In this embodiment, the to-be-verified voice is compared with the six enhanced registered voices to obtain six cosine similarity calculation results, and then the six cosine similarity results are averaged to obtain the final result of the current user voice and the host voice. Similarity P. In this embodiment, the matching errors between the voice to be verified and the single enhanced registered voice can be averaged, which is beneficial to improve the accuracy of voiceprint recognition and the robustness of the voiceprint recognition algorithm.
需要说明的是,本实施例中,声纹识别的算法(步骤S130~S170所对应的算法)可以在手机100上实现,以实现声纹的离线识别;也可以在云端实现,以节省手机100本地的计算资源。当声纹识别算法在云端实现时,手机100将步骤S120采集到的待验证语音上传到云端服务器中,云端服务器利用声纹识别算法对手机100的当前用户的身份进行认证后,将认证结果返回至手机,手机100根据认证结果决定是否解锁屏幕。It should be noted that, in this embodiment, the voiceprint recognition algorithm (the algorithms corresponding to steps S130 to S170 ) can be implemented on the mobile phone 100 to realize the offline recognition of the voiceprint; it can also be implemented in the cloud to save the mobile phone 100 local computing resources. When the voiceprint recognition algorithm is implemented in the cloud, the mobile phone 100 uploads the to-be-verified voice collected in step S120 to the cloud server, and the cloud server uses the voiceprint recognition algorithm to authenticate the identity of the current user of the mobile phone 100, and returns the authentication result. To the mobile phone, the mobile phone 100 decides whether to unlock the screen according to the authentication result.
以上介绍了本实施例中语音增强方法的实现过程,但可以理解的是,以上仅是示例性说明,在符合本申请发明构思的前提下,本领域技术人员可以在上述实施例的基础上进行其他变形。The implementation process of the speech enhancement method in this embodiment has been described above, but it can be understood that the above is only an exemplary description, and those skilled in the art can perform the above-mentioned embodiments on the premise of conforming to the inventive concept of the present application. other deformations.
例如,在一些实施例中,除了根据待验证语音中的噪声对注册语音进行增强之外,还在注册语音中加入混响成分,以得到增强的注册语音。For example, in some embodiments, in addition to enhancing the registration speech according to the noise in the speech to be verified, a reverberation component is added to the registration speech to obtain an enhanced registration speech.
声波在室内传播时,会经过房间墙壁、室内障碍物的多次反射,这样,当声源停止发生后,还会有若干个声波叠加混合在一起,使得人们感觉到声音在声源停止发声后还会持续一段时间,这种由于声波 的多次反射而使声音延续的现象即为混响。When the sound wave propagates indoors, it will be reflected multiple times by the room walls and indoor obstacles. In this way, when the sound source stops, several sound waves will be superimposed and mixed together, making people feel that the sound stops after the sound source stops sounding. It will continue for a period of time, and the phenomenon that the sound continues due to the multiple reflections of the sound waves is called reverberation.
当声纹识别的识别场景为室内场景时,待验证说话人的语音会在房间内产生混响,混响作为干扰因素的一部分,会对声纹的识别率造成一定影响。为此,在一些实施例中,基于识别场景对注册语音进行混响预估,即,对注册语音在识别场景下的混响进行模拟,基于混响模拟在注册语音中加入注册语音在识别场景中产生的混响成分,以使得待验证语音的非语音成分与增强的注册语音中的非语音成分尽可能接近,从而提高声纹的识别率以及声纹识别方法的鲁棒性。When the recognition scene of voiceprint recognition is an indoor scene, the voice of the speaker to be verified will generate reverberation in the room, and the reverberation, as a part of the interference factor, will have a certain impact on the recognition rate of the voiceprint. To this end, in some embodiments, the reverberation prediction is performed on the registered voice based on the recognition scene, that is, the reverberation of the registered voice in the recognition scene is simulated, and the registered voice is added to the registered voice based on the reverberation simulation. The reverberation components generated in the voiceprint are used to make the non-speech components of the speech to be verified and the non-speech components in the enhanced registration speech as close as possible, thereby improving the voiceprint recognition rate and the robustness of the voiceprint recognition method.
可选地,基于镜像源模型(Image source model,ISM)方法,来预估注册语音在识别场景下产生的混响。镜像源模型方法可以模拟声波在房间内的反射路径,根据声波的延迟和衰减参数计算房间声场的冲击响应函数(room impulse response,RIR)。在得到房间声场的冲击响应函数后,将注册语音的音频信号与脉冲响应函数进行卷积即得到注册语音在房间内产生的混响。Optionally, based on an Image source model (ISM) method, the reverberation generated by the registered speech in the recognition scene is estimated. The image source model method can simulate the reflection path of the sound wave in the room, and calculate the room impulse response function (RIR) of the sound field according to the delay and attenuation parameters of the sound wave. After the impulse response function of the room sound field is obtained, the reverberation generated by the registered speech in the room is obtained by convolving the audio signal of the registered speech with the impulse response function.
另外,在一些情况下,例如,对智能机器人、智能家居进行语音控制时,待验证说话人与麦克风的距离可能较远(例如,超过1m时),这样,待验证说话人的语音到达麦克风时会产生一定的衰减。因此,在一些实施例中,为考虑待验证语音与麦克风之间的距离因素,通过镜像源模型方法对注册语音进行混响预估时,还对注册语音进行远场仿真。即,在根据镜像源模型方法计算房间的冲击响应函数时,根据待验证说话人与麦克风之间的距离,设定模拟声场中注册语音与语音接收装置之间的距离,这样,可以将注册语音的采集距离模拟与待验证语音相同的采集距离,从而可以进一步减小待验证语音和增强的注册语音之间除有效语音成分之外的其他区别,提高声纹的识别率以及声纹识别方法的鲁棒性。In addition, in some cases, for example, when performing voice control on intelligent robots and smart homes, the distance between the speaker to be verified and the microphone may be far (for example, more than 1m), so that when the voice of the speaker to be verified reaches the microphone There will be some attenuation. Therefore, in some embodiments, in order to consider the distance factor between the voice to be verified and the microphone, when the reverberation estimation is performed on the registered voice by using the image source model method, far-field simulation is also performed on the registered voice. That is, when calculating the impulse response function of the room according to the image source model method, the distance between the registered voice in the simulated sound field and the voice receiving device is set according to the distance between the speaker to be verified and the microphone, so that the registered voice can be The acquisition distance simulates the same acquisition distance as the voice to be verified, thereby further reducing the difference between the voice to be verified and the enhanced registered voice except for the effective voice components, improving the recognition rate of voiceprints and the efficiency of the voiceprint recognition method. robustness.
再如,在一些实施例中,在对待验证语音和增强后的语音进行比较之前(即,步骤S50之前),还会对待验证语音进行前端处理,例如,对待验证语音进行回声抵消、去混响、主动降噪、动态增益、定向拾音等。为减小待验证语音和增强的注册语音之间除有效语音成分之外的其他区别,对增强的注册语音进行与待验证语音相同的前端处理(即,使待验证语音与增强的注册语音通过相同的前端处理算法模块),以进一步提高声纹的识别率以及声纹识别方法的鲁棒性。For another example, in some embodiments, before the to-be-verified speech is compared with the enhanced speech (ie, before step S50 ), the to-be-verified speech is also subjected to front-end processing, for example, echo cancellation and de-reverberation are performed on the to-be-verified speech , active noise reduction, dynamic gain, directional pickup, etc. In order to reduce other differences between the voice to be verified and the enhanced registered voice except for the effective voice components, the enhanced registered voice is subjected to the same front-end processing as the voice to be verified (that is, the voice to be verified and the enhanced registered voice are passed through. The same front-end processing algorithm module) to further improve the voiceprint recognition rate and the robustness of the voiceprint recognition method.
又如,在一些实施例中,可以省去语音信号的特征参数提取步骤(即,步骤S150),直接通过深度神经网络模型对语音信号进行识别等。For another example, in some embodiments, the feature parameter extraction step of the speech signal (ie, step S150 ) may be omitted, and the speech signal may be recognized directly through a deep neural network model.
【实施例二】[Example 2]
参考图4,本实施例用于提供另一种语音增强方法,与实施例一不同的是,本实施例中,在采集到待验证语音后,还对待验证语音的采集场景进行识别,以获取待验证语音所对应的场景类型。之后,除了根据待验证语音中包含的噪声确定增强的注册语音之外,还根据上述场景类型确定增强的注册语音。具体地,根据本实施例的由手机100执行的语音增强方法包括下述步骤:Referring to FIG. 4 , this embodiment is used to provide another voice enhancement method. Different from Embodiment 1, in this embodiment, after the voice to be verified is collected, the scene of the voice to be verified is also recognized to obtain The scene type corresponding to the voice to be verified. After that, in addition to determining the enhanced registration voice according to the noise contained in the voice to be verified, the enhanced registration voice is also determined according to the above scene type. Specifically, the speech enhancement method performed by the mobile phone 100 according to this embodiment includes the following steps:
S210:采集注册语音,这里,注册语音为手机100机主在安静环境下录制的语音,这样,注册语音中没有明显的噪声分量。S210: Collect the registered voice. Here, the registered voice is the voice recorded by the owner of the mobile phone 100 in a quiet environment, so that there is no obvious noise component in the registered voice.
S220:采集待验证语音,这里,待验证语音为手机的当前用户在嘈杂人声场景下录制的语音。换句话说,手机用户可以在该场景中通过声纹识别的手段对手机屏幕进行解锁。另外,手机的前用户是当前操作手机100的人,可能是机主本人,也可能机主本人之外的其他人。S220: Collect the voice to be verified. Here, the voice to be verified is the voice recorded by the current user of the mobile phone in the noisy human voice scene. In other words, the mobile phone user can unlock the screen of the mobile phone by means of voiceprint recognition in this scenario. In addition, the former user of the mobile phone is the person who currently operates the mobile phone 100, which may be the owner himself, or may be someone other than the owner himself.
S230:确定待验证语音中包含的噪声。本实施例中,待验证语音中包含的噪声是识别场景下中除手机100的当前用户以外的其他声源产生的声音。S230: Determine the noise contained in the speech to be verified. In this embodiment, the noise contained in the voice to be verified is the sound generated by other sound sources other than the current user of the mobile phone 100 in the recognition scene.
S240:在注册语音上叠加待验证语音中包含的噪声,以得到增强的注册语音。本实施例中,在时域内,将噪声信号的信号数值与注册语音信号的信号数值相加,以得到增强的注册语音。S240: Superimpose the noise contained in the voice to be verified on the registration voice to obtain an enhanced registration voice. In this embodiment, in the time domain, the signal value of the noise signal and the signal value of the registration speech signal are added to obtain the enhanced registration speech.
本实施中,步骤S210~S240与实施例一中的步骤S110~S140实质上相同,对于步骤中的细节过程 不再赘述。本实施例中,注册语音的数量与实施例一相同,即,注册语音的数量为6条,因此,在步骤S240中,在6条注册语音上分别叠加待验证语音中包含的噪声,得到6条增强的注册语音。In this implementation, steps S210-S240 are substantially the same as steps S110-S140 in Embodiment 1, and detailed processes in the steps are not repeated. In this embodiment, the number of registered voices is the same as that of the first embodiment, that is, the number of registered voices is 6. Therefore, in step S240, the noises contained in the voices to be verified are respectively superimposed on the 6 registered voices to obtain 6 Enhanced registration voice.
S250:确定待验证语音所对应的场景类型。具体地,在采集到待验证语音后,通过语音识别算法对待验证语音所对应的场景类型进行识别,语音识别算法例如为GMM方法、DNN方法等。在语音识别算法中,场景类型的标签值可以为居家场景;车载场景;室外嘈杂场景;会场场景;影院场景等。S250: Determine the scene type corresponding to the voice to be verified. Specifically, after the voice to be verified is collected, the scene type corresponding to the voice to be verified is identified by a voice recognition algorithm, such as a GMM method, a DNN method, or the like. In the speech recognition algorithm, the label value of the scene type can be a home scene; a car scene; an outdoor noisy scene; a venue scene; a cinema scene, etc.
S260:在注册语音上叠加模板噪声。模板噪声是与步骤S250所确定的场景类型相对应的噪声,例如,模板噪声是在步骤S250所确定的场景下录制的噪声。其中,对于每个场景类型,可对应多组模板噪声。本实施例中,假设步骤S250中确定待验证语音所对应的场景类型为居家场景,且居家场景下录制有3组模板噪声(例如,家用音视频设备产生的声音、家庭成员对话时产生的背景语音、和/或家用电器产生的噪声等)。S260: Superimpose template noise on the registered speech. The template noise is noise corresponding to the scene type determined in step S250, for example, template noise is noise recorded under the scene determined in step S250. Among them, for each scene type, it can correspond to multiple groups of template noise. In this embodiment, it is assumed that the scene type corresponding to the voice to be verified is determined in step S250 to be a home scene, and three groups of template noises are recorded in the home scene (for example, the sound generated by home audio and video equipment, the background generated when family members talk voice, and/or noise from household appliances, etc.).
然后,将3组模板噪声,分别叠加到6条注册语音上,形成3×6=18条增强的注册语音。连同步骤S240中形成的6条增强的注册语音,本实施例中,共形成24条增强的注册语音。Then, 3 groups of template noises are superimposed on the 6 registered voices respectively to form 3×6=18 enhanced registered voices. Together with the 6 enhanced registration voices formed in step S240, in this embodiment, a total of 24 enhanced registration voices are formed.
S270:提取待验证语音的特征参数和增强的注册语音的特征参数,可参考实施例一中的步骤S150。但可以理解,本实施例中,分别提取24条增强的注册语音中的特征参数。S270: Extract the feature parameters of the voice to be verified and the feature parameters of the enhanced registered voice, refer to step S150 in the first embodiment. However, it can be understood that, in this embodiment, the characteristic parameters in the 24 enhanced registered voices are extracted respectively.
S280:对待验证语音的特征参数和增强的注册语音的特征参数进行参数识别,以分别得到手机100当前用户的语音模板和手机100机主的语音模板,可参考实施例一中的S160。但可以理解,本实施例中,得到24个机主语音模板分别记为B1,B2,……,B24。S280: Perform parameter identification on the feature parameters of the voice to be verified and the feature parameters of the enhanced registered voice to obtain the voice template of the current user of the mobile phone 100 and the voice template of the owner of the mobile phone 100 respectively, refer to S160 in the first embodiment. However, it can be understood that, in this embodiment, the obtained 24 host voice templates are respectively recorded as B1, B2, . . . , B24.
S290:将手机100机主的语音模板与手机100当前用户的语音模板进行匹配,以得到识别结果,可参考实施例一中的步骤S170。但可以理解,本实施例中,24个机主语音模板与手机100当前用户语音模板的余弦相似度分别为cosθ 1、cosθ 2,……,cosθ 24。对24个余弦相似度取均值,得到当前用户语音与机主语音的相似度P=(cosθ 1+cosθ 2+……+cosθ 24)/24。 S290 : Match the voice template of the owner of the mobile phone 100 with the voice template of the current user of the mobile phone 100 to obtain a recognition result. Reference may be made to step S170 in the first embodiment. However, it can be understood that, in this embodiment, the cosine similarity between the 24 host voice templates and the current user voice template of the mobile phone 100 are cosθ 1 , cosθ 2 , . . . , cosθ 24 , respectively. Taking the mean value of the 24 cosine similarities, the similarity between the current user voice and the host voice P=(cosθ 1 +cosθ 2 +...+cosθ 24 )/24 is obtained.
如果当前用户语音与机主语音的相似度P大于设定值(例如,0.8),则判断手机100的当前用户为机主本人,此时,手机100解锁屏幕;否则,判断手机100的当前用户并非机主本人,手机100不会解锁屏幕。If the similarity P between the current user's voice and the host's voice is greater than the set value (for example, 0.8), it is determined that the current user of the mobile phone 100 is the host himself. At this time, the mobile phone 100 unlocks the screen; otherwise, it is determined that the current user of the mobile phone 100 is the host. Not the owner himself, the phone 100 will not unlock the screen.
可以理解的是,以上仅是对本申请的技术方案的示例性说明,在上述基础上,本领域技术人员可以进行其他变形。例如,省去步骤S230和步骤S240,即省去根据待验证语音中包含的噪声对注册语音进行增强的步骤,仅根据与识别场景所对应的模板噪声对注册语音进行增强。这样,增强的注册语音为18条,其对应的机主语音模板为B7,B2,……,B24,相应地,手机100当前用户语音与机主语音的相似度P=(cosθ 7+cosθ 2+……+cosθ 24)/18。 It can be understood that the above is only an exemplary description of the technical solutions of the present application, and those skilled in the art can make other modifications on the basis of the above. For example, steps S230 and S240 are omitted, that is, the step of enhancing the registered voice according to the noise contained in the voice to be verified is omitted, and the registered voice is only enhanced according to the template noise corresponding to the recognition scene. Thus, enhanced voice register 18, which corresponds to the owner of the utterance B7, B2, ......, B24, respectively, with the voice of the user machine 100 of the main current voice phone similarity P = (cosθ 7 + cosθ 2 +...+cosθ 24 )/18.
另外,本实施例未提及的技术细节,例如,声纹识别算法的实现主体(在手机100本地实现还是在云端实现),对语音的其他处理(例如,混响预估、远场仿真、前端处理等)等,可以参考实施例一中的介绍,不再赘述。In addition, the technical details not mentioned in this embodiment, for example, the implementation body of the voiceprint recognition algorithm (implemented locally in the mobile phone 100 or in the cloud), other processing of speech (for example, reverberation estimation, far-field simulation, Front-end processing, etc.), etc., may refer to the introduction in Embodiment 1, and will not be repeated here.
本文中,待验证语音所对应的场景类型、待验证说话人与麦克风之间的距离等均为待验证语音中的环境特征参数。In this paper, the scene type corresponding to the voice to be verified, the distance between the speaker to be verified and the microphone, etc. are all environmental characteristic parameters in the voice to be verified.
【实施例三】[Example 3]
本实施例在实施例一的基础上,将语音增强方法的应用场景进行了变更,具体地,本实施例中语音增强方法应用于图5示出对智能音箱200进行控制的场景。智能音箱200具有语音识别功能,用户可通过语音与智能音箱200进行交互,以通过智能音箱200进行歌曲点播、天气查询、日程管理、智能家居控制等功能。This embodiment changes the application scenario of the voice enhancement method on the basis of the first embodiment. Specifically, the voice enhancement method in this embodiment is applied to the scenario shown in FIG. 5 for controlling the smart speaker 200 . The smart speaker 200 has a voice recognition function, and the user can interact with the smart speaker 200 through voice, so as to perform functions such as song on demand, weather query, schedule management, and smart home control through the smart speaker 200 .
本实施例中,当用户向智能音箱200发出语音指令以使智能音箱200执行某种操作(例如,播放当日日程,播放特定目录的歌曲,对智能家居进行控制等)时,智能基于声纹识别方法对用户的身份进行认证,以判断当前用户是否为智能音箱200的机主,进而判断当前用户是否有控制智能音箱200执行该操作的权限。In this embodiment, when the user sends a voice command to the smart speaker 200 to make the smart speaker 200 perform a certain operation (for example, play the current schedule, play songs from a specific directory, control the smart home, etc.) The method authenticates the identity of the user to determine whether the current user is the owner of the smart speaker 200, and then determines whether the current user has the authority to control the smart speaker 200 to perform the operation.
具体地,本实施例的语音增强方法包括:Specifically, the speech enhancement method of this embodiment includes:
S310:采集注册语音。本实施例中,通过智能音箱200的麦克风采集来自于智能音箱200的机主的注册语音,但本申请不限于此,在其他实施例中,也可以通过手机、专用麦克风等采集注册语音。在采集到注册语音后,可以将注册语音保存在智能音箱200本地,以通过智能音箱200对用户的声纹进行识别,以实现声纹的离线识别;也可以将注册语音上传到云端,以利用云端的计算资源对用户的声纹进行识别,以节省智能音箱200本地的计算资源。S310: Collect registered voice. In this embodiment, the registration voice from the owner of the smart speaker 200 is collected through the microphone of the smart speaker 200, but the application is not limited to this. After the registered voice is collected, the registered voice can be saved locally in the smart speaker 200 to recognize the user's voiceprint through the smart speaker 200 to realize offline recognition of the voiceprint; the registered voice can also be uploaded to the cloud to use The computing resources in the cloud recognize the user's voiceprint to save the local computing resources of the smart speaker 200 .
S320:采集待验证语音。本实施例中,通过智能音箱200的麦克风采集待验证语音。待验证语音的采集参数(例如,待验证语音的时长、文本内容)等可以参考实施例一中的描述,不再赘述。S320: Collect the voice to be verified. In this embodiment, the voice to be verified is collected through the microphone of the smart speaker 200 . For the acquisition parameters of the voice to be verified (for example, the duration and text content of the voice to be verified), reference may be made to the description in Embodiment 1, and details are not repeated here.
S330:确定待验证语音中包含的噪声。本实施例中,将待验证语音分为多个语音帧,并计算各语音帧的中能量。由于噪声中的能量通常小于有效语音中的能量,因此,当语音帧中的能量小于预定值时,可以将该语音帧确定为噪声帧,从而简化噪声的提取过程。S330: Determine the noise contained in the speech to be verified. In this embodiment, the speech to be verified is divided into a plurality of speech frames, and the medium energy of each speech frame is calculated. Since the energy in the noise is generally smaller than that in the valid speech, when the energy in the speech frame is smaller than a predetermined value, the speech frame can be determined as a noise frame, thereby simplifying the noise extraction process.
S340:在注册语音上叠加待验证语音中包含的噪声,以得到增强的注册语音。本实施例中,在时域内,将噪声信号的信号数值与注册语音信号的信号数值相加,以得到增强的注册语音。S340: Superimpose the noise contained in the voice to be verified on the registration voice to obtain an enhanced registration voice. In this embodiment, in the time domain, the signal value of the noise signal and the signal value of the registration speech signal are added to obtain the enhanced registration speech.
S350:提取待验证语音的特征参数和增强的注册语音的特征参数。例如,通过MFCC方法提取待验证语音的特征参数和增强的注册语音的特征参数。S350: Extract the feature parameters of the voice to be verified and the feature parameters of the enhanced registered voice. For example, the feature parameters of the voice to be verified and the feature parameters of the enhanced registered voice are extracted by the MFCC method.
S360:对待验证语音的特征参数和增强的注册语音的特征参数进行参数识别,以分别得到智能音箱200的当前用户的语音模板和智能音箱200机主的语音模板。本实施例对参数识别的识别模型不作限定,可以是概率模型,例如,身份向量(I-vector)模型;也可以是深度神经网络模型,例如,时延神经网络(Time-Delay Neural Network,TDNN)模型,ResNet模型等。S360: Perform parameter identification on the feature parameters of the voice to be verified and the feature parameters of the enhanced registered voice, so as to obtain the voice template of the current user of the smart speaker 200 and the voice template of the owner of the smart speaker 200, respectively. The recognition model for parameter recognition is not limited in this embodiment, and may be a probability model, such as an identity vector (I-vector) model; or a deep neural network model, such as a Time-Delay Neural Network (TDNN) ) model, ResNet model, etc.
S370:将智能音箱200机主的语音模板与智能音箱200当前用户的语音模板进行匹配,以得到识别结果。本实施例中,模板匹配方法可以为余弦距离法、线性判别法或概率线性判别分析法等。如果当前用户语音与机主语音的相似度大于设定值时,则判断智能音箱200的当前用户为机主本人,此时,智能音箱200响应于用户的语音指令执行相应操作;否则,判断智能音箱200的当前用户并非机主本人,智能音箱200忽略用户的语音指令。S370: Match the voice template of the owner of the smart speaker 200 with the voice template of the current user of the smart speaker 200 to obtain a recognition result. In this embodiment, the template matching method may be a cosine distance method, a linear discriminant method, or a probabilistic linear discriminant analysis method, or the like. If the similarity between the current user's voice and the host's voice is greater than the set value, it is determined that the current user of the smart speaker 200 is the owner himself. At this time, the smart speaker 200 performs corresponding operations in response to the user's voice command; The current user of the speaker 200 is not the owner himself, and the smart speaker 200 ignores the user's voice command.
需要说明的是,除应用场景外,本实施例的语音增强方法与实施例一的语音增强方法实质相同,因此,本实施例中未描述的技术细节可以参考实施例一中的描述。It should be noted that the speech enhancement method in this embodiment is substantially the same as the speech enhancement method in Embodiment 1 except for the application scenario. Therefore, for technical details not described in this embodiment, reference may be made to the description in Embodiment 1.
与实施例一类似,声纹识别的算法(步骤S330~S370所对应的算法)可以在智能音箱200上实现,以实现声纹的离线识别;也可以在云端实现,以节省智能音箱200本地的计算资源。当声纹识别算法在云端实现时,智能音箱200将步骤S120采集到的待验证语音上传到云端服务器中,云端服务器利用声纹识别算法对智能音箱200的当前用户的身份进行认证后,将认证结果返回至智能音箱200,智能音箱200根据认证结果决定是否执行用户的语音指令。Similar to the first embodiment, the voiceprint recognition algorithm (the algorithms corresponding to steps S330 to S370 ) can be implemented on the smart speaker 200 to realize offline recognition of voiceprints; it can also be implemented in the cloud to save the local smart speaker 200 computing resources. When the voiceprint recognition algorithm is implemented in the cloud, the smart speaker 200 uploads the to-be-verified voice collected in step S120 to the cloud server. After the cloud server uses the voiceprint recognition algorithm to authenticate the identity of the current user of the smart speaker 200, the authentication The result is returned to the smart speaker 200, and the smart speaker 200 determines whether to execute the user's voice command according to the authentication result.
另外,本领域技术人员也可将实施例二中的语音增强方法应用于图5示出的对智能音箱控制的场景,不再赘述。In addition, those skilled in the art can also apply the voice enhancement method in Embodiment 2 to the scenario of controlling the smart speaker shown in FIG. 5 , which will not be repeated here.
现在参考图6,所示为根据本申请的一个实施例的电子设备400的框图。电子设备400可以包括耦合到控制器中枢403的一个或多个处理器401。对于至少一个实施例,控制器中枢403经由诸如前端总 线(FSB,Front Side Bus)之类的多分支总线、诸如快速通道连(QPI,QuickPath Interconnect)之类的点对点接口、或者类似的连接406与处理器401进行通信。处理器401执行控制一般类型的数据处理操作的指令。在一实施例中,控制器中枢403包括,但不局限于,图形存储器控制器中枢(GMCH,Graphics&Memory Controller Hub)(未示出)和输入/输出中枢(IOH,Input Output Hub)(其可以在分开的芯片上)(未示出),其中GMCH包括存储器和图形控制器并与IOH耦合。Referring now to FIG. 6, shown is a block diagram of an electronic device 400 according to one embodiment of the present application. Electronic device 400 may include one or more processors 401 coupled to controller hub 403 . For at least one embodiment, the controller hub 403 is connected to 406 via a multidrop bus such as a Front Side Bus (FSB), a point-to-point interface such as a QuickPath Interconnect (QPI), or the like The processor 401 communicates. Processor 401 executes instructions that control general types of data processing operations. In one embodiment, the controller hub 403 includes, but is not limited to, a graphics memory controller hub (GMCH, Graphics & Memory Controller Hub) (not shown) and an input/output hub (IOH, Input Output Hub) (which can be on a separate chip) (not shown), where the GMCH includes the memory and graphics controller and is coupled to the IOH.
电子设备400还可包括耦合到控制器中枢403的协处理器402和存储器404。或者,存储器和GMCH中的一个或两者可以被集成在处理器内(如本申请中所描述的),存储器404和协处理器402直接耦合到处理器401以及控制器中枢403,控制器中枢403与IOH处于单个芯片中。 Electronic device 400 may also include a coprocessor 402 and memory 404 coupled to controller hub 403 . Alternatively, one or both of the memory and GMCH may be integrated within the processor (as described in this application), with the memory 404 and coprocessor 402 coupled directly to the processor 401 and to the controller hub 403, the controller hub 403 and IOH are in a single chip.
存储器404可以是例如动态随机存取存储器(DRAM,Dynamic Random Access Memory)、相变存储器(PCM,Phase Change Memory)或这两者的组合。存储器404中可以包括用于存储数据和/或指令的一个或多个有形的、非暂时性计算机可读介质。计算机可读存储介质中存储有指令,具体而言,存储有该指令的暂时和永久副本。该指令可以包括:由处理器中的至少一个执行时导致电子设备400实施如图3、图4所述语音增强方法的指令。当指令在计算机上运行时,使得计算机执行上述实施例一和/或实施例二中公开的方法。The memory 404 may be, for example, Dynamic Random Access Memory (DRAM), Phase Change Memory (PCM), or a combination of the two. Memory 404 may include one or more tangible, non-transitory computer-readable media for storing data and/or instructions. The computer-readable storage medium stores instructions, in particular temporary and permanent copies of the instructions. The instructions may include instructions that, when executed by at least one of the processors, cause the electronic device 400 to implement the speech enhancement method described in FIGS. 3 and 4 . When the instructions are executed on the computer, the computer is caused to execute the method disclosed in the first embodiment and/or the second embodiment.
在一个实施例中,协处理器402是专用处理器,诸如例如高吞吐量MIC(Many Integrated Core,集成众核)处理器、网络或通信处理器、压缩引擎、图形处理器、GPGPU(General-purpose computing on graphics processing units,图形处理单元上的通用计算)、或嵌入式处理器等等。协处理器402的任选性质用虚线表示在图6中。In one embodiment, the coprocessor 402 is a special-purpose processor, such as, for example, a high-throughput MIC (Many Integrated Core) processor, a network or communication processor, a compression engine, a graphics processor, a GPGPU (General- purpose computing on graphics processing units, general-purpose computing on graphics processing units), or embedded processors, etc. Optional properties of the coprocessor 402 are represented in FIG. 6 by dashed lines.
在一个实施例中,电子设备400可以进一步包括网络接口(NIC,Network Interface Controller)406。网络接口406可以包括收发器,用于为电子设备400提供无线电接口,进而与任何其他合适的设备(如前端模块,天线等)进行通信。在各种实施例中,网络接口406可以与电子设备400的其他组件集成。网络接口406可以实现上述实施例中的通信单元的功能。In one embodiment, the electronic device 400 may further include a network interface (NIC, Network Interface Controller) 406 . The network interface 406 may include a transceiver for providing a radio interface for the electronic device 400 to communicate with any other suitable devices (eg, front-end modules, antennas, etc.). In various embodiments, network interface 406 may be integrated with other components of electronic device 400 . The network interface 406 can implement the functions of the communication unit in the above-mentioned embodiments.
电子设备400可以进一步包括输入/输出(I/O,Input/Output)设备405。I/O405可以包括:用户界面,该设计使得用户能够与电子设备400进行交互;外围组件接口的设计使得外围组件也能够与电子设备400交互;和/或传感器设计用于确定与电子设备400相关的环境条件和/或位置信息。The electronic device 400 may further include an input/output (I/O, Input/Output) device 405 . I/O 405 may include: a user interface designed to enable a user to interact with electronic device 400 ; a peripheral component interface designed to enable peripheral components to also interact with electronic device 400 ; and/or sensors designed to determine association with electronic device 400 environmental conditions and/or location information.
值得注意的是,图6仅是示例性的。即虽然图6中示出了电子设备400包括处理器401、控制器中枢403、存储器404等多个器件,但是,在实际的应用中,使用本申请各方法的设备,可以仅包括电子设备400各器件中的一部分器件,例如,可以仅包含处理器401和网络接口406。图6中可选器件的性质用虚线示出。It is worth noting that Figure 6 is exemplary only. That is, although FIG. 6 shows that the electronic device 400 includes multiple devices such as the processor 401, the controller center 403, the memory 404, etc., in practical applications, the device using each method of the present application may only include the electronic device 400 Some of the devices, for example, may include only the processor 401 and the network interface 406 . The properties of the optional device in Figure 6 are shown in dashed lines.
现在参考图7,所示为根据本申请的一实施例的SoC(System on Chip,片上系统)500的框图。在图7中,相似的部件具有同样的附图标记。另外,虚线框是更先进的SoC的可选特征。在图7中,SoC500包括:互连单元550,其被耦合至处理器510;系统代理单元580;总线控制器单元590;集成存储器控制器单元540;一组或一个或多个协处理器520,其可包括集成图形逻辑、图像处理器、音频处理器和视频处理器;静态随机存取存储器(SRAM,Static Random-Access Memory)单元530;直接存储器存取(DMA,Direct Memory Access)单元560。在一个实施例中,协处理器520包括专用处理器,诸如例如网络或通信处理器、压缩引擎、GPGPU(General-purpose computing on graphics processing units,图形处理单元上的通用计算)、高吞吐量MIC处理器、或嵌入式处理器等。Referring now to FIG. 7 , shown is a block diagram of a SoC (System on Chip, system on chip) 500 according to an embodiment of the present application. In Figure 7, similar components have the same reference numerals. Also, the dotted box is an optional feature of more advanced SoCs. In FIG. 7, SoC 500 includes: interconnect unit 550 coupled to processor 510; system agent unit 580; bus controller unit 590; integrated memory controller unit 540; , which may include integrated graphics logic, image processor, audio processor and video processor; Static Random Access Memory (SRAM, Static Random-Access Memory) unit 530; Direct Memory Access (DMA, Direct Memory Access) unit 560 . In one embodiment, the coprocessor 520 includes a special purpose processor such as, for example, a network or communications processor, a compression engine, a GPGPU (General-purpose computing on graphics processing units), a high-throughput MIC processor, or embedded processor, etc.
静态随机存取存储器(SRAM)单元530可以包括用于存储数据和/或指令的一个或多个有形的、非暂时性计算机可读介质。计算机可读存储介质中存储有指令,具体而言,存储有该指令的暂时和永久 副本。该指令可以包括:由处理器中的至少一个执行时导致SoC实施如图3、图4所述语音增强方法的指令。当指令在计算机上运行时,使得计算机执行上述实施例一和/或实施例二中公开的方法。Static random access memory (SRAM) unit 530 may include one or more tangible, non-transitory computer-readable media for storing data and/or instructions. The computer-readable storage medium stores instructions, in particular temporary and permanent copies of the instructions. The instructions may include instructions that, when executed by at least one of the processors, cause the SoC to implement the speech enhancement method described in FIGS. 3 and 4 . When the instructions are executed on the computer, the computer is caused to execute the method disclosed in the first embodiment and/or the second embodiment.
本文中术语“和/或”,仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。The term "and/or" in this article is only an association relationship to describe the associated objects, indicating that there can be three kinds of relationships, for example, A and/or B, it can mean that A exists alone, A and B exist at the same time, and A and B exist independently B these three cases.
本申请的各方法实施方式均可以以软件、磁件、固件等方式实现。Each method implementation of the present application can be implemented by means of software, magnetic components, firmware, and the like.
可将程序代码应用于输入指令,以执行本文描述的各功能并生成输出信息。可以按已知方式将输出信息应用于一个或多个输出设备。为了本申请的目的,处理系统包括具有诸如例如数字信号处理器(DSP,Digital Signal Processor)、微控制器、专用集成电路(ASIC)或微处理器之类的处理器的任何系统。Program code may be applied to input instructions to perform the functions described herein and to generate output information. The output information can be applied to one or more output devices in a known manner. For the purposes of this application, a processing system includes any system having a processor such as, for example, a Digital Signal Processor (DSP), a microcontroller, an Application Specific Integrated Circuit (ASIC), or a microprocessor.
程序代码可以用高级程序化语言或面向对象的编程语言来实现,以便与处理系统通信。在需要时,也可用汇编语言或机器语言来实现程序代码。事实上,本文中描述的机制不限于任何特定编程语言的范围。在任一情形下,该语言可以是编译语言或解释语言。The program code may be implemented in a high-level procedural language or an object-oriented programming language to communicate with the processing system. The program code may also be implemented in assembly or machine language, if desired. In fact, the mechanisms described herein are not limited to the scope of any particular programming language. In either case, the language may be a compiled language or an interpreted language.
至少一个实施例的一个或多个方面可以由存储在计算机可读存储介质上的表示性指令来实现,指令表示处理器中的各种逻辑,指令在被机器读取时使得该机器制作用于执行本文所述的技术的逻辑。被称为“IP(Intellectual Property,知识产权)核”的这些表示可以被存储在有形的计算机可读存储介质上,并被提供给多个客户或生产设施以加载到实际制造该逻辑或处理器的制造机器中。One or more aspects of at least one embodiment may be implemented by representative instructions stored on a computer-readable storage medium, the instructions representing various logic in a processor, the instructions, when read by a machine, cause the machine to make Logic that implements the techniques described herein. These representations, referred to as "IP (Intellectual Property) cores," may be stored on tangible computer-readable storage media and provided to multiple customers or production facilities for loading into the actual manufacturing of the logic or processor. in the manufacturing machine.
在一些情况下,指令转换器可用来将指令从源指令集转换至目标指令集。例如,指令转换器可以变换(例如使用静态二进制变换、包括动态编译的动态二进制变换)、变形、仿真或以其它方式将指令转换成将由核来处理的一个或多个其它指令。指令转换器可以用软件、硬件、固件、或其组合实现。指令转换器可以在处理器上、在处理器外、或者部分在处理器上且部分在处理器外。In some cases, an instruction converter may be used to convert instructions from a source instruction set to a target instruction set. For example, an instruction translator may transform (eg, using static binary transforms, dynamic binary transforms including dynamic compilation), warp, emulate, or otherwise convert an instruction into one or more other instructions to be processed by the core. Instruction translators can be implemented in software, hardware, firmware, or a combination thereof. The instruction translator may be on-processor, off-processor, or partially on-processor and partially off-processor.

Claims (26)

  1. 一种语音增强方法,应用于电子设备,其特征在于,包括:A speech enhancement method, applied to electronic equipment, is characterized in that, comprising:
    采集待验证语音;Collect the voice to be verified;
    确定所述待验证语音中包含的环境噪声和/或环境特征参数;determining the environmental noise and/or environmental characteristic parameters contained in the to-be-verified voice;
    基于所述环境噪声和/或所述环境特征参数对注册语音进行增强;enhancing the registered speech based on the environmental noise and/or the environmental characteristic parameter;
    比较所述待验证语音与增强的注册语音,确定所述待验证语音与所述注册语音来自相同用户。The to-be-verified voice and the enhanced registration voice are compared, and it is determined that the to-be-verified voice and the registered voice are from the same user.
  2. 根据权利要求1所述的方法,其特征在于,所述基于所述环境噪声对注册语音进行增强,包括:在所述注册语音上叠加所述环境噪声。The method according to claim 1, wherein the enhancing the registered voice based on the environmental noise comprises: superimposing the environmental noise on the registered voice.
  3. 根据权利要求1所述的方法,其特征在于,所述环境噪声为所述电子设备的副麦克风拾取到的声音。The method according to claim 1, wherein the environmental noise is a sound picked up by a secondary microphone of the electronic device.
  4. 根据权利要求1所述的方法,其特征在于,所述待验证语音的时长小于所述注册语音的时长。The method according to claim 1, wherein the duration of the to-be-verified voice is shorter than the duration of the registered voice.
  5. 根据权利要求1所述的方法,其特征在于,所述环境特征参数包括所述待验证语音所对应的场景类型;The method according to claim 1, wherein the environmental characteristic parameter comprises a scene type corresponding to the to-be-verified voice;
    所述基于所述环境特征参数对注册语音进行增强,包括:基于所述待验证语音所对应的场景类型,确定所述场景类型所对应的模板噪声,并在所述注册语音上叠加所述模板噪声。The enhancing the registered voice based on the environmental characteristic parameter includes: determining the template noise corresponding to the scene type based on the scene type corresponding to the to-be-verified voice, and superimposing the template on the registered voice noise.
  6. 根据权利要求5所述的方法,其特征在于,所述待验证语音所对应的场景类型是根据场景识别算法对所述待验证语音进行识别而确定的。The method according to claim 5, wherein the scene type corresponding to the to-be-verified speech is determined by recognizing the to-be-verified speech by a scene recognition algorithm.
  7. 根据权利要求6所述的方法,其特征在于,所述场景识别算法为下述任意一种:GMM算法;DNN算法。The method according to claim 6, wherein the scene recognition algorithm is any one of the following: GMM algorithm; DNN algorithm.
  8. 根据权利要求7所述的方法,其特征在于,所述待验证语音的场景类型为下述任意一种:居家场景;车载场景;室外嘈杂场景;会场场景;影院场景。The method according to claim 7, wherein the scene type of the voice to be verified is any one of the following: a home scene; a vehicle-mounted scene; an outdoor noisy scene; a venue scene; a cinema scene.
  9. 根据权利要求1所述的方法,其特征在于,所述待验证语音和所述增强的注册语音为经过相同的前端处理算法处理过的语音。The method according to claim 1, wherein the voice to be verified and the enhanced registration voice are voices processed by the same front-end processing algorithm.
  10. 根据权利要求9所述的方法,其特征在于,所述前端处理算法包括以下至少一种处理算法:回声抵消;去混响;主动降噪;动态增益;定向拾音。The method according to claim 9, wherein the front-end processing algorithm comprises at least one of the following processing algorithms: echo cancellation; de-reverberation; active noise reduction; dynamic gain;
  11. 根据权利要求1所述的方法,其特征在于,所述注册语音的数量为多条;并且,基于所述环境噪声和/或所述环境特征参数,对多条所述注册语音分别进行增强,以得到多条增强的注册语音。The method according to claim 1, wherein the number of the registered voices is multiple; and, based on the environmental noise and/or the environmental characteristic parameter, the multiple registered voices are respectively enhanced, for multiple enhanced registration voices.
  12. 根据权利要求1所述的方法,其特征在于,所述比较所述待验证语音与增强的注册语音,确定所述待验证语音与所述注册语音来自相同用户,包括:The method according to claim 1, wherein the comparing the to-be-verified voice and the enhanced registered voice to determine that the to-be-verified voice and the registered voice are from the same user, comprising:
    通过特征参数提取算法提取所述待验证语音的特征参数,和所述增强的注册语音的特征参数;Extract the feature parameters of the voice to be verified and the feature parameters of the enhanced registered voice through a feature parameter extraction algorithm;
    通过参数识别模型对所述待验证语音的特征参数和所述增强的注册语音的特征参数进行参数识别,以分别得到待验证说话人的语音模板和注册说话人的语音模板;Parameter identification is performed on the characteristic parameters of the voice to be verified and the characteristic parameters of the enhanced registered voice through the parameter identification model, so as to obtain the voice template of the speaker to be verified and the voice template of the registered speaker respectively;
    通过模板匹配算法对所述待验证说话人的语音模板和所述注册说话人的语音模板进行匹配,根据匹配结果确定所述待验证语音与所述注册语音来自相同用户。The voice template of the speaker to be verified and the voice template of the registered speaker are matched by a template matching algorithm, and it is determined according to the matching result that the voice to be verified and the registered voice are from the same user.
  13. 根据权利要求12所述的方法,其特征在于,The method of claim 12, wherein:
    所述特征参数提取算法为MFCC算法,log mel算法或者LPCC算法;和/或,Described feature parameter extraction algorithm is MFCC algorithm, log mel algorithm or LPCC algorithm; And/or,
    所述参数识别模型为身份向量模型、时延神经网络模型或者ResNet模型;和/或,The parameter identification model is an identity vector model, a time-delay neural network model or a ResNet model; and/or,
    所述模板匹配算法为余弦距离法、线性判别法或者概率线性判别分析法。The template matching algorithm is a cosine distance method, a linear discriminant method or a probabilistic linear discriminant analysis method.
  14. 一种语音增强系统,其特征在于,包括终端设备以及与所述终端设备通信连接的服务器,其中:A voice enhancement system, characterized in that it includes a terminal device and a server that is communicatively connected to the terminal device, wherein:
    所述终端设备,用于采集待验证语音,并将所述待验证语音发送至所述服务器;the terminal device, configured to collect the voice to be verified, and send the voice to be verified to the server;
    所述服务器,用于确定所述待验证语音中包含的环境噪声和/或环境特征参数,基于所述环境噪声和/或所述环境特征参数对注册语音进行增强,并比较所述待验证语音与所述增强的注册语音,确定所述待验证语音与所述注册语音来自相同用户;The server is configured to determine the environmental noise and/or environmental characteristic parameters contained in the to-be-verified speech, enhance the registered speech based on the environmental noise and/or the environmental characteristic parameters, and compare the to-be-verified speech With the enhanced registration voice, it is determined that the to-be-verified voice and the registration voice are from the same user;
    所述服务器,还用于将确定所述待验证语音与所述注册语音来自相同用户的确定结果发送至所述终端设备。The server is further configured to send a determination result of determining that the voice to be verified and the registered voice are from the same user to the terminal device.
  15. 根据权利要求14所述的系统,其特征在于,所述基于所述环境噪声对注册语音进行增强,包括:在所述注册语音上叠加所述环境噪声。The system according to claim 14, wherein the enhancing the registered voice based on the environmental noise comprises: superimposing the environmental noise on the registered voice.
  16. 根据权利要求14所述的系统,其特征在于,所述环境噪声为所述终端设备的副麦克风拾取到的声音。The system according to claim 14, wherein the environmental noise is a sound picked up by a secondary microphone of the terminal device.
  17. 根据权利要求14所述的系统,其特征在于,所述待验证语音的时长小于所述注册语音的时长。The system according to claim 14, wherein the duration of the to-be-verified voice is shorter than the duration of the registered voice.
  18. 根据权利要求14所述的系统,其特征在于,所述环境特征参数包括所述待验证语音所对应的场景类型;The system according to claim 14, wherein the environmental characteristic parameter comprises a scene type corresponding to the to-be-verified voice;
    所述基于所述环境特征参数对注册语音进行增强,包括:基于所述待验证语音所对应的场景类型,确定所述场景类型所对应的模板噪声,并在所述注册语音上叠加所述模板噪声。The enhancing the registered voice based on the environmental characteristic parameter includes: determining the template noise corresponding to the scene type based on the scene type corresponding to the to-be-verified voice, and superimposing the template on the registered voice noise.
  19. 根据权利要求18所述的系统,其特征在于,所述待验证语音所对应的场景类型是根据场景识别算法对所述待验证语音进行识别而确定的。The system according to claim 18, wherein the scene type corresponding to the to-be-verified speech is determined by recognizing the to-be-verified speech by a scene recognition algorithm.
  20. 根据权利要求18所述的系统,其特征在于,所述待验证语音的场景类型为下述任意一种:居家场景;车载场景;室外嘈杂场景;会场场景;影院场景。The system according to claim 18, wherein the scene type of the voice to be verified is any one of the following: a home scene; a car scene; an outdoor noisy scene; a venue scene; a cinema scene.
  21. 根据权利要求14所述的系统,其特征在于,所述待验证语音和所述增强的注册语音为经过相同的前端处理算法处理过的语音。The system according to claim 14, wherein the voice to be verified and the enhanced registered voice are voices processed by the same front-end processing algorithm.
  22. 根据权利要求21所述的系统,其特征在于,所述前端处理算法包括以下至少一种处理算法:回声抵消;去混响;主动降噪;动态增益;定向拾音。The system of claim 21, wherein the front-end processing algorithm comprises at least one of the following processing algorithms: echo cancellation; de-reverberation; active noise reduction; dynamic gain; directional pickup.
  23. 根据权利要求14所述的系统,其特征在于,所述注册语音的数量为多条;并且,所述服务器基于所述环境噪声和/或所述环境特征参数,对多条所述注册语音分别进行增强,以得到多条增强的注册语音。The system according to claim 14, wherein the number of the registered voices is a plurality of; Enhancements are made to get multiple enhanced registration voices.
  24. 根据权利要求14所述的系统,其特征在于,所述比较所述待验证语音与增强的注册语音,确定所述待验证语音与所述注册语音来自相同用户,包括:The system according to claim 14, wherein the comparing the to-be-verified voice and the enhanced registered voice to determine that the to-be-verified voice and the registered voice are from the same user, comprising:
    通过特征参数提取算法提取所述待验证语音的特征参数,和所述增强的注册语音的特征参数;Extract the feature parameters of the voice to be verified and the feature parameters of the enhanced registered voice through a feature parameter extraction algorithm;
    通过参数识别模型对所述待验证语音的特征参数和所述增强的注册语音的特征参数进行参数识别,以分别得到待验证说话人的语音模板和注册说话人的语音模板;Parameter identification is performed on the characteristic parameters of the voice to be verified and the characteristic parameters of the enhanced registered voice through the parameter identification model, so as to obtain the voice template of the speaker to be verified and the voice template of the registered speaker respectively;
    通过模板匹配算法对所述待验证说话人的语音模板和所述注册说话人的语音模板进行匹配,根据匹配结果确定所述待验证语音与所述注册语音来自相同用户。A template matching algorithm is used to match the voice template of the speaker to be verified and the voice template of the registered speaker, and according to the matching result, it is determined that the voice to be verified and the registered voice are from the same user.
  25. 一种电子设备,其特征在于,包括:An electronic device, comprising:
    存储器,用于存储由所述电子设备的一个或多个处理器执行的指令;memory for storing instructions executed by one or more processors of the electronic device;
    处理器,当所述处理器执行所述存储器中的所述指令时,可使得所述电子设备执行权利要求1~13任一项所述的语音增强方法。The processor, when the processor executes the instructions in the memory, can cause the electronic device to execute the speech enhancement method according to any one of claims 1 to 13 .
  26. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质上存储有指令,该指令在计算机上执行时使得计算机执行权利要求1~13任一项所述的方法。A computer-readable storage medium, characterized in that the computer-readable storage medium stores an instruction, and when the instruction is executed on a computer, causes the computer to execute the method of any one of claims 1-13.
PCT/CN2021/105003 2020-07-08 2021-07-07 Speech enhancement method, device, system, and storage medium WO2022007846A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010650893.XA CN113921013A (en) 2020-07-08 2020-07-08 Speech enhancement method, apparatus, system, and storage medium
CN202010650893.X 2020-07-08

Publications (1)

Publication Number Publication Date
WO2022007846A1 true WO2022007846A1 (en) 2022-01-13

Family

ID=79231704

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/105003 WO2022007846A1 (en) 2020-07-08 2021-07-07 Speech enhancement method, device, system, and storage medium

Country Status (2)

Country Link
CN (1) CN113921013A (en)
WO (1) WO2022007846A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117268796A (en) * 2023-11-16 2023-12-22 天津大学 Vehicle fault acoustic event detection method
CN117725187A (en) * 2024-02-08 2024-03-19 人和数智科技有限公司 Question-answering system suitable for social assistance
CN117725187B (en) * 2024-02-08 2024-04-30 人和数智科技有限公司 Question-answering system suitable for social assistance

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101051463A (en) * 2006-04-06 2007-10-10 株式会社东芝 Verification method and device identified by speaking person
WO2010049695A1 (en) * 2008-10-29 2010-05-06 British Telecommunications Public Limited Company Speaker verification
CN106384588A (en) * 2016-09-08 2017-02-08 河海大学 Additive noise and short time reverberation combined compensation method based on vector Taylor series
CN108022591A (en) * 2017-12-30 2018-05-11 北京百度网讯科技有限公司 The processing method of speech recognition, device and electronic equipment in environment inside car
CN108257606A (en) * 2018-01-15 2018-07-06 江南大学 A kind of robust speech personal identification method based on the combination of self-adaptive parallel model

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101051463A (en) * 2006-04-06 2007-10-10 株式会社东芝 Verification method and device identified by speaking person
WO2010049695A1 (en) * 2008-10-29 2010-05-06 British Telecommunications Public Limited Company Speaker verification
CN106384588A (en) * 2016-09-08 2017-02-08 河海大学 Additive noise and short time reverberation combined compensation method based on vector Taylor series
CN108022591A (en) * 2017-12-30 2018-05-11 北京百度网讯科技有限公司 The processing method of speech recognition, device and electronic equipment in environment inside car
CN108257606A (en) * 2018-01-15 2018-07-06 江南大学 A kind of robust speech personal identification method based on the combination of self-adaptive parallel model

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117268796A (en) * 2023-11-16 2023-12-22 天津大学 Vehicle fault acoustic event detection method
CN117268796B (en) * 2023-11-16 2024-01-26 天津大学 Vehicle fault acoustic event detection method
CN117725187A (en) * 2024-02-08 2024-03-19 人和数智科技有限公司 Question-answering system suitable for social assistance
CN117725187B (en) * 2024-02-08 2024-04-30 人和数智科技有限公司 Question-answering system suitable for social assistance

Also Published As

Publication number Publication date
CN113921013A (en) 2022-01-11

Similar Documents

Publication Publication Date Title
CN107799126B (en) Voice endpoint detection method and device based on supervised machine learning
CN111091828B (en) Voice wake-up method, device and system
WO2021143599A1 (en) Scene recognition-based speech processing method and apparatus, medium and system
CN107240405B (en) Sound box and alarm method
WO2014117722A1 (en) Speech processing method, device and terminal apparatus
WO2021013255A1 (en) Voiceprint recognition method and apparatus
CN114141230A (en) Electronic device, and voice recognition method and medium thereof
CN115482830B (en) Voice enhancement method and related equipment
WO2022007846A1 (en) Speech enhancement method, device, system, and storage medium
CN113830026A (en) Equipment control method and computer readable storage medium
CN114067782A (en) Audio recognition method and device, medium and chip system thereof
CN113539290B (en) Voice noise reduction method and device
WO2022199405A1 (en) Voice control method and apparatus
WO2021031811A1 (en) Method and device for voice enhancement
CN113611318A (en) Audio data enhancement method and related equipment
WO2023124248A1 (en) Voiceprint recognition method and apparatus
CN115312068B (en) Voice control method, equipment and storage medium
CN116386623A (en) Voice interaction method of intelligent equipment, storage medium and electronic device
US11783809B2 (en) User voice activity detection using dynamic classifier
CN109922397A (en) Audio intelligent processing method, storage medium, intelligent terminal and smart bluetooth earphone
WO2022052691A1 (en) Multi-device voice processing method, medium, electronic device, and system
CN115116458A (en) Voice data conversion method and device, computer equipment and storage medium
CN115424628B (en) Voice processing method and electronic equipment
CN115331672B (en) Device control method, device, electronic device and storage medium
CN114093380B (en) Voice enhancement method, electronic equipment, chip system and readable storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21837111

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21837111

Country of ref document: EP

Kind code of ref document: A1