WO2022007846A1

WO2022007846A1 - Speech enhancement method, device, system, and storage medium

Info

Publication number: WO2022007846A1
Application number: PCT/CN2021/105003
Authority: WO
Inventors: 胡伟湘; 黄劲文; 曾夕娟; 芦宇
Original assignee: 华为技术有限公司
Priority date: 2020-07-08
Filing date: 2021-07-07
Publication date: 2022-01-13
Also published as: CN113921013A

Abstract

The present application provides an artificial intelligence (AI)-based speech enhancement method, a terminal device, a speech enhancement system, and a computer readable storage medium. An electronic device acquires speech to be verified; the electronic device determines at least one of environmental noise and an environment feature parameter comprised in the speech to be verified; the electronic device then enhances a registration speech on the basis of the environmental noise and/or the environment feature parameter; finally, the electronic device compares the speech to be verified with the enhanced registration speech to determine whether the speech to be verified and the registration speech are from the same user. In embodiments of the present application, the registration speech is enhanced according to a noise component in the speech to be verified so as to cause the enhanced registration speech and the speech to be verified to have similar noise components, so that a more accurate recognition result can be obtained.

Description

Speech enhancement method, device, system and storage medium

This application claims the priority of the Chinese patent application with the application number 202010650893.X and the application name "Speech Enhancement Method, Device, System and Storage Medium" filed with the China Patent Office on July 8, 2020. The entire content of the above application is approved by Reference is incorporated in this application.

technical field

The present application relates to the technical field of biometrics, and in particular, to a speech enhancement method, device, system, and computer-readable storage medium.

Background technique

At present, biometric authentication technology based on biometric identification has gradually been popularized and applied in the fields of family life and public safety. Biometric features that can be applied to biometric authentication include fingerprint, face (face), iris, DNA, voiceprint, etc. Among them, voiceprint recognition technology (also known as speaker recognition technology) that uses voiceprint as an identification feature The contact method realizes the collection of sound samples, and the collection method is more concealed, so it is easier to be accepted by users.

In the prior art, when there is noise in the collection environment of the sound sample, the recognition rate of the voiceprint will be affected.

SUMMARY OF THE INVENTION

Some embodiments of the present application provide a speech enhancement method, a terminal device, a speech enhancement system, and a computer-readable storage medium. The present application is described below from various aspects, and the embodiments and beneficial effects of the following aspects can be referred to each other.

In a first aspect, an embodiment of the present application provides a voice enhancement method, which is applied to an electronic device, including: collecting a voice to be verified; determining environmental noise and/or environmental characteristic parameters contained in the voice to be verified; The environment feature parameter enhances the registered voice; compares the to-be-verified voice and the enhanced registered voice, and determines that the to-be-verified voice and the registered voice are from the same user.

According to the embodiment of the present application, the registration voice is enhanced according to the noise components in the voice to be verified, so that the enhanced registration voice and the voice to be verified have similar noise components. The difference lies in the difference between the two effective speech components. After comparing the two through the voiceprint recognition algorithm, a more accurate recognition result can be obtained. In addition, in the embodiment of the present application, the user only needs to record the registration voice in a quiet environment, and there is no need to separately record the registration voice in multiple scenarios, so the user experience is better.

In some embodiments, the registration speech is the speech from the registration speaker collected in a quiet environment. In this way, there is no obvious noise component in the registered speech, which can improve the accuracy of recognition.

In some embodiments, enhancing the registration speech based on the environmental noise includes superimposing the environmental noise on the registration speech. The implementation method of the present application obtains the enhanced registration voice by superimposing the environmental noise on the registration voice, and the algorithm is simple.

In some embodiments, the ambient noise is sound picked up by a secondary microphone of the electronic device. The embodiments of the present application can conveniently determine the noise contained in the speech to be verified.

In some embodiments, the duration of the to-be-verified speech is less than the duration of the registered speech. In this way, the user can input a short voice to be verified, which is beneficial to improve the user experience.

In some embodiments, the environmental characteristic parameter includes a scene type corresponding to the voice to be verified; the enhancement of the registered voice based on the environmental characteristic parameter includes: determining the template noise corresponding to the scene type based on the scene type corresponding to the to-be-verified voice, And superimpose template noise on the registered speech.

According to the embodiment of the present application, the registration speech is enhanced by superimposing template noise on the registration speech, so that the enhanced registration speech and the to-be-verified speech have noise components as close as possible, which is beneficial to improve the recognition accuracy.

In some embodiments, the scene type corresponding to the voice to be verified is determined according to the scene recognition algorithm that recognizes the voice to be verified. In some embodiments, the scene recognition algorithm is any one of the following: GMM algorithm; DNN algorithm.

In some embodiments, the scene type of the voice to be verified is any one of the following: a home scene; a vehicle-mounted scene; an outdoor noisy scene; a venue scene; a cinema scene. The scene types of the embodiments of the present application cover the places where the user performs daily activities, which is beneficial to improve the user experience.

In some embodiments, the environmental parameter characteristics of the voice to be verified include the distance between the user who generates the voice to be verified and the electronic device; the enhancement of the registered voice based on the environmental characteristic parameters includes: according to the user who generated the voice to be verified and the electronic device The distance between the registered voices is simulated in the far field. Wherein, the far-field simulation of the registered voice is used to simulate the acquisition distance of the registered voice (the distance between the voice acquisition device of the registered voice and the user who generates the registered voice) to the acquisition distance of the voice to be verified (the voice collection of the voice to be verified). distance between the device and the user producing the speech to be authenticated).

According to the embodiment of the present application, by performing far-field simulation on the registered voice, the attenuation component of the voice to be verified during the propagation process can be considered, so that the enhanced registered voice and the voice to be verified have noise components as close as possible, which is beneficial to improve the recognition efficiency. Accuracy.

In some embodiments, performing a far-field simulation on the registered voice according to the distance between the user who generates the voice to be verified and the electronic device includes: according to the distance between the user who generates the voice to be verified and the electronic device, based on the mirror source model Methods The impulse response function of the acquisition site of the speech to be verified is established; the impulse response function is convolved with the audio signal of the registered speech to perform far-field simulation of the registered speech.

In some embodiments, the voice to be verified and the enhanced registration voice are voices processed by the same front-end processing algorithm. Through front-end processing, the interference factors in the speech can be removed, which is beneficial to improve the accuracy of voiceprint recognition.

In some embodiments, the front-end processing algorithm includes at least one of the following processing algorithms: echo cancellation; de-reverberation; active noise reduction; dynamic gain; directional pickup.

In some embodiments, the number of registered voices is multiple; and, based on environmental noise and/or environmental characteristic parameters, the multiple registered voices are respectively enhanced to obtain multiple enhanced registered voices.

According to the embodiment of the present application, a plurality of enhanced registration voices are obtained, and the to-be-verified voice and a plurality of enhanced registration voices can be respectively matched to obtain a plurality of similarity matching results, which can be further matched according to the plurality of similarity matching results. By comprehensively judging the similarity between the voice of the speaker to be verified and the voice of the registered speaker, the error of a single matching result can be averaged, which is beneficial to improve the accuracy of voiceprint recognition and the robustness of the voiceprint recognition algorithm.

In some embodiments, comparing the to-be-verified voice and the enhanced registered voice, and determining that the to-be-verified voice and the registered voice are from the same user, include: extracting characteristic parameters of the to-be-verified voice and enhanced registered voice characteristic parameters through a feature parameter extraction algorithm ; Carry out parameter identification through the parameter recognition model of the characteristic parameters of the voice to be verified and the characteristic parameters of the enhanced registered voice, so as to obtain the voice template of the speaker to be verified and the voice template of the registered speaker respectively; The voice template is matched with the voice template of the registered speaker, and according to the matching result, it is determined that the voice to be verified and the registered voice are from the same user.

In some embodiments, the feature parameter extraction algorithm is MFCC algorithm, log mel algorithm or LPCC algorithm; and/or, the parameter identification model is an identity vector model, a time-delay neural network model or a ResNet model; and/or, the template matching algorithm is Cosine distance method, linear discriminant method or probabilistic linear discriminant analysis method.

In a second aspect, an embodiment of the present application provides a voice enhancement method, including: a terminal device collects the voice to be verified, and sends the voice to be verified to a server that is communicatively connected to the terminal device; the server determines the environment contained in the voice to be verified Noise and/or environmental characteristic parameters; the server, based on the environmental noise and/or environmental characteristic parameters, enhances the registered voice; the server, compares the to-be-verified voice and the enhanced registered voice, and determines that the to-be-verified voice and the registered voice are from the same user; the server, The determination result of determining that the voice to be verified and the registered voice are from the same user is sent to the terminal device.

According to the embodiment of the present application, the registration voice is enhanced according to the noise components in the voice to be verified, so that the enhanced registration voice and the voice to be verified have similar noise components. The difference lies in the difference between the two effective speech components. After comparing the two through the voiceprint recognition algorithm, a more accurate recognition result can be obtained. In addition, in the embodiment of the present application, the user only needs to record the registration voice in a quiet environment, and there is no need to separately record the registration voice in multiple scenarios, so the user experience is better. In the embodiments of the present application, the speaker recognition algorithm is implemented on the server, which can save local computing resources of the terminal device.

In some embodiments, the ambient noise is the sound picked up by the secondary microphone of the terminal device. The embodiments of the present application can conveniently determine the noise contained in the speech to be verified.

In some embodiments, the duration of the to-be-verified speech is less than the duration of the registered speech. In this way, the user can input a short voice to be verified, which is beneficial to improve user experience.

In some embodiments, the environmental parameter characteristics of the voice to be verified include the distance between the user who generates the voice to be verified and the terminal device; the enhancement of the registered voice based on the environmental characteristic parameters includes: according to the user who generated the voice to be verified and the terminal device The distance between the registered voices is simulated in the far field. Wherein, the far-field simulation of the registered voice is used to simulate the acquisition distance of the registered voice (the distance between the voice acquisition device of the registered voice and the user who generates the registered voice) to the acquisition distance of the voice to be verified (the voice collection of the voice to be verified). distance between the device and the user producing the speech to be authenticated).

In some embodiments, performing far-field simulation on the registered voice according to the distance between the user who generates the voice to be verified and the terminal device, including: according to the distance between the user who generates the voice to be verified and the terminal device, based on the mirror source model Methods The impulse response function of the acquisition site of the speech to be verified is established; the impulse response function is convolved with the audio signal of the registered speech to perform far-field simulation of the registered speech.

In some embodiments, the number of registered voices is multiple; and, based on environmental noise and/or environmental characteristic parameters, the server enhances the multiple registered voices respectively, so as to obtain multiple enhanced registered voices.

In a third aspect, embodiments of the present application provide an electronic device, including: a memory for storing instructions executed by one or more processors of the electronic device; a processor, when the processor executes the instructions in the memory, it can The electronic device is caused to execute the speaker identification method provided by any embodiment of the first aspect of the present application. For the beneficial effects that can be achieved in the third aspect, reference may be made to the beneficial effects of the method provided by any embodiment of the first aspect, which will not be repeated here.

In a fourth aspect, an embodiment of the present application provides a speech enhancement system, including a terminal device and a server communicatively connected to the terminal device, wherein,

The terminal device collects the voice to be verified, and sends the voice to be verified to the server; the server is used to determine the environmental noise and/or environmental feature parameters contained in the voice to be verified, and enhance the registered voice based on the environmental noise and/or the environmental feature parameters and compare the voice to be verified and the enhanced registered voice, and determine that the voice to be verified and the registered voice come from the same user; the server is also used to send the determination result of determining that the voice to be verified and the registered voice come from the same user to the terminal device.

In a fifth aspect, an embodiment of the present application provides a computer-readable storage medium, where an instruction is stored in the computer-readable storage medium, and when the instruction is executed on a computer, the computer can execute the information provided by any one of the embodiments of the first aspect of the present application. method, or causing a computer to execute the method provided by any embodiment of the second aspect of the present application. For the beneficial effects that can be achieved in the fifth aspect, reference may be made to the beneficial effects of the method provided by any embodiment of the first aspect or any embodiment of the second aspect, which will not be repeated here.

Description of drawings

Fig. 1a shows an exemplary application scenario of the speech enhancement method provided by the embodiment of the present application;

Fig. 1b shows another exemplary application scenario of the speech enhancement method provided by the embodiment of the present application;

FIG. 2 shows a schematic structural diagram of a speech enhancement device provided by an embodiment of the present application;

FIG. 3 shows a flowchart of a speech enhancement method provided by an embodiment of the present application;

FIG. 4 shows a flowchart of a speech enhancement method provided by another embodiment of the present application;

FIG. 5 shows an application scenario of the speech enhancement method provided by the embodiment of the present application;

FIG. 6 shows a structural diagram of an electronic device provided by an embodiment of the present application;

FIG. 7 shows a block diagram of a system-on-chip (SoC) provided by an embodiment of the present application.

detailed description

The embodiments of the present application will be described in detail below with reference to the accompanying drawings.

Speaker recognition technology (also known as voiceprint recognition technology) is a technology that uses the uniqueness of the speaker's voiceprint to identify the speaker's identity. Because each person's vocal organs (for example, tongue, teeth, larynx, lungs, nasal cavity, vocal passages, etc.) are innately different, and vocalization habits, etc. have acquired differences, therefore, each person's voiceprint features are unique. By analyzing the features of the pattern, the identity of the speaker can be identified.

The specific process of speaker identification is to collect the voice of the speaker whose identity is to be confirmed, and compare it with the voice of a specific speaker to confirm whether the speaker whose identity is to be confirmed is the specific speaker. In this paper, the voice of the speaker whose identity is to be confirmed is called "voice to be verified", and the speaker whose identity is to be confirmed is called "speaker to be verified"; the voice of a specific speaker is called "registered voice", the specific speaker Speakers are called "registered speakers".

Referring to FIG. 1 a , the above process is described by taking the voiceprint unlocking function of the mobile phone (ie, unlocking the screen of the mobile phone by means of voiceprint recognition) as an example. Before using the voiceprint unlocking function of the mobile phone, the mobile phone owner records his own voice (the voice is the registered voice) in the mobile phone through the microphone on the mobile phone.

When it is necessary to unlock the screen of the mobile phone by means of voiceprint recognition, the current user of the mobile phone enters the real-time voice (the voice is the voice to be verified) through the mobile phone microphone, and the mobile phone uses the built-in voiceprint recognition program to compare the voice to be verified and the registered voice , to determine whether the current user of the mobile phone is the owner of the mobile phone. If the to-be-verified voice matches the registered voice, it is judged that the current user of the mobile phone is the owner, the current user of the mobile phone has passed the identity authentication, and the mobile phone completes the subsequent screen unlocking action; if the to-be-verified voice does not match the registered voice, it is judged If the current user of the mobile phone is not the owner, and the current user of the mobile phone has not passed the identity authentication, the mobile phone can refuse the subsequent screen unlocking action.

The application of the voiceprint recognition technology is described above by taking the voiceprint unlocking function of a mobile phone as an example, but the present application is not limited to this, and the voiceprint recognition technology can be applied to other scenarios where the identity of the speaker needs to be recognized. For example, voiceprint recognition technology can be applied to the field of family life, and voice control of smart phones, smart cars, smart homes (eg, smart audio and video equipment, smart lighting systems, smart door locks), etc.; voiceprint recognition technology can also be applied In the field of payment, the voiceprint authentication is combined with other authentication methods (such as passwords, dynamic verification codes, etc.) to perform double or multiple authentication of the user's identity to improve the security of payment; voiceprint recognition technology can also be applied to information In the security field, voiceprint authentication is used as a way to log in to an account; voiceprint recognition technology can also be applied to the judicial field, using voiceprint as auxiliary evidence for judging identity.

In addition, the main device for voiceprint recognition can be other electronic devices other than mobile phones, such as mobile devices, including wearable devices (such as wristbands, earphones, etc.), vehicle terminals, etc.; or fixed devices, including smart Home, network server, etc. In addition, the voiceprint recognition algorithm can be implemented in the cloud in addition to the terminal. For example, after the mobile phone collects the voice to be verified, the collected voice to be verified can be sent to the cloud, and the voice to be verified is recognized by the voiceprint recognition algorithm in the cloud. After the recognition is completed, the cloud returns the recognition result to the mobile phone. Through the cloud recognition mode, users can share the computing resources in the cloud to save the local computing resources of the mobile phone.

As shown in Figure 1b, when the voice of the speaker to be verified is collected, if there is noisy human voice noise in the surrounding environment, these noises will be collected by the microphone together and become part of the voice to be verified. In this way, the voice to be verified not only includes the voice of the speaker to be verified, but also contains noise components, which will reduce the recognition rate of the voiceprint.

This embodiment does not limit the scene of the voiceprint recognition, for example, it may also be a home scene, a car scene, a meeting place scene, a cinema scene, and the like.

When the owner of the mobile phone needs to unlock the mobile phone through voiceprint recognition, if there is noise in the surrounding environment, the sound collected by the mobile phone microphone is not only the owner's voice, but also the noise in the environment. After the real-time voice is compared with the registered voice preset in the mobile phone by the owner, it may result that the two do not match. Even if the current user of the mobile phone is the owner of the mobile phone, the mobile phone may still give a result that the user identity authentication fails, thus affecting the user experience.

In the prior art, some technical solutions remove noise components in the voice to be verified by performing denoising processing on the voice to be verified, so as to improve the recognition rate of the voiceprint. However, the voice to be verified after the denoising process still contains some noise components, and some valid voice components (the voice components of the speaker to be verified) are also removed. In this way, the voice to be verified after the denoising process may appear. It still cannot be recognized correctly, and the recognition rate of voiceprint is not significantly improved.

In the prior art, there are also technical solutions to improve the recognition rate of voiceprints by recording registered voices in different scenarios respectively. Specifically, the user records registration voices in multiple different scenarios (for example, home scenarios, cinema scenarios, outdoor noisy scenarios, etc.), and when performing voiceprint recognition, compares the voice to be verified with the registered voice recorded in the corresponding scenario , in order to improve the recognition rate of voiceprint. In the prior art, the user needs to record registration voices respectively in multiple different scenarios, and the user experience is low.

To this end, the embodiments of the present application provide a voice enhancement method, which is used to improve the voiceprint recognition rate and the robustness of the voiceprint recognition method, and improve user experience. In this application, after the voice to be verified is collected, a noise component corresponding to the noise component in the voice to be verified will be superimposed on the registered voice, and then the registered voice after the noise component has been superimposed is compared with the voice to be verified, to get the recognition result. In other words, in this application, the registration voice will be enhanced according to the noise components in the voice to be verified, so that the enhanced registration voice and the voice to be verified have similar noise components, so that the voice to be verified and the enhanced registration voice The main difference between the two is the difference between the two effective speech components. After comparing the two through the voiceprint recognition algorithm, a more accurate recognition result can be obtained. In addition, in the embodiment of the present application, the user only needs to record the registration voice in a quiet environment, and there is no need to separately record the registration voice in multiple scenarios, so the user experience is better.

Here, the "valid speech component" is the speech component from the speaker, for example, the valid speech component in the speech to be verified is the speech component of the speaker to be verified, and the valid speech component in the enhanced registered speech is the speech component of the registered speaker .

The technical solution of the present application will still be introduced below with reference to the voiceprint unlocking function of the mobile phone in FIG. 1b, but it is understood that the present application is not limited thereto.

FIG. 2 shows the structure of the mobile phone 100 . The mobile phone 100 may include a processor 110, an external memory interface 120, an internal memory 121, an antenna, a communication module 150, an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, a camera 193, a display screen 194, and the like.

It can be understood that the structures illustrated in the embodiments of the present invention do not constitute a specific limitation on the mobile phone 100 . In other embodiments of the present application, the mobile phone 100 may include more or less components than shown, or some components may be combined, or some components may be separated, or different component arrangements. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

The processor 110 may include one or more processing units, for example, the processor 110 may include an application processor (application processor, AP), a modem processor, a controller, a digital signal processor (digital signal processor, DSP), baseband processor, etc. Wherein, different processing units may be independent devices, or may be integrated in one or more processors.

The processor can generate an operation control signal according to the instruction operation code and timing signal, and complete the control of fetching and executing instructions.

A memory may also be provided in the processor 110 for storing instructions and data. In some embodiments, the memory in processor 110 is cache memory. This memory may hold instructions or data that have just been used or recycled by the processor 110 . If the processor 110 needs to use the instruction or data again, it can be called directly from the memory. Repeated accesses are avoided and the latency of the processor 110 is reduced, thereby increasing the efficiency of the system.

In some embodiments, the processor 110 may include one or more interfaces. The interface may include an integrated circuit built-in audio (inter-integrated circuit sound, I2S) interface, a pulse code modulation (pulse code modulation, PCM) interface, and/or a general-purpose input/output (general-purpose input/output, GPIO) interface, etc.

The I2S interface can be used for audio communication. In some embodiments, the processor 110 may contain multiple sets of I2S buses. The processor 110 may be coupled with the audio module 170 through an I2S bus to implement communication between the processor 110 and the audio module 170 . The PCM interface can also be used for audio communications, sampling, quantizing and encoding analog signals.

The GPIO interface can be configured by software. The GPIO interface can be configured as a control signal or as a data signal. In some embodiments, the GPIO interface may be used to connect the processor 110 with the camera 193, the display screen 194, the audio module 170, and the like. The GPIO interface can also be configured as an I2S interface, etc.

It can be understood that, the interface connection relationship between the modules illustrated in the embodiment of the present invention is only a schematic illustration, and does not constitute a structural limitation of the mobile phone 100 . In other embodiments of the present application, the mobile phone 100 may also adopt different interface connection manners in the foregoing embodiments, or a combination of multiple interface connection manners.

The wireless communication function of the mobile phone 100 may be implemented by an antenna, a communication module 150, a modem processor, a baseband processor, and the like.

Antennas are used to transmit and receive electromagnetic wave signals. Each antenna in handset 100 may be used to cover a single or multiple communication frequency bands. Different antennas can also be reused to improve antenna utilization. For example, the antennas can be multiplexed into the diversity antennas of the wireless local area network. In other embodiments, the antenna may be used in conjunction with a tuning switch.

The communication module 150 may provide a wireless communication solution including 2G/3G/4G/5G, etc. applied on the mobile phone 100 . The communication module 150 may include at least one filter, switch, power amplifier, low noise amplifier (LNA), and the like. The communication module 150 can receive the electromagnetic wave by the antenna, filter, amplify, etc. the received electromagnetic wave, and transmit it to the modulation and demodulation processor for demodulation. The communication module 150 can also amplify the signal modulated by the modulation and demodulation processor, and then convert it into electromagnetic waves for radiation through the antenna. In some embodiments, at least part of the functional modules of the communication module 150 may be provided in the processor 110 . In some embodiments, at least some of the functional modules of the communication module 150 may be provided in the same device as at least some of the modules of the processor 110 .

The modem processor may include a modulator and a demodulator. Wherein, the modulator is used to modulate the low frequency baseband signal to be sent into a medium and high frequency signal. The demodulator is used to demodulate the received electromagnetic wave signal into a low frequency baseband signal. Then the demodulator transmits the demodulated low-frequency baseband signal to the baseband processor for processing. The low frequency baseband signal is processed by the baseband processor and passed to the application processor. The application processor outputs sound signals through audio devices (not limited to the speaker 170A, the receiver 170B, etc.), or displays images or videos through the display screen 194 . In some embodiments, the modem processor may be a stand-alone device. In other embodiments, the modulation and demodulation processor may be independent of the processor 110, and may be provided in the same device as the communication module 150 or other functional modules.

The external memory interface 120 can be used to connect an external memory card, such as a Micro SD card, to expand the storage capacity of the mobile phone 100 . The external memory card communicates with the processor 110 through the external memory interface 120 to realize the data storage function. For example to save files like music, video etc in external memory card.

Internal memory 121 may be used to store computer executable program code, which includes instructions. The internal memory 121 may include a storage program area and a storage data area. The storage program area can store an operating system, an application program required for at least one function (such as a sound playback function, an image playback function, etc.), a voiceprint recognition program, a voice signal front-end processing program, and the like. The storage data area can store data (such as audio data, phone book, etc.) created during the use of the mobile phone 100, and data required for voiceprint recognition, such as audio data of registered voice, trained voice parameter recognition model, etc. In addition, the internal memory 121 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, universal flash storage (UFS), and the like. The processor 110 executes various functional applications and data processing of the mobile phone 100 by executing the instructions stored in the internal memory 121 and/or the instructions stored in the memory provided in the processor.

The mobile phone 100 can implement audio functions through an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, and an application processor. Such as music playback, recording, etc.

The audio module 170 is used for converting digital audio information into analog audio signal output, and also for converting analog audio input into digital audio signal. Audio module 170 may also be used to encode and decode audio signals. In some embodiments, the audio module 170 may be provided in the processor 110 , or some functional modules of the audio module 170 may be provided in the processor 110 .

Speaker 170A, also referred to as a "speaker", is used to convert audio electrical signals into sound signals. The mobile phone 100 can listen to music through the speaker 170A, or listen to a hands-free call.

The receiver 170B, also referred to as "earpiece", is used to convert audio electrical signals into sound signals. When the mobile phone 100 answers a call or a voice message, the voice can be answered by placing the receiver 170B close to the human ear.

The microphone 170C, also called "mic", "microphone", or "microphone", is used to convert sound signals into electrical signals. When entering the registered voice or the voice to be verified, the user can make a sound by approaching the microphone 170C through the human mouth, and input the sound signal into the microphone 170C. The mobile phone 100 may be provided with at least one microphone 170C.

In other embodiments, the mobile phone 100 may be provided with two microphones 170C, which can implement a noise reduction function in addition to collecting sound signals. Specifically, the mobile phone 100 has two microphones at the top and bottom, one microphone 170C is provided on the bottom side of the mobile phone 100 , and the other microphone 170C is provided on the top side of the mobile phone 100 . When a user makes a call or sends a voice message, the mouth is usually close to the microphone 170C on the bottom side. Therefore, the user's voice will generate a larger audio signal Va in the microphone, which is referred to as the "main mic" herein. At the same time, the user's voice will also generate a certain amount of audio signal Vb on the microphone 170C on the top side, but since the microphone is far away from the user's mouth, the audio signal Vb on the microphone is significantly smaller than that on the main mic. The audio signal Va is referred to herein as "sub-mic".

For the noise in the environment, since the sound source of the noise is usually far away from the mobile phone 100, it can be considered that the distance between the noise sound source and the main mic and the auxiliary mic is basically the same, that is, it can be considered that the main mic and the auxiliary mic The intensity of the noise is basically the same.

The noise signal and the user speech signal can be separated by using the signal strength difference caused by the difference of the two mic positions. For example, after the audio signal picked up by the main mic and the audio signal picked up by the secondary mic are differentiated (that is, the signal in the main mic is subtracted from the signal in the secondary mic), the user's voice signal (this is the dual mic) can be obtained. The principle of active noise cancellation). Furthermore, after removing the user's voice signal from the main mic signal, the noise signal can be separated. Alternatively, since the audio signal Vb on the secondary mic is significantly smaller than the audio signal Va on the primary mic, it can be considered that the signal picked up by the secondary mic is a noise signal.

A setting method of dual mics of the mobile phone 100 is given above, but this is only an exemplary description, and other setting methods can be used for the microphones, for example, the main mic is arranged on the front of the mobile phone 100, and the secondary mic is arranged on the back of the mobile phone.

In other embodiments, the mobile phone 100 may further be provided with three, four or more microphones 170C to collect sound signals, reduce noise, identify sound sources, and implement directional recording functions.

The earphone jack 170D is used to connect wired earphones. The earphone interface 170D may be a universal serial bus (USB) interface, or may be a 3.5mm open mobile terminal platform (OMTP) standard interface, the cellular telecommunications industry association (cellular telecommunications industry association) of the USA, CTIA) standard interface.

[Example 1]

The technical solution of this embodiment will be described below with reference to the mobile phone voiceprint unlocking scene in FIG. 1b. It can be understood that the present application is not limited to this, and the speech enhancement method of the present application can also be applied to other scenarios than the scenario shown in FIG. 1 b.

Referring to FIG. 3, this embodiment is used to provide a voice enhancement method. After the voice to be verified is collected, the noise contained in the voice to be verified is separated from the voice to be verified, and then the separated noise is superimposed on the registered voice. , in this way, the voice to be verified and the registered voice after superimposed noise have similar noise components, and the main difference between the two is the difference between the two effective voice components, which can improve the voiceprint recognition rate and the voiceprint recognition method. robustness. Specifically, the speech enhancement method provided by this embodiment includes the following steps:

S110: Collect registered voice. In order to provide the voiceprint unlocking function, the mobile phone 100 has a voiceprint unlocking application (which may be a system application or a third-party application). In order to utilize the voiceprint unlocking function of the mobile phone 100, when the owner of the mobile phone 100 registers the user account of the voiceprint unlocking application, he collects his own voice through the mobile phone 100, and the voiceprint unlocking application uses the voice as the reference voice for subsequent voiceprint recognition. This voice is the registered voice. However, the present application is not limited to this. For example, in other embodiments, when the mobile phone 100 is turned on for the first time, the owner of the mobile phone 100 enters the registration voice through the setting wizard of the mobile phone 100, and the voiceprint unlocking application of the mobile phone 100 uses the voice as the reference voice for voiceprint recognition. .

Here, the registered voice is the voice recorded by the owner of the mobile phone 100 in a quiet environment, so that there is no obvious noise component in the registered voice.

When the signal-to-noise ratio (ie, the ratio of the host voice signal strength to the noise signal strength) in the registered voice recording environment is characterized, when the signal-to-noise ratio in the recording environment is higher than the set value (eg, 30dB), the recording is considered to be recorded. The environment is quiet. Alternatively, when the intensity of the noise signal in the registered voice recording environment is lower than a set value (eg, 20 dB), the recording environment is considered to be a quiet environment.

In this embodiment, the registration voice from the host is collected through the microphone of the mobile phone 100 . The registered voice is near-field voice. When recording the registered voice, the distance between the owner's mouth and the main mic of the mobile phone 100 should be kept within 30cm to 1m. For example, if the owner holds the mobile phone 100 and speaks to the main mic, the distance between the owner's mouth and the main mic of the mobile phone 100 should be kept within 30cm. , which can avoid the attenuation of the host voice due to the long propagation distance.

When recording the registered voice, the owner enters 6 voices to form 6 registered voices. Entering multiple languages helps to improve the flexibility of speech recognition and the richness of voiceprint information.

In order to take into account the user's operating experience and ensure that each registered voice contains enough voiceprint information, the length of each registered voice is 10-30s. Further, each registered voice corresponds to different text content, so as to enrich the voiceprint information contained in the registered voice. After collecting the registered voice, the mobile phone 100 stores the audio signal of the registered voice in the internal memory. However, the present application is not limited to this, and the mobile phone 100 may also upload the audio signal of the registered voice to the cloud, so as to recognize the voiceprint through the cloud recognition mode.

The above recording method, recording length, and quantity of the registered voice are only exemplary descriptions, and the present application is not limited thereto. For example, in other examples, the registered voice may be recorded by other recording devices (eg, voice recorder, dedicated microphone, etc.), the number of registered voices may be one, and the length of the registered voice may be greater than 30s.

For the coherence of the description, step S110 is mentioned first. It can be understood that step S110, as the data preparation process of the speech enhancement method, is relatively independent from the single speech enhancement process, and does not need to be performed every time. Occurs with other steps of the speech enhancement method.

S120: Collect the voice to be verified, and the voice to be verified is the voice recorded by the current user of the mobile phone in a noisy human voice scene. In other words, the mobile phone user can unlock the screen of the mobile phone by means of voiceprint recognition in this scenario. In addition, the current user of the mobile phone is the person who currently operates the mobile phone 100 , which may be the owner himself or someone other than the owner himself.

In this embodiment, the voice to be verified is collected through the microphone of the mobile phone 100 . When the screen of the mobile phone 100 is in the locked screen state, the microphone of the mobile phone 100 is turned on. At this time, the current user of the mobile phone 100 can input the voice to be verified through the microphone of the mobile phone 100 to unlock the mobile phone through voiceprint recognition. For example, when the user needs to operate the mobile phone 100 from a distance (eg, open an application in the mobile phone (eg, a music application, a phone application)), or when the user needs to operate the mobile phone when both hands are occupied (eg, when doing housework) 100, input the voice to be verified through the microphone of the mobile phone 100 to unlock the mobile phone through voiceprint recognition.

The to-be-verified voice is a voice with specific content. In other implementation manners, the voice to be verified may also be voice of any text content.

In this embodiment, the length of the voice to be verified is 10-30 s, so that the voice to be verified can contain relatively rich voiceprint information, which is beneficial to improve the voiceprint recognition rate. However, this application does not limit this. For example, in other embodiments, the length of the voice to be verified is less than 10s, so the length of the voice to be verified is less than the length of the registered voice. In this case, the user can enter a shorter to-be-verified voice. Verification of voice is beneficial to improve user experience. When the length of the voice to be verified is less than the length of the registered voice, part of the voice fragments can be intercepted from the voice to be verified, and spliced with the originally collected voice to be verified, so that the spliced voice has substantially the same length as the registered voice , so that in the subsequent steps of this embodiment (will be described in detail below), the feature parameters extracted from the registered voice and the feature parameters extracted from the voice to be verified have the same dimension, which is convenient for the similarity of the two. degree for comparison. In the description of this article, it does not distinguish between the original collected voice to be verified and the spliced voice to be verified, which is referred to as the voice to be verified in this document.

In this paper, the meaning of splicing the A voice and the B voice is to connect the A voice and the B voice end to end, so that the length of the spliced voice is the sum of the lengths of the A voice and the B voice. On this basis, the present application does not limit the connection order of the A voice and the B voice. For example, the A voice may be connected after the B voice, or the A voice may be connected before the B voice.

S130: Determine the noise contained in the speech to be verified. In this embodiment, the noise contained in the voice to be verified is the sound generated by other sound sources other than the current user of the mobile phone 100 in the recognition scene. For example, the sound of household equipment (for example, a vacuum cleaner) in a home scene, the sound of water flowing when washing dishes; the sound of the car broadcast, the sound of the engine in the car scene; the sound of the sound projected in the theater environment, the voice of other audiences in the theater, etc.

In this embodiment, the sound picked up by the mic of the mobile phone 100 is determined as the noise contained in the voice to be verified, so that the noise contained in the voice to be verified can be easily determined. However, the present application is not limited to this. For example, in some embodiments, it is considered that the initial segment of the speech to be verified contains only noise components, so that after multiple copies of the initial segment of the speech to be verified, it is determined as the to-be-verified speech. The noise contained in the speech; for another example, in other embodiments, the speech to be verified is divided into multiple speech frames, and the medium energy of each speech frame is calculated. Since the energy in noise is generally smaller than that in valid speech, when the energy in a speech frame is less than a predetermined value, the speech frame can be determined as a noise frame, thereby simplifying the noise extraction process. In addition, other methods in the prior art may also be used to determine the noise in the speech to be verified, which will not be described in detail.

The energy of the speech frame represents the sum of the squares of the signal values of the speech signals included in the speech frame. Exemplarily, suppose that the signal value of the i-th speech signal in _{the speech frame is x i} , and the number of speech signals in the speech frame is N, then the energy in the speech frame is

S140: Superimpose the noise contained in the voice to be verified on the registration voice to obtain an enhanced registration voice. In this embodiment, in the time domain, the signal value of the noise signal and the signal value of the registration speech signal are added to obtain the enhanced registration speech. However, the present application is not limited to this, and in other embodiments, the superposition of the registration speech signal and the noise signal may also be completed in the frequency domain. The embodiment of the present application realizes the enhancement of the registered voice signal by simply superimposing the numerical value of the voice signal, and the algorithm is simple.

In this embodiment, the length of the noise is equal to the length of the registered voice. In other embodiments, the length of the noise may be smaller than the length of the registered voice.

In this embodiment, the number of registered voices is 6. Therefore, noises contained in the voices to be verified are respectively superimposed on the 6 registered voices to obtain 6 enhanced registered voices.

S150: Extract the feature parameters of the voice to be verified and the feature parameters of the enhanced registered voice. Since the MFCC method can better conform to the auditory perception characteristics of the human ear, in this embodiment, the feature parameters in the speech signal are extracted by the Mel-Frequency Cepstrum Coefficient (MFCC) method.

First, taking the voice to be verified as an example, the extraction process of feature parameters is introduced. For ease of description, an audio signal representing speech to be authenticated by S _T. The audio signal S _T divided prior to feature extraction is first to be authenticated as a series of voice speech frame x (n), where, for the n-number of speech frames. Considering that the motion model of the vocal organ is basically stable within 10-30ms, the length of each speech frame is 10-30ms. In particular, the present embodiment a length of 10s audio signal S _T 500 is divided into speech frames.

After dividing the audio signal S _T frame processing to extract feature parameters for each speech frame x (n) by the MFCC methods. The MFCC feature extraction method includes the steps of Fourier transform, Mel filtering, discrete cosine transform, etc. on the speech frame x(n). In this embodiment, the order of the discrete cosine transform is 20. Therefore, the MFCC feature parameter of each speech frame x(n) has 20 dimensions.

Each speech frame x (n) of the characteristic parameters for splicing to give the speech to be verified MFCC feature parameter S _T of the audio signal, it is understood that dimensions of 20 × 500 = 10000 dimension.

For the extraction process of the feature parameter of the enhanced registered voice, reference may be made to the above process, and details are not repeated here. It can be understood that, for each enhanced registration speech, a set of MFCC characteristic parameters are obtained respectively.

It should be noted that the above is a principle description of the MFCC method. In the actual implementation process, the extraction process can be adjusted as required. For example, differential calculation may be performed on the MFCC feature parameters extracted above. For example, after taking the first-order difference and the second-order difference of the MFCC feature parameters extracted above, for each speech frame, a set of 60-dimensional MFCC feature parameters is obtained. In addition, other parameters of the extraction process, such as the length and number of speech frames, the order of discrete cosine transform, etc., can also be adjusted according to the computing capability of the device and the requirements of recognition accuracy.

In addition, in addition to the MFCC method, the feature parameters in the speech signal can also be extracted by other methods, for example, the log mel method, the Linear Predictive Cepstrum Coefficient (LPCC) method, and the like.

S160: Perform parameter identification on the feature parameters of the voice to be verified and the feature parameters of the enhanced registered voice, so as to obtain the voice template of the current user of the mobile phone 100 and the voice template of the owner of the mobile phone 100, respectively. The identification model for parameter identification is not limited in this application, and can be a probability model, such as an identity vector (I-vector) model; or a deep neural network model, such as a Time-Delay Neural Network (TDNN) model, ResNet model, etc.

The 10,000-dimensional feature parameters of the speech to be verified are input into the recognition model, and the speech template of the current user of the mobile phone 100 is obtained after the dimensionality reduction and abstraction of the recognition model. In this embodiment, the speech template of the current user of the mobile phone 100 is a 512-dimensional feature vector, denoted as A.

Correspondingly, the characteristic parameters of 6 enhanced registered voices are input into the recognition model, and the voice templates of 6 mobile phone 100 owners are obtained, each voice template is a feature vector of 512, and the 6 master voice templates are marked as B1 respectively. , B2, ..., B6.

It can be understood that the dimension of the feature vector described above is only an exemplary illustration, and can actually be adjusted according to the computing capability and identification accuracy requirements of the device.

S170: Match the voice template of the owner of the mobile phone 100 with the voice template of the current user of the mobile phone 100 to obtain a recognition result. In this application, the template matching method may be a cosine distance method, a linear discriminant method, or a probabilistic linear discriminant analysis method, or the like. The cosine distance method is used as an example for description below.

The cosine distance method evaluates the similarity of two feature vectors by computing the cosine of the angle between them. Taking the feature vector A (the feature vector corresponding to the voice template of the current user of the mobile phone 100) and the feature vector B1 (the feature vector corresponding to the main voice template of the mobile phone 100) as an example, the cosine similarity can be expressed as:

Among them, a _i is the ith coordinate in the eigenvector A, b _i is the ith coordinate in the eigenvector B1, and θ ₁ is the angle between the eigenvector A and the eigenvector B1. Among them, the _{larger the value of cosθ 1} , the closer the direction of the eigenvector A and the eigenvector B1 is, and the higher the similarity of the two eigenvectors. Conversely, the _{smaller the value of cosθ 1} , the lower the similarity between the two feature vectors.

For speech enhancement register 6, the owner Six utterance B1, B2, ......, B6, 100 with the current mobile phone users cosine similarity utterance respectively _{_{cosθ 1, cosθ 2, ......,}} cosθ 6. Taking the average of the six cosine similarities, the similarity between the current user voice and the host voice P=(cosθ ₁ +cosθ ₂ +...+cosθ ₆ )/6 is obtained.

If the similarity P between the current user's voice and the host's voice is greater than the set value (for example, 0.8), it is determined that the current user of the mobile phone 100 is the host himself. At this time, the mobile phone 100 unlocks the screen; otherwise, it is determined that the current user of the mobile phone 100 is the host. Not the owner himself, the phone 100 will not unlock the screen.

In this embodiment, the to-be-verified voice is compared with the six enhanced registered voices to obtain six cosine similarity calculation results, and then the six cosine similarity results are averaged to obtain the final result of the current user voice and the host voice. Similarity P. In this embodiment, the matching errors between the voice to be verified and the single enhanced registered voice can be averaged, which is beneficial to improve the accuracy of voiceprint recognition and the robustness of the voiceprint recognition algorithm.

It should be noted that, in this embodiment, the voiceprint recognition algorithm (the algorithms corresponding to steps S130 to S170 ) can be implemented on the mobile phone 100 to realize the offline recognition of the voiceprint; it can also be implemented in the cloud to save the mobile phone 100 local computing resources. When the voiceprint recognition algorithm is implemented in the cloud, the mobile phone 100 uploads the to-be-verified voice collected in step S120 to the cloud server, and the cloud server uses the voiceprint recognition algorithm to authenticate the identity of the current user of the mobile phone 100, and returns the authentication result. To the mobile phone, the mobile phone 100 decides whether to unlock the screen according to the authentication result.

The implementation process of the speech enhancement method in this embodiment has been described above, but it can be understood that the above is only an exemplary description, and those skilled in the art can perform the above-mentioned embodiments on the premise of conforming to the inventive concept of the present application. other deformations.

For example, in some embodiments, in addition to enhancing the registration speech according to the noise in the speech to be verified, a reverberation component is added to the registration speech to obtain an enhanced registration speech.

When the sound wave propagates indoors, it will be reflected multiple times by the room walls and indoor obstacles. In this way, when the sound source stops, several sound waves will be superimposed and mixed together, making people feel that the sound stops after the sound source stops sounding. It will continue for a period of time, and the phenomenon that the sound continues due to the multiple reflections of the sound waves is called reverberation.

When the recognition scene of voiceprint recognition is an indoor scene, the voice of the speaker to be verified will generate reverberation in the room, and the reverberation, as a part of the interference factor, will have a certain impact on the recognition rate of the voiceprint. To this end, in some embodiments, the reverberation prediction is performed on the registered voice based on the recognition scene, that is, the reverberation of the registered voice in the recognition scene is simulated, and the registered voice is added to the registered voice based on the reverberation simulation. The reverberation components generated in the voiceprint are used to make the non-speech components of the speech to be verified and the non-speech components in the enhanced registration speech as close as possible, thereby improving the voiceprint recognition rate and the robustness of the voiceprint recognition method.

Optionally, based on an Image source model (ISM) method, the reverberation generated by the registered speech in the recognition scene is estimated. The image source model method can simulate the reflection path of the sound wave in the room, and calculate the room impulse response function (RIR) of the sound field according to the delay and attenuation parameters of the sound wave. After the impulse response function of the room sound field is obtained, the reverberation generated by the registered speech in the room is obtained by convolving the audio signal of the registered speech with the impulse response function.

In addition, in some cases, for example, when performing voice control on intelligent robots and smart homes, the distance between the speaker to be verified and the microphone may be far (for example, more than 1m), so that when the voice of the speaker to be verified reaches the microphone There will be some attenuation. Therefore, in some embodiments, in order to consider the distance factor between the voice to be verified and the microphone, when the reverberation estimation is performed on the registered voice by using the image source model method, far-field simulation is also performed on the registered voice. That is, when calculating the impulse response function of the room according to the image source model method, the distance between the registered voice in the simulated sound field and the voice receiving device is set according to the distance between the speaker to be verified and the microphone, so that the registered voice can be The acquisition distance simulates the same acquisition distance as the voice to be verified, thereby further reducing the difference between the voice to be verified and the enhanced registered voice except for the effective voice components, improving the recognition rate of voiceprints and the efficiency of the voiceprint recognition method. robustness.

For another example, in some embodiments, before the to-be-verified speech is compared with the enhanced speech (ie, before step S50 ), the to-be-verified speech is also subjected to front-end processing, for example, echo cancellation and de-reverberation are performed on the to-be-verified speech , active noise reduction, dynamic gain, directional pickup, etc. In order to reduce other differences between the voice to be verified and the enhanced registered voice except for the effective voice components, the enhanced registered voice is subjected to the same front-end processing as the voice to be verified (that is, the voice to be verified and the enhanced registered voice are passed through. The same front-end processing algorithm module) to further improve the voiceprint recognition rate and the robustness of the voiceprint recognition method.

For another example, in some embodiments, the feature parameter extraction step of the speech signal (ie, step S150 ) may be omitted, and the speech signal may be recognized directly through a deep neural network model.

[Example 2]

Referring to FIG. 4 , this embodiment is used to provide another voice enhancement method. Different from Embodiment 1, in this embodiment, after the voice to be verified is collected, the scene of the voice to be verified is also recognized to obtain The scene type corresponding to the voice to be verified. After that, in addition to determining the enhanced registration voice according to the noise contained in the voice to be verified, the enhanced registration voice is also determined according to the above scene type. Specifically, the speech enhancement method performed by the mobile phone 100 according to this embodiment includes the following steps:

S210: Collect the registered voice. Here, the registered voice is the voice recorded by the owner of the mobile phone 100 in a quiet environment, so that there is no obvious noise component in the registered voice.

S220: Collect the voice to be verified. Here, the voice to be verified is the voice recorded by the current user of the mobile phone in the noisy human voice scene. In other words, the mobile phone user can unlock the screen of the mobile phone by means of voiceprint recognition in this scenario. In addition, the former user of the mobile phone is the person who currently operates the mobile phone 100, which may be the owner himself, or may be someone other than the owner himself.

S230: Determine the noise contained in the speech to be verified. In this embodiment, the noise contained in the voice to be verified is the sound generated by other sound sources other than the current user of the mobile phone 100 in the recognition scene.

S240: Superimpose the noise contained in the voice to be verified on the registration voice to obtain an enhanced registration voice. In this embodiment, in the time domain, the signal value of the noise signal and the signal value of the registration speech signal are added to obtain the enhanced registration speech.

In this implementation, steps S210-S240 are substantially the same as steps S110-S140 in Embodiment 1, and detailed processes in the steps are not repeated. In this embodiment, the number of registered voices is the same as that of the first embodiment, that is, the number of registered voices is 6. Therefore, in step S240, the noises contained in the voices to be verified are respectively superimposed on the 6 registered voices to obtain 6 Enhanced registration voice.

S250: Determine the scene type corresponding to the voice to be verified. Specifically, after the voice to be verified is collected, the scene type corresponding to the voice to be verified is identified by a voice recognition algorithm, such as a GMM method, a DNN method, or the like. In the speech recognition algorithm, the label value of the scene type can be a home scene; a car scene; an outdoor noisy scene; a venue scene; a cinema scene, etc.

S260: Superimpose template noise on the registered speech. The template noise is noise corresponding to the scene type determined in step S250, for example, template noise is noise recorded under the scene determined in step S250. Among them, for each scene type, it can correspond to multiple groups of template noise. In this embodiment, it is assumed that the scene type corresponding to the voice to be verified is determined in step S250 to be a home scene, and three groups of template noises are recorded in the home scene (for example, the sound generated by home audio and video equipment, the background generated when family members talk voice, and/or noise from household appliances, etc.).

Then, 3 groups of template noises are superimposed on the 6 registered voices respectively to form 3×6=18 enhanced registered voices. Together with the 6 enhanced registration voices formed in step S240, in this embodiment, a total of 24 enhanced registration voices are formed.

S270: Extract the feature parameters of the voice to be verified and the feature parameters of the enhanced registered voice, refer to step S150 in the first embodiment. However, it can be understood that, in this embodiment, the characteristic parameters in the 24 enhanced registered voices are extracted respectively.

S280: Perform parameter identification on the feature parameters of the voice to be verified and the feature parameters of the enhanced registered voice to obtain the voice template of the current user of the mobile phone 100 and the voice template of the owner of the mobile phone 100 respectively, refer to S160 in the first embodiment. However, it can be understood that, in this embodiment, the obtained 24 host voice templates are respectively recorded as B1, B2, . . . , B24.

S290 : Match the voice template of the owner of the mobile phone 100 with the voice template of the current user of the mobile phone 100 to obtain a recognition result. Reference may be made to step S170 in the first embodiment. However, it can be understood that, in this embodiment, the cosine similarity between the 24 host voice templates and the current user voice template of the mobile phone 100 are cosθ ₁ , cosθ ₂ , . . . , cosθ _{24 , respectively.} Taking the mean value of the 24 cosine similarities, the similarity between the current user voice and the host voice P=(cosθ ₁ +cosθ ₂ +...+cosθ ₂₄ )/24 is obtained.

It can be understood that the above is only an exemplary description of the technical solutions of the present application, and those skilled in the art can make other modifications on the basis of the above. For example, steps S230 and S240 are omitted, that is, the step of enhancing the registered voice according to the noise contained in the voice to be verified is omitted, and the registered voice is only enhanced according to the template noise corresponding to the recognition scene. Thus, enhanced voice register 18, which corresponds to the owner of the utterance B7, B2, ......, B24, respectively, with the voice of the user machine 100 of the main current voice phone similarity P = (cosθ ₇ + cosθ ₂ +...+cosθ ₂₄ )/18.

In addition, the technical details not mentioned in this embodiment, for example, the implementation body of the voiceprint recognition algorithm (implemented locally in the mobile phone 100 or in the cloud), other processing of speech (for example, reverberation estimation, far-field simulation, Front-end processing, etc.), etc., may refer to the introduction in Embodiment 1, and will not be repeated here.

In this paper, the scene type corresponding to the voice to be verified, the distance between the speaker to be verified and the microphone, etc. are all environmental characteristic parameters in the voice to be verified.

[Example 3]

This embodiment changes the application scenario of the voice enhancement method on the basis of the first embodiment. Specifically, the voice enhancement method in this embodiment is applied to the scenario shown in FIG. 5 for controlling the smart speaker 200 . The smart speaker 200 has a voice recognition function, and the user can interact with the smart speaker 200 through voice, so as to perform functions such as song on demand, weather query, schedule management, and smart home control through the smart speaker 200 .

In this embodiment, when the user sends a voice command to the smart speaker 200 to make the smart speaker 200 perform a certain operation (for example, play the current schedule, play songs from a specific directory, control the smart home, etc.) The method authenticates the identity of the user to determine whether the current user is the owner of the smart speaker 200, and then determines whether the current user has the authority to control the smart speaker 200 to perform the operation.

Specifically, the speech enhancement method of this embodiment includes:

S310: Collect registered voice. In this embodiment, the registration voice from the owner of the smart speaker 200 is collected through the microphone of the smart speaker 200, but the application is not limited to this. After the registered voice is collected, the registered voice can be saved locally in the smart speaker 200 to recognize the user's voiceprint through the smart speaker 200 to realize offline recognition of the voiceprint; the registered voice can also be uploaded to the cloud to use The computing resources in the cloud recognize the user's voiceprint to save the local computing resources of the smart speaker 200 .

S320: Collect the voice to be verified. In this embodiment, the voice to be verified is collected through the microphone of the smart speaker 200 . For the acquisition parameters of the voice to be verified (for example, the duration and text content of the voice to be verified), reference may be made to the description in Embodiment 1, and details are not repeated here.

S330: Determine the noise contained in the speech to be verified. In this embodiment, the speech to be verified is divided into a plurality of speech frames, and the medium energy of each speech frame is calculated. Since the energy in the noise is generally smaller than that in the valid speech, when the energy in the speech frame is smaller than a predetermined value, the speech frame can be determined as a noise frame, thereby simplifying the noise extraction process.

S340: Superimpose the noise contained in the voice to be verified on the registration voice to obtain an enhanced registration voice. In this embodiment, in the time domain, the signal value of the noise signal and the signal value of the registration speech signal are added to obtain the enhanced registration speech.

S350: Extract the feature parameters of the voice to be verified and the feature parameters of the enhanced registered voice. For example, the feature parameters of the voice to be verified and the feature parameters of the enhanced registered voice are extracted by the MFCC method.

S360: Perform parameter identification on the feature parameters of the voice to be verified and the feature parameters of the enhanced registered voice, so as to obtain the voice template of the current user of the smart speaker 200 and the voice template of the owner of the smart speaker 200, respectively. The recognition model for parameter recognition is not limited in this embodiment, and may be a probability model, such as an identity vector (I-vector) model; or a deep neural network model, such as a Time-Delay Neural Network (TDNN) ) model, ResNet model, etc.

S370: Match the voice template of the owner of the smart speaker 200 with the voice template of the current user of the smart speaker 200 to obtain a recognition result. In this embodiment, the template matching method may be a cosine distance method, a linear discriminant method, or a probabilistic linear discriminant analysis method, or the like. If the similarity between the current user's voice and the host's voice is greater than the set value, it is determined that the current user of the smart speaker 200 is the owner himself. At this time, the smart speaker 200 performs corresponding operations in response to the user's voice command; The current user of the speaker 200 is not the owner himself, and the smart speaker 200 ignores the user's voice command.

It should be noted that the speech enhancement method in this embodiment is substantially the same as the speech enhancement method in Embodiment 1 except for the application scenario. Therefore, for technical details not described in this embodiment, reference may be made to the description in Embodiment 1.

Similar to the first embodiment, the voiceprint recognition algorithm (the algorithms corresponding to steps S330 to S370 ) can be implemented on the smart speaker 200 to realize offline recognition of voiceprints; it can also be implemented in the cloud to save the local smart speaker 200 computing resources. When the voiceprint recognition algorithm is implemented in the cloud, the smart speaker 200 uploads the to-be-verified voice collected in step S120 to the cloud server. After the cloud server uses the voiceprint recognition algorithm to authenticate the identity of the current user of the smart speaker 200, the authentication The result is returned to the smart speaker 200, and the smart speaker 200 determines whether to execute the user's voice command according to the authentication result.

In addition, those skilled in the art can also apply the voice enhancement method in Embodiment 2 to the scenario of controlling the smart speaker shown in FIG. 5 , which will not be repeated here.

Referring now to FIG. 6, shown is a block diagram of an electronic device 400 according to one embodiment of the present application. Electronic device 400 may include one or more processors 401 coupled to controller hub 403 . For at least one embodiment, the controller hub 403 is connected to 406 via a multidrop bus such as a Front Side Bus (FSB), a point-to-point interface such as a QuickPath Interconnect (QPI), or the like The processor 401 communicates. Processor 401 executes instructions that control general types of data processing operations. In one embodiment, the controller hub 403 includes, but is not limited to, a graphics memory controller hub (GMCH, Graphics & Memory Controller Hub) (not shown) and an input/output hub (IOH, Input Output Hub) (which can be on a separate chip) (not shown), where the GMCH includes the memory and graphics controller and is coupled to the IOH.

Electronic device 400 may also include a coprocessor 402 and memory 404 coupled to controller hub 403 . Alternatively, one or both of the memory and GMCH may be integrated within the processor (as described in this application), with the memory 404 and coprocessor 402 coupled directly to the processor 401 and to the controller hub 403, the controller hub 403 and IOH are in a single chip.

The memory 404 may be, for example, Dynamic Random Access Memory (DRAM), Phase Change Memory (PCM), or a combination of the two. Memory 404 may include one or more tangible, non-transitory computer-readable media for storing data and/or instructions. The computer-readable storage medium stores instructions, in particular temporary and permanent copies of the instructions. The instructions may include instructions that, when executed by at least one of the processors, cause the electronic device 400 to implement the speech enhancement method described in FIGS. 3 and 4 . When the instructions are executed on the computer, the computer is caused to execute the method disclosed in the first embodiment and/or the second embodiment.

In one embodiment, the coprocessor 402 is a special-purpose processor, such as, for example, a high-throughput MIC (Many Integrated Core) processor, a network or communication processor, a compression engine, a graphics processor, a GPGPU (General- purpose computing on graphics processing units, general-purpose computing on graphics processing units), or embedded processors, etc. Optional properties of the coprocessor 402 are represented in FIG. 6 by dashed lines.

In one embodiment, the electronic device 400 may further include a network interface (NIC, Network Interface Controller) 406 . The network interface 406 may include a transceiver for providing a radio interface for the electronic device 400 to communicate with any other suitable devices (eg, front-end modules, antennas, etc.). In various embodiments, network interface 406 may be integrated with other components of electronic device 400 . The network interface 406 can implement the functions of the communication unit in the above-mentioned embodiments.

The electronic device 400 may further include an input/output (I/O, Input/Output) device 405 . I/O 405 may include: a user interface designed to enable a user to interact with electronic device 400 ; a peripheral component interface designed to enable peripheral components to also interact with electronic device 400 ; and/or sensors designed to determine association with electronic device 400 environmental conditions and/or location information.

It is worth noting that Figure 6 is exemplary only. That is, although FIG. 6 shows that the electronic device 400 includes multiple devices such as the processor 401, the controller center 403, the memory 404, etc., in practical applications, the device using each method of the present application may only include the electronic device 400 Some of the devices, for example, may include only the processor 401 and the network interface 406 . The properties of the optional device in Figure 6 are shown in dashed lines.

Referring now to FIG. 7 , shown is a block diagram of a SoC (System on Chip, system on chip) 500 according to an embodiment of the present application. In Figure 7, similar components have the same reference numerals. Also, the dotted box is an optional feature of more advanced SoCs. In FIG. 7, SoC 500 includes: interconnect unit 550 coupled to processor 510; system agent unit 580; bus controller unit 590; integrated memory controller unit 540; , which may include integrated graphics logic, image processor, audio processor and video processor; Static Random Access Memory (SRAM, Static Random-Access Memory) unit 530; Direct Memory Access (DMA, Direct Memory Access) unit 560 . In one embodiment, the coprocessor 520 includes a special purpose processor such as, for example, a network or communications processor, a compression engine, a GPGPU (General-purpose computing on graphics processing units), a high-throughput MIC processor, or embedded processor, etc.

Static random access memory (SRAM) unit 530 may include one or more tangible, non-transitory computer-readable media for storing data and/or instructions. The computer-readable storage medium stores instructions, in particular temporary and permanent copies of the instructions. The instructions may include instructions that, when executed by at least one of the processors, cause the SoC to implement the speech enhancement method described in FIGS. 3 and 4 . When the instructions are executed on the computer, the computer is caused to execute the method disclosed in the first embodiment and/or the second embodiment.

The term "and/or" in this article is only an association relationship to describe the associated objects, indicating that there can be three kinds of relationships, for example, A and/or B, it can mean that A exists alone, A and B exist at the same time, and A and B exist independently B these three cases.

Each method implementation of the present application can be implemented by means of software, magnetic components, firmware, and the like.

Program code may be applied to input instructions to perform the functions described herein and to generate output information. The output information can be applied to one or more output devices in a known manner. For the purposes of this application, a processing system includes any system having a processor such as, for example, a Digital Signal Processor (DSP), a microcontroller, an Application Specific Integrated Circuit (ASIC), or a microprocessor.

The program code may be implemented in a high-level procedural language or an object-oriented programming language to communicate with the processing system. The program code may also be implemented in assembly or machine language, if desired. In fact, the mechanisms described herein are not limited to the scope of any particular programming language. In either case, the language may be a compiled language or an interpreted language.

One or more aspects of at least one embodiment may be implemented by representative instructions stored on a computer-readable storage medium, the instructions representing various logic in a processor, the instructions, when read by a machine, cause the machine to make Logic that implements the techniques described herein. These representations, referred to as "IP (Intellectual Property) cores," may be stored on tangible computer-readable storage media and provided to multiple customers or production facilities for loading into the actual manufacturing of the logic or processor. in the manufacturing machine.

In some cases, an instruction converter may be used to convert instructions from a source instruction set to a target instruction set. For example, an instruction translator may transform (eg, using static binary transforms, dynamic binary transforms including dynamic compilation), warp, emulate, or otherwise convert an instruction into one or more other instructions to be processed by the core. Instruction translators can be implemented in software, hardware, firmware, or a combination thereof. The instruction translator may be on-processor, off-processor, or partially on-processor and partially off-processor.

Claims

A speech enhancement method, applied to electronic equipment, is characterized in that, comprising:

Collect the voice to be verified;

determining the environmental noise and/or environmental characteristic parameters contained in the to-be-verified voice;

enhancing the registered speech based on the environmental noise and/or the environmental characteristic parameter;

The to-be-verified voice and the enhanced registration voice are compared, and it is determined that the to-be-verified voice and the registered voice are from the same user.
The method according to claim 1, wherein the enhancing the registered voice based on the environmental noise comprises: superimposing the environmental noise on the registered voice.
The method according to claim 1, wherein the environmental noise is a sound picked up by a secondary microphone of the electronic device.
The method according to claim 1, wherein the duration of the to-be-verified voice is shorter than the duration of the registered voice.
The method according to claim 1, wherein the environmental characteristic parameter comprises a scene type corresponding to the to-be-verified voice;

The enhancing the registered voice based on the environmental characteristic parameter includes: determining the template noise corresponding to the scene type based on the scene type corresponding to the to-be-verified voice, and superimposing the template on the registered voice noise.
The method according to claim 5, wherein the scene type corresponding to the to-be-verified speech is determined by recognizing the to-be-verified speech by a scene recognition algorithm.
The method according to claim 6, wherein the scene recognition algorithm is any one of the following: GMM algorithm; DNN algorithm.
The method according to claim 7, wherein the scene type of the voice to be verified is any one of the following: a home scene; a vehicle-mounted scene; an outdoor noisy scene; a venue scene; a cinema scene.
The method according to claim 1, wherein the voice to be verified and the enhanced registration voice are voices processed by the same front-end processing algorithm.
The method according to claim 9, wherein the front-end processing algorithm comprises at least one of the following processing algorithms: echo cancellation; de-reverberation; active noise reduction; dynamic gain;
The method according to claim 1, wherein the number of the registered voices is multiple; and, based on the environmental noise and/or the environmental characteristic parameter, the multiple registered voices are respectively enhanced, for multiple enhanced registration voices.
The method according to claim 1, wherein the comparing the to-be-verified voice and the enhanced registered voice to determine that the to-be-verified voice and the registered voice are from the same user, comprising:

Extract the feature parameters of the voice to be verified and the feature parameters of the enhanced registered voice through a feature parameter extraction algorithm;

Parameter identification is performed on the characteristic parameters of the voice to be verified and the characteristic parameters of the enhanced registered voice through the parameter identification model, so as to obtain the voice template of the speaker to be verified and the voice template of the registered speaker respectively;

The voice template of the speaker to be verified and the voice template of the registered speaker are matched by a template matching algorithm, and it is determined according to the matching result that the voice to be verified and the registered voice are from the same user.
The method of claim 12, wherein:

Described feature parameter extraction algorithm is MFCC algorithm, log mel algorithm or LPCC algorithm; And/or,

The parameter identification model is an identity vector model, a time-delay neural network model or a ResNet model; and/or,

The template matching algorithm is a cosine distance method, a linear discriminant method or a probabilistic linear discriminant analysis method.
A voice enhancement system, characterized in that it includes a terminal device and a server that is communicatively connected to the terminal device, wherein:

the terminal device, configured to collect the voice to be verified, and send the voice to be verified to the server;

The server is configured to determine the environmental noise and/or environmental characteristic parameters contained in the to-be-verified speech, enhance the registered speech based on the environmental noise and/or the environmental characteristic parameters, and compare the to-be-verified speech With the enhanced registration voice, it is determined that the to-be-verified voice and the registration voice are from the same user;

The server is further configured to send a determination result of determining that the voice to be verified and the registered voice are from the same user to the terminal device.
The system according to claim 14, wherein the enhancing the registered voice based on the environmental noise comprises: superimposing the environmental noise on the registered voice.
The system according to claim 14, wherein the environmental noise is a sound picked up by a secondary microphone of the terminal device.
The system according to claim 14, wherein the duration of the to-be-verified voice is shorter than the duration of the registered voice.
The system according to claim 14, wherein the environmental characteristic parameter comprises a scene type corresponding to the to-be-verified voice;

The enhancing the registered voice based on the environmental characteristic parameter includes: determining the template noise corresponding to the scene type based on the scene type corresponding to the to-be-verified voice, and superimposing the template on the registered voice noise.
The system according to claim 18, wherein the scene type corresponding to the to-be-verified speech is determined by recognizing the to-be-verified speech by a scene recognition algorithm.
The system according to claim 18, wherein the scene type of the voice to be verified is any one of the following: a home scene; a car scene; an outdoor noisy scene; a venue scene; a cinema scene.
The system according to claim 14, wherein the voice to be verified and the enhanced registered voice are voices processed by the same front-end processing algorithm.
The system of claim 21, wherein the front-end processing algorithm comprises at least one of the following processing algorithms: echo cancellation; de-reverberation; active noise reduction; dynamic gain; directional pickup.
The system according to claim 14, wherein the number of the registered voices is a plurality of; Enhancements are made to get multiple enhanced registration voices.
The system according to claim 14, wherein the comparing the to-be-verified voice and the enhanced registered voice to determine that the to-be-verified voice and the registered voice are from the same user, comprising:

Extract the feature parameters of the voice to be verified and the feature parameters of the enhanced registered voice through a feature parameter extraction algorithm;

Parameter identification is performed on the characteristic parameters of the voice to be verified and the characteristic parameters of the enhanced registered voice through the parameter identification model, so as to obtain the voice template of the speaker to be verified and the voice template of the registered speaker respectively;

A template matching algorithm is used to match the voice template of the speaker to be verified and the voice template of the registered speaker, and according to the matching result, it is determined that the voice to be verified and the registered voice are from the same user.
An electronic device, comprising:

memory for storing instructions executed by one or more processors of the electronic device;

The processor, when the processor executes the instructions in the memory, can cause the electronic device to execute the speech enhancement method according to any one of claims 1 to 13 .
A computer-readable storage medium, characterized in that the computer-readable storage medium stores an instruction, and when the instruction is executed on a computer, causes the computer to execute the method of any one of claims 1-13.