CN110021307B

CN110021307B - Audio verification method and device, storage medium and electronic equipment

Info

Publication number: CN110021307B
Application number: CN201910273077.9A
Authority: CN
Inventors: 陈岩
Original assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Current assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date: 2019-04-04
Filing date: 2019-04-04
Publication date: 2022-02-01
Anticipated expiration: 2039-04-04
Also published as: CN110021307A

Abstract

The embodiment of the application discloses an audio verification method, an audio verification device, a storage medium and electronic equipment, wherein the electronic equipment comprises a processor, a special voice recognition chip and two microphones, the power consumption of the special voice recognition chip is smaller than that of the processor, when the processor is in a dormant state, the special voice recognition chip with low power consumption is used for verifying external audio signals, if the verification is passed, the processor is awakened, the processor is used for denoising the two external audio signals to obtain denoised audio signals, and then the processor verifies the denoised audio signals to obtain corresponding verification results. Therefore, the interference of external noise can be eliminated, and the audio signal can be more accurately checked.

Description

Audio verification method and device, storage medium and electronic equipment

Technical Field

The application relates to the technical field of voice processing, in particular to an audio verification method, an audio verification device, a storage medium and electronic equipment.

Background

At present, through audio verification, a user can speak a voice instruction to control the electronic device under the condition that the electronic device is inconvenient to directly control. However, in an actual use environment, various noises often exist, so that it is difficult for the electronic device to accurately verify an input audio signal.

Disclosure of Invention

The embodiment of the application provides an audio verification method, an audio verification device, a storage medium and electronic equipment, and can improve the accuracy of verifying an audio signal by the electronic equipment.

In a first aspect, an embodiment of the present application provides an audio verification method, which is applied to an electronic device, where the electronic device includes a processor, a dedicated speech recognition chip, and two microphones, and power consumption of the dedicated speech recognition chip is less than power consumption of the processor, and the audio verification method includes:

when the processor is in a dormant state, acquiring an external audio signal through any microphone and providing the audio signal to the special voice chip;

the special voice chip is used for verifying the audio signal, waking up the processor when the verification is passed, and controlling the special voice chip to sleep after the processor is woken up;

acquiring two external audio signals through two microphones and providing the two audio signals to the processor;

and denoising the two audio signals through the processor to obtain a denoising audio signal, and verifying the denoising audio signal to obtain a verification result.

In a second aspect, an embodiment of the present application provides an audio verification apparatus, which is applied to an electronic device, where the electronic device includes a processor, a dedicated speech recognition chip, and two microphones, and the audio verification apparatus includes:

the first acquisition module is used for acquiring an external audio signal through any microphone when the processor is in a dormant state and providing the audio signal to the special voice chip;

the first checking module is used for checking the audio signal through the special voice chip, awakening the processor when the checking is passed, and controlling the special voice chip to sleep after the processor is awakened;

the second acquisition module is used for acquiring two external audio signals through two microphones and providing the two audio signals to the processor;

and the second checking module is used for reducing the noise of the two audio signals through the processor to obtain a noise reduction audio signal, and checking the noise reduction audio signal to obtain a checking result.

In a third aspect, the present application provides a storage medium having a computer program stored thereon, which, when the computer program is run on an electronic device including a processor, a dedicated speech recognition chip and two microphones, causes the electronic device to perform the steps in the audio verification method provided by the present application.

In a fourth aspect, an embodiment of the present application further provides an electronic device, where the electronic device includes an audio acquisition unit, a processor, a dedicated speech recognition chip and two microphones, and a power consumption of the dedicated speech recognition chip is smaller than a power consumption of the processor, where,

the electronic equipment comprises an audio acquisition unit, a processor, a special voice recognition chip and a screen, wherein the power consumption of the special voice recognition chip is less than that of the processor,

the audio acquisition unit is used for acquiring an external audio signal through any microphone when the processor is in a dormant state and providing the audio signal to the special voice chip;

the special voice recognition chip is used for verifying the audio signal, awakening the processor when the verification is passed, and sleeping after awakening the processor;

the audio acquisition unit is used for acquiring two external audio signals through two microphones after waking up the processor and providing the two audio signals to the processor;

the processor is used for denoising the two audio signals to obtain a denoised audio signal, and verifying the denoised audio signal to obtain a verification result.

In the embodiment of the application, the electronic device comprises a processor, a special voice recognition chip and two microphones, wherein the power consumption of the special voice recognition chip is smaller than that of the processor, when the processor is in a dormant state, the special voice recognition chip with low power consumption is used for checking external audio signals, if the checking is passed, the processor is awakened, the processor is used for reducing noise of the two external audio signals to obtain noise-reduced audio signals, and then the processor checks the noise-reduced audio signals to obtain corresponding checking results. Therefore, the interference of external noise can be eliminated, and the audio signal can be more accurately checked.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic flow chart of an audio verification method according to an embodiment of the present application.

Fig. 2 is a schematic diagram of the arrangement positions of two microphones in the embodiment of the present application.

Fig. 3 is a schematic diagram of noise suppression according to two audio signals collected by two microphones in the embodiment of the present application.

Fig. 4 is a schematic flow chart of training a voiceprint feature extraction model in the embodiment of the present application.

Fig. 5 is a schematic diagram of a spectrogram extracted in the example of the present application.

Fig. 6 is another schematic flowchart of an audio verification method according to an embodiment of the present application.

Fig. 7 is a schematic structural diagram of an audio verification apparatus according to an embodiment of the present application.

Fig. 8 is a schematic structural diagram of an electronic device provided in an embodiment of the present application.

Detailed Description

Referring to the drawings, wherein like reference numbers refer to like elements, the principles of the present application are illustrated as being implemented in a suitable computing environment. The following description is based on illustrated embodiments of the application and should not be taken as limiting the application with respect to other embodiments that are not detailed herein.

The embodiment of the present application first provides an audio verification method, where an execution main body of the audio verification method may be an electronic device provided in the embodiment of the present application, the electronic device includes a processor, a dedicated voice recognition chip and two microphones, and power consumption of the dedicated voice recognition chip is less than power consumption of the processor, and the electronic device may be a device with processing capability and configured with a processor, such as a smart phone, a tablet computer, a palm computer, a notebook computer, or a desktop computer.

Referring to fig. 1, fig. 1 is a schematic flowchart of an audio verification method according to an embodiment of the present disclosure. The audio verification method is applied to the electronic device provided by the present application, where the electronic device includes a processor, a dedicated speech recognition chip and two microphones, as shown in fig. 1, a flow of the audio verification method provided by the embodiment of the present application may be as follows:

in 101, when the processor is in a sleep state, an external audio signal is acquired through any microphone and provided to the dedicated voice chip.

It should be noted that the dedicated voice recognition chip in the embodiment of the present application is a dedicated chip designed for voice recognition, such as a digital signal processing chip designed for voice, an application specific integrated circuit chip designed for voice, etc., which has lower power consumption than a general-purpose processor. Any two of the special voice recognition chip, the processor and the audio acquisition unit are in communication connection through a communication bus (such as an I2C bus) to realize data interaction.

In the embodiment of the application, the processor is in a dormant state when the screen of the electronic equipment is in a screen-off state, and the special voice recognition chip is in a dormant state when the screen is in a screen-on state. In addition, the two microphones included in the electronic device may be internal microphones or external microphones (may be wired microphones or wireless microphones).

When the processor is in a sleep state (the special voice recognition chip is in an awakening state), the electronic device collects external sound through any microphone, if the microphone is an analog microphone, the analog audio signal is collected, and at the moment, the analog audio signal needs to be subjected to analog-to-digital conversion to obtain a digitized audio signal for subsequent processing. For example, the electronic device may sample an external analog audio signal at a sampling frequency of 16KHz after the external analog audio signal is collected by the microphone, so as to obtain a digital audio signal.

It will be appreciated by those skilled in the art that if the microphone included in the electronic device is a digital microphone, the digitized audio signal will be directly acquired without analog-to-digital conversion.

After the external audio signal is collected, the electronic device provides the collected audio signal to the dedicated voice recognition chip.

In 102, the audio signal is verified by the dedicated voice chip, and the processor is awakened when the verification is passed, and the dedicated voice chip is controlled to sleep after the processor is awakened.

In the embodiment of the application, after the collected external audio signal is provided for the special voice chip, the electronic device further checks the audio signal through a first checking algorithm running on the special voice chip to obtain a checking result. Including but not limited to verifying textual and/or voiceprint characteristics of the aforementioned audio signals.

In a popular way, the text characteristic of the audio signal is checked to determine whether the audio signal includes the preset wake-up word, and as long as the audio signal includes the preset wake-up word, the text characteristic of the audio signal is checked to pass through no matter who says the preset wake-up word. For example, the audio signal includes a preset wake-up word set by a preset user (e.g., an owner of the electronic device, or another user authorized by the owner to use the electronic device), but the preset wake-up word is spoken by the user a instead of the preset user, and the dedicated voice recognition chip passes the verification when the text feature of the audio signal is verified based on the first verification algorithm.

And checking whether the text characteristic and the voiceprint characteristic of the audio signal comprise the preset awakening words spoken by the preset user, if the audio signal comprises the preset awakening words spoken by the preset user, the text characteristic and the voiceprint characteristic of the audio signal are checked to be passed, and if not, the check is not passed. For example, if the audio signal includes a preset wake-up word set by a preset user, and the preset wake-up word is spoken by the preset user, the text feature and the voiceprint feature of the audio signal are verified; for another example, if the audio signal includes a preset wake-up word spoken by a user other than the preset user, or the audio signal does not include any preset wake-up word spoken by the user, the text feature and the voiceprint feature of the audio signal will fail to be verified (or will not pass verification).

In the embodiment of the application, when the electronic device checks that the audio signal passes through the special voice recognition chip, the electronic device sends a preset interrupt signal to the processor through the communication connection between the special voice recognition chip and the processor so as to wake up the processor. After waking up the processor, the electronic device sleeps the special voice recognition chip, and meanwhile, the screen is switched from a chip state to a bright screen state.

It should be noted that, if the audio signal is not verified, the electronic device continues to provide the external audio signal acquired through any microphone to the dedicated voice recognition chip for verification until the verification is passed.

In 103, two external audio signals are acquired by two microphones and provided to a processor.

After waking up the processor, the electronic device synchronously acquires two external audio signals with the same duration through the two set microphones and provides the two acquired audio signals to the processor.

It will be appreciated by those skilled in the art from the foregoing description that the two audio signals provided to the processor are likewise digitized audio signals.

At 104, the two audio signals are denoised by the processor to obtain a denoised audio signal, and the denoised audio signal is verified to obtain a verification result.

In the embodiment of the application, after the electronic device provides the two acquired audio signals to the processor, the two audio signals are denoised by the double-microphone denoising algorithm operated by the processor to obtain the denoising audio signal. Wherein, as to what kind of double-microphone noise reduction algorithm is selected, the embodiment of the present application is not particularly limited, and the selection can be performed by a person having ordinary skill in the art according to actual needs, including but not limited to a double-microphone beam forming noise reduction algorithm, a double-microphone blind source separation noise reduction algorithm, and the like,

after the two audio signals are denoised by the processor to obtain the denoised audio signal, the electronic device further verifies the denoised audio signal by a second verification algorithm run by the processor to obtain a verification result, wherein the verification result includes but is not limited to verifying text features and/or voiceprint features of the denoised audio signal. When the noise reduction audio signal is verified to pass, the electronic device can further execute the operation corresponding to the noise reduction audio signal, including but not limited to unlocking a screen, starting a voice assistant, and the like.

It should be noted that the first checking algorithm executed by the dedicated speech recognition chip may be the same as or different from the second checking algorithm executed by the processor, and the embodiment of the present application does not specifically limit this.

As can be seen from the above, in the embodiment of the application, when the processor of the electronic device is in the sleep state, the low-power-consumption dedicated voice recognition chip is used to verify the external audio signal, if the verification is passed, the processor is awakened, the processor performs noise reduction on the two external audio signals to obtain the noise reduction audio signal, and then the processor verifies the noise reduction audio signal to obtain the corresponding verification result. Therefore, the interference of external noise can be eliminated, and the audio signal can be more accurately checked.

In one embodiment, "noise reducing two audio signals by a processor to obtain a noise reduced audio signal" includes:

(1) vectorizing and representing the two audio signals through a processor to obtain an audio vector;

(2) and separating the audio vector through a blind source of the processor to obtain a voice signal, and setting the voice signal as a noise reduction audio signal.

In this embodiment, the electronic device may reduce noise of the two audio signals by the processor in a blind source separation manner to obtain a noise reduction audio signal.

The electronic equipment firstly characterizes two audio signals through vectorization of a processor to obtain an audio vector. For example, suppose that two acquired audio signals are x respectively₁And x₂Then vectorizing the two audio signals by the processor to obtain an audio vector

Assuming that the speech component in the audio vector x is s1 and the noise component is s2, the relationship between the speech component, the noise component, and the audio vector can be expressed as:

where w represents the separation coefficient for blind source separation.

When the audio vectors are separated through the blind source of the processor to obtain noise signals and voice signals, the electronic equipment firstly obtains a separation coefficient for separating the audio vectors through the blind source, then based on the obtained separation coefficient, the audio vectors are separated through the blind source of the processor to obtain the noise signals (namely, noise components in the audio vectors) and the voice signals (namely, voice components in the audio vectors), and the voice signals obtained through component separation are set as noise reduction audio signals.

It should be noted that blind source separation of the audio vectors by the processor will result in two audio signals, and due to the uncertainty of the blind source separation output signal, the electronic device may identify the speech signal as well as the noise signal in the two separated audio signals by an endpoint detection algorithm run by the processor.

In one embodiment, "blind source separating audio vectors by a processor to obtain a speech signal" includes:

(1) framing the audio vector by a processor to obtain a plurality of audio frames;

(2) obtaining, by a processor, a separation coefficient for blind source separation of each audio frame;

(3) based on each separation coefficient, separating the corresponding audio frame through a blind source of a processor to obtain a sub-voice signal;

(4) and combining the sub-voice signals of the audio frames through a processor to obtain the voice signal.

In the embodiment of the application, when the electronic device obtains a noise signal and a voice signal by separating audio vectors through a blind source of a processor, the electronic device firstly frames the audio vectors through the processor to obtain a plurality of audio frames, wherein the lengths of the audio frames obtained by framing are the same.

For example, when the electronic device frames the audio vector to obtain a plurality of audio frames through the processor, the electronic device frames the audio vector to obtain a plurality of audio frames according to the frame length of 20 milliseconds, and the audio frames are represented as

m represents the number of frames.

After obtaining a plurality of audio frames by framing the audio vector, for each audio frame, the electronic device obtains a separation coefficient for separating each audio frame by a blind source through the processor, x^mThe separation coefficient for can be expressed as w^m。

After obtaining the separation coefficients for separating each audio frame by the blind source, the electronic device further obtains, based on each separation coefficient, a sub-noise signal and a sub-speech signal by separating the corresponding audio frame by the processor blind source, which are expressed as:

wherein,

representing the m-th audio frame, w^mRepresenting the separation coefficient corresponding to the mth audio frame,

representing the sub-speech signal separated from the m-th audio frame,

representing the sub-noise signal separated from the mth audio frame.

After the blind source separation of each audio frame is completed, the electronic equipment combines the sub-voice signals of each audio frame through the processor to obtain the voice signal according to the sequence of each audio frame in the time sequence, and combines the sub-noise signals of each audio frame through the processor to obtain the noise signal.

It should be noted that, the blind source separation of any audio frame by the processor will result in two sub-audio signals, and due to the uncertainty of the blind source separation output signal, the electronic device may identify the sub-speech signal and the sub-noise signal in the two separated sub-audio signals by an endpoint detection algorithm run by the processor.

In one embodiment, "obtaining, by a processor, separation coefficients for blind source separation of audio frames" includes:

(1) whitening, by a processor, a current audio frame;

(2) and setting a separation coefficient corresponding to the previous audio frame as an initial separation coefficient of the current audio frame, and iterating the separation coefficient for separating the audio frame from the blind source through a processor based on the whitened current audio frame and the initial separation coefficient.

It should be noted that, in the embodiment of the present application, for each audio frame obtained by framing, the electronic device obtains, by the processor, the separation coefficient for blind source separation of each audio frame on a frame-by-frame basis. The current audio frame does not refer to a specific audio frame, but refers to an audio frame currently acquiring a corresponding separation coefficient, which may be any audio frame. For example, if the electronic device is currently acquiring the separation coefficient of the first frame of audio frame, the first frame of audio frame is the current audio frame.

When the processor acquires the separation coefficient for separating each audio frame in a blind source, the electronic equipment can process the current audio frame in a whitening mode through the processor, so that the correlation between different components of the current audio frame is reduced. Assuming that the current audio frame is the mth frame audio frame, the whitening process for the current audio frame may be expressed as

Wherein,

representing the m-th frame of audio after whitening, V representing the covariance matrix corresponding to the m-th frame of audio, D^-1/2An inverse square root matrix representing the covariance matrix V, T representing the transpose of the matrix, x^mRepresenting the mth frame of audio.

After whitening processing of the current audio frame is completed, the electronic device sets a separation coefficient corresponding to a previous audio frame as an initial separation coefficient of the current audio frame, and iterates through a processor a separation coefficient for blind source separation of the current audio frame based on the whitened current audio frame and the initial separation coefficient.

In an iterative process, the following may be expressed:

wherein N represents the nth iteration and takes the value of [1, N%]Where N represents the total number of iterations, an empirical value may be obtained by a person of ordinary skill in the art according to actual needs, for example, N is set to 10 in the embodiment of the present application, that is, iteration is performed 10 times; w is a^m,n-1E represents an average value, g (u) -exp (-au), where g (u) represents a separation coefficient after N-1 iterations of the mth frame audio frame (where N is 1, the separation coefficient is an initial separation coefficient, that is, a separation coefficient obtained by converging the previous audio frame through N iterations, that is, when the mth frame audio frame is the second frame audio frame, the initial separation coefficient is a separation coefficient obtained by converging the initial separation coefficient of the first frame audio frame through N iterations), E represents an average value, and g (u) represents-exp (-au)²/2) is a Gaussian distribution function, a takes the empirical value,

g' (u) represents the first derivative, w^m，nRepresenting the separation coefficient after the nth iteration of the mth frame of audio frames.

After completing N iterations of the initial separation coefficients, the separation coefficients for blind source separation of the mth frame of audio frame are converged to obtain.

It should be noted that, when the mth frame audio frame is the first frame audio frame, since there is no previous frame, the original separation coefficient w is set^1,0Is arranged as

(1) converting the current audio frame of each of the two audio signals from a time domain to a frequency domain through a processor, and extracting sub-audio signals from respective expected directions in the two current audio frames in the frequency domain to obtain two sub-audio signals, wherein the expected directions corresponding to the two current audio frames are opposite;

(2) performing frequency band division on the two sub-audio signals through a processor, and performing beam forming on a plurality of sub-frequency bands obtained through division according to corresponding beam forming filter coefficients to obtain a plurality of beam forming signals;

(3) obtaining, by a processor, a plurality of gain factors for performing noise suppression on the plurality of beamformed signals, respectively, in the plurality of subbands, based on the corresponding beamforming filter coefficients and the respective autocorrelation coefficients of the two sub-audio signals, respectively;

(4) respectively carrying out noise suppression on the plurality of beam forming signals according to the plurality of gain factors through a processor, carrying out frequency band splicing on the plurality of beam forming signals after noise suppression, and converting the signals into a time domain to obtain a current audio frame after noise suppression;

(5) and obtaining a noise reduction audio signal according to the current audio frame after noise suppression through a processor.

It should be noted that, in the embodiment of the present application, two microphones are arranged back to back, and the two microphones are arranged back to back, which means that the sound pickup holes of the two microphones are oriented oppositely. For example, referring to fig. 2, the electronic device includes two microphones, which are a microphone 1 disposed on a lower side of the electronic device and a microphone 2 disposed on an upper side of the electronic device, respectively, wherein a sound collecting hole of the microphone 1 faces downward, and a sound collecting hole of the microphone 2 faces upward. Further, the two microphones provided to the electronic device may be nondirectional microphones (or omnidirectional microphones).

It should be noted that, after the electronic device acquires two audio signals with the same duration through two microphones, the processor performs framing processing on the two audio signals respectively, and divides the two audio signals into a plurality of audio frames with the same number, so as to perform noise suppression frame by frame.

For example, referring to fig. 3, the two collected audio signals are respectively denoted as audio signal 1 and audio signal 2, the electronic device may frame the audio signal 1 into n audio frames with a length of 20 milliseconds, similarly frame the audio signal 2 into n audio frames with a length of 20 milliseconds, so as to perform noise suppression according to the first audio frame from the audio signal 1 and the first audio frame from the audio signal 2, to obtain a first noise-suppressed audio frame, performing noise suppression on the second audio frame from the audio signal 1 and the second audio frame from the audio signal 2 to obtain a second noise-suppressed audio frame, and noise suppressing the nth audio frame from the audio signal 1 and the nth audio frame from the audio signal 2 to obtain an nth noise-suppressed audio frame, and so on. Thus, a noise-suppressed complete audio signal, i.e. a noise-reduced audio signal, can be obtained according to the noise-suppressed audio frames.

It should be noted that the current audio frame is not used to refer to a specific audio frame, but is used to refer to an audio frame used for noise suppression at the current time, for example, if noise suppression is performed according to a fifth audio frame of two audio signals at the current time, a fifth audio frame of the two audio signals is the current audio frame, and if noise suppression is performed according to a sixth audio frame of the two audio signals at the current time, a sixth audio frame of the two audio signals is the current audio frame, and so on.

In the embodiment of the application, the electronic device transforms, by using the processor, the current audio frame of each of the two audio signals from the time domain to the frequency domain, and extracts, in the frequency domain, the sub-audio signals from the respective desired directions (the desired directions of the microphones) in the two current audio frames, so as to obtain two sub-audio signals. Wherein the desired directions of the two microphones are opposite, wherein the desired direction of the microphone closer to the target sound source is a direction towards the target sound source, and the desired direction of the microphone farther from the target sound source is a direction away from the target sound source.

For example, when the electronic device performs sound collection during an owner call, the owner is a target sound source, the two microphones of the electronic device are denoted as a microphone 1 and a microphone 2, if the microphone 1 is closer to the owner, the expected direction of the microphone 1 is a direction toward the owner, and the expected direction of the microphone 2 is a direction away from the owner.

From the above description, it is possible for a person skilled in the art to extract two sub-audio signals from two current audio frames for an electronic device, wherein one sub-audio signal carries more "target sound" and the other sub-audio signal carries more "noise".

After extracting two sub-audio signals from two current audio frames, the electronic device performs frequency band division on the two sub-audio signals according to the same frequency band division mode to obtain a plurality of sub-frequency bands. And then, for each sub-band, performing beam forming according to the beam forming filter coefficient corresponding to the sub-band to obtain a beam forming signal of the sub-band, so that for obtaining a plurality of sub-bands by division, the electronic device obtains a plurality of beam forming signals correspondingly.

For example, the electronic device performs band division on two sub-audio signals according to the same band division manner to obtain i sub-bands, and performs beam forming on the i sub-bands according to corresponding beam forming filter coefficients to obtain i beam forming signals.

After obtaining the plurality of beam forming signals, the electronic device performs autocorrelation calculation on the two sub-audio signals in each sub-band through the processor, so as to obtain autocorrelation coefficients of the two sub-audio signals in each sub-band. Then, for each sub-band, a gain factor for noise suppression of the beamformed signals of the sub-band is obtained according to the beamforming filter coefficient corresponding to the sub-band and the autocorrelation coefficient of each of the two sub-audio signals in the sub-band. In this way, for a plurality of beamformed signals, the electronic device will obtain gain factors for noise suppression of the plurality of beamformed signals, respectively.

After obtaining the plurality of gain factors for performing noise suppression on the plurality of beamforming signals, the electronic device may perform noise suppression on the plurality of beamforming signals according to the plurality of gain factors, respectively, by using the processor, to obtain the plurality of beamforming signals after noise suppression. And then, the electronic equipment performs frequency band splicing on the plurality of beam forming signals subjected to noise suppression through the processor and converts the signals into a time domain to obtain a current audio frame subjected to noise suppression.

And for each audio frame from the two audio signals, the electronic equipment performs noise reduction to obtain a corresponding audio frame, and the electronic equipment further splices the audio frames obtained by noise reduction to obtain the noise-reduced audio signal.

In one embodiment, "verifying the noise reduction audio signal" includes:

(1) carrying out endpoint detection on the noise reduction audio signal through a processor, and dividing the noise reduction audio signal into a plurality of sub noise reduction audio signals according to an endpoint detection result;

(2) calling a voiceprint feature extraction model related to a preset text through a processor to extract a voiceprint feature vector of each sub-noise reduction audio signal;

(3) acquiring similarity between a voiceprint feature vector of each sub-noise reduction audio signal and a target voiceprint feature vector through a processor, wherein the target voiceprint feature vector is the voiceprint feature vector of an audio signal of a preset text spoken by a preset user;

(4) and according to the corresponding similarity of each sub noise reduction audio signal, checking the text characteristic and the voiceprint characteristic of the noise reduction audio signal through a processor.

In the embodiment of the present application, it is considered that the noise reduction audio signal is generally continuous speech, and the noise reduction audio signal needs to be divided. The processor firstly carries out endpoint detection on the noise reduction audio signal by adopting a preset endpoint detection algorithm, and then divides the noise reduction audio signal into a plurality of sub audio signals according to an endpoint detection result, and records the sub audio signals as the sub noise reduction audio signals. It should be noted that, for the endpoint Detection algorithm adopted by the processor, no specific limitation is imposed in the embodiment of the present application, and a person having ordinary skill in the art may select the endpoint Detection algorithm according to actual needs, for example, in the embodiment of the present application, the processor adopts a VAD (Voice Activity Detection) algorithm to perform endpoint Detection on the noise reduction audio signal. In addition, when the noise reduction audio signal is divided into a plurality of sub noise reduction audio signals according to the endpoint detection result, the processor divides the audio data corresponding to the adjacent endpoints whose time intervals are less than the preset time duration (for example, set to 200 milliseconds) into one sub noise reduction audio signal according to the endpoint detection result.

It should be noted that, in the embodiment of the present application, a voiceprint feature extraction model related to a preset text (for example, a preset wake-up word) is also trained in advance. For example, in the embodiment of the present application, a voiceprint feature extraction model based on a convolutional neural network is trained, please refer to fig. 4, which may collect audio signals of a plurality of people (e.g., 200 people) speaking preset wake words in advance, perform endpoint detection on the audio signals, segment the preset wake word parts therein, perform preprocessing and windowing on the segmented audio signals, perform fourier transform (e.g., short-time fourier transform), calculate energy density of the fourier transformed audio signals, generate a spectrogram of gray scale (as shown in fig. 5, where a horizontal axis represents time, a vertical axis represents frequency, and a gray scale represents an energy value), and finally train the generated spectrogram by using the convolutional neural network to generate a voiceprint feature extraction model related to a preset text. In addition, in the embodiment of the application, a spectrogram of an audio signal of a preset user speaking a preset wakeup word (i.e., a preset text) is extracted and input into a previously trained voiceprint feature extraction model, and after passing through a plurality of convolution layers, pooling layers and full-link layers of the voiceprint feature extraction model, a group of corresponding feature vectors are output and recorded as target voiceprint feature vectors.

Accordingly, the processor extracts spectrogram of the plurality of sub noise reduction audio signals respectively after dividing the noise reduction audio signal into the plurality of sub noise reduction audio signals. For how to extract the spectrogram, details are not repeated here, and specific reference may be made to the above related description. After the speech spectrograms of the sub noise reduction audio signals are extracted, the processor respectively inputs the speech spectrograms of the sub noise reduction audio signals into a previously trained voiceprint feature extraction model, and therefore a voiceprint feature vector of each sub noise reduction audio signal is obtained.

After extracting the voiceprint feature vectors of the sub noise reduction audio signals, the processor respectively obtains the similarity between the voiceprint feature vectors of the sub noise reduction audio signals and the target voiceprint feature vectors, and then the text features and the voiceprint features of the noise reduction audio signals are verified according to the similarity corresponding to the sub noise reduction audio signals. For example, the processor may determine whether there is a sub noise reduction audio signal whose similarity between the voiceprint feature vector and the target voiceprint feature vector reaches a preset similarity (an empirical value may be taken by a person of ordinary skill in the art according to actual needs, and may be set to 75%, for example), and if there is, determine that the text feature and the voiceprint feature of the noise reduction audio signal pass verification.

In one embodiment, "checking, by the processor, the text feature and the voiceprint feature of the noise reduction audio signal according to the similarity corresponding to each sub-noise reduction audio signal" includes:

according to the similarity corresponding to each sub noise reduction audio signal and a preset identification function, verifying the text characteristic and the voiceprint characteristic of the noise reduction audio signal through a processor;

wherein the identification function is gamma_n＝γ_n-1+f(l_n)，γ_nRepresenting the state value, gamma, of the recognition function corresponding to the nth sub-noise-reduced audio signal_n-1Representing the state value of the recognition function corresponding to the (n-1) th sub-noise reduction audio signal,

a is a correction value of the recognition function, b is a predetermined similarity, l_nSimilarity between the voiceprint feature vector of the nth sub-noise reduction audio signal and the target voiceprint feature vector;

processor in presence of gamma greater than preset discrimination function state value_nAnd judging that the text characteristic and the voiceprint characteristic of the noise reduction audio signal pass verification.

It should be noted that the value of a in the recognition function can be an empirical value according to actual needs by those skilled in the art, for example, a can be set to 1.

In addition, the value of b in the recognition function is positively correlated with the recognition rate of the voiceprint feature extraction model, and the value of b is determined according to the recognition rate of the voiceprint feature extraction model obtained through actual training.

In addition, the preset recognition function state value can also be an empirical value obtained by a person skilled in the art according to actual needs, and the higher the value is, the higher the accuracy of the verification of the first audio signal is.

Therefore, through the identification function, even when the first audio signal comprises other information besides the preset awakening word (for example, the preset awakening word is 'small ohm in europe', and the text corresponding to the first audio signal is 'small ohm in europe today' in weather), the first audio signal can be accurately identified.

In one embodiment, "obtaining, by the processor, a similarity between a voiceprint feature vector of each sub-noise reduction audio signal and a target voiceprint feature vector" includes:

calculating the similarity between the vocal print characteristic vector of each sub noise reduction audio signal and the target vocal print characteristic vector through a processor according to a dynamic time warping algorithm;

or, calculating, by the processor, a feature distance between the voiceprint feature vector of each sub-noise reduction audio signal and the target voiceprint feature vector as a similarity.

In the embodiment of the application, when the similarity between the voiceprint feature vector of each sub-noise reduction audio signal and the target voiceprint feature training is obtained, the similarity between the voiceprint feature vector of each sub-noise reduction audio signal and the target voiceprint feature vector can be calculated through the processor according to the dynamic time warping algorithm.

Or, the processor may calculate a feature distance between the voiceprint feature vector of each sub noise reduction audio signal and the target voiceprint feature vector as a similarity, where what feature distance is used to measure the similarity between the two vectors is not specifically limited in this embodiment of the application, for example, an euclidean distance may be used to measure the similarity between the voiceprint feature vector of the sub noise reduction audio signal and the target voiceprint feature vector.

Fig. 6 is another schematic flow chart of an audio verification method according to an embodiment of the present application. The audio verification method is applied to the electronic device provided by the present application, where the electronic device includes a processor, a dedicated voice recognition chip, and two microphones, as shown in fig. 6, a flow of the audio verification method provided by the embodiment of the present application may be as follows:

in 201, when the processor is in a sleep state, the electronic device acquires an external audio signal through any microphone and provides the audio signal to the dedicated voice chip.

In 202, the electronic device verifies the audio signal through the dedicated voice chip, wakes up the processor when the verification is passed, and controls the dedicated voice chip to sleep after waking up the processor.

At 203, the electronic device acquires two external audio signals through two microphones and provides the two audio signals to the processor.

At 204, the electronic device characterizes the two audio signals by processor vectorization, resulting in an audio vector.

The electronic equipment firstly characterizes two audio signals through vectorization of a processor to obtain an audio vector. For example, suppose that two acquired audio signals are x respectively₁And x₂Then the two tones are characterized vectorially by the processorObtaining an audio vector from the audio signal

At 205, the electronic device frames the audio vector by the processor to obtain a plurality of audio frames, and obtains, by the processor, a separation coefficient for blind source separation of the audio frames.

The electronic equipment frames the audio vector through the processor to obtain a plurality of audio frames, wherein the lengths of the audio frames obtained through framing are the same.

m represents the number of frames.

At 206, the electronic device blind-source separates the corresponding audio frame by the processor based on the separation coefficients to obtain sub-speech signals.

wherein,

representing a sub-language separated from the m-th audio frameThe sound signal is transmitted to the microphone in a manner,

representing the sub-noise signal separated from the mth audio frame.

At 207, the electronic device combines the sub-speech signals of the audio frames by the processor to obtain a noise-reduced audio signal.

After the blind source separation of each audio frame is completed, the electronic equipment combines the sub-voice signals of each audio frame through the processor to obtain the noise reduction audio signal according to the sequence of each audio frame in the time sequence, and combines the sub-noise signals of each audio frame through the processor to obtain the noise signal.

At 208, the electronic device verifies the noise reduction audio signal via the processor to obtain a verification result.

Referring to fig. 7, fig. 7 is a schematic structural diagram of an audio verification apparatus according to an embodiment of the present disclosure. The audio verification device can be applied to electronic equipment which comprises a processor, a special voice recognition chip and two microphones. The audio verification apparatus may include a first capture module 401, a first verification module 402, a second capture module 403, and a second verification module 404, wherein,

the first acquisition module 401 is configured to acquire an external audio signal through any one of the microphones when the processor is in a sleep state, and provide the audio signal to the dedicated voice chip;

the first checking module 402 is configured to check the audio signal through the dedicated voice chip, wake up the processor when the check is passed, and control the dedicated voice chip to sleep after the processor is woken up;

a second collecting module 403, configured to obtain two external audio signals through two microphones and provide the two audio signals to the processor;

and a second checking module 404, configured to reduce the noise of the two audio signals through the processor to obtain a noise reduction audio signal, and check the noise reduction audio signal to obtain a check result.

In an embodiment, when the two audio signals are denoised by the processor to obtain the denoised audio signal, the second check module 404 may be configured to:

vectorizing and representing the two audio signals through a processor to obtain an audio vector;

and separating the audio vector through a blind source of the processor to obtain a voice signal, and setting the voice signal as a noise reduction audio signal.

In an embodiment, when the audio vector is blind source separated by the processor to obtain the speech signal, the second check module 404 may be configured to:

framing the audio vector by a processor to obtain a plurality of audio frames;

obtaining, by a processor, a separation coefficient for blind source separation of each audio frame;

based on each separation coefficient, separating the corresponding audio frame through a blind source of a processor to obtain a sub-voice signal;

and combining the sub-voice signals of the audio frames through a processor to obtain the voice signal.

In an embodiment, when the separation coefficients for blind source separation of audio frames are obtained by the processor, the second check module 404 may be configured to:

whitening, by a processor, a current audio frame;

and setting a separation coefficient corresponding to the previous audio frame as an initial separation coefficient of the current audio frame, and iterating the separation coefficient for separating the audio frame from the blind source through a processor based on the whitened current audio frame and the initial separation coefficient.

converting the current audio frame of each of the two audio signals from a time domain to a frequency domain through a processor, and extracting sub-audio signals from respective expected directions in the two current audio frames in the frequency domain to obtain two sub-audio signals, wherein the expected directions corresponding to the two current audio frames are opposite;

performing frequency band division on the two sub-audio signals through a processor, and performing beam forming on a plurality of sub-frequency bands obtained through division according to corresponding beam forming filter coefficients to obtain a plurality of beam forming signals;

obtaining, by a processor, a plurality of gain factors for performing noise suppression on the plurality of beamformed signals, respectively, in the plurality of subbands, based on the corresponding beamforming filter coefficients and the respective autocorrelation coefficients of the two sub-audio signals, respectively;

respectively carrying out noise suppression on the plurality of beam forming signals according to the plurality of gain factors through a processor, carrying out frequency band splicing on the plurality of beam forming signals after noise suppression, and converting the signals into a time domain to obtain a current audio frame after noise suppression;

and obtaining a noise reduction audio signal according to the current audio frame after noise suppression through a processor.

In an embodiment, in verifying the noise reduction audio signal, the second verification module 404 may be configured to:

carrying out endpoint detection on the noise reduction audio signal through a processor, and dividing the noise reduction audio signal into a plurality of sub noise reduction audio signals according to an endpoint detection result;

calling a voiceprint feature extraction model related to a preset text through a processor to extract a voiceprint feature vector of each sub-noise reduction audio signal;

acquiring similarity between a voiceprint feature vector of each sub-noise reduction audio signal and a target voiceprint feature vector through a processor, wherein the target voiceprint feature vector is the voiceprint feature vector of an audio signal of a preset text spoken by a preset user;

and according to the corresponding similarity of each sub noise reduction audio signal, checking the text characteristic and the voiceprint characteristic of the noise reduction audio signal through a processor.

In an embodiment, when the text feature and the voiceprint feature of the noise reduction audio signal are checked by the processor according to the corresponding similarity of each sub noise reduction audio signal, the second check module 404 may be configured to:

In an embodiment, when the processor obtains the similarity between the voiceprint feature vector of each sub-noise reduction audio signal and the target voiceprint feature vector, the second check module 404 may be configured to:

An embodiment of the present application provides a storage medium, on which an audio verification program is stored, and when the stored audio verification program is executed on an electronic device provided in an embodiment of the present application, the electronic device is enabled to execute steps in an audio verification method provided in an embodiment of the present application. The storage medium may be a magnetic disk, an optical disk, a Read Only Memory (ROM), a Random Access Memory (RAM), or the like.

Referring to fig. 8, the electronic device includes an audio acquisition unit 101, a processor 102, a dedicated speech recognition chip 103, two microphones 104, and a memory 105, where power consumption of the dedicated speech recognition chip 103 is less than power consumption of the processor 102, where any two of the dedicated speech recognition chip 103, the processor 102, and the audio acquisition unit 101 establish a communication connection through a communication bus (such as an I2C bus) to implement data interaction.

It should be noted that the dedicated speech recognition chip 103 in the embodiment of the present application is a dedicated chip designed for speech recognition, such as a digital signal processing chip designed for speech, an application specific integrated circuit chip designed for speech, etc., which has lower power consumption than a general-purpose processor.

The processor in the embodiments of the present application is a general purpose processor, such as an ARM architecture processor.

The memory 105 stores an audio verification program, which may be a high-speed random access memory, or a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, or other volatile solid-state storage device. Accordingly, the memory 105 may also include a memory controller to provide access to the memory 105 by the processor 102 and the dedicated speech recognition chip 103.

In this embodiment of the application, the audio acquisition unit 101 is configured to acquire an external audio signal through any one of the microphones 104 when the processor 102 is in a sleep state, and provide the audio signal to the dedicated voice chip;

the special voice recognition chip 103 is used for verifying the audio signal, waking up the processor 102 when the verification is passed, and sleeping after waking up the processor 102;

the audio acquisition unit is used for acquiring two external audio signals through the two microphones 104 after waking up the processor 102, and providing the two audio signals to the processor 102;

the processor 102 is configured to denoise the two audio signals to obtain a denoised audio signal, and verify the denoised audio signal to obtain a verification result.

In an embodiment, when the two audio signals are denoised to obtain a denoised audio signal, the processor 102 may be configured to:

vectorizing and representing the two audio signals to obtain an audio vector;

and separating the audio vector by the blind source to obtain a voice signal, and setting the voice signal as a noise reduction audio signal.

In one embodiment, when blind source separation of audio vectors results in a speech signal, the processor 102 may be configured to:

framing the audio vectors to obtain a plurality of audio frames;

acquiring a separation coefficient for separating each audio frame by a blind source;

separating the corresponding audio frame based on each separation coefficient blind source to obtain a sub-voice signal;

and combining the sub-voice signals of the audio frames to obtain the voice signal.

In one embodiment, in obtaining separation coefficients for blind source separation of audio frames, the processor 102 may be configured to:

whitening the current audio frame;

and setting the separation coefficient corresponding to the previous audio frame as the initial separation coefficient of the current audio frame, and iterating the separation coefficient for separating the current audio frame from the blind source based on the current audio frame after whitening and the initial separation coefficient.

transforming, by the processor 102, a current audio frame of each of the two audio signals from a time domain to a frequency domain, and extracting sub-audio signals from respective desired directions in the two current audio frames in the frequency domain to obtain two sub-audio signals, where the desired directions corresponding to the two current audio frames are opposite;

performing frequency band division on the two sub-audio signals, and performing beam forming on a plurality of sub-frequency bands obtained by division according to corresponding beam forming filter coefficients to obtain a plurality of beam forming signals;

acquiring a plurality of gain factors for respectively carrying out noise suppression on a plurality of beam forming signals in a plurality of sub-frequency bands according to the corresponding beam forming filter coefficients and the respective autocorrelation coefficients of the two sub-audio signals;

respectively carrying out noise suppression on the plurality of beam forming signals according to the plurality of gain factors, carrying out frequency band splicing on the plurality of beam forming signals after noise suppression, and converting the signals into a time domain to obtain a current audio frame after noise suppression;

and obtaining a noise reduction audio signal according to the current audio frame after the noise suppression.

In one embodiment, in verifying the noise reduction audio signal, the processor 102 may be configured to:

carrying out endpoint detection on the noise reduction audio signal, and dividing the noise reduction audio signal into a plurality of sub noise reduction audio signals according to an endpoint detection result;

calling a voiceprint feature extraction model related to a preset text to extract a voiceprint feature vector of each sub-noise reduction audio signal;

acquiring similarity between a voiceprint feature vector of each sub-noise reduction audio signal and a target voiceprint feature vector, wherein the target voiceprint feature vector is a voiceprint feature vector of an audio signal of a preset text spoken by a preset user;

and checking the text characteristic and the voiceprint characteristic of the noise reduction audio signal according to the corresponding similarity of each sub noise reduction audio signal.

In an embodiment, when checking the text feature and the voiceprint feature of the noise reduction audio signal according to the similarity corresponding to each sub-noise reduction audio signal, the processor 102 may be configured to:

checking the text characteristic and the voiceprint characteristic of the noise reduction audio signal according to the corresponding similarity of each sub noise reduction audio signal and a preset identification function;

processor 102 identifies the function state value in the presence of gamma greater than a preset gamma_nAnd judging that the text characteristic and the voiceprint characteristic of the noise reduction audio signal pass verification.

In one embodiment, in obtaining the similarity between the voiceprint feature vector of each sub-noise reduction audio signal and the target voiceprint feature vector, the processor 102 may be configured to:

calculating the similarity between the vocal print characteristic vector of each sub noise reduction audio signal and the target vocal print characteristic vector according to a dynamic time warping algorithm;

or, calculating a feature distance between the voiceprint feature vector of each sub-noise reduction audio signal and the target voiceprint feature vector as a similarity.

It should be noted that the electronic device provided in the embodiment of the present application and the audio verification method in the foregoing embodiment belong to the same concept, and any method provided in the embodiment of the audio verification method may be run on the electronic device, and a specific implementation process thereof is described in detail in the embodiment of the feature extraction method, and is not described herein again.

It should be noted that, for the audio verification method in the embodiment of the present application, it can be understood by a person skilled in the art that all or part of the process for implementing the audio verification method in the embodiment of the present application can be completed by controlling the relevant hardware through a computer program, where the computer program can be stored in a computer readable storage medium, such as a memory of an electronic device, and executed by a processor and a dedicated voice recognition chip in the electronic device, and the process of executing the process can include, for example, the process of the embodiment of the audio verification method. The storage medium may be a magnetic disk, an optical disk, a read-only memory, a random access memory, etc.

The foregoing describes in detail an audio verification method, a storage medium, and an electronic device provided in an embodiment of the present application, and a specific example is applied in the present application to explain the principle and the implementation of the present application, and the description of the foregoing embodiment is only used to help understand the method and the core idea of the present application; meanwhile, for those skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. An audio verification method is applied to electronic equipment, and is characterized in that the electronic equipment comprises a processor, a special voice recognition chip and two microphones, the directions of sound pickup holes of the two microphones are opposite, and the power consumption of the special voice recognition chip is smaller than that of the processor, and the audio verification method comprises the following steps:

the special voice chip is used for verifying the audio signal, the processor is awakened when the verification is passed, the special voice chip is controlled to be dormant after the processor is awakened, and the screen is switched from the screen-off state to the screen-on state;

the method comprises the steps that external two audio signals with the same time length are obtained through two microphones respectively, and the two audio signals with the same time length are provided for a processor;

the two audio signals with the same time length are respectively subjected to framing processing through the processor, and the two audio signals with the same time length are respectively divided into a plurality of audio frames with the same number;

determining an nth audio frame of the two audio signals from a plurality of audio frames of the two audio signals, and extracting sub audio signals from respective expected directions in the nth audio frame of the two audio signals in a frequency domain, wherein the expected directions corresponding to the two nth audio frames are opposite, the expected directions are the pickup directions of corresponding microphones, and one of the two nth audio frames carries more target sounds and one carries more noise;

performing frequency band division on the two sub-audio signals according to the same division mode to obtain a plurality of sub-frequency bands of each sub-audio signal;

performing noise suppression on beamforming signals of a plurality of sub-bands of the two sub-audio signals; splicing the plurality of sub-frequency bands after noise suppression, and converting the sub-frequency bands into a time domain to obtain two nth audio frames after noise suppression;

obtaining noise reduction audio signals according to the audio frames after noise suppression, and checking the noise reduction audio signals to obtain a checking result;

and when the processor passes the verification, executing the operation corresponding to the noise reduction audio signal.

2. The audio verification method of claim 1, wherein denoising the two audio signals by the processor to obtain a denoised audio signal comprises:

vectorizing and representing the two audio signals through the processor to obtain an audio vector;

and separating the audio vector through the blind source of the processor to obtain a voice signal, and setting the voice signal as the noise reduction audio signal.

3. The audio verification method of claim 2, wherein said blind source separating the audio vectors by the processor to obtain a speech signal comprises:

framing, by the processor, the audio vectors to obtain a plurality of audio frames;

obtaining, by the processor, separation coefficients for blind source separation of each of the audio frames;

based on each separation coefficient, separating the corresponding audio frame through the blind source of the processor to obtain a sub-voice signal;

and combining the sub-voice signals of the audio frames through the processor to obtain the voice signal.

4. The audio verification method of claim 3, wherein obtaining, by the processor, a separation factor for blind source separation of each of the audio frames comprises:

whitening, by the processor, a current audio frame;

setting a separation coefficient corresponding to a previous audio frame as an initial separation coefficient of a current audio frame, and iterating the separation coefficient for separating the current audio frame from a blind source through the processor based on the whitened current audio frame and the initial separation coefficient.

5. The audio verification method of any one of claims 1 to 4, wherein the verifying the noise reduced audio signal comprises:

performing endpoint detection on the noise reduction audio signal through the processor, and dividing the noise reduction audio signal into a plurality of sub noise reduction audio signals according to an endpoint detection result;

calling a voiceprint feature extraction model related to a preset text through the processor to extract a voiceprint feature vector of each sub-noise reduction audio signal;

obtaining, by the processor, a similarity between a voiceprint feature vector of each of the sub-noise reduction audio signals and a target voiceprint feature vector, where the target voiceprint feature vector is a voiceprint feature vector of an audio signal of a preset text spoken by a preset user;

and according to the corresponding similarity of the sub noise reduction audio signals, checking the text characteristic and the voiceprint characteristic of the noise reduction audio signals through the processor.

6. The audio verification method of claim 5, wherein the processor verifies the text feature and the voiceprint feature of the noise-reduced audio signal according to the similarity corresponding to each of the sub-noise-reduced audio signals, comprising:

checking the text characteristic and the voiceprint characteristic of the noise reduction audio signal through the processor according to the corresponding similarity of each sub noise reduction audio signal and a preset identification function;

the processor detects the presence of gamma greater than a preset identification function state value_nAnd judging that the text characteristic and the voiceprint characteristic of the noise reduction audio signal pass verification.

7. The audio verification method of claim 5, wherein the obtaining, by the processor, a similarity between the voiceprint feature vector of each of the sub-noise reduction audio signals and a target voiceprint feature vector comprises:

calculating the similarity between the vocal print characteristic vector of each sub noise reduction audio signal and the target vocal print characteristic vector according to a dynamic time warping algorithm through the processor;

or, calculating, by the processor, a feature distance between the voiceprint feature vector of each of the sub-noise reduction audio signals and the target voiceprint feature vector as a similarity.

8. The utility model provides an audio frequency calibration equipment, is applied to electronic equipment, its characterized in that, electronic equipment includes treater, special speech recognition chip and two microphones, the pick-up hole opposite direction of two microphones, audio frequency calibration equipment includes:

the first checking module is used for checking the audio signal through the special voice chip, awakening the processor when the checking is passed, controlling the special voice chip to sleep after the processor is awakened, and switching the screen from the screen-off state to the screen-on state;

the second acquisition module is used for respectively acquiring two external audio signals with the same time length through two microphones and providing the two audio signals with the same time length to the processor;

a second checking module, configured to perform framing processing on the two audio signals with the same duration through the processor, divide the two audio signals with the same duration into a plurality of audio frames with the same number, determine an nth audio frame of the two audio signals from the plurality of audio frames of the two audio signals, extract sub-audio signals from respective desired directions in the nth audio frame of the two audio signals in a frequency domain, where the desired directions corresponding to the two nth audio frames are opposite, the desired direction is a pickup direction of a corresponding microphone, one of the two nth audio frames carries more target sounds and one carries more noise, perform frequency band division on the two sub-audio signals according to the same division manner, obtain a plurality of sub-frequency bands of each sub-audio signal, and perform noise suppression on beam-formed signals of the plurality of sub-frequency bands of the two sub-audio signals, splicing the plurality of sub-frequency bands after noise suppression, converting the sub-frequency bands into a time domain to obtain two nth audio frames after noise suppression, obtaining noise reduction audio signals according to the audio frames after noise suppression, checking the noise reduction audio signals to obtain a check result, and executing operation corresponding to the noise reduction audio signals when the processor passes the check.

9. An electronic device, comprising an audio acquisition unit, a processor, a dedicated speech recognition chip, and two microphones, wherein the sound pickup holes of the two microphones are oppositely oriented, and the power consumption of the dedicated speech recognition chip is less than the power consumption of the processor, wherein,

the special voice recognition chip is used for verifying the audio signal, awakening the processor when the verification is passed, and switching the screen from a screen-off state to a screen-on state after the processor is awakened and then is dormant;

the audio acquisition unit is used for acquiring external two audio signals with the same time length through two microphones respectively after waking up the processor, and providing the two audio signals with the same time length for the processor;

the processor is configured to perform frame division processing on the two audio signals with the same duration, divide the two audio signals with the same duration into a plurality of audio frames with the same number, determine an nth audio frame of the two audio signals from the plurality of audio frames of the two audio signals, extract sub-audio signals from respective desired directions in the nth audio frame of the two audio signals in a frequency domain, where the desired directions corresponding to the two nth audio frames are opposite, the desired direction is a pickup direction of a corresponding microphone, one of the two nth audio frames carries more target sounds and one carries more noise, perform frequency band division on the two sub-audio signals according to the same division manner, obtain a plurality of sub-frequency bands of each sub-audio signal, and perform noise suppression on beam forming signals of the plurality of sub-frequency bands of the two sub-audio signals, splicing the plurality of sub-frequency bands after noise suppression, converting the sub-frequency bands into a time domain to obtain two nth audio frames after noise suppression, obtaining noise reduction audio signals according to the audio frames after noise suppression, checking the noise reduction audio signals to obtain a check result, and executing operation corresponding to the noise reduction audio signals when the processor passes the check.

10. A storage medium, characterized in that, when a computer program stored in the storage medium is run on an electronic device comprising a processor, a dedicated speech recognition chip and two microphones, the electronic device is caused to perform the steps in the audio verification method according to any one of claims 1 to 7.