WO2024093460A9 - 语音检测方法及其相关设备 - Google Patents

语音检测方法及其相关设备 Download PDF

Info

Publication number
WO2024093460A9
WO2024093460A9 PCT/CN2023/114481 CN2023114481W WO2024093460A9 WO 2024093460 A9 WO2024093460 A9 WO 2024093460A9 CN 2023114481 W CN2023114481 W CN 2023114481W WO 2024093460 A9 WO2024093460 A9 WO 2024093460A9
Authority
WO
WIPO (PCT)
Prior art keywords
signal
time domain
domain signal
frame number
voice
Prior art date
Application number
PCT/CN2023/114481
Other languages
English (en)
French (fr)
Other versions
WO2024093460A1 (zh
Inventor
常文蕾
高欢
王志超
Original Assignee
荣耀终端有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 荣耀终端有限公司 filed Critical 荣耀终端有限公司
Publication of WO2024093460A1 publication Critical patent/WO2024093460A1/zh
Publication of WO2024093460A9 publication Critical patent/WO2024093460A9/zh

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/87Detection of discrete points within a voice signal

Definitions

  • the present application relates to the field of audio processing, and in particular to a speech detection method and related equipment.
  • the present application provides a speech detection method and related equipment, which performs VAD detection and wind noise detection by combining multi-channel audio signals obtained by multi-channel microphones, thereby avoiding the impact on speech quality and improving the accuracy of detection.
  • a voice detection method is provided, which is applied to an electronic device including a first microphone and a second microphone, the method comprising:
  • Acquire audio data where the audio data is data collected by the first microphone and the second microphone in the same environment;
  • Wind noise detection is performed on the voice signal detected by the VAD to determine and filter out the voice signal.
  • the electronic device when a user uses an electronic device including multiple microphones to make a voice call or voice operation, the electronic device can first perform VAD detection on the audio data received by the multiple microphones to distinguish the voice signal from other signals; then, a wind noise detection is performed on the screened voice signal, which is equivalent to screening the voice signal again, so that the real voice signal and the wind noise signal mistakenly judged as the voice signal can be distinguished, and the voice signal detected by the wind noise is the final detection result.
  • the test signal generated by the multiple microphones is combined, and after the two-stage detection of VAD and wind noise, the real voice signal, wind noise signal and other signals can be distinguished.
  • Such a simple detection method does not involve hardware changes, which can not only avoid the impact on voice quality, but also improve the accuracy of detection.
  • the other signals referred to in this application refer to signals other than speech signals and wind noise signals.
  • the method when the audio data is data in the time domain, the method further includes:
  • the audio data is preprocessed, wherein the preprocessing at least includes frame segmentation and time-frequency conversion.
  • the preprocessing includes at least frame segmentation and time-frequency transformation.
  • the number of the multi-frame first time domain signals and the multi-frame second time domain signals obtained are the same, and have a one-to-one correspondence in order. Therefore, after frequency domain conversion is performed on the multi-frame first time domain signals and the multi-frame second time domain signals after framing, the number of the multi-frame first frequency domain signals and the multi-frame second frequency domain signals obtained are also the same, and have a one-to-one correspondence in order.
  • preprocessing is performed to make the audio data easier to detect later.
  • the audio data includes a first signal stream to be tested collected by the first microphone and a second signal stream to be tested collected by the second microphone;
  • Preprocessing the audio data includes:
  • multiple frames of the first time domain signal correspond one-to-one to multiple frames of the first frequency domain signal
  • multiple frames of the second time domain signal correspond one-to-one to multiple frames of the second frequency domain signal
  • multiple frames of first time domain signals and multiple frames of first frequency domain signals can be obtained based on the first signal stream to be tested, and multiple frames of second time domain signals and multiple frames of second frequency domain signals can be obtained based on the second signal stream to be tested, so that multiple signals of the same order can be combined for voice detection subsequently.
  • performing VAD detection on the audio data to determine and filter out the voice signal includes:
  • For the first time domain signal determine first data corresponding to the first time domain signal according to the first time domain signal and the first frequency domain signal corresponding to the first time domain signal, wherein the first data at least includes a zero crossing rate, a spectral entropy, and a flatness;
  • VAD detection is performed on the first time domain signal to determine and filter out the voice signal.
  • the difference in the performance of the voice signal and other signals in the first data can be used as a distinction criterion, and the first time domain signal can be identified as a voice signal or other signal.
  • performing VAD detection on the first time domain signal to determine and filter out the voice signal includes:
  • the value of the first frame number flag is increased by 1, and it is determined whether the value of the first frame number flag is greater than a first preset frame number threshold;
  • the value of the second frame number flag is increased by 1, and it is determined whether the value of the second frame number flag is greater than a second preset frame number threshold;
  • the first time domain signal whose modified current state is a speech signal is determined and screened out.
  • the first time domain signal of each frame is set with a tentative state and a current state.
  • the tentative state and the current state can be divided into three states: speech signal, wind noise signal and other signals.
  • the tentative state when the tentative state is different from the current state, it means that the two judgments are inconsistent, and it is possible that at least one of them is wrong, so the number of frames can be accumulated.
  • the number of frames is accumulated to be greater than the frame number threshold, the corresponding current state is modified, which is equivalent to relying on the continuity between the multiple frames of the signal to be tested before the first time domain signal of the frame determined by the algorithm to predict and determine the state corresponding to the first time domain signal of the frame.
  • the method further includes:
  • the value of the first frame number flag is less than or equal to the first preset frame number threshold, determining and filtering out the first time domain signal whose current state is a voice signal; or,
  • the first time domain signal whose current state is a voice signal is determined and screened out.
  • the corresponding current state is not modified, which is equivalent to ignoring the abnormality of these short frames and still treating them as voice signals in order to ensure the integrity of the sentence and prevent the sentence from being interrupted in the middle. Or, it is equivalent to still treating a small amount of other signals as other signals in order to avoid mistakenly identifying them as voice signals.
  • the method before the first data satisfies the first condition, the method also includes: performing a first initialization process, the first initialization process at least including resetting the value of the first frame number flag and the value of the second frame number flag to zero.
  • the first condition when the first data includes the zero-crossing rate, the spectral entropy, and the flatness, the first condition includes:
  • the zero-crossing rate is greater than a zero-crossing rate threshold, the spectral entropy is less than a spectral entropy threshold, and the flatness is less than a flatness threshold.
  • performing wind noise detection on the voice signal detected by the VAD to determine and filter out the voice signal includes:
  • a first time domain signal detected by VAD as a speech signal determine second data corresponding to the first time domain signal according to the first time domain signal and a first frequency domain signal corresponding to the first time domain signal, and a second frequency domain signal having the same order as the first frequency domain signal, wherein the second data at least includes a spectral centroid, low-frequency energy, and correlation;
  • the second data is determined, wind noise detection is performed on the first time domain signal, and a voice signal is determined and screened out.
  • the wind noise signal and the voice signal cannot be distinguished very accurately, and there may be a situation where the wind noise signal is mistaken for the voice signal. That is to say, after the VAD detection, the voice signal in the first detection result obtained is only a suspected voice signal, which may include a wind noise signal. Then, continuing to perform wind noise detection, the real voice signal and the false voice signal (i.e., wind noise signal) can be further distinguished. Therefore, after continuous VAD detection and wind noise detection, the accuracy of detection can be greatly improved.
  • performing wind noise detection on the first time domain signal to determine and filter out the voice signal includes:
  • the value of the third frame number flag is increased by 1, and it is determined whether the value of the third frame number flag is greater than a third preset frame number threshold;
  • the value of the first frame number flag is increased by 1, and it is determined whether the value of the first frame number flag is greater than a fourth preset frame number threshold;
  • the first time domain signal whose modified current state is a speech signal is determined and screened out.
  • the tentative state when the tentative state is different from the current state, it means that the two judgments are inconsistent. At this time, it is possible that at least one of the judgments is wrong, or the interval between words when the user speaks, so the number of frames can be accumulated.
  • the number of frames accumulated is greater than the frame number threshold, the corresponding current state is modified, which is equivalent to relying on the continuity between the multiple frames of the signal to be tested before the first time domain signal of the frame determined by the algorithm to predict and determine the state corresponding to the first time domain signal of the frame.
  • the method further includes:
  • the value of the third frame number flag is less than or equal to the third preset frame number threshold, determining and filtering out the first time domain signal whose current state is a voice signal; or,
  • the first time domain signal whose current state is a voice signal is determined and screened out.
  • the corresponding current state is not modified, which is equivalent to ignoring the abnormality of these few frames for the sake of ensuring the integrity of the sentence and preventing the sentence from being interrupted in the middle, and still treating it as a voice signal. Or, it is equivalent to still treating a small amount of wind noise signal as a wind noise signal in order to avoid mistakenly identifying it as a voice signal.
  • the method before the second data satisfies the second condition, the method also includes: performing a second initialization process, the second initialization process at least including resetting the value of the first frame number flag and the value of the third frame number flag to zero.
  • the second condition when the second data includes spectral center of gravity, low-frequency energy, and correlation, the second condition includes:
  • the spectrum centroid is smaller than a spectrum centroid threshold, the low-frequency energy is larger than a low-frequency energy threshold, and the correlation is smaller than the correlation threshold.
  • the first microphone includes one or more first microphones, and/or the second microphone includes one or more second microphones.
  • the first microphone is a microphone disposed at the bottom of the electronic device
  • the second microphone is a microphone disposed at the top or back of the electronic device.
  • an electronic device comprising: one or more processors, a memory and a display screen; the memory is coupled to the one or more processors, the memory is used to store computer program code, the computer program code comprises computer instructions, and the one or more processors call the computer instructions so that the electronic device executes any one of the speech detection methods in the first aspect.
  • a speech detection device comprising a unit for executing any one of the speech detection methods in the first aspect.
  • the processing unit may be a processor, and the input unit may be a communication interface; the electronic device may also include a memory, which is used to store computer program code, and when the processor executes the computer program code stored in the memory, the electronic device executes any one of the methods in the first aspect.
  • a chip system wherein the chip is applied to an electronic device, and the chip includes one or more processors, and the processor is used to call computer instructions so that the electronic device executes any one of the speech detection methods in the first aspect.
  • a computer-readable storage medium stores a computer program code.
  • the computer program code is executed by an electronic device, the electronic device executes any one of the speech detection methods in the first aspect.
  • a computer program product comprising: a computer program code, when the computer program code is executed by an electronic device, the electronic device executes any one of the speech detection methods in the first aspect.
  • the embodiment of the present application provides a voice detection method and related equipment.
  • the electronic device can first perform pre-processing such as framing and time-frequency conversion on the multi-channel test signals received by the multiple microphones, and then perform VAD detection to distinguish the voice signals and other signals therein; then, the screened voice signals are subjected to wind noise detection, so that the voice signals can be screened again to distinguish the real voice signals from the wind noise signals mistakenly judged as voice signals.
  • the accuracy of the detection can be greatly improved, and the real voice signals, wind noise signals and other signals can be distinguished.
  • the method is simple, which can avoid the impact on the voice quality and improve the accuracy of the detection.
  • the speech detection method provided in this application since the speech detection method provided in this application only involves methods and does not involve hardware improvements, and does not require the addition of complex acoustic structures, compared with related technologies, the speech detection method provided in this application is more friendly to small electronic devices and has stronger applicability.
  • FIG1 is a schematic diagram of the layout of a microphone provided in an embodiment of the present application.
  • FIG2 is a schematic diagram of an application scenario applicable to the present application.
  • FIG3 is a schematic diagram of another application scenario applicable to the present application.
  • FIG4 is a flow chart of a voice detection method provided in an embodiment of the present application.
  • FIG5 is a flow chart of another voice detection method provided in an embodiment of the present application.
  • FIG6 is a schematic diagram of a VAD detection process provided in an embodiment of the present application.
  • FIG7 is a schematic diagram of a flow chart of wind noise detection provided in an embodiment of the present application.
  • FIG8 is an example of VAD detection provided by an embodiment of the present application.
  • FIG9 is an example of data for wind noise detection provided by an embodiment of the present application.
  • FIG10 is an example of wind noise detection provided by an embodiment of the present application.
  • FIG11 is a schematic diagram of a related interface provided in an embodiment of the present application.
  • FIG12 is a schematic diagram of a hardware system of an electronic device applicable to the present application.
  • FIG13 is a schematic diagram of a software system of an electronic device applicable to the present application.
  • FIG14 is a schematic diagram of the structure of a speech detection device provided by the present application.
  • FIG15 is a schematic diagram of the structure of an electronic device provided in the present application.
  • Noise generally speaking, refers to the sound produced by other sound sources in the background of the sound source.
  • Noise reduction refers to the process of reducing noise in audio data.
  • Wind noise is the sound produced by air turbulence near the microphone, including the sound produced by air turbulence caused by wind. It should be understood that the source of wind noise is near the microphone.
  • Speech recognition refers to the technology in which an electronic device processes and collects speech signals according to a pre-configured speech recognition algorithm to obtain a recognition result that represents the meaning of the speech signal.
  • Framing is for subsequent batch processing. It is segmented according to the specified length (time period or number of samples) to structure the entire audio data into a certain data structure. It should be understood that the signal after framing processing is a time domain signal.
  • Time-frequency transformation that is, converting audio data from the time domain (the relationship between time and amplitude) to the frequency domain (the relationship between frequency and amplitude).
  • the time-frequency transformation can be performed using methods such as Fourier transform and fast Fourier transform.
  • Fourier transform is a linear integral transform used to represent the transformation of signals between the time domain (or spatial domain) and the frequency domain.
  • FFT Fast Fourier transform
  • Voice activity detection is a technology used in speech processing to detect whether a speech signal exists.
  • the voice detection method provided in the embodiment of the present application can be applied to various electronic devices.
  • the electronic device can be a mobile phone, a smart screen, a tablet computer, a wearable electronic device, an in-vehicle electronic device, an augmented reality (AR) device, a virtual reality (VR) device, a laptop computer, an ultra-mobile personal computer (UMPC), a netbook, a personal digital assistant (PDA), a projector, a smart dictionary pen, a smart voice recorder, a smart translator, a smart speaker, headphones, hearing aids, conference phone equipment, etc., including devices with at least two microphones.
  • the embodiments of the present application do not impose any restrictions on the specific type of electronic device.
  • FIG1 shows a schematic diagram of the layout of microphones provided on the mobile phone.
  • the electronic device 10 has two microphones (MIC).
  • a microphone also called a “microphone”, “microphone” or “sound pickup device” is used to convert a sound signal into an electrical signal.
  • the electronic device can receive a sound signal based on multiple microphones and convert the sound signal into an electrical signal that can be subsequently processed.
  • the electronic device 10 includes two microphones, one is arranged at the bottom of the mobile phone, and the other is arranged at the top of the mobile phone.
  • the microphone arranged at the bottom of the mobile phone is close to the user's mouth, and the microphone can also be called the main microphone, and the other can be called the auxiliary microphone.
  • the main microphone can also be called the bottom microphone, and the auxiliary microphone can also be called the top microphone.
  • the voice detection method provided in the present application performed by the electronic device can also be called a dual-microphone voice detection method.
  • Fig. 1 is only an example of microphone layout.
  • the locations of the two microphones can also be adjusted as needed.
  • one microphone can be arranged at the bottom of the mobile phone and the other at the back of the mobile phone.
  • the electronic device 10 may also include three or more microphones, and the present embodiment of the application does not impose any restrictions on this.
  • the electronic device when the electronic device is a mobile phone with two foldable display screens, the electronic device may be provided with a bottom microphone and a top microphone on one display screen, and a bottom microphone on the other display screen; or, a bottom microphone and a top microphone are provided on each display screen; or, multiple bottom microphones and multiple top microphones may be provided on each display screen, and this may be set and adjusted as needed, and the present embodiment of the application does not impose any restrictions.
  • FIG. 2 and FIG. 3 are schematic diagrams of two application scenarios provided by an embodiment of the present application.
  • the microphone In addition to receiving the voice generated by the user, the microphone generally also receives other sounds in the surrounding environment, such as the sound of car horns, the sound of metal hitting, the sound of footsteps on the ground when walking, etc.
  • the related technologies usually process the audio data received by electronic devices, including noise reduction, speech recognition using trained neural network models, etc.
  • the voice content may also be denoised to a certain extent, resulting in later voice distortion.
  • the samples used in training the neural network model are usually limited and the learning is incomplete, resulting in the inability of the trained neural network model to accurately recognize the voice when used.
  • the cost of deploying the neural network model on electronic equipment is also relatively high.
  • an embodiment of the present application provides a voice detection method.
  • the electronic device can first perform preprocessing such as framing on the multi-channel test signals received by the multiple microphones, and then perform VAD detection to distinguish the voice signals and other signals therein; then, wind noise detection is performed on the screened voice signals, which is equivalent to screening the voice signals again, so that the real voice signals and the wind noise signals mistakenly judged as voice signals can be distinguished, and the voice signal detected by the wind noise is the final detection result.
  • the test signals generated by the multiple microphones are combined, and after the two-stage detection of VAD and wind noise, the real voice signals, wind noise signals and other signals can be distinguished.
  • Such a simple detection method does not involve hardware changes, which can not only avoid the impact on voice quality, but also improve the accuracy of detection.
  • the other signals referred to in this application refer to signals other than speech signals and wind noise signals.
  • FIG4 is a flow chart of a voice detection method provided by an embodiment of the present application.
  • the voice detection method 100 can be performed by the electronic device 10 shown in FIG1 , and the two microphones are used to collect sounds in the same environment.
  • the voice detection method includes the following S110 to S150, and S110 to S150 are described in detail below.
  • the microphones are used to collect sounds in the same environment, which may mean that when a user makes a call with a mobile phone outdoors, both microphones on the mobile phone collect the user's call voice, wind noise, and other sounds in the surrounding environment.
  • the microphone is used to collect sounds in the same environment, which may mean that when multiple users hold a meeting indoors using a conference phone device, multiple microphones on the conference phone device collect the voices, wind noise, and other sounds in the indoor environment of the multiple users.
  • the signal stream to be tested refers to a signal sequence including speech, wind noise and other sounds and having a certain time sequence.
  • one microphone is used to obtain one signal stream to be tested, and two microphones can obtain two signal streams to be tested, for example, the first microphone is used to obtain the first signal to be tested, and the second microphone is used to obtain the second signal to be tested.
  • the multiple signal streams to be tested should have the same start time and end time.
  • One channel can also be understood as a channel.
  • the electronic device in response to a user's operation, the electronic device enables a voice call application; during the process of running the voice call application to make a voice call, the electronic device can obtain audio data such as the user's call content.
  • the electronic device in response to the user's operation, the electronic device enables a recording application; during the process of running the recording application to record, the electronic device can obtain audio data such as the user's singing voice.
  • the electronic device in response to the user's operation, the electronic device enables a voice assistant application; in the process of running the voice assistant application for human-computer interaction, the electronic device obtains audio data such as the user's keyword commands.
  • the audio data may also be audio data such as other people's voices received by the electronic device when the electronic device is running a third-party application (such as WeChat).
  • a third-party application such as WeChat
  • S120 Preprocess multiple signal streams to be tested.
  • the preprocessing includes at least framing and time-frequency conversion, and in the execution order, framing comes first and time-frequency conversion comes later.
  • the preprocessing may also include other steps, and the embodiment of the present application does not impose any limitation on this.
  • the frame length may be 20 ms.
  • the first signal stream to be tested obtained from the first microphone can be framed and divided into multiple frames of first time domain signals, and time-frequency transformation is performed on the multiple frames of first time domain signals to obtain multiple frames of first frequency domain signals.
  • the first time domain signal is in the time domain
  • the first frequency domain signal is in the frequency domain.
  • the first time domain signal and the first frequency domain signal have a one-to-one correspondence.
  • the second signal stream to be tested obtained by the second microphone can be framed and divided into multiple frames of second time domain signals, and time-frequency transformation is performed on the multiple frames of second time domain signals to obtain multiple frames of second frequency domain signals.
  • the second time domain signal is in the time domain
  • the second frequency domain signal is in the frequency domain.
  • the second time domain signal and the second frequency domain signal have a one-to-one correspondence.
  • the number of the multi-frame first time domain signals and the multi-frame second time domain signals obtained are the same, and have a one-to-one correspondence in order. Therefore, after frequency domain conversion is performed on the multi-frame first time domain signals and the multi-frame second time domain signals after framing, the number of the multi-frame first frequency domain signals and the multi-frame second frequency domain signals obtained are also the same, and have a one-to-one correspondence in order.
  • multiple frames of first time domain signals and multiple frames of second time domain signals generated by frame division can all be stored in order to improve the efficiency of subsequent processing.
  • the VAD detection is used to detect whether the signal stream to be tested includes a speech signal, and the first detection result includes multiple frames of speech signals and/or other signals.
  • the VAD detection may be repeatedly performed multiple times, and the speech signal and other signals may be distinguished from the intersection of the multiple detection results as the first detection result.
  • two VAD tests can be performed on a signal stream to be tested after preprocessing, and the signal determined as a speech signal twice and the signal determined as a speech signal once and determined as other signals another time are both regarded as speech signals in the first detection result; and the signal determined as other signals twice are regarded as other signals in the first detection result.
  • a signal determined as a speech signal twice may be regarded as a speech signal in the first detection result, while a signal determined as a speech signal once and as other signals another time, and a signal determined as other signals twice may be regarded as other signals in the first detection result.
  • VAD detection can be performed in real time for both pre-processed signal streams to be tested.
  • One of the signal streams to be tested is used as the main detection signal stream, and the other signal stream to be tested is used as the auxiliary detection signal stream.
  • the detection result of the auxiliary detection signal stream can be used to assist the detection result in the main detection signal stream. For example, when the signals to be tested with the same order in the two signal streams are both voice signals, it is determined that the signal in the main detection stream is a voice signal.
  • wind noise detection is performed on the voice signal in the first detection result to obtain a second detection result.
  • Wind noise detection is used to distinguish between speech signals and wind noise signals, and the second detection result includes multiple frames of speech signals and/or wind noise signals.
  • the test signals include voice signals, and then the voice signals can be distinguished from other signals; and because the characteristics of wind noise signals and voice signals are similar, at this time, only after the first stage of VAD detection, the wind noise signals and voice signals cannot be distinguished very accurately, and there may be a situation where the wind noise signal is mistakenly regarded as a voice signal, that is, after the VAD detection, the voice signal in the first detection result obtained is only a suspected voice signal, which may include a wind noise signal. Then, continuing to perform wind noise detection can further distinguish between real voice signals and false voice signals (i.e., wind noise signals). Therefore, after continuous VAD detection and wind noise detection, the accuracy of detection can be greatly improved. And because the VAD detection and wind noise detection provided by the present application do not affect the quality of the signal itself, there will be no problem of losing the quality of the signal to be tested.
  • step S140 may not be performed.
  • the wind noise detection may be performed repeatedly for multiple times, and the speech signal and the wind noise signal may be distinguished from the intersection of the multiple second detection results.
  • three wind noise detections are performed on the speech signal in the first detection result, and signals that are determined to be speech signals at any two of the three times are used as the speech signal in the second detection result.
  • the number of times VAD detection and wind noise detection are performed may be different, and the specific number of repetitions may be set and modified as needed, and the embodiment of the present application does not impose any limitation on this.
  • VAD detection and wind noise detection may be repeated on multiple frames of test signals within a next period of time, and so on.
  • VAD detection and wind noise detection may be repeatedly performed on the next frame of the test signal, and so on.
  • VAD detection and wind noise detection may be performed on a frame of the signal to be tested. While the wind noise detection is performed on the frame of the signal to be tested, VAD detection may be performed on the next frame of the signal to be tested.
  • this method has a relatively fast response speed and processing speed, and can detect voice signals, wind noise signals and other signals in the signal in real time while collecting.
  • the embodiment of the present application provides a voice detection method.
  • the electronic device can first perform pre-processing such as framing and time-frequency conversion on the multi-channel test signals received by the multiple microphones, and then perform VAD detection to distinguish the voice signals and other signals therein; then, the screened voice signals are subjected to wind noise detection, so that the voice signals can be screened again to distinguish the real voice signals from the wind noise signals mistakenly judged as voice signals.
  • the accuracy of the detection can be greatly improved, and the real voice signals, wind noise signals and other signals can be distinguished.
  • the method is simple, which can avoid the impact on the voice quality and improve the accuracy of the detection.
  • the speech detection method provided in this application since the speech detection method provided in this application only involves methods and does not involve hardware improvements, and does not require the addition of complex acoustic structures, compared with related technologies, the speech detection method provided in this application is more friendly to small electronic devices and has stronger applicability.
  • VAD detection may be performed on a first signal stream to be tested among the preprocessed multiple signal streams to be tested to obtain a first detection result, and VAD detection is not performed on the other multiple signal streams to be tested after preprocessing.
  • the voice signal in the first detection result is combined with the test signals of other test signal streams preprocessed in the corresponding order to perform wind noise detection to determine whether the voice signal in the first detection result remains as a voice signal or is changed into a wind noise signal.
  • the first channel of the signal to be tested is equivalent to the main signal to be detected, and the other channels of the signal to be tested are used to assist in detecting the voice signal in the first channel of the signal to be tested.
  • FIG. 5 shows a schematic diagram of another speech detection process provided by an embodiment of the present application.
  • the speech detection method may include the following S210 to S250, and steps S210 to S250 are described below respectively.
  • S210 Obtain a first signal stream to be tested and a second signal stream to be tested.
  • first signal to be tested and the second signal to be tested are audio data
  • the present application is used to process audio data within a period of time.
  • the duration of the first time domain signal stream and the second time domain signal stream is 600ms.
  • S220 preprocessing the first signal stream to be tested and the second signal stream to be tested to obtain a plurality of frames of first time domain signals and a plurality of frames of first frequency domain signals corresponding to the first signal stream to be tested, and a plurality of frames of second time domain signals and a plurality of frames of second frequency domain signals corresponding to the second signal stream to be tested.
  • the preprocessing includes framing and time-frequency conversion.
  • the above S220 may include:
  • S221 frame the first signal to be tested to obtain multiple frames of first time domain signals; frame the second signal to be tested to obtain multiple frames of second time domain signals.
  • a first channel of the signal to be tested of 600 ms is framed to obtain 30 frames of the first time domain signal; and a second channel of the signal to be tested of 600 ms is framed to obtain 30 frames of the second time domain signal.
  • the multiple frames of first time domain signals and the multiple frames of second time domain signals are both time domain signals.
  • time-frequency transformation is performed on 30 frames of first time domain signals to obtain 30 frames of first frequency domain signals
  • time-frequency transformation is performed on 30 frames of second time domain signals to obtain 30 frames of second frequency domain signals.
  • S230 Perform VAD detection on the preprocessed first signal stream to be tested.
  • the above S230 can also be expressed as: performing VAD detection in combination with the multi-frame first time domain signal and the multi-frame first frequency domain signal corresponding to the first signal stream to be tested, wherein the multi-frame first time domain signal and the multi-frame first frequency domain signal have a one-to-one correspondence.
  • VAD detection is not performed on the second signal stream to be tested after preprocessing.
  • the above S230 may include:
  • the zero-crossing rate refers to the ratio of the speech signal passing through the zero point (changing from positive to negative or from negative to positive) in the first time domain signal of each frame. Generally speaking, the zero-crossing rate of noise or other sounds is relatively small, while the zero-crossing rate of speech signals is relatively large.
  • the value of the zero-crossing rate of the first time domain signal can be determined by the following formula (1).
  • Formula (1) is:
  • t is the time point in the frame
  • T is the length of each frame
  • S represents the amplitude of the signal (S can be positive or negative). If the amplitudes of two adjacent time points are both positive or both negative, then ⁇ A ⁇ is 0; if one is positive and the other is negative, then ⁇ A ⁇ is 1.
  • the ⁇ values of adjacent points at T-1 in the frame are counted, summed, and then divided by T-1, which is the ratio of zero-crossing points in a frame, referred to as the zero-crossing rate.
  • spectral entropy describes the relationship between power spectrum and entropy rate. In this application, it can describe the dispersion of the signal. If the signal is noise, the signal is relatively dispersed, corresponding to a higher spectral entropy; if the signal is speech, the signal is equivalent to aggregation, corresponding to a lower spectral entropy. Flatness is used to describe the flatness of the signal. The flatness of noise is relatively large, and the flatness of speech signal is relatively small.
  • the value of the spectral entropy of the first time domain signal can be determined by the following set of formulas (2).
  • r(n) represents the short-time autocorrelation function of each frame signal
  • L is the window length
  • N is the FFT transformation length
  • X(k, m) represents the power spectrum amplitude of the kth frequency point of the mth frame
  • X(k, m) is symmetric about N/2+1, so X power (k, m) is equal to X(k, m), and X power (k, m) represents the power spectrum energy
  • P(i, m) represents the probability that the power spectrum energy of each frequency component accounts for the power spectrum energy of the entire frame
  • the power spectrum entropy corresponding to each frame can be expressed as H(m).
  • the flatness value of the first time domain signal can be determined by the following formula (3).
  • Formula (3) is:
  • L is the Lth frequency point after FFT transformation
  • N is the Nth frequency point after FFT transformation
  • Y(L) is the energy of the Lth frequency point
  • the calculation formula is the same as that of X power (k); exp(x) is e raised to the power of x.
  • S233 Determine whether the first time domain signal of each frame is a speech signal or other signal by at least combining the zero-crossing rate, spectral entropy and flatness values corresponding to the first time domain signal of the frame.
  • S234 Filter out the first time domain signal that is determined to be a speech signal.
  • the first time domain signal is a speech signal
  • the first time domain signal can be intercepted; at the same time, the first frequency domain signal corresponding to the first time domain signal after time-frequency transformation can also be intercepted to facilitate subsequent detection.
  • the above S240 can also be expressed as: combining the multi-frame second frequency domain signal corresponding to the second signal stream to be tested, performing wind noise detection on the first time domain signal determined as a voice signal from the preprocessed first signal stream to be tested.
  • the first frequency domain signal corresponding to the first time domain signal determined as a voice signal in VAD detection can be used as the object to be detected.
  • the above S240 may include:
  • the spectral center of gravity is used to describe the position of the center of gravity of the signal.
  • the spectral center of gravity of the wind noise signal is low, and the spectral center of gravity of the speech signal is high.
  • Low-frequency energy is used to describe the magnitude of low-frequency energy in the signal.
  • the low-frequency energy of the wind noise signal is high, and the low-frequency energy of the speech signal is small.
  • the value of the spectrum center of gravity of the first time domain signal can be determined by the following formula (4).
  • r is the center of gravity of the spectrum
  • i is the coordinate value of each point on the spectrum
  • fndata (i) is the amplitude of each point on the spectrum.
  • the value of the low-frequency energy of the first time domain signal can be determined by the following formula (5).
  • Formula (5) is:
  • E is the low frequency energy
  • X(f) is the FFT result corresponding to frequency f
  • the energy is calculated by taking the absolute value and squaring it.
  • correlation is used to describe the similarity between two signals.
  • the correlation of wind noise is relatively low, while the correlation of speech signals is relatively high.
  • the value of the correlation of the first time domain signal can be determined by the following formula (6).
  • Formula (6) is:
  • X is the first frequency domain signal
  • Y is the second frequency domain signal
  • r(X, Y) is the correlation between the two
  • Cov(X, Y) is the covariance of X and Y
  • D(X) and D(Y) are the variances of X and Y respectively.
  • S243 Determine whether the first time domain signal of each frame is a speech signal or a wind noise signal by at least combining the correlation, the spectral center of gravity, and the low-frequency energy value corresponding to the first time domain signal of the frame.
  • spectral center of gravity and low-frequency energy can also be determined to distinguish whether the first time domain signal is a speech signal or a wind noise signal.
  • the relevant data can be set and modified as needed, and this application does not impose any restrictions on this.
  • S244 Filter out the first time domain signal that is again determined to be a speech signal.
  • the first time domain signal is a speech signal
  • the first time domain signal may be cut out as the final detected speech signal.
  • the detection result obtained is that the first time domain signal of the frame is determined to be a speech signal, other signals or wind noise signal.
  • the detection result obtained includes information that each frame of the first time domain signal in the multiple frames of the first time domain signal is a speech signal, other signals or wind noise signal, and the intercepted signal determined to be a speech signal.
  • the first signal stream to be tested is the signal obtained by the mobile phone using the bottom microphone
  • the second signal stream to be tested is the signal obtained by the mobile phone using the top microphone.
  • the signal to be tested received by the bottom microphone is equivalent to the main signal to be detected
  • the signal to be tested received by the top microphone is used to assist in detecting the voice signal in the signal to be tested received by the bottom microphone.
  • it can be determined that all signals in the bottom microphone are voice signals, wind noise signals or other signals, and the voice signal can be intercepted at the same time.
  • the electronic device when the user uses an electronic device including two microphones to make a voice call or voice operation, the electronic device can first perform pre-processing such as frame division and time-frequency conversion on the two test signals received by the two microphones; then determine the zero-crossing rate, spectral entropy and flatness in combination with the multi-frame first time domain signal and the multi-frame first frequency domain signal generated during the preprocessing of the first test signal stream; then determine whether the first time domain signal is a voice signal or other signal in combination with the zero-crossing rate, spectral entropy and flatness, and screen out the first time domain signal determined to be a voice signal and the first frequency domain signal corresponding to the voice signal; then, determine the correlation, spectral center of gravity and low-frequency energy for the first frequency domain signal corresponding to the screened voice signal and the second frequency domain signal corresponding to the same order after preprocessing the second test signal stream; then determine whether the voice signal determined in the VAD detection stage is a true voice signal
  • the real voice signal, wind noise signal and other signals can be distinguished.
  • the method is simple, which can avoid the impact on the voice quality and improve the accuracy of detection.
  • FIG6 shows a flow chart of combining the zero-crossing rate, spectral entropy and flatness values corresponding to the first time domain signal of each frame to determine whether the first time domain signal of the frame is a speech signal or other signal (i.e., S233) provided by an embodiment of the present application.
  • the determination method 300 may include the following S301 to S310.
  • the multi-frame first time domain signal may include, in addition to the signal data itself, three frame number flags (i, j and k) and two signal flags (int, SF) corresponding to each frame of the first time domain signal.
  • the signal flag int is used to indicate the tentative state of the first time domain signal; when int is equal to 1, it indicates that the first time domain signal of the frame is tentatively a speech signal; when int is equal to 0, it indicates that the first time domain signal of the frame is tentatively other signals; when int is equal to -1, it indicates that the first time domain signal of the frame is tentatively a wind noise signal.
  • the signal flag SF is used to indicate the current state of the first time domain signal; when SF is equal to 1, it indicates that the first time domain signal of the frame is currently determined to be a speech signal; when SF is equal to 0, it indicates that the first time domain signal of the frame is currently determined to be other signals; when SF is equal to -1, it indicates that the first time domain signal of the frame is currently determined to be a wind noise signal.
  • the frame number flag i is used to indicate the number of accumulated frames corresponding to the provisional state of the voice signal. For example, i equals 1, indicating that the cumulative number of signals in the provisional state of the voice signal is 1 frame.
  • the second frame number flag j is used to indicate the number of accumulated frames corresponding to the provisional state of other states. For example, j equals 2, indicating that the cumulative number of signals in the provisional state of other signals is 2 frames.
  • the third frame number flag k is used to indicate the number of accumulated frames corresponding to the provisional state of the wind noise signal. For example, k equals 3, indicating that the cumulative number of signals in the provisional state of the wind noise signal is 3 frames.
  • performing the first initialization processing is equivalent to returning the three frame number flags and the two signal flags corresponding to each first time domain signal to zero to avoid interference and make them all 0.
  • S302 Determine whether the spectral entropy, flatness and zero-crossing rate corresponding to the first time domain signal meet the first condition.
  • the first condition includes: the zero-crossing rate is greater than a zero-crossing rate threshold, the spectral entropy is less than a spectral entropy threshold, and the flatness is less than a flatness threshold.
  • the above S302 can also be expressed as: determining whether the zero-crossing rate corresponding to the first time domain signal is greater than the zero-crossing rate threshold? determining whether the spectral entropy determined by the first frequency domain signal converted from the first time domain signal is less than the spectral entropy threshold? and whether the flatness is less than the flatness threshold?
  • the zero-crossing rate threshold, the spectral entropy threshold, and the flatness threshold can be set and modified as needed, and the embodiments of the present application do not impose any limitations on this.
  • each frame of the first time domain signal is set with a tentative state and a current state.
  • the tentative state and the current state can be divided into three states: speech signal, wind noise signal and other signals.
  • the zero-crossing rate corresponding to the first time domain signal is greater than the zero-crossing rate threshold; the spectral entropy determined by the converted first frequency domain signal is less than the spectral entropy threshold; and the flatness is also less than the flatness threshold, it can be considered that the first time domain signal meets the characteristics of a speech signal, and the tentative state of the first time domain signal can be determined to be a speech signal, and the signal flag int corresponding to the first time domain signal for indicating the tentative state is equal to 1, that is, X is equal to 1.
  • the first time domain signal does not meet the characteristics of the speech signal, and the tentative state of the first time domain signal can be determined to be other signals.
  • the signal flag int corresponding to the first time domain signal representing the tentative state is equal to 0, that is, Y is equal to 0.
  • S305 After determining the provisional state corresponding to the first time domain signal, whether the provisional state of the first time domain signal is a speech signal or other signal, determine whether the provisional state determined by the first time domain signal is the same as the corresponding current state.
  • the signal flag bit used to indicate the current state is SF. Therefore, whether the provisional state determined by the first time domain signal is the same as the corresponding current state can be determined by comparing the value of the signal flag bit int with the value of the signal flag bit SF.
  • the current state is modified, that is, the corresponding current state is modified from a voice signal to other signals, or from other signals to a voice signal.
  • the number of frames can be accumulated.
  • the number of frames accumulated is greater than the frame number threshold, the corresponding current state is modified, which is equivalent to relying on the continuity between the multiple frames of the signal to be tested before the first time domain signal of the frame determined by the algorithm to predict and determine the state corresponding to the first time domain signal of the frame.
  • the tentative state of the first time domain signal of the 6th frame is a speech signal
  • the current state is other signals.
  • the number of frames with the tentative state as a speech signal is 6, which means that the first time domain signals of the previous 5 frames are all speech signals.
  • the first time domain signal of the 6th frame is still a speech signal.
  • the original current state is no longer trusted, and the current state is changed from other signals to speech signals.
  • first preset frame number threshold and the second preset frame number threshold can be set and modified as needed, and the embodiment of the present application does not impose any limitation on this.
  • the provisional state is different from the current state, the corresponding cumulative number of frames does not exceed the preset frame number threshold. At this time, it can be considered that the number of first time domain signals in the same provisional state is too small and can be ignored, so no modification is required, and the current state continues to be maintained as a voice signal or other signal.
  • the current state here refers to the modified current state. If the tentative state is the same as the current state, the current state here refers to the original current state.
  • FIG7 shows a flow chart of combining the correlation, spectral center of gravity, and low-frequency energy values corresponding to each frame of the first time domain signal to determine whether the first time domain signal of the frame is a speech signal or a wind noise signal (i.e., S242) provided by an embodiment of the present application.
  • the determination method 400 may include the following S401 to S410.
  • the signal flag SF used to indicate the current state has been determined to be a voice signal in the method shown in FIG6 and is equal to 1.
  • the signal flag SF may not be processed, and the second frame number flag j corresponding to other signals in the provisional state may not be processed; only the signal flag int, the first frame number flag i used to indicate that the provisional state corresponds to a voice signal, and the third frame number flag k used to indicate that the provisional state corresponds to a wind noise signal are reset to zero, so that they are all 0.
  • the third frame number flag k since the third frame number flag k is reset to zero during the first initialization in the VAD detection stage and is not used, the third frame number flag k does not need to be reset to zero during wind noise detection here. If the third frame number flag k is not reset to zero during the first initialization, the third frame number flag k can be reset to zero before wind noise detection to avoid calculation errors.
  • S402 Determine whether the correlation, spectrum center of gravity and low-frequency energy corresponding to the first time domain signal meet the second condition.
  • the second condition includes: the correlation is less than a correlation threshold, the spectrum center of gravity is less than a spectrum center of gravity threshold, and the low-frequency energy is greater than a low-frequency energy threshold.
  • the above S402 can also be expressed as: combining the first frequency domain signal corresponding to the time-frequency transformation of the first time domain signal, and the second frequency domain signal with the same order determined from the multiple frames of second frequency domain signals included in the second signal stream to be tested after preprocessing, determine the correlation, frequency domain center of gravity and low-frequency energy of the two first frequency domain signals and the second frequency domain signal as the values of the correlation, spectral center of gravity and low-frequency energy corresponding to the first time domain signal.
  • the correlation threshold, the spectrum center of gravity threshold and the low-frequency energy threshold can be set and modified as needed, and the embodiments of the present application do not impose any limitations on this.
  • the spectral center of gravity is less than the spectral center of gravity threshold
  • the low-frequency energy is greater than the low-frequency energy threshold
  • the first time domain signal does not meet the characteristics of the wind noise signal
  • the tentative state of the first time domain signal can be determined to be a speech signal, and the signal flag int of the first time domain signal is equal to 1, that is, X is equal to 1.
  • S405 After determining the provisional state corresponding to the output first time domain signal, whether the provisional state of the first time domain signal is a speech signal or a wind noise signal, determine whether the provisional state determined by the first time domain signal is the same as the corresponding current state.
  • the signal flag bit used to indicate the current state is SF. Therefore, whether the provisional state determined by the first time domain signal is the same as the corresponding current state can be determined by comparing the value of the signal flag bit int with the value of the signal flag bit SF.
  • the frame number is accumulated. If the tentative state is a voice signal, the first frame number flag i is accumulated by 1; if the tentative state is a wind noise signal, the third frame number flag k is accumulated by 1.
  • the current state is modified, that is, the corresponding current state is modified from a voice signal to a wind noise signal, or from a wind noise signal to a voice signal.
  • the tentative state is different from the current state, it means that the two judgments are inconsistent. At this time, it is possible that at least one of the judgments is wrong, or it is the interval between words when the user speaks. Therefore, the number of frames can be accumulated. When the accumulated number of frames is less than the frame number threshold, the corresponding current state is not modified. In order to ensure the integrity of the sentence and prevent the sentence from being interrupted in the middle, the abnormality of these short frames can be ignored and it can still be regarded as a voice signal.
  • the tentative state of the first time domain signal of the 7th frame is a wind noise signal
  • the current state is a speech signal.
  • the number of frames with a tentative state of a speech signal is 6 frames
  • the number of frames with a tentative state of a wind noise signal is 1 frame.
  • the number is relatively small, indicating that the first time domain signals of the previous 6 frames are all speech signals.
  • the first time domain signal of the 7th frame may be a wind noise signal, in order to ensure the integrity of the sentence and prevent the sentence from being interrupted in the middle, the current state can continue to be a speech signal without modification.
  • the corresponding current state is modified, which is equivalent to predicting and determining the state corresponding to the first time domain signal of the frame by relying on the continuity between the multiple frames of the test signal before the first time domain signal of the frame determined by the algorithm.
  • third preset frame number threshold and the fourth preset frame number threshold can be set and modified as needed, and the embodiment of the present application does not impose any limitation on this.
  • the provisional state is different from the current state, the corresponding cumulative number of frames does not exceed the preset frame number threshold. At this time, it can be considered that the number of first time domain signals in the same provisional state is too small and can be ignored, so no modification is required, and the current state continues to be maintained as a speech signal or wind noise signal.
  • the current state here refers to the modified current state. If the provisional state is the same as the current state, the current state here refers to the current state determined by VAD detection.
  • Figures 8 to 10 are examples of a voice detection method provided in an embodiment of the present application.
  • VAD detection is performed starting from the first time domain signal of the first frame to determine the zero crossing rate corresponding to the first time domain signal of the first frame, as well as the spectral entropy and flatness corresponding to the first frequency domain signal after the time-frequency transformation of the first time domain signal of the first frame. And determine whether the values of the zero crossing rate, spectral entropy and flatness meet the first condition?
  • VAD detection is started for the first time domain signal of the third frame, and the zero crossing rate corresponding to the first time domain signal of the third frame is determined by the above method, as well as the spectral entropy and flatness corresponding to the first frequency domain signal after the time-frequency transformation of the first time domain signal of the third frame. And determine whether the values of the zero crossing rate, spectral entropy and flatness meet the first condition?
  • the second VAD detection can be performed in combination with the voice signal detected in the first VAD detection. It should be noted that when the second VAD detection starts the first initialization, it is not necessary to reset the flag signal bit of the current state to zero, and the current state result of the first VAD detection should be retained as the initial current state data of the second VAD detection.
  • the voice signal detected in the first 9 frames of the first time domain signal is a voice signal, it may include a wind noise signal that is mistakenly judged as a voice signal. Therefore, as shown in (b) in FIG9 , the first frequency domain signal corresponding to the first time domain signal of the 5th to 8th frames in the first channel of the test signal stream can be screened out.
  • the wind noise detection is continued in combination with the first frequency domain signal and the second frequency domain signal to distinguish the real voice signal from the wind noise signal.
  • the current state signal flag SF involved in the first time domain signal of the 5th to 8th frames determined for the first signal stream to be tested is not processed, and only the signal flag int corresponding to the tentative state is reset to zero; at the same time, the second frame number flag j corresponding to the tentative state as other signals may not be processed, and only the frame number flag i used to indicate that the tentative state corresponds to the voice signal and the third frame number flag k used to indicate that the tentative state corresponds to the wind noise signal are subjected to the second initialization processing, so that both are 0.
  • the wind noise detection is performed on the first time domain signal of the sixth frame, and the correlation, spectral center of gravity and low-frequency energy values corresponding to the first time domain signal of the sixth frame are determined according to the first frequency domain signal and the second frequency domain signal with which the first time domain signal of the sixth frame is associated. And whether the values of the correlation, spectral center of gravity and low-frequency energy meet the second condition?
  • the "voice detection” function can be set to be turned on in the setting interface of the electronic device. After the application for calls in the electronic device is run, the "voice detection" function can be automatically turned on to execute the voice detection method of the embodiment of the present application.
  • a "voice detection” function may be set to be enabled in a recording application of an electronic device. According to the setting, the "voice detection” function may be enabled when recording audio to execute the voice detection method of the embodiment of the present application.
  • the “voice detection” function may be automatically enabled to execute the voice detection method of the embodiment of the present application.
  • FIG6 is a schematic diagram of the interface of an electronic device provided in an embodiment of the present application.
  • the electronic device displays a lock screen interface 501, as shown in (a) of FIG11 .
  • the smart assistant application is run to automatically execute the voice detection method of the present application, and then, keywords can be further determined based on the detection results, and appropriate content can be selected from the text library based on the keywords to broadcast the reply, such as "Are you there?"; at the same time, an interface 502 as shown in (b) of FIG11 is displayed.
  • the electronic device When the electronic device receives the user's audio data again, such as "open the map", it can display the interface 503 shown in (c) of Figure 11; at the same time, it automatically executes the voice detection method of the present application, further determines the keywords based on the detection results, and then, in response to the keywords, runs the map application, and loads and displays the home page 504 in the map application as shown in (d) of Figure 11.
  • Fig. 12 shows a hardware system of an electronic device applicable to the present application.
  • the electronic device 600 can be used to implement the voice detection method described in the above method embodiment.
  • the electronic device 600 may include a processor 610, an external memory interface 620, an internal memory 621, a universal serial bus (USB) interface 630, a charging management module 640, a power management module 641, a battery 642, an antenna 1, an antenna 2, a mobile communication module 650, a wireless communication module 660, an audio module 670, a speaker 670A, a receiver 670B, a microphone 670C, an earphone interface 670D, a sensor module 680, a button 690, a motor 691, an indicator 692, a camera 693, a display screen 694, and a subscriber identification module (SIM) card interface 695, etc.
  • SIM subscriber identification module
  • the sensor module 680 may include a pressure sensor 680A, a gyroscope sensor 680B, an air pressure sensor 680C, a magnetic sensor 680D, an acceleration sensor 680E, a distance sensor 680F, a proximity light sensor 680G, a fingerprint sensor 680H, a temperature sensor 680J, a touch sensor 680K, an ambient light sensor 680L, a bone conduction sensor 680M, etc.
  • the audio module 670 is used to convert digital audio information into analog audio signal output, and can also be used to convert analog audio input into digital audio signal.
  • the audio module 670 can also be used to encode and decode audio signals.
  • the audio module 670 or some functional modules of the audio module 670 can be arranged in the processor 610.
  • the audio module 670 may send audio data collected by a microphone to the processor 610 .
  • the structure shown in FIG12 does not constitute a specific limitation on the electronic device 600.
  • the electronic device 600 may include more or fewer components than those shown in FIG12, or the electronic device 600 may include a combination of some of the components shown in FIG12, or the electronic device 600 may include sub-components of some of the components shown in FIG12.
  • the components shown in FIG12 may be implemented in hardware, software, or a combination of software and hardware.
  • the processor 610 may include one or more processing units.
  • the processor 610 may include at least one of the following processing units: an application processor (AP), a modem processor, a graphics processor (GPU), an image signal processor (ISP), a controller, a video codec, a digital signal processor (DSP), a baseband processor, and a neural-network processing unit (NPU).
  • AP application processor
  • GPU graphics processor
  • ISP image signal processor
  • DSP digital signal processor
  • NPU neural-network processing unit
  • Different processing units may be independent devices or integrated devices.
  • the controller can generate operation control signals according to the instruction operation code and timing signal to complete the control of instruction fetching and execution.
  • the processor 610 may also be provided with a memory for storing instructions and data.
  • the memory in the processor 610 is a cache memory.
  • the memory may store instructions or data that the processor 610 has just used or cyclically used. If the processor 610 needs to use the instruction or data again, it may be directly called from the memory. This avoids repeated access, reduces the waiting time of the processor 610, and thus improves the efficiency of the system.
  • the processor 610 may include one or more interfaces.
  • the processor 610 may include at least one of the following interfaces: an inter-integrated circuit (I2C) interface, an inter-integrated circuit sound (I2S) interface, a pulse code modulation (PCM) interface, a universal asynchronous receiver/transmitter (UART) interface, a mobile industry processor interface (MIPI), a general-purpose input/output (GPIO) interface, a SIM interface, and a USB interface.
  • I2C inter-integrated circuit
  • I2S inter-integrated circuit sound
  • PCM pulse code modulation
  • UART universal asynchronous receiver/transmitter
  • MIPI mobile industry processor interface
  • GPIO general-purpose input/output
  • SIM interface SIM interface
  • USB interface USB interface
  • the processor 610 can be used to execute the video processing method of the embodiment of the present application; for example, obtain audio data, where the audio data is data collected by a first microphone and a second microphone in the same environment; perform VAD detection on the audio data to determine and filter out a voice signal; perform wind noise detection on the voice signal detected by VAD to determine and filter out the voice signal.
  • connection relationship between the modules shown in Fig. 12 is only a schematic illustration and does not constitute a limitation on the connection relationship between the modules of the electronic device 600.
  • the modules of the electronic device 600 may also adopt a combination of multiple connection modes in the above embodiments.
  • the wireless communication function of the electronic device 600 can be implemented by components such as antenna 1, antenna 2, mobile communication module 650, wireless communication module 660, modulation and demodulation processor, and baseband processor.
  • Antenna 1 and antenna 2 are used to transmit and receive electromagnetic wave signals.
  • Each antenna in the electronic device 600 can be used to cover a single or multiple communication frequency bands. Different antennas can also be reused to improve the utilization rate of the antenna.
  • antenna 1 of electronic device 600 is coupled to mobile communication module 650, and antenna 2 of electronic device 600 is coupled to wireless communication module 660, so that electronic device 600 can communicate with the network and other electronic devices through wireless communication technology.
  • the electronic device 600 can realize the display function through the GPU, the display screen 694 and the application processor.
  • the GPU is a microprocessor for image processing, which connects the display screen 694 and the application processor.
  • the GPU is used to perform mathematical and geometric calculations for graphics rendering.
  • the processor 610 may include one or more GPUs, which execute program instructions to generate or change display information.
  • Display screen 694 may be used to display images or videos.
  • the electronic device 600 can realize the shooting function through the ISP, the camera 693, the video codec, the GPU, the display screen 694 and the application processor.
  • the ISP is used to process the data fed back by the camera 693. For example, when taking a photo, the shutter is opened, and the light is transmitted to the camera photosensitive element through the lens. The light signal is converted into an electrical signal, and the camera photosensitive element transmits the electrical signal to the ISP for processing and converts it into an image visible to the naked eye.
  • the ISP can perform algorithmic optimization on the noise, brightness and color of the image. The ISP can also optimize the exposure and color temperature of the shooting scene and other parameters. In some embodiments, the ISP can be set in the camera 693.
  • the camera 693 is used to capture still images or videos.
  • the object generates an optical image through the lens and projects it onto the photosensitive element.
  • the photosensitive element can be a charge coupled device (CCD) or a complementary metal oxide semiconductor (CMOS) phototransistor.
  • CMOS complementary metal oxide semiconductor
  • the photosensitive element converts the optical signal into an electrical signal, and then transmits the electrical signal to the ISP to be converted into a digital image signal.
  • the ISP outputs the digital image signal to the DSP for processing.
  • the DSP converts the digital image signal into an image signal in a standard red green blue (RGB), YUV or other format.
  • the electronic device 600 may include 1 or N cameras 693, where N is a positive integer greater than 1.
  • the voice detection method may be executed in the processor 610 .
  • the digital signal processor is used to process digital signals, and can process not only digital image signals but also other digital signals. For example, when the electronic device 600 is selecting a frequency point, the digital signal processor is used to perform Fourier transform on the frequency point energy.
  • Video codecs are used to compress or decompress digital videos.
  • the electronic device 600 may support one or more video codecs.
  • the electronic device 600 may play or record videos in a variety of coding formats, such as Moving Picture Experts Group (MPEG) 1, MPEG2, MPEG3, and MPEG4.
  • MPEG Moving Picture Experts Group
  • the external memory interface 620 can be used to connect an external memory card, such as a secure digital (SD) card, to expand the storage capacity of the electronic device 600.
  • the external memory card communicates with the processor 610 through the external memory interface 620 to implement a data storage function. For example, files such as music and videos can be stored in the external memory card.
  • the internal memory 621 may be used to store computer executable program codes, which include instructions.
  • the internal memory 621 may include a program storage area and a data storage area.
  • the electronic device 600 can implement audio functions, such as music playback and recording, through the audio module 670, the speaker 670A, the receiver 670B, the microphone 670C, the headphone jack 670D, and the application processor.
  • audio functions such as music playback and recording
  • the speaker 670A also known as a horn, is used to convert an audio electrical signal into a sound signal.
  • the electronic device 600 can listen to music or make a hands-free call through the speaker 670A.
  • the receiver 670B also known as a handset, is used to convert an audio electrical signal into a sound signal.
  • the fingerprint sensor 680H is used to collect fingerprints.
  • the electronic device 600 can use the collected fingerprint characteristics to realize functions such as unlocking, accessing application locks, taking photos, and answering calls.
  • the touch sensor 680K is also called a touch control device.
  • the touch sensor 680K can be set on the display screen 694.
  • the touch sensor 680K and the display screen 694 form a touch screen, which is also called a touch control screen.
  • the touch sensor 680K is used to detect touch operations acting on or near it.
  • the touch sensor 680K can pass the detected touch operation to the application processor to determine the type of touch event.
  • Visual output related to the touch operation can be provided through the display screen 694.
  • the touch sensor 680K can also be set on the surface of the electronic device 600 and set at a different position from the display screen 694.
  • the above describes in detail the hardware system of the electronic device 600, and the following describes the software system of the electronic device 600.
  • the software system may adopt a layered architecture, an event-driven architecture, a micro-core architecture, a micro-service architecture, or a cloud architecture.
  • the embodiment of the present application takes the layered architecture as an example to exemplarily describe the software system of the electronic device 600.
  • a software system using a layered architecture is divided into several layers, each with clear roles and division of labor.
  • the layers communicate with each other through software interfaces.
  • the software system can be divided into four layers, from top to bottom, namely, the application layer, the application framework layer, the Android Runtime and system library, and the kernel layer.
  • the application layer may include applications such as calling, navigation, recording, and voice assistant.
  • the voice detection method provided in the embodiments of the present application can be applied to a call application; for example, run the call application to obtain audio data, where the audio data is data collected by a first microphone and a second microphone in the same environment; perform VAD detection on the audio data to determine and filter out the voice signal; perform wind noise detection on the voice signal detected by VAD to determine and filter out the voice signal.
  • the speech detection method provided in the embodiments of the present application can be applied to a recording application; for example, running the recording application to obtain audio data, where the audio data is data collected by a first microphone and a second microphone in the same environment; performing VAD detection on the audio data to determine and filter out the speech signal; performing wind noise detection on the speech signal detected by VAD to determine and filter out the speech signal.
  • the voice detection method provided in the embodiments of the present application can be applied to a navigation assistant application; for example, run the navigation assistant application to obtain audio data, where the audio data is data collected by a first microphone and a second microphone in the same environment; perform VAD detection on the audio data to determine and filter out the voice signal; perform wind noise detection on the voice signal detected by VAD to determine and filter out the voice signal.
  • the voice detection method provided in the embodiments of the present application can be applied to a voice assistant application; for example, run the voice assistant application to obtain audio data, where the audio data is data collected by a first microphone and a second microphone in the same environment; perform VAD detection on the audio data to determine and filter out the voice signal; perform wind noise detection on the voice signal detected by VAD to determine and filter out the voice signal.
  • the application framework layer provides an application programming interface (API) and a programming framework for applications in the application layer.
  • API application programming interface
  • the application framework layer may include some predefined functions.
  • the application framework layer includes the window manager, content provider, view system, telephony manager, resource manager, and notification manager.
  • the window manager is used to manage window programs.
  • the window manager can obtain the display screen size, determine whether there is a status bar, lock the screen, and capture the screen.
  • Content providers are used to store and retrieve data and make it accessible to applications.
  • the data may include video, images, audio, calls made and received, browsing history and bookmarks, and phone books.
  • the view system includes visual controls, such as controls for displaying text and controls for displaying images.
  • the view system can be used to build applications.
  • a display interface can be composed of one or more views.
  • a display interface including a text notification icon can include a view for displaying text and a view for displaying images.
  • the phone manager is used to provide communication functions of the electronic device, such as management of call status (connected or hung up).
  • the resource manager provides various resources for applications, such as localized strings, icons, images, layout files, and video files.
  • the notification manager enables applications to display notification information in the status bar, which can be used to convey notification-type messages and disappear automatically after a short stay without user interaction.
  • Android Runtime includes core libraries and virtual machines. Android Runtime is responsible for the scheduling and management of the Android system.
  • the core library consists of two parts: one part is the function that needs to be called by the Java language, and the other part is the Android core library.
  • the application layer and the application framework layer run in the virtual machine.
  • the virtual machine executes the Java files of the application layer and the application framework layer as binary files.
  • the virtual machine is used to perform functions such as object life cycle management, stack management, thread management, security and exception management, and garbage collection.
  • the system library can include multiple functional modules, such as: surface manager, media library, 3D graphics processing library (such as open graphics library for embedded systems (OpenGL ES) and 2D graphics engine (such as skia graphics library (SGL)).
  • 3D graphics processing library such as open graphics library for embedded systems (OpenGL ES)
  • 2D graphics engine such as skia graphics library (SGL)
  • the surface manager is used to manage the display subsystem and provide the fusion of 2D layers and 3D layers for multiple applications.
  • the media library supports playback and recording of multiple audio formats, playback and recording of multiple video formats, and still image files.
  • the media library can support multiple audio and video coding formats, such as: MPEG4, H.264, moving picture experts group audio layer III (MP3), advanced audio coding (AAC), adaptive multi-rate (AMR), joint photographic experts group (JPG) and portable network graphics (PNG).
  • MP3 moving picture experts group audio layer III
  • AAC advanced audio coding
  • AMR adaptive multi-rate
  • JPG joint photographic experts group
  • PNG portable network graphics
  • the 3D graphics processing library can be used to implement 3D graphics drawing, image rendering, compositing and layer processing.
  • a 2D graphics engine is a drawing engine for 2D drawings.
  • the kernel layer is the layer between hardware and software.
  • the kernel layer can include driver modules such as audio driver and display driver.
  • the speech detection device 700 includes an acquisition unit 710 and a processing unit 720.
  • the acquisition unit 710 is used to acquire audio data, where the audio data is data collected by the first microphone and the second microphone in the same environment.
  • the processing unit 720 is used to perform VAD detection on the audio data to determine and filter out the voice signal; perform wind noise detection on the voice signal detected by VAD to determine and filter out the voice signal.
  • the above-mentioned speech detection device 700 is embodied in the form of a functional unit.
  • the term “unit” here can be implemented in the form of software and/or hardware, and is not specifically limited to this.
  • a "unit” may be a software program, a hardware circuit, or a combination of the two that implements the above functions.
  • the hardware circuit may include an application specific integrated circuit (ASIC), an electronic circuit, a processor (such as a shared processor, a dedicated processor, or a group processor, etc.) and a memory for executing one or more software or firmware programs, a combined logic circuit, and/or other suitable components that support the described functions.
  • ASIC application specific integrated circuit
  • processor such as a shared processor, a dedicated processor, or a group processor, etc.
  • memory for executing one or more software or firmware programs, a combined logic circuit, and/or other suitable components that support the described functions.
  • Fig. 15 shows a schematic diagram of the structure of an electronic device provided by the present application.
  • the dotted line in Fig. 15 indicates that the unit or the module is optional, and the electronic device 800 can be used to implement the voice detection method described in the above method embodiment.
  • the electronic device 800 may be a chip
  • the communication unit 805 may be an input and/or output circuit of the chip
  • the communication unit 805 may be a communication interface of the chip
  • the chip may be a component of a terminal device or other electronic devices.
  • the electronic device 800 may be a terminal device, and the communication unit 805 may be a transceiver of the terminal device, or the communication unit 805 may be a transceiver circuit of the terminal device.
  • the electronic device 800 may include one or more memories 802 on which a program 804 is stored.
  • the program 804 can be executed by the processor 801 to generate instructions 803, so that the processor 801 executes the speech detection method described in the above method embodiment according to the instructions 803.
  • data may be stored in the memory 802.
  • the processor 801 may read data stored in the memory 802.
  • the data may be stored at the same storage address as the program 804, or may be stored at a different storage address than the program 804.
  • the processor 801 and the memory 802 may be provided separately or integrated together; for example, integrated on a system on chip (SOC) of a terminal device.
  • SOC system on chip
  • the memory 802 can be used to store the relevant program 804 of the voice detection method provided in the embodiment of the present application
  • the processor 801 can be used to call the relevant program 804 of the voice detection method stored in the memory 802 during the transition processing to execute the voice detection method of the embodiment of the present application. For example: obtain audio data, the audio data is data collected by the first microphone and the second microphone in the same environment. Perform VAD detection on the audio data to determine and filter out the voice signal; perform wind noise detection on the voice signal detected by VAD to determine and filter out the voice signal.
  • the present application also provides a computer program product, which, when executed by the processor 801, implements the speech detection method described in any method embodiment of the present application.
  • the computer program product may be stored in the memory 802 , for example, a program 804 , which is converted into an executable target file that can be executed by the processor 801 after preprocessing, compiling, assembling, linking and other processing processes.
  • the present application also provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a computer, the speech detection method described in any method embodiment of the present application is implemented.
  • the computer program can be a high-level language program or an executable target program.
  • the computer-readable storage medium is, for example, a memory 802.
  • the memory 802 may be a volatile memory or a nonvolatile memory, or the memory 802 may include both a volatile memory and a nonvolatile memory.
  • the nonvolatile memory may be a read-only memory (ROM), a programmable read-only memory (PROM), an erasable programmable read-only memory (EPROM), an electrically erasable programmable read-only memory (EEPROM), or a flash memory.
  • the volatile memory may be a random access memory (RAM), which is used as an external cache.
  • RAM synchronous RAM
  • SDRAM synchronous DRAM
  • DDR SDRAM double data rate SDRAM
  • ESDRAM enhanced SDRAM
  • SLDRAM synchronous link DRAM
  • DR RAM direct rambus RAM
  • the disclosed systems, devices and methods can be implemented in other ways.
  • the embodiments of the electronic device described above are only schematic.
  • the division of the modules is only a logical function division. There may be other division methods in actual implementation, such as multiple units or components can be combined or integrated into another system, or some features can be ignored or not executed.
  • Another point is that the mutual coupling or direct coupling or communication connection shown or discussed can be an indirect coupling or communication connection through some interfaces, devices or units, which can be electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place or distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
  • the size of the serial number of each process does not mean the order of execution.
  • the execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of the present application.
  • the functions are implemented in the form of software functional units and sold or used as independent products, they can be stored in a computer-readable storage medium.
  • the computer software product is stored in a storage medium, including several instructions for a computer device (which can be a personal computer, server, or network device, etc.) to perform all or part of the steps of the methods described in each embodiment of the present application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), disk or optical disk, and other media that can store program codes.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

本申请提供了一种语音检测方法及其相关设备,涉及音频处理领域,该语音检测方法包括:获取音频数据,音频数据为第一麦克风和第二麦克风在同一环境中采集的数据;对音频数据进行VAD检测,确定并筛选出语音信号;对VAD检测出的语音信号进行风噪检测,确定并筛选出语音信号。本申请通过结合多路麦克风获取的多路音频信号,进行VAD检测和风噪检测,既可避免对语音质量的影响,又能提高检测的准确性。

Description

语音检测方法及其相关设备
本申请要求于2022年10月31日提交国家知识产权局、申请号为202211350590.1、申请名称为“语音检测方法及其相关设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及音频处理领域,具体涉及一种语音检测方法及其相关设备。
背景技术
随着电子设备的普及和发展,电子设备已经成为我们日常生活和娱乐中不可或缺的一环。通常情况下,在进行语音通话或者进行语音操作的过程中,由于外界声音的干扰,电子设备输入的音频数据可能受到影响。那么,为了提高音频的质量,电子设备需要对输入的音频数据进行一些处理。
相关技术中通常会进行降噪、利用神经网络模型进行语音识别等。但是,降噪效果较好时,又可能会导致语音失真;神经网络模型又需要提前进行训练,通常样本还比较有限,导致使用时无法准确识别出语音,影响检测的质量;因此,亟需一种新的语音检测方法,既能避免对语音质量的影响,又能提高检测的准确性。
发明内容
本申请提供了一种语音检测方法及其相关设备,通过结合多路麦克风获取的多路音频信号,进行VAD检测和风噪检测,既可避免对语音质量的影响,又能提高检测的准确性。
第一方面,提供了一种语音检测方法,应用于包括第一麦克风和第二麦克风的电子设备,所述方法包括:
获取音频数据,所述音频数据为所述第一麦克风和所述第二麦克风在同一环境中采集的数据;
对所述音频数据进行VAD检测,确定并筛选出语音信号;
对VAD检测出的所述语音信号进行风噪检测,确定并筛选出语音信号。
在本申请实施例中,用户在使用包括多个麦克风的电子设备进行语音通话或语音操作的过程中,电子设备可以对多个麦克风接收到的音频数据先进行VAD检测,区分出其中的语音信号和其他信号;然后,针对筛选出的语音信号再进行风噪检测,相当于对语音信号再次进行筛选,从而可以区分出真正的语音信号和误判为语音信号的风噪信号,风噪检测出的语音信号为最终的检测结果。由此,结合多路麦克风产生的待测信号,经过VAD和风噪两个阶段的检测,从而可以区分出真正的语音信号、风噪信号和其他信号。这样简单的检测方法,不涉及硬件更改,既能避免对语音质量的影响,又能提高检测的准确性。
其中,本申请所指的其他信号指的是除了语音信号和风噪信号之外的信号。
结合第一方面,在第一方面的一种实现方式中,当所述音频数据为位于时域的数据时,所述方法还包括:
对所述音频数据进行预处理,所述预处理至少包括分帧和时频变换。
可选地,预处理至少包括分帧和时频变换。
应理解,针对多路待测信号流利用相同长度进行分帧后,得到的多帧第一时域信号和多帧第二时域信号的数量相同,并且在次序上具有一一对应的关系。由此,针对分帧后的多帧第一时域信号和多帧第二时域信号进行频域转换后,得到的多帧第一频域信号和多帧第二频域信号的数量也相同,并且在次序上也具有一一对应的关系。
在本申请实施例中,进行预处理,可以使得音频数据便于后续进行检测。
结合第一方面,在第一方面的一种实现方式中,所述音频数据包括所述第一麦克风采集的第一路待测信号流和所述第二麦克风采集的第二路待测信号流;
对所述音频数据进行预处理包括:
对所述第一路待测信号流进行所述分帧,得到多帧第一时域信号;
对多帧所述第一时域信号进行所述时频变换,得到多帧第一频域信号;
对所述第二路待测信号流进行所述分帧,得到多帧第二时域信号;
对多帧所述第二时域信号进行所述时频变换,得到多帧第二频域信号;
其中,多帧所述第一时域信号和多帧所述第一频域信号一一对应,多帧所述第二时域信号和多帧所述第二频域信号一一对应。
在本申请实施例中,可以根据第一路待测信号流得到多帧第一时域信号和多帧第一频域信号,根据第二路待测信号流得到多帧第二时域信号和多帧第二频域信号,从而后续可以将相同次序的多个信号,配合起来进行语音检测。
结合第一方面,在第一方面的一种实现方式中,对所述音频数据进行VAD检测,确定并筛选出语音信号,包括:
针对所述第一时域信号,根据所述第一时域信号和与所述第一时域信号对应的所述第一频域信号,确定所述第一时域信号所对应的第一数据,所述第一数据至少包括过零率、谱熵和平坦度;
基于所述第一数据,对所述第一时域信号进行VAD检测,确定并筛选出语音信号。
在本申请实施例中,可以基于语音信号和其他信号在第一数据方面的表现不同,来作为区分的标准,进而针对第一时域信号可以辨别出为语音信号或其他信号。
结合第一方面,在第一方面的一种实现方式中,基于所述第一数据,对所述第一时域信号进行VAD检测,确定并筛选出语音信号,包括:
当所述第一数据满足第一条件时,确定所述第一时域信号的暂定状态为语音信号;
当所述第一数据不满足所述第一条件时,确定所述第一时域信号的暂定状态为其他信号,所述其他信号用于指示除语音信号和风噪信号之外的信号;
针对所述第一时域信号,确定所述暂定状态与当前状态是否相同;
当不同,且所述暂定状态为语音信号时,第一帧数标志位的值加1,并确定所述第一帧数标志位的值是否大于第一预设帧数阈值;
当所述第一帧数标志位的值大于所述第一预设帧数阈值时,修改所述当前状态,当所述当前状态为语音信号时,修改为其他信号,当所述当前状态为其他信号时,修改为语音信号;
当不同,且所述暂定状态为其他信号时,第二帧数标志位的值加1,并确定所述第二帧数标志位的值是否大于第二预设帧数阈值;
当所述第二帧数标志位的值大于所述第二预设帧数阈值时,修改所述当前状态;
确定并筛选出修改后的当前状态为语音信号的第一时域信号。
由于语音字词通常会持续几帧,且字词之间会有间隔,为了能完整的判断语句的开始和结束,防止语句中间断掉,每帧第一时域信号都设置有暂定状态和当前状态。其中,暂定状态和当前状态都可以分为三种状态:语音信号、风噪信号和其他信号。
在本申请实施例中,当暂定状态与当前状态不同时,说明两次判断不一致,此时有可能至少有一次是判断错的,因此,可以进行帧数累计。当帧数累计到大于帧数阈值时,修改对应的当前状态,相当于依靠算法确定出的该帧第一时域信号前面多帧待测信号之间的连续性,来预测确定出该帧第一时域信号所对应的状态。
结合第一方面,在第一方面的一种实现方式中,所述方法还包括:
当相同,确定并筛选出所述当前状态为语音信号的第一时域信号;或者,
当不同,且所述第一帧数标志位的值小于或等于所述第一预设帧数阈值时,确定并筛选出所述当前状态为语音信号的第一时域信号;或者,
当不同,且所述第二帧数标志位的值小于或等于所述第二预设帧数阈值时,确定并筛选出所述当前状态为语音信号的第一时域信号。
在本申请实施例中,当暂定状态与当前状态相同,或者,虽然不同,但是当帧数累计小于帧数阈值时,不修改对应的当前状态,相当于为了保证语句的完整性,防止语句中间断掉,可以忽略短暂这几帧的异常,将其依然当作语音信号。或者,相当于为了避免错误的将少量其他信号识别为语音信号,将其依然当作其他信号。
结合第一方面,在第一方面的一种实现方式中,在当所述第一数据满足第一条件之前,所述方法还包括:进行第一初始化处理,所述第一初始化处理至少包括对所述第一帧数标志位的值和所述第二帧数标志位的值归零。
在本申请实施例中,通过进行第一初始化处理,可以避免数据错误,或者其他阶段的一些检测结果的干扰。
结合第一方面,在第一方面的一种实现方式中,当所述第一数据包括所述过零率、所述谱熵和所述平坦度时,所述第一条件包括:
所述过零率大于过零率阈值,所述谱熵小于谱熵阈值,且所述平坦度小于平坦度阈值。
结合第一方面,在第一方面的一种实现方式中,对VAD检测出的所述语音信号进行风噪检测,确定并筛选出语音信号,包括:
针对VAD检测出的为语音信号的第一时域信号,根据所述第一时域信号与所述第一时域信号对应的第一频域信号,以及与所述第一频域信号次序相同的第二频域信号,确定所述第一时域信号所对应的第二数据,所述第二数据至少包括频谱重心、低频能量和相关性;
确定所述第二数据,对所述第一时域信号进行风噪检测,确定并筛选出语音信号。
在本申请实施例中,由于风噪信号与语音信号的特性相似,此时,仅经过第一阶段的VAD检测后,并不能十分准确地将风噪信号和语音信号作出区分,可能存在误将风噪信号当成语音信号的情况,也即是说,经过VAD检测之后,得到的第一检测结果中的语音信号仅为疑似语音信号,可能包括风噪信号。那么,继续进行风噪检测,则可以进一步区分出真正的语音信号和假的语音信号(即风噪信号)。由此经过连续的VAD检测和风噪检测之后,检测的准确度可以大幅提升。
结合第一方面,在第一方面的一种实现方式中,基于所述第二数据,对所述第一时域信号进行风噪检测,确定并筛选出语音信号,包括:
当所述第二数据满足第二条件时,确定所述第一时域信号的暂定状态为风噪信号;
当所述第二数据不满足所述第二条件时,确定所述第一时域信号的暂定状态为语音信号;
针对所述第一时域信号,确定所述暂定状态与当前状态是否相同;
当不同,且所述暂定状态为风噪信号时,第三帧数标志位的值加1,并确定所述第三帧数标志位的值是否大于第三预设帧数阈值;
当所述第三帧数标志位的值大于所述第三预设帧数阈值时,修改所述当前状态,当所述当前状态为语音信号时,修改为风噪信号,当所述当前状态为风噪信号时,修改为语音信号;
当不同,且所述暂定状态为语音信号时,第一帧数标志位的值加1,并确定所述第一帧数标志位的值是否大于第四预设帧数阈值;
当所述第一帧数标志位的值大于所述第四预设帧数阈值时,修改所述当前状态;
确定并筛选出修改后的当前状态为语音信号的第一时域信号。
在本申请实施例中,当暂定状态与当前状态不同时,说明两次判断不一致,此时有可能至少有一次是判断错的,或者是用户说话时词语之间的间隔,因此,可以进行帧数累计。帧数累计到大于帧数阈值时,修改对应的当前状态,相当于依靠算法确定出的该帧第一时域信号前面多帧待测信号之间的连续性,来预测确定出该帧第一时域信号所对应的状态。
结合第一方面,在第一方面的一种实现方式中,所述方法还包括:
当相同,确定并筛选出所述当前状态为语音信号的第一时域信号;或者,
当不同,且所述第三帧数标志位的值小于或等于所述第三预设帧数阈值时,确定并筛选出所述当前状态为语音信号的第一时域信号;或者,
当不同,且所述第一帧数标志位的值小于或等于所述第四预设帧数阈值时,确定并筛选出所述当前状态为语音信号的第一时域信号。
在本申请实施例中,当暂定状态与当前状态相同,或者,虽然不同,但是当帧数累计小于帧数阈值时,不修改对应的当前状态,相当于为了保证语句的完整性,防止语句中间断掉,可以忽略短暂这几帧的异常,将其依然当作语音信号。或者,相当于为了避免错误的将少量风噪信号识别为语音信号,将其依然当作风噪信号。
结合第一方面,在第一方面的一种实现方式中,在当所述第二数据满足第二条件之前,所述方法还包括:进行第二初始化处理,所述第二初始化处理至少包括对所述第一帧数标志的值和所述第三帧数标志位的值归零。
在本申请实施例中,通过进行第二初始化处理,可以避免数据错误,或者其他阶段的一些检测结果的干扰。
结合第一方面,在第一方面的一种实现方式中,当所述第二数据包括频谱重心、低频能量和相关性时,所述第二条件包括:
所述频谱重心小于频谱重心阈值,所述低频能量大于低频能量阈值,且所述相关性小于所述相关性阈值。
结合第一方面,在第一方面的一种实现方式中,所述第一麦克风包括1个或多个第一麦克风,和/或,所述第二麦克风包括1个或多个第二麦克风。
结合第一方面,在第一方面的一种实现方式中,所述第一麦克风为所述电子设备设置在底部的麦克风,所述第二麦克风为所述电子设备设置在顶部或背面的麦克风。
第二方面,提供了一种电子设备,所述电子设备包括:一个或多个处理器、存储器和显示屏;所述存储器与所述一个或多个处理器耦合,所述存储器用于存储计算机程序代码,所述计算机程序代码包括计算机指令,所述一个或多个处理器调用所述计算机指令以使得所述电子设备执行第一方面中的任一种语音检测方法。
第三方面,提供了一种语音检测装置,包括用于执行第一方面中任一种语音检测方法的单元。
在一种可能的实现方式中,当该语音检测装置是电子设备时,该处理单元可以是处理器,该输入单元可以是通信接口;该电子设备还可以包括存储器,该存储器用于存储计算机程序代码,当该处理器执行该存储器所存储的计算机程序代码时,使得该电子设备执行第一方面中的任一种方法。
第四方面,提供了一种芯片系统,所述芯片应用于电子设备,所述芯片包括一个或多个处理器,所述处理器用于调用计算机指令以使得所述电子设备执行第一方面中的任一种语音检测方法。
第五方面,提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序代码,当所述计算机程序代码被电子设备运行时,使得该电子设备执行第一方面中的任一种语音检测方法。
第六方面,提供了一种计算机程序产品,所述计算机程序产品包括:计算机程序代码,当所述计算机程序代码被电子设备运行时,使得该电子设备执行第一方面中的任一种语音检测方法。
本申请实施例提供一种语音检测方法及其相关设备,用户在使用包括至少两个麦克风的电子设备进行语音通话或语音操作的过程中,电子设备可以对多个麦克风接收到的多路待测信号先进行分帧、时频变换等预处理,再进行VAD检测,区分出其中的语音信号和其他信号;然后,针对筛选出的语音信号再进行风噪检测,从而可以对语音信号再次进行筛选,区分出真正的语音信号和误判为语音信号的风噪信号。结合多路麦克风产生的待测信号,经过连续的VAD检测和风噪检测之后,检测的准确度可以大幅提升,可以区分出真正的语音信号、风噪信号和其他信号,方法简单,既能避免对语音质量的影响,又能提高检测的准确性。
此外,由于本申请提供的语音检测方法,仅涉及方法,不涉及硬件上改进,更不需要增设复杂的声学结构,因此,相对于相关技术,本申请提供的语音检测方法,对小型电子设备更加友好,适用性更强。
附图说明
图1是本申请实施例提供的一种麦克风的布局示意图;
图2是一种适用于本申请的应用场景的示意图;
图3是另一种适用于本申请的应用场景的示意图;
图4是本申请实施例提供的一种语音检测方法的流程示意图;
图5是本申请实施例提供的另一种语音检测方法的流程示意图;
图6是本申请实施例提供的一种VAD检测的流程示意图;
图7是本申请实施例提供的一种风噪检测的流程示意图;
图8是本申请实施例提供的一种VAD检测的示例;
图9是本申请实施例提供的一种用于风噪检测的数据的示例;
图10是本申请实施例提供的一种风噪检测的示例;
图11是本申请实施例提供的一种相关界面示意图;
图12是一种适用于本申请的电子设备的硬件系统的示意图;
图13是一种适用于本申请的电子设备的软件系统的示意图;
图14是本申请提供的一种语音检测装置的结构示意图;
图15是本申请提供的一种电子设备的结构示意图。
具体实施方式
下面将结合附图,对本申请实施例中的技术方案进行描述。
首先,对本申请实施例中的部分用语进行解释说明,以便于本领域技术人员理解。
1、噪声,一般来说是指声源背景中其他声源产生的声音。
2、降噪,指的是减少音频数据中噪声的过程。
3、风噪(wind noise),是由麦克风附近的空气湍流产生的声音,包括风造成的空气湍流所产生的声音;应理解,风噪的声源就是麦克风附近。
4、语音识别,是指电子设备根据预先配置的语音识别算法,处理采集得到语音信号,从而获得表示该语音信号的含义的识别结果的技术。
5、分帧,是为了后续进行批量处理,根据指定的长度(时间段或者采样数)进行分段,将整段的音频数据结构化为一定的数据结构。应理解,分帧处理后的信号为时域信号。
6、时频变换,也即将音频数据从时域(时间与振幅的关系)转化成频域(频率与振幅的关系)。例如,具体可以利用傅立叶变换、快速傅立叶变换等方法进行时频变换。
7、傅立叶变换,傅立叶变换是一种线性积分变换,用于表示信号在时域(或者,空域)与频域之间的变换。
8、快速傅立叶变换(fast fourier transform,FFT),FFT是指离散傅立叶变换的快速算法,可以将一个信号由时域变换到频域。
9、语音活性检测(voice activity detection,VAD),语音活性检测是一项用于语音处理的技术,目的是检测语音信号是否存在。
以上是对本申请实施例所涉及的名词的简单介绍,以下不再赘述。
本申请实施例提供的语音检测方法可以适用于各种电子设备。
在本申请的一些实施例中,该电子设备可以是手机、智慧屏、平板电脑、可穿戴电子设备、车载电子设备、增强现实(augmented reality,AR)设备、虚拟现实(virtual reality,VR)设备、笔记本电脑、超级移动个人计算机(ultra-mobile personal computer,UMPC)、上网本、个人数字助理(personal digital assistant,PDA)、投影仪、智能词典笔、智能录音笔、智能翻译器、智能音箱、耳机、助听器、会议电话设备等等包括至少两个麦克风的设备,本申请实施例对电子设备的具体类型不作任何限制。
以电子设备为手机为例,图1示出了手机上设置的麦克风的布局示意图。
示例性地,如图1所示,该电子设备10拥有两个麦克风(microphone,MIC)。麦克风,也称“话筒”、“传声器”或“拾音装置”,用于将声音信号转换为电信号。本申请实施例中,电子设备可以基于多个麦克风接收声音信号,并将声音信号转换为可以进行后续处理的电信号。
通常电子设备10包括的两个麦克风,一个设置在手机的底部,另一个设置在手机的顶部。在用户手持手机进行通话时,设置在手机底部的麦克风接近用户的嘴部,该麦克风也可以称为主麦克风,而另一个可称为辅麦克风。主麦克风也可以称为底部麦克风,辅麦克风也可以称为顶部麦克风。在只有一个底部麦克风和一个顶部麦克风的情况下,电子设备执行的本申请提供的语音检测方法也可以称为双麦克风语音检测方法。
图1仅为一种麦克风布局示例,当电子设备10包括两个麦克风时,该两个麦克风的设置位置也可以根据需要进行调整。例如,还可以将一个麦克风设置在手机的底部,另一个设置在手机的背面。
当然,电子设备10还可以包括三个及三个以上的麦克风,本申请实施例对此不进行任何限制。例如,当电子设备为具有两个可以折叠起来的显示屏的手机时,电子设备可以在一个显示屏上设置一个底部麦克风和一个顶部麦克风,另一个显示屏上设置一个底部麦克风;或者,在每个显示屏上均设置一个底部麦克风和一个顶部麦克风;又或者,还可以每个显示屏上设置多个底部麦克风和多个顶部麦克风,对此,可以根据需要进行设置和调整,本申请实施例不进行任何限制。
结合上述电子设备10,图2和图3为本申请实施例提供的两种应用场景的示意图。
如图2所示,当用户使用电子设备进行语音通话时,由于发音吐气的原因,可能会导致用户在说话的过程中,对着电子设备中的麦克风进行吐气,从而导致电子设备接收到的音频数据不仅包括语音内容,还可能包括吹气造成的风噪。
如图3所示,当用户在边奔跑边使用电子设备进行语音操作(例如,唤醒语音助手打开电子设备上的地图应用程序)时,由于用户在快速奔跑,携带的电子设备也随之快速移动;此时,电子设备周边形成较快的风速,导致电子设备接收到的音频数据不仅包括语音内容,还包括麦克风附近较快气流产生的风噪。由于风噪与语音的一些特性比较相似,比如都是低频、非稳定的信号,这样就有可能导致电子设备中的语音助手误将风噪当作语音,进而导致出现误唤醒,误操作等情况。
此外,麦克风除了接收用户产生的语音之外,一般还会接收到周围环境中的其他声音。例如,汽车鸣笛的声音、金属撞击声音、走路时踩在地面上的声音等等。
目前,相关技术针对电子设备接收到的音频数据通常进行的处理包括降噪、利用已训练的神经网络模型进行语音识别等。
然而,针对音频数据进行降噪,在降噪效果较好时,可能会同时将语音内容也进行一定程度的降噪,导致后期语音失真;针对音频数据利用已训练的神经网络模型进行语音识别时,由于神经网络模型训练时的样本通常比较有限,学习不完善,导致训练出的神经网络模型在使用时无法准确识别出语音,另外,在电子设备上布置神经网络模型的成本也比较高。
此外,对于手机、耳机等小型电子设备来说,由于电子设备体积的限制,也无法使用复杂的声学结构来减弱、消除风噪。
针对这些问题,亟需一种新的语音检测方法进行解决。
有鉴于此,本申请实施例提供一种语音检测方法,用户在使用包括多个麦克风的电子设备进行语音通话或语音操作的过程中,电子设备可以对多个麦克风接收到的多路待测信号先进行分帧等预处理,再进行VAD检测,区分出其中的语音信号和其他信号;然后,针对筛选出的语音信号再进行风噪检测,相当于对语音信号再次进行筛选,从而可以区分出真正的语音信号和误判为语音信号的风噪信号,风噪检测出的语音信号为最终的检测结果。由此,结合多路麦克风产生的待测信号,经过VAD和风噪两个阶段的检测,从而可以区分出真正的语音信号、风噪信号和其他信号。这样简单的检测方法,不涉及硬件更改,既能避免对语音质量的影响,又能提高检测的准确性。
其中,本申请所指的其他信号指的是除了语音信号和风噪信号之外的信号。
下面结合图4至图10对本申请实施例提供的语音检测方法进行说明。
图4是本申请实施例提供的一种语音检测方法的流程示意图。该语音检测方法100可以由图1所示的电子设备10执行,该两个麦克风用于采集同一环境中的声音。该语音检测方法包括以下S110至S150,下面分别对S110至S150进行详细地描述。
示例性地,麦克风用于采集同一环境中的声音,可以是指用户在室外利用手机拨打电话时,手机上两个麦克风均采集用户的通话声音、风噪以及周围环境中的其他声音。
示例性地,麦克风用于采集同一环境中的声音,可以是指多个用户在室内利用会议电话设备开会时,会议电话设备上的多个麦克风均采集该多个用户的语音、风噪以及室内环境中的其他声音。
S110、获取音频数据。该音频数据包括多路待测信号流。
待测信号流指的是包括语音、风噪以及其他声音且具有一定时间顺序的信号序列。
例如,一个麦克风用于获取一路待测信号流,两个麦克风可以获取两路待测信号流,比如,第一麦克风用于获取第一路待测信号,第二麦克风用于获取第二路待测信号。应理解,该多路待测信号流应具有相同的起始时刻和结束时刻。一路也可以理解为一个通道。
示例性地,以电子设备为耳机为例,响应于用户的操作,电子设备启用语音通话应用程序;在运行语音通话应用程序进行语音通话的过程中,电子设备可以获取用户的通话内容等音频数据。
示例性地,以电子设备为智能录音笔为例,响应于用户的操作,电子设备启用录音应用程序;在运行录音应用程序进行录制的过程中,电子设备可以获取用户的唱歌声音等音频数据。
示例性地,以电子设备为智能音箱为例,响应于用户的操作,电子设备启用语音助手应用程序;在运行语音助手应用程序进行人机交互的过程中,电子设备获取用户的关键词指令等音频数据。
示例性地,以电子设备为平板电脑为例,音频数据也可以是电子设备在运行第三方应用程序(例如微信)时,电子设备接收到的其他人的语音等音频数据。
S120、对多路待测信号流进行预处理。
可选地,预处理至少包括分帧和时频变换,且按照执行顺序,分帧在前,时频变换在后。当然,预处理还可以包括其他步骤,本申请实施例对此不进行任何限制。
例如,可以以20ms为一帧的长度进行分帧。
例如,针对第一麦克风获取第一路待测信号流可以进行分帧,划分成多帧第一时域信号,针对多帧第一时域信号进行时频变换,可以得到多帧第一频域信号。第一时域信号位于时域,第一频域信号频域,第一时域信号和第一频域信号具有一一对应关系。
同理,针对第二麦克风获取的第二路待测信号流可以进行分帧,划分成多帧第二时域信号,针对多帧第二时域信号进行时频变换,可以得到多帧第二频域信号。第二时域信号位于时域,第二频域信号位于频域,第二时域信号和第二频频域信号具有一一对应关系。
应理解,针对多路待测信号流利用相同长度进行分帧后,得到的多帧第一时域信号和多帧第二时域信号的数量相同,并且在次序上具有一一对应的关系。由此,针对分帧后的多帧第一时域信号和多帧第二时域信号进行频域转换后,得到的多帧第一频域信号和多帧第二频域信号的数量也相同,并且在次序上也具有一一对应的关系。
还应理解,分帧生成的多帧第一时域信号和多帧第二时域信号,以及时频变换后生成的多帧第一频域信号和多帧第二频域信号,均可以按照次序存储起来,以便于提高后续处理的效率。
S130、针对预处理后的多路待测信号流的至少一路待测信号流进行VAD检测,得到第一检测结果。
VAD检测用于检测待测信号流中是否包括语音信号,第一检测结果包括多帧语音信号和/或其他信号。
可选地,可以重复执行VAD检测多次,从多次检测结果的交集中,区分出语音信号和其他信号,来作为第一检测结果。
例如,针对预处理后的一路待测信号流可以进行两次VAD检测,将两次均确定为语音信号的信号、以及一次确定为语音信号,另一次确定为其他信号的信号都作为第一检测结果中的语音信号;而将两次均确定为其他信号的信号作为第一检测结果中的其他信号。
或者,还可以将两次均确定为语音信号的信号作为第一检测结果中的语音信号,而将一次确定为语音信号另一次确定为其他信号的信号、以及两次均确定为其他信号的信号均作为第一检测结果中的其他信号。
又例如,针对预处理后的两路待测信号流,均可以实时进行VAD检测。将其中一路待测信号流作为主要检测信号流,另一路待测信号流作为辅助检测信号流,然后,在进行VAD检测后,可以利用辅助检测信号流的检测结果辅助主要检测信号流中的检测结果。比如,当两路信号流中次序相同的待测信号均为语音信号时,确定主要检测流中的该信号为语音信号。
S140、结合预处理后的多路待测信号流,针对第一检测结果中的语音信号进行风噪检测,得到第二检测结果。
风噪检测用于区分语音信号和风噪信号,第二检测结果包括多帧语音信号和/或风噪信号。
S150、确定语音信号为检测结果。
应理解,针对预处理后的多路待测信号进行VAD检测,可以确定待测信号是否包括语音信号,进而从中可以区分出语音信号和其他信号;又由于风噪信号与语音信号的特性相似,此时,仅经过第一阶段的VAD检测后,并不能十分准确地将风噪信号和语音信号作出区分,可能存在误将风噪信号当成语音信号的情况,也即是说,经过VAD检测之后,得到的第一检测结果中的语音信号仅为疑似语音信号,可能包括风噪信号。那么,继续进行风噪检测,则可以进一步区分出真正的语音信号和假的语音信号(即风噪信号)。由此经过连续的VAD检测和风噪检测之后,检测的准确度可以大幅提升。又因为本申请提供的VAD检测和风噪检测并未对信号本身的质量造成影响,因此,不会存在损失待测信号质量的问题。
可选地,当第一检测结果中不包括语音信号时,则可以不执行S140步骤。
可选地,可以重复执行风噪检测多次,从多次第二检测结果中的交集中,区分出语音信号和风噪信号。
例如,针对第一检测结果中的语音信号进行三次风噪检测,将三次中任意两次确定为语音信号的信号作为第二检测结果中的语音信号。
应理解,在执行整个方法的过程中,执行VAD检测和风噪检测次数可以不相同,具体重复次数可以根据需要进行设置和修改,本申请实施例对此不进行任何限制。
可选地,可以针对预处理后的待测信号流中的一段时间段内的多帧待测信号进行VAD检测和风噪检测之后,再针对下一段时间段内的多帧待测信号重复执行VAD检测和风噪检测,后续依次类推。
应理解,该方式对执行该方法的硬件性能要求相对更低,更容易实现。
可选地,可以针对预处理后的待测信号流中的一帧待测信号进行VAD检测和风噪检测之后,再针对下一帧待测信号重复执行VAD检测和风噪检测,后续依次类推。
可选地,还可以针对一帧待测信号进行VAD检测和风噪检测,在对该帧待测信号进行风噪检测的同时,可以针对下一帧待测信号进行VAD检测。
应理解,该方式响应速度、处理速度比较快,可以边采集边实时检测信号中的语音信号、风噪信号和其他信号。
本申请实施例提供一种语音检测方法,用户在使用包括至少两个麦克风的电子设备进行语音通话或语音操作的过程中,电子设备可以对多个麦克风接收到的多路待测信号先进行分帧、时频变换等预处理,再进行VAD检测,区分出其中的语音信号和其他信号;然后,针对筛选出的语音信号再进行风噪检测,从而可以对语音信号再次进行筛选,区分出真正的语音信号和误判为语音信号的风噪信号。结合多路麦克风产生的待测信号,经过连续的VAD检测和风噪检测之后,检测的准确度可以大幅提升,可以区分出真正的语音信号、风噪信号和其他信号,方法简单,既能避免对语音质量的影响,又能提高检测的准确性。
此外,由于本申请提供的语音检测方法,仅涉及方法,不涉及硬件上改进,更不需要增设复杂的声学结构,因此,相对于相关技术,本申请提供的语音检测方法,对小型电子设备更加友好,适用性更强。
示例性地,可以针对预处理后的多路待测信号流中的第一路待测信号流进行VAD检测,得到第一检测结果。针对预处理后的其他多路待测信号不进行VAD检测。
然后,再针对第一检测结果中的语音信号,结合对应次序的预处理后的其他路待测信号流的待测信号,进行风噪检测,确定第一检测结果中的语音信号是否保持为语音信号或是更改为风噪信号。
应理解,此种方式中,第一路待测信号相当于是主要被检测信号,其他路待测信号用于辅助检测第一路待测信号中的语音信号。
下面结合图5对该示例进行详细说明。图5示出了本申请实施例提供的另一种语音检测的流程示意图,该语音检测方法可以包括以下S210至S250,下面分别对步骤S210至步骤S250进行说明。
S210、获取第一路待测信号流和第二路待测信号流。
应理解,第一路待测信号和第二路待测信号流即为音频数据,本申请用于针对一段时间内的音频数据进行处理。例如,第一时域信号流和第二时域信号流的时长为600ms。
S220、针对第一路待测信号流和第二路待测信号流进行预处理,得到第一路待测信号流对应的多帧第一时域信号、多帧第一频域信号,以及第二路待测信号流对应的多帧第二时域信号、多帧第二频域信号。其中,预处理包括分帧和时频变换。
可选地,如图5所示,上述S220可以包括:
S221、对第一路待测信号进行分帧,得到多帧第一时域信号;针对第二路待测信号流进行分帧,得到多帧第二时域信号。
例如,对600ms的第一路待测信号进行分帧,得到30帧第一时域信号;针对600ms的第二路待测信号流进行分帧,得到30帧第二时域信号。
应理解,多帧第一时域信号和多帧第二时域信号均为时域信号。
S222、对S221得到的多帧第一时域信号进行时频变换,得到对应帧数的第一频域信号;对多帧第二时域信号进行时频变换,得到对应帧数的第二频域信号。
例如,对30帧第一时域信号进行时频变换,得到30帧第一频域信号;针对30帧第二时域信号进行时频变换,得到30帧第二频域信号。
S230、针对预处理后的第一路待测信号流进行VAD检测。
上述S230也可以表达为:结合第一路待测信号流对应的多帧第一时域信号、多帧第一频域信号进行VAD检测。其中,多帧第一时域信号与多帧第一频域信号具有一一对应关系。
此处,针对预处理后的第二路待测信号流不进行VAD检测。
可选地,如图5所示,上述S230可以包括:
S231、针对第一时域信号,确定对应的过零率(zero crossing rate,ZCR)。
过零率是指在每帧第一时域信号中,语音信号通过零点(从正变为负或从负变为正)的比率。一般来说,噪声或其他声音的过零率偏小,而语音信号的过零率相对偏大一些。
例如,可以通过以下公式(1)确定第一时域信号的过零率的值。
公式(1)为:
其中,t为帧内的时间点,T为每帧的长度,S表示信号的幅值(S有正负之分);相邻两个时间点的幅值若同为正或者同为负,则π{A}为0;若是一正一负,则π{A}为1;统计了帧内T-1对相邻点的π值,求和,然后再除以T-1,即为一帧内过零点的比率,简称为过零率。
S232、针对与第一时域信号对应的第一频域信号,确定分别对应的谱熵、平坦度。
应理解,谱熵描述了功率谱和熵率之间的关系。在本申请中,可以描述信号的分散程度。若信号为噪声,信号相对分散,对应较高的谱熵;若信号为语音,信号相当于聚集,对应较低的谱熵。平坦度用于描述信号的平坦程度。噪声的平坦度偏大,语音信号的平坦度相对偏小。
[根据细则26改正 17.05.2024]
例如,可以通过以下一组公式(2)确定第一时域信号的谱熵的值。公式(2)为:Xpower(k,m)=X(k,m),1≤k≤N/2
其中,r(n)表示每帧信号的短时自相关函数,L为窗长,N为FFT变换长度,X(k,m)表示第m帧第k频率点的功率谱幅度;对实际信号来说,X(k,m)是关于N/2+1对称的,所以,Xpower(k,m)与X(k,m)相等,Xpower(k,m)表示功率谱能量;P(i,m)表示每一个频率分量的功率谱能量占整个这一帧的功率谱能量的概率;每帧对应的功率谱熵大小可以表示为H(m)。
例如,可以通过以下公式(3)确定第一时域信号的平坦度的值。公式(3)为:
其中,L为FFT变换后的第L个频率点,N为FFT变换后的第N个频率点,Y(L)为第L个频率点的能量,计算公式与Xpower(k)计算公式相同;exp(x)即为e的x次方。
S233、至少结合每帧第一时域信号对应的过零率、谱熵和平坦度的值,判断该帧第一时域信号是语音信号还是其他信号。
应理解,除了过零率、谱熵和平坦度之外,还可以确定其他相关数据,以区分第一时域信号是语音信号还是其他信号,相关数据可以根据需要进行设置和修改,本申请对此不进行任何限制。
S234、筛选出确定为语音信号的第一时域信号。
若该第一时域信号为语音信号,可以将该第一时域信号截取出来;同时,还可以将该第一时域信号进行时频变换后对应的第一频域信号截取出来,便于后续继续进行检测。
S240、结合预处理后的第二路待测信号流,对S230中确定出为语音信号进行风噪检测。
上述S240也可以表达为:结合第二路待测信号流对应的多帧第二频域信号,对从预处理后的第一路待测信号流中确定为语音信号的第一时域信号进行风噪检测。风噪检测时,可以利用VAD检测中确定为语音信号的第一时域信号所对应的第一频域信号作为被检对象。
可选地,如图5所示,上述S240可以包括:
S241、基于VAD检测中确定为语音信号的多帧第一时域信号所对应的多帧第一频域信号,确定每帧第一频域信号对应的频谱重心和低频能量。
应理解,频谱重心用于描述信号的重心位置。风噪信号的频谱重心偏低,语音信号的频谱重心偏高。低频能量用于描述信号中的低频能量大小。风噪信号的低频能量偏高,语音信号的低频能量偏小。
例如,可以通过以下公式(4)确定第一时域信号的频谱重心的值。
[根据细则26改正 17.05.2024]
公式(4)为:
其中,r为频谱重心,i为频谱上每点的坐标值,fndata(i)为频谱上每点的幅度。
[根据细则26改正 17.05.2024]
例如,可以通过以下公式(5)确定第一时域信号的低频能量的值。公式(5)为:
其中,E为低频能量,X(f)为频率f对应的FFT结果,取绝对值再平方即计算其能量。f1、f2表示选取的低频频率范围的起止频率;例如,选取低频范围为100-500Hz,则f1=100,f2=500。
S242、基于VAD检测中确定为语音信号的多帧第一时域信号所对应的多帧第一频域信号,以及从预处理后的第二路待测信号流中,按照对应次序筛选出多帧第二频域信号,确定相同次序的一组第一频域信号和第二频域信号对应的相关性。
应理解,相关性用于描述两路信号之间的相似度。风噪的相关性比较低,语音信号的相关性比较高。
[根据细则26改正 17.05.2024]
例如,可以通过以下公式(6)确定第一时域信号的相关性的值。公式(6)为:
其中,X为第一频域信号,Y为第二频域信号,r(X,Y)为两者的相关性大小;Cov(X,Y)为X和Y的协方差,D(X)、D(Y)分别为X、Y的方差。
S243、至少结合每帧第一时域信号对应的相关性、频谱重心和低频能量的值,判断该帧第一时域信号是语音信号还是风噪信号。
应理解,除了相关性、频谱重心和低频能量,还可以确定其他相关数据,以区分第一时域信号是语音信号还是风噪信号,相关数据可以根据需要进行设置和修改,本申请对此不进行任何限制。
S244、筛选出再次确定为语音信号的第一时域信号。
若该第一时域信号为语音信号,可以将该第一时域信号截取出来,作为最终检测出的语音信号。
S250、得到检测结果。
针对一帧第一时域信号进行上述检测时,得到的检测结果为该帧第一时域信号确定为语音信号、其他信号或风噪信号。针对多帧第一时域信号进行上述检测时,得到的检测结果包括多帧第一时域信号中每帧第一时域信号分别为语音信号、其他信号或风噪信号的信息,以及截取出的确定为语音信号的信号。
示例性地,第一路待测信号流为手机利用底部麦克风获取的信号,第二路待测信号流为手机利用顶部麦克风获取的信号。结合上述流程,底部麦克风接收到待测信号相当于是主要被检测信号,顶部麦克风接收到的待测信号用于辅助检测底部麦克风接收到待测信号中的语音信号。结合顶部麦克风接收的信号,可以确定出底部麦克风中所有信号为语音信号、风噪信号或其他信号,同时可以截取出语音信号。
应理解,确定出的多帧语音信号可以按照次序重新排序存储或进行识别等其他处理,本申请实施例对此不进行任何限制。
本申请实施例提供的语音检测方法中,用户在使用包括两个麦克风的电子设备进行语音通话或语音操作的过程中,电子设备可以对两个麦克风接收到的两路待测信号先进行分帧、时频变换等预处理;再结合第一路待测信号流预处理时产生的多帧第一时域信号、多帧第一频域信号,确定过零率、谱熵和平坦度;再结合过零率、谱熵和平坦度判断第一时域信号是语音信号还是其他信号,筛选出确定为语音信号的第一时域信号和与之对应的第一频域信号;然后,针对筛选出的语音信号所对应的第一频域信号,以及第二路待测信号流预处理后对应相同次序的第二频域信号,确定相关性、频谱重心和低频能量;再结合相关性、频谱重心和低频能量判断VAD检测阶段确定的语音信号是真正的语音信号,还是误判为语音信号的风噪信号。由此,经过双路待测信号的配合,以及VAD检测和风噪检测两个阶段针对信号特性连续的检测,从而可以区分出真正的语音信号、风噪信号和其他信号。方法简单,既避免对语音质量的影响能提高检测的准确性,既能避免对语音质量的影响,又能提高检测的准确性。
可选地,图6示出了本申请实施例提供的一种结合每帧第一时域信号对应的过零率、谱熵和平坦度的值,判断该帧第一时域信号是语音信号还是其他信号(即S233)的流程示意图。如图6所示,该判断方法300可以包括以下S301至S310。
S301、进行第一初始化处理。
应理解,多帧第一时域信号除了包括信号数据本身之外,还可以包括:三个帧数标志位(i、j和k)以及每帧第一时域信号对应有两个信号标志位(int、SF)。
例如,信号标志位int用于表示第一时域信号的暂定状态;int等于1时表示该帧第一时域信号暂定为语音信号;int等于0时,表示该帧第一时域信号暂定为其他信号;int等于-1时,表示该帧第一时域信号暂定为风噪信号。
信号标志位SF用于表示第一时域信号的当前状态;SF等于1时,表示该帧第一时域信号当前确定为语音信号,SF等于0时,表示该帧第一时域信号当前确定为其他信号;SF等于-1时,表示该帧第一时域信号当前确定为风噪信号。
帧数标志为i用于表示暂定状态为语音信号时所对应累计的帧数数量,比如,i等于1表示暂定状态为语音信号的信号累计数量为1帧。第二帧数标志位j用于表示暂定状态为其他状态时所对应累计的帧数数量,比如,j等于2表示暂定状态为其他信号的信号累计数量为2帧。第三帧数标志位k用于表示暂定状态为风噪信号时所对应累计的帧数数量,比如,k等于3表示暂定状态为风噪信号的信号累计数量为3帧。
基于此,针对多帧第一时域信号,进行第一初始化处理相当于对三个帧数标志位,以及每个第一时域信号对应的两个信号标志位进行归零处理,避免干扰,使其均为0。
S302、确定第一时域信号对应的谱熵、平坦度和过零率是否符合第一条件?
第一条件包括:过零率大于过零率阈值、谱熵小于谱熵阈值,并且平坦度小于平坦度阈值。
上述S302还可以表述为:确定第一时域信号对应的过零率是否大于过零率阈值?确定该第一时域信号转成的第一频域信号所确定出的谱熵是否小于谱熵阈值?且平坦度是否小于平坦度阈值?
应理解,过零率阈值、谱熵阈值和平坦度阈值都根据可以根据需要进行设置和修改,本申请实施例对此不进行任何限制。
S303、当第一时域信号对应的谱熵、平坦度和过零率符合第一条件时,确定第一时域信号的暂定状态为语音信号,修改第一信号标志位的值为X。
应理解,由于语音字词通常会持续几帧,且字词之间会有间隔,为了能完整的判断语句的开始和结束,防止语句中间断掉,每帧第一时域信号都设置有暂定状态和当前状态。其中,暂定状态和当前状态都可以分为三种状态:语音信号、风噪信号和其他信号。
S304、当第一时域信号对应的谱熵、平坦度和过零率不符合第一条件时,确定第一时域信号的暂定状态为其他信号,修改第一信号标志位为Y。
也即,当第一时域信号对应的过零率大于过零率阈值;转成的第一频域信号所确定出的谱熵小于谱熵阈值;平坦度也小于平坦度阈值时,可以认为该第一时域信号符合语音信号的特点,可以确定第一时域信号的暂定状态为语音信号,该第一时域信号对应用于表示暂定状态的信号标志位int等于1,即X等于1。
除此之外,当第一时域信号对应的过零率、谱熵、平坦度中的任意一项不满足各自对应的条件时,可以认为该第一时域信号不符合语音信号的特点,可以确定第一时域信号的暂定状态为其他信号,该第一时域信号对应用于表示暂定状态的信号标志位int等于0,即Y等于0。
S305、确定出第一时域信号对应的暂定状态后,无论第一时域信号的暂定状态是语音信号还是其他信号,确定该第一时域信号判断出的暂定状态与其对应的当前状态是否相同。
用于表示当前状态的信号标志位为SF,因此,确定第一时域信号判断出的暂定状态与其对应的当前状态是否相同,可以通过比对信号标志位int的值和信号标志位SF的值来确定出。
S306、当暂定状态与当前状态不同时,进行帧数累计。若暂定状态为语音信号,则第一帧数标志位i累计加1;若暂定状态为其他信号,则第二帧数标志位j累计加1。
S307、当第一帧数标志位i累计的帧数大于第一预设帧数阈值时,修改当前状态,也即将对应的当前状态从语音信号修改为其他信号,或者从其他信号修改为语音信号。
同理,当第二帧数标志位j累计的帧数大于第二预设帧数阈值时,修改当前状态,也即将对应的当前状态从语音信号修改为其他信号,或者从其他信号修改为语音信号。
应理解,当暂定状态与当前状态不同时,说明两次判断不一致,此时有可能至少有一次是判断错的,因此,可以进行帧数累计。当帧数累计到大于帧数阈值时,修改对应的当前状态,相当于依靠算法确定出的该帧第一时域信号前面多帧待测信号之间的连续性,来预测确定出该帧第一时域信号所对应的状态。
例如,第6帧第一时域信号的暂定状态为语音信号,当前状态为其他信号,而进行帧数统计后,暂定状态为语音信号的帧数已经是6,说明前面5帧第一时域信号都是语音信号,此时该第6帧第一时域信号还是语音信号的可能性比较大,不再信任原来的当前状态,而将当前状态从其他信号修改为语音信号。
应理解,第一预设帧数阈值和第二预设帧数阈值可以根据需要进行设置和修改,本申请实施例对此不进行任何限制。
S308、在上述S305中,当暂定状态与当前状态相同时,继续确定当前状态是否为语音信号;或者,在S306之后,当第一帧数标志位i小于或等于第一预设帧数阈值,或第二帧数标志位j小于或等于第二预设帧数阈值,继续确定当前状态是否为语音信号;又或者在S307中,修改当前状态之后,继续确定当前状态是否为语音信号。
应理解,两次判断结果一致相对于一次判断结果的准确程度更高。因此,当暂定状态与当前状态相同时,判断出的第一时域信号对应的状态结果较为准确,不用修改当前状态。
或者,暂定状态与当前状态虽然不同,但是对应帧数累计数量没有超过预设帧数阈值,此时可以认为由于连续同一暂定状态的第一时域信号的数量太少,可以忽略,所以不用修改,继续保持当前状态为语音信号或其他信号。
S309、若当前状态对应为其他信号,剔除对应的信号标志位SF等于0的第一时域信号,SF等于0表示确定出的第一时域信号为其他信号。
S310、若当前状态对应为语音信号,筛选对应的信号标志位SF等于1的第一时域信号,作为第一检测结果,SF等于1表示确定出的第一时域信号为语音信号。
此处,若暂定状态与当前状态不同,修改了当前状态时,此处的当前状态指的是修改后的当前状态。若暂定状态与当前状态相同,此处的当前状态指的是原有的当前状态。
可选地,图7示出了本申请实施例提供的一种结合每帧第一时域信号对应的相关性、频谱重心和低频能量的值,判断该帧第一时域信号是语音信号还是风噪信号(即S242)的流程示意图。如图7所示,该判断方法400可以包括以下S401至S410。
S401、针对S310确定出的为语音信号的多帧第一时域信号,进行第二初始化处理。
应理解,由于用于表示当前状态的信号标志位SF已在图6所示的方法中确定出了为语音信号,等于1。此处,进行第二初始化处理时,可以对信号标志位SF不进行处理,同时对暂定状态对应为其他信号的第二帧数标志位j不进行处理;仅对信号标志位int、用于表示暂定状态对应为语音信号的第一帧数标志位i,以及用于表示暂定状态对应风噪信号的第三帧数标志位k进行归零处理,使其均为0。
当然,由于在VAD检测阶段进行第一初始化时,对第三帧数标志位k进行了归零处理,且没有用到,此处进行风噪检测时对第三帧数标志位k可以不用进行归零处理。若在进行第一初始化处理时,没有对第三帧数标志位k进行归零处理,此时,在进行风噪检测前,可以对第三帧数标志位k进行归零处理,以避免计算错误。
S402、确定第一时域信号对应的相关性、频谱重心和低频能量是否符合第二条件?
第二条件包括:相关性小于相关性阈值、频谱重心小于频谱重心阈值,并且低频能量大于低频能量阈值。
上述S402还可以表述为:结合第一时域信号时频变换后对应的第一频域信号,以及从预处理后的第二路待测信号流包括的多帧第二频域信号中,确定出的次序一致的第二频域信号,确定该两个第一频域信号和第二频域信号的相关性、频域重心和低频能量,以作为该第一时域信号对应的相关性、频谱重心和低频能量的值。
应理解,相关性阈值、频谱重心阈值和低频能量阈值都可以根据需要进行设置和修改,本申请实施例对此不进行任何限制。
S403、当第一时域信号对应的相关性、频谱重心和低频能量符合第二条件时,确定第一时域信号的暂定状态为风噪信号,修改第一信号标志位的值为Z。
S404、当第一时域信号对应的相关性、频谱重心和低频能量不符合第二条件时,确定第一时域信号的暂定状态为语音信号,修改第一信号标志位的值为X。
也即,当第一时域信号对应的第一频域信号与该第一频域信号次序相同的第二频域信号所确定出的相关性小于相关性阈值、频谱重心小于频谱重心阈值、低频能量大于低频能量阈值时,可以认为该第一时域信号符合风噪信号的特点,可以确定第一时域信号的暂定状态为风噪信号,该第一时域信号的信号标志位int等于-1,即Z等于-1。
除此之外,当第一时域信号对应的相关性、频谱重心和低频能量中的任意一项不满足各自对应的条件时,可以认为该第一时域信号不符合风噪信号的特点,可以确定第一时域信号的暂定状态为语音信号,该第一时域信号的信号标志位int等于1,即X等于1。
S405、确定输出第一时域信号对应的暂定状态后,无论第一时域信号的暂定状态是语音信号还是风噪信号,确定该第一时域信号判断出的暂定状态与其对应的当前状态是否相同。
用于表示当前状态的信号标志位为SF,因此,确定第一时域信号判断出的暂定状态与其对应的当前状态是否相同,可以通过比对信号标志位int的值和信号标志位SF的值来确定出。
S406、当暂定状态与当前状态不同时,进行帧数累计。若暂定状态为语音信号,则第一帧数标志位i累计加1;若暂定状态为风噪信号,则第三帧数标志位k累计加1。
S407、当第三帧数标志位k累计的帧数大于第三预设帧数阈值时,修改当前状态,也即将对应的当前状态从语音信号修改为风噪信号,或者从风噪信号修改为语音信号。
当第一帧数标志位i的累计的帧数大于第四预设帧数阈值时,修改当前状态,也即将对应的当前状态从语音信号修改为风噪信号,或者从风噪信号修改为语音信号。
应理解,当暂定状态与当前状态不同时,说明两次判断不一致,此时有可能至少有一次是判断错的,或者是用户说话时词语之间的间隔,因此,可以进行帧数累计。当帧数累计小于帧数阈值时,不修改对应的当前状态,相当于为了保证语句的完整性,防止语句中间断掉,可以忽略短暂这几帧的异常,将其依然当作语音信号。
例如,第7帧第一时域信号的暂定状态为风噪信号,当前状态为语音信号,而进行帧数统计后,暂定状态为语音信号的帧数是6帧,暂定状态为风噪信号的帧数为1帧,数量比较小,说明前面6帧第一时域信号都是语音信号,此时,该第7帧第一时域信号还是语音信号的可能性比较大,或者说该第7帧第一时域信号虽然可能是风噪信号,但为了保证语句的完整性,防止语句中间断掉,可以继续保持当前状态为语音信号,不做修改。
当帧数累计到大于帧数阈值时,修改对应的当前状态,相当于依靠算法确定出的该帧第一时域信号前面多帧待测信号之间的连续性,来预测确定出该帧第一时域信号所对应的状态。
应理解,第三预设帧数阈值和第四预设帧数阈值可以根据需要进行设置和修改,本申请实施例对此不进行任何限制。
S408、在上述S405中,当暂定状态与当前状态相同时,继续确定当前状态是否为风噪信号;或者,在S406之后,当第三帧数标志位k小于或等于第三预设帧数阈值,或第一帧数标志位i小于或等于第四预设帧数阈值,继续确定当前状态是否为风噪信号;又或者,在S407中,修改当前状态之后,可以继续确定当前状态是否为风噪信号。
应理解,两次判断结果一致相对于一次判断结果的准确程度更高。因此,当暂定状态与当前状态相同时,判断出的第一时域信号对应的状态结果较为准确,不用修改当前状态。
或者,暂定状态与当前状态虽然不同,但是对应帧数累计数量没有超过预设帧数阈值,此时可以认为由于连续同一暂定状态的第一时域信号的数量太少,可以忽略,所以不用修改,继续保持当前状态为语音信号或风噪信号。
S409、若当前状态对应为风噪信号,剔除对应的信号标志位SF等于-1的第一时域信号,SF等于-1表示确定出的第一时域信号为风噪信号。
S410、若当前状态对应为语音信号,筛选对应的信号标志位SF等于1的第一时域信号,作为第二检测结果;SF等于1表示确定出的第一时域信号为语音信号。
此处,若暂定状态与当前状态不同,修改了当前状态时,此处的当前状态指的是修改后的当前状态。若暂定状态与当前状态相同,此处的当前状态指的是VAD检测确定出的当前状态。
结合图5至图7,例如,图8至图10为本申请实施例提供的一种语音检测方法的示例。
如图8中的(a)所示,针对第一路待测信号流进行分帧后可以得到30帧第一时域信号。针对该30帧多帧第一时域信号所涉及的三个帧数标志位,以及每个第一时域信号所对应的两个信号标志位进行第一初始化处理,使其均为0。
然后,如图8中的(b)所示,从第1帧第一时域信号开始进行VAD检测,确定第1帧第一时域信号对应的过零率大小,以及利用第1帧第一时域信号时频变换后对应的第一频域信号,所对应的谱熵和平坦度大小。并确定过零率、谱熵和平坦度的值是否符合第一条件?当第1帧第一时域信号确定出的过零率、谱熵和平坦度的值不符合第一条件时,确定第1帧第一时域信号的暂定状态为其他信号,int=0;用于表示暂定状态为其他信号的累计帧数的帧数标志位更新为1,j=1。
此时,由于第1帧第一时域信号对应的信号标志位SF=0;暂定状态和当前状态相同,继续确定当前状态是否为语音信号,此处不为语音信号。由此,第1帧第一时域信号的当前状态对应的信号标志位SF保持为0,即SF=0。
接着,对第2帧第一时域信号进行VAD检测,利用上述方法确定第2帧第一时域信号暂定状态为其他信号,int=0;暂定状态和当前状态相同,继续确定当前状态是否为语音信号,此处部位语音信号,由此,第2帧第一时域信号的当前状态对应的信号标志位SF保持为0,即SF=0。
对第3帧第一时域信号开始进行VAD检测,利用上述方法确定第3帧第一时域信号对应的过零率大小,以及利用第3帧第一时域信号时频变换后对应的第一频域信号,所对应的谱熵和平坦度大小。并确定过零率、谱熵和平坦度的值是否符合第一条件?当第3帧第一时域信号确定出的过零率、谱熵和平坦度的值符合第一条件时,确定第3帧第一时域信号暂定状态为语音信号,int=1;由于初始化后当前状态对应的信号标志位SF为0,因此,可以判断到暂定状态和当前状态不同,用于表示暂定状态为语音信号的累计帧数的帧数标志位更新为1,即i=1;i的值小于第一预设帧数阈值(例如2帧),此时可以认为该暂定状态为语音信号的数量太少,判断不可靠。然后,确定到当前状态对应的是其他信号,可以保持当前状态对应的信号标志位SF的值,即SF=0。
对第4帧第一时域信号进行VAD检测,利用上述方法确定第4帧第一时域信号对应的过零率大小,以及利用第4帧第一时域信号时频变换后对应的第一频域信号,所对应的谱熵和平坦度大小。并确定过零率、谱熵和平坦度的值是否符合第一条件?当第4帧第一时域信号确定出的过零率、谱熵和平坦度的值符合第一条件时,确定第4帧第一待测电信号暂定状态为语音信号,int=1;由于初始化后当前状态对应的信号标志位SF为0,因此,可以判断到暂定状态和当前状态还是不同,用于表示暂定状态为语音信号的累计帧数的帧数标志位更新为2,即i=2;i的值小于第一预设帧数阈值,此时,可以继续认为该暂定状态为语音信号的数量不达标,判断不可靠。然后,确定到当前状态对应的其他信号,可以保持当前状态对应的信号标志位SF为0,即SF=0。
同理,对第5帧第一时域信号至第8帧第一时域信号分别进行VAD检测后,可以确定第5帧第一时域信号至第8帧第一时域信号保持当前状态为其他信号,信号标志位SF为0,即SF=0。
接着,对第9帧第一时域信号进行VAD检测,利用上述方法确定第9帧第一时域信号暂定状态为其他信号,int=0;暂定状态和当前状态相同,继续确定当前状态是否为语音信号,此处部位语音信号,由此,第9帧第一时域信号的当前状态对应的信号标志位SF保持为0,即SF=0。
后续帧数依次类推,在此不再赘述。
可选地,还可以结合该第一次VAD检测中所检测出的语音信号,继续进行第二次VAD检测判断。需要说明的是,在进行第二次VAD检测开始进行第一初始化时,则不需要对当前状态的标志信号位进行归零,应保留第一次VAD检测出的当前状态结果,作为第二次VAD检测初始的当前状态数据。
在此基础上,如图9中的(a)所示,以前9帧第一时域信号中检测出的语音信号为例,虽然第一路待测信号中包括的第5帧至第8帧第一时域信号的当前状态为语音信号,但是,其中可能包括误判断为语音信号的风噪信号。由此,如图9中的(b)所示,可以筛选出第一路待测信号流中的第5帧至第8帧第一时域信号所对应的第一频域信号。同时,还需要确定出与第5帧至第8帧第一时域信号次序相同的第二路待测信号流中的第5帧至第8帧第二时域信号,所对应的第二频域信号。再结合第一频域信号和第二频域信号继续进行风噪检测,以区分真正的语音信号和风噪信号。
如图10中的(a)所示,针对第一路待测信号流确定出的第5帧至第8帧第一时域信号所涉及的当前状态信号标志位SF不进行处理,仅对暂定状态对应的信号标志位int进行归零;同时,对暂定状态对应为其他信号的第二帧数标志位j可以不进行处理,仅对用于表示暂定状态对应为语音信号的帧数标志为i和用于表示暂定状态对应风噪信号的第三帧数标志位k进行第二初始化处理,使其均为0。
如图10中的(b)所示,从第5帧第一时域信号开始进行风噪检测,根据第5帧第一时域信号具有关联关系的第一频域信号、第二频域信号,来确定第5帧第一时域信号对应的相关性、频谱重心和低频能量的值。并确定相关性、频谱重心和低频能量的值是否符合第二条件?当第5帧第一时域信号确定出的相关性、频谱重心和低频能量的值不符合第二条件时,确定第5帧第一时域信号的暂定状态为语音信号,int=1。
此时,由于第5帧第一时域信号对应的信号标志位SF=1;暂定状态和当前状态相同,继续确定当前状态是否为语音信号,此处为语音信号。由此,第5帧第一时域信号的当前状态对应的信号标志位SF保持为1,即SF=1。
接着,对第6帧第一时域信号进行风噪检测,根据第6帧第一时域信号具有关联关系的第一频域信号、第二频域信号,来确定第6帧第一时域信号对应的相关性、频谱重心和低频能量的值。并确定相关性、频谱重心和低频能量的值是否符合第二条件?当第6帧第一时域信号确定出的相关性、频谱重心和低频能量的值符合第二条件时,确定第6帧第一时域信号的暂定状态为风噪信号,int=-1;当前状态为语音信号,SF=1,暂定状态和语音状态不同,用于表示暂定状态为风噪信号的累计帧数的第三帧数标志位k更新为1,即k=1;k的值小于第三预设帧数阈值(例如4帧),此时可以认为该暂定状态为风噪信号的数量太少,判断不可靠,或者,认为该风噪属于用户说话时字词之间的间隔。然后,确定到当前状态对应的是语音信号,可以保持当前状态对应的信号标志位SF的值,即SF=1。
对第7帧第一时域信号进行风噪检测,利用上述方法确定第7镇第一时域信号对应的暂定状态为风噪信号,int=-1;由于暂定状态与当前状态不同,用于表示暂定状态为风噪信号的累计帧数的第三帧数标志位k更为2,即k=2;k的值还是小于第三预设帧数阈值,此时继续保持当前状态对应的信号标志位SF的值,即SF=1。
对第8帧第一时域信号进行风噪检测,根据第8帧第一时域信号具有关联关系的第一频域信号、第二频域信号,来确定第8帧第一时域信号对应的相关性、频谱重心和低频能量的值。并确定相关性、频谱重心和低频能量的值是否符合第二条件?当第8帧第一时域信号确定出的相关性、频谱重心和低频能量的值不符合第二条件时,确定第8帧第一时域信号的暂定状态为语音信号,int=1;由于暂定状态与当前状态相同,继续确定当前状态是否为语音信号,此处为语音信号。由此,第8帧第一时域信号的当前状态对应的信号标志位SF保持为1,即SF=1。
下面结合图11对电子设备中的界面示意图进行举例描述。
在一种可能的实现方式中,可以在电子设备的设置界面中设置开启“语音检测”的功能,在电子设备中用于通话的应用程序运行后,可以自动开启“语音检测”的功能执行本申请实施例的语音检测方法。
在另一种可能的实现方式中,可以在电子设备的录音应用程序中设置开启“语音检测”功能,根据设置可以在录制音频时可以开启“语音检测”的功能,执行本申请实施例的语音检测方法。
在又一种可能的实现方式中,可以自动开启“语音检测”的功能执行本申请实施例的语音检测方法。
结合第三种实现方式,以电子设备自动开启“语音检测”功能为例,图6是本申请实施例提供的一种电子设备的界面示意图。
例如,如图11所示,以电子设备是手机为例,电子设备显示锁屏界面501,如图11中的(a)所示。当电子设备接收到用户的音频数据时,比如“你好,YoYo!”时,运行智慧助手应用程序,自动执行本申请的语音检测方法,然后,可以根据检测结果进一步确定出关键词,并根据关键词,从文本库中筛选出合适的内容进行播报答复,比如,“在呢”;同时,显示如图11中的(b)所示的界面502。
当电子设备再次接收到用户的音频数据,比如“打开地图”时,可以显示如图11中的(c)所示的界面503;同时,自动执行本申请的语音检测方法,根据检测结果进一步确定出关键词,然后,响应于关键词,运行地图应用程序,并加载、显示如图11中的(d)所示的地图应用程序中的首页504。
应理解,上述举例说明是为了帮助本领域技术人员理解本申请实施例,而非要将本申请实施例限于所例示的具体数值或具体场景。本领域技术人员根据所给出的上述举例说明,显然可以进行各种等价的修改或变化,这样的修改或变化也落入本申请实施例的范围内。
上文结合图1至图11,描述了本申请实施例的语音检测方法和相关显示界面。下面将结合图12至图15,详细描述本申请适用的电子设备的软件系统、硬件系统、装置以及芯片。应理解,本申请实施例中的软件系统、硬件系统、装置以及芯片系统可以执行前述本申请实施例的各种方法,即以下各种产品的具体工作过程,可以参考前述方法实施例中的对应过程。
图12示出了一种适用于本申请的电子设备的硬件系统。电子设备600可用于实现上述方法实施例中描述的语音检测方法。
电子设备600可以包括处理器610,外部存储器接口620,内部存储器621,通用串行总线(universal serial bus,USB)接口630,充电管理模块640,电源管理模块641,电池642,天线1,天线2,移动通信模块650,无线通信模块660,音频模块670,扬声器670A,受话器670B,麦克风670C,耳机接口670D,传感器模块680,按键690,马达691,指示器692,摄像头693,显示屏694,以及用户标识模块(subscriber identification module,SIM)卡接口695等。其中传感器模块680可以包括压力传感器680A,陀螺仪传感器680B,气压传感器680C,磁传感器680D,加速度传感器680E,距离传感器680F,接近光传感器680G,指纹传感器680H,温度传感器680J,触摸传感器680K,环境光传感器680L,骨传导传感器680M等。
示例性地,音频模块670用于将数字音频信息转换成模拟音频信号输出,也可以用于将模拟音频输入转换为数字音频信号。音频模块670还可以用于对音频信号编码和解码。在一些实施例中,音频模块670或者音频模块670的部分功能模块可以设置于处理器610中。
例如,在本申请的实施例中,音频模块670可以将麦克风采集的音频数据向处理器610发送。
需要说明的是,图12所示的结构并不构成对电子设备600的具体限定。在本申请另一些实施例中,电子设备600可以包括比图12所示的部件更多或更少的部件,或者,电子设备600可以包括图12所示的部件中某些部件的组合,或者,电子设备600可以包括图12所示的部件中某些部件的子部件。图12示的部件可以以硬件、软件、或软件和硬件的组合实现。
处理器610可以包括一个或多个处理单元。例如,处理器610可以包括以下处理单元中的至少一个:应用处理器(application processor,AP)、调制解调处理器、图形处理器(graphics processing unit,GPU)、图像信号处理器(image signal processor,ISP)、控制器、视频编解码器、数字信号处理器(digital signal processor,DSP)、基带处理器、神经网络处理器(neural-network processing unit,NPU)。其中,不同的处理单元可以是独立的器件,也可以是集成的器件。
控制器可以根据指令操作码和时序信号,产生操作控制信号,完成取指令和执行指令的控制。
处理器610中还可以设置存储器,用于存储指令和数据。在一些实施例中,处理器610中的存储器为高速缓冲存储器。该存储器可以保存处理器610刚用过或循环使用的指令或数据。如果处理器610需要再次使用该指令或数据,可从所述存储器中直接调用。避免了重复存取,减少了处理器610的等待时间,因而提高了系统的效率。
在一些实施例中,处理器610可以包括一个或多个接口。例如,处理器610可以包括以下接口中的至少一个:内部集成电路(inter-integrated circuit,I2C)接口、内部集成电路音频(inter-integrated circuit sound,I2S)接口、脉冲编码调制(pulse code modulation,PCM)接口、通用异步接收传输器(universal asynchronous receiver/transmitter,UART)接口、移动产业处理器接口(mobile industry processor interface,MIPI)、通用输入输出(general-purpose input/output,GPIO)接口、SIM接口、USB接口。
示例性地,处理器610可以用于执行本申请实施例的视频处理方法;例如,获取音频数据,音频数据为第一麦克风和第二麦克风在同一环境中采集的数据;对音频数据进行VAD检测,确定并筛选出语音信号;对VAD检测出的语音信号进行风噪检测,确定并筛选出语音信号。
图12所示的各模块间的连接关系只是示意性说明,并不构成对电子设备600的各模块间的连接关系的限定。可选地,电子设备600的各模块也可以采用上述实施例中多种连接方式的组合。
电子设备600的无线通信功能可以通过天线1、天线2、移动通信模块650、无线通信模块660、调制解调处理器以及基带处理器等器件实现。天线1和天线2用于发射和接收电磁波信号。电子设备600中的每个天线可用于覆盖单个或多个通信频带。不同的天线还可以复用,以提高天线的利用率。
在一些实施例中,电子设备600的天线1和移动通信模块650耦合,电子设备600的天线2和无线通信模块660耦合,使得电子设备600可以通过无线通信技术与网络和其他电子设备通信。
电子设备600可以通过GPU、显示屏694以及应用处理器实现显示功能。GPU为图像处理的微处理器,连接显示屏694和应用处理器。GPU用于执行数学和几何计算,用于图形渲染。处理器610可包括一个或多个GPU,其执行程序指令以生成或改变显示信息。
显示屏694可以用于显示图像或视频。
电子设备600可以通过ISP、摄像头693、视频编解码器、GPU、显示屏694以及应用处理器等实现拍摄功能。
ISP用于处理摄像头693反馈的数据。例如,拍照时,打开快门,光线通过镜头被传递到摄像头感光元件上,光信号转换为电信号,摄像头感光元件将所述电信号传递给ISP处理,转化为肉眼可见的图像。ISP可以对图像的噪点、亮度和色彩进行算法优化,ISP还可以优化拍摄场景的曝光和色温等参数。在一些实施例中,ISP可以设置在摄像头693中。
摄像头693用于捕获静态图像或视频。物体通过镜头生成光学图像投射到感光元件。感光元件可以是电荷耦合器件(charge coupled device,CCD)或互补金属氧化物半导体(complementary metal-oxide-semiconductor,CMOS)光电晶体管。感光元件把光信号转换成电信号,之后将电信号传递给ISP转换成数字图像信号。ISP将数字图像信号输出到DSP加工处理。DSP将数字图像信号转换成标准的红绿蓝(red green blue,RGB),YUV等格式的图像信号。在一些实施例中,电子设备600可以包括1个或N个摄像头693,N为大于1的正整数。
示例性地,在本申请的实施例中,可以在处理器610中执行语音检测方法。
数字信号处理器用于处理数字信号,除了可以处理数字图像信号,还可以处理其他数字信号。例如,当电子设备600在频点选择时,数字信号处理器用于对频点能量进行傅里叶变换等。
视频编解码器用于对数字视频压缩或解压缩。电子设备600可以支持一种或多种视频编解码器。这样,电子设备600可以播放或录制多种编码格式的视频,例如:动态图像专家组(moving picture experts group,MPEG)1、MPEG2、MPEG3和MPEG4。
外部存储器接口620可以用于连接外部存储卡,例如安全数码(secure digital,SD)卡,实现扩展电子设备600的存储能力。外部存储卡通过外部存储器接口620与处理器610通信,实现数据存储功能。例如将音乐,视频等文件保存在外部存储卡中。
内部存储器621可以用于存储计算机可执行程序代码,所述可执行程序代码包括指令。内部存储器621可以包括存储程序区和存储数据区。
电子设备600可以通过音频模块670、扬声器670A、受话器670B、麦克风670C、耳机接口670D以及应用处理器等实现音频功能,例如,音乐播放和录音。
扬声器670A,也称为喇叭,用于将音频电信号转换为声音信号。电子设备600可以通过扬声器670A收听音乐或免提通话。受话器670B,也称为听筒,用于将音频电信号转换成声音信号。
指纹传感器680H用于采集指纹。电子设备600可以利用采集的指纹特性实现解锁、访问应用锁、拍照和接听来电等功能。
触摸传感器680K,也称为触控器件。触摸传感器680K可以设置于显示屏694,由触摸传感器680K与显示屏694组成触摸屏,触摸屏也称为触控屏。触摸传感器680K用于检测作用于其上或其附近的触摸操作。触摸传感器680K可以将检测到的触摸操作传递给应用处理器,以确定触摸事件类型。可以通过显示屏694提供与触摸操作相关的视觉输出。在另一些实施例中,触摸传感器680K也可以设置于电子设备600的表面,并且与显示屏694设置于不同的位置。
上文详细描述了电子设备600的硬件系统,下面介绍电子设备600的软件系统。软件系统可以采用分层架构、事件驱动架构、微核架构、微服务架构或云架构,本申请实施例以分层架构为例,示例性地描述电子设备600的软件系统。
如图13所示,采用分层架构的软件系统分成若干个层,每一层都有清晰的角色和分工。层与层之间通过软件接口通信。在一些实施例中,软件系统可以分为四层,从上至下分别为应用程序层、应用程序框架层、安卓运行时(Android Runtime)和系统库、以及内核层。
应用程序层可以包括通话、导航、录音、语音助手等应用程序。
示例性地,本申请实施例提供的语音检测方法可以应用于通话应用程序;例如,运行通话应用程序,获取音频数据,音频数据为第一麦克风和第二麦克风在同一环境中采集的数据;对音频数据进行VAD检测,确定并筛选出语音信号;对VAD检测出的语音信号进行风噪检测,确定并筛选出语音信号。
示例性地,本申请实施例提供的语音检测方法可以应用于录音应用程序;例如,运行录音应用程序,获取音频数据,音频数据为第一麦克风和第二麦克风在同一环境中采集的数据;对音频数据进行VAD检测,确定并筛选出语音信号;对VAD检测出的语音信号进行风噪检测,确定并筛选出语音信号。
示例性地,本申请实施例提供的语音检测方法可以应用于导航助手应用程序;例如,运行导航助手应用程序,获取音频数据,音频数据为第一麦克风和第二麦克风在同一环境中采集的数据;对音频数据进行VAD检测,确定并筛选出语音信号;对VAD检测出的语音信号进行风噪检测,确定并筛选出语音信号。
示例性地,本申请实施例提供的语音检测方法可以应用于语音助手应用程序;例如,运行语音助手应用程序,获取音频数据,音频数据为第一麦克风和第二麦克风在同一环境中采集的数据;对音频数据进行VAD检测,确定并筛选出语音信号;对VAD检测出的语音信号进行风噪检测,确定并筛选出语音信号。
应用程序框架层为应用程序层的应用程序提供应用程序编程接口(application programming interface,API)和编程框架。应用程序框架层可以包括一些预定义的函数。
例如,应用程序框架层包括窗口管理器、内容提供器、视图系统、电话管理器、资源管理器和通知管理器。
窗口管理器用于管理窗口程序。窗口管理器可以获取显示屏大小,判断是否有状态栏、锁定屏幕和截取屏幕。
内容提供器用来存放和获取数据,并使这些数据可以被应用程序访问。所述数据可以包括视频、图像、音频、拨打和接听的电话、浏览历史和书签、以及电话簿。
视图系统包括可视控件,例如显示文字的控件和显示图片的控件。视图系统可用于构建应用程序。显示界面可以由一个或多个视图组成,例如,包括短信通知图标的显示界面,可以包括显示文字的视图以及显示图片的视图。
电话管理器用于提供电子设备的通信功能,例如通话状态(接通或挂断)的管理。
资源管理器为应用程序提供各种资源,比如本地化字符串、图标、图片、布局文件和视频文件。
通知管理器使应用程序可以在状态栏中显示通知信息,可以用于传达告知类型的消息,可以短暂停留后自动消失,无需用户交互。
Android Runtime包括核心库和虚拟机。Android Runtime负责安卓系统的调度和管理。
核心库包含两部分:一部分是java语言需要调用的功能函数,另一部分是安卓的核心库。
应用程序层和应用程序框架层运行在虚拟机中。虚拟机将应用程序层和应用程序框架层的java文件执行为二进制文件。虚拟机用于执行对象生命周期的管理、堆栈管理、线程管理、安全和异常的管理、以及垃圾回收等功能。
系统库可以包括多个功能模块,例如:表面管理器(surface manager),媒体库(Media Libraries),三维图形处理库(例如:针对嵌入式系统的开放图形库(open graphics library for embedded systems,OpenGL ES)和2D图形引擎(例如:skia图形库(skia graphics library,SGL))。
表面管理器用于对显示子系统进行管理,并且为多个应用程序提供了2D图层和3D图层的融合。
媒体库支持多种音频格式的回放和录制、多种视频格式回放和录制以及静态图像文件。媒体库可以支持多种音视频编码格式,例如:MPEG4、H.264、动态图像专家组音频层面3(moving picture experts group audio layer III,MP3)、高级音频编码(advanced audio coding,AAC)、自适应多码率(adaptive multi-rate,AMR)、联合图像专家组(joint photographic experts group,JPG)和便携式网络图形(portable network graphics,PNG)。
三维图形处理库可以用于实现三维图形绘图、图像渲染、合成和图层处理。
二维图形引擎是2D绘图的绘图引擎。
内核层是硬件和软件之间的层。内核层可以包括音频驱动和显示驱动等驱动模块。
图14是本申请实施例提供的语音检测装置的结构示意图。该语音检测装置700包括获取单元710和处理单元720。
获取单元710用于获取音频数据,音频数据为第一麦克风和第二麦克风在同一环境中采集的数据。
处理单元720用于对音频数据进行VAD检测,确定并筛选出语音信号;对VAD检测出的语音信号进行风噪检测,确定并筛选出语音信号。
需要说明的是,上述语音检测装置700以功能单元的形式体现。这里的术语“单元”可以通过软件和/或硬件形式实现,对此不作具体限定。
例如,“单元”可以是实现上述功能的软件程序、硬件电路或二者结合。所述硬件电路可能包括应用特有集成电路(application specific integrated circuit,ASIC)、电子电路、用于执行一个或多个软件或固件程序的处理器(例如共享处理器、专有处理器或组处理器等)和存储器、合并逻辑电路和/或其它支持所描述的功能的合适组件。
因此,在本申请的实施例中描述的各示例的单元,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
图15示出了本申请提供的一种电子设备的结构示意图。图15中的虚线表示该单元或该模块为可选的,电子设备800可用于实现上述方法实施例中描述的语音检测方法。
电子设备800包括一个或多个处理器801,该一个或多个处理器801可支持电子设备800实现方法实施例中的方法。处理器801可以是通用处理器或者专用处理器。例如,处理器801可以是中央处理器(central processing unit,CPU)、数字信号处理器(digital signal processor,DSP)、专用集成电路(application specific integrated circuit,ASIC)、现场可编程门阵列(field programmable gate array,FPGA)或者其它可编程逻辑器件,如分立门、晶体管逻辑器件或分立硬件组件。
处理器801可以用于对电子设备800进行控制,执行软件程序,处理软件程序的数据。电子设备800还可以包括通信单元805,用以实现信号的输入(接收)和输出(发送)。
例如,电子设备800可以是芯片,通信单元805可以是该芯片的输入和/或输出电路,或者,通信单元805可以是该芯片的通信接口,该芯片可以作为终端设备或其它电子设备的组成部分。
又例如,电子设备800可以是终端设备,通信单元805可以是该终端设备的收发器,或者,通信单元805可以是该终端设备的收发电路。
电子设备800中可以包括一个或多个存储器802,其上存有程序804,程序804可被处理器801运行,生成指令803,使得处理器801根据指令803执行上述方法实施例中描述的语音检测方法。
可选地,存储器802中还可以存储有数据。可选地,处理器801还可以读取存储器802中存储的数据,该数据可以与程序804存储在相同的存储地址,该数据也可以与程序804存储在不同的存储地址。
处理器801和存储器802可以单独设置,也可以集成在一起;例如,集成在终端设备的系统级芯片(system on chip,SOC)上。
示例性地,存储器802可以用于存储本申请实施例中提供的语音检测方法的相关程序804,处理器801可以用于在转场处理时调用存储器802中存储的语音检测方法的相关程序804,执行本申请实施例的语音检测方法。例如:获取音频数据,音频数据为第一麦克风和第二麦克风在同一环境中采集的数据。对音频数据进行VAD检测,确定并筛选出语音信号;对VAD检测出的语音信号进行风噪检测,确定并筛选出语音信号。
本申请还提供了一种计算机程序产品,该计算机程序产品被处理器801执行时实现本申请中任一方法实施例所述的语音检测方法。
该计算机程序产品可以存储在存储器802中,例如是程序804,程序804经过预处理、编译、汇编和链接等处理过程最终被转换为能够被处理器801执行的可执行目标文件。
本申请还提供了一种计算机可读存储介质,其上存储有计算机程序,该计算机程序被计算机执行时实现本申请中任一方法实施例所述的语音检测方法。该计算机程序可以是高级语言程序,也可以是可执行目标程序。
可选地,该计算机可读存储介质例如是存储器802。存储器802可以是易失性存储器或非易失性存储器,或者,存储器802可以同时包括易失性存储器和非易失性存储器。其中,非易失性存储器可以是只读存储器(read-only memory,ROM)、可编程只读存储器(programmable ROM,PROM)、可擦除可编程只读存储器(erasable PROM,EPROM)、电可擦除可编程只读存储器(electrically EPROM,EEPROM)或闪存。易失性存储器可以是随机存取存储器(random access memory,RAM),其用作外部高速缓存。通过示例性但不是限制性说明,许多形式的RAM可用,例如静态随机存取存储器(static RAM,SRAM)、动态随机存取存储器(dynamic RAM,DRAM)、同步动态随机存取存储器(synchronous DRAM,SDRAM)、双倍数据速率同步动态随机存取存储器(double data rate SDRAM,DDR SDRAM)、增强型同步动态随机存取存储器(enhanced SDRAM,ESDRAM)、同步连接动态随机存取存储器(synchlink DRAM,SLDRAM)和直接内存总线随机存取存储器(direct rambus RAM,DR RAM)。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的电子设备的实施例仅仅是示意性的,例如,所述模块的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。
应理解,在本申请的各种实施例中,各过程的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请的实施例的实施过程构成任何限定。
另外,本文中的术语“和/或”,仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,本文中字符“/”,一般表示前后关联对象是一种“或”的关系。
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准总之,以上所述仅为本申请技术方案的较佳实施例而已,并非用于限定本申请的保护范围。凡在本申请的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。

Claims (18)

  1. 一种语音检测方法,其特征在于,应用于包括第一麦克风和第二麦克风的电子设备,所述方法包括:
    获取音频数据,所述音频数据为所述第一麦克风和所述第二麦克风在同一环境中采集的数据;
    对所述音频数据进行VAD检测,确定并筛选出语音信号;
    对VAD检测出的所述语音信号进行风噪检测,确定并筛选出语音信号。
  2. 根据权利要求1所述的语音检测方法,其特征在于,当所述音频数据为位于时域的数据时,所述方法还包括:
    对所述音频数据进行预处理,所述预处理至少包括分帧和时频变换。
  3. 根据权利要求2所述的语音检测方法,其特征在于,所述音频数据包括所述第一麦克风采集的第一路待测信号流和所述第二麦克风采集的第二路待测信号流;
    对所述音频数据进行预处理包括:
    对所述第一路待测信号流进行所述分帧,得到多帧第一时域信号;
    对多帧所述第一时域信号进行所述时频变换,得到多帧第一频域信号;
    对所述第二路待测信号流进行所述分帧,得到多帧第二时域信号;
    对多帧所述第二时域信号进行所述时频变换,得到多帧第二频域信号;
    其中,多帧所述第一时域信号和多帧所述第一频域信号一一对应,多帧所述第二时域信号和多帧所述第二频域信号一一对应。
  4. 根据权利要求3所述的语音检测方法,其特征在于,对所述音频数据进行VAD检测,确定并筛选出语音信号,包括:
    针对所述第一时域信号,根据所述第一时域信号和与所述第一时域信号对应的所述第一频域信号,确定所述第一时域信号所对应的第一数据,所述第一数据至少包括过零率、谱熵和平坦度;
    基于所述第一数据,对所述第一时域信号进行VAD检测,确定并筛选出语音信号。
  5. 根据权利要求4所述的语音检测方法,其特征在于,基于所述第一数据,对所述第一时域信号进行VAD检测,确定并筛选出语音信号,包括:
    当所述第一数据满足第一条件时,确定所述第一时域信号的暂定状态为语音信号;
    当所述第一数据不满足所述第一条件时,确定所述第一时域信号的暂定状态为其他信号,所述其他信号用于指示除语音信号和风噪信号之外的信号;
    针对所述第一时域信号,确定所述暂定状态与当前状态是否相同;
    当不同,且所述暂定状态为语音信号时,第一帧数标志位的值加1,并确定所述第一帧数标志位的值是否大于第一预设帧数阈值;
    当所述第一帧数标志位的值大于所述第一预设帧数阈值时,修改所述当前状态,当所述当前状态为语音信号时,修改为其他信号,当所述当前状态为其他信号时,修改为语音信号;
    当不同,且所述暂定状态为其他信号时,第二帧数标志位的值加1,并确定所述第二帧数标志位的值是否大于第二预设帧数阈值;
    当所述第二帧数标志位的值大于所述第二预设帧数阈值时,修改所述当前状态;
    确定并筛选出修改后的当前状态为语音信号的第一时域信号。
  6. 根据权利要求5所述的语音检测方法,其特征在于,所述方法还包括:
    当相同,确定并筛选出所述当前状态为语音信号的第一时域信号;或者,
    当不同,且所述第一帧数标志位的值小于或等于所述第一预设帧数阈值时,确定并筛选出所述当前状态为语音信号的第一时域信号;或者,
    当不同,且所述第二帧数标志位的值小于或等于所述第二预设帧数阈值时,确定并筛选出所述当前状态为语音信号的第一时域信号。
  7. 根据权利要求5或6所述的语音检测方法,其特征在于,在当所述第一数据满足第一条件之前,所述方法还包括:进行第一初始化处理,所述第一初始化处理至少包括对所述第一帧数标志位的值和所述第二帧数标志位的值归零。
  8. 根据权利要求5至7中任一项所述的语音检测方法,其特征在于,当所述第一数据包括所述过零率、所述谱熵和所述平坦度时,所述第一条件包括:
    所述过零率大于过零率阈值,所述谱熵小于谱熵阈值,且所述平坦度小于平坦度阈值。
  9. 根据权利要求1至8中任一项所述的语音检测方法,其特征在于,对VAD检测出的所述语音信号进行风噪检测,确定并筛选出语音信号,包括:
    针对VAD检测出的为语音信号的第一时域信号,根据所述第一时域信号与所述第一时域信号对应的第一频域信号,以及与所述第一频域信号次序相同的第二频域信号,确定所述第一时域信号所对应的第二数据,所述第二数据至少包括频谱重心、低频能量和相关性;
    确定所述第二数据,对所述第一时域信号进行风噪检测,确定并筛选出语音信号。
  10. 根据权利要求9所述的语音检测方法,其特征在于,基于所述第二数据,对所述第一时域信号进行风噪检测,确定并筛选出语音信号,包括:
    当所述第二数据满足第二条件时,确定所述第一时域信号的暂定状态为风噪信号;
    当所述第二数据不满足所述第二条件时,确定所述第一时域信号的暂定状态为语音信号;
    针对所述第一时域信号,确定所述暂定状态与当前状态是否相同;
    当不同,且所述暂定状态为风噪信号时,第三帧数标志位的值加1,并确定所述第三帧数标志位的值是否大于第三预设帧数阈值;
    当所述第三帧数标志位的值大于所述第三预设帧数阈值时,修改所述当前状态,当所述当前状态为语音信号时,修改为风噪信号,当所述当前状态为风噪信号时,修改为语音信号;
    当不同,且所述暂定状态为语音信号时,第一帧数标志位的值加1,并确定所述第一帧数标志位的值是否大于第四预设帧数阈值;
    当所述第一帧数标志位的值大于所述第四预设帧数阈值时,修改所述当前状态;
    确定并筛选出修改后的当前状态为语音信号的第一时域信号。
  11. 根据权利要求10所述的语音检测方法,其特征在于,所述方法还包括:
    当相同,确定并筛选出所述当前状态为语音信号的第一时域信号;或者,
    当不同,且所述第三帧数标志位的值小于或等于所述第三预设帧数阈值时,确定并筛选出所述当前状态为语音信号的第一时域信号;或者,
    当不同,且所述第一帧数标志位的值小于或等于所述第四预设帧数阈值时,确定并筛选出所述当前状态为语音信号的第一时域信号。
  12. 根据权利要求10或11所述的语音检测方法,其特征在于,在当所述第二数据满足第二条件之前,所述方法还包括:进行第二初始化处理,所述第二初始化处理至少包括对所述第一帧数标志的值和所述第三帧数标志位的值归零。
  13. 根据权利要求10至12中任一项所述的语音检测方法,其特征在于,当所述第二数据包括频谱重心、低频能量和相关性时,所述第二条件包括:
    所述频谱重心小于频谱重心阈值,所述低频能量大于低频能量阈值,且所述相关性小于所述相关性阈值。
  14. 根据权利要求1至13中任一项所述的语音检测方法,其特征在于,所述第一麦克风包括1个或多个第一麦克风,和/或,所述第二麦克风包括1个或多个第二麦克风。
  15. 根据权利要求1或14所述的语音检测方法,其特征在于,所述第一麦克风为所述电子设备设置在底部的麦克风,所述第二麦克风为所述电子设备设置在顶部或背面的麦克风。
  16. 一种电子设备,其特征在于,包括处理器和存储器;
    所述存储器,用于存储可在所述处理器上运行的计算机程序;
    所述处理器,用于执行如权利要求1至15中任一项所述的语音检测方法。
  17. 一种芯片系统,其特征在于,所述芯片系统应用于电子设备,所述芯片系统包括一个或多个处理器,所述处理器用于调用计算机指令以使得所述电子设备执行如权利要求1至15中任一项所述的语音检测方法。
  18. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储有计算机程序,所述计算机程序包括程序指令,所述程序指令当被处理器执行时,使所述处理器执行如权利要求1至15中任一项所述的语音检测方法。
PCT/CN2023/114481 2022-10-31 2023-08-23 语音检测方法及其相关设备 WO2024093460A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211350590.1A CN117995225A (zh) 2022-10-31 2022-10-31 语音检测方法及其相关设备
CN202211350590.1 2022-10-31

Publications (2)

Publication Number Publication Date
WO2024093460A1 WO2024093460A1 (zh) 2024-05-10
WO2024093460A9 true WO2024093460A9 (zh) 2024-06-27

Family

ID=90900079

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/114481 WO2024093460A1 (zh) 2022-10-31 2023-08-23 语音检测方法及其相关设备

Country Status (2)

Country Link
CN (1) CN117995225A (zh)
WO (1) WO2024093460A1 (zh)

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101593522B (zh) * 2009-07-08 2011-09-14 清华大学 一种全频域数字助听方法和设备
WO2019112468A1 (en) * 2017-12-08 2019-06-13 Huawei Technologies Co., Ltd. Multi-microphone noise reduction method, apparatus and terminal device
CN109920451A (zh) * 2019-03-18 2019-06-21 恒玄科技(上海)有限公司 语音活动检测方法、噪声抑制方法和噪声抑制系统
US11114109B2 (en) * 2019-09-09 2021-09-07 Apple Inc. Mitigating noise in audio signals
CN111741401B (zh) * 2020-08-26 2021-01-01 恒玄科技(北京)有限公司 用于无线耳机组件的无线通信方法以及无线耳机组件
CN113270106B (zh) * 2021-05-07 2024-03-15 深圳市友杰智新科技有限公司 双麦克风的风噪声抑制方法、装置、设备及存储介质
CN114627899A (zh) * 2022-03-22 2022-06-14 展讯通信(上海)有限公司 声音信号检测方法及装置、计算机可读存储介质、终端

Also Published As

Publication number Publication date
WO2024093460A1 (zh) 2024-05-10
CN117995225A (zh) 2024-05-07

Similar Documents

Publication Publication Date Title
US20220310095A1 (en) Speech Detection Method, Prediction Model Training Method, Apparatus, Device, and Medium
WO2019214361A1 (zh) 语音信号中关键词的检测方法、装置、终端及存储介质
WO2021143599A1 (zh) 基于场景识别的语音处理方法及其装置、介质和系统
CN111933112B (zh) 唤醒语音确定方法、装置、设备及介质
WO2022033556A1 (zh) 电子设备及其语音识别方法和介质
CN115312068B (zh) 语音控制方法、设备及存储介质
CN105635452A (zh) 移动终端及其联系人标识方法
CN111881315A (zh) 图像信息输入方法、电子设备及计算机可读存储介质
WO2022147692A1 (zh) 一种语音指令识别方法、电子设备以及非瞬态计算机可读存储介质
CN111370025A (zh) 音频识别方法、装置及计算机存储介质
CN111105788A (zh) 敏感词分数检测方法、装置、电子设备及存储介质
CN115689963A (zh) 一种图像处理方法及电子设备
CN114333774A (zh) 语音识别方法、装置、计算机设备及存储介质
CN112233689B (zh) 音频降噪方法、装置、设备及介质
WO2023006001A1 (zh) 视频处理方法及电子设备
WO2024093460A9 (zh) 语音检测方法及其相关设备
EP4293664A1 (en) Voiceprint recognition method, graphical interface, and electronic device
CN112233688A (zh) 音频降噪方法、装置、设备及介质
CN115641867B (zh) 语音处理方法和终端设备
CN114943976B (zh) 模型生成的方法、装置、电子设备和存储介质
CN117153166B (zh) 语音唤醒方法、设备及存储介质
CN115424628B (zh) 一种语音处理方法及电子设备
WO2024082914A1 (zh) 视频问答方法及电子设备
CN115527532A (zh) 一种设备唤醒方法、装置、计算机设备及存储介质
CN117828069A (zh) 一种会议重点内容提取方法、装置及电子设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23884397

Country of ref document: EP

Kind code of ref document: A1