WO2021093808A1 - 一种有效语音信号的检测方法、装置及设备 - Google Patents

一种有效语音信号的检测方法、装置及设备 Download PDF

Info

Publication number
WO2021093808A1
WO2021093808A1 PCT/CN2020/128374 CN2020128374W WO2021093808A1 WO 2021093808 A1 WO2021093808 A1 WO 2021093808A1 CN 2020128374 W CN2020128374 W CN 2020128374W WO 2021093808 A1 WO2021093808 A1 WO 2021093808A1
Authority
WO
WIPO (PCT)
Prior art keywords
signal
audio
wavelet
audio intensity
value
Prior art date
Application number
PCT/CN2020/128374
Other languages
English (en)
French (fr)
Inventor
张超鹏
Original Assignee
腾讯音乐娱乐科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯音乐娱乐科技(深圳)有限公司 filed Critical 腾讯音乐娱乐科技(深圳)有限公司
Publication of WO2021093808A1 publication Critical patent/WO2021093808A1/zh
Priority to US17/728,198 priority Critical patent/US20220246170A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/84Detection of presence or absence of voice signals for discriminating voice from noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/02Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
    • G10L19/0212Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders using orthogonal transformation
    • G10L19/0216Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders using orthogonal transformation using wavelet decomposition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L2025/783Detection of presence or absence of voice signals based on threshold decision
    • G10L2025/786Adaptive threshold
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/21Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information

Definitions

  • This application relates to the field of audio technology, in particular to a method, device and equipment for detecting effective voice signals.
  • Voice is a means of human-computer interaction, but noise interference always exists in the working environment. These noises will affect the application effect of voice. Therefore, it is necessary to detect the effective voice signal and distinguish the effective voice signal from the noise interference signal. For further processing.
  • the difference between speech signal and noise signal can be reflected in their respective energy.
  • the signal-to-noise ratio can be understood as the ratio of the speech signal to the noise signal.
  • the energy of the speech signal part is generally higher than that of the noise signal part. The energy is much greater.
  • the energy of the noise signal is relatively large, which is almost the same as the energy of the speech signal.
  • a method based on signal energy is used to analyze the speech signal.
  • the detection method is to distinguish the voice signal from the noise signal according to the short-term energy of the input signal, calculate the energy of the input signal in a period of time, and compare the energy of the input signal in an adjacent period of time to judge the signal in the current time. Is it a voice signal or a noise signal? Adopting the existing technology scheme, by calculating and comparing the energy of the signal within a period of time, due to the frequent occurrence of noise, there is noise in the signal in the current time, and there is also noise in the signal in the adjacent time period.
  • the energy of a time period is the sum of the energy of the noise signal and the speech signal
  • the energy of the adjacent time period is also the sum of the energy of the noise signal and the speech signal, so the existence of noise cannot be compared.
  • the frequent appearance of noise increases the energy of the signal, interferes with the detection of the signal, and misdetects the noise as an effective voice signal. Therefore, the accuracy of the detection of the effective voice signal in the prior art is not high enough.
  • this application provides an effective voice signal detection method, device and equipment. By collecting the energy information of all sample points in the wavelet signal sequence, the effective voice signal is judged and detected, and the effective voice signal detection is improved. Accuracy.
  • the present application provides a method for detecting effective speech signals, the method including:
  • each wavelet decomposition signal includes a plurality of samples and an audio intensity value of each sample
  • the wavelet decomposition signals corresponding to each audio frame signal are spliced to obtain a wavelet signal sequence; and the audio intensity values of all samples in the wavelet signal sequence are obtained Determining the first audio intensity threshold according to the maximum and minimum audio intensity values of all sample points in the wavelet signal sequence;
  • the determining the first audio intensity threshold according to the maximum and minimum audio intensity values of all sample points in the wavelet signal sequence includes:
  • the first audio intensity threshold and the second audio intensity threshold are determined according to the maximum and minimum of the audio intensity values of all sample points in the wavelet signal sequence, wherein the first audio intensity threshold is smaller than the second audio intensity threshold Intensity threshold
  • the determining that a sample point having an audio intensity value greater than the first audio intensity threshold value corresponds to a sample point signal in the first audio signal as a valid voice signal includes:
  • a signal of a corresponding sample point in the first audio signal of the first sample point and the previous sample point of the second sample point in the wavelet signal sequence is an effective speech segment in the effective speech signal.
  • At least a first preset number of consecutive sample points are included between the second sample point and the first same point.
  • the method further includes:
  • the average value of the first reference audio intensity values of the second preset number of consecutive samples including the target sample in the wavelet signal sequence is used as the audio intensity value of the target sample.
  • the average value of the first reference audio intensity values of a second preset number of consecutive samples including the target sample in the wavelet signal sequence is used as the target sample
  • the audio intensity value of before includes:
  • the value obtained by adding the second reference audio intensity value and the third reference audio intensity value as the fourth reference audio intensity value of the target sample point will include the target sample point and The smallest value among the fourth reference audio intensity values of all samples in the wavelet signal sequence that are sorted before the target sample point is used as the first reference audio intensity value of the target sample point.
  • the obtaining the maximum value and the minimum value of the audio intensity values of all the sample points in the wavelet signal sequence includes:
  • the value obtained by processing the reference maximum value and the reference minimum value in all the wavelet decomposition signals in the wavelet signal sequence is used as the maximum value and the minimum value of the audio intensity values of all sample points in the wavelet signal sequence.
  • the acquiring the first audio signal of the preset duration includes:
  • the performing wavelet decomposition for each audio frame signal includes:
  • the first reference audio intensity threshold T L min( ⁇ 1 ⁇ (Sc max -Sc min )+Sc min , ⁇ 2 ⁇ Sc min ), where Sc max and Sc min are the maximum and minimum audio intensity values of all sample points in the wavelet signal sequence, and ⁇ 1 is the second preset threshold , ⁇ 2 is the third preset threshold.
  • the determining the first audio intensity threshold and the second audio intensity threshold according to the maximum value and the minimum value of the audio intensity values of all sample points in the wavelet signal sequence includes:
  • the first reference audio intensity threshold T L min( ⁇ 1 .(Sc max -Sc min )+Sc min , ⁇ is determined according to the maximum value and the minimum value of the audio intensity values of all sample points in the wavelet signal sequence 2.
  • Sc min where Sc max and Sc min are the maximum and minimum audio intensity values of all sample points in the wavelet signal sequence, ⁇ 1 is the second preset threshold, and ⁇ 2 is the third preset threshold ;
  • the second audio intensity threshold T U ⁇ T L , where ⁇ is the fourth preset threshold, and ⁇ is greater than 1.
  • the present application provides a voice signal detection device, including:
  • An acquiring module configured to acquire a first audio signal of a preset duration, where the first audio signal includes at least one audio frame information
  • the decomposition module is used to perform wavelet decomposition for each audio frame signal to obtain multiple wavelet decomposition signals corresponding to each audio frame signal.
  • Each wavelet decomposition contains multiple samples and the audio of each sample point. Intensity value
  • a splicing module configured to splice the wavelet decomposition signal corresponding to each audio frame signal to obtain a wavelet signal sequence according to the framing sequence of the audio frame signal in the first audio signal;
  • the determining module is used to obtain the maximum value and the minimum value of the audio intensity values of all the sample points in the wavelet signal sequence, and determine the first value according to the maximum value and the minimum value of the audio intensity values of all the sample points in the wavelet signal sequence.
  • Audio intensity threshold is used to obtain the maximum value and the minimum value of the audio intensity values of all the sample points in the wavelet signal sequence, and determine the first value according to the maximum value and the minimum value of the audio intensity values of all the sample points in the wavelet signal sequence.
  • the determining module is further configured to obtain samples in the wavelet signal sequence whose audio intensity value is greater than the first audio intensity threshold, and to determine the samples in the wavelet signal sequence whose audio intensity value is greater than the first audio intensity threshold.
  • the signal corresponding to the sample point in the first audio signal is determined to be an effective speech signal.
  • the determining module is further configured to determine the first audio intensity threshold and the second audio according to the maximum value and the minimum value of the audio intensity values of all samples in the wavelet signal sequence. An intensity threshold, wherein the first audio intensity threshold is less than the second audio intensity threshold;
  • the acquiring module is further configured to acquire the first identical point in the wavelet signal sequence, wherein the audio intensity value of the previous identical point of the first identical point is less than the second audio intensity threshold, and the first identical point The audio intensity value of the point is greater than the second audio intensity threshold;
  • the acquisition module is further configured to acquire a second sample point in the wavelet signal sequence, where the second sample point is the first sample point in the wavelet signal sequence that appears after the first sample point. Samples with a value less than the first audio intensity threshold;
  • the determining module is further configured to determine that the signal of the first sample point and the previous sample point of the second sample point in the wavelet signal sequence corresponds to the sample point in the first audio signal as the valid signal.
  • the effective speech segment in the speech signal is further configured to determine that the signal of the first sample point and the previous sample point of the second sample point in the wavelet signal sequence corresponds to the sample point in the first audio signal as the valid signal.
  • a first preset number of consecutive sample points are included between the second sample point and the first same point.
  • the determining module is further configured to calculate the average value of the first reference audio intensity value of the second preset number of consecutive samples including the target sample in the wavelet signal sequence As the audio intensity value of the target sample point.
  • the apparatus for detecting a voice signal further includes a calculation module, and the determining module calculates a second preset number of consecutive samples including target samples in the wavelet signal sequence. Before the average value of the first reference audio intensity value of the point is used as the audio intensity value of the target sample point:
  • the calculation module is configured to multiply the audio intensity value of the same point before the target sample in the wavelet signal sequence by a smoothing coefficient to obtain the second reference audio intensity of the target sample;
  • the calculation module is further configured to include the target sample points in the wavelet signal sequence, and the audio intensity values of all continuous sample points in the wavelet signal sequence that are sorted before the target sample point Multiplying the average value of with the remaining smoothing coefficient to obtain the third reference audio intensity value of the target sample point;
  • the calculation module is further configured to add a value obtained by adding the second reference audio intensity value and the third reference audio intensity value as the fourth reference audio intensity value of the target sample;
  • the determining module is further configured to determine the minimum value among the fourth reference audio intensity values of all the samples in the wavelet signal sequence that include the target sample and whose order is before the target sample , As the first reference audio intensity value of the target sample point.
  • the acquiring module is further configured to acquire the first reference maximum value and the first reference minimum value among the audio intensity values of all sample points of the first wavelet decomposition signal in the wavelet signal sequence;
  • the determining module is further configured to process the reference maximum value and the reference minimum value of all wavelet decomposition signals in the wavelet signal sequence as the maximum value of the audio intensity value of all sample points in the wavelet signal sequence. Value and minimum value.
  • the device 14 for detecting a valid voice signal further includes a compensation module.
  • the compensation module is configured to The high-frequency component of the first preset threshold in the original audio signal of the preset duration is compensated, so as to obtain the first audio signal.
  • the decomposition module is further configured to perform wavelet packet decomposition for each audio frame signal, and use the signal obtained after the wavelet packet decomposition as the wavelet decomposition signal.
  • the second audio intensity threshold T U ⁇ T L , where ⁇ is the fourth preset threshold, and ⁇ is greater than 1.
  • the present application provides a detection device for valid voice signals, the device includes a transceiver, a processor, and a memory, wherein:
  • the transceiver is connected to the processor and the memory, and the processor is also connected to the memory;
  • the transceiver is configured to obtain a first audio signal of a preset duration, where the first audio signal includes at least one audio frame signal;
  • the processor is configured to perform wavelet decomposition for each audio frame signal to obtain multiple wavelet decomposition signals corresponding to each audio frame signal, and each wavelet decomposition signal contains multiple samples and each sample.
  • the processor is further configured to splice the wavelet decomposition signals corresponding to each audio frame signal according to the framing order of the audio frame signal in the first audio signal to obtain a wavelet signal sequence;
  • the processor is further configured to obtain the maximum value and the minimum value of the audio intensity values of all the sample points in the wavelet signal sequence, according to the maximum value and the minimum value of the audio intensity values of all the sample points in the wavelet signal sequence Determine the first audio intensity threshold;
  • the processor is further configured to obtain samples with an audio intensity value greater than the first audio intensity threshold in the wavelet signal sequence, and to compare samples with an audio intensity value greater than the first audio intensity threshold in the wavelet signal sequence The signal corresponding to the sample point in the first audio signal is determined to be an effective speech signal.
  • the memory is used to store a computer program, and the computer program is called by the processor.
  • the processor is further configured to:
  • the first audio intensity threshold and the second audio intensity threshold are determined according to the maximum and minimum of the audio intensity values of all sample points in the wavelet signal sequence, wherein the first audio intensity threshold is smaller than the second audio intensity threshold Intensity threshold
  • a signal of a corresponding sample point in the first audio signal of the first sample point and the previous sample point of the second sample point in the wavelet signal sequence is an effective speech segment in the effective speech signal.
  • a first preset number of consecutive sample points are included between the second sample point and the first same point.
  • the processor is further configured to:
  • the average value of the first reference audio intensity values of the second preset number of consecutive samples including the target sample in the wavelet signal sequence is used as the audio intensity value of the target sample.
  • the processor is further configured to:
  • the value obtained by adding the second reference audio intensity value and the third reference audio intensity value as the fourth reference audio intensity value of the target sample point will include the target sample point and The smallest value among the fourth reference audio intensity values of all samples in the wavelet signal sequence that are sorted before the target sample point is used as the first reference audio intensity value of the target sample point.
  • the processor is further configured to:
  • the value obtained by processing the reference maximum value and the reference minimum value in all the wavelet decomposition signals in the wavelet signal sequence is used as the maximum value and the minimum value of the audio intensity values of all sample points in the wavelet signal sequence.
  • the processor is further configured to:
  • the processor is further configured to:
  • the second audio intensity threshold T U ⁇ T L , where ⁇ is the fourth preset threshold, and ⁇ is greater than 1.
  • the present application provides a computer-readable storage medium that stores instructions in the readable storage medium, and when it runs on a computer, it can implement the steps of the methods described in the above-mentioned aspects.
  • the effective speech signal is judged and detected according to the energy distribution of the wavelet signal sequence, and the accuracy of the effective speech signal detection is improved.
  • FIG. 1 is a schematic flowchart of a method for detecting effective voice signals according to an embodiment of this application
  • FIG. 2 is a schematic diagram of a structure of wavelet decomposition provided by an embodiment of the application.
  • Fig. 3 is an amplitude-frequency characteristic curve of a high- and low-pass filter provided by an embodiment of the application;
  • FIG. 4 is a schematic diagram of a wavelet decomposition processing process provided by an embodiment of this application.
  • FIG. 5 is a schematic structural diagram of a wavelet packet decomposition provided by an embodiment of this application.
  • FIG. 6 is a schematic diagram of a wavelet packet decomposition processing process provided by an embodiment of this application.
  • FIG. 7 is a schematic flowchart of another effective voice signal detection method provided by an embodiment of this application.
  • FIG. 8 is a schematic diagram of a voice signal provided by an embodiment of this application.
  • FIG. 9 is a schematic flowchart of another effective voice signal detection method provided by an embodiment of this application.
  • FIG. 10 is a schematic diagram of a process for tracking a voice signal according to an embodiment of this application.
  • FIG. 11a is a schematic diagram of another voice signal provided by an embodiment of this application.
  • FIG. 11b is a schematic diagram of another voice signal provided by an embodiment of this application.
  • FIG. 12 is a schematic diagram of another process for tracking a voice signal according to an embodiment of this application.
  • FIG. 13a to 13e are respectively schematic diagrams of detection effects of an effective voice signal provided by an embodiment of this application.
  • FIG. 14 is a structural block diagram of an effective voice signal detection device provided by an embodiment of this application.
  • FIG. 15 is a structural block diagram of an effective voice signal detection device provided by an embodiment of the application.
  • FIG. 1 is a schematic flowchart of an effective voice signal detection method provided by an embodiment of the application. As shown in Figure 1, the specific execution steps of this embodiment are as follows:
  • the effective voice signal detection device acquires the first audio signal with a preset duration. Since the oral muscle movement is relatively slow relative to the voice frequency, the voice signal is relatively stable in a short time range, so the voice signal has short-term stability. According to the short-term stability of the speech signal, the speech signal can be divided into segments for detection, that is, the first audio signal of the preset duration is divided into frames to obtain at least one audio frame signal. Optionally, There is no overlap between the audio frame signals.
  • the size of the frame shift is the size of the frame length. It can be understood that the frame shift is the overlap between the previous frame and the next frame.
  • the effective voice signal detection device performs voice signal sampling at a frequency of 16 kHz, that is, 16k samples are collected in 1 second, and a first audio signal with a preset duration of 5 seconds is acquired. Taking 10ms as the frame shift and 10ms as the frame length, the first audio signal is divided into frames, and each audio frame signal includes 160 samples, and the audio intensity values corresponding to the 160 samples are obtained.
  • each wavelet decomposition signal contains multiple samples and an audio intensity value of each sample point .
  • the first audio signal is acquired, the first audio signal is divided into frames to obtain an audio frame signal, and each audio frame signal is subjected to wavelet decomposition.
  • the wavelet decomposition is described in detail below.
  • the wavelet decomposition can be referred to Figures 2 to 4, first of which refer to Figure 2.
  • Figure 2 is a schematic diagram of a wavelet decomposition structure provided by an embodiment of the application.
  • the first The audio frame signal obtained after the audio signal is divided into frames is subjected to wavelet decomposition.
  • the first audio frame signal is used as an example for description.
  • the process of wavelet decomposition can be considered as a process of high- and low-pass filtering.
  • FIG. 3 is the amplitude-frequency characteristic curve of a high- and low-pass filter provided by an embodiment of the application.
  • the characteristics of the high- and low-pass filtering are different according to the selected filter model.
  • a 16-tap Daubechies8 wavelet may be selected.
  • the first-level wavelet decomposition signal is obtained through the high- and low-pass filter shown in Figure 3.
  • the first-level wavelet decomposition signal includes low-frequency information L1 and high-frequency information H1. Continue to perform the low-frequency information L1 in the first-level wavelet decomposition signal.
  • High- and low-pass filtering obtains the low-frequency information L2 and high-frequency information H2 in the second-level wavelet decomposition signal, and performs high- and low-pass filtering on the low-frequency information L2 in the second-level wavelet decomposition signal to obtain the low-frequency information L3 and L3 in the third-level wavelet decomposition signal.
  • the high-frequency information H3, and so on can perform multi-level wavelet decomposition on the input signal, which is just an example here.
  • L3 and H3 contain all the information of L2, L2 and H2 contain all the information of L1, L1 and H1 contain all the information of the first audio frame signal, so L3, H3, H2 and H1 pass
  • the spliced sub-wavelet signal sequence may represent the first audio frame signal.
  • the sub-wavelet signal sequences of multiple audio frame signals are spliced according to the framing order of the first audio signal to form a wavelet signal sequence representing The first audio signal. It can be seen that the low-frequency components in the first audio frame signal are refined and analyzed after wavelet decomposition, the resolution is improved, and the low-frequency band has a relatively wide analysis window and has excellent local microscopic characteristics.
  • Fig. 4 is a schematic diagram of a wavelet decomposition processing process provided by an embodiment of this application.
  • wavelet decomposition is performed on the first audio frame signal in a possible implementation manner.
  • the signal after high-pass filtering and low-pass filtering can be down-sampled, and 16kHz is used as the first
  • the sampling frequency of the audio signal is 10ms as the frame shift and 10ms as the frame length.
  • the first audio signal is divided into frames. Each audio frame signal includes 160 samples. Wavelet decomposition is performed for each audio frame signal. The number of samples after a high-pass filter is 160, and the number of samples after the first low-pass filter is also 160, forming the first-level wavelet decomposition signal, and the first low-pass filtered signal is reduced Sampling, the sampling frequency after the first low-pass filtering is half of the sampling frequency of the first audio frame signal, then the number of samples after the first low-pass filtering down-sampling is 80; in the same way, the first low-pass filtering sampling frequency is 80; The number of samples after a high-pass filter down-sampling is 80, then the number of samples in the first-level wavelet decomposition signal is the sum of the number of samples after the first low-pass filter down-sampling and the first high-pass filter down-sampling 160.
  • the number of samples included in the sub-wavelet signal sequence obtained after wavelet decomposition of the first audio frame signal is the number of samples included in the first audio frame signal.
  • the number of samples of an audio frame signal It is understandable that, according to the double sampling theorem, the sampling frequency is twice the highest frequency of the voice signal, and the voice signal collected at the sampling frequency of 16kHz corresponds to the highest frequency of 8kHz, and the first audio frame signal is processed
  • the first-level wavelet decomposition obtains a first-level wavelet decomposition signal.
  • the first-level wavelet decomposition signal includes a signal obtained after first high-pass filtering and down-sampling and a signal obtained after first low-pass filtering and down-sampling.
  • the frequency band of the signal obtained after downsampling by the first high-pass filtering is 0 to 4 kHz, and the frequency band corresponding to the wavelet signal H1 obtained after the first high-pass filtering and down-sampling is 4 kHz to 8 kHz; the first-level wavelet decomposition signal is performed on the second level The wavelet decomposition obtains the second-level wavelet decomposition signal.
  • the signal obtained after the first low-pass filter down-sampling is subjected to the second high-pass filter and the second low-pass filter, and the signal obtained after the second high-pass filter down-sampling is obtained
  • the frequency band corresponding to the wavelet signal H2 is 2 kHz to 4 kHz
  • the signal obtained after the second low-pass filtering and down-sampling corresponds to the frequency band of 0 to 2 kHz
  • the second-level wavelet decomposition signal is subjected to the third-level wavelet decomposition to obtain the third-level wavelet decomposition
  • the signal, specifically, the third high-pass filter and the third low-pass filter are performed on the signal obtained after the second low-pass filter down-sampling, and the wavelet signal H3 obtained after the third high-pass filter down-sampling corresponds to a frequency band of 1 kHz To 2kHz, the frequency band corresponding to the wavelet signal L3 obtained after the third low-pass filtering and down-sampling is 0 to 1
  • the performing wavelet decomposition for each audio frame signal includes: performing wavelet packet decomposition for each audio frame signal, and using the signal obtained after the wavelet packet decomposition as the wavelet decomposition signal.
  • FIG. 5 is a schematic diagram of a wavelet packet decomposition structure provided by an embodiment of the application, as shown in Figure 5.
  • the audio frame signal obtained after framing the first audio signal is subjected to wavelet packet decomposition.
  • the first audio frame signal is used as an example for description.
  • the process of wavelet packet decomposition can also be considered as high- and low-pass filtering.
  • the specific high- and low-pass filtering characteristics can also be seen in Figure 3 above.
  • the filter type can be a 16-tap Daubechies8 wavelet.
  • the difference between wavelet packet decomposition and wavelet decomposition is that wavelet packet decomposition can decompose both the low-frequency part of the signal and the high-frequency part. Therefore, for signals containing a large amount of intermediate and high-frequency information, the wavelet packet decomposition signal can be decomposed. Better time-frequency localization analysis.
  • the first-level wavelet decomposition signal is obtained through the high-low-pass filter.
  • the first-level wavelet decomposition signal includes low-frequency information lp1 and high-frequency information hp1. Continue to perform high- and low-pass filtering on the low-frequency information lp1 in the first-level wavelet decomposition signal.
  • wavelet packet decomposition Different from wavelet decomposition for low-frequency information lp2 and high-frequency information hp2, wavelet packet decomposition also performs high- and low-pass filtering on the decomposed high-frequency information, so the high-frequency information hp1 in the first-level wavelet decomposition signal is high- and low-pass Filter to obtain low-frequency information lp3 and hp3, the low-frequency information in the second-level wavelet decomposition signal includes lp2 and lp3, and the high-frequency information includes hp2 and hp3; Frequency information hp2 and hp3 are respectively subjected to high- and low-pass filtering to obtain the third-level wavelet decomposition signal.
  • the third-level wavelet decomposition signal includes low-frequency information lp4, lp5, lp6, and lp7, and high-frequency information hp4, hp5, hp6, and hp7.
  • multi-level wavelet decomposition can be performed on the input signal, which is illustrated here as an example.
  • lp4 and hp4 contain all the information of lp2, lp5 and hp5 contain all the information of hp2, and lp2 and hp2 contain all the information of lp1, it is understandable that lp4, hp4, lp5 and hp5 contain With all the information of lp1; lp6 and hp6 contain all the information of lp3, lp7 and lp7 contain all the information of hp3, and lp3 and hp3 contain all the information of hp1.
  • the sub-wavelet signal sequence composed of lp4, hp4, lp5, hp5, lp6, hp6, lp7, and hp7 can represent the For the first audio frame signal, the sub-wavelet signal sequences of all audio frame signals are spliced according to the framing order of the audio frames in the first audio signal to obtain the wavelet signal sequence representing the first audio signal, and It can be seen that after the wavelet packet decomposition of the first audio frame signal, the resolution of both the high frequency band and the low frequency band is improved.
  • FIG. 6 is a schematic diagram of a wavelet packet decomposition processing process provided by an embodiment of the application.
  • the wavelet packet decomposition is performed on the first audio frame signal.
  • the signal after high-pass filtering and low-pass filtering can be down-sampled, and 16kHz is used as the first audio signal.
  • the sampling frequency is 10ms as the frame shift and 10ms as the frame length.
  • the first audio signal is divided into frames. Each audio frame signal includes 160 samples.
  • the wavelet packet decomposition is performed for each audio frame signal.
  • the first high pass The number of filtered samples is 160, and the number of samples after the first low-pass filter is also 160.
  • the first high-pass filter and the first low-pass filtered signal form the first wavelet packet decomposition.
  • the first-level wavelet decomposes the signal, and down-samples the first low-pass filtered signal, and the sampling frequency after the first low-pass filter is half of the sampling frequency of the first audio frame signal, then the first The number of samples after low-pass filtering down-sampling is 80; in the same way, the number of samples after the first high-pass filtering down-sampling is 80, then the number of samples in the first-level wavelet decomposition signal.
  • the sum of the number of samples after the first low-pass filter down-sampling and the first high-pass filter down-sampling is 160, which is consistent with the number of samples of an audio frame signal, and so on.
  • a signal after low-pass filtering and down-sampling is subjected to second high-pass filtering and second low-pass filtering, and down-sampling, and the sum of the obtained sample points is the number of points after the first low-pass filtering down-sampling;
  • the signal after the first high-pass filter down-sampling is subjected to third high-pass filtering and third low-pass filtering, and down-sampling, and the sum of the number of samples obtained is the number of points after down-sampling by the first high-pass filter;
  • the signal after the two low-pass filtering down-sampling is subjected to fourth high-pass filtering and fourth low-pass filtering, and down-sampling, and the sum of the number of samples obtained is the number of points after the second low-pass filtering down-sampling;
  • the signal downsampled by the second high-pass filtering is subjected to fifth high-pass filtering and fifth low-pass filtering, and down-sampling, and the sum of
  • the number of samples included in the sub-wavelet signal sequence obtained after the wavelet decomposition of the first audio frame signal is the number of samples of the first audio frame. It is understandable that, according to the double sampling theorem, the sampling frequency is twice the highest frequency of the voice signal, and the voice signal collected at the sampling frequency of 16kHz corresponds to the highest frequency of 8kHz, and the first audio frame signal is processed
  • the first-level wavelet packet decomposition obtains a first-level wavelet decomposition signal.
  • the first-level wavelet decomposition signal includes a signal down-sampled by a first high-pass filter and a signal down-sampled by a first low-pass filter.
  • the first low-pass The signal obtained after filtering and down-sampling corresponds to a frequency band of 0 to 4 kHz, and the signal obtained after the first high-pass filtering and down-sampling corresponds to a frequency band of 4 kHz to 8 kHz; the first-level wavelet decomposition signal is subjected to the second-level wavelet packet decomposition to obtain The second-level wavelet decomposition signal.
  • the second-level wavelet decomposition signal includes the signal down-sampled by the second low-pass filter, the signal down-sampled by the second high-pass filter, the signal down-sampled by the third low-pass filter, and the third signal.
  • the signal after high-pass down-sampling specifically, the signal obtained after the one low-pass filter down-sampling is subjected to the second high-pass filter and the second low-pass filter, and the signal corresponding to the frequency band obtained after the second high-pass filter down-sampling is 2kHz to 4kHz
  • the signal obtained after the second low-pass filter down-sampling corresponds to a frequency band of 0 to 2kHz
  • the signal obtained after the first high-pass filter down-sampling is subjected to third high-pass filtering and third low-pass filtering, so
  • the signal obtained after the third high-pass filter down-sampling corresponds to a frequency band of 6 kHz to 8 kHz
  • the signal obtained after the third low-pass filter down-sampling corresponds to a frequency band of 4 kHz to 6 kHz
  • the second-level wavelet decomposition signal is subjected to the third
  • the third-level wavelet packet decomposition obtains the third-level wavelet decom
  • the third-level wavelet decomposition signal includes the signal after the fourth low-pass filter down-sampling, the fourth high-pass filter down-sampling signal, and the fifth low-pass filter down-sampling signal.
  • the sampled signal, specifically, the fourth low-pass filter and the fourth high-pass filter are performed on the signal obtained after the second low-pass filter down-sampling, and the wavelet packet signal lp4 obtained after the fourth low-pass filter down-sampling
  • the corresponding frequency band is 0 to 1 kHz
  • the wavelet packet signal hp4 obtained after the fourth high-pass filter down-sampling corresponds to a frequency band of 1 kHz to 2 kHz
  • This embodiment illustrates the three-level wavelet packet decomposition as an example. Unlike wavelet decomposition, the wavelet packet decomposition will continue to analyze the high-pass The high-frequency signal in each level of signal obtained by filtering is subjected to high- and low-pass filtering again.
  • the wavelet packet signals lp4, hp4, lp5, hp5, lp6, hp6, lp7, and hp7 in the third-level wavelet decomposition signal can be spliced into a sub-wavelet signal sequence as the wavelet decomposition signal of the first audio frame signal.
  • the first-level wavelet decomposition signal, the second-level wavelet decomposition signal, and the third-level wavelet decomposition signal may all be obtained by high- and low-pass filtering of the same filter type.
  • the framing sequence of the audio frame signal in the first audio signal splicing wavelet decomposition signals corresponding to each audio frame signal to obtain a wavelet signal sequence.
  • the wavelet decomposition signal of the first audio frame is obtained according to step 101
  • the wavelet decomposition signal of all audio frames in the first audio signal is obtained
  • the wavelet decomposition signal of all audio frames is determined according to the first step 100.
  • the audio framing sequence is spliced head to tail to obtain a wavelet signal sequence representing the information of the first audio signal.
  • the sample point values of all sample points in the wavelet signal sequence represent the voltage amplitude of the sample point.
  • the audio intensity value may be the voltage amplitude of the sample point;
  • the audio intensity value may be the energy value of a sample point, and the voltage amplitude of the sample point is squared to obtain the energy value of the sample point.
  • the first audio intensity threshold is determined according to the maximum value and the minimum value of the audio intensity values of all sample points in the wavelet signal sequence, as a basis for judging valid speech signals.
  • the acquiring the maximum value and the minimum value of the audio intensity values of all the sample points in the wavelet signal sequence includes: acquiring all the sample points of the first wavelet decomposition signal in the wavelet signal sequence The first reference maximum value and the first reference minimum value in the audio intensity value; the value obtained by processing the reference maximum value and the reference minimum value in all wavelet decomposition signals in the wavelet signal sequence is used as the wavelet signal sequence The maximum and minimum audio intensity values of all samples in the.
  • the wavelet signal sequence includes multiple wavelet decomposition signals, and the maximum and minimum values of all sample points in each wavelet decomposition signal are obtained.
  • the maximum and minimum values in each wavelet decomposition signal are averaged.
  • before the acquiring the first audio signal of the preset duration includes: compensating the high frequency component of the first preset threshold in the original audio signal of the preset duration, so as to obtain the The first audio signal.
  • the signal is greatly damaged during transmission. In order to obtain a better signal waveform at the receiving terminal , The damaged signal needs to be compensated.
  • the audio intensity value of the sample point, a is a pre-emphasis coefficient, exemplarily, a is greater than 0.9 and less than 1, which can be understood as the first preset threshold
  • y(n) is a signal that has undergone pre-emphasis processing. It can be understood that the pre-emphasis processing can be regarded as passing the first audio signal through a high-pass filter to compensate for high-frequency components, and reduce high-frequency loss caused by the process of lip pronunciation or microphone recording.
  • the energy information of all sample points in the wavelet signal sequence is collected, and the audio intensity threshold is determined according to the energy distribution of the wavelet signal sequence, and the effective speech signal is judged and detected according to the audio intensity threshold to improve the effective speech signal detection. Accuracy.
  • FIG. 7 is a schematic flowchart of another effective voice signal detection method provided by an embodiment of the application. As shown in Figure 7, the specific execution steps of this embodiment are as follows:
  • each audio frame signal performs wavelet decomposition for each audio frame signal to obtain multiple wavelet decomposition signals corresponding to each audio frame signal, and each wavelet decomposition signal includes multiple samples and an audio intensity value of each sample point. ;
  • step 700, step 701, and step 702 are the process of framing and wavelet decomposition of the first audio signal to obtain a wavelet signal sequence.
  • step 700, step 701, and step 702 are the process of framing and wavelet decomposition of the first audio signal to obtain a wavelet signal sequence.
  • the first audio intensity value and the second audio intensity value are determined according to the maximum value and the minimum value of the audio intensity values of all sample points in the wavelet signal sequence, and optionally, according to all samples in the wavelet signal sequence
  • the acquiring the maximum value and the minimum value of the audio intensity values of all the sample points in the wavelet signal sequence includes: acquiring all the sample points of the first wavelet decomposition signal in the wavelet signal sequence The first reference maximum value and the first reference minimum value in the audio intensity value; the value obtained by processing the reference maximum value and the reference minimum value in all wavelet decomposition signals in the wavelet signal sequence is used as the wavelet signal sequence The maximum and minimum audio intensity values of all samples in the.
  • the wavelet signal sequence includes multiple wavelet decomposition signals, and the maximum and minimum values of all sample points in each wavelet decomposition signal are obtained.
  • the maximum and minimum values in each wavelet decomposition signal are averaged.
  • the first same point is the starting point of the effective voice signal, and it is predefined to enter the effective voice segment from the first same point.
  • step 705. Obtain a second sample point in the wavelet signal sequence, where the second sample point is the first occurrence of an audio intensity value smaller than the first audio after the first sample point in the wavelet signal sequence.
  • Sample point of intensity threshold. predefines the first same point as the starting end point of the effective voice segment, and enters the effective voice segment. After the first same point, the audio intensity value of the second sample point for the first time appears to be less than the total value.
  • the first audio intensity threshold it is considered that the second sample point has exited the effective speech segment where the first sample point is located.
  • a signal of a corresponding sample point in the first audio signal of the first sample point and the previous sample point of the second sample point in the wavelet signal sequence is an effective voice in the effective voice signal segment.
  • the second sample point has exited the valid speech segment where the first sample point is located, and it can be determined that the first sample point and the previous sample point of the second sample point are in the first sample point.
  • the signal corresponding to the sample point in an audio signal is an effective speech segment.
  • at least a first preset number of consecutive sample points are included between the second sample point and the first same point. If the first sample point and the second sample point are too close, for example, the first preset number is 20.
  • the occurrence of the first sample point greater than the second audio intensity threshold is caused by Caused by jitter in transient noise, not by valid speech.
  • before the acquiring the first audio signal of the preset duration includes: compensating the high frequency component of the first preset threshold in the original audio signal of the preset duration, so as to obtain the The first audio signal.
  • the signal is greatly damaged during transmission. In order to obtain a better signal waveform at the receiving terminal , The damaged signal needs to be compensated.
  • the audio intensity value of the sample point, a is a pre-emphasis coefficient, exemplarily, a is greater than 0.9 and less than 1, which can be understood as the first preset threshold
  • y(n) is a signal that has undergone pre-emphasis processing. It can be understood that the pre-emphasis processing can be regarded as passing the first audio signal through a high-pass filter to compensate for high-frequency components, and reduce high-frequency loss caused by the process of lip pronunciation or microphone recording.
  • FIG. 8 is a schematic diagram of a voice signal provided by an embodiment of this application.
  • the step 104 described above in conjunction with FIG. 1 determines the effective voice signal according to the first audio intensity threshold. Further, in this embodiment, the effective voice segment is determined according to the first audio intensity threshold and the second audio intensity threshold.
  • the transient noise shown in 8 is eliminated from the effective voice signal to avoid the situation that the transient noise is misdetected as a valid voice signal, and the accuracy of effective signal detection is further improved.
  • FIG. 9 is a schematic diagram of another effective voice signal detection process provided by an embodiment of this application. As shown in FIG. 9, the specific execution steps are as follows:
  • the sample point index i is an independent variable, representing the i-th sample point
  • the starting point index is is a recording variable, which records the starting sample point of the effective signal segment; in order to traverse all the sample points in the wavelet signal sequence, self The variable i will change, so it is necessary to define the variable is to record the first point.
  • the index idx of the effective voice signal period is also a recording variable. It records the idxth effective voice segment. You can define idx to record all the The number of valid speech segments included in the first audio signal.
  • the second audio intensity threshold may be regarded as the upper threshold of the effective voice signal, and the audio intensity value of the sample point is compared with the second audio intensity threshold,
  • the i-th sample point at this time is the first identical point, that is, is represents the first identical point, and step 902 is performed. If the audio intensity value Sc(i) of the i-th sample point is not greater than the second audio intensity threshold, step 903 is performed.
  • the i-th sample is The audio intensity value Sc(i) is compared with the first audio intensity threshold to realize the acquisition of the second sample point in the wavelet signal sequence in step 705 of the embodiment described in conjunction with FIG.
  • the point is the first sample point in the wavelet signal sequence whose audio intensity value is less than the first audio intensity threshold after the first same point.
  • the second sample point In order to ensure that the second sample point is after the first sample point, it is necessary to judge the starting index is. If the actual index point is is not equal to 0, it means that the first sample point is already Appear and determine that when the audio intensity value of the sample point in the wavelet signal sequence, the sample point smaller than the first audio intensity threshold appears for the first time, it is determined as the second sample point. It is understandable that the second sample point has exited the effective voice segment where the first sample point is located, and the previous sample point of the second sample point is the end point of the effective voice segment.
  • the first preset number of consecutive samples in a preset number of consecutive samples may be represented by a period of time Tmin.
  • Tmin the first preset number of consecutive samples in a preset number of consecutive samples.
  • the first The frame length of the audio frame signal is 10ms and includes 160 samples.
  • the sample point interval in the wavelet signal sequence is 0.5ms. If the first preset number is 20, then Tmin is 20 times 0.5ms, that is, Tmin is equal to 10ms.
  • the beginning end of the effective speech segment may be noise jitter, because the energy of transient noise rises quickly and then falls quickly, so the audio intensity value of the sample point is greater than the second audio intensity threshold, but the duration is not long enough. It drops to less than the first audio intensity threshold within a period of time, which is not consistent with the short-term stability of the speech signal. This segment of the signal is discarded, and step 906 is executed.
  • step 706 it is determined that the signal of the corresponding sample point in the first audio signal between the first sample point and the previous sample point of the second sample point is a valid speech segment in the valid speech signal.
  • the interval representation form of the effective speech segment is [is, i-1], the is records the first sample point, the i is the second sample point, and i-1 is the second sample point The same point before the point.
  • idx idx+1, and record the number of effective signal segments included in the wavelet signal sequence. Then step 906 is executed.
  • i i+1. Specifically, the sample points of the wavelet signal sequence are traversed continuously, and the sample points are traversed from the front to the back by adding 1 operation.
  • step 909 is performed.
  • Fig. 10 is a schematic diagram of a process for tracking a voice signal according to an embodiment of the application. As shown in Fig. 10, the specific tracking steps are as follows:
  • the samples in the wavelet sequence are time-domain amplitude smoothed, so that a smooth transition can be made between the sample points before and after the voice signal, and the influence of glitches on the voice signal is reduced.
  • S(i) represents the audio intensity value of the target sample point
  • S(i-1) represents the audio intensity value of the same point before the target sample point
  • ⁇ s represents the smoothing coefficient.
  • the audio intensity value S(i-1) of the same point before the target sample is multiplied by the smoothing coefficient ⁇ s to obtain the second reference audio intensity value of the target sample, and the second reference audio intensity value of the target sample is obtained.
  • the second reference audio intensity value is ⁇ s ⁇ S(i-1).
  • the second reference audio intensity value is a part of the time-domain smoothing result
  • the wavelet signal sequence includes the target sample point
  • the order in the wavelet signal sequence frame is in the order of the target
  • the average of the audio intensity values of all consecutive samples before the sample is multiplied by the remaining smoothing coefficient as another part of the time-domain smoothing result.
  • the wavelet signal sequence includes 8 wavelet packet decomposition signals, and the sequence of all continuous samples before the target sample is The average value M(i) of the audio intensity value of the point is:
  • i in formula 1 is the i-th sample point in the wavelet signal sequence
  • l represents the l-th wavelet decomposition signal. It can be understood that i is less than the total number of all sample points in the wavelet signal sequence.
  • the value obtained by adding the second reference audio intensity value and the third reference audio intensity value is used as the fourth reference audio intensity value of the target sample point.
  • the second reference audio intensity value is ⁇ s ⁇ S(i-1)
  • the third reference audio intensity value is M(i) ⁇ (1- ⁇ s )
  • the second reference audio intensity value and the third reference audio intensity value are added to obtain the fourth reference audio intensity value ⁇ s ⁇ S(i-1)+M(i) ⁇ (1- ⁇ s)
  • the fourth reference audio intensity value may be expressed as the smoothed audio intensity value of the target sample, and the fourth reference audio intensity value is used as the target sample
  • the first reference audio intensity value of the point Specifically, the time length of the signal to be tracked is preset, the fourth reference audio intensity value of all the sample points sorted before the target sample point in the wavelet signal sequence is divided into a segment of the tracking signal of the preset time length, and recorded
  • the minimum value of the fourth reference audio intensity value of all sample points in the first period of time is passed to the tracking signal of the next preset period of time, and the minimum value of all sample points passed from the previous period of preset time is transferred to the Compare the audio intensity value of the first point of the two, record the relatively small value of the two, and compare the relatively small value of the two with the audio intensity value of the next point of the period, and so on, Each time, the relatively small value of the two is recorded and compared with the audio intensity value of the next sample point, so as to obtain the minimum value of
  • FIG. 11a is a schematic diagram of another voice signal provided by an embodiment of this application.
  • the previous embodiments described in conjunction with Figures 1 to 9 can obtain accurate first audio intensity thresholds and second audio intensity values by counting all the sample points of the wavelet signal sequence, so that the instantaneous State noise is excluded from the range of the effective speech segment to achieve the effect shown in FIG. 11a; while implementing this embodiment, the effect achieved is shown in FIG. 11b.
  • FIG. 11a is a schematic diagram of another voice signal provided by an embodiment of this application.
  • the previous embodiments described in conjunction with Figures 1 to 9 can obtain accurate first audio intensity thresholds and second audio intensity values by counting all the sample points of the wavelet signal sequence, so that the instantaneous State noise is excluded from the range of the effective speech segment to achieve the effect shown in FIG. 11a; while implementing this embodiment, the effect achieved is shown in FIG. 11b.
  • FIG. 11b is a schematic diagram of another voice signal provided by an embodiment of this application.
  • FIG. 11b is another schematic diagram of a voice signal provided by an embodiment of this application.
  • the audio intensity value of the samples in the wavelet signal sequence is weakened, and the energy of transient noise is greatly weakened, reducing the transient Noise interferes with effective voice signal detection.
  • the effective voice signal is detected by tracking the processed first audio intensity threshold and the second audio intensity threshold, thereby improving the accuracy of effective voice signal detection.
  • the short-term mean smoothing is performed on the target sample, and the value after the short-term mean smoothing is used as the audio intensity value of the target sample.
  • the i-th The audio intensity value S C (i) of the sample point is:
  • 2M in Formula 2 is the second preset number of consecutive samples
  • S m (i) is the first reference audio intensity value of the target sample
  • S m (im) represents the first reference audio intensity value of the target sample.
  • M samples before or after each sample.
  • M 80
  • the second preset number of consecutive samples is 160
  • It means that the first audio reference intensity values of 80 samples are taken before and after the i-th sample point to perform a summation operation to obtain M samples including the target sample point i and before and after the target sample point i
  • the sum of the audio intensity values of the points perform the averaging operation on the result of the summation operation, and divide the sum of the audio intensity values by the number of all samples, as the i-th sample point amplitude short-term average smoothing
  • the following audio intensity value S C (i), at this time m in formula 2 is the independent variable, in order to avoid negative samples, so i is greater than M, taking M equal to 80 as an example, starting from the 81st sample The mean
  • the effective voice signal detection device tracks the voice signal, and uses the tracking result to affect the audio intensity value of the signal, which can be combined with any of the foregoing embodiments described in conjunction with FIGS. 1 to 9 that use sample audio intensity values.
  • the effective speech signal detection device acquires a first audio signal of a preset duration, where the first audio signal includes at least one audio frame signal, acquires multiple samples in each audio frame signal, and The audio intensity value of each sample point;
  • each wavelet decomposition signal includes a plurality of samples and an audio intensity value of each sample
  • the minimum value of the fourth reference audio intensity values of all the samples in the wavelet signal sequence that includes the target sample and whose sorting order is before the target sample is taken as the value of the target sample
  • the effective speech signal detection device acquires a first audio signal of a preset duration, where the first audio signal includes at least one audio frame signal;
  • each wavelet decomposition signal includes a plurality of samples and an audio intensity value of each sample
  • the minimum value of the fourth reference audio intensity values of all the samples in the wavelet signal sequence that includes the target sample and whose sorting order is before the target sample is taken as the value of the target sample
  • a signal of a corresponding sample point in the first audio signal of the first sample point and the previous sample point of the second sample point in the wavelet signal sequence is an effective speech segment in the effective speech signal.
  • at least a first preset number of consecutive sample points are included between the second sample point and the first same point.
  • the value obtained by processing the reference maximum value and the reference minimum value in all the wavelet decomposition signals in the wavelet signal sequence is used as the maximum value and the minimum value of the audio intensity values of all sample points in the wavelet signal sequence.
  • the high-frequency component of the first preset threshold in the original audio signal of the preset duration is compensated, so as to obtain the first audio signal.
  • the performing wavelet decomposition for each audio frame signal includes: performing wavelet packet decomposition for each audio frame signal, and using a signal obtained after the wavelet packet decomposition as the wavelet decomposition signal.
  • FIG. 12 is a schematic diagram of another process for tracking a voice signal according to an embodiment of the present application. As shown in Figure 12, the specific execution steps are as follows:
  • the sample point cumulative index is used to control the preset duration. When the value of the sample point cumulative index i mod reaches a certain value, data is updated to complete signal tracking for a preset duration.
  • the audio intensity value of the i-th sample point S(i) ⁇ s ⁇ S(i-1)+M(i) ⁇ (1- ⁇ s ).
  • step 1204 is performed, and when the 10th sample point is traversed, step 1205 is performed.
  • S min S(i)
  • S mact S(i)
  • step 1203 traverses to the 10th sample point
  • S min min(S min ,S(10))
  • traverse to the previous step of the 10th sample point S min records the value of S(9).
  • S m (i) S min .
  • S m (i) S min .
  • i mod i mod +1.
  • i mod is equal to V min .
  • step 1210 Determine whether i is equal to V min . Specifically, when i is equal to V min , proceed to step 1211 to initialize the matrix data; when i is not equal to V min , proceed to step 1212.
  • SW initialize the matrix SW. Specifically, define SW:
  • SW is:
  • S m (i) is the first reference audio intensity value or audio intensity value of the i-th sample point. Specifically, it is known from step 1212 and step 1206 that S m (i) records the minimum value of all samples starting from the same point before V min . In a possible implementation, S m (i) is the i-th
  • the first reference audio intensity value of the sample point realizes the implementation process described in step 1003 of the embodiment described in FIG. 10 to obtain the first reference audio intensity value of the target sample point, thereby obtaining the target sample point’s first reference audio intensity value. Audio intensity value.
  • the minimum value S min of the audio intensity values of all samples of the previous tracking duration is transferred to the current tracking duration through a matrix, and S min is compared with the audio intensity value of the target sample to obtain including the target sample.
  • S min min(S min ,S(i))
  • the first reference audio intensity value S m (i) of the target sample point and then compare the smaller value of the two with the fourth reference audio intensity value of the last same point of the target sample point to obtain both
  • the smaller value in is used as the first reference audio intensity value S m (i+1) of the last point of the target sample point, and so on, to obtain the minimum value of the audio intensity of all the sample points of the tracking duration
  • the relatively small audio intensity value between the previous tracking duration and the current tracking duration is transferred to the next tracking duration through the matrix.
  • the sample point sequence formed by S m (i) can describe the distribution of the audio
  • the implementation of this embodiment can further improve the accuracy of effective speech signal detection by tracking the audio intensity value of the signal of stable duration, and further avoid erroneous detection of transient noise as an effective speech signal or effective speech signal segment.
  • Figures 13a to 13e are schematic diagrams of the detection effects of an effective voice signal provided by an embodiment of the application.
  • the effective speech signal detection device obtains a segment of original speech signal including transient noise.
  • the original waveform diagram of the speech signal is shown in FIG. 13a. It can be seen that the transient noise is distributed in a time period of 0-6 s.
  • the effective signal detection device performs the wavelet decomposition or wavelet packet decomposition of the original speech signal described above in conjunction with Figures 1 to 9 to obtain the audio intensity values of all sample points of the wavelet signal sequence of the original signal amplitude, and further, adopt Combining the voice signal tracking described in Figure 10 and Figure 12 above, the steady-state amplitude tracking after voice signal tracking is obtained.
  • the sample point energy distribution of the two methods is shown in Figure 13b.
  • the combination of Figure 10 will include Including the target sample point, the minimum value of the audio reference intensity value of all sample points sorted before the target sample point in the wavelet signal sequence is used as the audio reference intensity value of the target sample point, so it undergoes steady-state amplitude tracking
  • the amplitude value of the later speech signal is weaker than the amplitude value of the wavelet signal sequence of the original signal, and the weakened part is the signal of the transient noise part, and the signal of the speech part hardly changes.
  • the original signal amplitude and the audio intensity values of all samples after the steady-state amplitude tracking are smoothed, and the smoothed result is shown in Fig. 13c.
  • FIG. 13b and FIG. 13c it can be seen that the implementation of the foregoing embodiment in conjunction with FIG. 10
  • the short-term mean smoothing of the audio intensity value of the sample points in the sample can significantly reduce the signal glitch and make the signal overall smooth.
  • VAD Voice Activity Dectection
  • the VAD detection result of the original signal energy is shown in 13d.
  • the VAD detection result obtained by sequence tracking is shown in Fig. 13e.
  • the original signal amplitude after smoothing in Fig. 13c is implemented.
  • the previous embodiment described in conjunction with Fig. 1 to Fig. 9 shows that the detection result is relatively accurate, but the energy of the original signal is relatively accurate.
  • implement the foregoing embodiments in conjunction with FIGS. 10 to 12 and perform further tracking of the energy of the original signal, and then implement the foregoing embodiments in conjunction with FIGS. 1 to 9, which can further improve the accuracy of effective speech detection.
  • the detection result in Fig. 13e has a low probability of misjudgeting the transient noise as an effective voice signal, which greatly improves the accuracy of effective voice signal detection.
  • FIG. 14 is a structural block diagram of an effective voice signal detection device provided by an embodiment of the application. As shown in FIG. 14, a The voice signal detection device 14 includes:
  • the acquiring module 1401 is configured to acquire a first audio signal of a preset duration, where the first audio signal includes at least one audio frame information;
  • the decomposition module 1402 is configured to perform wavelet decomposition for each audio frame signal to obtain multiple wavelet decomposition signals corresponding to each audio frame signal, and each wavelet decomposition includes multiple samples and the value of each sample point. Audio intensity value;
  • the splicing module 1403 is configured to splice the wavelet decomposition signals corresponding to each audio frame signal according to the framing sequence of the audio frame signal in the first audio signal to obtain a wavelet signal sequence;
  • the determining module 1404 is configured to obtain the maximum value and the minimum value of the audio intensity values of all the sample points in the wavelet signal sequence, and determine the first value according to the maximum value and the minimum value of the audio intensity values of all the sample points in the wavelet signal sequence.
  • An audio intensity threshold ;
  • the determining module 1404 is further configured to obtain samples with an audio intensity value greater than the first audio intensity threshold in the wavelet signal sequence, and determine the audio intensity value of the wavelet signal sequence with an audio intensity value greater than the first audio intensity threshold.
  • the signal of the sample point corresponding to the sample point in the first audio signal is determined to be an effective speech signal.
  • the determining module 1404 is further configured to determine the first audio intensity threshold and the second audio intensity according to the maximum value and the minimum value among the audio intensity values of all samples in the wavelet signal sequence. An intensity threshold, wherein the first audio intensity threshold is less than the second audio intensity threshold;
  • the acquiring module 1401 is also configured to acquire the first identical point in the wavelet signal sequence, wherein the audio intensity value of the previous identical point of the first identical point is less than the second audio intensity threshold, and the first identical point The audio intensity value of the point is greater than the second audio intensity threshold;
  • the acquiring module 1401 is further configured to acquire a second sample point in the wavelet signal sequence, where the second sample point is the first audio intensity value that appears after the first point in the wavelet signal sequence. Samples smaller than the first audio intensity threshold;
  • the determining module 1404 is further configured to determine that the first sample point in the wavelet signal sequence and the previous sample point of the second sample point in the first audio signal corresponds to the sample point signal as the valid signal.
  • the effective speech segment in the speech signal is further configured to determine that the first sample point in the wavelet signal sequence and the previous sample point of the second sample point in the first audio signal corresponds to the sample point signal as the valid signal.
  • a first preset number of consecutive sample points are included between the second sample point and the first same point.
  • the determining module 1404 is further configured to calculate the average value of the first reference audio intensity value of the second preset number of consecutive samples including the target sample in the wavelet signal sequence The value is used as the audio intensity value of the target sample point.
  • the device 14 for detecting a voice signal further includes a calculation module 1405.
  • the determining module 1404 calculates a second preset number including target samples in the wavelet signal sequence.
  • the calculation module 1405 is configured to calculate the value of the same point before the target sample point in the wavelet signal sequence.
  • the audio intensity value is multiplied by the smoothing coefficient to obtain the second reference audio intensity of the target sample point; the calculation module 1405 is further configured to include the target sample point in the wavelet signal sequence, and the target sample point is included in the wavelet signal sequence.
  • the average value of the audio intensity values of all consecutive samples in the wavelet signal sequence sorted before the target sample is multiplied by the remaining smoothing coefficient to obtain the third reference audio intensity value of the target sample; the calculation module 1405.
  • the value obtained by adding the second reference audio intensity value and the third reference audio intensity value is also used as the fourth reference audio intensity value of the target sample point; the determining module 1404 also It is used to use the minimum value of the fourth reference audio intensity values of all the sample points in the wavelet signal sequence that include the target sample point and whose sorting order is before the target sample point as the target sample point.
  • the first reference audio intensity value of the point is used to use the minimum value of the fourth reference audio intensity values of all the sample points in the wavelet signal sequence that include the target sample point and whose sorting order is before the target sample point as the target sample point.
  • the obtaining module 1401 is further configured to obtain the first reference maximum value and the first reference minimum value among the audio intensity values of all sample points of the first wavelet decomposition signal in the wavelet signal sequence; the determining module 1404.
  • the value obtained by processing the reference maximum value and the reference minimum value in all the wavelet decomposition signals in the wavelet signal sequence is further used as the maximum value and the minimum value of the audio intensity value of all sample points in the wavelet signal sequence. value.
  • the device 14 for detecting a valid voice signal further includes a compensation module 1406.
  • the compensation module 1406 uses Compensating the high-frequency component of the first preset threshold in the original audio signal of the preset duration to obtain the first audio signal.
  • the decomposition module 1402 is further configured to perform wavelet packet decomposition for each audio frame signal, and use the signal obtained after the wavelet packet decomposition as the wavelet decomposition signal.
  • the determining module 1404 is further configured to determine the first audio intensity threshold and the first audio intensity threshold according to the maximum value and the minimum value among the audio intensity values of all sample points in the wavelet signal sequence.
  • the second audio intensity threshold T U ⁇ T L , where ⁇ is the fourth preset threshold, and ⁇ is greater than 1.
  • the effective speech signal is judged and detected according to the energy distribution of all the sample points in the wavelet signal sequence, and the accuracy of the effective speech detection is improved.
  • FIG. 15 is a structural block diagram of an effective voice signal detection device provided by an embodiment of the present application.
  • a The voice signal detection device 15 includes: a transceiver 1500, a processor 1501, and a memory 1502, where:
  • the transceiver 1500 is connected to the processor 1501 and the memory 1502, and the processor 1501 is also connected to the memory 1502,
  • the transceiver 1500 is configured to obtain a first audio signal of a preset duration, where the first audio signal includes at least one audio frame signal;
  • the processor 1501 is configured to perform wavelet decomposition for each audio frame signal to obtain multiple wavelet decomposition signals corresponding to each audio frame signal, and each wavelet decomposition signal includes multiple samples and each The audio intensity value of the sample point;
  • the processor 1501 is further configured to splice the wavelet decomposition signals corresponding to each audio frame signal according to the framing order of the audio frame signal in the first audio signal to obtain a wavelet signal sequence;
  • the processor 1501 is further configured to obtain the maximum and minimum audio intensity values of all the sample points in the wavelet signal sequence, and according to the maximum and minimum audio intensity values of all the sample points in the wavelet signal sequence Value determines the first audio intensity threshold;
  • the processor 1501 is further configured to obtain samples with an audio intensity value greater than the first audio intensity threshold in the wavelet signal sequence, and to determine the audio intensity value in the wavelet signal sequence with an audio intensity value greater than the first audio intensity threshold.
  • the signal of the sample point corresponding to the sample point in the first audio signal is determined to be an effective speech signal.
  • the memory 1502 is configured to store a computer program, and the computer program is called by the processor 1501.
  • the processor 1501 is further configured to:
  • the first audio intensity threshold and the second audio intensity threshold are determined according to the maximum and minimum of the audio intensity values of all sample points in the wavelet signal sequence, wherein the first audio intensity threshold is smaller than the second audio intensity threshold Intensity threshold
  • a signal of a corresponding sample point in the first audio signal of the first sample point and the previous sample point of the second sample point in the wavelet signal sequence is an effective speech segment in the effective speech signal.
  • a first preset number of consecutive sample points are included between the second sample point and the first same point.
  • the processor 1501 is further configured to:
  • the average value of the first reference audio intensity values of the second preset number of consecutive samples including the target sample in the wavelet signal sequence is used as the audio intensity value of the target sample.
  • the processor 1501 is further configured to:
  • the value obtained by adding the second reference audio intensity value and the third reference audio intensity value as the fourth reference audio intensity value of the target sample point will include the target sample point and The smallest value among the fourth reference audio intensity values of all samples in the wavelet signal sequence that are sorted before the target sample point is used as the first reference audio intensity value of the target sample point.
  • the processor 1501 is further configured to:
  • the value obtained by processing the reference maximum value and the reference minimum value in all the wavelet decomposition signals in the wavelet signal sequence is used as the maximum value and the minimum value of the audio intensity values of all sample points in the wavelet signal sequence.
  • the processor 1501 is further configured to:
  • the processor 1501 is further configured to:
  • the second audio intensity threshold T U ⁇ T L , where ⁇ is the fourth preset threshold, and ⁇ is greater than 1.
  • the effective signal detection device 15 can execute the implementation manners provided in the steps in Figures 1 to 13e through its built-in functional modules. For details, please refer to the steps in Figures 1 to 13e. The implementation method provided will not be repeated here.
  • the present application also provides a readable storage medium in which instructions are stored, and the instructions are executed by a processor in a device for detecting valid voice signals, so as to implement the above-mentioned FIGS. 1 to 13e The steps of the methods described in various aspects.
  • the embodiment of the present application can collect the energy information of all sample points in the wavelet signal sequence, and judge and detect the effective voice signal according to the energy distribution of the wavelet signal sequence, thereby improving the accuracy of effective voice detection, and can also detect the wavelet signal
  • the audio intensity values of all sample points in the sequence are smoothed and the energy distribution information of all sample points in the wavelet signal sequence is tracked, which further improves the accuracy of effective speech signal detection.
  • the disclosed method, device, and system can be implemented in other ways.
  • the above-described embodiments are only illustrative.
  • the division of the units is only a logical function division.
  • the coupling, or direct coupling, or communication connection between the components shown or discussed may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms. of.
  • the units described above as separate components may or may not be physically separate, and the components displayed as units may or may not be physical units, that is, they may be located in one place or distributed on multiple network units; Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
  • the functional modules in the embodiments of the present application can be all integrated into one processing unit, or each module can be individually used as a unit, or two or more modules can be integrated into one unit; the above-mentioned integration
  • the unit can be implemented in the form of hardware, or in the form of hardware plus software functional units.
  • a person of ordinary skill in the art can understand that all or part of the steps in the above method embodiments can be implemented by a program instructing relevant hardware.
  • the foregoing program can be stored in a computer readable storage medium. When the program is executed, it is executed. Including the steps of the foregoing method embodiment; and the foregoing storage medium includes: removable storage devices, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disks or optical disks, etc.
  • ROM read-only memory
  • RAM Random Access Memory
  • magnetic disks or optical disks etc.
  • the above-mentioned integrated unit of the present application is implemented in the form of a software function module and sold or used as an independent product, it may also be stored in a computer readable storage medium.
  • the computer software product is stored in a storage medium and includes several instructions for A computer device (which may be a personal computer, a server, or a network device, etc.) executes all or part of the methods described in the various embodiments of the present application.
  • the aforementioned storage media include: removable storage devices, ROM, RAM, magnetic disks, or optical disks and other media that can store program codes.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

一种有效语音信号的检测方法、检测装置(14)及设备(15),该方法包括:获取预设时长的第一音频信号,该第一音频信号包括至少一个音频帧信号(100);针对每个音频帧信号进行小波分解,得到分别与每个音频帧信号对应的多个小波分解信号,每个小波分解信号中包含多个样点以及每个样点的音频强度值(101);按照音频帧信号在第一音频信号中的分帧顺序,将各个音频帧信号对应的小波分解信号进行拼接得到小波信号序列(102);获取小波信号序列中所有样点的音频强度值中的最大值和最小值,根据小波信号序列中所有样点的音频强度值的最大值和最小值确定第一音频强度阈值(103);获取小波信号序列中音频强度值大于第一音频强度阈值的样点,将小波信号序列中音频强度值大于第一音频强度阈值的样点在第一音频信号中对应样点的信号确定为有效语音信号(104)。通过采集小波信号序列中所有样点的能量信息,对有效语音信号进行判断检测,提高了有效语音检测的准确性。

Description

一种有效语音信号的检测方法、装置及设备
本申请要求于2019年11月13日提交中国专利局、申请号为201911109218.X、申请名称为“一种有效语音信号的检测方法、装置及设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及音频技术领域,尤其是一种有效语音信号的检测方法、装置及设备。
背景技术
语音作为一种人机交互的手段,但噪声干扰时时刻刻存在工作环境中,这些噪声会影响语音的应用效果,所以需要对有效语音信号进行检测,将有效语音信号与噪声干扰信号进行区分,以便进一步的处理。
语音信号和噪声信号的区别可以体现在各自的能量上,在高信噪比的情况下,信噪比可以理解为语音信号与噪声信号的比值,语音信号部分的能量一般要比噪声信号部分的能量大得多。但是,在低信噪比的情况下,输入的音频段频繁出现噪声时,噪声信号的能量较大,与语音信号的能量相差无几,在现有技术中,采用基于信号能量的方法对语音信号进行检测的方法,根据输入信号的短时能量将语音信号和噪声信号进行区分,计算一段时间内输入信号的能量,通过与相邻一段时间内输入信号的能量进行对比,判断当前时间内的信号为语音信号还是噪声信号。采用现有技术的方案,通过对一段时间内信号的能量进行计算与对比,由于噪声的频繁出现,在当前时间内的信号中存在噪声,在相邻时间段内的信号中也存在噪声,当前时间段的能量为噪声信号和语音信号的能量之和,而相邻时间段的能量也为噪声信号和语音信号的能量之和,从而无法对比出噪声的存在。噪声的频繁出现,使得信号的能量增加,干扰对信号的检测,会出现将噪声误检为有效语音信号的情况,所以现有技术对有效语音信号的检测的准确性不够高。
发明内容
基于上面所述的问题,本申请提供了一种有效语音信号的检测方法、装置及设备,通过采集小波信号序列中所有样点的能量信息,对有效语音信号进行判断检测,提高有效语音信号检测的准确性。
第一方面,本申请提供了一种有效语音信号的检测方法,所述方法包括:
获取预设时长的第一音频信号,所述第一音频信号包括至少一个音频帧信号;
针对所述每个音频帧信号进行小波分解,得到分别与每个音频帧信号对应的多个小波分解信号,每个小波分解信号中包含多个样点以及每个样点的音频强度值;
按照所述音频帧信号在所述第一音频信号中的分帧顺序,将各个音频帧信号对应的小波分解信号进行拼接得到小波信号序列;获取所述小波信号序列中所有样点的音频强度值中的最大值和最小值,根据所述小波信号序列中所有样点的音频强度值的最大值和最小值确定第一音频强度阈值;
获取所述小波信号序列中音频强度值大于所述第一音频强度阈值的样点,将所述小波信号序列中音频强度值大于所述第一音频强度阈值的样点在所述第一音频信号中对应样点的信号确定为有效语音信号。
在一种可能的实施例中,所述根据所述小波信号序列中所有样点音频强度值的最大值和最小值确定第一音频强度阈值包括:
根据所述小波信号序列中所有样点的音频强度值中的最大值和最小值确定所述第一音频强度阈值和第二音频强度阈值,其中所述第一音频强度阈值小于所述第二音频强度阈值;
所述将音频强度值大于所述第一音频强度阈值的样点在所述第一音频信号中对应的样点信号确定为有效语音信号包括:
获取所述小波信号序列中第一样点,其中所述第一样点的前一样点的音频强度值小于所述第二音频强度阈值,以及所述第一样点的音频强度值大于所述第二音频强度阈值;
获取所述小波信号序列中的第二样点,所述第二样点为在所述小波信号序列中排序在所述第一样点之后,首个出现音频强度值小于所述第一音频强度阈值的样点;
确定所述小波信号序列中的所述第一样点和所述第二样点的前一样点在所述第一音频信号中对应样点的信号为所述有效语音信号中的有效语音段。
可选的,所述第二样点与所述第一样点之间至少包括第一预设数量个连续样点。
在一种可能的实施例中,所述方法还包括:
将所述小波信号序列中包括目标样点在内的第二预设数量个连续样点的第一参考音频强度值的平均值作为所述目标样点的音频强度值。
在一种可能的实施例中,所述将所述小波信号序列中包括目标样点在内的第二预设数量个连续样点的第一参考音频强度值的平均值作为所述目标样点的音频强度值之前包括:
将所述小波信号序列中所述目标样点前一样点的音频强度值乘以平滑系数,得到所述目标样点的第二参考音频强度;
将所述小波信号序列中包括所述目标样点在内,且在所述小波信号序列中排序顺序在所述目标样点之前的所有连续样点的音频强度值的平均值乘以剩余的平滑系数,得到所述目标样点的第三参考音频强度值;
将所述第二参考音频强度值和所述第三参考音频强度值相加得到的数值,作为所述目标样点的第四参考音频强度值,将包括所述目标样点在内,且在所述小波信号序列中排序顺序在所述目标样点之前的所有样点的第四参考音频强度值中的最小值,作为所述目标样点的第一参考音频强度值。
在一种可能的实现方式中,所述获取所述小波信号序列中所有样点的音频强度值中的最大值和最小值包括:
获取所述小波信号序列中的第一小波分解信号所有样点音频强度值中的第一参考最大值和第一参考最小值;
将所述小波信号序列中的所有小波分解信号中的参考最大值和参考最小值进行处理得到的数值,作为所述小波信号序列中所有样点音频强度值的最大值和最小值。
可选的,所述获取预设时长的第一音频信号之前包括:
将所述预设时长的原始音频信号中的第一预设阈值的高频成分进行补偿,从而得到所述第一音频信号。
在一种可能的实施例中,所述针对所述每个音频帧信号进行小波分解包括:
针对所述每个音频帧信号进行小波包分解,将小波包分解后得到的信号作为所述小波分解信号。
在一种可能的实现方式中,所述根据所述小波信号序列中所有样点音频强度值的最大值 和所述最小值确定第一参考音频强度阈值T L=min(λ 1×(Sc max-Sc min)+Sc min2×Sc min),其中Sc max和Sc min分别为所述小波信号序列中所有样点音频强度值的最大值和最小值,λ 1为第二预设阈值,λ 2为第三预设阈值。
在一种可能的实现方式中,所述根据所述小波信号序列中所有样点的音频强度值中的最大值和最小值确定所述第一音频强度阈值和第二音频强度阈值包括:
根据所述小波信号序列中所有样点的音频强度值中的最大值和最小值确定所述第一参考音频强度阈值T L=min(λ 1.(Sc max-Sc min)+Sc min2.Sc min),其中Sc max和Sc min分别为所述小波信号序列中所有样点音频强度值的最大值和最小值,λ 1为第二预设阈值,λ 2为第三预设阈值;
所述第二音频强度阈值T U=αT L,其中α为第四预设阈值,α取值大于1。
第二方面,本申请提供了一种语音信号检测的装置,包括:
获取模块,用于获取预设时长的第一音频信号,所述第一音频信号包括至少一个音频帧信息;
分解模块,用于针对所述每个音频帧信号进行小波分解,得到分别与每个音频帧信号对应的多个小波分解信号,每个小波分解中包含多个样点以及每个样点的音频强度值;
拼接模块,用于按照所述音频帧信号在所述第一音频信号中的分帧顺序,将各个音频帧信号对应的小波分解信号进行拼接得到小波信号序列;
确定模块,用于获取所述小波信号序列中所有样点的音频强度值中的最大值和最小值,根据所述小波信号序列中所有样点的音频强度值的最大值和最小值确定第一音频强度阈值;
所述确定模块,还用于获取所述小波信号序列中音频强度值大于所述第一音频强度阈值的样点,将所述小波信号序列中音频强度值大于所述第一音频强度阈值的样点在所述第一音频信号中对应样点的信号确定为有效语音信号。
在一种可能的实施例中,所述确定模块,还用于根据所述小波信号序列中所有样点的音频强度值中的最大值和最小值确定所述第一音频强度阈值和第二音频强度阈值,其中所述第一音频强度阈值小于所述第二音频强度阈值;
所述获取模块,还用于获取所述小波信号序列中第一样点,其中所述第一样点的前一样点的音频强度值小于所述第二音频强度阈值,以及所述第一样点的音频强度值大于所述第二音频强度阈值;
所述获取模块,还用于获取所述小波信号序列中的第二样点,所述第二样点为在所述小波信号序列中排序在所述第一样点之后,首个出现音频强度值小于所述第一音频强度阈值的样点;
所述确定模块,还用于确定所述小波信号序列中的所述第一样点和所述第二样点的前一样点在所述第一音频信号中对应样点的信号为所述有效语音信号中的有效语音段。
可选的,所述第二样点与所述第一样点之间包括第一预设数量个连续样点。
在一种可能的实施例中,所述确定模块,还用于将所述小波信号序列中包括目标样点在内的第二预设数量个连续样点的第一参考音频强度值的平均值作为所述目标样点的音频强度值。
在一种可能的实现方式中,所述一种语音信号检测的装置还包括计算模块,在所述确定模块将所述小波信号序列中包括目标样点在内的第二预设数量个连续样点的第一参考音频强 度值的平均值作为所述目标样点的音频强度值之前:
所述计算模块,用于将所述小波信号序列中所述目标样点前一样点的音频强度值乘以平滑系数,得到所述目标样点的第二参考音频强度;
所述计算模块,还用于将所述小波信号序列中包括所述目标样点在内,且在所述小波信号序列中排序顺序在所述目标样点之前的所有连续样点的音频强度值的平均值乘以剩余的平滑系数,得到所述目标样点的第三参考音频强度值;
所述计算模块,还用于将所述第二参考音频强度值和所述第三参考音频强值相加得到的数值,作为所述目标样点的第四参考音频强度值;
所述确定模块,还用于将包括所述目标样点在内,且在所述小波信号序列中排序顺序在所述目标样点之前的所有样点的第四参考音频强度值中的最小值,作为所述目标样点的第一参考音频强度值。
可选的,所述获取模块,还用于获取所述小波信号序列中的第一小波分解信号所有样点音频强度值中的第一参考最大值和第一参考最小值;
所述确定模块,还用于将所述小波信号序列中的所有小波分解信号中的参考最大值和参考最小值进行处理得到的数值,作为所述小波信号序列中所有样点音频强度值的最大值和最小值。
在一种可能的实施例中,所述一种有效语音信号的检测装置14还包括补偿模块,在所述获取模块获取预设时长的第一音频信号之前,所述补偿模块,用于将所述预设时长的原始音频信号中的第一预设阈值的高频成分进行补偿,从而得到所述第一音频信号。
在一种可能的实现方式中,所述分解模块,还用于针对所述每个音频帧信号进行小波包分解,将小波包分解后得到的信号作为所述小波分解信号。
在一种可能的实现方式中,所述确定模块,还用于根据所述小波信号序列中所有样点音频强度值的最大值和所述最小值确定第一参考音频强度阈值T L=min(λ 1.(Sc max-Sc min)+Sc min2.Sc min),其中Sc max和Sc min分别为所述小波信号序列中所有样点音频强度值的最大值和最小值,λ 1为第二预设阈值,λ 2为第三预设阈值。
在另一种可能的实现方式中,所述确定模块,还用于根据所述小波信号序列中所有样点的音频强度值中的最大值和最小值确定所述第一参考音频强度阈值T L=min(λ 1.(Sc max-Sc min)+Sc min2.Sc min),其中Sc max和Sc min分别为所述小波信号序列中所有样点音频强度值的最大值和最小值,λ 1为第二预设阈值,λ 2为第三预设阈值;
所述第二音频强度阈值T U=αT L,其中α为第四预设阈值,α取值大于1。
第三方面,本申请提供了一种有效语音信号的检测设备,所述设备包括收发器、处理器和存储器,其中:
所述收发器与所述处理器以及所述存储器连接,所述处理器还与所述存储器连接;
所述收发器,用于获取预设时长的第一音频信号,所述第一音频信号包括至少一个音频帧信号;
所述处理器,用于针对所述每个音频帧信号进行小波分解,得到分别与每个音频帧信号对应的多个小波分解信号,每个小波分解信号中包含多个样点以及每个样点的音频强度值;
所述处理器,还用于按照所述音频帧信号在所述第一音频信号中的分帧顺序,将各个音频帧信号对应的小波分解信号进行拼接得到小波信号序列;
所述处理器,还用于获取所述小波信号序列中所有样点的音频强度值中的最大值和最小值,根据所述小波信号序列中所有样点的音频强度值的最大值和最小值确定第一音频强度阈值;
所述处理器,还用于获取所述小波信号序列中音频强度值大于所述第一音频强度阈值的样点,将所述小波信号序列中音频强度值大于所述第一音频强度阈值的样点在所述第一音频信号中对应样点的信号确定为有效语音信号。
所述存储器,用于存储计算机程序,所述计算机程序被所述处理器调用。
在一种可能的实施例中,所述处理器还用于:
根据所述小波信号序列中所有样点的音频强度值中的最大值和最小值确定所述第一音频强度阈值和第二音频强度阈值,其中所述第一音频强度阈值小于所述第二音频强度阈值;
获取所述小波信号序列中第一样点,其中所述第一样点的前一样点的音频强度值小于所述第二音频强度阈值,以及所述第一样点的音频强度值大于所述第二音频强度阈值;
获取所述小波信号序列中的第二样点,所述第二样点为在所述小波信号序列中排序在所述第一样点之后,首个出现音频强度值小于所述第一音频强度阈值的样点;
确定所述小波信号序列中的所述第一样点和所述第二样点的前一样点在所述第一音频信号中对应样点的信号为所述有效语音信号中的有效语音段。
可选的,所述第二样点与所述第一样点之间包括第一预设数量个连续样点。
在一种可能的实施例中,所述处理器还用于:
将所述小波信号序列中包括目标样点在内的第二预设数量个连续样点的第一参考音频强度值的平均值作为所述目标样点的音频强度值。
在一种可能的实施例中,所述处理器还用于:
将所述小波信号序列中所述目标样点前一样点的音频强度值乘以平滑系数,得到所述目标样点的第二参考音频强度;
将所述小波信号序列中包括所述目标样点在内,且在所述小波信号序列中排序顺序在所述目标样点之前的所有连续样点的音频强度值的平均值乘以剩余的平滑系数,得到所述目标样点的第三参考音频强度值;
将所述第二参考音频强度值和所述第三参考音频强度值相加得到的数值,作为所述目标样点的第四参考音频强度值,将包括所述目标样点在内,且在所述小波信号序列中排序顺序在所述目标样点之前的所有样点的第四参考音频强度值中的最小值,作为所述目标样点的第一参考音频强度值。
在一种可能的实现方式中,所述处理器还用于:
获取所述小波信号序列中的第一小波分解信号所有样点音频强度值中的第一参考最大值和第一参考最小值;
将所述小波信号序列中的所有小波分解信号中的参考最大值和参考最小值进行处理得到的数值,作为所述小波信号序列中所有样点音频强度值的最大值和最小值。
可选的,所述处理器还用于:
将所述预设时长的原始音频信号中的第一预设阈值的高频成分进行补偿,从而得到所述第一音频信号。
在一种可能的实施例中,所述处理器还用于:
针对所述每个音频帧信号进行小波包分解,将小波包分解后得到的信号作为所述小波分 解信号。
在一种可能的实现方式中,所述处理器根据所述小波信号序列中所有样点音频强度值的最大值和所述最小值确定第一参考音频强度阈值T L=min(λ 1.(Sc max-Sc min)+Sc min2.Sc min),其中Sc max和Sc min分别为所述小波信号序列中所有样点音频强度值的最大值和最小值,λ 1为第二预设阈值,λ 2为第三预设阈值。
在一种可能的实现方式中,所述处理器根据所述小波信号序列中所有样点的音频强度值中的最大值和最小值确定所述第一参考音频强度阈值T L=min(λ 1.(Sc max-Sc min)+Sc min2.Sc min),其中Sc max和Sc min分别为所述小波信号序列中所有样点音频强度值的最大值和最小值,λ 1为第二预设阈值,λ 2为第三预设阈值;
所述第二音频强度阈值T U=αT L,其中α为第四预设阈值,α取值大于1。
第四方面,本申请提供了一种计算机可读存储介质,所述可读存储介质中存储有指令,当其在计算机上运行时,可实现上面所述各方面所述方法的步骤。
实施本申请,通过采集小波信号序列中所有样点的能量信息,根据小波信号序列的能量分布情况,对有效语音信号进行判断检测,提高有效语音信号检测的准确性。
附图说明
图1为本申请实施例提供的一种有效语音信号的检测方法的流程示意图;
图2为本申请实施例提供的一种小波分解的结构示意图;
图3为本申请实施例提供的一种高低通滤波器的幅频特性曲线;
图4为本申请实施例提供的一种小波分解处理过程的示意图;
图5为本申请实施例提供的一种小波包分解的结构示意图;
图6为本申请实施例提供的一种小波包分解处理过程的示意图;
图7为本申请实施例提供的另一种有效语音信号的检测方法的流程示意图;
图8为本申请实施例提供的一种语音信号示意图;
图9为本申请实施例提供的又一种有效语音信号的检测方法的流程示意图;
图10为本申请实施例提供的一种跟踪语音信号的流程示意图;
图11a为本申请实施例提供的另一种语音信号示意图;
图11b为本申请实施例提供的又一种语音信号示意图;
图12为本申请实施例提供的另一种跟踪语音信号的流程示意图;
图13a至图13e分别为本申请实施例提供的一种有效语音信号的检测效果示意图;
图14为本申请实施例提供的一种有效语音信号的检测装置的结构框图;
图15为本申请实施例提供的一种有效语音信号的检测设备的结构框图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
下面结合附图来对本申请的技术方案的实施作进一步的详细描述。
首先,下面对本申请提供的一种有效语音信号的检测方法进行介绍,参见图1至图6。
首先参见图1,图1为本申请实施例提供的一种有效语音信号检测方法的流程示意图。如图1所示,本实施例的具体执行步骤如下:
100、获取预设时长的第一音频信号,所述第一音频信号包括至少一个音频帧信号。具体的,有效语音信号的检测装置获取预设时长的第一音频信号,由于口腔肌肉运动相对于语音频率来说比较缓慢,语音信号在一个短时间范围内相对稳定,所以语音信号具有短时稳定性,可以根据语音信号的短时稳定性,将语音信号分成一段一段的来进行检测,即将所述预设时长的第一音频信号进行分帧,得到至少一个音频帧信号,可选的,所述音频帧信号之间没有重叠,帧移的大小即为帧长的大小,可以理解的是,帧移为上一帧与下一帧的交叠部分,当帧长等于帧移时,音频帧之间无交叠。在一种可能的实施例中,所述有效语音信号检测装置以16kHz的频率进行语音信号的采样,即1秒钟采集16k个样点,获取一段预设时长为5秒的第一音频信号,以10ms为帧移,10ms为帧长,对所述第一音频信号进行分帧,每一个音频帧信号包括160个样点,获取160个样点对应的音频强度值。
101、针对所述每个音频帧信号进行小波分解,得到分别与每个音频帧信号对应的多个小波分解信号,每个小波分解信号中包含多个样点以及每个样点的音频强度值。具体的,由步骤100获取所述第一音频信号,将所述第一音频信号进行分帧,得到音频帧信号,对所述每个音频帧信号进行小波分解。
下面对小波分解进行详细介绍,小波分解可以参见图2至图4,首先参见图2,图2为本申请实施例提供的一种小波分解的结构示意图,如图2所示,将第一音频信号分帧后得到的音频帧信号进行小波分解,本实施例以第一音频帧信号进行示例性说明。可以理解的是,小波分解的过程可以认为是高低通滤波的过程,具体的高低通滤波特性可以参见图3,图3为本申请实施例提供的一种高低通滤波器的幅频特性曲线,可以理解的是,所述高低通滤波特性根据选用的滤波器型号不同而不同,示例性的,可以选用16抽头Daubechies8小波。通过如图3所示的高低通滤波器得到第1级小波分解信号,第1级小波分解信号中包括低频信息L1和高频信息H1,继续对第1级小波分解信号中的低频信息L1进行高低通滤波得到第2级小波分解信号中的低频信息L2和高频信息H2,对第2级小波分解信号中的低频信息L2进行高低通滤波得到第3级小波分解信号中的低频信息L3和高频信息H3,以此类推,可以对输入信号进行多级小波分解,此处只是作示例性说明。可以理解的是,L3和H3包含着L2的全部信息,L2和H2包含着L1的全部信息,L1和H1包含着所述第一音频帧信号的全部信息,所以L3、H3、H2和H1通过拼接组成的子小波信号序列可以代表着所述第一音频帧信号,将多个音频帧信号的子小波信号序列,按照所述第一音频信号的分帧顺序进行拼接,形成小波信号序列代表着所述第一音频信号。由此可见,所述第一音频帧信号中的低频成分经过小波分解后得到细化分析,分辨率有所提高,在低频段具有比较宽的分析窗口,具有优良的局部显微特性。
下面对本实施例中小波分解的具体处理过程进行详细介绍,示例性的,本实施例以一个音频帧信号进行小波分解进行示例性说明。具体的,参见图4,图4为本申请实施例提供的一种小波分解处理过程的示意图,如图4所示,针对所述第一音频帧信号进行小波分解,在一种可能的实现方式中,为了使小波分解后的样点个数可以保持和原来音频帧信号的样点个数一致,可以对进行高通滤波和低通滤波之后的信号进行降采样,以16kHz为所述第一音频信号的采样频率,以10ms为帧移,10ms为帧长,对所述第一音频信号进行分帧,每一个音 频帧信号包括160个样点,针对每个音频帧信号进行小波分解,第一高通滤波后的样点个数为160个,第一低通滤波后的样点个数也为160个,组成第1级小波分解信号,将所述第一低通滤波后的信号进行降采样,第一低通滤波后的采样频率为所述第一音频帧信号采样频率的一半,则所述第一低通滤波降采样后的样点个数为80;同理的,所述第一高通滤波降采样后的样点个数为80,则第1级小波分解信号中的样点个数为第一低通滤波降采样和第一高通滤波降采样后样点个数加起来的160,与一个音频帧信号的样点个数一致,依此类推,将所述第一低通滤波降采样后的信号进行第二高通滤波和第二低通滤波,并降采样,得到的样点的个数之和为所述第一低通滤波降采样后的点数;将所述第二低通滤波降采样后的信号进行第三高通滤波和第三低通滤波,并降采样,得到的样点个数之和为所述第二低通滤波降采样后的点数,由此可知,第一音频帧信号进行小波分解后得到的子小波信号序列包括的样点数目为所述第一音频帧信号的样点数目。可以理解的是,根据两倍采样定理,采样频率为语音信号最高频率的两倍,则以16kHz的采样频率采集到的语音信号,对应的最高频率为8kHz,将所述第一音频帧信号进行第1级小波分解得到第1级小波分解信号,所述第1级小波分解信号包括第一高通滤波降采样后得到的信号和第一低通滤波降采样后得到的信号,所述第一低通滤波降采样后得到的信号对应频段为0至4kHz,所述第一高通滤波降采样后得到的小波信号H1对应的频段为4kHz至8kHz;将所述第1级小波分解信号进行第2级小波分解得到第2级小波分解信号,具体的,将所述第一低通滤波降采样后得到的信号进行第二高通滤波和第二低通滤波,所述第二高通滤波降采样后得到的小波信号H2对应的频段为2kHz至4kHz,所述第二低通滤波降采样后得到的信号对应频段为0至2kHz;将第2级小波分解信号进行第3级小波分解得到第3级小波分解信号,具体的,将所述第二低通滤波降采样后得到的信号进行第三高通滤波和第三低通滤波,所述第三高通滤波降采样后得到的小波信号H3对应的频段为1kHz至2kHz,所述第三低通滤波降采样后得到的小波信号L3对应的频段为0至1kHz,以此类推,本实施例对3级小波分解进行示例性说明,在一种可能的实现方式中,所述第1级小波分解信号、所述第2级小波分解信号和所述第3级小波分解信号均可以由同一种滤波器类型进行高低通滤波得到的。小波信号H1、H2、H3和L3可以拼接成子小波信号序列,作为所述第一音频帧信号的小波分解信号。
在一种可能的实施例中,所述针对所述每个音频帧信号进行小波分解包括:针对所述每个音频帧信号进行小波包分解,将小波包分解后得到的信号作为所述小波分解信号。
下面对小波包分解进行详细介绍,小波包分解可以参见图5至图6,首先参见图5,图5为本申请实施例提供的一种小波包分解的结构示意图,如图5所示,将第一音频信号分帧后得到的音频帧信号进行小波包分解,本实施例以第一音频帧信号进行示例性说明,可以理解的是,小波包分解的过程也可以认为是高低通滤波的过程,具体的高低通滤波特性也可以参见前文的图3,可选的,滤波器类型可以选用16抽头Daubechies8小波。小波包分解与小波分解不一样的是,小波包分解既可以对低频部分信号进行分解,也可以对高频部分进行分解,所以对包含大量中频和高频信息的信号,小波包分解信号能够进行更好的时频局部化分析。通过高低通滤波器得到第1级小波分解信号,所述第1级小波分解信号中包括低频信息lp1和高频信息hp1,继续对第1级小波分解信号中的低频信息lp1进行高低通滤波得到低频信息lp2和高频信息hp2,与小波分解不同的是,小波包分解还会对分解后的高频信息进行高低通滤波,所以对所述第1级小波分解信号中高频信息hp1进行高低通滤波,得到低频信息lp3和hp3,第2级小波分解信号中的低频信息包括lp2和lp3,高频信息包括hp2和hp3;对所 述第2级小波分解信号中的低频信息lp2和lp3以及高频信息hp2和hp3分别进行高低通滤波,得到第3级小波分解信号,第3级小波分解信号中包括低频信息lp4、lp5、lp6和lp7,以及高频信息hp4、hp5、hp6和hp7,以此类推,可以对输入信号进行多级小波分解,此处作示例性说明。如图5所示,lp4和hp4包含着lp2的全部信息,lp5和hp5包含着hp2的全部信息,而lp2和hp2包含着lp1的全部信息,可以理解的是,lp4、hp4、lp5和hp5包含着lp1的全部信息;lp6和hp6包含着lp3的全部信息,lp7和lp7包含着hp3的全部信息,而lp3和hp3包含着hp1的全部信息,可以理解的是,lp6、hp6、lp7和hp7包含着hp1的全部信息;由于lp1和hp1包含所述第一音频帧信号的全部信息,所以lp4、hp4、lp5、hp5、lp6、hp6、lp7和hp7拼接起来组成的子小波信号序列可以代表所述第一音频帧信号,将所有音频帧信号的子小波信号序列,按照音频帧在所述第一音频信号中的分帧顺序进行拼接,得到代表着所述第一音频信号的小波信号序列,由此可见,第一音频帧信号经过小波包分解后,无论是高频段还是低频段的分辨率都有所提高。
下面对本实施例中小波包分解的具体处理过程进行详细介绍,示例性的,本实施例以一个音频帧信号进行小波包分解进行示例性说明。具体的,参见图6,图6为本申请实施例提供的一种小波包分解处理过程的示意图,如图6所示,针对所述第一音频帧信号进行小波包分解,在一种可能的实现方式中,为了使小波包分解后的样点个数可以保持和原来音频帧信号一致,可以对进行高通滤波和低通滤波之后的信号进行降采样,以16kHz为所述第一音频信号的采样频率,以10ms为帧移,10ms为帧长,对所述第一音频信号进行分帧,每一个音频帧信号包括160个样点,针对每个音频帧信号进行小波包分解,第一高通滤波后的样点个数为160个,第一低通滤波后的样点个数也为160个,所述第一高通滤波和所述第一低通滤波后的信号组成小波包分解的第1级小波分解信号,将所述第一低通滤波后的信号进行降采样,所述第一低通滤波后的采样频率为所述第一音频帧信号采样频率的一半,则所述第一低通滤波降采样后的样点个数为80;同理的,所述第一高通滤波降采样后的样点个数为80,则所述第1级小波分解信号中的样点个数为所述第一低通滤波降采样和所述第一高通滤波降采样后样点个数加起来的160,与一个音频帧信号的样点个数一致,依此类推,将所述第一低通滤波降采样后的信号进行第二高通滤波和第二低通滤波,并降采样,得到的样点个数之和为所述第一低通滤波降采样后的点数;将所述第一高通滤波降采样后的信号进行第三高通滤波和第三低通滤波,并降采样,得到的样点个数之和为所述第一高通滤波降采样后的点数;将所述第二低通滤波降采样后的信号进行第四高通滤波和第四低通滤波,并降采样,得到的样点个数之和为所述第二低通滤波降采样后的点数;将所述第二高通滤波降采样后的信号进行第五高通滤波和第五低通滤波,并降采样,得到的样点个数之和为所述第二高通滤波降采样后的点数;将所述第三低通滤波降采样后的信号进行第六高通滤波和第六低通滤波,并降采样,得到的样点个数之和为所述第三低通滤波降采样后的点数;将所述第三高通滤波降采样后的信号进行第七高通滤波和第七低通滤波,并降采样,得到的样点个数之和为所述第三高通滤波降采样后的点数,由此可知,第一音频帧信号进行小波分解后得到的子小波信号序列包括的样点数目为所述第一音频帧的样点数目。可以理解的是,根据两倍采样定理,采样频率为语音信号最高频率的两倍,则以16kHz的采样频率采集到的语音信号,对应的最高频率为8kHz,将所述第一音频帧信号进行第1级小波包分解得到第1级小波分解信号,所述第1级小波分解信号包括第一高通滤波降采样后的信号和第一低通滤波降采样后的信号,所述第一低通滤波降采样后得到的信号对应频段为0至4kHz,所述第一高通滤波降采样后得到的 信号对应频段为4kHz至8kHz;将所述第1级小波分解信号进行第2级小波包分解得到第2级小波分解信号,所述第2级小波分解信号包括第二低通滤波降采样后的信号、第二高通滤波降采样后的信号、第三低通滤波降采样后的信号以及第三高通降采样后的信号,具体的,将所述一低通滤波降采样后得到的信号进行第二高通滤波和第二低通滤波,所述第二高通滤波降采样后得到的信号对应频段为2kHz至4kHz,所述第二低通滤波降采样后得到的信号对应频段为0至2kHz,将所述第一高通滤波降采样后得到的信号进行第三高通滤波和第三低通滤波,所述第三高通滤波降采样后得到的信号对应频段为6kHz至8kHz,所述第三低通滤波降采样后得到的信号对应频段为4kHz至6kHz;将所述第2级小波分解信号进行第3级小波包分解得到第3级小波分解信号,所述第3级小波分解信号包括第四低通滤波降采样后的信号、第四高通滤波降采样后的信号、第五低通滤波降采样后的信号、第五高通滤波降采样后的信号、第六低通滤波降采样后的信号、第六高通滤波降采样后的信号、第七低通滤波降采样后的信号以及第七高通滤波降采样后的信号,具体的,将所述第二低通滤波降采样后得到的信号进行第四低通滤波和第四高通滤波,所述第四低通滤波降采样后得到的小波包信号lp4对应频段为0至1kHz,所述第四高通滤波降采样后得到的小波包信号hp4对应频段为1kHz至2kHz,将所述第二高通滤波降采样后得到的小波包信号进行第五低通滤波和第五高通滤波,所述第五低通滤波降采样后得到的小波包信号lp5对应频段为2kHz至3kHz,所述第五高通滤波降采样后得到的小波包信号hp5对应频段为3kHz至4kHz,同理的,将所述第三低通滤波降采样后得到的信号进行第六低通滤波和第六高通滤波,所述第六低通滤波降采样后得到的小波包信号lp6对应频段为4kHz至5kHz,所述第六高通滤波降采样后得到的小波包信号hp6对应频段为5kHz至6kHz,将所述第三高通滤波降采样后得到的信号进行第七低通滤波和第七高通滤波,所述第七低通滤波降采样后得到的小波包信号lp7对应频段为6kHz至7kHz,所述第七高通滤波降采样后得到的小波包信号hp7对应频段为7kHz至8kHz,以此类推,本实施例对3级小波包分解进行示例性说明,与小波分解不同的是,小波包分解会继续对高通滤波得到的每一级信号中的高频信号再次进行高低通滤波。第3级小波分解信号中的小波包信号lp4、hp4、lp5、hp5、lp6、hp6、lp7和hp7可以拼接成子小波信号序列,作为所述第一音频帧信号的小波分解信号。在一种可能的实现方式中,所述第1级小波分解信号、所述第2级小波分解信号和所述第3级小波分解信号均可以由同一种滤波器类型进行高低通滤波得到的。
102、按照所述音频帧信号在所述第一音频信号中的分帧顺序,将各个音频帧信号对应的小波分解信号进行拼接得到小波信号序列。具体的,根据步骤101得到所述第一音频帧的小波分解信号,获取所述第一音频信号中所有音频帧的小波分解信号,将所有音频帧的小波分解信号按照步骤100中所述第一音频的分帧顺序进行首尾拼接,得到代表着所述第一音频信号的信息的小波信号序列。
103、获取所述小波信号序列中所有样点的音频强度值中的最大值和最小值,根据所述小波信号序列中所有样点的音频强度值的最大值和最小值确定第一音频强度阈值。具体的,所述小波信号序列中的所有样点的样点值代表着该样点的电压幅值,在一种可能实现方式中,所述音频强度值可以为样点的电压幅值;在另一种可能的实现方式中,所述音频强度值可以为样点的能量值,将样点的电压幅值进行平方,得到该样点的能量值。根据所述小波信号序列中所有样点的音频强度值的最大值和最小值确定第一音频强度阈值,作为有效语音信号的判断依据。在一种可能的的实现方式中,有效语音信号检测装置根据所述小波信号序列中所 有样点音频强度值的最大值和所述最小值确定第一参考音频强度阈值T L=min(λ 1.(Sc max-Sc min)+Sc min2.Sc min),其中Sc max和Sc min分别为所述小波信号序列中所有样点音频强度值的最大值和最小值,λ 1为第二预设阈值,λ 2为第三预设阈值,示例性的,λ 1为0.04,λ 2为50。
在一种可能的实施例中,所述获取所述小波信号序列中所有样点的音频强度值中的最大值和最小值包括:获取所述小波信号序列中的第一小波分解信号所有样点音频强度值中的第一参考最大值和第一参考最小值;将所述小波信号序列中的所有小波分解信号中的参考最大值和参考最小值进行处理得到的数值,作为所述小波信号序列中所有样点音频强度值的最大值和最小值。具体的,所述小波信号序列中包括多个小波分解信号,获取各个小波分解信号中所有样点的最大值和最小值,可选的,将各个小波分解信号中最大值和最小值进行取平均值,作为所述小波信号序列中所有样点音频强度值的最大值和最小值。实施本实施例,对所述小波信号序列中的最大值和最小值进行优化,可以对小波信号序列中样点进一步分析,优化有效语音信号检测的效果。
104、获取所述小波信号序列中音频强度值大于所述第一音频强度阈值的样点,将所述小波信号序列中音频强度值大于所述第一音频强度阈值的样点在所述第一音频信号中对应样点的信号确定为有效语音信号。
在一种可能的实施例中,所述获取预设时长的第一音频信号之前包括:将所述预设时长的原始音频信号中的第一预设阈值的高频成分进行补偿,从而得到所述第一音频信号。具体的,由于口唇发音或麦克风录音的过程中,语音信号损失了高频成分,并且随着信号速率的增加,信号在传输过程中受损很大,为了在接收终端能得到比较好的信号波形,就需要对受损的信号进行补偿,在一种可能的实现方式中,对所述预设时长的原始音频信号进行预加重,使用y(n)=x(n)-ax(n-1)对第一音频信号进行处理,其中x(n)为第n时刻所述第一音频信号的样点的音频强度值,x(n-1)为第n-1时刻所述第一音频信号的样点的音频强度值,a为预加重系数,示例性的,a大于0.9小于1,可以理解为所述第一预设阈值,y(n)为经过预加重处理的信号。可以理解为,所述预加重处理可以认为将所述第一音频信号通过一个高通滤波器,对高频成分进行补偿,减少了口唇发音或麦克风录音的过程带来的高频损失。
实施本实施例,通过采集小波信号序列中所有样点的能量信息,根据小波信号序列的能量分布情况,确定音频强度阈值,根据音频强度阈值实现对有效语音信号的判断检测,提高有效语音信号检测的准确性。
下面对本申请提供的另一种有效语音信号的检测方法进行介绍,参见图7至图9。
首先参见图7,图7为本申请实施例提供的另一种有效语音信号检测方法的流程示意图。如图7所示,本实施例的具体执行步骤如下:
700、获取预设时长的第一音频信号,所述第一音频信号包括至少一个音频帧信号;
701、针对所述每个音频帧信号进行小波分解,得到分别与每个音频帧信号对应的多个小波分解信号,每个小波分解信号中包含多个样点以及每个样点的音频强度值;
702、按照所述音频帧信号在所述第一音频信号中的分帧顺序,将各个音频帧信号对应的小波分解信号进行拼接得到小波信号序列;
可以理解的是,步骤700、步骤701和步骤702为将所述第一音频信号进行分帧、小波分解后拼接得到小波信号序列的过程,具体实现可以参考前文结合图1至图6所描述的实施 例,此处不作赘述。
703、获取所述小波信号序列中所有样点的音频强度值中的最大值和最小值,根据所述小波信号序列中所有样点的音频强度值的最大值和最小值确定第一音频强度阈值和第二音频强度阈值,其中所述第一音频强度阈值小于所述第二音频强度阈值。具体的,根据所述小波信号序列中所有样点的音频强度值中的最大值和最小值确定第一音频强度值和第二音频强度值,可选的,根据所述小波信号序列中所有样点的音频强度值中的最大值和最小值确定所述第一参考音频强度阈值T L=min(λ 1.(Sc max-Sc min)+Sc min2.Sc min),其中Sc max和Sc min分别为所述小波信号序列中所有样点音频强度值的最大值和最小值,λ 1为第二预设阈值,λ 2为第三预设阈值;所述第二音频强度阈值T U=αT L,其中α为第四预设阈值,α取值大于1。
在一种可能的实施例中,所述获取所述小波信号序列中所有样点的音频强度值中的最大值和最小值包括:获取所述小波信号序列中的第一小波分解信号所有样点音频强度值中的第一参考最大值和第一参考最小值;将所述小波信号序列中的所有小波分解信号中的参考最大值和参考最小值进行处理得到的数值,作为所述小波信号序列中所有样点音频强度值的最大值和最小值。具体的,所述小波信号序列中包括多个小波分解信号,获取各个小波分解信号中所有样点的最大值和最小值,可选的,将各个小波分解信号中最大值和最小值进行取平均值,作为所述小波信号序列中所有样点音频强度值的最大值和最小值。实施本实施例,对所述小波信号序列中的最大值和最小值进行优化,可以对小波信号序列中样点进一步分析,优化有效语音信号检测的效果。
704、获取所述小波信号序列中第一样点,其中所述第一样点的前一样点的音频强度值小于所述第二音频强度阈值,以及所述第一样点的音频强度值大于所述第二音频强度阈值。具体的,当所述小波信号序列中第一样点的音频强度值大于所述第二音频强度阈值,而所述第一样点的前一样点的音频强度小于所述第二音频强度阈值,所述第一样点为有效语音信号的起始点,预定义从所述第一样点开始进入有效语音段。
705、获取所述小波信号序列中的第二样点,所述第二样点为在所述小波信号序列中排序在所述第一样点之后首个出现音频强度值小于所述第一音频强度阈值的样点。具体的,由步骤704预定义所述第一样点为有效语音段的起始端点,进入有效语音段,当所述第一样点之后,所述第二样点首次出现音频强度值小于所述第一音频强度阈值时,认为所述第二样点已经退出了所述第一样点所在的有效语音段。
706、确定所述小波信号序列中的所述第一样点和所述第二样点的前一样点在所述第一音频信号中对应样点的信号为所述有效语音信号中的有效语音段。具体的,由步骤705确定所述第二样点已经退出所述第一样点所在的有效语音段,可以确定所述第一样点与所述第二样点的前一样点在所述第一音频信号中对应样点的信号为有效语音段。更进一步的,所述第二样点与所述第一样点之间至少包括第一预设数量个连续样点。若所述第一样点和所述第二样点之间靠得太近,示例性的,所述第一预设数量为20。所述第一样点与所述第二样点之间少于所述第一预设数量个连续样点,可以认为所述第一样点出现大于所述第二音频强度阈值的情况是由瞬态噪声内的抖动引起的,而不是由有效语音引起的。
在一种可能的实施例中,所述获取预设时长的第一音频信号之前包括:将所述预设时长的原始音频信号中的第一预设阈值的高频成分进行补偿,从而得到所述第一音频信号。具体的,由于口唇发音或麦克风录音的过程中,语音信号损失了高频成分,并且随着信号速率的增加,信号在传输过程中受损很大,为了在接收终端能得到比较好的信号波形,就需要对受 损的信号进行补偿,在一种可能的实现方式中,对所述预设时长的原始音频信号进行预加重,使用y(n)=x(n)-ax(n-1)对第一音频信号进行处理,其中x(n)为第n时刻所述第一音频信号的样点的音频强度值,x(n-1)为第n-1时刻所述第一音频信号的样点的音频强度值,a为预加重系数,示例性的,a大于0.9小于1,可以理解为所述第一预设阈值,y(n)为经过预加重处理的信号。可以理解为,所述预加重处理可以认为将所述第一音频信号通过一个高通滤波器,对高频成分进行补偿,减少了口唇发音或麦克风录音的过程带来的高频损失。
实施本实施例的效果可以参见图8,图8为本申请实施例提供的一种语音信号示意图。前文结合图1所描述的步骤104根据所述第一音频强度阈值确定有效语音信号,进一步的,本实施例根据第一音频强度阈值和第二音频强度阈值确定有效语音段,可以实现将如图8所示的瞬态噪声在有效语音信号中剔除,避免将瞬态噪声误检为有效语音信号的情况发生,更进一步的提高有效信号检测的准确性。
下面结合附图对本实施例的具体实现方式进行详细介绍,参见图9,图9为本申请实施例提供的又一种有效语音信号检测的流程示意图,如图9所示,具体执行步骤如下:
900、所述有效信号的检测装置初始定义样点索引i=0,有效语音信号起始点索引is=0,有效语音信号时段的索引idx=0。具体的,样点索引i为自变量,代表着第i个样点,起始点索引is为记录变量,记录着有效信号段的起始样点;为了遍历小波信号序列中的所有样点,自变量i会发生变化,所以需要定义变量is来记录所述第一样点,可选的,有效语音信号时段的索引idx也是记录变量,记录着第idx个有效语音段,可以定义idx来记录所述第一音频信号中包括的有效语音段的数量。
901、判断第i个样点的音频强度值Sc(i)是否大于所述第二音频强度阈值,且所述起始点索引is等于0。具体的,所述第二音频强度阈值可以认为有效语音信号的上限阈值,将样点的音频强度值与所述第二音频强度阈值进行比较,
902、记录进入有效语音段的样点i,is=i。具体的,当第i个样点的音频强度值Sc(i)大于所述第二音频强度阈值时,且起始点索引is为初始定义的0时,通过is记录符合步骤901的第i样点的样点位置,预定义所述第i样点进入有效语音段,然后进行下一个样点音频强度的判断,进入步骤907,i=i+1,将下一样点作为当前样点,以此不断进行检测判断。可以理解的是,所述第i样点的前一样点的音频强度值小于所述第二音频强度阈值,而所述第一样点的音频强度值大于所述第二音频强度阈值,前文结合图7所描述的实施例中的步骤704中,获取所述小波信号序列中第一样点,其中所述第一样点的前一样点的音频强度值小于所述第二音频强度阈值,以及所述第一样点的音频强度值大于所述第二音频强度阈值。此时的第i样点为所述第一样点,即is代表着所述第一样点,进行步骤902。若所述第i个样点的音频强度值Sc(i)不大于所述第二音频强度阈值,进行步骤903。
903、判断第i个样点的音频强度值Sc(i)是否小于第二音频强度阈值,且起始点索引is不等于0。具体的,若所述第i个样点的音频强度值Sc(i)小于或等于第二音频强度阈值,或所述起始点索引is不等于0时,再将所述第i个样点的音频强度值Sc(i)与所述第一音频强度阈值进行比较,实现前文结合图7所描述的实施例步骤705中的获取所述小波信号序列中的第二样点,所述第二样点为在所述小波信号序列中排序在所述第一样点之后首个出现音频强度值小于所述第一音频强度阈值的样点。为了保证所述第二样点在所述第一样点之后,所以还需要对所述起始索引is进行判断,若所述其实索引点is不等于0,代表着所述第一样点已经出现并且确定,当所述小波信号序列中样点的音频强度值时,首次出现小于所述第一音频 强度阈值的样点时,确定为所述第二样点。可以理解的是,所述第二样点已经退出所述第一样点所在的有效语音段,所述第二样点的前一样点为所述有效语音段的结束端点。若所述第i个样点的音频强度值Sc(i)不小于所述第一音频强度阈值,代表着所述第i样点还处于有效信号段中,又或者,所述起始点索引is等于0,代表着所述第i样点还没进入预定义的有效信号段中,出现上述两种情况,均执行步骤907,i=i+1,将下一样点作为当前样点,重新开始有效语音段的检测流程。
进一步的,可以将进入有效语音信号段的起始样点与有效语音信号的结束端点之间的时间间隔来进行比较,判断所述第一样点与所述第二样点之间是否至少包括第一预设数量个连续样点,步骤如下:
904、判断i与is之间的时间间隔,i>is+T min,具体的,由于采样时间可以由采样频率来确定,所述第一样点与所述第二样点之间至少包括第一预设数量个连续样点中的所述第一预设数量个连续样点可以用一段时间Tmin来表示,示例性的,以第一音频帧信号的16kHz采样频率为例,所述第一音频帧信号的帧长为10ms,包括160个样点,经过3级小波分解或小波包分解的降采样后,小波信号序列中的样点间隔为0.5ms,若所述第一预设数量为20,则Tmin为20乘以0.5ms,即Tmin等于10ms。若所述第一样点与所述第二样点之间至少包括第一预设数量个连续样点,即i>is+Tmin时,执行步骤905。若所述第一样点与所述第二样点之间的样点数目少于所述第一预设数量个连续样点,此时i为所述第二样点,is经过步骤901为所述第一样点,即i>is+Tmin不成立时,认为第i个样点的前一样点第i-1个样点不是有效语音段的结束端点,前面步骤902中is=i记录的有效语音段的开始端点可能是噪声抖动,因为瞬态噪声的能量是快速上升后快速下降的,所以样点的音频强度值大于第二音频强度阈值,但持续的时间不够长,在小于Tmin的时间内就下降到小于所述第一音频强度阈值,这与语音信号的短时稳定性不符合,舍弃掉这一段信号,执行步骤906。
905、idx=idx+1,有效语音段为[is,i-1]。具体的,当所述第一样点与所述第二样点之间至少包括第一预设数量个连续样点,即i>is+Tmin时,实现前文结合图7所描述的实施例中的步骤706,确定所述第一样点与所述第二样点的前一样点在所述所述第一音频信号中对应样点的信号为所述有效语音信号中的有效语音段。所述有效语音段的区间表示形式为[is,i-1],所述is记录着所述第一样点,所述i为所述第二样点,i-1为所述第二样点的前一样点。可选的,idx=idx+1,记录所述小波信号序列中包括的有效信号段的数量。然后执行步骤906。
906、重置is=0。具体的,is记录的所述第一样点已经用区间记录下来了,可以释放is的值,重置is=0,执行步骤907,i=i+1,将下一样点作为当前样点,重新开始有效语音段的检测流程。
907、i=i+1,具体的,不停地对小波信号序列的样点进行遍历,通过加1的操作,从前面开始往后面进行样点遍历。
908、判断i是否大于或等于样点总数。具体的,在执行步骤907,i=i+1后,重新开始有效语音段的检测流程之前,需要对样点的位置进行判断,判断第i个样点中的i是否大于或等于小波信号序列中样点的总数,因为i一直在加1,在不停的向后移动进行样点的遍历,若i小于小波信号序列中样点总数,则继续进行上述与所述第二音频强度阈值以及所述第一音频强度阈值的比较流程,若第i个样点已经遍历至所有样点中的最后一个,即i等于或大于样点总数时,进行步骤909。
909、确定有效语音段为[is,i-1],实现前文结合图7所描述的实施例中的步骤706,确 定所述第一样点与所述第二样点的前一样点在所述所述第一音频信号中对应样点的信号为所述有效语音信号中的有效语音段。
前文结合图1至图9所描述的实施例基于语音信号的音频强度值来判断有效语音信号以及有效语音信号出现的时间段,进一步的,可以对语音信号进行跟踪,并用跟踪的结果影响信号的音频强度值,进一步地提高有效语音信号检测的准确性。下面结合附图对跟踪语音信号进行详细的介绍。参见图10至图12。
首先参见图10,图10为本申请实施例提供的一种跟踪语音信号的流程示意图,如图10所示,具体的跟踪步骤如下:
1000、将小波信号序列中目标样点前一样点的音频强度值乘以平滑系数,得到所述目标样点的第二参考音频强度。具体的,将所述小波序列中的样点进行时域幅度平滑,以使语音信号的前后样点之间可以圆滑过渡,减少毛刺对语音信号的影响。示例性的,以S(i)表示所述目标样点的音频强度值,S(i-1)表示所述目标样点前一样点的音频强度值,α s代表所述平滑系数,将所述小波信号序列中所述目标样点前一样点的音频强度值S(i-1)乘以平滑系数α s得到所述目标样点的第二参考音频强度值,所述目标样点的第二参考音频强度值为α s×S(i-1)。
1001、将所述小波信号序列中包括所述目标样点在内,且在所述小波信号序列中排序顺序在所述目标样点之前的所有连续样点的音频强度值的平均值乘以剩余的平滑系数,得到所述目标样点的第三参考音频强度值。具体的,所述第二参考音频强度值为时域平滑结果的一部分,将所述小波信号序列中包括所述目标样点在内,且在所述小波信号序列帧中排序顺序在所述目标样点之前的所有连续样点的音频强度值的平均值乘以剩余的平滑系数作为时域平滑结果的另一部分。示例性的,以将所述第一音频信号进行3级小波包分解为例进行说明,所述小波信号序列中包括8个小波包分解信号,排序顺序在所述目标样点之前的所有连续样点的音频强度值的平均值M(i)为:
Figure PCTCN2020128374-appb-000001
其中,公式1中i为小波信号序列中的第i个样点,l代表着第l个小波分解信号,可以理解的是,i小于小波信号序列中所有样点的数目总数。将所述小波信号序列中包括所述目标样点在内,且在所述小波信号序列中排序顺序在所述目标样点之前的所有连续样点的音频强度值的平均值M(i)乘以剩余的平滑系数1-α s,得到所述目标样点的第三参考音频强度值,所述第三参考音频强度值为M(i)×(1-α s)。
1002、将所述第二参考音频强度值和所述第三参考音频强值相加得到的数值,作为所述目标样点的第四参考音频强度值。具体的,由步骤1000和步骤1001可知所述第二参考音频强度值为α s×S(i-1),所述第三参考音频强度值为M(i)×(1-α s),将所述第二参考音频强度值和所述第三参考音频强值相加得到所述第四参考音频强度值为α s×S(i-1)+M(i)×(1-α s),在一种可能的实现方式中,所述第四参考音频强度值可以表示为经过平滑后的所述目标样点的音频强度值,将所述第四参考音频强度值作为所述目标样点的音频强度值,公式表示为S(i)=α s×S(i-1)+M(i)×(1-α s)。
1003、将包括所述目标样点在内,且在所述小波信号序列中排序顺序在所述目标样点之前的所有样点的第四参考音频强度值中的最小值,作为所述目标样点的第一参考音频强度值。具体的,预设需要跟踪信号的时长,将在所述小波信号序列中排序在所述目标样点之前的所 有样点的第四参考音频强度值划分为一段一段预设时长的跟踪信号,记录第一段时长中所有样点的第四参考音频强度值的最小值,并传递至下一段预设时长的跟踪信号,将上一段预设时长传递下来的所有样点的最小值和该段中的第一样点的音频强度值进行比较,记录下两者中的相对较小值,并将两者中相对较小值与该时段的下一样点的音频强度值进行比较,以此类推,每次都是记录两者中的相对较小值与下一个样点的音频强度值进行比较,由此获得该段预设时长中所有样点的第四参考音频强度值的最小值,从而确定所述目标样点的第一参考音频强度值。
实施本实施例,将所述小波信号序列中的所有样点进行预设时长划分,跟踪预设时长的所有样点的音频强度的分布,可以弱化瞬态噪声的能量。实施本实施例的效果可以参见图11a和图11b,图11a为本申请实施例提供的另一种语音信号示意图。如图11a所示,前文结合图1至图9所描述的实施例,通过对小波信号序列的所有样点进行统计,可以获取准确的第一音频强度阈值和第二音频强度值,从而将瞬态噪声排除在有效语音段的范围外,实现如图11a所示的效果;而实施本实施例,实现的效果参见图11b,图11b为本申请实施例提供的又一种语音信号示意图。如图11b所示,通过跟踪预设时长的所有样点的音频强度的分布,对小波信号序列中的样点的音频强度值进行弱化,瞬态噪声的能量极大被削弱,降低了瞬态噪声对有效语音信号检测的干扰,通过跟踪处理后的第一音频强度阈值和第二音频强度阈值对有效语音信号进行检测,从而提高有效语音信号检测的准确性。
下文将结合附图对如何跟踪语音信号以及跟踪语音信号达到的效果进行详细介绍。
在一种可能的实施例中,为了进一步的降低小波信号序列中可能出现的毛刺影响,确定所述目标样点的第一参考音频强度值之后还可以继续进行以下步骤:
1004、将所述小波信号序列中包括目标样点在内的第二预设数量个连续样点的第一参考音频强度值的平均值作为所述目标样点的音频强度值。具体的,在小波信号序列中,对目标样点进行短时均值平滑,将短时均值平滑后的数值作为所述目标样点的音频强度值,在一种可能的实现方式中,第i个样点的音频强度值S C(i)为:
Figure PCTCN2020128374-appb-000002
其中,公式2中的2M为所述第二预设数量个连续样点,S m(i)为所述目标样点的第一参考音频强度值,S m(i-m)表示在所述第i个样点的前面或后面m个样点。示例性的,M=80,所述第二预设数量个连续样点为160,则
Figure PCTCN2020128374-appb-000003
表示在所述第i个样点的前面和后面各取80个样点的第一音频参考强度值来进行求和运算,得到包括目标样点i以及所述目标样点i前后的M个样点的音频强度值的总和;将求和运算得到的结果进行求平均值的运算,将音频强度值的总和除以所有样点的个数,作为所述第i个样点幅度短时均值平滑后的音频强度值S C(i),此时公式2中的m为自变量,为了避免出现负数个样点,所以i大于M,以M等于80为例,从第81个样点开始进行均值平滑。
有效语音信号检测装置对语音信号进行跟踪,并用跟踪的结果影响信号的音频强度值,可以与前文结合图1至图9所描述的任意一个使用到样点音频强度值的实施例进行结合。
在一种可能的实施例中,有效语音信号检测装置获取预设时长的第一音频信号,所述第一音频信号包括至少一个音频帧信号,获取每个音频帧信号中的多个样点以及每个样点的音频强度值;
针对所述每个音频帧信号进行小波分解,得到分别与每个音频帧信号对应的多个小波分 解信号,每个小波分解信号中包含多个样点以及每个样点的音频强度值;
按照所述音频帧信号在所述第一音频信号中的分帧顺序,将各个音频帧信号对应的小波分解信号进行拼接得到小波信号序列;
将所述小波信号序列中目标样点前一样点的音频强度值乘以平滑系数,得到所述目标样点的第二参考音频强度值;
将所述小波信号序列中包括所述目标样点在内,且在所述小波信号序列中排序顺序在所述目标样点之前的所有连续样点的音频强度值的平均值乘以剩余的平滑系数,得到所述目标样点的第三参考音频强度值;
将所述第二参考音频强度值和所述第三参考音频强值相加得到的数值,作为所述目标样点的第四参考音频强度值;
将包括所述目标样点在内,且在所述小波信号序列中排序顺序在所述目标样点之前的所有样点的第四参考音频强度值中的最小值,作为所述目标样点的第一参考音频强度值;
将所述小波信号序列中包括目标样点在内的第二预设数量个连续样点的第一参考音频强度值的平均值作为所述目标样点的音频强度值;
获取所述小波信号序列中所有样点的音频强度值中的最大值和最小值,根据所述小波信号序列中所有样点的音频强度值的最大值和最小值确定第一音频强度阈值;
获取所述小波信号序列中音频强度值大于所述第一音频强度阈值的样点,将所述小波信号序列中音频强度值大于所述第一音频强度阈值的样点在所述第一音频信号中对应样点的信号确定为有效语音信号。
本实施例的具体实现过程可以参见前文结合图1至图10所描述的实施例,此处不作赘述。本实施例可以在前文结合图1至图9所描述的实施例的效果上进一步提高有效信号检测的准确性,下文将结合附图来进行详细说明。实施本实施例,通过小波分解的优良局部显微特性,跟踪小波信号序列中稳定时长的能量分布信息,基于跟踪到的能量分布信息确定音频强度阈值的上限,从而实现对有效语音信号的检测。
在另一种可能的实施例中,有效语音信号检测装置获取预设时长的第一音频信号,所述第一音频信号包括至少一个音频帧信号;
针对所述每个音频帧信号进行小波分解,得到分别与每个音频帧信号对应的多个小波分解信号,每个小波分解信号中包含多个样点以及每个样点的音频强度值;
按照所述音频帧信号在所述第一音频信号中的分帧顺序,将各个音频帧信号对应的小波分解信号进行拼接得到小波信号序列;
将所述小波信号序列中目标样点前一样点的音频强度值乘以平滑系数,得到所述目标样点的第二参考音频强度值;
将所述小波信号序列中包括所述目标样点在内,且在所述小波信号序列中排序顺序在所述目标样点之前的所有连续样点的音频强度值的平均值乘以剩余的平滑系数,得到所述目标样点的第三参考音频强度值;
将所述第二参考音频强度值和所述第三参考音频强值相加得到的数值,作为所述目标样点的第四参考音频强度值;
将包括所述目标样点在内,且在所述小波信号序列中排序顺序在所述目标样点之前的所有样点的第四参考音频强度值中的最小值,作为所述目标样点的第一参考音频强度值;
将所述小波信号序列中包括目标样点在内的第二预设数量个连续样点的第一参考音频强度值的平均值作为所述目标样点的音频强度值;
获取所述小波信号序列中所有样点的音频强度值中的最大值和最小值,根据所述小波信号序列中所有样点的音频强度值中的最大值和最小值确定所述第一音频强度阈值和第二音频强度阈值,其中所述第一音频强度阈值小于所述第二音频强度阈值;
获取所述小波信号序列中第一样点,其中所述第一样点的前一样点的音频强度值小于所述第二音频强度阈值,以及所述第一样点的音频强度值大于所述第二音频强度阈值;
获取所述小波信号序列中的第二样点,所述第二样点为在所述小波信号序列中排序在所述第一样点之后首个出现音频强度值小于所述第一音频强度阈值的样点;
确定所述小波信号序列中的所述第一样点和所述第二样点的前一样点在所述第一音频信号中对应样点的信号为所述有效语音信号中的有效语音段。可选的,所述第二样点与所述第一样点之间至少包括第一预设数量个连续样点。
本实施例的具体实现过程可以参见前文结合图1至图10所描述的实施例,此处不作赘述。本实施例可以在前文结合图1至图9所描述的实施例的效果上进一步提高有效信号检测的准确性,下文将结合附图来进行详细说明。实施本实施例,通过小波分解的优良局部显微特性,跟踪小波信号序列中稳定时长的能量分布信息,基于跟踪到的能量分布信息确定音频强度阈值的上限和下限,从而实现对有效语音信号中有效语音段的检测。
在一种可能的实现方式中,获取所述小波信号序列中的第一小波分解信号所有样点音频强度值中的第一参考最大值和第一参考最小值;
将所述小波信号序列中的所有小波分解信号中的参考最大值和参考最小值进行处理得到的数值,作为所述小波信号序列中所有样点音频强度值的最大值和最小值。
在一种可能的实现方式中,将所述预设时长的原始音频信号中的第一预设阈值的高频成分进行补偿,从而得到所述第一音频信号。
可以理解的是,所述针对所述每个音频帧信号进行小波分解包括:针对所述每个音频帧信号进行小波包分解,将小波包分解后得到的信号作为所述小波分解信号。
下文将结合附图对如何跟踪语音信号进行示例性说明,参见图12,图12为本申请实施例提供的另一种跟踪语音信号的流程示意图。如图12所示,具体执行步骤如下:
1201、有效语音信号检测装置初始定义小波信号序列的样点索引i=0,初始化音频强度值S(0)=M(0),定义样点累积索引i mod=0。具体的,i=0,S(0)=M(0),i mod=0,可以理解成所述有效语音信号检测装置的初始状态,定义需要遍历的样点初始值,以及对应的音频强度值,样点累积索引用于控制预设时长,当样点累积索引i mod的值达到一定值时,进行数据更新,完成一段预设时长的信号跟踪。
1202、i=i+1,第i个样点的音频强度值S(i)=α s×S(i-1)+M(i)×(1-α s)。具体的,开始进行样点的音频强度值跟踪,也可以理解为能量分布情况的跟踪,i=i+1,对每一个遍历过的样点进行幅度平滑,平滑后的第i个样点的音频强度值为S(i)=α s×S(i-1)+M(i)×(1-α s),实现前文结合图10所描述的实施例中的步骤1000、1001和1002的方法,所述第四参考音频强度值为S(i)=α s×S(i-1)+M(i)×(1-α s),可选的α s=0.7。
1203、判断i是否小于累加样点数目V win。具体的,本实施例是对一段时长的语音信号进行跟踪,所以需要对样点进行累加,预先定义累加的样点数目V win,可选的V win=10,当在 第0至9个样点时,进行步骤1204,当遍历至第10个样点时,进行步骤1205。
1204、若i小于累加样点数目V win,定义S min=S(i),S mact=S(i)。具体的,当i从小波信号序列中的第1个样点开始遍历,进行样点的音频强度平滑,若i小于累加样点数目V win时,将S(i)的值赋予给S min和S mact,即S min=S(i),S mact=S(i),进行步骤1206对S min的数据进行记录,以及步骤1207开始进行样点累加。示例性的,i=i+1,可以理解成有效语音信号检测装置一直在对样点的音频强度值进行跟踪,i小于累加数目V win的情况是所述第一音频信号的前V win个样点,例如V win=10,当遍历至第9个样点时S min=S(9),S mact=S(9),S min和S mact记录着第9个样点的音频强度值。
1205、若i大于或等于累加样点数目V win,获取第V win个样点到第i个样点的音频强度值最小值,S min=min(S min,S(i)),S mact=min(S mact,S(i))。具体的,若i大于或等于累加样点数目V win,当遍历至第V win个样点时,以V win=10为例进行说明,示例性的,当步骤1203遍历至第10个样点时,获取第9个样点与第10个样点之间的较小值赋给S min,S min=min(S min,S(10)),遍历至第10个样点的前一步骤中S min记录着S(9)的值。
1206、定义S m(i)=S min。具体的,实现前文结合图10的所描述的实施例中的步骤1203,将包括所述目标样点在内,且在所述小波信号序列中排序顺序在所述目标样点之前的所有样点的第四参考音频强度值中的最小值,作为所述目标样点的第一参考音频强度值。可以理解的是,在i小于累加样点V win时,S m(i)记录不是相邻样点间的较小值,可以理解的是,在语音开始的几个样点,需要做一部分必要的初始化设置,例如将矩阵SW进行初始化,所以刚开始的部分语音信号可以不加考虑。以V win=10为例,S m(i)记录着的是从第9个样点开始进行获取音频强度值的最小值。
1207、i mod=i mod+1。具体的,在样点i遍历的过程中,样点累积i mod也在不断的累加,i mod=i mod+1,i mod控制着矩阵SW是否进行数据更新,将所述小波信号序列划分为预设时长的语音信号来进行跟踪。可以理解的是,i代表着所述小波信号序列中的样点位置与顺序,而i mod代表着i样点在所述预设时长中的位置与顺序,在达到预设时长时,i mod会被重置,重新开始记录下一个小波信号序列中的样点在下一段预设时长中的位置。
1208、判断i mod是否等于V min。具体的,对i mod与V min进行比较,判断对样点的跟踪是否达到了预设时长。示例性的,以16kHz为所述第一音频信号的采样频率,进行3级小波包分解并降采样,则小波信号序列中为每隔0.5ms进行一次采样,样点累加数目V win=10,跟踪时长为V win×0.5=5ms。若i mod等于V min,代表达到了跟踪预设时长,进行步骤1209;若i mod不等于V min,可选的,若i mod小于V min,进行步骤1213。
1209、i mod=0。具体的,在i mod每次达到累加样点数目V win时,释放i mod,重置i mod=0,以进行下一次样点累加。
1210、判断i是否等于V min。具体的,当i等于V min时,进行步骤1211,初始化矩阵数据;当i不等于V min时,进行步骤1212。
1211、初始化矩阵SW。具体的,定义SW:
Figure PCTCN2020128374-appb-000004
当i等于V min时,定义N win行,1列的矩阵SW,可选的,N win=2。可以理解的是,该 步骤是在一段语音的开始部分执行的,i一直在累加,V win是一个预设的固定值,在i遍历至第V win个样点时,对矩阵SW进行初始化设置,以提供矩阵来存储本实施例的数据。
1212、进行矩阵SW中的数据更新,并记录矩阵中的最小值S min=min{SW}重置S mact=S(i)。具体的,SW为:
Figure PCTCN2020128374-appb-000005
当i不等于V min时,并且i mod累加达到预设时长时,更新矩阵SW的值,将当前时段的所有样点的最小值与上一时段的最小值放在矩阵SW中,获取两者中的较小值,记录在S min中,S min=min{SW},可以理解的是,S min记录着从V min前一样点开始的所有样点的最小值,释放S mact,重置S mact=S(i)。示例性的,以跟踪时长为5ms为例进行说明,S mact记录着最新的5ms中的所有样点的第四参考音频强度值的最小值,S min记录着前一段5ms中所有样点的第四参考音频强度值的最小值,将相邻5ms的最小值放在长度为2矩阵SW中,获取两者中的较小值,记录在S min中,S min=min{SW},在步骤1206中将记录着跟踪时长的最小值S min赋值给S m(i),S m(i)=S min
1213、判断i是否大于或等于样点总数。具体的,可以理解的是,具体的,在执行步骤1202,i=i+1后,重新开始跟踪预设时段的信号之前,需要对小波信号序列的样点位置进行判断,判断第i个样点中的i是否大于或等于小波信号序列中样点的总数,因为i一直在加1,在不停的向后移动进行样点的遍历,若i小于小波信号序列中样点总数,则继续进行信号跟踪,若第i个样点已经遍历至所有样点中的最后一个,即i等于或大于样点总数时,进行步骤1214。
1214、确定S m(i)为第i样点的第一参考音频强度值或音频强度值。具体的,由步骤1212和步骤1206得知S m(i)记录着从V min前一样点开始的所有样点的最小值,在一种可能的实现方式中,S m(i)为第i样点的第一参考音频强度值,实现前文结合图10所描述的实施例步骤1003所描述的实现过程,得到所述目标样点的第一参考音频强度值,从而获取所述目标样点的音频强度值。
本实施例通过矩阵将上一跟踪时长的所有样点的音频强度值的最小值S min传递到当前跟踪时长中,S min与目标样点的音频强度值进行比较,获取包括目标样点在内,且在所述小波信号序列中排序顺序在所述目标样点之前的所有样点的第四参考音频强度值中的最小值,即S min=min(S min,S(i)),作为所述目标样点的第一参考音频强度值S m(i);再将两者中的较小值与所述目标样点的后一样点的第四参考音频强度值进行比较,获取两者中的较小值,作为所述目标样点的后一样点的第一参考音频强度值S m(i+1),以此类推,获取该跟踪时长的所有样点的音频强度的最小值,并通过矩阵将上一跟踪时长与当前跟踪时长中,相对较小的音频强度值传递到下一跟踪时长中。S m(i)构成的样点序列可以描述语音信号的音频强度值的分布情况,也可以理解为语音信号的能量分布趋势。
实施本实施例,通过对稳定时长的信号的音频强度值进行跟踪,可以更进一步的提高有效语音信号检测的准确性,进一步避免将瞬态噪声误检测为有效语音信号或有效语音信号段。
下面可以结合附图对实施本实施例的效果进行示例性说明,参见图13a至图13e,图13a至图13e分别为本申请实施例提供的一种有效语音信号的检测效果示意图。有效语音信号检测装置获取一段包括瞬态噪声的原始语音信号,所述语音信号的原始波形图如图13a所示,可见所述瞬态噪声分布在0~6s的时间段内。
有效信号检测装置对该原始语音信号进行前文结合图1至图9所述的小波分解或小波包分解后,得到的原始信号幅度的小波信号序列的所有样点的音频强度值,进一步的,采用前文结合图10和图12所述的语音信号跟踪,得到语音信号跟踪后的稳态幅度跟踪,两种方式的样点能量分布如图13b所示,可以理解的是,前文结合图10将包括目标样点在内,在所述小波信号序列中排序在所述目标样点之前的所有样点音频参考强度值的最小值作为所述目标样点的音频参考强度值,所以经过稳态幅度跟踪后的语音信号的幅度值相对原始信号的小波信号序列的幅度值有所弱化,而弱化的部分是瞬态噪声部分的信号,语音部分的信号几乎没有变化。
为进一步的减少信号毛刺的影响,对所述原始信号幅度和所述稳态幅度跟踪后的所有样点的音频强度值进行平滑,平滑后的结果如图13c所示。在一种可能的实现方式中,可以参考前文结合图10的实施例,以及公式2进行样点的音频强度值的平滑,结合图13b和图13c,可以看出实施前文结合图10的实施例中的样点音频强度值的短时均值平滑可以明显的减少信号的毛刺,让信号整体趋于平滑。
对图13c中的信号进行有效语音信号检测,即VAD(Voice activity dectection,语音活动检测)检测,在本申请为有效语音信号检测,对原始信号能量的VAD检测结果如13d所示,对平稳信号序列跟踪得到的VAD检测结果如图13e所示,对图13c中的平滑后原始信号幅度进行实施前文结合图1至图9所述的实施例,检测结果已经比较准确,但是对原始信号的能量先实施前文结合图10至图12的实施例,对原始信号的能量进行进一步的跟踪后再实施前文结合图1至图9所述的实施例,可以更进一步的提高有效语音检测的准确性,如图13e所示,相对图13d,图13e中的检测结果将瞬态噪声误判为有效语音信号的概率低,大大提高了有效语音信号检测的准确性。
下面对本申请实施例提供的一种有效信号的检测装置进行说明,参见图14,图14为本申请实施例提供的一种有效语音信号的检测装置的结构框图,如图14所示,一种语音信号检测的装置14包括:
获取模块1401,用于获取预设时长的第一音频信号,所述第一音频信号包括至少一个音频帧信息;
分解模块1402,用于针对所述每个音频帧信号进行小波分解,得到分别与每个音频帧信号对应的多个小波分解信号,每个小波分解中包含多个样点以及每个样点的音频强度值;
拼接模块1403,用于按照所述音频帧信号在所述第一音频信号中的分帧顺序,将各个音频帧信号对应的小波分解信号进行拼接得到小波信号序列;
确定模块1404,用于获取所述小波信号序列中所有样点的音频强度值中的最大值和最小值,根据所述小波信号序列中所有样点的音频强度值的最大值和最小值确定第一音频强度阈值;
所述确定模块1404,还用于获取所述小波信号序列中音频强度值大于所述第一音频强度阈值的样点,将所述小波信号序列中音频强度值大于所述第一音频强度阈值的样点在所述第 一音频信号中对应样点的信号确定为有效语音信号。
在一种可能的实施例中,所述确定模块1404还用于根据所述小波信号序列中所有样点的音频强度值中的最大值和最小值确定所述第一音频强度阈值和第二音频强度阈值,其中所述第一音频强度阈值小于所述第二音频强度阈值;
所述获取模块1401还用于获取所述小波信号序列中第一样点,其中所述第一样点的前一样点的音频强度值小于所述第二音频强度阈值,以及所述第一样点的音频强度值大于所述第二音频强度阈值;
所述获取模块1401还用于获取所述小波信号序列中的第二样点,所述第二样点为在所述小波信号序列中排序在所述第一样点之后首个出现音频强度值小于所述第一音频强度阈值的样点;
所述确定模块1404还用于确定所述小波信号序列中的所述第一样点和所述第二样点的前一样点在所述第一音频信号中对应样点的信号为所述有效语音信号中的有效语音段。
可选的,所述第二样点与所述第一样点之间包括第一预设数量个连续样点。
在一种可能的实施例中,所述确定模块1404,还用于将所述小波信号序列中包括目标样点在内的第二预设数量个连续样点的第一参考音频强度值的平均值作为所述目标样点的音频强度值。
在一种可能的实现方式中,所述一种语音信号检测的装置14还包括计算模块1405,在所述确定模块1404将所述小波信号序列中包括目标样点在内的第二预设数量个连续样点的第一参考音频强度值的平均值作为所述目标样点的音频强度值之前,所述计算模块1405,用于将所述小波信号序列中所述目标样点前一样点的音频强度值乘以平滑系数,得到所述目标样点的第二参考音频强度;所述计算模块1405,还用于将所述小波信号序列中包括所述目标样点在内,且在所述小波信号序列中排序顺序在所述目标样点之前的所有连续样点的音频强度值的平均值乘以剩余的平滑系数,得到所述目标样点的第三参考音频强度值;所述计算模块1405,还用于将所述第二参考音频强度值和所述第三参考音频强值相加得到的数值,作为所述目标样点的第四参考音频强度值;所述确定模块1404,还用于将包括所述目标样点在内,且在所述小波信号序列中排序顺序在所述目标样点之前的所有样点的第四参考音频强度值中的最小值,作为所述目标样点的第一参考音频强度值。
可选的,所述获取模块1401,还用于获取所述小波信号序列中的第一小波分解信号所有样点音频强度值中的第一参考最大值和第一参考最小值;所述确定模块1404,还用于将所述小波信号序列中的所有小波分解信号中的参考最大值和参考最小值进行处理得到的数值,作为所述小波信号序列中所有样点音频强度值的最大值和最小值。
在一种可能的实施例中,所述一种有效语音信号的检测装置14还包括补偿模块1406,在所述获取模块1401获取预设时长的第一音频信号之前,所述补偿模块1406,用于将所述预设时长的原始音频信号中的第一预设阈值的高频成分进行补偿,从而得到所述第一音频信号。
在一种可能的实现方式中,所述分解模块1402,还用于针对所述每个音频帧信号进行小波包分解,将小波包分解后得到的信号作为所述小波分解信号。
在一种可能的实现方式中,所述确定模块1404,还用于根据所述小波信号序列中所有样点音频强度值的最大值和所述最小值确定第一参考音频强度阈值T L=min(λ 1.(Sc max-Sc min)+Sc min2.Sc min),其中Sc max和Sc min分别为所述小波信号序列中所有样点音频强度 值的最大值和最小值,λ 1为第二预设阈值,λ 2为第三预设阈值。
在另一种可能的实现方式中,所述确定模块1404,还用于根据所述小波信号序列中所有样点的音频强度值中的最大值和最小值确定所述第一音频强度阈值和第二音频强度阈值,具体的,根据所述小波信号序列中所有样点的音频强度值中的最大值和最小值确定所述第一参考音频强度阈值T L=min(λ 1.(Sc max-Sc min)+Sc min2.Sc min),其中Sc max和Sc min分别为所述小波信号序列中所有样点音频强度值的最大值和最小值,λ 1为第二预设阈值,λ 2为第三预设阈值;
所述第二音频强度阈值T U=αT L,其中α为第四预设阈值,α取值大于1。
可以理解的是,本实施例中对有效语音信号检测的具体实现过程可以参考前文结合图1至图13e所描述的实施例,此处不作赘述。
实施本实施例,通过采集小波信号序列中所有样点的能量信息,根据小波信号序列所有样点的能量分布情况,对有效语音信号进行判断检测,提高有效语音检测的准确性。
下面对本申请实施例提供的一种有效信号的检测设备进行说明,参见图15,图15为本申请实施例提供的一种有效语音信号的检测设备的结构框图,如图15所示,一种语音信号检测的设备15包括:收发器1500、处理器1501和存储器1502,其中:
所述收发器1500与所述处理器1501以及所述存储器1502连接,所述处理器1501还与所述存储器1502连接,
所述收发器1500,用于获取预设时长的第一音频信号,所述第一音频信号包括至少一个音频帧信号;
所述处理器1501,用于针对所述每个音频帧信号进行小波分解,得到分别与每个音频帧信号对应的多个小波分解信号,每个小波分解信号中包含多个样点以及每个样点的音频强度值;
所述处理器1501,还用于按照所述音频帧信号在所述第一音频信号中的分帧顺序,将各个音频帧信号对应的小波分解信号进行拼接得到小波信号序列;
所述处理器1501,还用于获取所述小波信号序列中所有样点的音频强度值中的最大值和最小值,根据所述小波信号序列中所有样点的音频强度值的最大值和最小值确定第一音频强度阈值;
所述处理器1501,还用于获取所述小波信号序列中音频强度值大于所述第一音频强度阈值的样点,将所述小波信号序列中音频强度值大于所述第一音频强度阈值的样点在所述第一音频信号中对应样点的信号确定为有效语音信号。
所述存储器1502,用于存储计算机程序,所述计算机程序被所述处理器1501调用。
在一种可能的实施例中,所述处理器1501还用于:
根据所述小波信号序列中所有样点的音频强度值中的最大值和最小值确定所述第一音频强度阈值和第二音频强度阈值,其中所述第一音频强度阈值小于所述第二音频强度阈值;
获取所述小波信号序列中第一样点,其中所述第一样点的前一样点的音频强度值小于所述第二音频强度阈值,以及所述第一样点的音频强度值大于所述第二音频强度阈值;
获取所述小波信号序列中的第二样点,所述第二样点为在所述小波信号序列中排序在所述第一样点之后首个出现音频强度值小于所述第一音频强度阈值的样点;
确定所述小波信号序列中的所述第一样点和所述第二样点的前一样点在所述第一音频信号中对应样点的信号为所述有效语音信号中的有效语音段。
可选的,所述第二样点与所述第一样点之间包括第一预设数量个连续样点。
在一种可能的实施例中,所述处理器1501还用于:
将所述小波信号序列中包括目标样点在内的第二预设数量个连续样点的第一参考音频强度值的平均值作为所述目标样点的音频强度值。
在一种可能的实施例中,所述处理器1501还用于:
将所述小波信号序列中所述目标样点前一样点的音频强度值乘以平滑系数,得到所述目标样点的第二参考音频强度;
将所述小波信号序列中包括所述目标样点在内,且在所述小波信号序列中排序顺序在所述目标样点之前的所有连续样点的音频强度值的平均值乘以剩余的平滑系数,得到所述目标样点的第三参考音频强度值;
将所述第二参考音频强度值和所述第三参考音频强度值相加得到的数值,作为所述目标样点的第四参考音频强度值,将包括所述目标样点在内,且在所述小波信号序列中排序顺序在所述目标样点之前的所有样点的第四参考音频强度值中的最小值,作为所述目标样点的第一参考音频强度值。
在一种可能的实现方式中,所述处理器1501还用于:
获取所述小波信号序列中的第一小波分解信号所有样点音频强度值中的第一参考最大值和第一参考最小值;
将所述小波信号序列中的所有小波分解信号中的参考最大值和参考最小值进行处理得到的数值,作为所述小波信号序列中所有样点音频强度值的最大值和最小值。
可选的,所述处理器1501还用于:
将所述预设时长的原始音频信号中的第一预设阈值的高频成分进行补偿,从而得到所述第一音频信号。
在一种可能的实施例中,所述处理器1501还用于:
针对所述每个音频帧信号进行小波包分解,将小波包分解后得到的信号作为所述小波分解信号。
在一种可能的实现方式中,所述处理器1501根据所述小波信号序列中所有样点音频强度值的最大值和所述最小值确定第一参考音频强度阈值T L=min(λ 1×(Sc max-Sc min)+Sc min2×Sc min),其中Sc max和Sc min分别为所述小波信号序列中所有样点音频强度值的最大值和最小值,λ 1为第二预设阈值,λ 2为第三预设阈值。
在一种可能的实现方式中,所述处理器1501根据所述小波信号序列中所有样点的音频强度值中的最大值和最小值确定所述第一参考音频强度阈值T L=min(λ 1.(Sc max-Sc min)+Sc min2.Sc min),其中Sc max和Sc min分别为所述小波信号序列中所有样点音频强度值的最大值和最小值,λ 1为第二预设阈值,λ 2为第三预设阈值;
所述第二音频强度阈值T U=αT L,其中α为第四预设阈值,α取值大于1。
可以理解的是,所述有效信号检测设备15可通过其内置的各个功能模块执行如上述图1至图13e中各个步骤所提供的实现方式,具体可参见上述图1至图13e中各个步骤所提供的实现方式,在此不再赘述。
实施本实施例,可以在有效语音信号的检测设备检测到有效语音信号时,唤醒所述设备 的其他工作模块,减少所述设备的功率消耗。
本申请还提供了一种可读存储介质,所述可读存储介质中存储有指令,所述指令被有效语音信号的检测设备中的处理器执行,以实现上面所述图1至图13e中各方面所述方法的步骤。
需要说明的是,上述术语“第一”、“第二”仅用于描述目的,而不能理解为指示或暗示相对重要性。
本申请的实施例可以通过采集小波信号序列中所有样点的能量信息,根据小波信号序列的能量分布情况,对有效语音信号进行判断检测,提高有效语音检测的准确性,并且还可以对小波信号序列中所有样点的音频强度值进行平滑处理以及跟踪所述小波信号序列中所有样点的能量分布信息,更进一步的提高有效语音信号检测的准确性。
在本申请所提供的几个实施例中,应该理解到,所揭露的方法、装置以及系统,可以通过其它的方式实现。以上所描述的实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,如:多个单元或组件可以结合,或可以集成到另一个系统,或一些特征可以忽略,或不执行。另外,所显示或讨论的各组成部分相互之间的耦合、或直接耦合、或通信连接可以是通过一些接口,设备或单元的间接耦合或通信连接,可以是电性的、机械的或其它形式的。
上述作为分离部件说明的单元可以是、或也可以不是物理上分开的,作为单元显示的部件可以是、或也可以不是物理单元,即可以位于一个地方,也可以分布到多个网络单元上;可以根据实际的需要选择其中的部分或全部单元来实现本实施例方案的目的。
另外,在本申请各实施例中的各功能模块可以全部集成在一个处理单元中,也可以是各模块分别单独作为一个单元,也可以两个或两个以上模块集成在一个单元中;上述集成的单元既可以采用硬件的形式实现,也可以采用硬件加软件功能单元的形式实现。
本领域普通技术人员可以理解:实现上述方法实施例的全部或部分步骤可以通过程序指令相关的硬件来完成,前述的程序可以存储于一计算机可读取存储介质中,该程序在执行时,执行包括上述方法实施例的步骤;而前述的存储介质包括:移动存储设备、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。
或者,本申请上述集成的单元如果以软件功能模块的形式实现并作为独立的产品销售或使用时,也可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请实施例的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机、服务器、或者网络设备等)执行本申请各个实施例所述方法的全部或部分。而前述的存储介质包括:移动存储设备、ROM、RAM、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。

Claims (13)

  1. 一种有效语音信号的检测方法,其特征在于,所述方法包括:
    获取预设时长的第一音频信号,所述第一音频信号包括至少一个音频帧信号;
    针对所述每个音频帧信号进行小波分解,得到分别与每个音频帧信号对应的多个小波分解信号,每个小波分解信号中包含多个样点以及每个样点的音频强度值;
    按照所述音频帧信号在所述第一音频信号中的分帧顺序,将各个音频帧信号对应的小波分解信号进行拼接得到小波信号序列;获取所述小波信号序列中所有样点的音频强度值中的最大值和最小值,根据所述小波信号序列中所有样点的音频强度值的最大值和最小值确定第一音频强度阈值;
    获取所述小波信号序列中音频强度值大于所述第一音频强度阈值的样点,将所述小波信号序列中音频强度值大于所述第一音频强度阈值的样点在所述第一音频信号中对应样点的信号确定为有效语音信号。
  2. 根据权利要求1所述的方法,其特征在于,所述根据所述小波信号序列中所有样点音频强度值的最大值和最小值确定第一音频强度阈值包括:
    根据所述小波信号序列中所有样点的音频强度值中的最大值和最小值确定所述第一音频强度阈值和第二音频强度阈值,其中所述第一音频强度阈值小于所述第二音频强度阈值;
    所述将音频强度值大于所述第一音频强度阈值的样点在所述第一音频信号中对应的样点信号确定为有效语音信号包括:
    获取所述小波信号序列中第一样点,其中所述第一样点的前一样点的音频强度值小于所述第二音频强度阈值,以及所述第一样点的音频强度值大于所述第二音频强度阈值;
    获取所述小波信号序列中的第二样点,所述第二样点为在所述小波信号序列中排序在所述第一样点之后,首个出现音频强度值小于所述第一音频强度阈值的样点;
    确定所述小波信号序列中的所述第一样点和所述第二样点的前一样点在所述第一音频信号中对应样点的信号为所述有效语音信号中的有效语音段。
  3. 根据权利要求2所述的方法,其特征在于,所述第二样点与所述第一样点之间至少包括第一预设数量个连续样点。
  4. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    将所述小波信号序列中包括目标样点在内的第二预设数量个连续样点的第一参考音频强度值的平均值作为所述目标样点的音频强度值。
  5. 根据权利要求4所述的方法,其特征在于,所述将所述小波信号序列中包括目标样点在内的第二预设数量个连续样点的第一参考音频强度值的平均值作为所述目标样点的音频强度值之前包括:
    将所述小波信号序列中所述目标样点前一样点的音频强度值乘以平滑系数,得到所述目标样点的第二参考音频强度;
    将所述小波信号序列中包括所述目标样点在内,且在所述小波信号序列中排序顺序在所 述目标样点之前的所有连续样点的音频强度值的平均值乘以剩余的平滑系数,得到所述目标样点的第三参考音频强度值;
    将所述第二参考音频强度值和所述第三参考音频强值相加得到的数值,作为所述目标样点的第四参考音频强度值;将包括所述目标样点在内,且在所述小波信号序列中排序顺序在所述目标样点之前的所有样点的第四参考音频强度值中的最小值,作为所述目标样点的第一参考音频强度值。
  6. 根据权利要求1所述的方法,其特征在于,所述获取所述小波信号序列中所有样点的音频强度值中的最大值和最小值包括:
    获取所述小波信号序列中的第一小波分解信号所有样点音频强度值中的第一参考最大值和第一参考最小值;
    将所述小波信号序列中的所有小波分解信号中的参考最大值和参考最小值进行处理得到的数值,作为所述小波信号序列中所有样点音频强度值的最大值和最小值。
  7. 根据权利要求1所述的方法,其特征在于,所述获取预设时长的第一音频信号之前包括:
    将所述预设时长的原始音频信号中的第一预设阈值的高频成分进行补偿,从而得到所述第一音频信号。
  8. 根据权利要求1所述的方法,其特征在于,所述针对所述每个音频帧信号进行小波分解包括:
    针对所述每个音频帧信号进行小波包分解,将小波包分解后得到的信号作为所述小波分解信号。
  9. 根据权利要求1所述的方法,其特征在于,所述根据所述小波信号序列中所有样点音频强度值的最大值和所述最小值确定第一参考音频强度阈值T L=min(λ 1.(Sc max-Sc min)+Sc min2.Sc min),其中Sc max和Sc min分别为所述小波信号序列中所有样点音频强度值的最大值和最小值,λ 1为第二预设阈值,λ 2为第三预设阈值。
  10. 根据权利要求2所述的方法,其特征在于,所述根据所述小波信号序列中所有样点的音频强度值中的最大值和最小值确定所述第一音频强度阈值和第二音频强度阈值包括:
    根据所述小波信号序列中所有样点的音频强度值中的最大值和最小值确定所述第一参考音频强度阈值T L=min(λ 1.(Sc max-Sc min)+Sc min2.Sc min),其中Sc max和Sc min分别为所述小波信号序列中所有样点音频强度值的最大值和最小值,λ 1为第二预设阈值,λ 2为第三预设阈值;
    所述第二音频强度阈值T U=αT L,其中α为第四预设阈值,α取值大于1。
  11. 一种有效语音信号的检测装置,其特征在于,包括:
    获取模块,用于获取预设时长的第一音频信号,所述第一音频信号包括至少一个音频帧信息;
    分解模块,用于针对所述每个音频帧信号进行小波分解,得到分别与每个音频帧信号对应的多个小波分解信号,每个小波分解中包含多个样点以及每个样点的音频强度值;
    拼接模块,用于按照所述音频帧信号在所述第一音频信号中的分帧顺序,将各个音频帧信号对应的小波分解信号进行拼接得到小波信号序列;
    确定模块,用于获取所述小波信号序列中所有样点的音频强度值中的最大值和最小值,根据所述小波信号序列中所有样点的音频强度值的最大值和最小值确定第一音频强度阈值;
    所述确定模块,还用于获取所述小波信号序列中音频强度值大于所述第一音频强度阈值的样点,将所述小波信号序列中音频强度值大于所述第一音频强度阈值的样点在所述第一音频信号中对应样点的信号确定为有效语音信号。
  12. 一种有效语音信号的检测设备,其特征在于,所述设备包括收发器、处理器和存储器,其中所述处理器用于执行所述存储器中存储的计算机程序,实现如权利要求1至10中任意一项所述方法的步骤。
  13. 一种计算机可读存储介质,其特征在于,所述可读存储介质中存储有指令,当其在计算机上运行时,使得计算机执行如权利要求1至10中任意一项所述方法的步骤。
PCT/CN2020/128374 2019-11-13 2020-11-12 一种有效语音信号的检测方法、装置及设备 WO2021093808A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/728,198 US20220246170A1 (en) 2019-11-13 2022-04-25 Method and apparatus for detecting valid voice signal and non-transitory computer readable storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911109218.XA CN110827852B (zh) 2019-11-13 2019-11-13 一种有效语音信号的检测方法、装置及设备
CN201911109218.X 2019-11-13

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/728,198 Continuation US20220246170A1 (en) 2019-11-13 2022-04-25 Method and apparatus for detecting valid voice signal and non-transitory computer readable storage medium

Publications (1)

Publication Number Publication Date
WO2021093808A1 true WO2021093808A1 (zh) 2021-05-20

Family

ID=69554882

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/128374 WO2021093808A1 (zh) 2019-11-13 2020-11-12 一种有效语音信号的检测方法、装置及设备

Country Status (3)

Country Link
US (1) US20220246170A1 (zh)
CN (1) CN110827852B (zh)
WO (1) WO2021093808A1 (zh)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110827852B (zh) * 2019-11-13 2022-03-04 腾讯音乐娱乐科技(深圳)有限公司 一种有效语音信号的检测方法、装置及设备
CN112365899A (zh) * 2020-10-30 2021-02-12 北京小米松果电子有限公司 语音处理方法、装置、存储介质及终端设备
CN112562718A (zh) * 2020-11-30 2021-03-26 重庆电子工程职业学院 基于topk多路音源有效信号筛选系统及方法
CN114299990A (zh) * 2022-01-28 2022-04-08 杭州老板电器股份有限公司 吸油烟机的异音识别及音频注入的控制方法和系统

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102419972A (zh) * 2011-11-28 2012-04-18 西安交通大学 一种声音信号检测和识别的方法
CN103117066A (zh) * 2013-01-17 2013-05-22 杭州电子科技大学 基于时频瞬时能量谱的低信噪比语音端点检测方法
CN105374367A (zh) * 2014-07-29 2016-03-02 华为技术有限公司 异常帧检测方法和装置
CN107305774A (zh) * 2016-04-22 2017-10-31 腾讯科技(深圳)有限公司 语音检测方法和装置
US20180012614A1 (en) * 2016-02-19 2018-01-11 New York University Method and system for multi-talker babble noise reduction
CN108198545A (zh) * 2017-12-19 2018-06-22 安徽建筑大学 一种基于小波变换的语音识别方法
US10304474B2 (en) * 2014-08-15 2019-05-28 Samsung Electronics Co., Ltd. Sound quality improving method and device, sound decoding method and device, and multimedia device employing same
CN110827852A (zh) * 2019-11-13 2020-02-21 腾讯音乐娱乐科技(深圳)有限公司 一种有效语音信号的检测方法、装置及设备

Family Cites Families (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5267322A (en) * 1991-12-13 1993-11-30 Digital Sound Corporation Digital automatic gain control with lookahead, adaptive noise floor sensing, and decay boost initialization
US6182035B1 (en) * 1998-03-26 2001-01-30 Telefonaktiebolaget Lm Ericsson (Publ) Method and apparatus for detecting voice activity
US20040064314A1 (en) * 2002-09-27 2004-04-01 Aubert Nicolas De Saint Methods and apparatus for speech end-point detection
US7536301B2 (en) * 2005-01-03 2009-05-19 Aai Corporation System and method for implementing real-time adaptive threshold triggering in acoustic detection systems
KR100718846B1 (ko) * 2006-11-29 2007-05-16 인하대학교 산학협력단 음성 검출을 위한 통계 모델을 적응적으로 결정하는 방법
KR20140031790A (ko) * 2012-09-05 2014-03-13 삼성전자주식회사 잡음 환경에서 강인한 음성 구간 검출 방법 및 장치
CN103325388B (zh) * 2013-05-24 2016-05-25 广州海格通信集团股份有限公司 基于最小能量小波框架的静音检测方法
CN104867493B (zh) * 2015-04-10 2018-08-03 武汉工程大学 基于小波变换的多重分形维数端点检测方法
US10090005B2 (en) * 2016-03-10 2018-10-02 Aspinity, Inc. Analog voice activity detection
CN107564544A (zh) * 2016-06-30 2018-01-09 展讯通信(上海)有限公司 语音活动侦测方法及装置
CN106782617A (zh) * 2016-11-22 2017-05-31 广州海格通信集团股份有限公司 一种针对受白噪声干扰语音信号的静音检测方法
CN107731223B (zh) * 2017-11-22 2022-07-26 腾讯科技(深圳)有限公司 语音活性检测方法、相关装置和设备
WO2020062392A1 (zh) * 2018-09-28 2020-04-02 上海寒武纪信息科技有限公司 信号处理装置、信号处理方法及相关产品

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102419972A (zh) * 2011-11-28 2012-04-18 西安交通大学 一种声音信号检测和识别的方法
CN103117066A (zh) * 2013-01-17 2013-05-22 杭州电子科技大学 基于时频瞬时能量谱的低信噪比语音端点检测方法
CN105374367A (zh) * 2014-07-29 2016-03-02 华为技术有限公司 异常帧检测方法和装置
US10304474B2 (en) * 2014-08-15 2019-05-28 Samsung Electronics Co., Ltd. Sound quality improving method and device, sound decoding method and device, and multimedia device employing same
US20180012614A1 (en) * 2016-02-19 2018-01-11 New York University Method and system for multi-talker babble noise reduction
CN107305774A (zh) * 2016-04-22 2017-10-31 腾讯科技(深圳)有限公司 语音检测方法和装置
CN108198545A (zh) * 2017-12-19 2018-06-22 安徽建筑大学 一种基于小波变换的语音识别方法
CN110827852A (zh) * 2019-11-13 2020-02-21 腾讯音乐娱乐科技(深圳)有限公司 一种有效语音信号的检测方法、装置及设备

Also Published As

Publication number Publication date
CN110827852B (zh) 2022-03-04
US20220246170A1 (en) 2022-08-04
CN110827852A (zh) 2020-02-21

Similar Documents

Publication Publication Date Title
WO2021093808A1 (zh) 一种有效语音信号的检测方法、装置及设备
JP6557786B2 (ja) エコー遅延トラッキング方法、装置及びコンピュータ記憶媒体
EP3703052B1 (en) Echo cancellation method and apparatus based on time delay estimation
WO2021093807A1 (zh) 一种瞬态噪声的检测方法、装置及设备
CN105812993B (zh) 啸叫检测和抑制方法及其装置
US20110099010A1 (en) Multi-channel noise suppression system
CN101458943B (zh) 一种录音控制方法和录音设备
US20110099007A1 (en) Noise estimation using an adaptive smoothing factor based on a teager energy ratio in a multi-channel noise suppression system
KR20110068637A (ko) 잡음 환경의 입력신호로부터 잡음을 제거하는 방법 및 그 장치
KR100677126B1 (ko) 레코더 기기의 잡음 제거 장치 및 그 방법
JP2014126856A (ja) 雑音除去装置及びその制御方法
CN111179931A (zh) 用于语音交互的方法、装置及家用电器
WO2022218254A1 (zh) 语音信号增强方法、装置及电子设备
CN110931032B (zh) 一种动态回声消除方法及装置
KR102188620B1 (ko) 누락 데이터에 대한 사인곡선 보간
CN112116927A (zh) 实时检测音频信号中的语音活动
CN109920444B (zh) 回声时延的检测方法、装置以及计算机可读存储介质
KR101295727B1 (ko) 적응적 잡음추정 장치 및 방법
CN116403594B (zh) 基于噪声更新因子的语音增强方法和装置
Fang et al. Integrating statistical uncertainty into neural network-based speech enhancement
GB2426167A (en) Quantile based noise estimation
CN108062959B (zh) 一种声音降噪方法及装置
CN113270118B (zh) 语音活动侦测方法及装置、存储介质和电子设备
Lim et al. Acoustic blur kernel with sliding window for blind estimation of reverberation time
CN111246036A (zh) 一种回声估计方法和装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20887941

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 29.09.2022)

122 Ep: pct application non-entry in european phase

Ref document number: 20887941

Country of ref document: EP

Kind code of ref document: A1