WO2017181772A1 - 语音检测方法、装置及存储介质 - Google Patents

语音检测方法、装置及存储介质 Download PDF

Info

Publication number
WO2017181772A1
WO2017181772A1 PCT/CN2017/074798 CN2017074798W WO2017181772A1 WO 2017181772 A1 WO2017181772 A1 WO 2017181772A1 CN 2017074798 W CN2017074798 W CN 2017074798W WO 2017181772 A1 WO2017181772 A1 WO 2017181772A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio
segment
audio segment
current audio
current
Prior art date
Application number
PCT/CN2017/074798
Other languages
English (en)
French (fr)
Inventor
范海金
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Priority to JP2018516116A priority Critical patent/JP6705892B2/ja
Priority to EP17785258.9A priority patent/EP3447769B1/en
Priority to KR1020187012848A priority patent/KR102037195B1/ko
Publication of WO2017181772A1 publication Critical patent/WO2017181772A1/zh
Priority to US15/968,526 priority patent/US10872620B2/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/84Detection of presence or absence of voice signals for discriminating voice from noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0224Processing in the time domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/15Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being formant information
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/06Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being correlation coefficients
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/09Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being zero crossing rates
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/21Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information

Definitions

  • the embodiments of the present invention relate to the field of computers, and in particular, to a voice detection method, apparatus, and storage medium.
  • voice signals are used to implement control in many fields.
  • a voice signal is input as a voice password.
  • the voice detection method used for the voice signal is usually a single feature extraction of the input signal, so that the extracted single features are often sensitive to noise, and cannot accurately distinguish the interference sound and the voice signal, thereby causing the voice. The accuracy of the detection is reduced.
  • the embodiments of the present invention provide a voice detection method, apparatus, and storage medium to solve at least the technical problem of low accuracy of voice detection caused by adopting a related voice detection method.
  • a voice detection method includes: dividing an audio signal to be detected into a plurality of audio segments; extracting audio features in each of the audio segments, wherein the audio features include at least The time domain feature and the frequency domain feature of the audio segment; the target voice segment is detected from the audio segment according to the audio feature of the audio segment.
  • a voice detecting apparatus including: a dividing unit, configured to divide the audio signal to be detected into a plurality of audio segments; and an extracting unit configured to extract an audio feature in each of the audio segments, wherein the audio feature includes at least a time domain characteristic and a frequency of the audio segment a domain feature; a detecting unit configured to detect a target segment of speech from the audio segment based on the audio feature of the audio segment.
  • a storage medium configured to store program code for performing the following steps: dividing an audio signal to be detected into a plurality of audio segments; extracting each An audio feature in the audio segment, wherein the audio feature includes at least a time domain feature and a frequency domain feature of the audio segment; the target speech is detected from the audio segment based on the audio feature of the audio segment.
  • the audio signal to be detected is divided into a plurality of audio segments, and audio features in each audio segment are extracted, wherein the audio features include at least a time domain feature and a frequency domain feature of the audio segment, thereby Realizing the multiple features of the fused audio segment in different domains to accurately detect the target speech segment from the plurality of audio segments, so as to reduce the interference of the noise signal in the audio segment to the speech detection process, thereby improving the accuracy of detecting the speech, and further The problem of low detection accuracy caused by detecting a voice mode by only a single feature in the related art is overcome.
  • the human-machine interaction device can also quickly and in real time determine the start time and the end time of the speech segment composed of the target speech segment, thereby realizing the detection by the human-machine interaction device.
  • the voice responds accurately and in real time to achieve the natural interaction of human and machine.
  • the human-computer interaction device can also achieve the effect of improving the efficiency of human-computer interaction by accurately detecting the start time and the end time of the speech segment formed by the target speech segment, thereby overcoming the related art by the interaction person pressing the control button.
  • the problem of low human-computer interaction caused by triggering the process of human-computer interaction is triggered.
  • FIG. 1 is a schematic diagram of an application environment of an optional voice detection method according to an embodiment of the present invention
  • FIG. 2 is a schematic diagram of an application environment of another optional voice detection method according to an embodiment of the present invention.
  • FIG. 3 is a schematic flowchart diagram of an optional voice detection method according to an embodiment of the present invention.
  • FIG. 4 is a waveform diagram of an optional voice detection method according to an embodiment of the present invention.
  • FIG. 5 is a waveform diagram of another alternative voice detection method according to an embodiment of the present invention.
  • FIG. 6 is a waveform diagram of still another alternative voice detection method according to an embodiment of the present invention.
  • FIG. 7 is a waveform diagram of still another alternative speech detection method according to an embodiment of the present invention.
  • FIG. 8 is a waveform diagram of still another alternative speech detection method according to an embodiment of the present invention.
  • FIG. 9 is a schematic flowchart diagram of another optional voice detection method according to an embodiment of the present invention.
  • FIG. 10 is a schematic diagram of an optional voice detecting apparatus according to an embodiment of the present invention.
  • FIG. 11 is a schematic diagram of an optional voice detection device in accordance with an embodiment of the present invention.
  • the voice detection method may be, but is not limited to, applied to an application environment as shown in FIG. 1.
  • the audio signal to be detected is acquired by the terminal 102, and the audio signal to be detected is sent to the server 106 through the network 104.
  • the server 106 divides the audio signal to be detected into a plurality of audio segments; and extracts audio features in each audio segment.
  • the audio feature includes at least a time domain feature and a frequency domain feature of the audio segment; and the target voice segment is detected from the audio segment according to the audio feature of the audio segment.
  • the foregoing voice detection method may also be applied to an application environment as shown in FIG. 2, but is not limited to. That is, after the terminal 102 acquires the audio signal to be detected, the terminal 102 performs the detection process of the audio segment in the voice detection method.
  • the specific process may be as above, and details are not described herein again.
  • the terminal shown in FIG. 1-2 is only an example.
  • the foregoing terminal may include, but is not limited to, at least one of the following: a mobile phone, a tablet computer, a notebook computer, a desktop PC, a digital television, and other device interaction devices.
  • the foregoing network may include, but is not limited to, at least one of the following: a wide area network, a metropolitan area network, and a local area network. The above is only an example, and the embodiment does not limit this.
  • a voice detection method As shown in FIG. 3, the method includes:
  • the target speech segment is detected from the audio segment according to the audio feature of the audio segment.
  • the foregoing voice detection method may be, but is not limited to, applied to at least one of the following scenarios: an intelligent robot chat system, an automatic question answering system, a human-machine chat software, and the like. That is to say, the voice detection method provided in this embodiment is applied to the human-computer interaction process, and the audio features of the audio segment including the time domain feature and the frequency domain feature of the audio segment are extracted to accurately detect the detection. a target speech segment among the plurality of audio segments divided in the audio signal, so that the device for human-computer interaction can know the start time and the end time of the speech segment composed of the target speech segment, so that the device can obtain the desired expression. After the complete voice message, an accurate reply is made.
  • the voice segment may include, but is not limited to, one target voice segment or a plurality of consecutive target voice segments. Each target speech segment includes a start time and a stop time of the target speech segment. This embodiment does not limit this.
  • the human-machine interaction device divides the audio signal to be detected into multiple audio segments, and extracts audio features in each audio segment, where the audio features include at least the audio segment. Domain features and frequency domain features, thereby implementing multiple features of the fused audio segment in different domains to accurately detect the target speech segment from the plurality of audio segments, thereby reducing interference of the noise signal in the audio segment to the speech detection process, and improving
  • the purpose of the accuracy of speech detection is to overcome the problem of low detection accuracy caused by the manner in which the speech is detected by only a single feature in the related art.
  • the human-machine interaction device can also quickly and in real time determine the start time and the end time of the speech segment composed of the target speech segment, thereby realizing the human-computer interaction device acquiring the detection.
  • the voice information is accurately and real-time reacted to achieve the natural interaction of human and machine.
  • the human-computer interaction device can also achieve the effect of improving the efficiency of human-computer interaction by accurately detecting the start time and the end time of the speech segment formed by the target speech segment, thereby overcoming the related art by the interaction person pressing the control button.
  • the problem of low human-computer interaction caused by triggering the process of human-computer interaction is triggered.
  • the foregoing audio features may include, but are not limited to, at least one of: a signal zero-crossing rate in a time domain, a short-term energy in a time domain, a spectral flatness in a frequency domain, and a time in time Signal information entropy, autocorrelation coefficient, wavelet transform signal, signal complexity, etc.
  • the above-mentioned signal zero-crossing rate may be, but not limited to, interference for removing some impulse noise
  • the short-term energy described above may be, but not limited to, measuring the amplitude of the audio signal, and matching with a certain threshold to remove 3)
  • the spectral flatness may be, but is not limited to, a frequency distribution characteristic for calculating a signal in the frequency domain, and determining whether the audio signal is background Gaussian white noise according to the size of the feature
  • the signal time domain information entropy may be, but is not limited to, used to measure the distribution characteristics of the audio signal in the time domain, which is used to distinguish between speech signals and general noise.
  • the interference of the pulse and the background noise is resisted, and the robustness is enhanced to realize the division of the audio signal to be detected.
  • the target speech segment is accurately detected in the audio segment, so as to accurately obtain the start time and the end time of the speech segment formed by the target speech segment, so as to realize the human-machine natural interaction.
  • the manner of detecting the target voice segment from the plurality of audio segments of the audio signal according to the audio feature of the audio segment may include, but is not limited to, determining whether the audio feature of the audio segment meets a predetermined threshold condition; When the audio feature of the audio segment satisfies a predetermined threshold condition, the audio segment is detected as the target speech segment.
  • the current audio segment for making the determination may be acquired from the plurality of audio segments in at least one of the following order: 1) The input order of the audio signals; 2) in the predetermined order.
  • the predetermined order may be a random order, or may be an order arranged according to a predetermined principle, for example, according to the size order of the audio segments. The above is only an example, and is not limited in this embodiment.
  • the predetermined threshold condition described above may be, but is not limited to, adaptive update adjustment based on the changed scene.
  • predetermined threshold conditions for comparison with audio features to ensure accurate detection from multiple audio segments according to different scenes during the detection process
  • Target speech segment for the plurality of features of the audio segment in the plurality of domains, determining whether the corresponding predetermined threshold condition is met, respectively, to perform multiple determination and screening on the audio segment, thereby ensuring accurate detection of the target segment.
  • Detecting the target speech segment in the audio segment includes repeating the following steps until the current audio segment is the last audio segment of the plurality of audio segments, wherein the current audio segment is initialized to the first audio segment of the plurality of audio segments :
  • S4 Determine whether the current audio segment is the last audio segment of the plurality of audio segments, and if not, use the next audio segment of the current audio segment as the current audio segment.
  • the predetermined threshold condition may be, but is not limited to, being updated based on at least an audio feature of the current audio segment to obtain an updated predetermined threshold condition. That is to say, when the above predetermined threshold condition is updated, the predetermined threshold condition required for the next audio segment is determined based on the audio characteristics of the current audio segment (historical audio segment), thereby making the detection process for the audio segment more accurate.
  • the method further includes:
  • multiple audio segments are subjected to noise suppression processing to avoid interference of noise on the voice signal.
  • the minimum mean square error log spectral amplitude estimation is used to eliminate the background noise of the audio signal.
  • the foregoing first N audio segments may be, but are not limited to, audio segments that are not voice input. That is to say, before the human-computer interaction process is started, an initialization operation is performed, the noise suppression model is constructed by the audio segment without voice input, and an initial predetermined threshold condition for determining the audio feature is obtained.
  • the above initial predetermined threshold condition may be, but is not limited to, determined according to an average value of audio features of the first N audio segments.
  • the method before extracting the audio feature in each audio segment, the method further includes: performing secondary quantization on the collected audio signal to be detected, where the quantization level of the second quantization is smaller than the second A quantized quantization level at a time.
  • the first quantization may be performed, but not limited to, when the audio signal is acquired; the second quantization may be performed, but not limited to, after the noise suppression processing is performed.
  • the larger the quantization level the more sensitive the interference is, that is, when the quantization level is large, since the quantization interval is small, a small noise signal is also subjected to the quantization operation, so that after quantization The result includes both speech signals and noise signals, which cause great interference to speech signal detection.
  • the second quantization is implemented by adjusting the quantization level, that is, the quantization level of the second quantization is smaller than the quantization level of the first quantization, thereby performing secondary filtering on the noise signal to achieve the effect of reducing interference.
  • dividing the audio signal to be detected into multiple audio segments may include, but is not limited to, an audio signal collected by a fixed length window sampling device.
  • the length of the fixed length window is small, for example, the length of the window used is 256 (the number of signals). That is, the division of the audio signal is realized by the small window, thereby realizing returning the processing result in real time, so as to complete the real-time detection of the speech signal.
  • the target speech segment is accurately detected to reduce the interference of the noise signal in the audio segment to the speech detection process, thereby improving the accuracy of the speech detection, and thereby overcoming the detection accuracy caused by the method of detecting the speech by only a single feature in the related art. The problem is lower.
  • detecting the target speech segment from the audio segment according to the audio characteristics of the audio segment includes:
  • the audio features of the current audio segment x(i) in the N audio segments can be obtained by the following formula:
  • h[i] is the window function when the following function is used
  • Figure 4 shows the original audio signal with impulse noise.
  • the middle band (the signal between the horizontal axis 50,000 and 150,000) has some impulse noise, and the speech signal is the last segment (the horizontal axis is 230000).
  • Figure 5 shows the audio signal of the signal zero-crossing rate separately extracted from the original audio signal.
  • the impulse noise such as the intermediate band
  • the signal zero-crossing rate characteristic ( The impulse noise of the signal between the horizontal axis of 50000-150000 can be directly filtered, but for low-energy non-impulse noise (signal between the horizontal axis 210000-220000) can not be distinguished;
  • Figure 6 shows the original The audio signal separately extracts the short-time energy audio signal, and it can be seen that the low-energy non-impulse noise (signal between the horizontal axis 210000-220000) can be filtered according to the short-time energy characteristic, but the intermediate band cannot be distinguished ((horizontal axis 50000)
  • Impulse noise of signals between -150000 (pulse signals are also relatively large)
  • Figure 7 shows the extraction of spectral flatness and signal information entropy audio signals from the original audio signal, both of which can detect both the speech signal and the impulse noise, and can retain all types of speech signals to the greatest extent; further FIG.
  • the interference between the pulse and the background noise is resisted by combining the above-mentioned multiple features in the time domain and the frequency domain in the voice detection process, and the robustness is enhanced to realize the division from the audio signal to be detected.
  • the target speech segment is accurately detected in the plurality of audio segments, thereby accurately obtaining the start time and the end time of the speech signal corresponding to the target speech segment, and realizing the natural interaction between the human and the machine.
  • detecting the target speech segment from the audio segment according to the audio characteristics of the audio segment includes:
  • the foregoing predetermined threshold condition may be, but is not limited to, adaptive update adjustment according to the changed scenario.
  • the audio segment is obtained from the plurality of audio segments in accordance with the input order of the audio signal to determine whether the audio feature of the audio segment satisfies
  • the predetermined threshold condition described above may be, but is not limited to, updated based at least on the audio characteristics of the current audio segment. That is, when it is necessary to update the predetermined threshold condition, the next updated predetermined threshold condition is acquired based on the current audio segment (history audio segment).
  • the above-mentioned judging process will be repeatedly performed for each audio segment until the plurality of audio segments divided by the audio signal to be detected are traversed. That is, until the current audio segment is the last audio segment of the plurality of audio segments.
  • the predetermined threshold conditions for comparison with the audio features are continuously updated to ensure that the target speech segments are accurately detected from the plurality of audio segments according to different scenes during the detection process. Further, for multiple features of the audio segment in multiple domains, it is determined whether the audio segment is subjected to multiple determination and screening by determining whether the corresponding predetermined threshold condition is met, thereby ensuring that an accurate target segment is detected.
  • determining whether the audio feature of the current audio segment meets a predetermined threshold condition includes: S11, determining whether a signal zero-crossing rate of the current audio segment in the time domain is greater than a first threshold; and when a signal zero-crossing rate of the current audio segment is greater than a first threshold Determining whether the short-term energy of the current audio segment in the time domain is greater than a second threshold; determining whether the spectral flatness of the current audio segment in the frequency domain is less than a third threshold when the short-term energy of the current audio segment is greater than the second threshold; Determining whether the signal entropy of the current audio segment in the time domain is less than a fourth threshold when the spectral flatness of the current audio segment is less than a third threshold;
  • detecting that the current audio segment is the target voice segment includes: S21, when it is determined that the signal information entropy of the current audio segment is less than a fourth threshold, detecting the current audio The segment is the target speech segment.
  • the foregoing process of detecting a target voice segment according to multiple features of the current audio segment in the time domain and the frequency domain may be performed, but not limited to, after performing the second quantization on the audio signal. This embodiment does not limit this.
  • Signal zero-crossing rate obtain the signal zero-crossing rate of the current audio segment in the time domain; the signal cross-zero The rate represents the number of times the waveform in the audio signal crosses the zero axis. In general, the zero-crossing rate of the speech signal is larger than that of the non-speech signal;
  • short-time energy obtaining the time domain energy of the current audio segment in the time domain amplitude; the short-time energy signal is used to distinguish the non-speech signal and the speech signal from the signal energy; in general, the short-term energy of the speech signal is greater than Short-term energy of non-speech signals;
  • Spectral flatness Fourier transform is performed on the current audio segment and its spectral flatness is calculated; wherein the frequency distribution of the speech signal is concentrated, and the corresponding spectral flatness is small; the frequency distribution of the Gaussian white noise signal is relatively scattered, and the corresponding spectrum Larger flatness;
  • Signal information entropy Calculate the signal information entropy after normalizing the current audio segment; wherein the speech signal distribution is concentrated, the corresponding signal information entropy is small, and the non-speech signal, especially the Gaussian white noise distribution is relatively scattered, the corresponding signal Information entropy is relatively large.
  • S904 Determine whether the signal zero-crossing rate of the current audio segment is greater than the first threshold. If the signal zero-crossing rate of the current audio segment is greater than the first threshold, perform the next operation; if the current audio segment's signal zero-crossing rate is less than or equal to the first Threshold, then the current audio segment is directly determined as a non-target speech segment;
  • S906 determining whether the short-term energy of the current audio segment is greater than a second threshold, and if greater than the second threshold, performing a next determination; if the short-term energy of the current audio segment is less than or equal to the second threshold, the current audio segment is directly determined as Non-target speech segment, and updating a second threshold according to short-term energy of the current audio segment;
  • S908 determining whether the spectral flatness of the current audio segment is less than a third threshold, and if less than the third threshold, performing a next determination; if the spectral flatness of the current audio segment is greater than or equal to the third threshold, the current audio segment is directly determined as a non-target speech segment, and updating a third threshold according to a spectral flatness of the current audio segment;
  • step S910 when it is determined that the above four features all satisfy the corresponding predetermined threshold condition, it is determined that the current audio segment is the target voice segment.
  • the target speech segment is accurately detected from the plurality of audio segments by combining multiple features of the audio segment in different domains, so as to reduce interference of the noise signal in the audio segment to the speech detection process, thereby improving The purpose of the accuracy of voice detection.
  • updating the predetermined threshold condition according to at least the audio feature of the current audio segment includes:
  • the fourth threshold is updated according to at least the signal information entropy of the current audio segment.
  • updating the predetermined threshold condition according to at least the audio feature of the current audio segment includes:
  • a represents the attenuation coefficient
  • B represents the short-term energy of the current audio segment
  • A' represents the second threshold
  • A represents the updated second threshold
  • B represents the spectral flatness of the current audio segment
  • A' represents The third threshold
  • A represents the updated third threshold
  • B represents the signal information entropy of the current audio segment
  • A' represents a fourth threshold
  • A represents an updated fourth threshold
  • the predetermined threshold condition required for the next audio segment is determined according to the audio characteristics of the current audio segment (historical audio segment), thereby making the target
  • the standard voice detection process is more accurate.
  • the predetermined threshold conditions for comparison with the audio features are continuously updated to ensure that the target speech segments are accurately detected from the plurality of audio segments according to different scenes during the detection process.
  • the method further includes:
  • S1 Determine a start time and a stop time of the continuous voice segment formed by the target voice segment according to the position of the target voice segment in the plurality of audio segments.
  • the foregoing voice segment may include, but is not limited to, one target voice segment, or multiple consecutive target voice segments.
  • Each target speech segment includes a start time of the target speech segment and a termination time of the target speech segment.
  • the time stamp of the target voice segment such as the start time of the target voice segment and the termination time of the target voice segment, may be used. To obtain the start time and end time of the speech segment composed of the target speech segment.
  • determining a start time and a termination time of the continuous voice segment formed by the target voice segment according to the location of the target voice segment in the multiple audio segments includes:
  • the foregoing K is an integer greater than or equal to 1.
  • the M may be set to different values according to different scenarios, which is not limited in this embodiment.
  • the target speech segments detected from multiple (for example, 20) audio segments are: P1-P5, P7-P8, P10, P17-P20 . Further, assume that M is 5.
  • the first five target speech segments are continuous, P5 and P7 include a non-target speech segment (P6), and P8 and P10 include a non-target speech segment (P9), between P10 and P17. Includes 6 non-target speech segments (ie P11-P16).
  • the first K ie, the first 5 consecutive target speech segments
  • a speech segment A containing a speech signal is detected from the audio signal to be detected, wherein the starting time of the speech segment A is the first five targets.
  • the start time of the first target speech segment in the speech segment ie, the start time of P1.
  • the voice segment A is not terminated when the non-target speech segment P6 and the non-target speech segment P9.
  • the speech segment A is terminated at the start time of the first non-target speech segment (ie, the start time of P11) in the continuous non-target speech segment (ie, P11-P16), and the start time of P11 is used as the speech segment A.
  • the moment of termination That is to say, the start time of the speech segment A is the start time 0 of P1, and the end time is the start time 10T of P11.
  • the above-described continuous target speech segments P17-P20 will be used to determine the detection process of the next speech segment B.
  • the detection process can be performed by referring to the above process, and details are not described herein again in this embodiment.
  • the audio signal to be detected may be acquired in real time, so as to detect whether the audio segment in the audio signal is the target speech segment, so as to accurately detect the start of the speech segment formed by the target speech segment.
  • the moment and the termination time of the voice segment, thereby realizing the human-computer interaction device can accurately respond according to the voice information to be expressed in the complete voice segment, thereby realizing human-computer interaction.
  • the above detection step may be repeatedly performed for the voice detection. This description is not repeated here.
  • the human-machine interaction device when the target speech segment is accurately detected, can quickly and in real time determine the start time and the end time of the speech segment formed by the target speech segment, thereby realizing human-computer interaction.
  • the device responds accurately and in real time to the detected voice information, and achieves the effect of natural interaction between human and machine.
  • the human-machine interaction device can also achieve the effect of improving the efficiency of human-computer interaction by accurately detecting the start time and the end time of the voice signal corresponding to the target voice segment, thereby overcoming the related art by pressing the control button by the interaction personnel. To trigger the problem of low human-computer interaction caused by starting the human-computer interaction process.
  • the method further includes:
  • a noise suppression model is constructed based on the first N audio segments in the following manner. Assuming that the audio signal includes a pure speech signal and independent Gaussian white noise, the noise can be suppressed by performing Fourier transform on the background noise of the first N audio segments to obtain frequency domain information of the signal; according to the frequency domain of the background noise Information, estimating the frequency domain logarithmic characteristics of noise to construct a noise suppression model. Further, for the (N+1)th audio segment and the following audio segment, the maximum noise estimation method may be implemented based on the above-described noise suppression model to implement noise cancellation processing on the audio signal.
  • the noise suppression model is constructed by the audio segment without voice input, and an initial predetermined threshold condition for determining the audio feature is obtained.
  • the above initial predetermined threshold condition may be, but is not limited to, determined according to an average value of audio features of the first N audio segments.
  • the first N audio segments of the plurality of audio segments are used to implement the initialization operation of the human-computer interaction, such as constructing a noise suppression model to perform noise suppression processing on multiple audio segments to avoid noise-to-speech signals. Interference. For example, an initial predetermined threshold condition for determining an audio feature is obtained to facilitate voice detection of a plurality of audio segments.
  • the first quantization may be performed, but not limited to, when the audio signal is acquired; the second quantization may be performed, but not limited to, after the noise suppression processing is performed.
  • the larger the quantization level the more sensitive the interference is, that is, the smaller the interference is, the more likely it is to interfere with the voice signal, and the secondary interference is adjusted by adjusting the quantization level to achieve secondary filtering of the interference. Effect.
  • 16 bits are used in the first quantization, and 8 bits, that is, the range of [-128--127] is used in the second quantization; thereby achieving accurate by filtering again. Distinguish between speech signals and noise.
  • the method according to the above embodiment can be implemented by means of software plus a necessary general hardware platform, and of course, by hardware, but in many cases, the former is A better implementation.
  • the technical solution of the present invention in essence or the contribution to the related art can be embodied in the form of a software product stored in a storage medium (such as ROM/RAM, disk, CD-ROM).
  • the instructions include a number of instructions for causing a terminal device (which may be a cell phone, computer, server, or network device, etc.) to perform the methods described in various embodiments of the present invention.
  • a voice detecting apparatus for implementing the voice detecting method is further provided. As shown in FIG. 10, the apparatus includes:
  • a dividing unit 1002 configured to divide the audio signal to be detected into a plurality of audio segments
  • an extracting unit 1004 configured to extract an audio feature in each audio segment, wherein the audio feature includes at least a time domain feature and a frequency domain feature of the audio segment;
  • the detecting unit 1006 is arranged to detect the target speech segment from the audio segment based on the audio characteristics of the audio segment.
  • the voice detecting device may be, but is not limited to, applied to at least one of the following scenarios: an intelligent robot chat system, an automatic question answering system, a human-machine chat software, and the like. That is to say, the voice detecting device provided in this embodiment is applied to the human-computer interaction process, and the audio features of the audio segment including the time domain feature and the frequency domain feature of the audio segment are extracted to accurately detect the detection. a target speech segment among the plurality of audio segments divided in the audio signal, so that the device for human-computer interaction can know the start time and the end time of the speech segment composed of the target speech segment, so that the device can obtain the desired expression. After the complete voice message, an accurate reply is made.
  • the voice segment may include but is not limited to: one mesh A voice segment or a continuous number of target segments. Each target speech segment includes a start time and a stop time of the target speech segment. This embodiment does not limit this.
  • the human-machine interaction device divides the audio signal to be detected into multiple audio segments, and extracts audio features in each audio segment, where the audio features include at least the audio segment. Domain features and frequency domain features, thereby implementing multiple features of the fused audio segment in different domains to accurately detect the target speech segment from the plurality of audio segments, thereby reducing interference of the noise signal in the audio segment to the speech detection process, and improving
  • the purpose of the accuracy of speech detection is to overcome the problem of low detection accuracy caused by the manner in which the speech is detected by only a single feature in the related art.
  • the human-machine interaction device can also quickly and in real time determine the start time and the end time of the speech segment composed of the target speech segment, thereby realizing the human-computer interaction device acquiring the detection.
  • the voice information is accurately and real-time reacted to achieve the natural interaction of human and machine.
  • the human-computer interaction device can also achieve the effect of improving the efficiency of human-computer interaction by accurately detecting the start time and the end time of the speech segment formed by the target speech segment, thereby overcoming the related art by the interaction person pressing the control button.
  • the problem of low human-computer interaction caused by triggering the process of human-computer interaction is triggered.
  • the foregoing audio features may include, but are not limited to, at least one of: a signal zero-crossing rate in a time domain, a short-term energy in a time domain, a spectral flatness in a frequency domain, and a time in time Signal information entropy, autocorrelation coefficient, wavelet transform signal, signal complexity, etc.
  • the above-mentioned signal zero-crossing rate may be, but not limited to, interference for removing some impulse noise
  • the short-term energy described above may be, but not limited to, measuring the amplitude of the audio signal, and matching with a certain threshold to remove 3)
  • the spectral flatness may be, but is not limited to, a frequency distribution characteristic for calculating a signal in the frequency domain, and determining whether the audio signal is background Gaussian white noise according to the size of the feature
  • the signal time domain information entropy may be, but is not limited to, used to measure the distribution characteristics of the audio signal in the time domain, which is used to distinguish between speech signals and general noise.
  • the interference of the pulse and the background noise is resisted, and the robustness is enhanced to realize the detection from the to-be-detected.
  • the target voice segment is accurately detected in the plurality of audio segments divided by the audio signal, so as to accurately obtain the start time and the end time of the voice segment formed by the target voice segment, so as to realize human-machine natural interaction.
  • the manner of detecting the target voice segment from the plurality of audio segments of the audio signal according to the audio feature of the audio segment may include, but is not limited to, determining whether the audio feature of the audio segment meets a predetermined threshold condition; When the audio feature of the audio segment satisfies a predetermined threshold condition, the audio segment is detected as the target speech segment.
  • the current audio segment for making the determination may be acquired from the plurality of audio segments in at least one of the following order: 1) The input order of the audio signals; 2) in the predetermined order.
  • the predetermined order may be a random order, or may be an order arranged according to a predetermined principle, for example, according to the size order of the audio segments. The above is only an example, and is not limited in this embodiment.
  • the predetermined threshold condition described above may be, but is not limited to, adaptive update adjustment based on the changed scene.
  • predetermined threshold conditions for comparison with audio features, it is ensured that target speech segments are accurately detected from a plurality of audio segments according to different scenes during the detection process. Further, for the plurality of features of the audio segment in the plurality of domains, determining whether the corresponding predetermined threshold condition is met, respectively, to perform multiple determination and screening on the audio segment, thereby ensuring accurate detection of the target segment.
  • Detecting the target speech segment in the audio segment includes repeating the following steps until the current audio segment is the last audio segment of the plurality of audio segments, wherein the current audio segment is initialized to the first audio segment of the plurality of audio segments :
  • the frequency band is the target speech segment
  • S4 Determine whether the current audio segment is the last audio segment of the plurality of audio segments, and if not, use the next audio segment of the current audio segment as the current audio segment.
  • the predetermined threshold condition may be, but is not limited to, being updated based on at least an audio feature of the current audio segment to obtain an updated predetermined threshold condition. That is to say, when the above predetermined threshold condition is updated, the predetermined threshold condition required for the next audio segment is determined based on the audio characteristics of the current audio segment (historical audio segment), thereby making the detection process for the audio segment more accurate.
  • the foregoing apparatus further includes:
  • the first obtaining unit is configured to acquire the first N audio segments of the plurality of audio segments after dividing the audio signal to be detected into a plurality of audio segments, where N is an integer greater than one;
  • a building unit configured to construct a noise suppression model according to the first N audio segments, wherein the noise suppression model is configured to perform noise suppression processing on the N+1th audio segment and the audio segment after the plurality of audio segments;
  • the second obtaining unit is configured to acquire an initial predetermined threshold condition according to the first N audio segments.
  • multiple audio segments are subjected to noise suppression processing to avoid interference of noise on the voice signal.
  • the minimum mean square error log spectral amplitude estimation is used to eliminate the background noise of the audio signal.
  • the foregoing first N audio segments may be, but are not limited to, audio segments that are not voice input. That is to say, before the human-computer interaction process is started, an initialization operation is performed, the noise suppression model is constructed by the audio segment without voice input, and an initial predetermined threshold condition for determining the audio feature is obtained.
  • the above initial predetermined threshold condition may be, but is not limited to, determined according to an average value of audio features of the first N audio segments.
  • the method before extracting the audio feature in each audio segment, the method further includes: performing secondary quantization on the collected audio signal to be detected, where the quantization level of the second quantization is smaller than the second A quantized quantization level at a time.
  • the first quantization may be performed, but not limited to, when the audio signal is acquired; the second quantization may be performed, but not limited to, after the noise suppression processing is performed.
  • the larger the quantization level the more sensitive the interference is, that is, when the quantization level is large, since the quantization interval is small, a small noise signal is also subjected to the quantization operation, so that after quantization The result includes both speech signals and noise signals, which cause great interference to speech signal detection.
  • the second quantization is implemented by adjusting the quantization level, that is, the quantization level of the second quantization is smaller than the quantization level of the first quantization, thereby performing secondary filtering on the noise signal to achieve the effect of reducing interference.
  • dividing the audio signal to be detected into multiple audio segments may include, but is not limited to, an audio signal collected by a fixed length window sampling device.
  • the length of the fixed length window is small, for example, the length of the window used is 256 (the number of signals). That is, the division of the audio signal is realized by the small window, thereby realizing returning the processing result in real time, so as to complete the real-time detection of the speech signal.
  • the audio signal to be detected is divided into a plurality of audio segments, and audio features in each audio segment are extracted, wherein the audio features include at least a time domain feature and a frequency domain feature of the audio segment.
  • the audio features include at least a time domain feature and a frequency domain feature of the audio segment.
  • the detecting unit 1006 includes:
  • the determining module is configured to determine whether the audio feature of the current audio segment meets a predetermined threshold condition, wherein the audio features of the current audio segment include: a signal zero crossing rate of the current audio segment in the time domain, and a short time domain of the current audio segment. Time energy, the spectral flatness of the current audio segment in the frequency domain, and the signal information entropy of the current audio segment in the time domain;
  • the detecting module is configured to detect that the current audio segment is the target voice segment when the audio feature of the current audio segment satisfies a predetermined threshold condition.
  • the audio features of the current audio segment x(i) in the N audio segments can be obtained by the following formula:
  • h[i] is the window function when the following function is used
  • Figure 4 shows the original audio signal with impulse noise.
  • the middle band (the signal between the horizontal axis 50,000 and 150,000) has some impulse noise, and the speech signal is the last segment (the horizontal axis is 230000).
  • Figure 5 shows the audio signal of the signal zero-crossing rate separately extracted from the original audio signal.
  • the impulse noise such as the intermediate band
  • the signal zero-crossing rate characteristic ( The impulse noise of the signal between the horizontal axis of 50000-150000 can be directly filtered, but for low-energy non-impulse noise (signal between the horizontal axis 210000-220000) can not be distinguished;
  • Figure 6 shows the original The audio signal separately extracts the short-time energy audio signal, and it can be seen that the low-energy non-impulse noise (signal between the horizontal axis 210000-220000) can be filtered according to the short-time energy characteristic, but the intermediate band cannot be distinguished ((horizontal axis 50000) Impulse noise of the signal between -150000 (the pulse signal also has a relatively large energy);
  • Figure 7 shows the extraction of the spectral flatness and the signal entropy audio signal for the original audio signal, both of which can pass the voice signal Both the impulse noise and the impulse noise are detected, and all types of speech signals can be retained to the greatest extent; further, in addition, FIG.
  • the energy signal of the above four characteristics and the signal zero-crossing rate characteristic can distinguish the interference of the impulse noise and other low-energy noise, and detect the actual speech signal. It can be seen from the signal shown in the above figure that in this embodiment The extracted audio signal will be more conducive to accurately detecting the target speech segment.
  • the interference between the pulse and the background noise is resisted by combining the above-mentioned multiple features in the time domain and the frequency domain in the voice detection process, and the robustness is enhanced to realize the division from the audio signal to be detected.
  • the target speech segment is accurately detected in the plurality of audio segments, thereby accurately obtaining the start time and the end time of the speech signal corresponding to the target speech segment, and realizing the natural interaction between the human and the machine.
  • the detecting unit 1006 includes:
  • the judging module is arranged to repeat the following steps until the current audio segment is the last audio segment of the plurality of audio segments, wherein the current audio segment is initialized to the first audio segment of the plurality of audio segments:
  • S4 Determine whether the current audio segment is the last audio segment of the plurality of audio segments, and if not, use the next audio segment of the current audio segment as the current audio segment.
  • the foregoing predetermined threshold condition may be, but is not limited to, adaptive update adjustment according to the changed scenario.
  • the predetermined threshold condition in the case that the audio segment is acquired from the plurality of audio segments in the input order of the audio signal to determine whether the audio feature of the audio segment satisfies a predetermined threshold condition, the predetermined threshold condition may be, but is not limited to, at least Updated based on the audio characteristics of the current audio segment. That is, when it is necessary to update the predetermined threshold condition, the next updated predetermined threshold condition is acquired based on the current audio segment (history audio segment).
  • the above-mentioned judging process will be repeatedly performed for each audio segment until the plurality of audio segments divided by the audio signal to be detected are traversed. That is, until the current audio segment is the last audio segment of the plurality of audio segments.
  • the predetermined threshold conditions for comparison with the audio features are continuously updated to ensure that the target speech segments are accurately detected from the plurality of audio segments according to different scenes during the detection process. Further, for multiple features of the audio segment in multiple domains, it is determined whether the audio segment is subjected to multiple determination and screening by determining whether the corresponding predetermined threshold condition is met, thereby ensuring that an accurate target segment is detected.
  • the determining module comprises: (1) a determining sub-module, configured to determine whether a signal zero-crossing rate of the current audio segment in the time domain is greater than a first threshold; and when a signal zero-crossing rate of the current audio segment is greater than a first threshold, determining the current Whether the short-term energy of the audio segment in the time domain is greater than a second threshold; when the short-term energy of the current audio segment is greater than the second threshold, determining whether the spectral flatness of the current audio segment in the frequency domain is less than a third threshold; When the spectral flatness in the frequency domain is less than the third threshold, determining whether the signal entropy of the current audio segment in the time domain is less than a fourth threshold;
  • the detection module comprises: (1) a detection sub-module, configured to detect that the current audio segment is the target speech segment when it is determined that the signal information entropy of the current audio segment is less than a fourth threshold.
  • the foregoing process of detecting a target voice segment according to multiple features of the current audio segment in the time domain and the frequency domain may be performed, but not limited to, after performing the second quantization on the audio signal. This embodiment does not limit this.
  • Signal zero-crossing rate obtain the signal zero-crossing rate of the current audio segment in the time domain; the zero-crossing rate of the signal indicates the number of times the waveform crosses the zero-axis in an audio signal. In general, the zero-crossing rate of the speech signal is higher than Large voice signal;
  • short-time energy obtaining the time domain energy of the current audio segment in the time domain amplitude; the short-time energy signal is used to distinguish the non-speech signal and the speech signal from the signal energy; in general, the short-term energy of the speech signal is greater than Short-term energy of non-speech signals;
  • Spectral flatness Fourier transform is performed on the current audio segment and its spectral flatness is calculated; wherein the frequency distribution of the speech signal is concentrated, and the corresponding spectral flatness is small; the frequency distribution of the Gaussian white noise signal is relatively scattered, and the corresponding spectrum Larger flatness;
  • Signal information entropy Calculate the signal information entropy after normalizing the current audio segment; wherein the speech signal distribution is concentrated, the corresponding signal information entropy is small, and the non-speech signal, especially the Gaussian white noise distribution is relatively scattered, the corresponding signal Information entropy is relatively large.
  • S904 Determine whether the signal zero-crossing rate of the current audio segment is greater than the first threshold. If the signal zero-crossing rate of the current audio segment is greater than the first threshold, perform the next operation; if the current audio segment's signal zero-crossing rate is less than or equal to the first Threshold, then the current audio segment is directly determined as a non-target speech segment;
  • S906 determining whether the short-term energy of the current audio segment is greater than a second threshold, and if greater than the second threshold, performing a next determination; if the short-term energy of the current audio segment is less than or equal to the second threshold, the current audio segment is directly determined as Non-target speech segment, and updating a second threshold according to short-term energy of the current audio segment;
  • S908 determining whether the spectral flatness of the current audio segment is less than a third threshold, and if less than the third threshold, performing a next determination; if the spectral flatness of the current audio segment is greater than or equal to the third threshold, the current audio segment is directly determined as a non-target speech segment, and updating a third threshold according to a spectral flatness of the current audio segment;
  • S910 Determine whether the signal information entropy of the current audio segment is less than a fourth threshold. If the signal threshold is less than the fourth threshold, perform the next determination. If the signal entropy of the current audio segment is greater than or equal to the fourth threshold, the current audio segment is directly determined as The non-target speech segment updates the fourth threshold based on the spectral flatness of the current audio segment.
  • step S910 when it is determined that the above four features all satisfy the corresponding predetermined threshold condition, it is determined that the current audio segment is the target voice segment.
  • the target speech segment is accurately detected from the plurality of audio segments by combining multiple features of the audio segment in different domains, so as to reduce interference of the noise signal in the audio segment to the speech detection process, thereby improving The purpose of the accuracy of voice detection.
  • the determining module by using the following steps, to update the predetermined threshold condition according to at least the audio feature of the current audio segment includes:
  • the fourth threshold is updated according to at least the signal information entropy of the current audio segment.
  • the determining module by using the following steps, to update the predetermined threshold condition according to at least the audio feature of the current audio segment, includes:
  • a represents the attenuation coefficient
  • B represents the short-term energy of the current audio segment
  • A' represents the second threshold
  • A represents the updated second threshold
  • B represents the spectral flatness of the current audio segment
  • A' represents The third threshold
  • A represents the updated third threshold
  • B represents the signal information entropy of the current audio segment
  • A' represents a fourth threshold
  • A represents an updated fourth threshold
  • the predetermined threshold condition required for the next audio segment is determined based on the audio characteristics of the current audio segment (historical audio segment), thereby making the target speech detection process more accurate.
  • the predetermined threshold conditions for comparison with the audio features are continuously updated to ensure that the target speech segments are accurately detected from the plurality of audio segments according to different scenes during the detection process.
  • a determining unit configured to determine a starting time of a continuous speech segment composed of the target speech segment according to a position of the target speech segment in the plurality of audio segments after the target speech segment is detected from the audio segment according to the audio feature of the audio segment And the moment of termination.
  • the foregoing voice segment may include, but is not limited to, one target voice segment, or multiple consecutive target voice segments.
  • Each target speech segment includes a start time of the target speech segment and a termination time of the target speech segment.
  • the target speech segment is detected from a plurality of audio segments.
  • the start time and the end time of the voice segment formed by the target voice segment can be obtained according to the time label of the target voice segment, such as the start time of the target voice segment and the end time of the target voice segment.
  • the determining unit includes:
  • the first obtaining module is configured to obtain a starting moment of the first target speech segment in the consecutive K target speech segments as a starting moment of the continuous speech segment;
  • the second obtaining module is configured to: after confirming the starting time of the continuous speech segment, acquire a starting moment of the first non-target speech segment of the consecutive M non-target speech segments after the Kth target speech segment , as the end of the continuous speech segment
  • the foregoing K is an integer greater than or equal to 1.
  • the M may be set to different values according to different scenarios, which is not limited in this embodiment.
  • the target speech segments detected from multiple (for example, 20) audio segments are: P1-P5, P7-P8, P10, P17-P20 . Further, assume that M is 5.
  • the first five target speech segments are continuous, P5 and P7 include a non-target speech segment (P6), and P8 and P10 include a non-target speech segment (P9), between P10 and P17. Includes 6 non-target speech segments (ie P11-P16).
  • the first K ie, the first 5 consecutive target speech segments
  • a speech segment A containing a speech signal is detected from the audio signal to be detected, wherein the starting time of the speech segment A is the first five targets.
  • the start time of the first target speech segment in the speech segment ie, the start time of P1.
  • the speech segment A is terminated at the start time of the first non-target speech segment (ie, the start time of P11) in the continuous non-target speech segment (ie, P11-P16), and the start time of P11 is used as the speech segment A.
  • the moment of termination That is to say, the start time of the speech segment A is the start time 0 of P1, and the end time is the start time 10T of P11.
  • the above-described continuous target speech segments P17-P20 will be used to determine the detection process of the next speech segment B.
  • the detection process can be performed by referring to the above process, and details are not described herein again in this embodiment.
  • the audio signal to be detected may be acquired in real time, so as to detect whether the audio segment in the audio signal is the target speech segment, so as to accurately detect the start of the speech segment formed by the target speech segment.
  • the moment and the termination time of the voice segment, thereby realizing the human-computer interaction device can accurately respond according to the voice information to be expressed in the complete voice segment, thereby realizing human-computer interaction.
  • the above detection step may be repeatedly performed for the voice detection. This description is not repeated here.
  • the human-machine interaction device when the target speech segment is accurately detected, can quickly and in real time determine the start time and the end time of the speech segment formed by the target speech segment, thereby realizing human-computer interaction.
  • the device responds accurately and in real time to the acquired voice information, achieving the effect of natural interaction between human and machine.
  • the human-machine interaction device can also improve the human-machine interaction by accurately detecting the start time and the end time of the voice signal corresponding to the target voice segment. The effect of mutual efficiency, thereby overcoming the problem of low human-computer interaction efficiency caused by the interactive personnel in triggering the initiation of human-computer interaction process by the interactive personnel in the related art.
  • the first obtaining unit is configured to acquire the first N audio segments of the plurality of audio segments after dividing the audio signal to be detected into a plurality of audio segments, where N is an integer greater than one;
  • a building unit configured to construct a noise suppression model according to the first N audio segments, wherein the noise suppression model is configured to perform noise suppression processing on the N+1th audio segment and the audio segment after the plurality of audio segments;
  • the second obtaining unit is configured to acquire an initial predetermined threshold condition according to the first N audio segments.
  • a noise suppression model is constructed based on the first N audio segments in the following manner. Assuming that the audio signal includes a pure speech signal and independent Gaussian white noise, the noise can be suppressed by performing Fourier transform on the background noise of the first N audio segments to obtain frequency domain information of the signal; according to the frequency domain of the background noise Information, estimating the frequency domain logarithmic characteristics of noise to construct a noise suppression model. Further, for the (N+1)th audio segment and the following audio segment, the maximum noise estimation method may be implemented based on the above-described noise suppression model to implement noise cancellation processing on the audio signal.
  • the noise suppression model is constructed by the audio segment without voice input, and an initial predetermined threshold condition for determining the audio feature is obtained.
  • the above initial predetermined threshold condition may be, but is not limited to, determined according to an average value of audio features of the first N audio segments.
  • the first N audio segments of the plurality of audio segments are used to implement the initialization operation of the human-computer interaction, such as constructing a noise suppression model to perform noise suppression processing on multiple audio segments to avoid noise-to-speech signals. Interference. For example, an initial predetermined threshold condition for determining an audio feature is obtained to facilitate voice detection of a plurality of audio segments.
  • the acquisition unit is set to collect the audio features before extracting each audio segment.
  • the detected audio signal wherein the audio signal is first quantized when the audio signal is acquired;
  • a quantization unit configured to perform a second quantization on the collected audio signal, wherein the quantization level of the second quantization is smaller than the quantization level of the first quantization.
  • the first quantization may be performed, but not limited to, when the audio signal is acquired; the second quantization may be performed, but not limited to, after the noise suppression processing is performed.
  • the larger the quantization level the more sensitive the interference is, that is, the smaller the interference is, the more likely it is to interfere with the voice signal, and the secondary interference is adjusted by adjusting the quantization level to achieve secondary filtering of the interference. Effect.
  • 16 bits are used in the first quantization, and 8 bits, that is, the range of [-128--127] is used in the second quantization; thereby achieving accurate by filtering again. Distinguish between speech signals and noise.
  • a voice detecting device for implementing the above voice detecting method is further provided. As shown in FIG. 11, the device includes:
  • a communication interface 1102 configured to acquire an audio signal to be detected
  • the processor 1104 is connected to the communication interface 1102, configured to divide the audio signal to be detected into a plurality of audio segments; and is further configured to extract an audio feature in each audio segment, wherein the audio feature includes at least the audio segment Domain features and frequency domain features; further configured to detect a target speech segment from the audio segment based on audio features of the audio segment;
  • the memory 1106 is connected to the communication interface 1102 and the processor 1104, and is configured to store a plurality of audio segments and a target speech segment in the audio signal.
  • Embodiments of the present invention also provide a storage medium.
  • storing The storage medium is set to store program code for performing the following steps:
  • the target voice is detected from the audio segment according to the audio feature of the audio segment.
  • the storage medium is further configured to store program code for performing: determining whether the audio feature of the current audio segment satisfies a predetermined threshold condition, wherein the audio features of the current audio segment include: current The signal zero-crossing rate of the audio segment in the time domain, the short-term energy of the current audio segment in the time domain, the spectral flatness of the current audio segment in the frequency domain, the signal information entropy of the current audio segment in the time domain; the audio in the current audio segment When the feature satisfies the predetermined threshold condition, the current audio segment is detected as the target speech segment.
  • the storage medium is further configured to store program code for performing the following steps: detecting the target voice segment from the audio segment according to the audio feature of the audio segment comprises: repeating the following steps until the current The audio segment is the last one of the plurality of audio segments, wherein the current audio segment is initialized to the first of the plurality of audio segments: determining whether the audio feature of the current audio segment meets a predetermined threshold condition; at the current audio When the audio feature of the segment satisfies the predetermined threshold condition, the current audio segment is detected as the target voice segment; when the audio feature of the current audio segment does not satisfy the predetermined threshold condition, the predetermined threshold condition is updated according to at least the audio feature of the current audio segment, and the update is obtained.
  • the subsequent predetermined threshold condition determining whether the current audio segment is the last audio segment of the plurality of audio segments, and if not, using the next audio segment of the current audio segment as the current audio segment.
  • the storage medium is further configured to store program code for performing the following steps: determining whether the audio feature of the current audio segment meets a predetermined threshold condition comprises: determining that the current audio segment is in the time domain Whether the zero rate is greater than the first threshold; when the signal zero-crossing rate of the current audio segment is greater than the first threshold, determining whether the short-term energy of the current audio segment in the time domain is greater than a second threshold; the short-term energy in the current audio segment is greater than the first When the threshold is two, the current audio segment is determined to be in frequency.
  • detecting that the current audio segment is the target voice segment includes: when it is determined that the signal information entropy of the current audio segment is less than the fourth threshold, detecting that the current audio segment is the target voice segment.
  • the storage medium is further configured to store program code for performing the following steps: when the short-term energy of the current audio segment is less than or equal to the second threshold, at least according to the short-term energy of the current audio segment Updating the second threshold; or updating the third threshold according to at least the spectral flatness of the current audio segment when the spectral flatness of the current audio segment is greater than or equal to the third threshold; or when the signal information entropy of the current audio segment is greater than or equal to the fourth threshold And updating the fourth threshold according to at least the signal information entropy of the current audio segment.
  • the storage medium is further arranged to store program code for performing the following steps:
  • a represents the attenuation coefficient
  • B represents the short-term energy of the current audio segment
  • A' represents the second threshold
  • A represents the updated second threshold
  • B represents the spectral flatness of the current audio segment
  • A' represents The third threshold
  • A represents the updated third threshold
  • B represents the signal information entropy of the current audio segment
  • A' represents a fourth threshold
  • A represents an updated fourth threshold
  • the storage medium is further configured to store program code for performing the following steps: after detecting the target speech segment from the audio segment according to the audio feature of the audio segment, according to the target speech segment The position in the audio segment determines the start time and the end time of the continuous speech segment formed by the target speech segment.
  • the storage medium is further configured to store program code for performing the following steps: acquiring a starting moment of the first target speech segment of the consecutive K target speech segments as a continuous speech segment The starting time of the continuous speech segment; after confirming the starting time of the continuous speech segment, acquiring the first non-target speech segment of the consecutive M non-target speech segments after the Kth target speech segment The starting moment is the ending moment of the continuous speech segment.
  • the storage medium is further configured to store program code for performing the following steps: after dividing the audio signal to be detected into a plurality of audio segments, acquiring the first N of the plurality of audio segments An audio segment, wherein N is an integer greater than 1; a noise suppression model is constructed according to the first N audio segments, wherein the noise suppression model is used to perform the N+1th audio segment and the following audio segments of the plurality of audio segments Noise suppression processing; obtaining an initial predetermined threshold condition based on the first N audio segments.
  • the storage medium is further configured to store program code for performing the following steps: collecting audio signals to be detected before extracting audio features in each audio segment, wherein audio is being collected
  • the signal is first quantized for the signal; the second time is quantized for the acquired audio signal, wherein the quantization level of the second quantization is smaller than the quantization level of the first quantization.
  • the storage medium is further configured to store program code for performing the following steps: performing noise suppression on the collected audio signal before performing the second quantization on the collected audio signal. .
  • the foregoing storage medium may include, but not limited to, a USB flash drive, a Read-Only Memory (ROM), a Random Access Memory (RAM), a mobile hard disk, and a magnetic memory.
  • ROM Read-Only Memory
  • RAM Random Access Memory
  • a mobile hard disk e.g., a hard disk
  • magnetic memory e.g., a hard disk
  • the integrated unit in the above embodiment if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in the above-described computer readable storage medium.
  • the technical solution of the present invention may be embodied in the form of a software product in the form of a software product, or the whole or part of the technical solution, which is stored in a storage medium, including Several instructions to make one or more calculations
  • the device (which may be a personal computer, server or network device, etc.) performs all or part of the steps of the method of the various embodiments of the present invention.
  • the disclosed client may be implemented in other manners.
  • the device embodiments described above are merely illustrative.
  • the division of the unit is only a logical function division.
  • multiple units or components may be combined or may be Integrate into another system, or some features can be ignored or not executed.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, unit or module, and may be electrical or otherwise.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
  • each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
  • the above integrated unit can be implemented in the form of hardware or in the form of a software functional unit.
  • the audio signal to be detected is divided into a plurality of audio segments, and audio features in each audio segment are extracted, wherein the audio features include at least a time domain feature and a frequency domain feature of the audio segment, thereby Implementing multiple features of the fused audio segment in different domains to derive from the above multiple tones
  • the target speech segment is accurately detected in the frequency band to reduce the interference of the noise signal in the audio segment to the speech detection process, thereby improving the accuracy of detecting the speech, thereby overcoming the detection accuracy caused by detecting the speech mode only by a single feature in the related art. The problem is lower.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Quality & Reliability (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Circuit For Audible Band Transducer (AREA)
  • Telephone Function (AREA)
  • Machine Translation (AREA)

Abstract

一种语音检测方法、装置及存储介质。该方法包括:将待检测的音频信号划分为多个音频段(S302);提取每个音频段中的音频特征,其中,音频特征至少包括音频段的时域特征及频域特征(S304);根据音频段的音频特征从音频段中检测出目标语音段(S306)。该方法解决了其它语音检测方法导致的语音检测准确率较低的技术问题。

Description

语音检测方法、装置及存储介质
本申请要求于2016年04月22日提交中国专利局、优先权号为2016102572447、发明名称为“语音检测方法和装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本发明实施例涉及计算机领域,具体而言,涉及一种语音检测方法、装置及存储介质。
背景技术
目前,为了简化操作,改善用户体验,在很多领域都开始应用语音信号实现控制。例如,将语音信号作为语音输入密码。但在相关技术中,对语音信号所采用的语音检测方式通常是对输入信号进行单个特征提取,这样提取到的单个特征,往往对噪声较为敏感,无法准确区分干扰声音和语音信号,从而导致语音检测的准确率下降。
针对上述的问题,目前尚未提出有效的解决方案。
发明内容
本发明实施例提供了一种语音检测方法、装置及存储介质,以至少解决由于采用相关的语音检测方法所导致的语音检测的准确率较低的技术问题。
根据本发明实施例的一个方面,提供了一种语音检测方法,包括:将待检测的音频信号划分为多个音频段;提取每个上述音频段中的音频特征,其中,上述音频特征至少包括上述音频段的时域特征及频域特征;根据上述音频段的上述音频特征从上述音频段中检测出目标语音段。
根据本发明实施例的另一方面,还提供了一种语音检测装置,包括: 划分单元,设置为将待检测的音频信号划分为多个音频段;提取单元,设置为提取每个上述音频段中的音频特征,其中,上述音频特征至少包括上述音频段的时域特征及频域特征;检测单元,设置为根据上述音频段的上述音频特征从上述音频段中检测出目标语音段。
根据本发明实施例的又一方面,还提供了一种存储介质,该存储介质被设置为存储用于执行以下步骤的程序代码:将待检测的音频信号划分为多个音频段;提取每个音频段中的音频特征,其中,音频特征至少包括音频段的时域特征及频域特征;根据音频段的音频特征从音频段中检测出目标语音。
在本发明实施例中,通过将待检测的音频信号划分为多个音频段,并提取每个音频段中的音频特征,其中,音频特征至少包括音频段的时域特征及频域特征,从而实现融合音频段在不同域的多个特征来从上述多个音频段中准确检测出目标语音段,以降低音频段中的噪声信号对语音检测过程的干扰,达到提高检测语音准确率的目的,进而克服相关技术中仅通过单个特征来检测语音方式所导致的检测准确率较低的问题。
进一步,在准确检测出目标语音段的同时,还可以使人机交互设备快速实时地判断出由目标语音段构成的语音段的起始时刻及终止时刻,从而实现人机交互设备对检测出的语音进行准确实时地反应,达到人机自然交互的效果。此外,人机交互设备通过准确检测出目标语音段构成的语音段的起始时刻及终止时刻,还将实现提高人机交互效率的效果,进而克服相关技术中由交互人员通过按下控制按钮来触发启动人机交互过程所导致的人机交互效率较低的问题。
附图说明
此处所说明的附图用来提供对本发明的进一步理解,构成本申请的一部分,本发明的示意性实施例及其说明用于解释本发明,并不构成对本发明的不当限定。在附图中:
图1是根据本发明实施例的一种可选的语音检测法的应用环境示意图;
图2是根据本发明实施例的另一种可选的语音检测方法的应用环境示意图;
图3是根据本发明实施例的一种可选的语音检测方法的流程示意图;
图4是根据本发明实施例的一种可选的语音检测方法的波形示意图;
图5是根据本发明实施例的另一种可选的语音检测方法的波形示意图;
图6是根据本发明实施例的又一种可选的语音检测方法的波形示意图;
图7是根据本发明实施例的又一种可选的语音检测方法的波形示意图;
图8是根据本发明实施例的又一种可选的语音检测方法的波形示意图;
图9是根据本发明实施例的另一种可选的语音检测方法的流程示意图;
图10是根据本发明实施例的一种可选的语音检测装置的示意图;以及
图11是根据本发明实施例的一种可选的语音检测设备的示意图。
具体实施方式
为了使本技术领域的人员更好地理解本发明方案,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分的实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都应当属于本发明保护的范围。
需要说明的是,本发明的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的本发明的实施例能够以除了在这里图示或描述的那些以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排 他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。
实施例1
根据本发明实施例,提供了一种上述语音检测方法的实施例。可选地,在本实施例中,该语音检测方法可以但不限于应用于如图1所示的应用环境中。通过终端102获取待检测的音频信号,将该待检测的音频信号通过网络104发送给服务器106,服务器106将待检测的音频信号划分为多个音频段;提取每个音频段中的音频特征,其中,音频特征至少包括音频段的时域特征及频域特征;并根据音频段的音频特征从音频段中检测出目标语音段。通过融合音频段在时频和频域中多个特征,利用各个特征的互补性,以实现从音频信号的多个音频段中准确检测出目标语音段,从而保证由目标语音段构成的语音段被检测出的准确率。
可选地,在本实施例中,上述语音检测方法还可以但不限于应用于如图2所示的应用环境中。也就是说,在终端102获取到待检测的音频信号后,由终端102来执行上述语音检测方法中对音频段的检测过程,具体过程可以如上,这里不再赘述。
需要说明的是,在本实施例中,图1-2所示的终端仅为一种示例。可选地,在本实施例中,上述终端可以包括但不限于以下至少之一:手机、平板电脑、笔记本电脑、台式PC机、数字电视及其他人机交互设备。上述只是一种示例,本实施例对此不做任何限定。可选地,在本实施例中,上述网络可以包括但不限于以下至少之一:广域网、城域网、局域网。上述只是一种示例,本实施例对此不做任何限定。
根据本发明实施例,提供了一种语音检测方法,如图3所示,该方法包括:
S302,将待检测的音频信号划分为多个音频段;
S304,提取每个音频段中的音频特征,其中,音频特征至少包括音频段的时域特征及频域特征;
S306,根据音频段的音频特征从音频段中检测出目标语音段。
可选地,在本实施例中,上述语音检测方法可以但不限于应用于以下至少一种场景中:智能机器人聊天系统、自动问答系统、人机聊天软件等。也就是说,将本实施例中所提供的语音检测方法应用于人机交互过程中,通过提取音频段中至少包括音频段的时域特征及频域特征的音频特征,来准确检测出对待检测的音频信号中所划分的多个音频段中的目标语音段,从而使用于人机交互的设备可以获知由目标语音段构成的语音段的起始时刻及终止时刻,以便于设备在获取所要表达的完整的语音信息后再进行准确答复。这里,在本实施例中,上述语音段可以包括但不限于:一个目标语音段或连续多个目标语音段。其中,每一个目标语音段包括该目标语音段的起始时刻及终止时刻。本实施例中对此不做任何限定。
需要说明的是,在本实施例中,人机交互设备通过将待检测的音频信号划分为多个音频段,并提取每个音频段中的音频特征,其中,音频特征至少包括音频段的时域特征及频域特征,从而实现融合音频段在不同域的多个特征来从上述多个音频段中准确检测出目标语音段,以降低音频段中的噪声信号对语音检测过程的干扰,达到提高语音检测的准确率的目的,进而克服相关技术中仅通过单个特征来检测语音的方式所导致的检测准确率较低的问题。
进一步,在准确检测出目标语音段的同时,还可以使人机交互设备快速实时地判断出由目标语音段构成的语音段的起始时刻及终止时刻,从而实现人机交互设备对检测获取到的语音信息进行准确实时地反应,达到人机自然交互的效果。此外,人机交互设备通过准确检测出目标语音段构成的语音段的起始时刻及终止时刻,还将实现提高人机交互效率的效果,进而克服相关技术中由交互人员通过按下控制按钮来触发启动人机交互过程所导致的人机交互效率较低的问题。
可选地,在本实施例中,上述音频特征可以包括但不限于以下至少之一:在时域的信号过零率、在时域的短时能量、在频域的谱平度、在时域的信号信息熵、自相关性系数、小波变换后信号、信号复杂度等。
需要说明的是,1)上述信号过零率可以但不限于用于去除一些脉冲噪声的干扰;2)上述短时能量可以但不限于用于衡量音频信号的幅度大小,配合一定的阈值以去除不相关人群说话语音的干扰;3)上述谱平度可以但不限于用于在频域内计算信号的频率分布特性,根据该特征的大小,以判断音频信号是否为背景高斯白噪声;4)上述信号时域信息熵可以但不限于用于度量音频信号在时域的分布特性,该特征用于区别语音信号和一般噪声。在本实施例中,通过在语音检测过程中融合上述在时域及频域的多个特征来抵抗脉冲和背景噪声的干扰,增强鲁棒性,以实现从待检测的音频信号所划分的多个音频段中准确检测出目标语音段,进而达到准确获取该目标语音段构成的语音段的起始时刻及终止时刻,以实现人机自然交互。
可选地,在本实施例中,根据音频段的音频特征从音频信号的多个音频段中检测出目标语音段的方式可以包括但不限于:判断音频段的音频特征是否满足预定阈值条件;在音频段的音频特征满足预定阈值条件时,则检测出该音频段为目标语音段。
需要说明的是,在本实施例中,在判断音频段的音频特征是否满足预定阈值条件时,可以按照以下至少一种顺序从多个音频段获取用于进行判断的当前音频段:1)按照音频信号的输入顺序;2)按照预定顺序。其中,上述预定顺序可以为随机顺序,也可以为按照预定原则排列的顺序,例如按照音频段的大小顺序。上述仅是一种示例,本实施例中对此不做任何限定。
此外,在本实施例中,上述预定阈值条件可以但不限于将根据变化的场景进行自适应更新调整。通过不断更新用于与音频特征进行比较的预定阈值条件,以保证在检测过程中根据不同场景准确从多个音频段中检测出 目标语音段。进一步,对于音频段在多个域的多个特征,通过分别判断是否满足对应的预定阈值条件,以实现对音频段进行多次判断筛选,从而保证准确地检测出目标语音段。
可选地,在本实施例中,在按照音频信号的输入顺序从多个音频段中获取音频段,以判断音频段的音频特征是否满足预定阈值条件的情况下,根据音频段的音频特征从音频段中检测出目标语音段包括:重复执行以下步骤,直至当前音频段为多个音频段中的最后一个音频段,其中,当前音频段被初始化为多个音频段中的第一个音频段:
S1,判断当前音频段的音频特征是否满足预定阈值条件;
S2,在当前音频段的音频特征满足预定阈值条件时,则检测出当前音频段为目标语音段;
S3,在当前音频段的音频特征不满足预定阈值条件时,至少根据当前音频段的音频特征更新预定阈值条件,得到更新后的预定阈值条件;
S4,判断当前音频段是否为多个音频段中的最后一个音频段,若不是,则将当前音频段的下一个音频段作为当前音频段。
需要说明的是,在本实施例中,上述预定阈值条件可以但不限于至少根据当前音频段的音频特征更新,以得到更新后的预定阈值条件。也就是说,在更新上述预定阈值条件时,是根据当前音频段(历史音频段)的音频特征来确定下一个音频段所需的预定阈值条件,从而使对音频段的检测过程更加准确。
可选地,在本实施例中,在将待检测的音频信号划分为多个音频段之后,还包括:
S1,获取多个音频段中前N个音频段,其中,N为大于1的整数;
S2,根据前N个音频段构建抑噪模型,其中,抑噪模型用于对多个音频段中第N+1个音频段及其之后的音频段进行抑噪处理;
S3,根据前N个音频段获取初始预定阈值条件。
需要说明的是,为了保证语音检测过程的准确率,在本实施例中将对多个音频段进行抑噪处理,以避免噪声对语音信号的干扰。例如,采用最小均方误差对数谱幅度估计方式来消除音频信号的背景噪声。
可选地,在本实施例中,上述前N个音频段可以但不限于为无语音输入的音频段。也就是说,在人机交互过程开启前,执行初始化操作,通过无语音输入的音频段来构建抑噪模型,并获取用于判断音频特征的初始预定阈值条件。其中,上述初始预定阈值条件可以但不限于根据前N个音频段的音频特征的平均值确定。
可选地,在本实施例中,在提取每个音频段中的音频特征之前,还包括:对采集到的待检测的音频信号进行二次量化,其中,第二次量化的量化级小于第一次量化的量化级。
需要说明的是,在本实施例中,第一次量化可以但不限于在采集音频信号时进行;第二次量化可以但不限于在执行抑噪处理后进行。此外,在本实施例中,量化级越大,干扰越敏感,也就是说,在量化级较大时,由于量化间隔较小,因而较小的噪声信号也会被执行量化操作,这样量化后的结果既包括语音信号,也包括噪声信号,对语音信号检测造成了很大干扰。在本实施例中,通过调整量化级实现二次量化,即第二次量化的量化级小于第一次量化的量化级,从而实现对噪声信号进行二次过滤,以达到降低干扰的效果。
可选地,在本实施例中,将待检测的音频信号划分为多个音频段可以包括但不限于:通过定长窗口采样设备采集到的音频信号。其中,在本实施例中,上述定长窗口的长度较小,例如,采用的窗口的长度为256(信号个数)。即,通过小窗口实现对音频信号的划分,从而实现实时地返回处理结果,以便完成语音信号的实时检测。
通过本申请提供的实施例,通过将待检测的音频信号划分为多个音频 段,并提取每个音频段中的音频特征,其中,音频特征至少包括音频段的时域特征及频域特征,从而实现融合音频段在不同域的多个特征来从上述多个音频段中准确检测出目标语音段,以降低音频段中的噪声信号对语音检测过程的干扰,达到提高语音检测的准确率的目的,进而克服相关技术中仅通过单个特征来检测语音的方式所导致的检测准确率较低的问题。
作为一种可选的方案,根据音频段的音频特征从音频段中检测出目标语音段包括:
S1,判断当前音频段的音频特征是否满足预定阈值条件,其中,当前音频段的音频特征包括:当前音频段在时域的信号过零率、当前音频段在时域的短时能量、当前音频段在频域的谱平度、当前音频段在时域的信号信息熵;
S2,在当前音频段的音频特征满足预定阈值条件时,则检测出当前音频段为目标语音段。
可选地,在本实施例中,对N个音频段中的当前音频段x(i)的音频特征可以通过如下公式获取:
1)在时域的信号过零率(即短时过零率)计算:
Figure PCTCN2017074798-appb-000001
其中sgn[]是符号函数:
Figure PCTCN2017074798-appb-000002
2)在时域的短时能量计算:
Figure PCTCN2017074798-appb-000003
其中h[i]是窗口函数,当采用下列函数
Figure PCTCN2017074798-appb-000004
3)在频域的谱平度计算:
首先对音频段x(i)i=0,1,2…,N-1进行傅立叶变换得到频域幅度值大小f(i)i=0,1,2…,N-1;
根据以下公式计算谱平度:
Figure PCTCN2017074798-appb-000005
4)在时域的信号信息熵计算:
首先计算信号绝对值归一化后的相对概率大小:
Figure PCTCN2017074798-appb-000006
再根据以下公式计算信号信息熵:
Figure PCTCN2017074798-appb-000007
具体结合以下示例进行说明,如图4示出了带有脉冲噪声的原始音频信号,中间带((横轴50000-150000之间的信号)有一些脉冲噪声,语音信号为最后一段(横轴230000-240000之间的信号);图5示出了对原始音频信号单独提取信号过零率的音频信号,可以看到根据信号过零率特征能够很好地区别出脉冲噪声,如中间带((横轴50000-150000之间的信号)的脉冲噪声可以被直接过滤,但是对于低能量的非脉冲噪声(横轴210000-220000之间的信号)却不能被区分出来;图6示出了对原始音频信号单独提取短时能量的音频信号,可以看到根据短时能量特征可以过滤低能量的非脉冲噪声(横轴210000-220000之间的信号),但是却无法区分中间带((横轴50000-150000之间的信号)的脉冲噪声(脉冲信号也有比较大 的能量);图7示出了对原始音频信号提取谱平度和信号信息熵音频信号,这两者能过把语音信号和脉冲噪声都检测出来,能够最大程度的保留所有类语音信号;进一步,图8示出了采用本实施例中提供的方式:在提取谱平度和信号信息熵的基础上结合短时能量特征及信号过零率特征,可以分辨脉冲噪声和其他低能量噪声的干扰,把实际的语音信号检测出来。由上述附图所示信号可知,本实施例中提取出的音频信号将更利于准确检测出目标语音段。
通过本申请提供的实施例,通过在语音检测过程中融合上述在时域及频域的多个特征来抵抗脉冲和背景噪声的干扰,增强鲁棒性,以实现从待检测的音频信号所划分的多个音频段中准确检测出目标语音段,进而达到准确获取该目标语音段对应的语音信号的起始时间及终止时间,实现人机之间的自然交互。
作为一种可选的方案,根据音频段的音频特征从音频段中检测出目标语音段包括:
S1,重复执行以下步骤,直至当前音频段为多个音频段中的最后一个音频段,其中,当前音频段被初始化为多个音频段中的第一个音频段:
S11,判断当前音频段的音频特征是否满足预定阈值条件;
S12,在当前音频段的音频特征满足预定阈值条件时,则检测出当前音频段为目标语音段;
S13,在当前音频段的音频特征不满足预定阈值条件时,至少根据当前音频段的音频特征更新预定阈值条件,得到更新后的预定阈值条件;
S14,判断当前音频段是否为多个音频段中的最后一个音频段,若不是,则将当前音频段的下一个音频段作为当前音频段。
可选地,在本实施例中,上述预定阈值条件可以但不限于将根据变化的场景进行自适应更新调整。其中,在本实施例中,在按照音频信号的输入顺序从多个音频段中获取音频段,以判断音频段的音频特征是否满足预 定阈值条件的情况下,上述预定阈值条件可以但不限于至少根据当前音频段的音频特征更新。也就是说,在需要更新预定阈值条件时,基于当前音频段(历史音频段)获取下一个更新后的预定阈值条件。
需要说明的是,对于待检测的音频信号来说,包括多个音频段,上述判断过程将对各个音频段重复执行,直至遍历上述待检测音频信号所划分的多个音频段。即,直至当前音频段为多个音频段中的最后一个音频段。
通过本申请提供的实施例,通过不断更新用于与音频特征进行比较的预定阈值条件,以保证在检测过程中根据不同场景准确从多个音频段中检测出目标语音段。进一步,对于音频段在多个域的多个特征,通过分别判断是否满足对应的预定阈值条件,以实现对音频段进行多次判断筛选,从而保证检测出准确的目标语音段。
作为一种可选的方案,
S1,判断当前音频段的音频特征是否满足预定阈值条件包括:S11,判断当前音频段在时域的信号过零率是否大于第一阈值;在当前音频段的信号过零率大于第一阈值时,判断当前音频段在时域的短时能量是否大于第二阈值;在当前音频段的短时能量大于第二阈值时,判断当前音频段在频域的谱平度是否小于第三阈值;在当前音频段在频域的谱平度小于第三阈值时,判断当前音频段在时域的信号信息熵是否小于第四阈值;
S2,在当前音频段的音频特征满足预定阈值条件时,则检测出当前音频段为目标语音段包括:S21,在判断出当前音频段的信号信息熵小于第四阈值时,则检测出当前音频段为目标语音段。
可选地,在本实施例中,上述根据当前音频段在时域及频域的多个特征来检测目标语音段的过程可以但不限于在对音频信号进行第二次量化后执行。本实施例中对此不做任何限定。
需要说明的是,上述音频特征在语音检测过程的作用如下:
1)信号过零率:获取当前音频段在时域的信号过零率;该信号过零 率表示一段音频信号中波形穿过零轴的次数,一般情况下,语音信号的过零率比非语音信号大;
2)短时能量:获取当前音频段在时域幅度上的时域能量;该短时能量信号用于在信号能量上区分非语音信号和语音信号;一般情况下,语音信号的短时能量大于非语音信号的短时能量;
3)谱平度:对当前音频段进行傅立叶变换并计算其谱平度;其中,语音信号的频率分布比较集中,对应的谱平度较小;高斯白噪声信号频率分布比较分散,对应的谱平度较大;
4)信号信息熵:对当前音频段进行归一化后计算信号信息熵;其中,语音信号分布比较集中,对应的信号信息熵小,非语音信号特别是高斯白噪声分布比较分散,对应的信号信息熵比较大。
具体结合图9所示示例进行说明:
S902,获取当前音频段的音频特征;
S904,判断当前音频段的信号过零率是否大于第一阈值,如果当前音频段的信号过零率大于第一阈值,则进行下一步操作;如果当前音频段的信号过零率小于等于第一阈值,那么当前音频段直接判定为非目标语音段;
S906,判断当前音频段的短时能量是否大于第二阈值,如果大于第二阈值,则进行下一步的判断;如果当前音频段的短时能量小于等于第二阈值,那么当前音频段直接判定为非目标语音段,并根据该当前音频段的短时能量更新第二阈值;
S908,判断当前音频段的谱平度是否小于第三阈值,如果小于第三阈值,则进行下一步的判断;如果当前音频段的谱平度大于等于第三阈值,那么当前音频段直接判定为非目标语音段,并根据该当前音频段的谱平度更新第三阈值;
S910,判断当前音频段的信号信息熵是否小于第四阈值,如果小于第 四阈值,则进行下一步的判断;如果当前音频段的信号信息熵大于等于第四阈值,那么当前音频段直接判定为非目标语音段,并根据该当前音频段的谱平度更新第四阈值。
在执行完步骤S910后,在判断出上述四个特征均满足所对应的预定阈值条件时,则判定当前音频段为目标语音段。
通过本申请提供的实施例,通过融合音频段在不同域的多个特征来从上述多个音频段中准确检测出目标语音段,以降低音频段中的噪声信号对语音检测过程的干扰,达到提高语音检测的准确率的目的。
作为一种可选的方案,至少根据当前音频段的音频特征更新预定阈值条件包括:
1)在当前音频段的短时能量小于等于第二阈值时,至少根据当前音频段的短时能量更新第二阈值;或者
2)在当前音频段的谱平度大于等于第三阈值时,至少根据当前音频段的谱平度更新第三阈值;或者
3)在当前音频段的信号信息熵大于等于第四阈值时,至少根据当前音频段的信号信息熵更新第四阈值。
可选地,在本实施例中,至少根据当前音频段的音频特征更新预定阈值条件包括:
A=a×A'+(1-a)×B  (8)
其中,a表示衰减系数,在B表示当前音频段的短时能量时,A’表示第二阈值,A表示更新后的第二阈值;在B表示当前音频段的谱平度时,A’表示第三阈值,A表示更新后的第三阈值;在B表示当前音频段的信号信息熵时,A’表示第四阈值,A表示更新后的第四阈值。
也就是说,在更新上述预定阈值条件时,是根据当前音频段(历史音频段)的音频特征来确定下一个音频段所需的预定阈值条件,从而使对目 标语音检测过程更加准确。
通过本申请提供的实施例,通过不断更新用于与音频特征进行比较的预定阈值条件,以保证在检测过程中根据不同场景准确从多个音频段中检测出目标语音段。
作为一种可选的方案,在根据音频段的音频特征从音频段中检测出目标语音段之后,还包括:
S1,根据目标语音段在多个音频段中的位置确定目标语音段构成的连续语音段的起始时刻及终止时刻。
可选地,在本实施例中,上述语音段可以包括但不限于一个目标语音段,或连续多个目标语音段。其中,每一个目标语音段包括目标语音段的起始时刻,及目标语音段的终止时刻。
需要说明的是,在本实施例中,在从多个音频段检测出目标语音段的同时,即可根据目标语音段的时间标签,如目标语音段的起始时刻及目标语音段的终止时刻,来获取目标语音段构成的语音段的起始时刻及终止时刻。
可选地,在本实施例中,根据目标语音段在多个音频段中的位置确定目标语音段构成的连续语音段的起始时刻及终止时刻包括:
S1,获取连续K个目标语音段中的第一个目标语音段的起始时刻,作为连续语音段的起始时刻;
S2,在确认连续语音段的起始时刻后,获取在第K个目标语音段之后,连续M个非目标语音段中的第一个非目标语音段的起始时刻,作为连续语音段的终止时刻。
可选地,在本实施例中,上述K为大于等于1的整数,上述M可以根据不同场景设置为不同取值,本实施例中对此不做任何限定。
具体结合以下示例进行说明,假设从多个(例如,20个)音频段(假设每段时长均为T)中检测出的目标语音段包括:P1-P5,P7-P8,P10,P17-P20。进一步,假设M为5。
基于上述假设可知,前5个目标语音段连续,P5与P7之间包括一个非目标语音段(即P6),P8与P10之间包括一个非目标语音段(即P9),P10与P17之间包括6个非目标语音段(即P11-P16)。
根据前K个(即前5个)连续目标语音段可以确认:从待检测的音频信号中检测出一个包含语音信号的语音段A,其中,该语音段A的起始时刻为前5个目标语音段中的第一个目标语音段的起始时刻(即P1的起始时刻)。进一步,由于P5与P7之间非目标语音段的数量为1,即小于M(M=5);由于P8与P10之间非目标语音段的数量为1,即小于M(M=5),则可以判定在非目标语音段P6及非目标语音段P9时,上述语音段A并未终止。而由于P10与P17之间非目标语音段的数量为6,即大于M(M=5),即连续非目标语音段(P11-P16)的数量已满足M个的预设阈值,则可以判定上述语音段A在连续非目标语音段(即P11-P16)中的第一个非目标语音段的起始时刻(即P11的起始时刻)终止,则将P11的起始时刻作为语音段A的终止时刻。也就是说,语音段A的起始时刻为P1的起始时刻0,终止时刻为P11的起始时刻10T。
这里,需要说明的是,在本示例中,上述连续目标语音段P17-P20将用于判定下一个语音段B的检测过程。检测过程可以参照上述过程执行,本实施例中在此不再赘述。
此外,在本实施例中,可以但不限于实时获取待检测的音频信号,以便于检测音频信号中的音频段是否为目标语音段,从而达到准确检测出目标语音段构成的语音段的起始时刻及语音段的终止时刻,进而实现人机交互设备可以根据完整的语音段所要表达的语音信息后再进行准确答复,实现人机交互。需要说明的是,在实时获取待检测的音频信号的过程中,对于语音检测可以但不限于重复执行上述检测步骤。本实施例中在此不再赘述。
通过本申请提供的实施例,在准确检测出目标语音段的同时,还可以使人机交互设备快速实时地判断出目标语音段构成的语音段的起始时刻及终止时刻,从而实现人机交互设备对检测获取到的语音信息进行准确实时地反应,达到人机自然交互的效果。此外,人机交互设备通过准确检测出目标语音段所对应的语音信号的起始时间及终止时间,还将实现提高人机交互效率的效果,进而克服相关技术中由交互人员通过按下控制按钮来触发启动人机交互过程所导致的人机交互效率较低的问题。
作为一种可选的方案,在将待检测的音频信号划分为多个音频段之后,还包括:
S1,获取多个音频段中前N个音频段,其中,N为大于1的整数;
S2,根据前N个音频段构建抑噪模型,其中,抑噪模型用于对多个音频段中第N+1个音频段及其之后的音频段进行抑噪处理;
S3,根据前N个音频段获取初始预定阈值条件。
例如,具体通过以下方式根据前N个音频段构建抑噪模型。假设音频信号包括纯净语音信号和独立的高斯白噪声,则可以通过以下方式来抑噪:对前N个音频段的背景噪声进行傅立叶变换,得到信号的频域信息;根据该背景噪声的频域信息,估计出噪声的频域对数普特征,以构建抑噪模型。 进一步,对第N+1个音频段及其之后的音频段,可以但不限于基于上述抑噪模型采用最大释然估计方法,实现对音频信号进行消除噪声处理。
又例如,在人机交互过程开启前,执行初始化操作,通过无语音输入的音频段来构建抑噪模型,并获取用于判断音频特征的初始预定阈值条件。其中,上述初始预定阈值条件可以但不限于根据前N个音频段的音频特征的平均值确定。
通过本申请提供的实施例,利用多个音频段中前N个音频段来实现人机交互的初始化操作,如构建抑噪模型,以对多个音频段进行抑噪处理,避免噪声对语音信号的干扰。如获取用于判断音频特征的初始预定阈值条件,以便于对多个音频段进行语音检测。
作为一种可选的方案,在提取每个音频段中的音频特征之前,还包括:
S1,采集待检测的音频信号,其中,在采集音频信号时对音频信号进行第一次量化;
S2,对采集到的音频信号进行第二次量化,其中,第二次量化的量化级小于第一次量化的量化级。
需要说明的是,在本实施例中,第一次量化可以但不限于在采集音频信号时进行;第二次量化可以但不限于在执行抑噪处理后进行。此外,在本实施例中,量化级越大,干扰越敏感,也就是说,较小的干扰越容易对语音信号造成干扰,通过调整量化级进行二次干扰,以实现对干扰的二次过滤的效果。
具体结合以下示例进行说明,例如,在第一次量化时,采用16比特,在第二次量化时,采用8比特,即[-128--127]的范围;从而实现通过再次过滤,来准确区分语音信号与噪声。
需要说明的是,对于前述的各方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本发明并不受所描述的动作顺序的限制,因为依据本发明,某些步骤可以采用其他顺序 或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作和模块并不一定是本发明所必须的。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到根据上述实施例的方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本发明的技术方案本质上或者说对相关技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,或者网络设备等)执行本发明各个实施例所述的方法。
实施例2
根据本发明实施例,还提供了一种用于实施上述语音检测方法的语音检测装置,如图10所示,该装置包括:
1)划分单元1002,设置为将待检测的音频信号划分为多个音频段;
3)提取单元1004,设置为提取每个音频段中的音频特征,其中,音频特征至少包括音频段的时域特征及频域特征;
3)检测单元1006,设置为根据音频段的音频特征从音频段中检测出目标语音段。
可选地,在本实施例中,上述语音检测装置可以但不限于应用于以下至少一种场景中:智能机器人聊天系统、自动问答系统、人机聊天软件等。也就是说,将本实施例中所提供的语音检测装置应用于人机交互过程中,通过提取音频段中至少包括音频段的时域特征及频域特征的音频特征,来准确检测出对待检测的音频信号中所划分的多个音频段中的目标语音段,从而使用于人机交互的设备可以获知由目标语音段构成的语音段的起始时刻及终止时刻,以便于设备在获取所要表达的完整的语音信息后再进行准确答复。这里,在本实施例中,上述语音段可以包括但不限于:一个目 标语音段或连续多个目标语音段。其中,每一个目标语音段包括该目标语音段的起始时刻及终止时刻。本实施例中对此不做任何限定。
需要说明的是,在本实施例中,人机交互设备通过将待检测的音频信号划分为多个音频段,并提取每个音频段中的音频特征,其中,音频特征至少包括音频段的时域特征及频域特征,从而实现融合音频段在不同域的多个特征来从上述多个音频段中准确检测出目标语音段,以降低音频段中的噪声信号对语音检测过程的干扰,达到提高语音检测的准确率的目的,进而克服相关技术中仅通过单个特征来检测语音的方式所导致的检测准确率较低的问题。
进一步,在准确检测出目标语音段的同时,还可以使人机交互设备快速实时地判断出由目标语音段构成的语音段的起始时刻及终止时刻,从而实现人机交互设备对检测获取到的语音信息进行准确实时地反应,达到人机自然交互的效果。此外,人机交互设备通过准确检测出目标语音段构成的语音段的起始时刻及终止时刻,还将实现提高人机交互效率的效果,进而克服相关技术中由交互人员通过按下控制按钮来触发启动人机交互过程所导致的人机交互效率较低的问题。
可选地,在本实施例中,上述音频特征可以包括但不限于以下至少之一:在时域的信号过零率、在时域的短时能量、在频域的谱平度、在时域的信号信息熵、自相关性系数、小波变换后信号、信号复杂度等。
需要说明的是,1)上述信号过零率可以但不限于用于去除一些脉冲噪声的干扰;2)上述短时能量可以但不限于用于衡量音频信号的幅度大小,配合一定的阈值以去除不相关人群说话语音的干扰;3)上述谱平度可以但不限于用于在频域内计算信号的频率分布特性,根据该特征的大小,以判断音频信号是否为背景高斯白噪声;4)上述信号时域信息熵可以但不限于用于度量音频信号在时域的分布特性,该特征用于区别语音信号和一般噪声。在本实施例中,通过在语音检测过程中融合上述在时域及频域的多个特征来抵抗脉冲和背景噪声的干扰,增强鲁棒性,以实现从待检测 的音频信号所划分的多个音频段中准确检测出目标语音段,进而达到准确获取该目标语音段构成的语音段的起始时刻及终止时刻,以实现人机自然交互。
可选地,在本实施例中,根据音频段的音频特征从音频信号的多个音频段中检测出目标语音段的方式可以包括但不限于:判断音频段的音频特征是否满足预定阈值条件;在音频段的音频特征满足预定阈值条件时,则检测出该音频段为目标语音段。
需要说明的是,在本实施例中,在判断音频段的音频特征是否满足预定阈值条件时,可以按照以下至少一种顺序从多个音频段获取用于进行判断的当前音频段:1)按照音频信号的输入顺序;2)按照预定顺序。其中,上述预定顺序可以为随机顺序,也可以为按照预定原则排列的顺序,例如按照音频段的大小顺序。上述仅是一种示例,本实施例中对此不做任何限定。
此外,在本实施例中,上述预定阈值条件可以但不限于将根据变化的场景进行自适应更新调整。通过不断更新用于与音频特征进行比较的预定阈值条件,以保证在检测过程中根据不同场景准确从多个音频段中检测出目标语音段。进一步,对于音频段在多个域的多个特征,通过分别判断是否满足对应的预定阈值条件,以实现对音频段进行多次判断筛选,从而保证准确地检测出目标语音段。
可选地,在本实施例中,在按照音频信号的输入顺序从多个音频段中获取音频段,以判断音频段的音频特征是否满足预定阈值条件的情况下,根据音频段的音频特征从音频段中检测出目标语音段包括:重复执行以下步骤,直至当前音频段为多个音频段中的最后一个音频段,其中,当前音频段被初始化为多个音频段中的第一个音频段:
S1,判断当前音频段的音频特征是否满足预定阈值条件;
S2,在当前音频段的音频特征满足预定阈值条件时,则检测出当前音 频段为目标语音段;
S3,在当前音频段的音频特征不满足预定阈值条件时,至少根据当前音频段的音频特征更新预定阈值条件,得到更新后的预定阈值条件;
S4,判断当前音频段是否为多个音频段中的最后一个音频段,若不是,则将当前音频段的下一个音频段作为当前音频段。
需要说明的是,在本实施例中,上述预定阈值条件可以但不限于至少根据当前音频段的音频特征更新,以得到更新后的预定阈值条件。也就是说,在更新上述预定阈值条件时,是根据当前音频段(历史音频段)的音频特征来确定下一个音频段所需的预定阈值条件,从而使对音频段的检测过程更加准确。
可选地,在本实施例中,上述装置还包括:
1)第一获取单元,设置为在将待检测的音频信号划分为多个音频段之后,获取多个音频段中前N个音频段,其中,N为大于1的整数;
2)构建单元,设置为根据前N个音频段构建抑噪模型,其中,抑噪模型用于对多个音频段中第N+1个音频段及其之后的音频段进行抑噪处理;
3)第二获取单元,设置为根据前N个音频段获取初始预定阈值条件。
需要说明的是,为了保证语音检测过程的准确率,在本实施例中将对多个音频段进行抑噪处理,以避免噪声对语音信号的干扰。例如,采用最小均方误差对数谱幅度估计方式来消除音频信号的背景噪声。
可选地,在本实施例中,上述前N个音频段可以但不限于为无语音输入的音频段。也就是说,在人机交互过程开启前,执行初始化操作,通过无语音输入的音频段来构建抑噪模型,并获取用于判断音频特征的初始预定阈值条件。其中,上述初始预定阈值条件可以但不限于根据前N个音频段的音频特征的平均值确定。
可选地,在本实施例中,在提取每个音频段中的音频特征之前,还包括:对采集到的待检测的音频信号进行二次量化,其中,第二次量化的量化级小于第一次量化的量化级。
需要说明的是,在本实施例中,第一次量化可以但不限于在采集音频信号时进行;第二次量化可以但不限于在执行抑噪处理后进行。此外,在本实施例中,量化级越大,干扰越敏感,也就是说,在量化级较大时,由于量化间隔较小,因而较小的噪声信号也会被执行量化操作,这样量化后的结果既包括语音信号,也包括噪声信号,对语音信号检测造成了很大干扰。在本实施例中,通过调整量化级实现二次量化,即第二次量化的量化级小于第一次量化的量化级,从而实现对噪声信号进行二次过滤,以达到降低干扰的效果。
可选地,在本实施例中,将待检测的音频信号划分为多个音频段可以包括但不限于:通过定长窗口采样设备采集到的音频信号。其中,在本实施例中,上述定长窗口的长度较小,例如,采用的窗口的长度为256(信号个数)。即,通过小窗口实现对音频信号的划分,从而实现实时地返回处理结果,以便完成语音信号的实时检测。
通过本申请提供的实施例,通过将待检测的音频信号划分为多个音频段,并提取每个音频段中的音频特征,其中,音频特征至少包括音频段的时域特征及频域特征,从而实现融合音频段在不同域的多个特征来从上述多个音频段中准确检测出目标语音段,以降低音频段中的噪声信号对语音检测过程的干扰,达到提高语音检测的准确率的目的,进而克服相关技术中仅通过单个特征来检测语音的方式所导致的检测准确率较低的问题。
作为一种可选的方案,检测单元1006包括:
1)判断模块,设置为判断当前音频段的音频特征是否满足预定阈值条件,其中,当前音频段的音频特征包括:当前音频段在时域的信号过零率、当前音频段在时域的短时能量、当前音频段在频域的谱平度、当前音频段在时域的信号信息熵;
2)检测模块,设置为在当前音频段的音频特征满足预定阈值条件时,则检测出当前音频段为目标语音段。
可选地,在本实施例中,对N个音频段中的当前音频段x(i)的音频特征可以通过如下公式获取:
1)在时域的信号过零率(即短时过零率)计算:
Figure PCTCN2017074798-appb-000008
其中sgn[]是符号函数:
Figure PCTCN2017074798-appb-000009
2)在时域的短时能量计算:
Figure PCTCN2017074798-appb-000010
其中h[i]是窗口函数,当采用下列函数
Figure PCTCN2017074798-appb-000011
3)在频域的谱平度计算:
首先对音频段x(i)i=0,1,2…,N-1进行傅立叶变换得到频域幅度值大小f(i)i=0,1,2…,N-1;
根据以下公式计算谱平度:
Figure PCTCN2017074798-appb-000012
4)在时域的信号信息熵计算:
首先计算信号绝对值归一化后的相对概率大小:
Figure PCTCN2017074798-appb-000013
再根据以下公式计算信号信息熵:
Figure PCTCN2017074798-appb-000014
具体结合以下示例进行说明,如图4示出了带有脉冲噪声的原始音频信号,中间带((横轴50000-150000之间的信号)有一些脉冲噪声,语音信号为最后一段(横轴230000-240000之间的信号);图5示出了对原始音频信号单独提取信号过零率的音频信号,可以看到根据信号过零率特征能够很好地区别出脉冲噪声,如中间带((横轴50000-150000之间的信号)的脉冲噪声可以被直接过滤,但是对于低能量的非脉冲噪声(横轴210000-220000之间的信号)却不能被区分出来;图6示出了对原始音频信号单独提取短时能量的音频信号,可以看到根据短时能量特征可以过滤低能量的非脉冲噪声(横轴210000-220000之间的信号),但是却无法区分中间带((横轴50000-150000之间的信号)的脉冲噪声(脉冲信号也有比较大的能量);图7示出了对原始音频信号提取谱平度和信号信息熵音频信号,这两者能过把语音信号和脉冲噪声都检测出来,能够最大程度的保留所有类语音信号;进一步,此外,图8示出了采用本实施例中提供的方式:在提取谱平度和信号信息熵的基础上结合提取短时能量上述四个特征及信号过零率特征的音频信号。,可以分辨脉冲噪声和其他低能量噪声的干扰,把实际的语音信号检测出来。由上述附图所示信号可知,本实施例中提取出的音频信号将更利于准确检测出目标语音段。
通过本申请提供的实施例,通过在语音检测过程中融合上述在时域及频域的多个特征来抵抗脉冲和背景噪声的干扰,增强鲁棒性,以实现从待检测的音频信号所划分的多个音频段中准确检测出目标语音段,进而达到准确获取该目标语音段对应的语音信号的起始时间及终止时间,实现人机之间的自然交互。
作为一种可选的方案,检测单元1006包括:
1)判断模块,设置为重复执行以下步骤,直至当前音频段为多个音频段中的最后一个音频段,其中,当前音频段被初始化为多个音频段中的第一个音频段:
S1,判断当前音频段的音频特征是否满足预定阈值条件;
S2,在当前音频段的音频特征满足预定阈值条件时,则检测出当前音频段为目标语音段;
S3,在当前音频段的音频特征不满足预定阈值条件时,至少根据当前音频段的音频特征更新预定阈值条件,得到更新后的预定阈值条件;
S4,判断当前音频段是否为多个音频段中的最后一个音频段,若不是,则将当前音频段的下一个音频段作为当前音频段。
可选地,在本实施例中,上述预定阈值条件可以但不限于将根据变化的场景进行自适应更新调整。其中,在本实施例中,在按照音频信号的输入顺序从多个音频段中获取音频段,以判断音频段的音频特征是否满足预定阈值条件的情况下,上述预定阈值条件可以但不限于至少根据当前音频段的音频特征更新。也就是说,在需要更新预定阈值条件时,基于当前音频段(历史音频段)获取下一个更新后的预定阈值条件。
需要说明的是,对于待检测的音频信号来说,包括多个音频段,上述判断过程将对各个音频段重复执行,直至遍历上述待检测音频信号所划分的多个音频段。即,直至当前音频段为多个音频段中的最后一个音频段。
通过本申请提供的实施例,通过不断更新用于与音频特征进行比较的预定阈值条件,以保证在检测过程中根据不同场景准确从多个音频段中检测出目标语音段。进一步,对于音频段在多个域的多个特征,通过分别判断是否满足对应的预定阈值条件,以实现对音频段进行多次判断筛选,从而保证检测出准确的目标语音段。
作为一种可选的方案,
1)判断模块包括:(1)判断子模块,设置为判断当前音频段在时域的信号过零率是否大于第一阈值;在当前音频段的信号过零率大于第一阈值时,判断当前音频段在时域的短时能量是否大于第二阈值;在当前音频段的短时能量大于第二阈值时,判断当前音频段在频域的谱平度是否小于第三阈值;在当前音频段在频域的谱平度小于第三阈值时,判断当前音频段在时域的信号信息熵是否小于第四阈值;
2)检测模块包括:(1)检测子模块,设置为在判断出当前音频段的信号信息熵小于第四阈值时,则检测出当前音频段为目标语音段。
可选地,在本实施例中,上述根据当前音频段在时域及频域的多个特征来检测目标语音段的过程可以但不限于在对音频信号进行第二次量化后执行。本实施例中对此不做任何限定。
需要说明的是,上述音频特征在语音检测过程的作用如下:
1)信号过零率:获取当前音频段在时域的信号过零率;该信号过零率表示一段音频信号中波形穿过零轴的次数,一般情况下,语音信号的过零率比非语音信号大;
2)短时能量:获取当前音频段在时域幅度上的时域能量;该短时能量信号用于在信号能量上区分非语音信号和语音信号;一般情况下,语音信号的短时能量大于非语音信号的短时能量;
3)谱平度:对当前音频段进行傅立叶变换并计算其谱平度;其中,语音信号的频率分布比较集中,对应的谱平度较小;高斯白噪声信号频率分布比较分散,对应的谱平度较大;
4)信号信息熵:对当前音频段进行归一化后计算信号信息熵;其中,语音信号分布比较集中,对应的信号信息熵小,非语音信号特别是高斯白噪声分布比较分散,对应的信号信息熵比较大。
具体结合图9所示示例进行说明:
S902,获取当前音频段的音频特征;
S904,判断当前音频段的信号过零率是否大于第一阈值,如果当前音频段的信号过零率大于第一阈值,则进行下一步操作;如果当前音频段的信号过零率小于等于第一阈值,那么当前音频段直接判定为非目标语音段;
S906,判断当前音频段的短时能量是否大于第二阈值,如果大于第二阈值,则进行下一步的判断;如果当前音频段的短时能量小于等于第二阈值,那么当前音频段直接判定为非目标语音段,并根据该当前音频段的短时能量更新第二阈值;
S908,判断当前音频段的谱平度是否小于第三阈值,如果小于第三阈值,则进行下一步的判断;如果当前音频段的谱平度大于等于第三阈值,那么当前音频段直接判定为非目标语音段,并根据该当前音频段的谱平度更新第三阈值;
S910,判断当前音频段的信号信息熵是否小于第四阈值,如果小于第四阈值,则进行下一步的判断;如果当前音频段的信号信息熵大于等于第四阈值,那么当前音频段直接判定为非目标语音段,并根据该当前音频段的谱平度更新第四阈值。
在执行完步骤S910后,在判断出上述四个特征均满足所对应的预定阈值条件时,则判定当前音频段为目标语音段。
通过本申请提供的实施例,通过融合音频段在不同域的多个特征来从上述多个音频段中准确检测出目标语音段,以降低音频段中的噪声信号对语音检测过程的干扰,达到提高语音检测的准确率的目的。
作为一种可选的方案,判断模块通过以下步骤实现至少根据当前音频段的音频特征更新预定阈值条件包括:
1)在当前音频段的短时能量小于等于第二阈值时,至少根据当前音频段的短时能量更新第二阈值;或者
2)在当前音频段的谱平度大于等于第三阈值时,至少根据当前音频段的谱平度更新第三阈值;或者
3)在当前音频段的信号信息熵大于等于第四阈值时,至少根据当前音频段的信号信息熵更新第四阈值。
可选地,在本实施例中,判断模块通过以下步骤实现至少根据当前音频段的音频特征更新预定阈值条件包括:
A=a×A'+(1-a)×B  (8)
其中,a表示衰减系数,在B表示当前音频段的短时能量时,A’表示第二阈值,A表示更新后的第二阈值;在B表示当前音频段的谱平度时,A’表示第三阈值,A表示更新后的第三阈值;在B表示当前音频段的信号信息熵时,A’表示第四阈值,A表示更新后的第四阈值。
也就是说,在更新上述预定阈值条件时,是根据当前音频段(历史音频段)的音频特征来确定下一个音频段所需的预定阈值条件,从而使对目标语音检测过程更加准确。
通过本申请提供的实施例,通过不断更新用于与音频特征进行比较的预定阈值条件,以保证在检测过程中根据不同场景准确从多个音频段中检测出目标语音段。
作为一种可选的方案,还包括:
1)确定单元,设置为在根据音频段的音频特征从音频段中检测出目标语音段之后,根据目标语音段在多个音频段中的位置确定目标语音段构成的连续语音段的起始时刻及终止时刻。
可选地,在本实施例中,上述语音段可以包括但不限于一个目标语音段,或连续多个目标语音段。其中,每一个目标语音段包括目标语音段的起始时刻,及目标语音段的终止时刻。
需要说明的是,在本实施例中,在从多个音频段检测出目标语音段的 同时,即可根据目标语音段的时间标签,如目标语音段的起始时刻及目标语音段的终止时刻,来获取目标语音段构成的语音段的起始时刻及终止时刻。
可选地,在本实施例中,上述确定单元包括:
1)第一获取模块,设置为获取连续K个目标语音段中的第一个目标语音段的起始时刻,作为连续语音段的起始时刻;
2)第二获取模块,设置为在确认连续语音段的起始时刻后,获取在第K个目标语音段之后,连续M个非目标语音段中的第一个非目标语音段的起始时刻,作为连续语音段的终止时刻
可选地,在本实施例中,上述K为大于等于1的整数,上述M可以根据不同场景设置为不同取值,本实施例中对此不做任何限定。
具体结合以下示例进行说明,假设从多个(例如,20个)音频段(假设每段时长均为T)中检测出的目标语音段包括:P1-P5,P7-P8,P10,P17-P20。进一步,假设M为5。
基于上述假设可知,前5个目标语音段连续,P5与P7之间包括一个非目标语音段(即P6),P8与P10之间包括一个非目标语音段(即P9),P10与P17之间包括6个非目标语音段(即P11-P16)。
根据前K个(即前5个)连续目标语音段可以确认:从待检测的音频信号中检测出一个包含语音信号的语音段A,其中,该语音段A的起始时刻为前5个目标语音段中的第一个目标语音段的起始时刻(即P1的起始时刻)。进一步,由于P5与P7之间非目标语音段的数量为1,即小于M(M=5);由于P8与P10之间非目标语音段的数量为1,即小于 M(M=5),则可以判定在非目标语音段P6及非目标语音段P9时,上述语音段A并未终止。而由于P10与P17之间非目标语音段的数量为6,即大于M(M=5),即连续非目标语音段(P11-P16)的数量已满足M个的预设阈值,则可以判定上述语音段A在连续非目标语音段(即P11-P16)中的第一个非目标语音段的起始时刻(即P11的起始时刻)终止,则将P11的起始时刻作为语音段A的终止时刻。也就是说,语音段A的起始时刻为P1的起始时刻0,终止时刻为P11的起始时刻10T。
这里,需要说明的是,在本示例中,上述连续目标语音段P17-P20将用于判定下一个语音段B的检测过程。检测过程可以参照上述过程执行,本实施例中在此不再赘述。
此外,在本实施例中,可以但不限于实时获取待检测的音频信号,以便于检测音频信号中的音频段是否为目标语音段,从而达到准确检测出目标语音段构成的语音段的起始时刻及语音段的终止时刻,进而实现人机交互设备可以根据完整的语音段所要表达的语音信息后再进行准确答复,实现人机交互。需要说明的是,在实时获取待检测的音频信号的过程中,对于语音检测可以但不限于重复执行上述检测步骤。本实施例中在此不再赘述。
通过本申请提供的实施例,在准确检测出目标语音段的同时,还可以使人机交互设备快速实时地判断出目标语音段构成的语音段的起始时刻及终止时刻,从而实现人机交互设备对获取到的语音信息进行准确实时地反应,达到人机自然交互的效果。此外,人机交互设备通过准确检测出目标语音段所对应的语音信号的起始时间及终止时间,还将实现提高人机交 互效率的效果,进而克服相关技术中由交互人员通过按下控制按钮来触发启动人机交互过程所导致的人机交互效率较低的问题。
作为一种可选的方案,还包括:
1)第一获取单元,设置为在将待检测的音频信号划分为多个音频段之后,获取多个音频段中前N个音频段,其中,N为大于1的整数;
2)构建单元,设置为根据前N个音频段构建抑噪模型,其中,抑噪模型用于对多个音频段中第N+1个音频段及其之后的音频段进行抑噪处理;
3)第二获取单元,设置为根据前N个音频段获取初始预定阈值条件。
例如,具体通过以下方式根据前N个音频段构建抑噪模型。假设音频信号包括纯净语音信号和独立的高斯白噪声,则可以通过以下方式来抑噪:对前N个音频段的背景噪声进行傅立叶变换,得到信号的频域信息;根据该背景噪声的频域信息,估计出噪声的频域对数普特征,以构建抑噪模型。进一步,对第N+1个音频段及其之后的音频段,可以但不限于基于上述抑噪模型采用最大释然估计方法,实现对音频信号进行消除噪声处理。
又例如,在人机交互过程开启前,执行初始化操作,通过无语音输入的音频段来构建抑噪模型,并获取用于判断音频特征的初始预定阈值条件。其中,上述初始预定阈值条件可以但不限于根据前N个音频段的音频特征的平均值确定。
通过本申请提供的实施例,利用多个音频段中前N个音频段来实现人机交互的初始化操作,如构建抑噪模型,以对多个音频段进行抑噪处理,避免噪声对语音信号的干扰。如获取用于判断音频特征的初始预定阈值条件,以便于对多个音频段进行语音检测。
作为一种可选的方案,还包括:
1)采集单元,设置为在提取每个音频段中的音频特征之前,采集待 检测的音频信号,其中,在采集音频信号时对音频信号进行第一次量化;
2)量化单元,设置为对采集到的音频信号进行第二次量化,其中,第二次量化的量化级小于第一次量化的量化级。
需要说明的是,在本实施例中,第一次量化可以但不限于在采集音频信号时进行;第二次量化可以但不限于在执行抑噪处理后进行。此外,在本实施例中,量化级越大,干扰越敏感,也就是说,较小的干扰越容易对语音信号造成干扰,通过调整量化级进行二次干扰,以实现对干扰的二次过滤的效果。
具体结合以下示例进行说明,例如,在第一次量化时,采用16比特,在第二次量化时,采用8比特,即[-128--127]的范围;从而实现通过再次过滤,来准确区分语音信号与噪声。
实施例3
根据本发明实施例,还提供了一种用于实施上述语音检测方法的语音检测设备,如图11所示,该设备包括:
1)通讯接口1102,设置为获取待检测的音频信号;
2)处理器1104,与通讯接口1102连接,设置为将待检测的音频信号划分为多个音频段;还设置为提取每个音频段中的音频特征,其中,音频特征至少包括音频段的时域特征及频域特征;还设置为根据音频段的音频特征从音频段中检测出目标语音段;
3)存储器1106,与通讯接口1102及处理器1104连接,设置为存储音频信号中的多个音频段及目标语音段。
可选地,本实施例中的具体示例可以参考上述实施例1和实施例2中所描述的示例,本实施例在此不再赘述。
实施例4
本发明的实施例还提供了一种存储介质。可选地,在本实施例中,存 储介质被设置为存储用于执行以下步骤的程序代码:
S1,将待检测的音频信号划分为多个音频段;
S2,提取每个音频段中的音频特征,其中,音频特征至少包括音频段的时域特征及频域特征;
S3,根据音频段的音频特征从音频段中检测出目标语音。
可选地,在本实施例中,存储介质还被设置为存储用于执行以下步骤的程序代码:判断当前音频段的音频特征是否满足预定阈值条件,其中,当前音频段的音频特征包括:当前音频段在时域的信号过零率、当前音频段在时域的短时能量、当前音频段在频域的谱平度、当前音频段在时域的信号信息熵;在当前音频段的音频特征满足预定阈值条件时,则检测出当前音频段为目标语音段。
可选地,在本实施例中,存储介质还被设置为存储用于执行以下步骤的程序代码:根据音频段的音频特征从音频段中检测出目标语音段包括:重复执行以下步骤,直至当前音频段为多个音频段中的最后一个音频段,其中,当前音频段被初始化为多个音频段中的第一个音频段:判断当前音频段的音频特征是否满足预定阈值条件;在当前音频段的音频特征满足预定阈值条件时,则检测出当前音频段为目标语音段;在当前音频段的音频特征不满足预定阈值条件时,至少根据当前音频段的音频特征更新预定阈值条件,得到更新后的预定阈值条件;判断当前音频段是否为多个音频段中的最后一个音频段,若不是,则将当前音频段的下一个音频段作为当前音频段。
可选地,在本实施例中,存储介质还被设置为存储用于执行以下步骤的程序代码:判断当前音频段的音频特征是否满足预定阈值条件包括:判断当前音频段在时域的信号过零率是否大于第一阈值;在当前音频段的信号过零率大于第一阈值时,判断当前音频段在时域的短时能量是否大于第二阈值;在当前音频段的短时能量大于第二阈值时,判断当前音频段在频 域的谱平度是否小于第三阈值;在当前音频段在频域的谱平度小于第三阈值时,判断当前音频段在时域的信号信息熵是否小于第四阈值;在当前音频段的音频特征满足预定阈值条件时,则检测出当前音频段为目标语音段包括:在判断出当前音频段的信号信息熵小于第四阈值时,则检测出当前音频段为目标语音段。
可选地,在本实施例中,存储介质还被设置为存储用于执行以下步骤的程序代码:在当前音频段的短时能量小于等于第二阈值时,至少根据当前音频段的短时能量更新第二阈值;或者在当前音频段的谱平度大于等于第三阈值时,至少根据当前音频段的谱平度更新第三阈值;或者在当前音频段的信号信息熵大于等于第四阈值时,至少根据当前音频段的信号信息熵更新第四阈值。
可选地,在本实施例中,存储介质还被设置为存储用于执行以下步骤的程序代码:
A=a×A'+(1-a)×B,
其中,a表示衰减系数,在B表示当前音频段的短时能量时,A’表示第二阈值,A表示更新后的第二阈值;在B表示当前音频段的谱平度时,A’表示第三阈值,A表示更新后的第三阈值;在B表示当前音频段的信号信息熵时,A’表示第四阈值,A表示更新后的第四阈值。
可选地,在本实施例中,存储介质还被设置为存储用于执行以下步骤的程序代码:在根据音频段的音频特征从音频段中检测出目标语音段之后,根据目标语音段在多个音频段中的位置确定目标语音段构成的连续语音段的起始时刻及终止时刻。
可选地,在本实施例中,存储介质还被设置为存储用于执行以下步骤的程序代码:获取连续K个目标语音段中的第一个目标语音段的起始时刻,作为连续语音段的起始时刻;在确认连续语音段的起始时刻后,获取在第K个目标语音段之后,连续M个非目标语音段中的第一个非目标语音段 的起始时刻,作为连续语音段的终止时刻。
可选地,在本实施例中,存储介质还被设置为存储用于执行以下步骤的程序代码:在将待检测的音频信号划分为多个音频段之后,获取多个音频段中前N个音频段,其中,N为大于1的整数;根据前N个音频段构建抑噪模型,其中,抑噪模型用于对多个音频段中第N+1个音频段及其之后的音频段进行抑噪处理;根据前N个音频段获取初始预定阈值条件。
可选地,在本实施例中,存储介质还被设置为存储用于执行以下步骤的程序代码:在提取每个音频段中的音频特征之前,采集待检测的音频信号,其中,在采集音频信号时对音频信号进行第一次量化;对采集到的音频信号进行第二次量化,其中,第二次量化的量化级小于第一次量化的量化级。
可选地,在本实施例中,存储介质还被设置为存储用于执行以下步骤的程序代码:在对采集到的音频信号进行第二次量化之前,对采集到的音频信号进行抑噪处理。
可选地,在本实施例中,上述存储介质可以包括但不限于:U盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。
可选地,本实施例中的具体示例可以参考上述实施例1和实施例2中所描述的示例,本实施例在此不再赘述。
上述本发明实施例序号仅仅为了描述,不代表实施例的优劣。
上述实施例中的集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在上述计算机可读取的存储介质中。基于这样的理解,本发明的技术方案本质上或者说对相关技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在存储介质中,包括若干指令用以使得一台或多台计算 机设备(可为个人计算机、服务器或者网络设备等)执行本发明各个实施例所述方法的全部或部分步骤。
在本发明的上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。
在本申请所提供的几个实施例中,应该理解到,所揭露的客户端,可通过其它的方式实现。其中,以上所描述的装置实施例仅仅是示意性的,例如所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,单元或模块的间接耦合或通信连接,可以是电性或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本发明各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
以上所述仅是本发明的优选实施方式,应当指出,对于本技术领域的普通技术人员来说,在不脱离本发明原理的前提下,还可以做出若干改进和润饰,这些改进和润饰也应视为本发明的保护范围。
工业实用性
在本发明实施例中,通过将待检测的音频信号划分为多个音频段,并提取每个音频段中的音频特征,其中,音频特征至少包括音频段的时域特征及频域特征,从而实现融合音频段在不同域的多个特征来从上述多个音 频段中准确检测出目标语音段,以降低音频段中的噪声信号对语音检测过程的干扰,达到提高检测语音准确率的目的,进而克服相关技术中仅通过单个特征来检测语音方式所导致的检测准确率较低的问题。

Claims (33)

  1. 一种语音检测方法,包括:
    将待检测的音频信号划分为多个音频段;
    提取每个所述音频段中的音频特征,其中,所述音频特征至少包括所述音频段的时域特征及频域特征;
    根据所述音频段的所述音频特征从所述音频段中检测出目标语音段。
  2. 根据权利要求1所述的方法,其中,根据所述音频段的所述音频特征从所述音频段中检测出所述目标语音段包括:
    判断当前音频段的音频特征是否满足预定阈值条件,其中,所述当前音频段的音频特征包括:所述当前音频段在时域的信号过零率、所述当前音频段在时域的短时能量、所述当前音频段在频域的谱平度、所述当前音频段在时域的信号信息熵;
    在所述当前音频段的音频特征满足所述预定阈值条件时,则检测出所述当前音频段为所述目标语音段。
  3. 根据权利要求1所述的方法,其中,根据所述音频段的所述音频特征从所述音频段中检测出所述目标语音段包括:重复执行以下步骤,直至当前音频段为所述多个音频段中的最后一个音频段,其中,所述当前音频段被初始化为所述多个音频段中的第一个音频段:
    判断所述当前音频段的音频特征是否满足预定阈值条件;
    在所述当前音频段的音频特征满足所述预定阈值条件时,则检测出所述当前音频段为所述目标语音段;
    在所述当前音频段的音频特征不满足所述预定阈值条件时,至少根据所述当前音频段的音频特征更新所述预定阈值条件,得到更新后的所述预定阈值条件;
    判断所述当前音频段是否为所述多个音频段中的最后一个音频 段,若不是,则将所述当前音频段的下一个音频段作为所述当前音频段。
  4. 根据权利要求2或3所述的方法,其中,
    判断所述当前音频段的音频特征是否满足所述预定阈值条件包括:判断所述当前音频段在时域的信号过零率是否大于第一阈值;在所述当前音频段的所述信号过零率大于所述第一阈值时,判断所述当前音频段在时域的短时能量是否大于第二阈值;在所述当前音频段的所述短时能量大于所述第二阈值时,判断所述当前音频段在频域的谱平度是否小于第三阈值;在所述当前音频段在频域的所述谱平度小于所述第三阈值时,判断所述当前音频段在时域的信号信息熵是否小于第四阈值;
    在所述当前音频段的音频特征满足所述预定阈值条件时,则检测出所述当前音频段为所述目标语音段包括:在判断出所述当前音频段的所述信号信息熵小于所述第四阈值时,则检测出所述当前音频段为所述目标语音段。
  5. 根据权利要求4所述的方法,其中,至少根据所述当前音频段的音频特征更新所述预定阈值条件包括:
    在所述当前音频段的所述短时能量小于等于所述第二阈值时,至少根据所述当前音频段的所述短时能量更新所述第二阈值;或者
    在所述当前音频段的所述谱平度大于等于所述第三阈值时,至少根据所述当前音频段的所述谱平度更新所述第三阈值;或者
    在所述当前音频段的所述信号信息熵大于等于所述第四阈值时,至少根据所述当前音频段的所述信号信息熵更新所述第四阈值。
  6. 根据权利要求5所述的方法,其中,至少根据所述当前音频段的音频特征更新所述预定阈值条件包括:
    A=a×A'+(1-a)×B,
    其中,所述a表示衰减系数,在所述B表示所述当前音频段的所 述短时能量时,所述A’表示所述第二阈值,所述A表示更新后的所述第二阈值;在所述B表示所述当前音频段的所述谱平度时,所述A’表示所述第三阈值,所述A表示更新后的所述第三阈值;在所述B表示所述当前音频段的所述信号信息熵时,所述A’表示所述第四阈值,所述A表示更新后的所述第四阈值。
  7. 根据权利要求1所述的方法,其中,在根据所述音频段的所述音频特征从所述音频段中检测出目标语音段之后,还包括:
    根据所述目标语音段在所述多个音频段中的位置确定所述目标语音段构成的连续语音段的起始时刻及终止时刻。
  8. 根据权利要求7所述的方法,其中,所述根据所述目标语音段在所述多个音频段中的位置确定所述目标语音段构成的连续语音段的起始时刻及终止时刻包括:
    获取连续K个所述目标语音段中的第一个目标语音段的起始时刻,作为所述连续语音段的所述起始时刻;
    在确认所述连续语音段的起始时刻后,获取在第K个目标语音段之后,连续M个非目标语音段中的第一个非目标语音段的起始时刻,作为所述连续语音段的所述终止时刻。
  9. 根据权利要求2或3所述的方法,其中,在将待检测的所述音频信号划分为所述多个音频段之后,还包括:
    获取所述多个音频段中前N个音频段,其中,所述N为大于1的整数;
    根据所述前N个音频段构建抑噪模型,其中,所述抑噪模型用于对所述多个音频段中第N+1个音频段及其之后的音频段进行抑噪处理;
    根据所述前N个音频段获取初始预定阈值条件。
  10. 根据权利要求1所述的方法,其中,在提取每个所述音频段中的音频 特征之前,还包括:
    采集待检测的所述音频信号,其中,在采集所述音频信号时对所述音频信号进行第一次量化;
    对采集到的所述音频信号进行第二次量化,其中,所述第二次量化的量化级小于所述第一次量化的量化级。
  11. 根据权利要求10所述的方法,其中,在所述对采集到的所述音频信号进行第二次量化之前,还包括:
    对所述采集到的所述音频信号进行抑噪处理。
  12. 一种语音检测装置,包括:
    划分单元,设置为将待检测的音频信号划分为多个音频段;
    提取单元,设置为提取每个所述音频段中的音频特征,其中,所述音频特征至少包括所述音频段的时域特征及频域特征;
    检测单元,设置为根据所述音频段的所述音频特征从所述音频段中检测出目标语音段。
  13. 根据权利要求12所述的装置,其中,所述检测单元包括:
    判断模块,设置为判断当前音频段的音频特征是否满足预定阈值条件,其中,所述当前音频段的音频特征包括:所述当前音频段在时域的信号过零率、所述当前音频段在时域的短时能量、所述当前音频段在频域的谱平度、所述当前音频段在时域的信号信息熵;
    检测模块,设置为在所述当前音频段的音频特征满足所述预定阈值条件时,则检测出所述当前音频段为所述目标语音段。
  14. 根据权利要求12所述的装置,其中,所述检测单元包括:
    判断模块,设置为重复执行以下步骤,直至当前音频段为所述多个音频段中的最后一个音频段,其中,所述当前音频段被初始化为所述多个音频段中的第一个音频段:
    判断所述当前音频段的音频特征是否满足预定阈值条件;
    在所述当前音频段的音频特征满足所述预定阈值条件时,则检测出所述当前音频段为所述目标语音段;
    在所述当前音频段的音频特征不满足所述预定阈值条件时,至少根据所述当前音频段的音频特征更新所述预定阈值条件,得到更新后的所述预定阈值条件;
    判断所述当前音频段是否为所述多个音频段中的最后一个音频段,若不是,则将所述当前音频段的下一个音频段作为所述当前音频段。
  15. 根据权利要求13或14所述的装置,其中,
    所述判断模块包括:判断子模块,设置为判断所述当前音频段在时域的信号过零率是否大于第一阈值;在所述当前音频段的所述信号过零率大于所述第一阈值时,判断所述当前音频段在时域的短时能量是否大于第二阈值;在所述当前音频段的所述短时能量大于所述第二阈值时,判断所述当前音频段在频域的谱平度是否小于第三阈值;在所述当前音频段在频域的所述谱平度小于所述第三阈值时,判断所述当前音频段在时域的信号信息熵是否小于第四阈值;
    所述检测模块包括:检测子模块,设置为在判断出所述当前音频段的所述信号信息熵小于所述第四阈值时,则检测出所述当前音频段为所述目标语音段。
  16. 根据权利要求15所述的装置,其中,所述判断模块通过以下步骤实现至少根据所述当前音频段的音频特征更新所述预定阈值条件:
    在所述当前音频段的所述短时能量小于等于所述第二阈值时,至少根据所述当前音频段的所述短时能量更新所述第二阈值;或者
    在所述当前音频段的所述谱平度大于等于所述第三阈值时,至少根据所述当前音频段的所述谱平度更新所述第三阈值;或者
    在所述当前音频段的所述信号信息熵大于等于所述第四阈值时,至少根据所述当前音频段的所述信号信息熵更新所述第四阈值。
  17. 根据权利要求16所述的装置,其中,所述判断模块通过以下步骤实现至少根据所述当前音频段的音频特征更新所述预定阈值条件:
    A=a×A'+(1-a)×B,
    其中,所述a表示衰减系数,在所述B表示所述当前音频段的所述短时能量时,所述A’表示所述第二阈值,所述A表示更新后的所述第二阈值;在所述B表示所述当前音频段的所述谱平度时,所述A’表示所述第三阈值,所述A表示更新后的所述第三阈值;在所述B表示所述当前音频段的所述信号信息熵时,所述A’表示所述第四阈值,所述A表示更新后的所述第四阈值。
  18. 根据权利要求12所述的装置,其中,还包括:
    确定单元,设置为在根据所述音频段的所述音频特征从所述音频段中检测出目标语音段之后,根据所述目标语音段在所述多个音频段中的位置确定所述目标语音段构成的连续语音段的起始时刻及终止时刻。
  19. 根据权利要求18所述的装置,其中,所述确定单元包括:
    第一获取模块,设置为获取连续K个所述目标语音段中的第一个目标语音段的起始时刻,作为所述连续语音段的所述起始时刻;
    第二获取模块,设置为在确认所述连续语音段的起始时刻后,获取在第K个目标语音段之后,连续M个非目标语音段中的第一个非目标语音段的起始时刻,作为所述连续语音段的所述终止时刻。
  20. 根据权利要求13或14所述的装置,其中,还包括:
    第一获取单元,设置为在将待识别的所述音频信号划分为所述多个音频段之后,获取所述多个音频段中前N个音频段,其中,所述N为大于1的整数;
    构建单元,设置为根据所述前N个音频段构建抑噪模型,其中,所述抑噪模型用于对所述多个音频段中第N+1个音频段及其之后的音频段进行抑噪处理;
    第二获取单元,设置为根据所述前N个音频段获取初始预定阈值条件。
  21. 根据权利要求12所述的装置,其中,还包括:
    采集单元,设置为在提取每个所述音频段中的音频特征之前,采集待识别的所述音频信号,其中,在采集所述音频信号时对所述音频信号进行第一次量化;
    量化单元,设置为对采集到的所述音频信号进行第二次量化,其中,所述第二次量化的量化级小于所述第一次量化的量化级。
  22. 根据权利要求21所述的装置,其中,还包括:
    抑噪单元,设置为在所述对采集到的所述音频信号进行第二次量化之前,对所述采集到的所述音频信号进行抑噪处理。
  23. 一种存储介质,所述存储介质被设置为存储用于执行以下步骤的程序代码:
    将待检测的音频信号划分为多个音频段;
    提取每个所述音频段中的音频特征,其中,所述音频特征至少包括所述音频段的时域特征及频域特征;
    根据所述音频段的所述音频特征从所述音频段中检测出目标语音段。
  24. 根据权利要求23所述的存储介质,其中,所述存储介质还被设置为存储用于执行以下步骤的程序代码:
    判断当前音频段的音频特征是否满足预定阈值条件,其中,所述当前音频段的音频特征包括:所述当前音频段在时域的信号过零率、所述当前音频段在时域的短时能量、所述当前音频段在频域的谱平度、所述当前音频段在时域的信号信息熵;
    在所述当前音频段的音频特征满足所述预定阈值条件时,则检测出所述当前音频段为所述目标语音段。
  25. 根据权利要求23所述的存储介质,其中,所述存储介质还被设置为存储用于执行以下步骤的程序代码:
    根据所述音频段的所述音频特征从所述音频段中检测出所述目标语音段包括:重复执行以下步骤,直至当前音频段为所述多个音频段中的最后一个音频段,其中,所述当前音频段被初始化为所述多个音频段中的第一个音频段:
    判断所述当前音频段的音频特征是否满足预定阈值条件;
    在所述当前音频段的音频特征满足所述预定阈值条件时,则检测出所述当前音频段为所述目标语音段;
    在所述当前音频段的音频特征不满足所述预定阈值条件时,至少根据所述当前音频段的音频特征更新所述预定阈值条件,得到更新后的所述预定阈值条件;
    判断所述当前音频段是否为所述多个音频段中的最后一个音频段,若不是,则将所述当前音频段的下一个音频段作为所述当前音频段。
  26. 根据权利要求24或25所述的存储介质,其中,所述存储介质还被设置为存储用于执行以下步骤的程序代码:
    判断所述当前音频段的音频特征是否满足所述预定阈值条件包括:判断所述当前音频段在时域的信号过零率是否大于第一阈值;在所述当前音频段的所述信号过零率大于所述第一阈值时,判断所述当前音频段在时域的短时能量是否大于第二阈值;在所述当前音频段的所述短时能量大于所述第二阈值时,判断所述当前音频段在频域的谱平度是否小于第三阈值;在所述当前音频段在频域的所述谱平度小于所述第三阈值时,判断所述当前音频段在时域的信号信息熵是否小于第四阈值;
    在所述当前音频段的音频特征满足所述预定阈值条件时,则检测出所述当前音频段为所述目标语音段包括:在判断出所述当前音频段 的所述信号信息熵小于所述第四阈值时,则检测出所述当前音频段为所述目标语音段。
  27. 根据权利要求26所述的存储介质,其中,所述存储介质还被设置为存储用于执行以下步骤的程序代码:
    在所述当前音频段的所述短时能量小于等于所述第二阈值时,至少根据所述当前音频段的所述短时能量更新所述第二阈值;或者
    在所述当前音频段的所述谱平度大于等于所述第三阈值时,至少根据所述当前音频段的所述谱平度更新所述第三阈值;或者
    在所述当前音频段的所述信号信息熵大于等于所述第四阈值时,至少根据所述当前音频段的所述信号信息熵更新所述第四阈值。
  28. 根据权利要求27所述的存储介质,其中,所述存储介质还被设置为存储用于执行以下步骤的程序代码:
    A=a×A'+(1-a)×B,
    其中,所述a表示衰减系数,在所述B表示所述当前音频段的所述短时能量时,所述A’表示所述第二阈值,所述A表示更新后的所述第二阈值;在所述B表示所述当前音频段的所述谱平度时,所述A’表示所述第三阈值,所述A表示更新后的所述第三阈值;在所述B表示所述当前音频段的所述信号信息熵时,所述A’表示所述第四阈值,所述A表示更新后的所述第四阈值。
  29. 根据权利要求23所述的存储介质,其中,所述存储介质还被设置为存储用于执行以下步骤的程序代码:
    在根据所述音频段的所述音频特征从所述音频段中检测出目标语音段之后,根据所述目标语音段在所述多个音频段中的位置确定所述目标语音段构成的连续语音段的起始时刻及终止时刻。
  30. 根据权利要求29所述的存储介质,其中,所述存储介质还被设置为存储用于执行以下步骤的程序代码:
    获取连续K个所述目标语音段中的第一个目标语音段的起始时刻,作为所述连续语音段的所述起始时刻;
    在确认所述连续语音段的起始时刻后,获取在第K个目标语音段之后,连续M个非目标语音段中的第一个非目标语音段的起始时刻,作为所述连续语音段的所述终止时刻。
  31. 根据权利要求24或25所述的存储介质,其中,所述存储介质还被设置为存储用于执行以下步骤的程序代码:
    在将待检测的所述音频信号划分为所述多个音频段之后,获取所述多个音频段中前N个音频段,其中,所述N为大于1的整数;
    根据所述前N个音频段构建抑噪模型,其中,所述抑噪模型用于对所述多个音频段中第N+1个音频段及其之后的音频段进行抑噪处理;
    根据所述前N个音频段获取初始预定阈值条件。
  32. 根据权利要求23所述的存储介质,其中,所述存储介质还被设置为存储用于执行以下步骤的程序代码:
    在提取每个所述音频段中的音频特征之前,采集待检测的所述音频信号,其中,在采集所述音频信号时对所述音频信号进行第一次量化;
    对采集到的所述音频信号进行第二次量化,其中,所述第二次量化的量化级小于所述第一次量化的量化级。
  33. 根据权利要求32所述的存储介质,其中,所述存储介质还被设置为存储用于执行以下步骤的程序代码:
    在所述对采集到的所述音频信号进行第二次量化之前,对所述采集到的所述音频信号进行抑噪处理。
PCT/CN2017/074798 2016-04-22 2017-02-24 语音检测方法、装置及存储介质 WO2017181772A1 (zh)

Priority Applications (4)

Application Number Priority Date Filing Date Title
JP2018516116A JP6705892B2 (ja) 2016-04-22 2017-02-24 音声検出方法、装置及び記憶媒体
EP17785258.9A EP3447769B1 (en) 2016-04-22 2017-02-24 Voice detection method and apparatus, and storage medium
KR1020187012848A KR102037195B1 (ko) 2016-04-22 2017-02-24 음성 검측 방법, 장치 및 기억 매체
US15/968,526 US10872620B2 (en) 2016-04-22 2018-05-01 Voice detection method and apparatus, and storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610257244.7 2016-04-22
CN201610257244.7A CN107305774B (zh) 2016-04-22 2016-04-22 语音检测方法和装置

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US15/968,526 Continuation US10872620B2 (en) 2016-04-22 2018-05-01 Voice detection method and apparatus, and storage medium

Publications (1)

Publication Number Publication Date
WO2017181772A1 true WO2017181772A1 (zh) 2017-10-26

Family

ID=60116605

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/074798 WO2017181772A1 (zh) 2016-04-22 2017-02-24 语音检测方法、装置及存储介质

Country Status (6)

Country Link
US (1) US10872620B2 (zh)
EP (1) EP3447769B1 (zh)
JP (1) JP6705892B2 (zh)
KR (1) KR102037195B1 (zh)
CN (1) CN107305774B (zh)
WO (1) WO2017181772A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023103693A1 (zh) * 2021-12-07 2023-06-15 阿里巴巴(中国)有限公司 音频信号的处理方法、装置、设备及存储介质

Families Citing this family (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109859744B (zh) * 2017-11-29 2021-01-19 宁波方太厨具有限公司 一种应用于吸油烟机中的语音端点检测方法
EP3759710A1 (en) 2018-02-28 2021-01-06 Robert Bosch GmbH System and method for audio event detection in surveillance systems
CN108447505B (zh) * 2018-05-25 2019-11-05 百度在线网络技术(北京)有限公司 音频信号过零率处理方法、装置及语音识别设备
CN108986830B (zh) * 2018-08-28 2021-02-09 安徽淘云科技有限公司 一种音频语料筛选方法及装置
CN109389999B (zh) * 2018-09-28 2020-12-11 北京亿幕信息技术有限公司 一种高性能的音视频自动断句方法和系统
CN109389993A (zh) * 2018-12-14 2019-02-26 广州势必可赢网络科技有限公司 一种语音数据采集方法、装置、设备及存储介质
CN109801646B (zh) * 2019-01-31 2021-11-16 嘉楠明芯(北京)科技有限公司 一种基于融合特征的语音端点检测方法和装置
WO2020170212A1 (en) * 2019-02-21 2020-08-27 OPS Solutions, LLC Acoustical or vibrational monitoring in a guided assembly system
CN109859745A (zh) * 2019-03-27 2019-06-07 北京爱数智慧科技有限公司 一种音频处理方法、设备及计算机可读介质
CN110189747A (zh) * 2019-05-29 2019-08-30 大众问问(北京)信息科技有限公司 语音信号识别方法、装置及设备
CN110197663B (zh) * 2019-06-30 2022-05-31 联想(北京)有限公司 一种控制方法、装置及电子设备
US10984808B2 (en) * 2019-07-09 2021-04-20 Blackberry Limited Method for multi-stage compression in sub-band processing
CN110827852B (zh) * 2019-11-13 2022-03-04 腾讯音乐娱乐科技(深圳)有限公司 一种有效语音信号的检测方法、装置及设备
CN112543972A (zh) * 2020-01-20 2021-03-23 深圳市大疆创新科技有限公司 音频处理方法及装置
WO2022006233A1 (en) * 2020-06-30 2022-01-06 Genesys Telecommunications Laboratories, Inc. Cumulative average spectral entropy analysis for tone and speech classification
WO2022018864A1 (ja) * 2020-07-22 2022-01-27 2nd Community株式会社 音データ処理装置、音データ処理方法及び音データ処理プログラム
CN112562735B (zh) * 2020-11-27 2023-03-24 锐迪科微电子(上海)有限公司 语音检测方法、装置、设备和存储介质
CN113470694A (zh) * 2021-04-25 2021-10-01 重庆市科源能源技术发展有限公司 水轮机组遥听监测方法、装置和系统
CN113113041B (zh) * 2021-04-29 2022-10-11 电子科技大学 一种基于时频跨域特征选择的语音分离方法
KR102516391B1 (ko) * 2022-09-02 2023-04-03 주식회사 액션파워 음성 구간 길이를 고려하여 오디오에서 음성 구간을 검출하는 방법

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101197130A (zh) * 2006-12-07 2008-06-11 华为技术有限公司 声音活动检测方法和声音活动检测器
CN101625857A (zh) * 2008-07-10 2010-01-13 新奥特(北京)视频技术有限公司 一种自适应的语音端点检测方法
CN101685446A (zh) * 2008-09-25 2010-03-31 索尼(中国)有限公司 音频数据分析装置和方法
CN103117067A (zh) * 2013-01-19 2013-05-22 渤海大学 一种低信噪比下语音端点检测方法
CN104021789A (zh) * 2014-06-25 2014-09-03 厦门大学 一种利用短时时频值的自适应端点检测方法

Family Cites Families (37)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS62150299A (ja) * 1985-12-25 1987-07-04 沖電気工業株式会社 音声信号区間検出器
JPH04223497A (ja) * 1990-12-25 1992-08-13 Oki Electric Ind Co Ltd 有音区間の検出方法
JP3088163B2 (ja) * 1991-12-18 2000-09-18 沖電気工業株式会社 Lsp係数の量子化方法
JP3451146B2 (ja) * 1995-02-17 2003-09-29 株式会社日立製作所 スペクトルサブトラクションを用いた雑音除去システムおよび方法
JPH11338499A (ja) * 1998-05-28 1999-12-10 Kokusai Electric Co Ltd ノイズキャンセラ
US20020116196A1 (en) * 1998-11-12 2002-08-22 Tran Bao Q. Speech recognizer
TW490655B (en) * 2000-12-27 2002-06-11 Winbond Electronics Corp Method and device for recognizing authorized users using voice spectrum information
JP3849116B2 (ja) * 2001-02-28 2006-11-22 富士通株式会社 音声検出装置及び音声検出プログラム
JP3963850B2 (ja) * 2003-03-11 2007-08-22 富士通株式会社 音声区間検出装置
US7412376B2 (en) * 2003-09-10 2008-08-12 Microsoft Corporation System and method for real-time detection and preservation of speech onset in a signal
EP1939859A3 (en) * 2006-12-25 2013-04-24 Yamaha Corporation Sound signal processing apparatus and program
JP5229234B2 (ja) * 2007-12-18 2013-07-03 富士通株式会社 非音声区間検出方法及び非音声区間検出装置
CN102044242B (zh) * 2009-10-15 2012-01-25 华为技术有限公司 语音激活检测方法、装置和电子设备
KR20140026229A (ko) * 2010-04-22 2014-03-05 퀄컴 인코포레이티드 음성 액티비티 검출
JP5870476B2 (ja) * 2010-08-04 2016-03-01 富士通株式会社 雑音推定装置、雑音推定方法および雑音推定プログラム
CN101968957B (zh) * 2010-10-28 2012-02-01 哈尔滨工程大学 一种噪声条件下的语音检测方法
US9100479B2 (en) * 2011-03-10 2015-08-04 Angel.Com Incorporated Answering machine detection
CN102314884B (zh) * 2011-08-16 2013-01-02 捷思锐科技(北京)有限公司 语音激活检测方法与装置
US9047873B2 (en) * 2012-12-21 2015-06-02 Draeger Safety, Inc. Self contained breathing and communication apparatus
CN103077728B (zh) * 2012-12-31 2015-08-19 上海师范大学 一种病人虚弱语音端点检测方法
BR112015018905B1 (pt) * 2013-02-07 2022-02-22 Apple Inc Método de operação de recurso de ativação por voz, mídia de armazenamento legível por computador e dispositivo eletrônico
US9443521B1 (en) * 2013-02-14 2016-09-13 Sociometric Solutions, Inc. Methods for automatically analyzing conversational turn-taking patterns
US9076459B2 (en) * 2013-03-12 2015-07-07 Intermec Ip, Corp. Apparatus and method to classify sound to detect speech
US9437186B1 (en) * 2013-06-19 2016-09-06 Amazon Technologies, Inc. Enhanced endpoint detection for speech recognition
CN104424956B9 (zh) * 2013-08-30 2022-11-25 中兴通讯股份有限公司 激活音检测方法和装置
US20150081287A1 (en) * 2013-09-13 2015-03-19 Advanced Simulation Technology, inc. ("ASTi") Adaptive noise reduction for high noise environments
US9281007B2 (en) * 2014-02-07 2016-03-08 Avago Technologies General Ip (Singapore) Pte. Ltd. Read channel sampling utilizing two quantization modules for increased sample bit width
CN103813251B (zh) * 2014-03-03 2017-01-11 深圳市微纳集成电路与系统应用研究院 一种可调节去噪程度的助听器去噪装置和方法
KR20150105847A (ko) * 2014-03-10 2015-09-18 삼성전기주식회사 음성구간 검출 방법 및 장치
US20150279373A1 (en) * 2014-03-31 2015-10-01 Nec Corporation Voice response apparatus, method for voice processing, and recording medium having program stored thereon
US9620105B2 (en) * 2014-05-15 2017-04-11 Apple Inc. Analyzing audio input for efficient speech and music recognition
US10186282B2 (en) * 2014-06-19 2019-01-22 Apple Inc. Robust end-pointing of speech signals using speaker recognition
CN105261375B (zh) * 2014-07-18 2018-08-31 中兴通讯股份有限公司 激活音检测的方法及装置
CN104464722B (zh) * 2014-11-13 2018-05-25 北京云知声信息技术有限公司 基于时域和频域的语音活性检测方法和设备
CN104409081B (zh) * 2014-11-25 2017-12-22 广州酷狗计算机科技有限公司 语音信号处理方法和装置
US9672841B2 (en) * 2015-06-30 2017-06-06 Zte Corporation Voice activity detection method and method used for voice activity detection and apparatus thereof
KR102446392B1 (ko) * 2015-09-23 2022-09-23 삼성전자주식회사 음성 인식이 가능한 전자 장치 및 방법

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101197130A (zh) * 2006-12-07 2008-06-11 华为技术有限公司 声音活动检测方法和声音活动检测器
CN101625857A (zh) * 2008-07-10 2010-01-13 新奥特(北京)视频技术有限公司 一种自适应的语音端点检测方法
CN101685446A (zh) * 2008-09-25 2010-03-31 索尼(中国)有限公司 音频数据分析装置和方法
CN103117067A (zh) * 2013-01-19 2013-05-22 渤海大学 一种低信噪比下语音端点检测方法
CN104021789A (zh) * 2014-06-25 2014-09-03 厦门大学 一种利用短时时频值的自适应端点检测方法

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023103693A1 (zh) * 2021-12-07 2023-06-15 阿里巴巴(中国)有限公司 音频信号的处理方法、装置、设备及存储介质

Also Published As

Publication number Publication date
US20180247662A1 (en) 2018-08-30
EP3447769A4 (en) 2019-12-18
JP2018532155A (ja) 2018-11-01
CN107305774B (zh) 2020-11-03
US10872620B2 (en) 2020-12-22
EP3447769A1 (en) 2019-02-27
EP3447769B1 (en) 2022-03-30
KR20180063282A (ko) 2018-06-11
KR102037195B1 (ko) 2019-10-28
CN107305774A (zh) 2017-10-31
JP6705892B2 (ja) 2020-06-03

Similar Documents

Publication Publication Date Title
WO2017181772A1 (zh) 语音检测方法、装置及存储介质
JP6553111B2 (ja) 音声認識装置、音声認識方法及び音声認識プログラム
CN110956957B (zh) 语音增强模型的训练方法及系统
WO2017202292A1 (zh) 一种回声时延跟踪方法及装置
WO2013157254A1 (en) Sound detecting apparatus, sound detecting method, sound feature value detecting apparatus, sound feature value detecting method, sound section detecting apparatus, sound section detecting method, and program
US9997168B2 (en) Method and apparatus for signal extraction of audio signal
JP4816711B2 (ja) 通話音声処理装置および通話音声処理方法
EP3689002A2 (en) Howl detection in conference systems
JP6493889B2 (ja) 音声信号を検出するための方法および装置
CN106850511B (zh) 识别访问攻击的方法及装置
CN111696568A (zh) 一种半监督瞬态噪声抑制方法
CN109920444B (zh) 回声时延的检测方法、装置以及计算机可读存储介质
CN112116927A (zh) 实时检测音频信号中的语音活动
CN109997186B (zh) 一种用于分类声环境的设备和方法
CN106910494B (zh) 一种音频识别方法和装置
JP4891805B2 (ja) 残響除去装置、残響除去方法、残響除去プログラム、記録媒体
Tian et al. Spoofing detection under noisy conditions: a preliminary investigation and an initial database
US20230186943A1 (en) Voice activity detection method and apparatus, and storage medium
KR20180041072A (ko) 오디오 프레임 프로세싱을 위한 디바이스 및 방법
CN108062959B (zh) 一种声音降噪方法及装置
CN113593599A (zh) 一种去除语音信号中噪声信号的方法
CN113593604A (zh) 检测音频质量方法、装置及存储介质
JP6904198B2 (ja) 音声処理プログラム、音声処理方法および音声処理装置
CN113539300A (zh) 基于噪声抑制的语音检测方法、装置、存储介质以及终端
TWI749547B (zh) 應用深度學習的語音增強系統

Legal Events

Date Code Title Description
WWE Wipo information: entry into national phase

Ref document number: 2018516116

Country of ref document: JP

ENP Entry into the national phase

Ref document number: 20187012848

Country of ref document: KR

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17785258

Country of ref document: EP

Kind code of ref document: A1