WO2021016925A1 - 音频处理方法及装置 - Google Patents

音频处理方法及装置 Download PDF

Info

Publication number
WO2021016925A1
WO2021016925A1 PCT/CN2019/098613 CN2019098613W WO2021016925A1 WO 2021016925 A1 WO2021016925 A1 WO 2021016925A1 CN 2019098613 W CN2019098613 W CN 2019098613W WO 2021016925 A1 WO2021016925 A1 WO 2021016925A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio
processing
result
audio segment
segment
Prior art date
Application number
PCT/CN2019/098613
Other languages
English (en)
French (fr)
Inventor
吴俊峰
周事成
Original Assignee
深圳市大疆创新科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市大疆创新科技有限公司 filed Critical 深圳市大疆创新科技有限公司
Priority to PCT/CN2019/098613 priority Critical patent/WO2021016925A1/zh
Priority to CN201980033584.3A priority patent/CN112189232A/zh
Publication of WO2021016925A1 publication Critical patent/WO2021016925A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/84Detection of presence or absence of voice signals for discriminating voice from noise

Definitions

  • This application relates to the field of audio technology, and in particular to an audio processing method and device.
  • the audio processing process for example, the voice recognition process
  • voice endpoint detection that is, extract the user's voice signal from the audio signal.
  • voice endpoint detection can usually be performed through voice activity detection (Voice Activity Detection, VAD).
  • VAD Voice Activity Detection
  • the audio segment that may include the voice signal is the result of voice endpoint detection, and then perform voice recognition on the audio segment To obtain speech recognition results.
  • audio fragments intercepted based on voice activity detection usually include voice-like noise or non-speech-like noise with large energy, which leads to the problem of inaccurate audio processing results.
  • the embodiments of the present application provide an audio processing method and device, which are used to solve the problem that the audio fragments intercepted based on voice activity detection in the prior art usually include voice-like noise or non-speech-like noise with large energy, and the result is There is a problem of inaccurate audio processing results.
  • an embodiment of the present application provides an audio processing method, including: intercepting an audio segment from an audio signal based on a voice activity detection method; using a sliding window method to perform target processing on the audio segment to obtain the processing of the audio segment result.
  • an embodiment of the present application provides an audio processing device, including: a processor and a memory; the memory is used to store program code; the processor calls the program code, and when the program code is executed, Used to perform the following operations:
  • An audio segment is intercepted from an audio signal based on a voice activity detection method; a sliding window method is used to perform target processing on the audio segment to obtain a processing result of the audio segment.
  • an embodiment of the present application provides a computer-readable storage medium, the computer-readable storage medium stores a computer program, the computer program includes at least one piece of code, the at least one piece of code can be executed by a computer to control all The computer executes the audio processing method according to any one of the first aspects.
  • an embodiment of the present application provides a computer program, when the computer program is executed by a computer, it is used to implement the audio processing method described in any one of the first aspects.
  • the embodiments of the present application provide an audio processing method and device.
  • the audio segment is intercepted from an audio signal by a voice activity detection method, and the audio segment is processed by a sliding window method to obtain the processing result of the audio segment.
  • the noise included in the audio clip intercepted by the voice activity detection can be excluded in one or more windows. Therefore, the sliding window method is used to target the audio clip, which can avoid the influence of noise in the audio clip, thereby improving the audio Accuracy of processing.
  • FIG. 1A is a schematic diagram 1 of an application scenario of an audio processing method provided by an embodiment of this application;
  • FIG. 1B is a second schematic diagram of an application scenario of an audio processing method provided by an embodiment of this application.
  • FIG. 2 is a schematic flowchart of an audio processing method provided by an embodiment of this application.
  • 3A-3C are schematic diagrams of the audio sub-segment provided by an embodiment of the application excluding the noise of the audio segment;
  • FIG. 4 is a schematic flowchart of a voice activity detection method provided by an embodiment of this application.
  • FIG. 5 is a schematic flowchart of an audio processing method provided by another embodiment of this application.
  • FIG. 6 is a schematic flowchart of an audio processing method provided by another embodiment of this application.
  • 7A-7D are schematic diagrams of intercepting a current sub-segment of an audio segment provided by an embodiment of this application.
  • FIG. 8 is a schematic flowchart of an audio processing method provided by another embodiment of this application.
  • FIG. 9 is a schematic structural diagram of an audio processing device provided by an embodiment of the application.
  • the audio processing method provided in the embodiments of the present application can be applied to any audio processing process that requires voice endpoint detection, and the audio processing method can be specifically executed by an audio processing device.
  • the audio processing device may be a device including an audio collection module (for example, a microphone).
  • FIG. 1A a schematic diagram of an application scenario of the audio processing method provided in an embodiment of the present application may be as shown in FIG. 1A.
  • the audio of the audio processing device The collection module can collect the voice spoken by the user to obtain an audio signal
  • the processor of the audio processing device can process the audio signal collected by the audio collection module using the audio processing method provided in the embodiment of the present application.
  • FIG. 1A is only a schematic diagram, and does not limit the structure of the audio processing device.
  • an amplifier may also be connected between the microphone and the processor to amplify the audio signal collected by the microphone.
  • a filter may be connected between the microphone and the processor to filter the audio signal collected by the microphone.
  • the audio processing device may also be a device that does not include an audio collection module.
  • the application scenario diagram of the audio processing method provided in the embodiment of the present application may be as shown in FIG. 1B.
  • the communication interface of the audio processing device It can receive audio signals collected by other devices or equipment, and the processor of the audio processing device can process the received audio signals using the audio processing method provided in the embodiments of the present application.
  • FIG. 1B is only a schematic diagram, and does not limit the structure of the audio processing device and the connection mode between the audio processing device and other devices or equipment.
  • the communication interface in the audio processing device can be replaced with a transceiver.
  • the type of equipment that includes the audio processing device may not be limited in the embodiments of the present application.
  • the equipment may be, for example, smart speakers, smart lighting devices, smart robots, mobile phones, and tablet computers.
  • the audio processing method provided by the embodiment of the present application can improve the accuracy of the audio processing result by combining the voice activity detection with the sliding window method. Specifically, after the voice activity detection intercepts the audio segment, the audio segment is further processed by the sliding window method. Since the sliding window method can exclude the noise included in the audio segment in one or more windows, the sliding window is adopted The method performs target processing on the audio segment, which can avoid the influence of noise in the audio segment, thereby improving the accuracy of audio processing.
  • FIG. 2 is a schematic flowchart of an audio processing method provided by an embodiment of the application.
  • the execution subject of this embodiment may be an audio processing device, and specifically may be a processor of the audio processing device.
  • the method of this embodiment may include:
  • step 201 an audio segment is intercepted from an audio signal based on the voice activity detection method.
  • step 201 may include the following steps 401 to 404, for example.
  • step 401 the audio signal is first divided into frames, and then the short-term average energy of each frame is calculated frame by frame to obtain the short-term energy envelope.
  • the audio signal can be divided into frames with a fixed duration. Taking a fixed duration of 1 second as an example, the audio signal can be divided into one frame from the 0th second to the first second, and the first second to the second second are divided into one frame. Frame, the 3rd to 4th second is divided into one frame, ..., thus completing the framing of the audio signal.
  • the short-term average energy of a frame is the average energy of the frame, and the short-term energy envelope can be obtained by connecting the energy envelope.
  • Step 402 Select a higher threshold T1, and mark the first and last intersection points of T1 and the short-term energy envelope as C and D.
  • the part higher than the threshold T1 is considered to have a higher probability of being speech.
  • Step 403 Select a lower threshold T2, where the intersection point B with the short-term energy envelope is located to the left of C, and the intersection point E is located to the right of D.
  • T2 is less than T1.
  • step 404 the short-term average zero-crossing rate of each frame is calculated, and the threshold T3 is selected.
  • the intersection point A of the short-term average zero-crossing rate line is located on the left of B, and the intersection F is located on the right of E.
  • the short-time average zero-crossing rate of a frame is the average zero-crossing rate of the frame.
  • intersection points A and F are the two endpoints of the audio signal determined based on the voice activity detection method
  • the audio segments from the intersection point A to the intersection point F are the audio segments intercepted from the audio signal based on the voice activity detection method.
  • steps 401 to 404 only one audio segment is intercepted from a segment of audio signal as an example. It is understandable that multiple audio segments can also be intercepted from a segment of audio signal. Make a limit.
  • Step 202 Use a sliding window method to perform target processing on the audio segment to obtain a processing result of the audio segment.
  • using the sliding window method to perform target processing on an audio segment may specifically refer to using the sliding window method to intercept audio sub-segments in the audio segment, and perform target processing on the audio sub-segments.
  • the sliding window method can exclude the noise included in the audio segment intercepted by the voice activity detection in one or more windows. For example, as shown in FIG. 3A, assuming that the beginning of the audio segment includes noise, the audio sub-segment X1 can exclude the noise. For another example, as shown in FIG. 3B, assuming that the middle part of the audio segment includes noise, the audio sub-segment X2 can exclude the noise. For another example, as shown in FIG. 3C, assuming that the end of the audio segment includes noise, the audio sub-segment X3 can exclude the noise. It should be noted that the part filled by the mesh lines in FIGS. 3A to 3C is used to represent noise.
  • processing result of an audio segment in step 202 may specifically be a processing result obtained by performing target processing on an audio sub-segment in the audio segment.
  • the target processing can specifically be any type of processing that can be further performed after the voice endpoint detection is performed based on the voice activity detection method.
  • the target processing may include any one of the following: voice recognition processing, voice matching processing, or voice endpoint detection processing.
  • voice recognition processing may refer to recognizing the text corresponding to the voice signal in the audio segment;
  • voice matching processing may refer to determining the target voice that matches the audio segment.
  • the processing result of the audio clip can be obtained in step 202 according to actual needs, or the processing result of the audio clip can be selectively obtained in step 202.
  • the actual requirements may be, for example, requirements for saving computing resources, requirements for function realization, requirements for simplifying design, etc.
  • the target processing is voice endpoint detection processing
  • the audio fragments intercepted in step 201 can be determined in step 202 to be more accurate Voice endpoint detection result.
  • the audio clip intercepted in step 201 can be performed Preset processing, and determine whether to perform step 202 according to the result obtained by the preset processing.
  • the preset processing may be, for example, a duration determination process, and further it may be determined whether to perform step 202 according to the determined duration of the audio segment; or, the preset processing may be, for example, a feature extraction process, which may further be based on the extracted audio feature Determine whether to perform step 202.
  • the target processing is speech recognition processing or speech matching processing
  • the target processing is speech recognition processing or speech matching processing
  • the audio clips intercepted in step 201 the more accurate voice endpoint detection result of the audio segment can be determined through step 202.
  • step 202 when a sliding window method is used to perform target processing on an audio segment, all audio sub-segments of the audio segment may be processed as target processing according to implementation requirements, or according to the audio sub-segment currently being processed (that is, the current The processing result of the sub-segment) determines whether to perform target processing on the next audio sub-segment of the current sub-segment.
  • an audio segment can intercept 6 audio sub-segments through the sliding window method, and the first audio sub-segment to the sixth audio sub-segment are respectively
  • the voice recognition results of the audio segment are noise, "on”, “on”, “lighting”, “lighting” and noise, respectively, and the voice recognition result of the audio segment can be "lighting on”.
  • the result of speech recognition being noise may indicate that the speech was not successfully recognized.
  • one audio segment can intercept 6 audio sub-segments through the sliding window method, and the voice recognition result of the first audio sub-segment is noise.
  • Two audio sub-segments are subjected to speech recognition processing, and further assuming that the voice recognition result of the second audio sub-segment is "Please open" and does not match the preset keywords, then the third audio sub-segment is further subjected to speech recognition processing.
  • the voice recognition result of the third audio sub-segment is "on" and matches the preset keyword
  • the voice recognition result of the third audio sub-segment can be used as the recognition result of the audio segment, and the first 4-6 audio sub-segments are processed for speech recognition.
  • the result of speech recognition being noise may indicate that the speech was not successfully recognized.
  • an audio segment is intercepted from an audio signal based on a voice activity detection method, and a sliding window method is used to perform target processing on the audio segment to obtain the processing result of the audio segment. Since the sliding window method can be used in one or more windows The noise included in the audio clips intercepted by the voice activity detection is excluded. Therefore, the sliding window method is used to target the audio clips, which can avoid the influence of noise in the audio clips, thereby improving the accuracy of audio processing
  • the voice activity detection method since the voice activity detection method has removed the sub-segments of the audio signal that does not include the voice signal, and the voice activity detection method consumes much less computing resources than the sliding window method, the voice activity detection method After the detection method is further processed based on the sliding window method, compared with directly using the sliding window method to process the audio signal, the consumption of computing resources can be reduced.
  • FIG. 5 is a schematic flowchart of an audio processing method provided by another embodiment of this application.
  • the audio processing method is specifically described by taking preset processing as the target processing as an example.
  • the method of this embodiment may include:
  • step 501 an audio segment is intercepted from an audio signal based on the voice activity detection method.
  • step 501 is similar to step 201 and will not be repeated here.
  • Step 502 Using the audio segment as a processing unit, perform target processing on the audio segment to obtain a processing result of the audio segment.
  • taking the audio segment as the processing unit may refer to treating the entire audio segment as an object to be processed for target processing.
  • the target processing may include extracting audio features of the audio segment, and decoding the voice features using a pre-trained model, where the decoding is used to obtain the processing result of the audio segment, for example, Viterbi decoding .
  • the audio feature may include Mel Frequency Cepstrum Coefficient (MFCC) features, linear prediction coefficients (Linear One or more of Prediction Coefficients (LPC) LPC features and Filter bank (Fbank) features.
  • MFCC Mel Frequency Cepstrum Coefficient
  • LPC Linear One or more of Prediction Coefficients
  • Fbank Filter bank
  • the model includes Gaussian Mixed Model-Hidden Markov Model (GMM-HMM model) and deep neural network One or more of (Deep Neural Networks, DNN) model, long-short term memory (LSTM) model, and Convolutional Neural Networks (CNN) model.
  • GMM-HMM model Gaussian Mixed Model-Hidden Markov Model
  • DNN Deep Neural Networks
  • LSTM long-short term memory
  • CNN Convolutional Neural Networks
  • Step 503 Determine whether the processing result of the audio segment meets the result condition.
  • the effect of the result condition can be related to the purpose of the target processing on the audio clip.
  • the target processing includes speech recognition processing
  • the purpose of the target processing is to determine an audio segment that matches a keyword
  • the result condition can be used to determine whether the audio segment matches the keyword, and the audio segment
  • the processing result of satisfying the result condition indicates that the audio segment matches the keyword, and the processing result of the audio segment does not satisfy the result condition indicating that the audio segment does not match the keyword.
  • the processing result can be the speech recognition result, and the result condition can be realized according to the characteristics of the speech recognition result.
  • the result condition can be included in the keyword set, for example .
  • the processing result can be considered as satisfying the result condition
  • the processing result is not included in the keyword set, it can be considered as not satisfying the result condition.
  • the result condition can be, for example, The processing result is not noise. Specifically, when the processing result is not noise, it can be regarded as meeting the result condition, and when the processing result is noise, it can be regarded as not meeting the result condition.
  • the target processing includes speech recognition processing, and the purpose of the target processing is to determine an audio segment that meets a specific sentence pattern (for example, a subject-predicate-object sentence pattern), then the result condition can be used to determine whether the audio segment meets For a specific sentence pattern, the processing result of the audio segment satisfies the result condition indicating that the audio segment meets the subject-predicate-object sentence pattern, and the processing result of the audio segment does not satisfy the result condition indicating that the audio segment does not meet the specific sentence pattern.
  • the result condition can be realized according to the characteristics of the speech recognition result.
  • the result condition may be a specific sentence condition, for example.
  • the processing result may not be noise.
  • taking the target processing includes voice matching processing
  • the purpose of the target processing is to determine an audio segment that matches the target voice
  • the target processing includes voice matching processing
  • the result condition is used to determine the audio Whether the segment matches the target voice, the processing result of the audio segment satisfies the result condition to indicate that the audio segment matches the target voice, and the processing result of the audio segment does not satisfy the result condition indicates that the audio segment does not match the target voice.
  • the target processing is voice matching processing
  • the processing result can be the voice matching result
  • the result condition can be realized according to the characteristics of the voice matching result.
  • the result condition may be greater than or equal to the matching degree threshold.
  • the result condition can be considered to be satisfied.
  • the processing result is less than the matching degree threshold, it can be considered that the result condition is not met.
  • the result condition can be that the processing result is yes, and the specific processing result is If it is yes, it can be considered that the result condition is satisfied, and if the processing result is no, it can be considered that the result condition is not satisfied.
  • Step 504 If the processing result of the audio segment does not satisfy the result condition, use a sliding window method to perform target processing on the audio segment.
  • the target processing and the target processing in step 504 can be understood as the same processing, for example, both are speech recognition processing.
  • the target processing in this step may include extracting audio features of the audio segment, and decoding the voice features using a pre-trained model.
  • the audio clip is used as the processing unit in step 503.
  • the target processing is the same as the model used for target processing using the sliding window method in this step.
  • the former can specifically refer to the energy in the audio signal other than the voice signal to be collected, and the meaning of the latter can be seen The foregoing description.
  • the processing result of the audio segment does not meet the result condition in step 503
  • the audio segment does not include a voice signal that meets the result condition
  • the other is that the audio segment includes an audio signal that meets the result condition.
  • the processing result of the audio segment in step 503 does not meet the result condition.
  • the sliding window method can be further used to perform target processing on the audio segment.
  • the processing result of the audio segment when the processing result of the audio segment satisfies the result, the processing result is the final processing result of the target processing of the audio segment.
  • the processing result may be the final processing result of the target processing of the audio segment, or it may not be the final processing result, and a sliding window method needs to be used to perform the target processing on the audio segment for further determination.
  • step 504 may specifically be: if the processing result of the audio segment does not meet the result condition and the audio segment meets the duration condition, then a sliding window method is used to perform the processing on the audio segment Target processing.
  • the duration condition is used to describe the possible duration of the audio segment that meets the result condition.
  • the duration condition may include that the duration is greater than or equal to a duration threshold.
  • the duration threshold is positively correlated with the shortest audio length of the audio segment that meets the result condition.
  • the duration threshold may be equal to the shortest audio length of the audio segment that meets the result condition, or the duration threshold may be equal to the sum of the shortest audio length and the offset of the audio segment that meets the result condition.
  • the duration threshold may be 0.3 seconds, for example.
  • the duration condition may be too long. This can further reduce the computing resources required for processing without affecting the final processing effect.
  • the present application may not limit the sequence of determining whether the processing result of the audio segment meets the result condition and determining whether the audio segment meets the duration condition. Take the judgment of whether the result condition is met first, and the judgment of whether the duration condition is met later as an example. After it is determined that the audio clip meets the result condition, it can be further judged whether the audio clip meets the duration condition, and after the duration condition is met, the sliding window method is used to compare the audio Fragments are targeted for processing.
  • the processing result of the audio segment is obtained, and then when the processing result of the audio segment does not meet the result condition, the sliding is further adopted.
  • the window method performs target processing on the audio segment, and realizes that when the processing result obtained by only using the audio segment as the processing unit for the target processing does not meet the result condition, the sliding window method that takes up more computing resources is used to perform the audio segment Processing, on the basis of improving the accuracy of audio processing, further reducing the consumption of computing resources.
  • the range of target processing in step 501 in the embodiment shown in FIG. 5 may be smaller than the range of target processing in the embodiment shown in FIG. 2.
  • the target processing in the embodiment shown in FIG. 2 may include voice endpoint detection processing, and the target processing in the embodiment shown in FIG. 5 may not include voice endpoint detection processing.
  • step 502 may specifically select an audio segment from a plurality of the audio segments, and use the audio segment as a processing unit to perform the target processing on the audio segment to obtain the processing result of the audio segment;
  • Step 504 after performing target processing on the audio clip using the sliding window method, further includes: returning to execution from step 502 until the completion condition is satisfied.
  • one audio segment can be selected in sequence according to the timing of the multiple audio segments in the audio signal; or, one audio segment can be selected from the multiple audio segments in descending order of duration; Alternatively, one audio segment can be selected from multiple audio segments according to the order of the average energy from large to small; or, one audio segment can be randomly selected from multiple audio segments.
  • the completion conditions can be flexibly designed according to requirements, and this application does not need to be limited.
  • the completion condition includes any one of the following: obtaining a target number of processing results that satisfy the result condition, performing target processing for a preset number of times, performing target processing for a preset number of audio clips, and All audio clips are subject to target processing.
  • FIG. 6 is a schematic flowchart of an audio processing method provided by another embodiment of this application.
  • this embodiment mainly describes an optional implementation manner of using a sliding window method to perform target processing on the audio clip.
  • the method of this embodiment may include:
  • Step 601 intercept the current sub-segment of the audio segment according to the length and position of the window.
  • the position of the window can be set to be at the beginning of the audio segment, and the current sub-segment is the current sub-segment F1 in FIG. 7A; then, in the window shown in FIG. 7A In position, after sliding the window by one step, the position of the window can be as shown in Figure 7B. At this time, the current sub-segment is the current sub-segment F2 in Figure 7B; then, under the position of the window shown in Figure 7B, After the window is slid by one step, the position of the window can be as shown in Fig. 7C, and the current sub-segment is the current sub-segment F3 in Fig. 7C;...
  • the method for determining the length of the window is not limited in this application.
  • it may be determined by user input, or may be a preset value.
  • the length of the window may be related to the result condition.
  • the length of the window may be positively related to the longest audio length of the audio segment that meets the result condition.
  • the positive correlation simply means that the longer the longest audio length of the audio clip that meets the result condition, the larger the length of the window.
  • the two specific formulas can be flexibly designed, for example, the length of the window can be equal to the result The longest audio length of the audio clip of the condition.
  • the length of the window can be, for example, 0.75 second, 0.8 second, 1 second, or the like.
  • the target processing is non-voice endpoint detection processing, such as speech recognition processing
  • the specific description of the result condition can refer to the embodiment shown in FIG. 5.
  • the result condition may specifically be a condition that can be used to determine a voice endpoint in the audio segment that is more accurate than the voice activity detection.
  • the result condition may be that the total sound energy of the window is greater than the product of the background noise energy and the signal-to-noise ratio of the speech start point.
  • Step 602 Perform target processing on the current sub-segment.
  • Step 603 If the processing result of the current sub-segment meets the result condition, use the processing result of the current sub-segment as the processing result of the audio segment.
  • step 603 after performing step 603, it can end directly; or, after performing step 603, it can be similar to performing step 604, returning to step 601 until the end condition is met, that is, the processing result of an audio segment Can be multiple.
  • Step 604 If the processing result of the current sub-segment does not satisfy the result condition, slide the window by one step.
  • step 604 return to step 601 to execute until the end condition is met, so as to complete the target processing of the audio segment.
  • the end condition includes: the window moves to the end of the audio segment, and/or the window sliding times reaches the maximum sliding times. Taking the end condition including the window moving to the end of the audio clip as an example, it can end when the window moves to the position shown in FIG. 7D. Taking the end condition including the window sliding times reaching the maximum sliding times, and the maximum sliding times being 2 as an example, it can end when the window moves to the position shown in FIG. 7C.
  • the method for determining the maximum number of swipes and the step length is not limited in this application, for example, it can be determined by user input, or can be a preset value, etc.
  • the method for determining the maximum number of swipes and the step length can be different.
  • the maximum number of sliding times and/or step length can be determined according to the desired processing accuracy.
  • the maximum number of sliding times is positively correlated with the expected processing accuracy; and/or, the step size is negatively correlated with the expected processing accuracy.
  • the positive correlation simply means that the higher the expected processing accuracy, the greater the maximum number of slidings.
  • the formula for the two can be flexibly designed.
  • the maximum number of slidings can be equal to 100 and the expected processing accuracy (for example, 0.9) product.
  • the negative correlation only means that the step size is negatively related to the expected processing accuracy, which can indicate a trend that the higher the expected processing accuracy, the smaller the step size, and which formulas satisfy the two can be flexibly designed.
  • the aforementioned step length may be, for example, 0.01 second, 0.03 second, 0.05 second, or the like.
  • the maximum number of swipes may be 10 times, 15 times, 30 times, etc., for example.
  • the desired processing accuracy may refer to the accuracy with which the correct processing result can be obtained by using the sliding window method to perform target processing on the audio clip.
  • the window sliding from the beginning of the audio clip is taken as an example. It is understood that the window can also start sliding from other positions of the audio clip, for example, it can start sliding from the end of the audio clip. , This application is not limited to this.
  • the current sub-segment of the audio segment is intercepted according to the length and position of the window, and the current sub-segment is subjected to target processing. If the processing result of the current sub-segment meets the result condition, the processing result of the current sub-segment is taken as the audio segment If the processing result of the current sub-segment does not meet the result condition, slide the window one step size and return to the step of intercepting the current sub-segment of the audio segment according to the length and position of the window until the end condition is met, and the adoption The sliding window method performs target processing on audio clips.
  • the embodiment of the present application may further include the following step: when the processing result of the audio segment meets the result condition, subsequent processing is performed.
  • the specific processing content of the subsequent processing may not be limited in this application, and may specifically be any processing that can be further performed after obtaining the processing result of the target processing.
  • the processing result of the audio segment satisfies the result condition to indicate that the audio segment matches the keyword; when the processing result of the audio segment satisfies the result condition, performing subsequent processing includes: If the processing result of the audio segment matches the keyword, the subsequent processing corresponding to the keyword is executed.
  • keywords can be specifically understood as instruction words, and correspondingly, subsequent processing can be understood as response processing to the instruction words.
  • the subsequent processing is performed when the processing result of the audio segment meets the result condition. Since the accuracy of the processing result of the audio segment is improved by the sliding window method, the accuracy of the subsequent processing can be improved.
  • the audio processing method provided by another embodiment of the present application may include the following steps:
  • step 801 an audio segment is intercepted from an audio signal based on the voice activity detection method.
  • Step 802 Extract the audio features of the audio segment, decode it using a pre-trained model, and output the voice recognition result of the audio segment.
  • the model can be input through the audio features of the audio segment to obtain the voice recognition result of the audio segment.
  • Step 803 Determine whether the voice recognition result of the audio segment is an instruction word.
  • step 804 if the voice recognition result of the audio segment is an instruction word, step 804 is executed. If the voice recognition result of the audio segment is not an instruction word, step 805 is executed.
  • the voice recognition result of the audio segment is an instruction word, it can indicate that the voice in the audio segment intercepted from the audio signal has been successfully recognized, and further processing can be performed according to the voice recognition result.
  • the subsequent processing Specifically, it can be human-computer interaction processing. If the voice recognition result of the audio segment is not a command word (for example, the voice recognition result is noise), it may indicate that the voice in the audio segment intercepted from the audio signal is not successfully recognized, and the sliding window method can be further used for recognition.
  • Step 804 Perform human-computer interaction processing.
  • the application does not limit the specific processing content of human-computer interaction processing.
  • the instruction word is "turn on the light”
  • the lighting device of the smart lighting equipment is turned on.
  • the smart robot When the instruction word is "forward”, the intelligent robot is controlled to move forward, and so on.
  • step 805 the audio sub-segment 1 in the window is intercepted from the beginning of the audio segment, the audio features of the audio sub-segment 1 are extracted, and the model is used for decoding, and the speech recognition result of the audio sub-segment 1 is output.
  • the model used may be the same as in step 802.
  • the voice recognition result of audio sub-segment 1 can be obtained by inputting audio features of audio sub-segment 1 into the model.
  • Step 806 Determine whether the voice recognition result of the audio sub-segment 1 is an instruction word.
  • step 804 if the voice recognition result of the audio sub-segment 1 is an instruction word, step 804 is executed. If the voice recognition result of the audio sub-segment 1 is not an instruction word, step 807 is executed.
  • the voice recognition result of the audio sub-segment 1 is an instruction word, it can indicate that the voice in the audio segment has been successfully recognized by performing voice recognition on the audio sub-segment 1 in the audio segment, and further processing can be performed according to the voice recognition result . If the voice recognition result of the audio sub-segment 1 is not an instruction word, it can indicate that the voice in the audio sub-segment 1 in the audio segment has not been successfully recognized by the voice recognition of the audio sub-segment 1, and the next audio sub-segment 1 The audio sub-segment is recognized to recognize the voice in the audio segment.
  • Step 807 Move the window to the right by one step, intercept the audio sub-segment 2 in the window at this time, extract the audio features of the audio sub-segment 2, and use the model to decode, and output the speech recognition result of the audio sub-segment 2.
  • the model used may be the same as in step 802.
  • the voice recognition result of audio sub-segment 2 can be obtained by inputting the audio features of audio sub-segment 2 into the model.
  • Step 808 Determine whether the voice recognition result of the audio sub-segment 2 is an instruction word.
  • step 804 if the voice recognition result of the audio sub-segment 2 is an instruction word, step 804 is executed. If the voice recognition result of the audio sub-segment 2 is not an instruction word, step 809 is executed.
  • the voice recognition result of the audio sub-segment 2 is an instruction word, it can indicate that the voice in the audio segment has been successfully recognized by performing voice recognition on the audio sub-segment 2 in the audio segment, and further processing can be performed according to the voice recognition result .
  • the voice recognition result of the audio sub-segment 2 is not a command word, it can indicate that the voice in the audio sub-segment 2 in the audio sub-segment has not been successfully recognized through the voice recognition of the audio sub-segment 2, and the next audio sub-segment 2 The audio sub-segment is recognized to recognize the voice in the audio segment.
  • Step 809 Move the window to the right by one step, intercept the audio sub-segment 3 in the window at this time, extract the audio characteristics of the audio sub-segment 3, and decode the audio sub-segment 3 using the model, and output the voice recognition result of the audio sub-segment 3.
  • the model used may be the same as in step 802.
  • the voice recognition result of the audio sub-segment 3 can be obtained by inputting the audio features of the audio sub-segment 3 into the model.
  • voice recognition is performed on the audio sub-segments in the sliding window. If the voice recognition result is an instruction word in the middle, the processing of the audio segment using the sliding window method is ended, and the human-computer interaction processing is entered. If the instruction word is still not recognized when the end condition is met, it is considered that the audio segment does not contain the voice of the instruction word, and the processing of the audio segment using the sliding window method is ended.
  • an audio segment is intercepted from an audio signal based on a voice activity detection method, the audio feature of the audio segment is extracted, and the pre-trained model is used to decode it, and the speech recognition result of the audio segment is output. If the voice recognition result is a command word, then further human-computer interaction processing is performed. If the voice recognition result of the audio clip is not a command word, the sliding window method is used to perform voice recognition on the voice clip, which realizes that the audio feature of the audio clip cannot be When the voice in the audio clip is successfully recognized, the sliding window method is used to recognize the voice in the audio clip according to the audio characteristics of the audio sub-segment in the audio clip.
  • the processing speed of speech recognition based on the audio feature of the audio clip is faster than the sliding window method It is much faster to recognize the voice in the audio segment according to the audio characteristics of the audio sub-segment in the audio segment, so it can avoid noise and reduce the recognition accuracy, and can shorten the processing time.
  • the audio processing method of the embodiment shown in FIG. 8 has a higher accuracy rate than the following method 1 in the prior art, and a processing speed faster than the following method 2 in the prior art.
  • Method 1 intercept audio segments in the audio signal based on the voice activity detection method, directly perform voice recognition on the audio segments, and perform human-computer interaction processing based on the voice recognition results.
  • Method 2 using a sliding window method to perform voice recognition on the audio signal, and perform human-computer interaction processing according to the voice recognition result.
  • FIG. 9 is a schematic structural diagram of an audio processing device provided by an embodiment of this application. As shown in FIG. 9, the device 900 may include a processor 901 and a memory 902.
  • the memory 902 is used to store program codes
  • the processor 901 calls the program code, and when the program code is executed, is configured to perform the following operations:
  • a sliding window method is used to perform target processing on the audio segment to obtain a processing result of the audio segment.
  • the audio processing device provided in this embodiment can be used to implement the technical solutions of the foregoing method embodiments, and its implementation principles and technical effects are similar to those of the method embodiments, and will not be repeated here.
  • a person of ordinary skill in the art can understand that all or part of the steps in the foregoing method embodiments can be implemented by a program instructing relevant hardware.
  • the aforementioned program can be stored in a computer readable storage medium.
  • the steps including the foregoing method embodiments are executed; and the foregoing storage medium includes: ROM, RAM, magnetic disk, or optical disk and other media that can store program codes.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

一种音频处理方法及装置。该方法包括:基于语音活性检测方法从音频信号中截取音频片段;采用滑动窗口方法对所述音频片段进行目标处理,得到所述音频片段的处理结果。由于滑动窗口方法可以在一个或多个窗口中将语音活性检测所截取出的音频片段中包括的噪声排除在外,因此本申请采用滑动窗口方法对音频片段进行目标处理,可以避免音频片段中噪声的影响,从而可以提高音频处理的准确性。

Description

音频处理方法及装置 技术领域
本申请涉及音频技术领域,尤其涉及一种音频处理方法及装置。
背景技术
在音频处理过程(例如,语音识别过程)中,通常需要进行语音端点检测,即从音频信号中将用户的语音信号抽出。
现有技术中,通常可以通过语音活性检测(Voice Activity Detection,VAD)进行语音端点的检测。例如,在语音识别过程中,可以先通过语音活性检测截取音频信号中可能包括语音信号的音频片段,该可能包括语音信号的音频片段即为语音端点检测的结果,然后对该音频片段进行语音识别以获得语音识别结果。
然而,现有技术中,基于语音活性检测所截取的音频片段中通常会包括语音类噪声或者能量较大的非语音类噪声,并由此导致存在音频处理结果不准确的问题。
发明内容
本申请实施例提供一种音频处理方法及装置,用以解决现有技术中基于语音活性检测所截取的音频片段中通常会包括语音类噪声或者能量较大的非语音类噪声,并由此导致存在音频处理结果不准确的问题。
第一方面,本申请实施例提供一种音频处理方法,包括:基于语音活性检测方法从音频信号中截取音频片段;采用滑动窗口方法对所述音频片段进行目标处理,得到所述音频片段的处理结果。
第二方面,本申请实施例提供一种音频处理装置,包括:处理器和存储 器;所述存储器,用于存储程序代码;所述处理器,调用所述程序代码,当程序代码被执行时,用于执行以下操作:
基于语音活性检测方法从音频信号中截取音频片段;采用滑动窗口方法对所述音频片段进行目标处理,得到所述音频片段的处理结果。
第三方面,本申请实施例提供一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序包含至少一段代码,所述至少一段代码可由计算机执行,以控制所述计算机执行上述第一方面任一项所述的音频处理方法。
第四方面,本申请实施例提供一种计算机程序,当所述计算机程序被计算机执行时,用于实现上述第一方面任一项所述的音频处理方法。
本申请实施例提供一种音频处理方法及装置,通过基于语音活性检测方法从音频信号中截取音频片段,并采用滑动窗口方法对音频片段进行目标处理,得到音频片段的处理结果,由于滑动窗口方法可以在一个或多个窗口中将语音活性检测所截取出的音频片段中包括的噪声排除在外,因此采用滑动窗口方法对音频片段进行目标处理,可以避免音频片段中噪声的影响,从而可以提高音频处理的准确性。
附图说明
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作一简单地介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1A为本申请实施例提供的音频处理方法的应用场景示意图一;
图1B为本申请实施例提供的音频处理方法的应用场景示意图二;
图2为本申请一实施例提供的音频处理方法的流程示意图;
图3A-图3C为本申请一实施例提供的音频子片段将音频片段的噪声排除在外的示意图;
图4为本申请实施例提供的语音活性检测方法的流程示意图;
图5为本申请另一实施例提供的音频处理方法的流程示意图;
图6为本申请又一实施例提供的音频处理方法的流程示意图;
图7A-图7D为本申请一实施例提供的截取音频片段的当前子片段的示意图;
图8为本申请又一实施例提供的音频处理方法的流程示意图;
图9为本申请一实施例提供的音频处理装置的结构示意图。
具体实施方式
为使本申请实施例的目的、技术方案和优点更加清楚,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
本申请实施例提供的音频处理方法可以应用于任何需要进行语音端点检测的音频处理过程中,该音频处理方法具体可以由音频处理装置执行。该音频处理装置可以为包括音频采集模块(例如,麦克风)的装置,相应的,本申请实施例提供的音频处理方法的应用场景示意图可以如图1A所示,具体的,该音频处理装置的音频采集模块可以采集用户讲话的语音获得音频信号,音频处理装置的处理器可以对音频采集模块采集的音频信号采用本申请实施例提供的音频处理方法进行处理。需要说明的是,图1A仅为示意图,并不对音频处理装置的结构作限定,例如麦克风与处理器之间还可以连接有放大器,用于对麦克风采集到的音频信号进行放大。又例如,麦克风与处理器之间还可以连接有滤波器,用于对麦克风采集到的音频信号进行滤波。
或者,该音频处理装置也可以为不包括音频采集模块的装置,相应的,本申请实施例提供的音频处理方法的应用场景示意图可以如图1B所示,具体的,该音频处理装置的通信接口可以接收其他装置或设备采集的音频信号,音频处理装置的处理器可以对接收到的音频信号采用本申请实施例提供的音频处理方法进行处理。需要说明的是,图1B仅为示意图,并不对音频处理装置的结构以及音频处理装置与其他装置或设备之间的连接方式作限定,例如音频处理装置中通信接口可以替换为收发器。
需要说明的是,对于包括该音频处理装置的设备的类型,本申请实施例可以不做限定,该设备例如可以为智能音响、智能照明设备、智能机器人、 手机、平板电脑等。
本申请实施例提供的音频处理方法通过将语音活性检测与滑动窗口方法结合,可以提高音频处理结果的准确性。具体的,在语音活性检测截取音频片段之后,采用滑动窗口方法对该音频片段进行进一步处理,由于滑动窗口方法可以在一个或多个窗口中将音频片段所包括的噪声排除在外,因此采用滑动窗口方法对音频片段进行目标处理,可以避免音频片段中噪声的影响,从而可以提高音频处理的准确性。
下面结合附图,对本申请的一些实施方式作详细说明。在不冲突的情况下,下述的实施例及实施例中的特征可以相互组合。
图2为本申请一实施例提供的音频处理方法的流程示意图,本实施例的执行主体可以为音频处理装置,具体可以为音频处理装置的处理器。如图2所示,本实施例的方法可以包括:
步骤201,基于语音活性检测方法从音频信号中截取音频片段。
本步骤中,对于基于语音活性检测方法从音频信号中截取音频片段所采用的具体算法,本申请可以不做限定,例如可以采用能量与过零率双门限算法、噪声-语音分类模型算法、方差法、谱距离法、谱熵法中的一种或多种。以采用能量与过零率双门限算法为例,如图4所示,步骤201例如可以包括如下步骤401至步骤404。
步骤401,首先对音频信号进行分帧,然后逐帧计算各帧的短时平均能量,得到短时能量包络。
其中,可以以一个固定时长对音频信号进行分帧,以固定时长为1秒为例,则可以将音频信号第0秒至第1秒分为一帧,第1秒至第2秒分为一帧,第3秒至第4秒分为一帧,……,从而完成对音频信号的分帧。一帧的短时平均能量即为该帧的平均能量,对能量包络进行连线即可得到短时能量包络。
步骤402,选择一个较高的阈值T1,标记T1与短时能量包络的首尾交点为C和D。
其中,高于阈值T1的部分被认为是语音的概率较高。
步骤403,选取一个较低的阈值T2,其与短时能量包络的交点B位于C的左边,交点E位于D的右边。
其中,T2小于T1。
步骤404,计算各帧的短时平均过零率,选取阈值T3,其与短时平均过零 率线的交点A位于B的左边,交点F位于E的右边。
其中,一帧的短时平均过零率即为该帧的平均过零率。
至此,交点A和F即为基于语音活性检测方法所确定的该音频信号的两个端点,交点A至交点F的音频片段即为基于语音活性检测方法从音频信号中所截取音频片段。
需要说明的是,步骤401至步骤404以从一段音频信号中只截取出一个音频片段为例,可以理解的是,从一段音频信号中也可以截取出多个音频片段,本申请对此可以不做限定。
步骤202,采用滑动窗口方法对所述音频片段进行目标处理,得到所述音频片段的处理结果。
本步骤中,采用滑动窗口方法对一个音频片段进行目标处理,具体可以是指采用滑动窗口方法截取该音频片段中的音频子片段,并对音频子片段进行目标处理。由于滑动窗口方法可以在一个或多个窗口中将语音活性检测所截取出的音频片段中包括的噪声排除在外。例如,如图3A所示,假设音频片段的开始部分包括噪声,则音频子片段X1可以将该噪声排除在外。再例如,如图3B所示,假设音频片段的中间部分包括噪声,则音频子片段X2可以将该噪声排除在外。又例如,如图3C所示,假设音频片段的结束部分包括噪声,则音频子片段X3可以将该噪声排除在外。需要说明的是,图3A至图3C中网线填充的部分用于表示噪声。
可以理解的是,在步骤202中一个音频片段的处理结果具体可以为对于该音频片段中的一个音频子片段进行目标处理所得到的处理结果。
目标处理具体可以为在基于语音活性检测方法进行语音端点检测之后进一步可以进行的任意类型的处理。可选的,目标处理可以包括下述中的任意一种:语音识别处理、语音匹配处理或语音端点检测处理。其中,语音识别处理可以是指识别出音频片段中语音信号对应的文字;语音匹配处理可以是指确定与音频片段匹配的目标语音。
可选的,对步骤201所截取的音频片段,可以根据实际需求,均通过步骤202的方式得到音频片段的处理结果,或者有选择的通过步骤202的方式得到音频片段的处理结果。其中,该实际需求例如可以为节省运算资源需求、功能实现需求、简化设计需求等。
例如,当目标处理为语音端点检测处理时,为了提高语音端点检测结果 准确性(即为功能实现需求),对于步骤201所截取的音频片段,可以均通过步骤202的方式确定音频片段更准确的语音端点检测结果。
又例如,由于滑动窗口方法对于运算资源消耗较大,当目标处理为语音识别处理或语音匹配处理时,为了节省运算资源(即为节省运算资源需求),对于步骤201所截取的音频片段可以进行预设处理,并根据预设处理得到的结果,判断是否执行步骤202。其中,预设处理例如可以为时长确定处理,进一步的可以根据所确定的音频片段的时长判断是否执行步骤202;或者,预设处理例如可以为特征提取处理,进一步的可以根据所提取的音频特征判断是否执行步骤202。
又例如,当目标处理为语音识别处理或语音匹配处理时,为了简化设计(即为简化实现需求)在能满足功能需求且不考虑节省运算资源的情况下,对于步骤201所截取的音频片段,可以均通过步骤202的方式确定音频片段更准确的语音端点检测结果。
可选的,步骤202在采用滑动窗口方法对音频片段进行目标处理时,可以根据实现需求对该音频片段的所有音频子片段均进行目标处理,或者根据当前正在处理的音频子片段(即,当前子片段)的处理结果决定是否对当前子片段的下一个音频子片段进行目标处理。
例如,以语音识别为例,假设实现需求为识别出音频片段中的文字,一个音频片段通过滑动窗口方法可以截取6个音频子片段,且第1个音频子片段至第6个音频子片段分别的语音识别结果分别为噪声、“开”、“开启”、“启照”、“照明”和噪声,则该音频片段的语音识别结果可以为“开启照明”。其中,语音识别结果为噪声可以表示未成功识别出语音。
又例如,以语音识别为例,假设实现需求为匹配预设关键词,一个音频片段通过滑动窗口方法可以截取6个音频子片段,且第1个音频子片段的语音识别结果为噪声则对第2个音频子片段进行语音识别处理,进一步假设第2个音频子片段的语音识别结果为“请开”且与预设关键词不匹配,则进一步对第3个音频子片段进行语音识别处理,进一步假设第3个音频子片段的语音识别结果为“开启”且与预设关键词匹配,则可以将第3个音频子片段的语音识别结果作为该音频片段的识别结果,且不再对第4-6个音频子片段进行语音识别处理。其中,语音识别结果为噪声可以表示未成功识别出语音。
本实施例中,通过基于语音活性检测方法从音频信号中截取音频片段, 并采用滑动窗口方法对音频片段进行目标处理,得到音频片段的处理结果,由于滑动窗口方法可以在一个或多个窗口中将语音活性检测所截取出的音频片段中包括的噪声排除在外,因此采用滑动窗口方法对音频片段进行目标处理,可以避免音频片段中噪声的影响,从而可以提高音频处理的准确性
另外,本实施例提供的方法,由于通过语音活性检测方法已经去掉音频信号中不包括语音信号的子片段,且语音活性检测方法较滑动窗口方法对于运算资源的消耗要小很多,因此在语音活性检测方法之后进一步基于滑动窗口方法进行处理,与直接采用滑动窗口方法对音频信号进行处理相比,可以减小运算资源的消耗。
图5为本申请另一实施例提供的音频处理方法的流程示意图。本实施例在图2所示实施例的基础上,以预设处理为所述目标处理为例,对音频处理方法进行具体描述。如图5所示,本实施例的方法可以包括:
步骤501,基于语音活性检测方法从音频信号中截取音频片段。
需要说明的是,步骤501与步骤201类似,在此不再赘述。
步骤502,以所述音频片段为处理单元,对所述音频片段进行目标处理,得到所述音频片段的处理结果。
本步骤中,以音频片段为处理单元可以是指将音频片段整个作为一个待处理对象进行目标处理。所述目标处理可以包括提取所述音频片段的音频特征,并对所述语音特征利用预先训练的模型进行解码,其中,解码用于得到音频片段的处理结果,例如可以为维特比(Viterbi)解码。
需要说明的是,对于所述音频特征的具体类型,本申请可以不做限定,可选的,音频特征可以包括梅尔频率倒谱系数(Mel Frequency Cepstrum Coefficient,MFCC)特征、线性预测系数(Linear Prediction Coefficients,LPC)LPC特征、滤波器组(Filter bank,Fbank)特征中的一种或多种。
对于所述模型的具体类型,本申请可以不做限定,可选的,所述模型包括高斯混合模型-隐马尔科夫模型(Gaussian Mixed Model-Hidden Markov Model,GMM-HMM模型)、深度神经网络(Deep Neural Networks,DNN)模型、长短期记忆模型(long-short term memory,LSTM)模型、卷积神经网络(Convolutional Neural Networks,CNN)模型中的一种或多种。
步骤503,判断所述音频片段的处理结果是否满足结果条件。
本步骤中,结果条件的作用可以与对音频片段进行的目标处理的目的相 关。可选的,以目标处理包括语音识别处理,且目标处理的目的是为了确定出与关键词匹配的音频片段为例,则所述结果条件可以用于判断音频片段是否与关键词匹配,音频片段的处理结果满足结果条件表征所述音频片段与关键词匹配,音频片段的处理结果不满足结果条件表征所述音频片段不与关键词匹配。可以理解的是,在目标处理是语音识别处理时,相应的,处理结果可以为语音识别结果,具体可以根据语音识别结果的特点实现结果条件。例如,假设上述模型是用于识别出音频片段中的文字,即如果音频片段中存在文字则尽量识别出文字,如果识别不出文字则认为是噪声,则结果条件例如可以为包括在关键词集合中,在处理结果包括在关键词集合中时可以认为是满足结果条件,在处理结果未包括在关键词集合中时可以认为是不满足结果条件。又例如,假设上述模型是用于识别出音频片段中的关键词,即如果音频片段中存在关键词则尽量识别出关键词,如果识别不出关键词则认为是噪声,则结果条件例如可以为处理结果不是噪声,具体的在处理结果不是噪声时可以认为是满足结果条件,在处理结果是噪声时可以认为是不满足结果条件。
可选的,以目标处理包括语音识别处理,且目标处理的目的是为了确定符合特定句式(例如,主谓宾句式)的音频片段,则所述结果条件可以用于判断音频片段是否符合特定句式,音频片段的处理结果满足结果条件表征所述音频片段符合主谓宾句式,音频片段的处理结果不满足结果条件表征所述音频片段不符合特定句式。类似的,可以根据语音识别结果的特点实现结果条件。例如,假设上述模型是用于识别出音频片段中的文字,即如果音频片段中存在文字则尽量识别出文字,如果识别不出文字则认为是噪声,则结果条件例如可以为特定句式条件。又例如,假设上述模型是用于识别出符合特定句式的音频片段,即如果音频片段符合特定句式则尽量识别出特定句式,如果识别不出特定句式则认为是噪声,则结果条件例如可以为处理结果不是噪声。
或者,可选的,以目标处理包括语音匹配处理,且目标处理的目的是为了确定出与目标语音匹配的音频片段为例,所述目标处理包括语音匹配处理,所述结果条件用于判断音频片段是否与目标语音匹配,音频片段的处理结果满足结果条件表征所述音频片段与目标语音匹配,音频片段的处理结果不满足结果条件表征所述音频片段不与目标语音匹配。可以理解的是,在目标处 理是语音匹配处理时,相应的,处理结果可以为语音匹配结果,具体可以根据语音匹配结果的特点实现结果条件。例如,假设上述模型是用于确定出音频片段与目标语音的匹配度,则结果条件例如可以为大于或等于匹配度阈值,在处理结果大于或等于匹配度阈值时可以认为是满足结果条件,在处理结果小于匹配度阈值时可以认为是不满足结果条件。又例如,假设上述模型是用于确定出音频片段是否与目标语音匹配,即处理结果可以为是或否两个中的任意一个,则结果条件具体可以为处理结果为是,具体的在处理结果为是时可以认为是满足结果条件,在处理结果为否时可以认为是不满足结果条件。
步骤504,若所述音频片段的处理结果不满足所述结果条件,则采用滑动窗口方法对所述音频片段进行目标处理。
本步骤中,所述目标处理与步骤504中的目标处理可以理解为相同处理,例如均为语音识别处理。与步骤503类似,本步骤中的目标处理可以包括提取所述音频片段的音频特征,并对所述语音特征利用预先训练的模型进行解码。为了避免由于除音频片段中噪声之外的其他因素导致音频片段为处理单元进行目标处理和采用滑动窗口方法进行目标处理所得到的处理结果的不一致的问题,步骤503中以音频片段为处理单元进行目标处理和本步骤中采用滑动窗口方法进行目标处理利用的模型相同。需要说明的是,本申请中音频片段中的噪声与进行处理结果描述的噪声,两者的含义不同,前者具体可以是指音频信号中待采集语音信号之外的能量,后者的含义可以见前述描述。
在步骤503中对于音频片段的处理结果不满足结果条件时,有两种可能,一种是音频片段中未包括满足结果条件的语音信号,另一种是音频片段中包括满足结果条件的音频信号但是由于音频片段中还包括噪声导致步骤503中对于音频片段的处理结果不满足结果条件。为了避免由于音频片段中噪声的影响导致处理结果不准确,因此,可以进一步的采用滑动窗口方法对音频片段进行目标处理。
需要说明的是,在音频片段的处理结果满足结果时,该处理结果即为该音频片段进行目标处理的最终处理结果。在音频片段的处理结果不满足结果时,该处理结果可能为该音频片段进行目标处理的最终处理结果,也可能不是最终处理结果,需要采用滑动窗口方法对音频片段进行目标处理从而进一步确定。
可选的,为了减小运算量,步骤504具体可以为:若所述音频片段的处理 结果不满足所述结果条件且所述音频片段满足时长条件,则采用滑动窗口方法对所述音频片段进行目标处理。其中,所述时长条件用于描述满足结果条件的音频片段的可能时长。可选的,考虑到人无法在极短时间内说话,所述时长条件可以包括时长大于或等于时长阈值。
为了进一步简化运算量,可选的,所述时长阈值与满足所述结果条件的音频片段的最短音频长度正相关。例如,时长阈值可以等于满足结果条件的音频片段的最短音频长度,或者时长阈值可以等于满足结果条件的音频片段的最短音频长度与偏移量之和。时长阈值例如可以为0.3秒。
可以理解的是,也可以根据设计需求对时长条件进行其他限制,例如限制音频片段的时长不能过长。这样可以在少影响甚至不影响最终处理效果的前提下,进一步降低处理所需运算资源。
需要说明的是,对于判断音频片段的处理结果是否满足结果条件以及判断音频片段是否满足时长条件两者之间的先后顺序,本申请可以不做限定。以判断是否满足结果条件在前,判断是否满足时长条件在后为例,在确定音频片段满足结果条件之后,进一步可以判断音频片段是否满足时长条件,并在满足时长条件之后采用滑动窗口方法对音频片段进行目标处理。
本实施例中,通过先以音频片段为处理单元,对音频片段进行目标处理,得到所述音频片段的处理结果,然后当所述音频片段的处理结果不满足所述结果条件时,进一步采用滑动窗口方法对所述音频片段进行目标处理,实现了仅对以音频片段为处理单元进行目标处理所得到的处理结果不满足结果条件时,再采用占用运算资源更多的滑动窗口方法对音频片段进行处理,从而在能够提高音频处理的准确性的基础上,进一步减小了运算资源的消耗。
需要说明的是,图5所示实施例中步骤501目标处理的范围,可以较图2所示实施例中目标处理的范围小。示例性的,图2所示实施例中目标处理可以包括语音端点检测处理,图5所示实施例中的目标处理可以不包括语音端点检测处理。
图5所示实施例中以通过步骤501采用语音活性检测方法从音频信号中所截取的音频片段的个数为一个进行说明,可选的,本实施例中步骤501从音频信号中所截取的音频片段的个数也可以为多个。相应的,步骤502具体可以为从多个所述音频片段中选择一个音频片段,并以所述音频片段为处理单元对所述音频片段进行所述目标处理,得到所述音频片段的处理结果;步骤504采 用滑动窗口方法对所述音频片段进行目标处理之后,还包括:返回从步骤502执行,直至满足完成条件。
需要说明的是,对于从多个音频片段中选择一个音频片段的顺序,本申请可以不做限定。可选的,可以依据多个所述音频片段在所述音频信号中的时序依次选择一个音频片段;或者,可以依据时长由大至小的顺序从多个所述音频片段中选择一个音频片段;或者,可以依据平均能量由大至小的顺序从多个音频片段中选择一个音频片段;或者,可以从多个所述音频片段中随机选择一个音频片段。
需要说明的是,完成条件可以根据需求灵活设计,本申请可以不做限定。可选的,所述完成条件包括下述中的任意一种:得到目标数量个满足所述结果条件的处理结果、进行预设次数的目标处理、对于预设数量的音频片段进行目标处理、对于所有音频片段进行目标处理。
图6为本申请又一实施例提供的音频处理方法的流程示意图。本实施例在图2、图5所述实施例的基础上,主要描述了采用滑动窗口方法对所述音频片段进行目标处理的一种可选的实现方式。如图6所示,本实施例的方法可以包括:
步骤601,根据窗口的长度以及位置截取所述音频片段的当前子片段。
本步骤中,如图7A所示,首先可以令窗口的位置可以为位于音频片段的起始端,此时当前子片段即为图7A中的当前子片段F1;然后,在图7A所示窗口的位置下,将窗口滑动一个步长后,窗口的位置可以如图7B所示,此时当前子片段即为图7B中的当前子片段F2;之后,在图7B所示窗口的位置下,将窗口滑动一个步长后,窗口的位置可以如图7C所示,此时当前子片段即为图7C中的当前子片段F3;……。
对窗口长度的确定方式,本申请可以不做限定,例如可以由用户输入确定,或者可以为预设值等。示例性的,窗口的长度可以与结果条件相关,可选的,所述窗口的长度可以与满足所述结果条件的音频片段的最长音频长度正相关。其中,正相关只是表示满足结果条件的音频片段的最长音频长度越长则窗口的长度越大这样一种变化趋势,两者具体满足何种公式可以灵活设计,例如窗口的长度可以等于满足结果条件的音频片段的最长音频长度。窗口的长度例如可以为0.75秒、0.8秒、1秒等。
需要说明的是,当目标处理为非语音端点检测处理,例如语音识别处理 时,关于结果条件的具体说明可以参见图5所示实施例。当目标处理为语音端点检测处理时,结果条件具体可以为能够用于确定音频片段中较语音活性检测更准确的语音端点的条件。例如,结果条件可以为窗口的总声音能量大于背景噪声能量和语音开始点信噪比的乘积。
步骤602,对所述当前子片段进行目标处理。
步骤603,若所述当前子片段的处理结果满足结果条件,则将所述当前子片段的处理结果作为所述音频片段的处理结果。
可选的,在执行完步骤603之后,可以直接结束;或者,在执行完步骤603之后,可以与执行完步骤604类似,返回至步骤601直至满足结束条件后结束,即一个音频片段的处理结果可以为多个。
步骤604,若所述当前子片段的处理结果不满足所述结果条件,则将窗口滑动一个步长。
其中,在执行完步骤604之后返回步骤601执行,直至满足结束条件,以完成对所述音频片段的所述目标处理。
示例性的,所述结束条件包括:窗口移动到音频片段的结束端,和/或,窗口滑动次数达到最大滑动次数。以结束条件包括窗口移动到音频片段的结束端为例,可以在窗口移动到如图7D所示的位置时结束。以结束条件包括窗口滑动次数达到最大滑动次数,且最大滑动次数为2为例,可以在窗口移动到如图7C所示的位置时结束。
对最大滑动次数以及步长的确定方式,本申请可以不做限定,例如可以由用户输入确定,或者可以为预设值等,最大滑动次数与步长的确定方式可以不同。可选的,可以根据期望的处理精度确定最大滑动次数和/或步长。示例性的,最大滑动次数与期望的处理精度正相关;和/或,所述步长与期望的处理精度负相关。其中,正相关只是表示期望的处理精度越高则最大滑动次数越大这样一种变化趋势,两者具体满足何种公式可以灵活设计,例如最大滑动次数可以等于100与期望的处理精度(例如,0.9)的乘积。类似的,负相关只是表示步长与期望的处理精度负相关可以表示期望的处理精度越高则步长越小这样一种变化趋势,两者具体满足何种公式可以灵活设计。
上述步长例如可以为0.01秒、0.03秒、0.05秒等。最大滑动次数例如可以为10次、15次、30次等。
其中,期望的处理精度可以是指采用滑动窗口方法对音频片段进行目标 处理能够得到正确的处理结果的精度。
需要说明的是,图7A至图7D中以窗口从音频片段的起始端开始滑动为例,可以理解的是窗口也可以从音频片段的其他位置开始滑动,例如可以从音频片段的结束端开始滑动,本申请对此可以不做限定。
本实施例中,通过根据窗口的长度以及位置截取音频片段的当前子片段,对当前子片段进行目标处理,若当前子片段的处理结果满足结果条件,则将当前子片段的处理结果作为音频片段的处理结果,若当前子片段的处理结果不满足结果条件,则将窗口滑动一个步长并返回根据窗口的长度以及位置截取音频片段的当前子片段的步骤执行,直至满足结束条件,实现了采用滑动窗口方法对音频片段进行目标处理。
在上述实施例的基础上,在得到满足结果条件音频片段的处理结果时,可以基于该处理结果进行进一步的处理。具体的,本申请实施例还可以包括如下步骤:当所述音频片段的处理结果满足所述结果条件时,进行后续处理。对于后续处理的具体处理内容本申请可以不做限定,具体可以为在获得目标处理的处理结果之后可以进一步进行的任意处理。
示例性的,所述音频片段的处理结果满足结果条件表征所述音频片段与关键词匹配;所述当所述音频片段的处理结果满足所述结果条件时,进行后续处理,包括:当所述音频片段的处理结果与关键词匹配,则执行与所述关键词对应的后续处理。可选的,关键词具体可以理解为指令词,相应的,后续处理可以理解为对于指令词的响应处理。
本申请实施例中,通过当音频片段的处理结果满足结果条件时,进行后续处理,由于通过滑动窗口方法提高了音频片段处理结果的准确性,从而可以提高后续处理的准确性。
以目标处理为语音识别处理,后续处理为人机交互处理为例,如图8所示,本申请又一实施例提供的音频处理方法可以包括如下步骤:
步骤801,基于语音活性检测方法从音频信号中截取音频片段。
步骤802,提取该音频片段的音频特征,并利用预先训练的模型进行解码,输出该音频片段的语音识别结果。
具体的,可以通过该音频片段的音频特征输入该模型,得到该音频片段的语音识别结果。
步骤803,判断该音频片段的语音识别结果是否为指令词。
本步骤中,若该音频片段的语音识别结果是否为指令词,则执行步骤804。若该音频片段的语音识别结果不为指令词,则执行步骤805。
其中,若该音频片段的语音识别结果为指令词,可以表示已成功识别出从音频信号中所截取的音频片段中的语音,进一步可以根据语音识别结果进行后续处理,本实施例中该后续处理具体可以为人机交互处理。若该音频片段的语音识别结果不为指令词(例如语音识别结果为噪声),可以表示未成功识别出从音频信号中所截取的音频片段中的语音,进一步可以采用滑动窗口方法进行识别。
步骤804,进行人机交互处理。
本步骤中,对于人机交互处理的具体处理内容,本申请可以不做限定,例如,对于智能照明设备,当指令词为“开灯”时,则开启智能照明设备的照明装置,对于智能机器人,当指令词为“前进”时,则控制智能机器人向前移动,等等。
步骤805,从该音频片段起始端开始截取窗口中的音频子片段1,提取该音频子片段1的音频特征,并利用该模型进行解码,输出音频子片段1的语音识别结果。
本步骤中,所使用的模型可以与步骤802相同。具体的,可以通过将音频子片段1的音频特征输入该模型,得到音频子片段1的语音识别结果。
步骤806,判断该音频子片段1的语音识别结果是否为指令词。
本步骤中,若该音频子片段1的语音识别结果是否为指令词,则执行步骤804。若该音频子片段1的语音识别结果不为指令词,则执行步骤807。
其中,若该音频子片段1的语音识别结果为指令词,可以表示通过对音频片段中的音频子片段1进行语音识别已成功识别出音频片段中的语音,进一步可以根据语音识别结果进行后续处理。若该音频子片段1的语音识别结果不为指令词,可以表示通过对音频片段中的音频子片段1进行语音识别未成功识别出音频片段中的语音,进一步可以对音频子片段1的下一个音频子片段进行识别以识别出音频片段中的语音。
步骤807,将窗口向右移动一个步长,截取此时窗口中的音频子片段2,提取该音频子片段2的音频特征,并利用该模型进行解码,输出音频子片段2的语音识别结果。
本步骤中,所使用的模型可以与步骤802相同。具体的,可以通过将音频 子片段2的音频特征输入该模型,得到音频子片段2的语音识别结果。
步骤808,判断该音频子片段2的语音识别结果是否为指令词。
本步骤中,若该音频子片段2的语音识别结果是否为指令词,则执行步骤804。若该音频子片段2的语音识别结果不为指令词,则执行步骤809。
其中,若该音频子片段2的语音识别结果为指令词,可以表示通过对音频片段中的音频子片段2进行语音识别已成功识别出音频片段中的语音,进一步可以根据语音识别结果进行后续处理。若该音频子片段2的语音识别结果不为指令词,可以表示通过对音频片段中的音频子片段2进行语音识别未成功识别出音频片段中的语音,进一步可以对音频子片段2的下一个音频子片段进行识别以识别出音频片段中的语音。
步骤809,将窗口向右移动一个步长,截取此时窗口中的音频子片段3,提取该音频子片段3的音频特征,并利用该模型进行解码,输出音频子片段3的语音识别结果。
本步骤中,所使用的模型可以与步骤802相同。具体的,可以通过将音频子片段3的音频特征输入该模型,得到音频子片段3的语音识别结果。
……,如此循环对滑动窗口中的音频子片段进行语音识别,中间若语音识别结果为指令词,则结束采用滑动窗口方法对该音频片段的处理,进入人机交互处理。若在满足结束条件时仍未识别为指令词,则认为该音频片段确实不包含指令词的语音,结束采用滑动窗口方法对该音频片段的处理。
本实施例中,通过基于语音活性检测方法从音频信号中截取音频片段,提取该音频片段的音频特征,并利用预先训练的模型进行解码,输出该音频片段的语音识别结果,若该音频片段的语音识别结果为指令词,则进一步进行人机交互处理,若该音频片段的语音识别结果不为指令词,则采用滑动窗口方法对语音片段进行语音识别,实现了在根据音频片段的音频特征无法成功识别出音频片段中的语音时,再采用滑动窗口方法根据音频片段中音频子片段的音频特征识别音频片段中的语音,由于根据音频片段的音频特征进行语音识别的处理速度比采用滑动窗口方法根据音频片段中音频子片段的音频特征识别音频片段中的语音要快很多,因此既能够避免噪音降低识别准确率,又能够缩短处理时长。
另外,图8所示实施例的音频处理方法,较现有技术中如下方式1的准确率高,较现有技术中如下方式2的处理速度快。方式1,基于语音活性检测方 法截取音频信号中的音频片段,直接对音频片段进行语音识别,并根据语音识别结果进行人机交互处理。方式2,采用滑动窗口方式对音频信号进行语音识别,并根据语音识别结果进行人机交互处理。
图9为本申请一实施例提供的音频处理装置的结构示意图,如图9所示,该装置900可以包括:处理器901和存储器902。
所述存储器902,用于存储程序代码;
所述处理器901,调用所述程序代码,当程序代码被执行时,用于执行以下操作:
基于语音活性检测方法从音频信号中截取音频片段;
采用滑动窗口方法对所述音频片段进行目标处理,得到所述音频片段的处理结果。
本实施例提供的音频处理装置,可以用于执行前述方法实施例的技术方案,其实现原理和技术效果与方法实施例类似,在此不再赘述。
本领域普通技术人员可以理解:实现上述各方法实施例的全部或部分步骤可以通过程序指令相关的硬件来完成。前述的程序可以存储于一计算机可读取存储介质中。该程序在执行时,执行包括上述各方法实施例的步骤;而前述的存储介质包括:ROM、RAM、磁碟或者光盘等各种可以存储程序代码的介质。
最后应说明的是:以上各实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述各实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分或者全部技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的范围。

Claims (38)

  1. 一种音频处理方法,其特征在于,包括:
    基于语音活性检测方法从音频信号中截取音频片段;
    采用滑动窗口方法对所述音频片段进行目标处理,得到所述音频片段的处理结果。
  2. 根据权利要求1所述的方法,其特征在于,所述采用滑动窗口方法对所述音频片段进行目标处理之前,还包括:
    以所述音频片段为处理单元,对所述音频片段进行所述目标处理,得到所述音频片段的处理结果;
    判断所述音频片段的处理结果是否满足结果条件;
    若所述音频片段的处理结果不满足所述结果条件,则执行采用滑动窗口方法对所述音频片段进行目标处理的步骤。
  3. 根据权利要求2所述的方法,其特征在于,若所述音频片段的处理结果不满足所述结果条件,则执行采用滑动窗口方法对所述音频片段进行目标处理的步骤,包括:
    若所述音频片段的处理结果不满足所述结果条件且所述音频片段满足时长条件,则执行采用滑动窗口方法对所述音频片段进行目标处理的步骤。
  4. 根据权利要求3所述的方法,其特征在于,所述时长条件包括时长大于或等于时长阈值。
  5. 根据权利要求4所述的方法,其特征在于,所述时长阈值与满足所述结果条件的音频片段的最短音频长度正相关。
  6. 根据权利要求2所述的方法,其特征在于,所述目标处理包括提取音频特征并对所述音频特征利用预先训练的模型进行解码。
  7. 根据权利要求6所述的方法,其特征在于,以音频片段为处理单元进行目标处理和采用滑动窗口方法进行目标处理利用的模型相同。
  8. 根据权利要求2所述的方法,其特征在于,从音频信号中截取的音频片段的个数为多个;
    以所述音频片段对处理单元对所述音频片段进行所述目标处理,得到所述音频片段的处理结果,包括:
    从多个所述音频片段中选择一个音频片段,并以所述音频片段为处理单元对所述音频片段进行所述目标处理,得到所述音频片段的处理结果;
    所述采用滑动窗口方法对所述音频片段进行目标处理之后,还包括:返回从多个所述音频片段中选择一个音频片段,并以所述音频片段为处理单元对所述音频片段进行所述目标处理的步骤执行,直至满足完成条件。
  9. 根据权利要求8所述的方法,其特征在于,所述完成条件包括下述中的任意一种:
    得到目标数量个满足所述结果条件的处理结果、进行预设次数的目标处理、对于预设数量的音频片段进行目标处理、对于所有音频片段进行目标处理。
  10. 根据权利要求1-9任一项所述的方法,所述采用滑动窗口方法对所述音频片段进行目标处理,包括:
    根据窗口的长度以及位置截取所述音频片段的当前子片段;
    对所述当前子片段进行目标处理;
    若所述当前子片段的处理结果满足结果条件,则将所述当前子片段的处理结果作为所述音频片段的处理结果;
    若所述当前子片段的处理结果不满足所述结果条件,则将窗口滑动一个步长后返回所述根据窗口的长度以及位置截取所述音频片段的当前子片段的步骤执行,直至满足结束条件,以完成对所述音频片段的所述目标处理。
  11. 根据权利要求10所述的方法,其特征在于,所述窗口的长度与满足所述结果条件的音频片段的最长音频长度正相关。
  12. 根据权利要求10所述的方法,其特征在于,所述结束条件包括:窗口移动到音频片段的结束端,和/或,窗口滑动次数达到最大滑动次数。
  13. 根据权利要求12所述的方法,其特征在于,所述最大滑动次数与期望的处理精度正相关;和/或,所述步长与期望的处理精度负相关。
  14. 根据权利要求2-13任一项所述的方法,其特征在于,所述目标处理包括语音识别处理,所述结果条件用于判断音频片段是否与关键词匹配,音频片段的处理结果满足结果条件表征所述音频片段与关键词匹配,音频片段的处理结果不满足结果条件表征所述音频片段不与关键词匹配。
  15. 根据权利要求2-13任一项所述的方法,其特征在于,所述目标处理包括语音匹配处理,所述结果条件用于判断音频片段是否与目标语音匹配,音频片段的处理结果满足结果条件表征所述音频片段与目标语音匹配,音频片段的处理结果不满足结果条件表征所述音频片段不与目标语音匹配。
  16. 根据权利要求2-15任一项所述的方法,其特征在于,所述方法还包括:
    当所述音频片段的处理结果满足所述结果条件时,进行后续处理。
  17. 根据权利要求16所述的方法,其特征在于,所述音频片段的处理结果满足结果条件表征所述音频片段与关键词匹配;所述当所述音频片段的处理结果满足所述结果条件时,进行后续处理,包括:
    当所述音频片段的处理结果与关键词匹配,则执行与所述关键词对应的后续处理。
  18. 根据权利要求1所述的方法,其特征在于,所述目标处理包括下述中的任意一种:
    语音识别处理、语音匹配处理或语音端点检测处理。
  19. 一种音频处理装置,其特征在于,包括:处理器和存储器;
    所述存储器,用于存储程序代码;
    所述处理器,调用所述程序代码,当程序代码被执行时,用于执行以下操作:
    基于语音活性检测方法从音频信号中截取音频片段;
    采用滑动窗口方法对所述音频片段进行目标处理,得到所述音频片段的处理结果。
  20. 根据权利要求19所述的装置,其特征在于,所述处理器还用于:
    以所述音频片段为处理单元,对所述音频片段进行所述目标处理,得到所述音频片段的处理结果;
    判断所述音频片段的处理结果是否满足结果条件;
    若所述音频片段的处理结果不满足所述结果条件,则执行采用滑动窗口方法对所述音频片段进行目标处理的步骤。
  21. 根据权利要求20所述的装置,其特征在于,所述处理器用于若所述音频片段的处理结果不满足所述结果条件,则执行采用滑动窗口方法对所述音频片段进行目标处理的步骤,具体包括:
    若所述音频片段的处理结果不满足所述结果条件且所述音频片段满足时长条件,则执行采用滑动窗口方法对所述音频片段进行目标处理的步骤。
  22. 根据权利要求21所述的装置,其特征在于,所述时长条件包括时长大于或等于时长阈值。
  23. 根据权利要求22所述的装置,其特征在于,所述时长阈值与满足所 述结果条件的音频片段的最短音频长度正相关。
  24. 根据权利要求20所述的装置,其特征在于,所述目标处理包括提取音频特征并对所述音频特征利用预先训练的模型进行解码。
  25. 根据权利要求24所述的装置,其特征在于,以音频片段为处理单元进行目标处理和采用滑动窗口方法进行目标处理利用的模型相同。
  26. 根据权利要求20所述的装置,其特征在于,从音频信号中截取的音频片段的个数为多个;
    所述处理器用于以所述音频片段对处理单元对所述音频片段进行所述目标处理,得到所述音频片段的处理结果,具体包括:
    从多个所述音频片段中选择一个音频片段,并以所述音频片段为处理单元对所述音频片段进行所述目标处理,得到所述音频片段的处理结果;
    所述处理器,还用于采用滑动窗口方法对所述音频片段进行目标处理之后,返回从多个所述音频片段中选择一个音频片段,并以所述音频片段为处理单元对所述音频片段进行所述目标处理的步骤执行,直至满足完成条件。
  27. 根据权利要求26所述的装置,其特征在于,所述完成条件包括下述中的任意一种:
    得到目标数量个满足所述结果条件的处理结果、进行预设次数的目标处理、对于预设数量的音频片段进行目标处理、对于所有音频片段进行目标处理。
  28. 根据权利要求19-27任一项所述的装置,所述处理器用于采用滑动窗口方法对所述音频片段进行目标处理,具体包括:
    根据窗口的长度以及位置截取所述音频片段的当前子片段;
    对所述当前子片段进行目标处理;
    若所述当前子片段的处理结果满足结果条件,则将所述当前子片段的处理结果作为所述音频片段的处理结果;
    若所述当前子片段的处理结果不满足所述结果条件,则将窗口滑动一个步长后返回所述根据窗口的长度以及位置截取所述音频片段的当前子片段的步骤执行,直至满足结束条件,以完成对所述音频片段的所述目标处理。
  29. 根据权利要求28所述的装置,其特征在于,所述窗口的长度与满足所述结果条件的音频片段的最长音频长度正相关。
  30. 根据权利要求28所述的装置,其特征在于,所述结束条件包括:窗 口移动到音频片段的结束端,和/或,窗口滑动次数达到最大滑动次数。
  31. 根据权利要求30所述的装置,其特征在于,所述最大滑动次数与期望的处理精度正相关;和/或,所述步长与期望的处理精度负相关。
  32. 根据权利要求20-31任一项所述的装置,其特征在于,所述目标处理包括语音识别处理,所述结果条件用于判断音频片段是否与关键词匹配,音频片段的处理结果满足结果条件表征所述音频片段与关键词匹配,音频片段的处理结果不满足结果条件表征所述音频片段不与关键词匹配。
  33. 根据权利要求20-31任一项所述的装置,其特征在于,所述目标处理包括语音匹配处理,所述结果条件用于判断音频片段是否与目标语音匹配,音频片段的处理结果满足结果条件表征所述音频片段与目标语音匹配,音频片段的处理结果不满足结果条件表征所述音频片段不与目标语音匹配。
  34. 根据权利要求20-33任一项所述的装置,其特征在于,所述处理器,还用于当所述音频片段的处理结果满足所述结果条件时,进行后续处理。
  35. 根据权利要求34所述的装置,其特征在于,所述音频片段的处理结果满足结果条件表征所述音频片段与关键词匹配;所述处理器用于当所述音频片段的处理结果满足所述结果条件时,进行后续处理,具体包括:
    当所述音频片段的处理结果与关键词匹配,则执行与所述关键词对应的后续处理。
  36. 根据权利要求19所述的装置,其特征在于,所述目标处理包括下述中的任意一种:
    语音识别处理、语音匹配处理或语音端点检测处理。
  37. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储有计算机程序,所述计算机程序包含至少一段代码,所述至少一段代码可由计算机执行,以控制所述计算机执行如权利要求1-18任一项所述的音频处理方法。
  38. 一种计算机程序,其特征在于,当所述计算机程序被计算机执行时,用于实现如权利要求1-18任一项所述的音频处理方法。
PCT/CN2019/098613 2019-07-31 2019-07-31 音频处理方法及装置 WO2021016925A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2019/098613 WO2021016925A1 (zh) 2019-07-31 2019-07-31 音频处理方法及装置
CN201980033584.3A CN112189232A (zh) 2019-07-31 2019-07-31 音频处理方法及装置

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2019/098613 WO2021016925A1 (zh) 2019-07-31 2019-07-31 音频处理方法及装置

Publications (1)

Publication Number Publication Date
WO2021016925A1 true WO2021016925A1 (zh) 2021-02-04

Family

ID=73919010

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/098613 WO2021016925A1 (zh) 2019-07-31 2019-07-31 音频处理方法及装置

Country Status (2)

Country Link
CN (1) CN112189232A (zh)
WO (1) WO2021016925A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112951218B (zh) * 2021-03-22 2024-03-29 百果园技术(新加坡)有限公司 基于神经网络模型的语音处理方法、装置及电子设备
CN114495907A (zh) * 2022-01-27 2022-05-13 多益网络有限公司 自适应的语音活动检测方法、装置、设备以及存储介质

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH05313674A (ja) * 1992-05-14 1993-11-26 Sony Corp 雑音低減装置
CN103731567A (zh) * 2012-10-11 2014-04-16 国际商业机器公司 用于降低共享媒体会话中的噪声的方法和系统
CN103874002A (zh) * 2012-12-18 2014-06-18 奥迪康有限公司 包括非自然信号减少的音频处理装置
CN103986845A (zh) * 2013-02-07 2014-08-13 联想(北京)有限公司 信息处理方法和信息处理设备
CN106297776A (zh) * 2015-05-22 2017-01-04 中国科学院声学研究所 一种基于音频模板的语音关键词检索方法
CN106782613A (zh) * 2016-12-22 2017-05-31 广州酷狗计算机科技有限公司 信号检测方法及装置
CN107967918A (zh) * 2016-10-19 2018-04-27 河南蓝信科技股份有限公司 一种增强语音信号清晰度的方法
CN108460633A (zh) * 2018-03-05 2018-08-28 北京电广聪信息技术有限公司 一种广告音频采集识别系统的建立方法及其用途

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1763844B (zh) * 2004-10-18 2010-05-05 中国科学院声学研究所 基于滑动窗口的端点检测方法、装置和语音识别系统
KR101697651B1 (ko) * 2012-12-13 2017-01-18 한국전자통신연구원 음성 신호의 검출 방법 및 장치
US9524735B2 (en) * 2014-01-31 2016-12-20 Apple Inc. Threshold adaptation in two-channel noise estimation and voice activity detection
CN108831508A (zh) * 2018-06-13 2018-11-16 百度在线网络技术(北京)有限公司 语音活动检测方法、装置和设备
CN109545188B (zh) * 2018-12-07 2021-07-09 深圳市友杰智新科技有限公司 一种实时语音端点检测方法及装置

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH05313674A (ja) * 1992-05-14 1993-11-26 Sony Corp 雑音低減装置
CN103731567A (zh) * 2012-10-11 2014-04-16 国际商业机器公司 用于降低共享媒体会话中的噪声的方法和系统
CN103874002A (zh) * 2012-12-18 2014-06-18 奥迪康有限公司 包括非自然信号减少的音频处理装置
CN103986845A (zh) * 2013-02-07 2014-08-13 联想(北京)有限公司 信息处理方法和信息处理设备
CN106297776A (zh) * 2015-05-22 2017-01-04 中国科学院声学研究所 一种基于音频模板的语音关键词检索方法
CN107967918A (zh) * 2016-10-19 2018-04-27 河南蓝信科技股份有限公司 一种增强语音信号清晰度的方法
CN106782613A (zh) * 2016-12-22 2017-05-31 广州酷狗计算机科技有限公司 信号检测方法及装置
CN108460633A (zh) * 2018-03-05 2018-08-28 北京电广聪信息技术有限公司 一种广告音频采集识别系统的建立方法及其用途

Also Published As

Publication number Publication date
CN112189232A (zh) 2021-01-05

Similar Documents

Publication Publication Date Title
WO2021093449A1 (zh) 基于人工智能的唤醒词检测方法、装置、设备及介质
US10540961B2 (en) Convolutional recurrent neural networks for small-footprint keyword spotting
KR102134201B1 (ko) 숫자 음성 인식에 있어서 음성 복호화 네트워크를 구성하기 위한 방법, 장치, 및 저장 매체
JP6772198B2 (ja) 言語モデルスピーチエンドポインティング
CN110310623B (zh) 样本生成方法、模型训练方法、装置、介质及电子设备
WO2020258661A1 (zh) 基于循环神经网络和声学特征的说话人分离方法及装置
US11727917B1 (en) Silent phonemes for tracking end of speech
US9070367B1 (en) Local speech recognition of frequent utterances
JP7336537B2 (ja) 組み合わせで行うエンドポイント決定と自動音声認識
CN104157285B (zh) 语音识别方法、装置及电子设备
WO2020043160A1 (en) Method and system for detecting voice activity innoisy conditions
US11978478B2 (en) Direction based end-pointing for speech recognition
CN102376305A (zh) 语音识别方法及系统
WO2015103836A1 (zh) 一种语音控制方法及装置
CN112599152B (zh) 语音数据标注方法、系统、电子设备及存储介质
CN103778915A (zh) 语音识别方法和移动终端
WO2021016925A1 (zh) 音频处理方法及装置
US20220068269A1 (en) Adaptive batching to reduce recognition latency
JPWO2010128560A1 (ja) 音声認識装置、音声認識方法、及び音声認識プログラム
CN114708856A (zh) 一种语音处理方法及其相关设备
US9542939B1 (en) Duration ratio modeling for improved speech recognition
KR20200023893A (ko) 화자 인증 방법, 화자 인증을 위한 학습 방법 및 그 장치들
CN112825250A (zh) 语音唤醒方法、设备、存储介质及程序产品
CN111739515A (zh) 语音识别方法、设备、电子设备和服务器、相关系统
Wang et al. A fusion model for robust voice activity detection

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19940018

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19940018

Country of ref document: EP

Kind code of ref document: A1