WO2021146857A1 - Audio processing method and device - Google Patents

Audio processing method and device Download PDF

Info

Publication number
WO2021146857A1
WO2021146857A1 PCT/CN2020/073292 CN2020073292W WO2021146857A1 WO 2021146857 A1 WO2021146857 A1 WO 2021146857A1 CN 2020073292 W CN2020073292 W CN 2020073292W WO 2021146857 A1 WO2021146857 A1 WO 2021146857A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio
window
audio signal
sampling window
voice information
Prior art date
Application number
PCT/CN2020/073292
Other languages
French (fr)
Chinese (zh)
Inventor
吴俊峰
罗东阳
童焦龙
Original Assignee
深圳市大疆创新科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市大疆创新科技有限公司 filed Critical 深圳市大疆创新科技有限公司
Priority to CN202080004356.6A priority Critical patent/CN112543972A/en
Priority to PCT/CN2020/073292 priority patent/WO2021146857A1/en
Publication of WO2021146857A1 publication Critical patent/WO2021146857A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • G10L15/05Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • G10L15/142Hidden Markov Models [HMMs]
    • G10L15/144Training of HMMs
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Definitions

  • This application relates to the field of audio technology, and in particular to an audio processing method and device.
  • Voice interaction is a common way of human-computer interaction.
  • voice recognition for human-computer interaction, it is necessary to perform voice endpoint detection first, that is, to detect which parts of the sound recorded by the microphone may contain voice.
  • voice endpoint detection The anti-interference ability and computing power consumption of various voice endpoint detection methods are different from each other. How to balance the anti-interference ability and computing power consumption has become an urgent problem in voice endpoint detection.
  • the embodiments of the present application provide an audio processing method and device to solve the problem that the anti-interference ability and computing power consumption cannot be taken into account in the voice endpoint detection process in the prior art, resulting in poor anti-interference ability and computing power consumption. Big technical problem.
  • an embodiment of the present application provides an audio processing method, including: intercepting an audio segment in the audio signal according to audio feature information of the audio signal; determining whether to perform a window recognition operation based on the audio segment; and the window recognition The operation includes the following operations: moving a sampling window in the audio signal after the audio segment, and performing voice recognition on the audio signal in the sampling window.
  • an embodiment of the present application provides an audio processing device, including: a processor and a memory; the memory is used to store program code; the processor calls the program code, and when the program code is executed, It is used to perform the following operations: intercept an audio segment in the audio signal according to the audio feature information of the audio signal; determine whether to perform a window recognition operation based on the audio segment; the window recognition operation includes the following steps: after the audio segment Move the sampling window in the audio signal, and perform voice recognition on the audio signal in the sampling window.
  • an embodiment of the present application provides a computer-readable storage medium, the computer-readable storage medium stores a computer program, the computer program includes at least one piece of code, and the at least one piece of code can be executed by a computer to control all
  • the computer executes the audio processing method according to any one of the above-mentioned first aspects.
  • an embodiment of the present application provides a computer program, when the computer program is executed by a computer, it is used to implement the audio processing method according to any one of the above-mentioned first aspects.
  • the embodiments of the present application provide an audio processing method and device. By selecting and switching the voice endpoint detection method according to the characteristics of specific application scenarios in the voice endpoint detection process, it can take into account the anti-interference ability and computing power consumption.
  • FIG. 1A is a schematic diagram 1 of an application scenario of an audio processing method provided by an embodiment of this application;
  • FIG. 1B is a second schematic diagram of an application scenario of the audio processing method provided by an embodiment of the application.
  • FIG. 2 is a schematic flowchart of an audio processing method provided by an embodiment of the application.
  • 3A-3C are schematic diagrams of the audio sub-segment provided by an embodiment of the application excluding the noise of the audio segment;
  • FIG. 4 is a schematic flowchart of an audio processing method provided by another embodiment of this application.
  • FIG. 5 is a schematic flowchart of a voice activity detection method provided by an embodiment of the application.
  • FIG. 6 is a schematic structural diagram of an audio processing device provided by an embodiment of the application.
  • the audio processing method provided in the embodiments of the present application can be applied to any audio processing process that requires voice endpoint detection, and the audio processing method can be specifically executed by an audio processing device.
  • the audio processing device may be a device including an audio collection module (for example, a microphone).
  • FIG. 1A a schematic diagram of an application scenario of the audio processing method provided in an embodiment of the present application may be as shown in FIG. 1A.
  • the audio collection module of the audio processing device can collect the voice spoken by the user to obtain audio signals
  • the processor of the audio processing device can process the audio signals collected by the audio collection module using the audio processing method provided in the embodiments of the present application.
  • FIG. 1A is only a schematic diagram, and does not limit the structure of the audio processing device.
  • an amplifier may be connected between the microphone and the processor to amplify the audio signal collected by the microphone.
  • a filter may be connected between the microphone and the processor to filter the audio signal collected by the microphone.
  • the audio processing device may also be a device that does not include an audio collection module.
  • a schematic diagram of an application scenario of the audio processing method provided in an embodiment of the present application may be as shown in FIG. 1B.
  • the communication interface of the audio processing device may receive audio signals collected by other devices or equipment, and the processor of the audio processing device may process the received audio signals using the audio processing method provided in the embodiments of the present application.
  • FIG. 1B is only a schematic diagram, and does not limit the structure of the audio processing device and the connection mode between the audio processing device and other devices or equipment.
  • the communication interface in the audio processing device can be replaced with a transceiver.
  • the type of equipment including the audio processing device may not be limited in the embodiments of the present application.
  • the equipment may be, for example, smart speakers, smart lighting devices, smart robots, mobile phones, tablet computers, and the like.
  • the audio processing method provided by the embodiments of the present application can automatically switch between the voice activity detection method and the sliding window detection method according to the characteristics of the specific application scenario, so that the anti-interference ability of the voice endpoint detection can be improved as a whole, and the voice endpoint detection can be reduced. Consumption of computing power. That is, the anti-interference ability of voice endpoint detection can be stronger than that of purely using the voice activity detection method, and at the same time, the computing power consumption of voice endpoint detection can be less than that of purely using the sliding window detection method.
  • the audio processing method provided by the embodiments of this application can be applied to the field of voice recognition, smart hardware devices with voice recognition functions, audio event detection fields, smart hardware devices with audio event detection functions, etc. This is not limited.
  • the execution subject of this embodiment may be an audio processing device, and specifically may be a processor of the audio processing device. As shown in FIG. 2, the method of this embodiment may include step S201 and step S202, for example.
  • step S201 an audio segment is intercepted from the audio signal according to the audio feature information of the audio signal.
  • step S201 the audio clip is intercepted based on the audio feature information. Since the intercepted endpoint is inaccurate, the intercepted audio clip usually contains noise. The audio clip is used for speech recognition, and the accuracy is not high.
  • the audio features may include, for example, Mel Frequency Cepstrum Coefficient (MFCC) features, linear prediction coefficients (Linear Prediction Coefficients, LPC) features, and filter bank (Filter bank, Fbank) features.
  • MFCC Mel Frequency Cepstrum Coefficient
  • LPC Linear Prediction Coefficients
  • Fbank Filter bank, Fbank
  • a voice endpoint detection method with lower computing power consumption may be used to process the audio signal.
  • the voice activity detection method recognizes whether the audio signal is a voice signal through methods such as energy and zero-crossing rate double threshold or noise-speech classification model, and only determines whether the audio signal is a voice signal. It can only be intercepted at a time for speech recognition, so the voice activity detection method can effectively distinguish between silent and non-silent, and can save computing power when used for speech recognition.
  • a voice activity detection method may be used to intercept audio segments in the audio signal according to the audio feature information of the audio signal.
  • the voice activity detection method cannot effectively distinguish between voice-type noises or non-speech-type noises with slightly larger energy.
  • the voice activity detection method is used for command word speech recognition, it is easy to mix noise before and after the command speech fragment, thereby reducing the voice command recognition rate, while the sliding window detection method can extract a certain length of speech fragment from the audio stream and continuously perform speech recognition .
  • the sliding window detection method can extract a certain length of speech fragment from the audio stream and continuously perform speech recognition .
  • This method can effectively prevent mixing of noise before and after the instruction word speech segment, but the detection method since the sliding window computationally intensive speech recognition, voice recognition keep an intelligent hardware device consumes significant computing resources and power.
  • the voice activity detection method can be used for audio processing in application scenarios with low anti-interference requirements to reduce computing power consumption, but sliding windows can be used in application scenarios with high anti-interference requirements
  • the detection method performs audio processing to improve the anti-interference ability.
  • step S202 it is determined whether to perform a window recognition operation based on the audio clip obtained in step S201.
  • the window recognition operation may include, for example, the following operations: moving a sampling window in the audio signal after the audio segment obtained in step S201, and performing voice recognition on the audio signal in the sampling window.
  • the sliding window detection method can exclude the noise included in the audio segment cut out by the voice activity detection in one or more windows.
  • the audio segment X1 can exclude the noise.
  • the audio segment X2 can exclude the noise.
  • the audio segment X3 can exclude the noise. It should be noted that the parts filled with grid lines in FIGS. 3A to 3C are used to represent noise.
  • step S202 it can be determined whether the audio segment obtained in step S201 contains or may contain voice-type noise or non-speech-type noise with slightly larger energy, and then it is determined whether to perform a window recognition operation. Wherein, if it is determined that the audio segment obtained in step S201 contains or may contain speech-type noise or non-speech-type noise with slightly larger energy, a window recognition operation is performed. Otherwise, if it is determined that the audio segment obtained in step S201 does not contain or may not contain speech-type noise or non-speech-type noise with slightly larger energy, then continue to perform step S201 and step S202.
  • noise is easily mixed before and after the command-type voice information, which easily leads to inaccurate speech segment interception and/or inaccurate speech recognition. Therefore, for the audio segment obtained in step S201, if command-type voice information is detected, a window recognition operation is performed to eliminate the influence of noise.
  • a window recognition operation is performed to avoid inaccurate audio segment interception and/or inaccurate speech recognition due to the inability of the voice activity detection method to effectively distinguish this type of information.
  • the corresponding voice endpoint detection method when processing audio signals, can be selected according to the characteristics of specific application scenarios. For example, in application scenarios where the anti-interference ability is low, the voice activity detection method can be used to process the audio signal to save computing power. For application scenarios that require high anti-interference ability, the sliding window detection method can be used to process the audio signal to improve the anti-interference ability, which can take into account the advantages of high anti-interference ability of the sliding window detection method and low computational power consumption of the voice activity detection method Therefore, the anti-interference ability of voice endpoint detection is generally stronger than that of purely using the voice activity detection method, and at the same time, the computing power consumption of voice endpoint detection is generally less than that of purely using the sliding window detection method.
  • step S202 it is possible to determine whether or not there is instruction-type voice information in the audio segment obtained in step S201 in various ways.
  • voice information may be extracted from the audio segment obtained in step S201, and the content and/or voice length of the extracted voice information may be used to determine whether there is or may exist instruction information, and then determine whether to execute Window recognition operation.
  • the method may further include, for example, extracting voice information from an audio segment.
  • step S202 may include, for example, judging whether to perform a window recognition operation based on the extracted voice information.
  • a specific condition for example, whether the voice information contains instruction-type information.
  • it may be determined whether to perform a window recognition operation based on the content and the length of the extracted voice information.
  • the sliding window detection method is used for voice endpoint detection to prevent inaccurate speech fragment interception and/or inaccurate speech recognition caused by mixing too much noise before and after the instruction.
  • the voice length corresponding to the voice information containing the instruction word is also limited. In this way, it is possible to roughly estimate whether or not there is an instruction word in the voice information according to the voice length of the extracted voice information, and then it is possible to determine whether to perform a window recognition operation.
  • the voice information contained therein can be extracted first, and then based on the content and/or voice length of the voice information to determine or estimate whether there is instruction information in the audio segment, and The determination or estimation result determines whether to perform the window recognition operation, thereby avoiding inaccuracies caused by excessive noise during voice segment interception and voice recognition of the command-type voice information.
  • the voice information may be extracted from the audio segment obtained in step S201, and the presence or possible existence of instruction information can be determined according to the content of the extracted voice information and/or the duration of the audio segment. And then determine whether to perform the window recognition operation.
  • the method may further include, for example, extracting voice information from the audio segment obtained in step S201.
  • step S202 may include the following operations, for example. It is first judged whether the extracted voice information matches the first voice information. If it does not match, then it is determined whether the duration of the audio segment is greater than or equal to the first duration threshold. If it is not greater than or not equal to the first duration threshold, step S201 and step S202 are executed for the audio signal after the audio segment.
  • step S202 may also include the following operations, for example. First, it is determined whether the duration of the audio segment obtained in step S201 is greater than or equal to the first duration threshold. If it is not greater than or not equal to the first duration threshold, then it is determined whether the extracted voice information matches the first voice information. If it does not match, for the audio signal after the audio segment, step S201 and step S202 are executed.
  • the first voice information may be set according to the vocabulary of the characterizing instruction (such as the first control instruction).
  • the first voice information may be set according to a vocabulary that characterizes all control instructions applied to the smart hardware device.
  • the voice length corresponding to the voice information containing the instruction word is also limited, and the duration of the audio segment containing the instruction word is also limited. Therefore, it is possible to roughly estimate whether or not there is an instruction word in the audio segment according to the duration of the audio segment obtained in step S201, and then it can be determined whether to perform a window recognition operation.
  • the first duration threshold may be set according to the maximum duration, average duration, or any duration of the audio segment containing the instruction (such as the first control instruction).
  • the voice information contained therein can be extracted, and based on the content of the voice information and the duration of the audio clip, it is determined or estimated whether there is instruction information in the audio clip, and The estimation result determines whether to perform the window recognition operation. In this way, it is also possible to avoid inaccuracy caused by excessive noise when performing voice segment interception and voice recognition on the command-type voice information.
  • the method may further include, for example, performing a window recognition operation when any of the following events occurs: the voice information extracted from the audio segment obtained in step S201 matches the first voice information (referred to as event 1); and/ Or, the duration of the audio segment obtained in step S201 is greater than or equal to the first duration threshold (referred to as event 2).
  • the setting of the first voice information and the first duration threshold is the same or similar to the setting method in the foregoing embodiment, which is not repeated here in the embodiment of the present application.
  • the voice information may be extracted from the audio segment obtained in step S201, and then the extracted voice information may be matched with the first voice information, so as to determine whether the extracted voice information matches the The first voice message matches.
  • the duration of the audio clip obtained in step S201 may be determined first, and then it is compared whether the duration is greater than or equal to the first duration threshold.
  • one or more of event 1 and event 2 can be used to determine or estimate that the audio clip obtained in step S201 contains or may contain instruction information, so that the window recognition operation can be performed to intercept the voice clip.
  • the noise before and after the instruction is excluded, so that the interception result of the speech segment is more accurate, and the speech recognition result obtained therefrom is also more accurate.
  • step S202 it is also possible to detect whether there is non-speech information with a little higher energy in the audio segment obtained in step S201 in a variety of ways.
  • step S202 may include, for example, judging whether a preset audio event has occurred based on the audio segment obtained in step S201, so as to determine whether there is non-voice information with a higher energy in the audio segment.
  • the window recognition operation can be performed, so that the noise can be excluded when the audio segment is intercepted, so that the audio segment can be intercepted
  • the result is more accurate, which in turn makes the resulting audio processing result more accurate.
  • the preset audio event does not occur, it is considered that there is no non-voice information with a larger energy in the audio segment, and the window recognition operation may not be performed.
  • the foregoing preset audio event may include, for example, one or more of the following sounds in the audio segment: percussion, applause, and clapping.
  • the above-mentioned preset audio events can be set according to non-voice information with slightly larger energy, such as percussive sounds, clapping sounds, and clapping sounds that often appear in the audio clip, for example.
  • Voice endpoint detection can prevent excessive noise from being mixed before and after the instruction, which may lead to inaccurate speech segment interception and/or inaccurate speech recognition.
  • the user may not speak continuously during actual use, but will not issue any more instructions for a long period of time after issuing an instruction.
  • the user says “please go to sleep” to a smart hardware device with voice recognition function (such as a robot), or the user says “please fly around A twice" to a smart hardware device with voice recognition function (such as a drone) , Etc.
  • the user generally will not issue other control commands for a long period of time after the command is issued. Therefore, if these special command-type voice messages are detected, switching to the sliding window detection method for voice endpoint detection will not only not significantly improve the anti-interference ability, but also increase the computing power consumption. Therefore, in the embodiment of the present application, if these special circumstances occur, the sliding window detection method may not be switched, thereby avoiding the defect that not only the anti-interference ability is not significantly improved after the switching, but also the computational power consumption will increase.
  • step S201 in the process of judging whether to perform the window recognition operation based on the audio clip obtained in step S201, it is also possible to detect whether specific voice information appears. Wherein, if the occurrence of specific voice information is detected, then for the audio signal after the audio segment obtained in step S201, step S201 and step S202 are executed without performing the window recognition operation.
  • step S202 may include, for example, extracting voice information from the audio segment obtained in step S201, and determining whether the extracted voice information matches the second voice information. Wherein, if it matches, for the audio signal after the audio segment obtained in step S201, step S201 and step S202 are executed.
  • the above-mentioned second voice information may be set according to a vocabulary characterizing the second control instruction, for example.
  • the second control instruction may have any of the following features: the processor suspends execution of other control instructions during the execution of the second control instruction (feature 1 for short); the second control instruction represents the pause of voice interaction with the user (feature 2 for short).
  • the second control command may be, for example, "please fly around A twice" in the above example.
  • the second control command may be, for example, "please enter the sleep state” in the above example.
  • the second control instruction may be any control instruction having the above-mentioned feature 1 and/or feature 2, which is not limited in the embodiment of the present application.
  • the sliding window detection method after switching to the sliding window detection method, that is, after performing the window recognition operation, if the specific application scenario is found to change again, the sliding window detection method can also be switched back to the previously used voice endpoint Detection methods to ensure that computing power is saved as much as possible.
  • the method may further include, for example, moving the sampling window in the audio signal after the audio segment obtained in step S201, and performing the voice recognition step on the audio signal in the sampling window. That is, in the process of performing the window recognition operation, in response to the switching trigger event that occurs, for the audio signal after the current position of the sampling window, step S201 and step S202 are executed.
  • the aforementioned switching trigger event may include, for example, any one or more of the following: in the process of performing voice recognition on the audio signal in the sampling window, it is detected that it corresponds to the third voice information. Matching voice information; the number of movement times of the sampling window reaches the maximum number of movement times; and the processing duration of moving the sampling window and performing voice recognition on the audio signal in the sampling window reaches the second duration threshold.
  • the third voice information may be set according to a vocabulary characterizing the second control instruction, for example.
  • the second control instruction has any of the following characteristics: the processor suspends execution of other control instructions during the execution of the second control instruction; the second control instruction represents the suspension of voice interaction with the user.
  • the foregoing third voice information may be the same as or similar to the second voice information in the foregoing embodiment, and details are not described herein again in this embodiment of the present application.
  • the second control instruction in the embodiment of the present application is the same as the second control instruction in the foregoing embodiment, and the description of the embodiment of the present application is not repeated here.
  • the window recognition operation not only affects the overall anti-interference ability, but also affects the overall computing power consumption.
  • windows with different related parameters have different capabilities and computational power consumed when eliminating noise. Therefore, in the process of performing the window recognition operation, the relevant parameters of the window can also be dynamically adjusted according to the voice recognition result, so as to achieve the maximum balance between anti-interference ability and computing power consumption.
  • sliding window detection refers to: setting a sliding window with a fixed window length of W milliseconds, and each time only the audio signal in the sliding window is taken for identification, and the sliding window can be from a certain segment The starting point of the audio signal gradually slides backward in steps of S milliseconds, so as to achieve the purpose of sliding window detection.
  • the maximum number of movement of the sampling window that is, the total number of sliding times N
  • the movement step size that is, the step size S
  • the maximum number of movement of the sampling window ie, the total number of sliding times N
  • the movement step size ie, step size S
  • dynamically adjusting the maximum number of movement times and/or the movement step length of the sampling window may include, for example, based on the instructions detected in the recognition result The maximum number of movement and/or movement step length of the sampling window is dynamically adjusted.
  • dynamically adjusting the maximum number of movement times of the sampling window based on the number of detected instructions may include, for example: in response to the number of detected instructions exceeding a first preset value, sampling The maximum number of times of movement of the window is adjusted from the first value to a second value greater than the first value.
  • the original total number of sliding times N can be increased to 2N times or other times, so as to eliminate noise by performing more sliding window detections, thereby improving the anti-interference ability.
  • dynamically adjusting the movement step size of the sampling window based on the number of detected instructions may include, for example: in response to the number of detected instructions exceeding a second preset value, Adjust the moving step length of the sampling window from the first step length to the second step length smaller than the first step length.
  • the second preset value may be the same as or different from the first preset value in the foregoing embodiment, which is not limited in the embodiment of the present application.
  • the original sliding step length S milliseconds can be reduced to S/2 milliseconds or other step lengths, so as to eliminate noise by performing a sliding window detection with a smaller sliding step length, thereby improving the anti-interference ability.
  • dynamically adjusting the movement step size of the sampling window may further include: responding to the number of detected instructions being lower than a third preset value (such as M 3 , where M 3 ⁇ N), adjust the moving step length of the sampling window from the first step length to a third step length larger than the first step length.
  • a third preset value such as M 3 , where M 3 ⁇ N
  • the third preset value may be smaller than the second preset value in the foregoing embodiment, for example.
  • the number of detected instructions is lower than a certain value (such as M 3 , where M 3 ⁇ N), it means that the user is in Although there is a speech behavior during this period, it may not be a continuous speech behavior, or even if it is a continuous speech behavior, it is not a behavior of continuously sending instructions.
  • a certain value such as M 3 , where M 3 ⁇ N
  • the original sliding step length S milliseconds can be increased to 2S milliseconds or other step lengths, so that noise can be eliminated by performing sliding window detection with a larger sliding step length, and at the same time, computing power can be saved, so as to achieve the window recognition operation In the process, try to balance anti-interference ability and computing power consumption.
  • step S201 may include the following steps S401 to S404, for example.
  • step S401 the audio signal is first divided into frames, and then the short-term average energy of each frame is calculated frame by frame to obtain the short-term energy envelope.
  • the audio signal can be divided into frames with a fixed duration. Taking a fixed duration of 1 second as an example, the audio signal can be divided into one frame from the 0th second to the first second, and the first second to the second second are divided into one frame. Frame, the third second to the fourth second are divided into one frame, ..., thus completing the framing of the audio signal.
  • the short-term average energy of a frame is the average energy of the frame, and the short-term energy envelope can be obtained by connecting the energy envelope.
  • step S402 a higher threshold value T1 is selected, and the first and last intersection points of T1 and the short-term energy envelope are marked as C and D.
  • the part higher than the threshold T1 is considered to have a higher probability of being speech.
  • step S403 a lower threshold value T2 is selected, and the intersection point B with the short-term energy envelope is located to the left of C, and the intersection point E is located to the right of D.
  • T2 is less than T1.
  • step S404 the short-term average zero-crossing rate of each frame is calculated, and the threshold T3 is selected, and the intersection point A of the short-term average zero-crossing rate line is located on the left of B, and the intersection F is located on the right of E.
  • the short-term average zero-crossing rate of a frame is the average zero-crossing rate of the frame.
  • intersection points A and F are the two endpoints of the audio signal determined based on the voice activity detection method
  • the audio segments from the intersection point A to the intersection point F are the audio segments intercepted from the audio signal based on the voice activity detection method.
  • steps S401 to S404 only one audio segment is intercepted from a segment of audio signal as an example. It is understandable that multiple audio segments can also be intercepted from a segment of audio signal. This is the case in the embodiment of the present application. There is no need to limit it.
  • a speech recognition model when performing speech recognition, a speech recognition model may be used.
  • the speech recognition model can be, for example, a GMM-HMM model (Gaussian Mixed Model-Hidden Markov Model, Gaussian Mixture Model-Hidden Markov Model), and the training process is as follows: use several command speech and negative samples as training data; perform MFCC on the training speech Feature extraction; then the training of the GMM-HMM model is carried out.
  • the HMM parameter estimation can, for example, use the Baum-welch algorithm to set several states for the category of instruction words and noise categories, and the number of Gaussians.
  • the speech recognition process can be: extracting the MFCC (Mel Frequency Cepstral Coefficient) feature of the detected sound segment, using the pre-trained GMM-HMM model to perform Viterbi decoding, and outputting the recognition result.
  • MFCC Mel Frequency Cepstral Coefficient
  • the voice activity detection method when the smart hardware device is just turned on, for example, the voice activity detection method can be used to perform voice endpoint detection by default, and when certain conditions are met (that is, when the application scenario changes), it switches to the sliding window detection method to run for a period of time, and then Returning to the voice activity detection method, the specific switching process is shown in FIG. 5, and the process may include the following steps, for example.
  • Step S501 power on.
  • the intelligent hardware device activates the voice endpoint detection function, and performs voice endpoint detection through the voice activity detection method by default.
  • Step S502 Acquire an audio stream in real time.
  • step S503 the audio segment in the audio stream is intercepted by the voice activity detection method.
  • Step S504 Detect whether there is sound in the intercepted audio clip. If it exists, go to step S505; otherwise, go to step S503.
  • Step S505 Perform the first voice recognition to obtain voice information.
  • Step S506 Determine whether the voice information is a control instruction (first switching condition). If yes, go to step S507; otherwise, go to step S508.
  • Step S507 execute the instruction, and execute step S509 after the instruction is executed.
  • the result of the first speech recognition is a control command
  • human-computer interaction is performed.
  • voice endpoint detection is performed through the sliding window detection method within a period of t minutes (the total number of sliding times in this period is N).
  • voice recognition is performed on the audio clips in each sliding window.
  • Step S508 Determine whether the voice length of the voice information is longer than L seconds (the second switching condition). If yes, go to step S509; otherwise, go to step S503.
  • the duration of the speech segment intercepted by the voice activity detection method exceeds a certain duration of L seconds (for example, 0.5s). If the duration of the voice segment is less than L seconds, continue to perform voice endpoint detection through the voice activity detection method, that is, return to step S503. If the duration of the speech segment is greater than or equal to L seconds, it is considered that the user is likely to have continuous speaking behavior at this time. Therefore, voice endpoint detection is performed by the sliding window detection method within t minutes, and voice recognition is performed on the audio segments in each sliding window during the sliding window detection period, and the flow is as step S510 to step S512.
  • L seconds for example, 0.5s
  • Step S511, perform voice recognition.
  • Step S512 it is judged that n ⁇ N? If yes, go to step S510; otherwise, go to step S503.
  • step S510 to step S512 describe the operation logic and exit logic during the sliding window detection period. If the voice recognition module recognizes the control command during the sliding window detection and recognition, the sliding window detection is suspended, and the command action is executed instead. After the execution of the instruction action ends, the sliding window detection is continued until the N sliding window detections for the period of time (ie, t minutes) have been executed. After the N sliding window detection ends, the default voice activity detection method is restored to perform voice endpoint detection, that is, step S503 is returned.
  • FIG. 6 is a schematic structural diagram of an audio processing device provided by an embodiment of this application. As shown in FIG. 6, the device 600 may include a processor 601 and a memory 602.
  • the memory 602 is used to store program codes
  • the processor 601 calls the program code, and when the program code is executed, is used to perform the following operations:
  • the window recognition operation includes the following operations: moving the sampling window in the audio signal after the audio segment, and performing voice recognition on the audio signal in the sampling window.
  • the audio processing device provided in this embodiment can be used to implement the technical solutions of the foregoing method embodiments, and its implementation principles and technical effects are similar to those of the method embodiments, and will not be repeated here.
  • a person of ordinary skill in the art can understand that all or part of the steps in the foregoing method embodiments can be implemented by a program instructing relevant hardware.
  • the aforementioned program can be stored in a computer readable storage medium. When the program is executed, it executes the steps including the foregoing method embodiments; and the foregoing storage medium includes: ROM, RAM, magnetic disk, or optical disk and other media that can store program codes.

Abstract

An audio processing method and device. The method comprises: intercepting an audio segment from an audio signal according to audio feature information of the audio signal; and determining, on the basis of audio segment, whether to execute a window recognition operation, the window recognition operation comprising the following operations: moving a sampling window in the audio signal after the audio segment, and performing voice recognition on the audio signal in the sampling window. For example, since a voice activity method has the advantage of less computing power consumption, and a sliding window method has the advantage of strong anti-interference capability, the present application automatically switches the voice activity method and the sliding window method according to an application scene for audio processing, thereby combining the advantages of the voice activity method and the sliding window method, saving the computing power, and improving the audio processing accuracy.

Description

音频处理方法及装置Audio processing method and device 技术领域Technical field
本申请涉及音频技术领域,尤其涉及一种音频处理方法及装置。This application relates to the field of audio technology, and in particular to an audio processing method and device.
背景技术Background technique
语音交互是一种常用的人机交互方式。使用语音识别进行人机交互时,需要先进行语音端点检测,即检测麦克风录制进来的声音中哪些部分可能含有语音。各种语音端点检测方法的抗干扰能力和算力消耗彼此不同,如何兼顾抗干扰能力和算力消耗成为语音端点检测中亟待解决的问题。Voice interaction is a common way of human-computer interaction. When using voice recognition for human-computer interaction, it is necessary to perform voice endpoint detection first, that is, to detect which parts of the sound recorded by the microphone may contain voice. The anti-interference ability and computing power consumption of various voice endpoint detection methods are different from each other. How to balance the anti-interference ability and computing power consumption has become an urgent problem in voice endpoint detection.
发明内容Summary of the invention
本申请实施例提供一种音频处理方法及装置,用以解决现有技术中在语音端点检测过程中无法兼顾抗干扰能力和算力消耗,并由此导致存在抗干扰能力差、算力消耗量大的技术问题。The embodiments of the present application provide an audio processing method and device to solve the problem that the anti-interference ability and computing power consumption cannot be taken into account in the voice endpoint detection process in the prior art, resulting in poor anti-interference ability and computing power consumption. Big technical problem.
第一方面,本申请实施例提供一种音频处理方法,包括:根据音频信号的音频特征信息在所述音频信号中截取音频片段;基于所述音频片段判断是否执行窗口识别操作;所述窗口识别操作包括如下操作:在所述音频片段之后的所述音频信号中移动采样窗口,并对所述采样窗口内的音频信号进行语音识别。In a first aspect, an embodiment of the present application provides an audio processing method, including: intercepting an audio segment in the audio signal according to audio feature information of the audio signal; determining whether to perform a window recognition operation based on the audio segment; and the window recognition The operation includes the following operations: moving a sampling window in the audio signal after the audio segment, and performing voice recognition on the audio signal in the sampling window.
第二方面,本申请实施例提供一种音频处理装置,包括:处理器和存储器;所述存储器,用于存储程序代码;所述处理器,调用所述程序代码,当程序代码被执行时,用于执行以下操作:根据音频信号的音频特征信息在所述音频信号中截取音频片段;基于所述音频片段判断是否执行窗口识别操作;所述窗口识别操作包括如下步骤:在所述音频片段之后的所述音频信号中移 动采样窗口,并对所述采样窗口内的音频信号进行语音识别。In a second aspect, an embodiment of the present application provides an audio processing device, including: a processor and a memory; the memory is used to store program code; the processor calls the program code, and when the program code is executed, It is used to perform the following operations: intercept an audio segment in the audio signal according to the audio feature information of the audio signal; determine whether to perform a window recognition operation based on the audio segment; the window recognition operation includes the following steps: after the audio segment Move the sampling window in the audio signal, and perform voice recognition on the audio signal in the sampling window.
第三方面,本申请实施例提供一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序包含至少一段代码,所述至少一段代码可由计算机执行,以控制所述计算机执行上述第一方面任一项所述的音频处理方法。In a third aspect, an embodiment of the present application provides a computer-readable storage medium, the computer-readable storage medium stores a computer program, the computer program includes at least one piece of code, and the at least one piece of code can be executed by a computer to control all The computer executes the audio processing method according to any one of the above-mentioned first aspects.
第四方面,本申请实施例提供一种计算机程序,当所述计算机程序被计算机执行时,用于实现上述第一方面任一项所述的音频处理方法。In a fourth aspect, an embodiment of the present application provides a computer program, when the computer program is executed by a computer, it is used to implement the audio processing method according to any one of the above-mentioned first aspects.
本申请实施例提供一种音频处理方法及装置,通过在语音端点检测过程中根据具体应用场景的特点选择并切换语音端点检测方法,可以兼顾抗干扰能力和算力消耗。The embodiments of the present application provide an audio processing method and device. By selecting and switching the voice endpoint detection method according to the characteristics of specific application scenarios in the voice endpoint detection process, it can take into account the anti-interference ability and computing power consumption.
附图说明Description of the drawings
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作一简单地介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly describe the technical solutions in the embodiments of the present application or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the drawings in the following description These are some embodiments of the present application. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without creative work.
图1A为本申请实施例提供的音频处理方法的应用场景示意图一;FIG. 1A is a schematic diagram 1 of an application scenario of an audio processing method provided by an embodiment of this application;
图1B为本申请实施例提供的音频处理方法的应用场景示意图二;FIG. 1B is a second schematic diagram of an application scenario of the audio processing method provided by an embodiment of the application;
图2为本申请一实施例提供的音频处理方法的流程示意图;2 is a schematic flowchart of an audio processing method provided by an embodiment of the application;
图3A-图3C为本申请一实施例提供的音频子片段将音频片段的噪声排除在外的示意图;3A-3C are schematic diagrams of the audio sub-segment provided by an embodiment of the application excluding the noise of the audio segment;
图4为本申请另一实施例提供的音频处理方法的流程示意图;4 is a schematic flowchart of an audio processing method provided by another embodiment of this application;
图5为本申请实施例提供的语音活性检测方法的流程示意图;以及FIG. 5 is a schematic flowchart of a voice activity detection method provided by an embodiment of the application; and
图6为本申请一实施例提供的音频处理装置的结构示意图。FIG. 6 is a schematic structural diagram of an audio processing device provided by an embodiment of the application.
具体实施方式Detailed ways
为使本申请实施例的目的、技术方案和优点更加清楚,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述, 显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。In order to make the purpose, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be described clearly and completely in conjunction with the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments It is a part of the embodiments of the present application, but not all of the embodiments. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of this application.
本申请实施例提供的音频处理方法可以应用于任何需要进行语音端点检测的音频处理过程中,该音频处理方法具体可以由音频处理装置执行。该音频处理装置可以为包括音频采集模块(例如,麦克风)的装置,相应的,本申请实施例提供的音频处理方法的应用场景示意图可以如图1A所示。具体的,该音频处理装置的音频采集模块可以采集用户讲话的语音获得音频信号,音频处理装置的处理器可以对音频采集模块采集的音频信号采用本申请实施例提供的音频处理方法进行处理。需要说明的是,图1A仅为示意图,并不对音频处理装置的结构作限定。例如麦克风与处理器之间还可以连接有放大器,用于对麦克风采集到的音频信号进行放大。又例如,麦克风与处理器之间还可以连接有滤波器,用于对麦克风采集到的音频信号进行滤波。The audio processing method provided in the embodiments of the present application can be applied to any audio processing process that requires voice endpoint detection, and the audio processing method can be specifically executed by an audio processing device. The audio processing device may be a device including an audio collection module (for example, a microphone). Accordingly, a schematic diagram of an application scenario of the audio processing method provided in an embodiment of the present application may be as shown in FIG. 1A. Specifically, the audio collection module of the audio processing device can collect the voice spoken by the user to obtain audio signals, and the processor of the audio processing device can process the audio signals collected by the audio collection module using the audio processing method provided in the embodiments of the present application. It should be noted that FIG. 1A is only a schematic diagram, and does not limit the structure of the audio processing device. For example, an amplifier may be connected between the microphone and the processor to amplify the audio signal collected by the microphone. For another example, a filter may be connected between the microphone and the processor to filter the audio signal collected by the microphone.
或者,该音频处理装置也可以为不包括音频采集模块的装置,相应的,本申请实施例提供的音频处理方法的应用场景示意图可以如图1B所示。具体的,该音频处理装置的通信接口可以接收其他装置或设备采集的音频信号,音频处理装置的处理器可以对接收到的音频信号采用本申请实施例提供的音频处理方法进行处理。需要说明的是,图1B仅为示意图,并不对音频处理装置的结构以及音频处理装置与其他装置或设备之间的连接方式作限定,例如音频处理装置中通信接口可以替换为收发器。Alternatively, the audio processing device may also be a device that does not include an audio collection module. Correspondingly, a schematic diagram of an application scenario of the audio processing method provided in an embodiment of the present application may be as shown in FIG. 1B. Specifically, the communication interface of the audio processing device may receive audio signals collected by other devices or equipment, and the processor of the audio processing device may process the received audio signals using the audio processing method provided in the embodiments of the present application. It should be noted that FIG. 1B is only a schematic diagram, and does not limit the structure of the audio processing device and the connection mode between the audio processing device and other devices or equipment. For example, the communication interface in the audio processing device can be replaced with a transceiver.
需要说明的是,对于包括该音频处理装置的设备的类型,本申请实施例可以不做限定,该设备例如可以为智能音响、智能照明设备、智能机器人、手机、平板电脑等。It should be noted that the type of equipment including the audio processing device may not be limited in the embodiments of the present application. The equipment may be, for example, smart speakers, smart lighting devices, smart robots, mobile phones, tablet computers, and the like.
本申请实施例提供的音频处理方法,可以根据具体应用场景的特点,自动切换语音活性检测方法与滑动窗口检测方法,因而可以从总体上提高语音端点检测的抗干扰能力,同时降低语音端点检测的算力消耗。即可以使语音端点检测的抗干扰能力比单纯使用语音活性检测方法强,同时可以使语音端点检测的算力消耗比单纯使用滑动窗口检测方法少。The audio processing method provided by the embodiments of the present application can automatically switch between the voice activity detection method and the sliding window detection method according to the characteristics of the specific application scenario, so that the anti-interference ability of the voice endpoint detection can be improved as a whole, and the voice endpoint detection can be reduced. Consumption of computing power. That is, the anti-interference ability of voice endpoint detection can be stronger than that of purely using the voice activity detection method, and at the same time, the computing power consumption of voice endpoint detection can be less than that of purely using the sliding window detection method.
需要说明的是,本申请实施例提供的音频处理方法,可以应用于语音识别领域,具有语音识别功能的智能硬件设备、音频事件检测领域、具有音频 事件检测功能的智能硬件设备等,本申请在此不做限定。It should be noted that the audio processing method provided by the embodiments of this application can be applied to the field of voice recognition, smart hardware devices with voice recognition functions, audio event detection fields, smart hardware devices with audio event detection functions, etc. This is not limited.
下面结合附图,对本申请的一些实施方式作详细说明。在不冲突的情况下,下述的实施例及实施例中的特征可以相互组合。Hereinafter, some embodiments of the present application will be described in detail with reference to the accompanying drawings. In the case of no conflict, the following embodiments and features in the embodiments can be combined with each other.
图2为本申请一实施例提供的音频处理方法的流程示意图,本实施例的执行主体可以为音频处理装置,具体可以为音频处理装置的处理器。如图2所示,本实施例的方法例如可以包括步骤S201和步骤S202。2 is a schematic flowchart of an audio processing method provided by an embodiment of the application. The execution subject of this embodiment may be an audio processing device, and specifically may be a processor of the audio processing device. As shown in FIG. 2, the method of this embodiment may include step S201 and step S202, for example.
步骤S201,根据音频信号的音频特征信息在音频信号中截取音频片段。In step S201, an audio segment is intercepted from the audio signal according to the audio feature information of the audio signal.
需要说明的是,在本申请实施例中,在步骤S201,基于音频特征信息截取音频片段,由于其截取的端点不准确,因而截取的音频片段内通常会含有噪声,若单纯基于由此截取的音频片段进行语音识别,准确率不高。It should be noted that, in the embodiment of the present application, in step S201, the audio clip is intercepted based on the audio feature information. Since the intercepted endpoint is inaccurate, the intercepted audio clip usually contains noise. The audio clip is used for speech recognition, and the accuracy is not high.
需要说明的是,对于所述音频特征信息的具体类型,本申请实施例可以不做限定。可选的,音频特征例如可以包括梅尔频率倒谱系数(Mel Frequency Cepstrum Coefficient,简称MFCC)特征、线性预测系数(Linear Prediction Coefficients,简称LPC)特征、滤波器组(Filter bank,Fbank)特征中的一种或多种。It should be noted that the specific type of the audio feature information may not be limited in the embodiment of the present application. Optionally, the audio features may include, for example, Mel Frequency Cepstrum Coefficient (MFCC) features, linear prediction coefficients (Linear Prediction Coefficients, LPC) features, and filter bank (Filter bank, Fbank) features. One or more of.
具体地,在本申请实施例中,在步骤S201,可以使用算力消耗较低的语音端点检测方法对音频信号进行处理。Specifically, in the embodiment of the present application, in step S201, a voice endpoint detection method with lower computing power consumption may be used to process the audio signal.
在实现本申请实施例的过程中,发明人发现,语音活性检测方法是通过能量与过零率双门限或者噪声-语音分类模型等方法识别音频信号是否为语音信号的,并且只有确定是语音信号时才截取出来进行语音识别,因而语音活性检测方法可以有效区分静音与非静音,并且用于语音识别时可以节约算力。In the process of implementing the embodiments of the present application, the inventor found that the voice activity detection method recognizes whether the audio signal is a voice signal through methods such as energy and zero-crossing rate double threshold or noise-speech classification model, and only determines whether the audio signal is a voice signal. It can only be intercepted at a time for speech recognition, so the voice activity detection method can effectively distinguish between silent and non-silent, and can save computing power when used for speech recognition.
因此,在一个实施例中,在步骤S201,可以使用语音活性检测方法来根据音频信号的音频特征信息在音频信号中截取音频片段。Therefore, in one embodiment, in step S201, a voice activity detection method may be used to intercept audio segments in the audio signal according to the audio feature information of the audio signal.
此外,在实现本申请实施例的过程中,发明人还发现,语音活性检测方法在面对语音类噪声或者能量稍大的非语音类噪声时,不能有效区分。例如语音活性检测方法用于指令词语音识别时,易在指令语音片段前后混入噪声,从而降低语音指令识别率,而滑动窗口检测方法能够从音频流中抽取一定长度的语音片段不断地进行语音识别,对于指令词语音识别而言
Figure PCTCN2020073292-appb-000001
该方法能有效防止噪声混入指令词语音片段前后,但是滑动窗口检测方法由于语音识 别计算量大,不停地进行语音识别会消耗智能硬件设备大量的计算资源与电量。
In addition, in the process of implementing the embodiments of the present application, the inventor also found that the voice activity detection method cannot effectively distinguish between voice-type noises or non-speech-type noises with slightly larger energy. For example, when the voice activity detection method is used for command word speech recognition, it is easy to mix noise before and after the command speech fragment, thereby reducing the voice command recognition rate, while the sliding window detection method can extract a certain length of speech fragment from the audio stream and continuously perform speech recognition , For command word speech recognition
Figure PCTCN2020073292-appb-000001
This method can effectively prevent mixing of noise before and after the instruction word speech segment, but the detection method since the sliding window computationally intensive speech recognition, voice recognition keep an intelligent hardware device consumes significant computing resources and power.
因此,在本申请实施例中,在抗干扰性要求低的应用场景中可以使用语音活性检测方法进行音频处理,以降低算力消耗,但在抗干扰性要求高的应用场景中可以使用滑动窗口检测方法进行音频处理,以提高抗干扰能力。Therefore, in the embodiments of the present application, the voice activity detection method can be used for audio processing in application scenarios with low anti-interference requirements to reduce computing power consumption, but sliding windows can be used in application scenarios with high anti-interference requirements The detection method performs audio processing to improve the anti-interference ability.
步骤S202,基于由步骤S201得到的音频片段判断是否执行窗口识别操作。In step S202, it is determined whether to perform a window recognition operation based on the audio clip obtained in step S201.
其中,窗口识别操作例如可以包括如下操作:在由步骤S201得到的音频片段之后的音频信号中移动采样窗口,并对采样窗口内的音频信号进行语音识别。The window recognition operation may include, for example, the following operations: moving a sampling window in the audio signal after the audio segment obtained in step S201, and performing voice recognition on the audio signal in the sampling window.
由于滑动窗口检测方法可以在一个或多个窗口中将语音活性检测所截取出的音频片段中包括的噪声排除在外。例如,如图3A所示,假设音频信号的开始部分包括噪声,则音频片段X1可以将该噪声排除在外。再例如,如图3B所示,假设音频信号的中间部分包括噪声,则音频片段X2可以将该噪声排除在外。又例如,如图3C所示,假设音频信号的结束部分包括噪声,则音频片段X3可以将该噪声排除在外。需要说明的是,图3A至图3C中网格线填充的部分用于表示噪声。Because the sliding window detection method can exclude the noise included in the audio segment cut out by the voice activity detection in one or more windows. For example, as shown in FIG. 3A, assuming that the beginning part of the audio signal includes noise, the audio segment X1 can exclude the noise. For another example, as shown in FIG. 3B, assuming that the middle part of the audio signal includes noise, the audio segment X2 can exclude the noise. For another example, as shown in FIG. 3C, assuming that the end part of the audio signal includes noise, the audio segment X3 can exclude the noise. It should be noted that the parts filled with grid lines in FIGS. 3A to 3C are used to represent noise.
具体地,在步骤S202,可以判断由步骤S201得到的音频片段中是否包含或者可能包含语音类噪声或能量稍大的非语音类噪声,进而判断是否执行窗口识别操作。其中,如果确定由步骤S201得到的音频片段中包含或者可能包含语音类噪声或能量稍大的非语音类噪声,则执行窗口识别操作。否则,如果确定由步骤S201得到的音频片段中不包含或者可能不包含语音类噪声或能量稍大的非语音类噪声,则继续执行步骤S201和步骤S202。Specifically, in step S202, it can be determined whether the audio segment obtained in step S201 contains or may contain voice-type noise or non-speech-type noise with slightly larger energy, and then it is determined whether to perform a window recognition operation. Wherein, if it is determined that the audio segment obtained in step S201 contains or may contain speech-type noise or non-speech-type noise with slightly larger energy, a window recognition operation is performed. Otherwise, if it is determined that the audio segment obtained in step S201 does not contain or may not contain speech-type noise or non-speech-type noise with slightly larger energy, then continue to perform step S201 and step S202.
例如,在一个实施例中,由于在指令类语音信息前后容易混入噪声,由此容易导致语音片段截取不准确和/或语音识别不准确。因此针对由步骤S201得到的音频片段,如果检测到指令类语音信息,则执行窗口识别操作,以排除噪声的影响。For example, in one embodiment, noise is easily mixed before and after the command-type voice information, which easily leads to inaccurate speech segment interception and/or inaccurate speech recognition. Therefore, for the audio segment obtained in step S201, if command-type voice information is detected, a window recognition operation is performed to eliminate the influence of noise.
再例如,在另一个实施例中,针对由步骤S201得到的音频片段,如果检测到能量稍大的非语音类信息,例如如果检测到发生特定敲击声、鼓掌声、拍手声等预设音频事件,则执行窗口识别操作,以避免由于语音活性检测方法无法有效区分这类信息而导致音频片段截取不准确和/或语音识别不准确。For another example, in another embodiment, for the audio segment obtained in step S201, if non-voice information with a slightly larger energy is detected, for example, if a preset audio such as a specific tapping sound, clapping sound, or clapping sound is detected, In the event, a window recognition operation is performed to avoid inaccurate audio segment interception and/or inaccurate speech recognition due to the inability of the voice activity detection method to effectively distinguish this type of information.
通过本申请实施例,在对音频信号进行处理时,可以根据具体应用场景 的特点选取对应的语音端点检测方法。例如对抗干扰能力要求低的应用场景,可以采用语音活性检测方法对音频信号进行处理,以节约算力。对抗干扰能力要求高的应用场景,可以采用滑动窗口检测方法对音频信号进行处理,以提高抗干扰能力,由此可以兼顾滑动窗口检测方法抗干扰能力高以及语音活性检测方法算力消耗少的优点,使语音端点检测的抗干扰能力总体上比单纯使用语音活性检测方法强,同时使语音端点检测的算力消耗量总体上比单纯使用滑动窗口检测方法少。Through the embodiments of the present application, when processing audio signals, the corresponding voice endpoint detection method can be selected according to the characteristics of specific application scenarios. For example, in application scenarios where the anti-interference ability is low, the voice activity detection method can be used to process the audio signal to save computing power. For application scenarios that require high anti-interference ability, the sliding window detection method can be used to process the audio signal to improve the anti-interference ability, which can take into account the advantages of high anti-interference ability of the sliding window detection method and low computational power consumption of the voice activity detection method Therefore, the anti-interference ability of voice endpoint detection is generally stronger than that of purely using the voice activity detection method, and at the same time, the computing power consumption of voice endpoint detection is generally less than that of purely using the sliding window detection method.
需要说明的是,在本申请实施例中,在步骤S202,可以通过多种方式确定由步骤S201得到的音频片段中是否存在或者可能存在指令类语音信息。It should be noted that, in the embodiment of the present application, in step S202, it is possible to determine whether or not there is instruction-type voice information in the audio segment obtained in step S201 in various ways.
例如,在一个实施例中,可以从由步骤S201得到的音频片段中提取语音信息,并根据提取的语音信息的内容和/或语音长度来判断是否存在或者可能存在指令类信息,进而确定是否执行窗口识别操作。For example, in one embodiment, voice information may be extracted from the audio segment obtained in step S201, and the content and/or voice length of the extracted voice information may be used to determine whether there is or may exist instruction information, and then determine whether to execute Window recognition operation.
具体地,在本申请实施例中,该方法例如还可以包括:从音频片段中提取语音信息。对应地,步骤S202例如可以包括:基于提取的语音信息判断是否执行窗口识别操作。Specifically, in the embodiment of the present application, the method may further include, for example, extracting voice information from an audio segment. Correspondingly, step S202 may include, for example, judging whether to perform a window recognition operation based on the extracted voice information.
例如,在一个实施例中,可以基于提取的语音信息的内容判断是否执行窗口识别操作。具体地,在该实施例中,可以判断该语音信息的内容是否满足特定条件(如该语音信息中是否包含指令类信息)。如果确定该语音信息的内容满足特定条件,则执行窗口识别操作。否则,如果确定该语音信息的内容不满足特定条件,则不执行窗口识别操作。For example, in one embodiment, it may be determined whether to perform a window recognition operation based on the content of the extracted voice information. Specifically, in this embodiment, it can be determined whether the content of the voice information satisfies a specific condition (for example, whether the voice information contains instruction-type information). If it is determined that the content of the voice information satisfies a specific condition, a window recognition operation is performed. Otherwise, if it is determined that the content of the voice information does not meet a specific condition, the window recognition operation is not performed.
再例如,在另一个实施例中,可以基于提取的语音信息的语音长度判断是否执行窗口识别操作。具体地,在该实施例中,可以判断该语音信息的语音长度是否小于或等于预设值。如果确定该语音信息的语音长度小于或等于预设值,则执行窗口识别操作。否则,如果确定该语音信息的语音长度大于预设值,则不执行窗口识别操作。For another example, in another embodiment, it is possible to determine whether to perform a window recognition operation based on the voice length of the extracted voice information. Specifically, in this embodiment, it can be determined whether the voice length of the voice information is less than or equal to a preset value. If it is determined that the voice length of the voice information is less than or equal to the preset value, a window recognition operation is performed. Otherwise, if it is determined that the voice length of the voice information is greater than the preset value, the window recognition operation is not performed.
再例如,在另一个实施例中,可以基于提取的语音信息的内容和语音长度判断是否执行窗口识别操作。For another example, in another embodiment, it may be determined whether to perform a window recognition operation based on the content and the length of the extracted voice information.
具体地,在该实施例中,可以先判断该语音信息的内容是否满足特定条件。如果该语音信息的内容不满足特定条件再判断该语音信息的语音长度是否小于等于预设值。如果确定该语音信息的内容不满足特定条件且该语音信 息的语音长度也不小于或不等于预设值,则不执行窗口识别操作。否则,则执行窗口识别操作。Specifically, in this embodiment, it can be determined first whether the content of the voice information meets a specific condition. If the content of the voice information does not meet the specific conditions, it is determined whether the voice length of the voice information is less than or equal to the preset value. If it is determined that the content of the voice information does not meet the specific conditions and the voice length of the voice information is not less than or not equal to the preset value, then the window recognition operation is not performed. Otherwise, the window recognition operation is performed.
或者,在该实施例中,还可以先判断该语音信息的语音长度是否小于或者等于预设值。如果该语音信息的内容不小于或者不等于预设值再判断该语音信息的内容是否满足特定条件。如果确定该语音信息的语音长度不小于或不等于预设值且该语音信息的内容也不满足特定条件,则不执行窗口识别操作。否则,则执行窗口识别操作。Alternatively, in this embodiment, it can also be determined first whether the voice length of the voice information is less than or equal to a preset value. If the content of the voice information is not less than or not equal to the preset value, then it is determined whether the content of the voice information meets a specific condition. If it is determined that the voice length of the voice information is not less than or equal to the preset value and the content of the voice information does not meet the specific condition, the window recognition operation is not performed. Otherwise, the window recognition operation is performed.
需要说明的是,根据提取的语音信息的内容和/或语音长度,可以确定由步骤S201得到的音频片段中是否存在指令类信息。如果确定由步骤S201得到的音频片段中存在指令类信息,考虑到用户在实际使用过程中可能会连续讲话,从而可能连续发出多条指令,因而在该音频片段之后的音频信号处理过程中,可以选用滑动窗口检测方法进行语音端点检测,以防在指令前后混入过多噪声而导致语音片段截取不准确和/或语音识别不准确。It should be noted that, according to the content and/or the length of the extracted voice information, it can be determined whether there is instruction information in the audio segment obtained in step S201. If it is determined that there is instruction-type information in the audio clip obtained in step S201, considering that the user may continue to speak during actual use, it is possible to issue multiple instructions continuously. Therefore, in the audio signal processing process after the audio clip, you can The sliding window detection method is used for voice endpoint detection to prevent inaccurate speech fragment interception and/or inaccurate speech recognition caused by mixing too much noise before and after the instruction.
此外,需要说明的是,由于指令词的长度有限,因而包含指令词的语音信息对应的语音长度也有限。由此可以根据提取的语音信息的语音长度粗略估计该语音信息中是否存在或者可能存在指令词,进而可以判断是否执行窗口识别操作。In addition, it should be noted that since the length of the instruction word is limited, the voice length corresponding to the voice information containing the instruction word is also limited. In this way, it is possible to roughly estimate whether or not there is an instruction word in the voice information according to the voice length of the extracted voice information, and then it is possible to determine whether to perform a window recognition operation.
通过本申请实施例,针对由步骤S201得到的音频片段,可以先提取其中包含的语音信息,再基于该语音信息的内容和/或语音长度确定或者估计该音频片段中是否存在指令类信息,并由该确定或者估计结果判断是否执行窗口识别操作,由此可以避免对指令类语音信息进行语音片段截取和语音识别时由于混入过多的噪声而导致不准确。Through the embodiment of the present application, for the audio segment obtained in step S201, the voice information contained therein can be extracted first, and then based on the content and/or voice length of the voice information to determine or estimate whether there is instruction information in the audio segment, and The determination or estimation result determines whether to perform the window recognition operation, thereby avoiding inaccuracies caused by excessive noise during voice segment interception and voice recognition of the command-type voice information.
此外,在另一个实施例中,可以从由步骤S201得到的音频片段中提取语音信息,并根据提取的语音信息的内容和/或该音频片段的时长来判断是否存在或者可能存在指令类信息,进而确定是否执行窗口识别操作。In addition, in another embodiment, the voice information may be extracted from the audio segment obtained in step S201, and the presence or possible existence of instruction information can be determined according to the content of the extracted voice information and/or the duration of the audio segment. And then determine whether to perform the window recognition operation.
例如,在本申请实施例中,作为一个实施例,可以同时根据提取的语音信息的内容和该音频片段的时长确定是否执行窗口识别操作。For example, in the embodiment of the present application, as an embodiment, it is possible to determine whether to perform a window recognition operation according to the content of the extracted voice information and the duration of the audio segment at the same time.
具体地,在本申请实施例中,该方法例如还可以包括:从由步骤S201得到的音频片段中提取语音信息。对应地,步骤S202例如可以包括如下操作。先判断提取的语音信息是否与第一语音信息相匹配。若不匹配,则再判断该 音频片段的时长是否大于或等于第一时长阈值。若不大于或不等于第一时长阈值,则针对该音频片段之后的音频信号,执行步骤S201和步骤S202。Specifically, in the embodiment of the present application, the method may further include, for example, extracting voice information from the audio segment obtained in step S201. Correspondingly, step S202 may include the following operations, for example. It is first judged whether the extracted voice information matches the first voice information. If it does not match, then it is determined whether the duration of the audio segment is greater than or equal to the first duration threshold. If it is not greater than or not equal to the first duration threshold, step S201 and step S202 are executed for the audio signal after the audio segment.
或者,在本申请实施例中,步骤S202例如也可以包括如下操作。先判断由步骤S201得到的音频片段的时长是否大于或等于第一时长阈值。若不大于或不等于第一时长阈值,则再判断提取的语音信息是否与第一语音信息相匹配。若不匹配,则针对该音频片段之后的音频信号,执行步骤S201和步骤S202。Alternatively, in the embodiment of the present application, step S202 may also include the following operations, for example. First, it is determined whether the duration of the audio segment obtained in step S201 is greater than or equal to the first duration threshold. If it is not greater than or not equal to the first duration threshold, then it is determined whether the extracted voice information matches the first voice information. If it does not match, for the audio signal after the audio segment, step S201 and step S202 are executed.
需要说明的是,在本申请实施例中,第一语音信息可以根据表征指令(如第一控制指令)的词汇设置。作为一种可选的实施例,在该音频处理方法应用于具有语音识别功能的智能硬件设备时,第一语音信息可以根据表征应用于该智能硬件设备的所有控制指令的词汇设置。It should be noted that, in the embodiment of the present application, the first voice information may be set according to the vocabulary of the characterizing instruction (such as the first control instruction). As an optional embodiment, when the audio processing method is applied to a smart hardware device with a voice recognition function, the first voice information may be set according to a vocabulary that characterizes all control instructions applied to the smart hardware device.
与指令词的长度有限,包含指令词的语音信息对应的语音长度也有限类似,包含指令词的音频片段的时长也有限。由此可以根据由步骤S201得到的音频片段的时长粗略估计该音频片段中是否存在或者可能存在指令词,进而可以判断是否执行窗口识别操作。Similar to the limited length of the instruction word, the voice length corresponding to the voice information containing the instruction word is also limited, and the duration of the audio segment containing the instruction word is also limited. Therefore, it is possible to roughly estimate whether or not there is an instruction word in the audio segment according to the duration of the audio segment obtained in step S201, and then it can be determined whether to perform a window recognition operation.
需要说明的是,在本申请实施例中,第一时长阈值可以根据包含指令(如第一控制指令)的音频片段的最大时长、平均时长或任一时长设置。It should be noted that, in the embodiment of the present application, the first duration threshold may be set according to the maximum duration, average duration, or any duration of the audio segment containing the instruction (such as the first control instruction).
通过本申请实施例,针对由步骤S201得到的音频片段,可以提取其中包含的语音信息,并基于该语音信息的内容和该音频片段的时长确定或者估计该音频片段中是否存在指令类信息,并由该估计结果判断是否执行窗口识别操作。由此也可以避免对指令类语音信息进行语音片段截取和语音识别时由于混入过多的噪声而导致不准确。Through the embodiment of the present application, for the audio clip obtained in step S201, the voice information contained therein can be extracted, and based on the content of the voice information and the duration of the audio clip, it is determined or estimated whether there is instruction information in the audio clip, and The estimation result determines whether to perform the window recognition operation. In this way, it is also possible to avoid inaccuracy caused by excessive noise when performing voice segment interception and voice recognition on the command-type voice information.
再例如,在本申请实施例中,作为另一个实施例,还可以根据提取的语音信息的内容和音频片段的时长中的一个或多个(即两个)确定是否执行窗口识别操作。For another example, in the embodiment of the present application, as another embodiment, it is also possible to determine whether to perform a window recognition operation according to one or more (ie, two) of the content of the extracted voice information and the duration of the audio segment.
具体地,该方法例如还可以包括在发生如下任意事件的情况下执行窗口识别操作:从由步骤S201得到的音频片段中提取的语音信息与第一语音信息相匹配(简称事件1);和/或,由步骤S201得到的音频片段的时长大于或等于第一时长阈值(简称事件2)。Specifically, the method may further include, for example, performing a window recognition operation when any of the following events occurs: the voice information extracted from the audio segment obtained in step S201 matches the first voice information (referred to as event 1); and/ Or, the duration of the audio segment obtained in step S201 is greater than or equal to the first duration threshold (referred to as event 2).
需要说明的是,在本申请实施例中,对于事件1和事件2都发生的情况,方式1,可以先判断事件1是否发生,再判断事件2是否发生。或者,方式2, 也可以先判断事件2是否发生,再判断事件1是否发生。但是无论是方式1还是方式2,必须是两个事件都发生才执行窗口识别操作。It should be noted that, in the embodiment of the present application, for the case where both event 1 and event 2 occur, in method 1, it is possible to first determine whether event 1 has occurred, and then determine whether event 2 has occurred. Or, in method 2, it is also possible to first determine whether event 2 has occurred, and then determine whether event 1 has occurred. However, whether it is Mode 1 or Mode 2, both events must occur before the window recognition operation is performed.
此外,在本申请实施例中,第一语音信息和第一时长阈值的设置与前述实施例中的设置方法相同或类似,本申请实施例在此不再赘述。In addition, in the embodiment of the present application, the setting of the first voice information and the first duration threshold is the same or similar to the setting method in the foregoing embodiment, which is not repeated here in the embodiment of the present application.
此外,在本申请实施例中,对于事件1,可以先从由步骤S201得到的音频片段中提取语音信息,再将提取的语音信息与第一语音信息匹配,从而确定提取的语音信息是否与该第一语音信息相匹配。In addition, in the embodiment of the present application, for event 1, the voice information may be extracted from the audio segment obtained in step S201, and then the extracted voice information may be matched with the first voice information, so as to determine whether the extracted voice information matches the The first voice message matches.
类似地,在本申请实施例中,对于事件2,可以先确定由步骤S201得到的音频片段的时长,再比较该时长是否大于或等于该第一时长阈值。Similarly, in the embodiment of the present application, for event 2, the duration of the audio clip obtained in step S201 may be determined first, and then it is compared whether the duration is greater than or equal to the first duration threshold.
通过本申请实施例中,通过事件1和事件2中的一个或多个都可以确定或者估计由步骤S201得到的音频片段中包含或者可能包含指令类信息,由此执行窗口识别操作截取语音片段可以将指令前后的噪声排除在外,使得语音片段的截取结果更准确,进而使得由此得到的语音识别结果也更准确。According to the embodiment of the present application, one or more of event 1 and event 2 can be used to determine or estimate that the audio clip obtained in step S201 contains or may contain instruction information, so that the window recognition operation can be performed to intercept the voice clip. The noise before and after the instruction is excluded, so that the interception result of the speech segment is more accurate, and the speech recognition result obtained therefrom is also more accurate.
需要说明的是,在本申请实施例中,在步骤S202,也可以通过多种方式检测由步骤S201得到的音频片段中是否存在能量稍大的非语音类信息。It should be noted that, in the embodiment of the present application, in step S202, it is also possible to detect whether there is non-speech information with a little higher energy in the audio segment obtained in step S201 in a variety of ways.
具体地,步骤S202例如可以包括:基于由步骤S201得到的音频片段判断是否发生预设音频事件,以确定该音频片段中是否存在能量稍大的非语音类信息。Specifically, step S202 may include, for example, judging whether a preset audio event has occurred based on the audio segment obtained in step S201, so as to determine whether there is non-voice information with a higher energy in the audio segment.
其中,如果判断发生该预设音频事件,则认为该音频片段中存在能量稍大的非语音类信息,可以执行窗口识别操作,由此截取音频片段时可以将噪声排除在外,使得音频片段的截取结果更准确,进而使得由此得到的音频处理结果也更准确。此外,如果判断未发生该预设音频事件,则认为该音频片段中不存在能量稍大的非语音类信息,可以不执行窗口识别操作。Among them, if it is determined that the preset audio event has occurred, it is considered that there is non-voice information with a slightly larger energy in the audio segment, and the window recognition operation can be performed, so that the noise can be excluded when the audio segment is intercepted, so that the audio segment can be intercepted The result is more accurate, which in turn makes the resulting audio processing result more accurate. In addition, if it is determined that the preset audio event does not occur, it is considered that there is no non-voice information with a larger energy in the audio segment, and the window recognition operation may not be performed.
需要说明的是,在本申请实施例中,上述预设音频事件例如可以包括音频片段中存在以下一种或多种声音:敲击声、鼓掌声、拍手声。其中,上述预设音频事件例如可以根据音频片段中经常出现的敲击声、鼓掌声、拍手声等能量稍大的非语音类信息进行设置。It should be noted that, in the embodiment of the present application, the foregoing preset audio event may include, for example, one or more of the following sounds in the audio segment: percussion, applause, and clapping. The above-mentioned preset audio events can be set according to non-voice information with slightly larger energy, such as percussive sounds, clapping sounds, and clapping sounds that often appear in the audio clip, for example.
应该理解,通常情况下用户在实际使用过程中可能会连续讲话,从而会连续发出多条指令,因而在进行语音端点检测时,一旦检测到指令类语音信 息,便可以切换至滑动窗口检测方法进行语音端点检测,从而可以防在指令前后混入过多噪声而导致语音片段截取和/或语音识别不准确。但是,还存在一些特殊情况,例如用户在实际使用过程中可能不会连续讲话,而是发出一条指令后在很长一段时间内便不会再发出任何指令。例如,用户对具有语音识别功能的智能硬件设备(如机器人)说“请进入休眠状态”,或者用户对具有语音识别功能的智能硬件设备(如无人机)说“请绕A飞行2圈”,等等,这些情况下,在发出指令后的很长一段时间内用户一般不会再发出其他控制指令。因此如果检测到这些特殊指令类语音信息,也切换至滑动窗口检测方法进行语音端点检测,不仅抗干扰能力没有明显提高,而且反而会增加算力消耗。因此,在本申请实施例中,如果发生这些特殊情况,则可以不切换至滑动窗口检测方法,由此可以避免切换后不仅抗干扰能力没有明显提高,而且反而会增加算力消耗的缺陷。It should be understood that under normal circumstances, the user may continue to speak during actual use, and thus will continuously issue multiple instructions. Therefore, when the voice endpoint detection is performed, once the command-type voice information is detected, it can be switched to the sliding window detection method. Voice endpoint detection can prevent excessive noise from being mixed before and after the instruction, which may lead to inaccurate speech segment interception and/or inaccurate speech recognition. However, there are some special situations. For example, the user may not speak continuously during actual use, but will not issue any more instructions for a long period of time after issuing an instruction. For example, the user says "please go to sleep" to a smart hardware device with voice recognition function (such as a robot), or the user says "please fly around A twice" to a smart hardware device with voice recognition function (such as a drone) , Etc. In these cases, the user generally will not issue other control commands for a long period of time after the command is issued. Therefore, if these special command-type voice messages are detected, switching to the sliding window detection method for voice endpoint detection will not only not significantly improve the anti-interference ability, but also increase the computing power consumption. Therefore, in the embodiment of the present application, if these special circumstances occur, the sliding window detection method may not be switched, thereby avoiding the defect that not only the anti-interference ability is not significantly improved after the switching, but also the computational power consumption will increase.
即,在另一个实施例中,在基于由步骤S201得到的音频片段判断是否执行窗口识别操作的过程中,还可以检测是否出现特定语音信息。其中,如果检测到出现特定语音信息,则针对由步骤S201得到的音频片段之后的音频信号,执行步骤S201和步骤S202,而不执行窗口识别操作。That is, in another embodiment, in the process of judging whether to perform the window recognition operation based on the audio clip obtained in step S201, it is also possible to detect whether specific voice information appears. Wherein, if the occurrence of specific voice information is detected, then for the audio signal after the audio segment obtained in step S201, step S201 and step S202 are executed without performing the window recognition operation.
具体地,步骤S202例如可以包括:从由步骤S201得到的音频片段中提取语音信息,并判断提取的语音信息是否与第二语音信息相匹配。其中,若匹配,针对由步骤S201得到的音频片段之后的音频信号,执行步骤S201和步骤S202。Specifically, step S202 may include, for example, extracting voice information from the audio segment obtained in step S201, and determining whether the extracted voice information matches the second voice information. Wherein, if it matches, for the audio signal after the audio segment obtained in step S201, step S201 and step S202 are executed.
需要说明的是,在本申请实施例中,上述第二语音信息例如可以根据表征第二控制指令的词汇设置。其中,第二控制指令例如可以具有以下任意特征:处理器执行第二控制指令期间暂停执行其他控制指令(简称特征1);第二控制指令表征与用户的语音交互暂停(简称特征2)。It should be noted that, in the embodiment of the present application, the above-mentioned second voice information may be set according to a vocabulary characterizing the second control instruction, for example. Wherein, the second control instruction may have any of the following features: the processor suspends execution of other control instructions during the execution of the second control instruction (feature 1 for short); the second control instruction represents the pause of voice interaction with the user (feature 2 for short).
对于特征1,第二控制指令例如可以是上述示例中的“请绕A飞行2圈”。对于特征2,第二控制指令例如可以是上述示例中的“请进入休眠状态”。具体地,第二控制指令可以是具有上述特征1和/或特征2的任意控制指令,本申请实施例在此不做限定。For feature 1, the second control command may be, for example, "please fly around A twice" in the above example. For feature 2, the second control command may be, for example, "please enter the sleep state" in the above example. Specifically, the second control instruction may be any control instruction having the above-mentioned feature 1 and/or feature 2, which is not limited in the embodiment of the present application.
此外,在本申请实施例中,在切换至滑动窗口检测方法之后,即在执行窗口识别操作之后,如果发现具体应用场景再次发生变化,还可以由滑动窗 口检测方法再切换回之前采用的语音端点检测方法,以保证尽量节约算力。In addition, in the embodiment of the present application, after switching to the sliding window detection method, that is, after performing the window recognition operation, if the specific application scenario is found to change again, the sliding window detection method can also be switched back to the previously used voice endpoint Detection methods to ensure that computing power is saved as much as possible.
具体地,在本申请实施例中,该方法例如还可以包括:在执行在由步骤S201得到的音频片段之后的音频信号中移动采样窗口,并对采样窗口内的音频信号进行语音识别步骤的过程中,即在执行窗口识别操作的过程中,响应于发生的切换触发事件,针对采样窗口当前所处位置之后的音频信号,执行步骤S201和步骤S202。Specifically, in the embodiment of the present application, the method may further include, for example, moving the sampling window in the audio signal after the audio segment obtained in step S201, and performing the voice recognition step on the audio signal in the sampling window. That is, in the process of performing the window recognition operation, in response to the switching trigger event that occurs, for the audio signal after the current position of the sampling window, step S201 and step S202 are executed.
具体地,在本申请实施例中,上述切换触发事件例如可以包括以下中的任意一种或几种:在对采样窗口内的音频信号进行语音识别的过程中,检测到与第三语音信息相匹配的语音信息;采样窗口的移动次数达到最大移动次数;以及移动采样窗口并对采样窗口内的音频信号进行语音识别的处理时长达到第二时长阈值。Specifically, in the embodiment of the present application, the aforementioned switching trigger event may include, for example, any one or more of the following: in the process of performing voice recognition on the audio signal in the sampling window, it is detected that it corresponds to the third voice information. Matching voice information; the number of movement times of the sampling window reaches the maximum number of movement times; and the processing duration of moving the sampling window and performing voice recognition on the audio signal in the sampling window reaches the second duration threshold.
具体地,在本申请实施例中,第三语音信息例如可以根据表征第二控制指令的词汇设置。其中,第二控制指令具有以下任意特征:处理器执行第二控制指令期间暂停执行其他控制指令;第二控制指令表征与用户的语音交互暂停。Specifically, in the embodiment of the present application, the third voice information may be set according to a vocabulary characterizing the second control instruction, for example. Wherein, the second control instruction has any of the following characteristics: the processor suspends execution of other control instructions during the execution of the second control instruction; the second control instruction represents the suspension of voice interaction with the user.
需要说明的是,在本申请实施例中,上述第三语音信息可以与前述实施例中的第二语音信息相同或类似,本申请实施例在此不再赘述。It should be noted that, in this embodiment of the present application, the foregoing third voice information may be the same as or similar to the second voice information in the foregoing embodiment, and details are not described herein again in this embodiment of the present application.
此外,本申请实施例中的第二控制指令与前述实施例中的第二控制指令相同,本申请实施例在此也不再赘述。In addition, the second control instruction in the embodiment of the present application is the same as the second control instruction in the foregoing embodiment, and the description of the embodiment of the present application is not repeated here.
在本申请实施例提供的音频处理方法中,由于窗口识别操作不仅影响总体的抗干扰能力,而且还影响总体的算力消耗。并且相关参数不同的窗口本身在排除噪声时其能力大小以及所消耗的算力也都不同。因此,在执行窗口识别操作的过程中,还可以根据语音识别结果动态地调整窗口的相关参数,以实现最大限度地兼顾抗干扰能力和算力消耗。In the audio processing method provided by the embodiment of the present application, the window recognition operation not only affects the overall anti-interference ability, but also affects the overall computing power consumption. In addition, windows with different related parameters have different capabilities and computational power consumed when eliminating noise. Therefore, in the process of performing the window recognition operation, the relevant parameters of the window can also be dynamically adjusted according to the voice recognition result, so as to achieve the maximum balance between anti-interference ability and computing power consumption.
需要说明的是,在本申请实施例中,滑动窗口检测是指:设置一个窗口长度固定为W毫秒的滑动窗,每次只取滑动窗内的音频信号进行识别,该滑动窗可以从某段音频信号的起始点以S毫秒为步长逐步向后滑动,从而达到滑动窗口检测的目的。其中,在一个示例中,该滑动窗口的总滑动次数N与时间段t分钟之间的关系可以表达为:t*60*1000=(W+S*N)。It should be noted that in the embodiments of this application, sliding window detection refers to: setting a sliding window with a fixed window length of W milliseconds, and each time only the audio signal in the sliding window is taken for identification, and the sliding window can be from a certain segment The starting point of the audio signal gradually slides backward in steps of S milliseconds, so as to achieve the purpose of sliding window detection. Wherein, in an example, the relationship between the total number of sliding times N of the sliding window and the time period t minutes can be expressed as: t*60*1000=(W+S*N).
在本申请实施例中,在执行在由步骤S201得到的音频片段之后的音频信号中移动采样窗口,并对采样窗口内的音频信号进行语音识别步骤的过程中, 即在执行窗口识别操作的过程中,例如可以基于对采样窗口的音频信号进行语音识别的识别结果,动态调整采样窗口的最大移动次数(即总滑动次数N)和/或移动步长(即步长S)。In the embodiment of the present application, in the process of moving the sampling window in the audio signal after the audio segment obtained in step S201, and performing the voice recognition step on the audio signal in the sampling window, that is, during the process of performing the window recognition operation For example, based on the recognition result of speech recognition on the audio signal of the sampling window, the maximum number of movement of the sampling window (that is, the total number of sliding times N) and/or the movement step size (that is, the step size S) can be dynamically adjusted.
例如,在一个实施例中,可以基于在语音识别中检测到指令的个数,动态调整采样窗口的最大移动次数(即总滑动次数N)和/或移动步长(即步长S)。For example, in one embodiment, based on the number of instructions detected in speech recognition, the maximum number of movement of the sampling window (ie, the total number of sliding times N) and/or the movement step size (ie, step size S) can be dynamically adjusted.
具体地,在本申请实施例中,基于对采样窗口的音频信号进行语音识别的识别结果,动态调整采样窗口的最大移动次数和/或移动步长例如可以包括:基于识别结果中检测到的指令的个数,动态调整采样窗口的最大移动次数和/或移动步长。Specifically, in the embodiment of the present application, based on the recognition result of performing voice recognition on the audio signal of the sampling window, dynamically adjusting the maximum number of movement times and/or the movement step length of the sampling window may include, for example, based on the instructions detected in the recognition result The maximum number of movement and/or movement step length of the sampling window is dynamically adjusted.
更具体地,在本申请实施例中,基于检测到的指令的个数,动态调整采样窗口的最大移动次数例如可以包括:响应于检测到的指令的个数超过第一预设值,将采样窗口的最大移动次数从第一值调整为大于第一值的第二值。More specifically, in the embodiment of the present application, dynamically adjusting the maximum number of movement times of the sampling window based on the number of detected instructions may include, for example: in response to the number of detected instructions exceeding a first preset value, sampling The maximum number of times of movement of the window is adjusted from the first value to a second value greater than the first value.
例如,在N次(即总滑动次数N)滑动窗口检测的识别期间,若检测到指令的个数达到了某个数值(如M 1,其中M 1<N),则说明用户在该期间存在连续讲话行为,并且存在连续发送指令的行为。因此可以将原定的总滑动次数N次加大到2N次或者其他次数,以便通过执行更多次的滑动窗口检测来排除噪声,从而提高抗干扰能力。 For example, during the identification period of the sliding window detection of N times (that is, the total number of sliding times N), if the number of detected instructions reaches a certain value (such as M 1 , where M 1 <N), it means that the user exists during this period Continuous speaking behavior, and there is a behavior of continuously sending instructions. Therefore, the original total number of sliding times N can be increased to 2N times or other times, so as to eliminate noise by performing more sliding window detections, thereby improving the anti-interference ability.
此外,更具体地,在本申请实施例中,基于检测到的指令的个数,动态调整采样窗口的移动步长例如可以包括:响应于检测到的指令的个数超过第二预设值,将采样窗口的移动步长从第一步长调整为小于第一步长的第二步长。In addition, more specifically, in the embodiment of the present application, dynamically adjusting the movement step size of the sampling window based on the number of detected instructions may include, for example: in response to the number of detected instructions exceeding a second preset value, Adjust the moving step length of the sampling window from the first step length to the second step length smaller than the first step length.
需要说明的是,在本申请实施例中,第二预设值可以与前述实施例中的第一预设值相同或者不同,本申请实施例在此不做限定。It should be noted that, in the embodiment of the present application, the second preset value may be the same as or different from the first preset value in the foregoing embodiment, which is not limited in the embodiment of the present application.
例如,在N次(即总滑动次数N)滑动窗口检测的识别期间,若检测到指令的个数达到了某个数值(如M 2,其中M 2<N),则说明用户在该期间存在连续讲话行为,并且存在连续发送指令的行为。因此可以将原定的滑动步长S毫秒减少到S/2毫秒或者其他步长,以便通过执行更小的滑动步长的滑动窗口检测来排除噪声,从而提高抗干扰能力。 For example, during the identification period of the sliding window detection of N times (that is, the total number of sliding times N), if the number of detected instructions reaches a certain value (such as M 2 , where M 2 <N), it means that the user exists during this period Continuous speaking behavior, and there is a behavior of continuously sending instructions. Therefore, the original sliding step length S milliseconds can be reduced to S/2 milliseconds or other step lengths, so as to eliminate noise by performing a sliding window detection with a smaller sliding step length, thereby improving the anti-interference ability.
或者,更具体地,基于检测到的指令的个数,动态调整采样窗口的移动步长例如还可以包括:响应于检测到的指令的个数低于第三预设值(如M 3,其中M 3<N),将采样窗口的移动步长从第一步长调整为大于第一步长的第 三步长。 Or, more specifically, based on the number of detected instructions, dynamically adjusting the movement step size of the sampling window, for example, may further include: responding to the number of detected instructions being lower than a third preset value (such as M 3 , where M 3 <N), adjust the moving step length of the sampling window from the first step length to a third step length larger than the first step length.
需要说明的是,在本申请实施例中,第三预设值例如可以小于前述实施例中的第二预设值。It should be noted that, in the embodiment of the present application, the third preset value may be smaller than the second preset value in the foregoing embodiment, for example.
例如,在前N/2次(即总滑动次数N)滑动窗口检测的识别期间,若检测到指令的个数低于某个数值(如M 3,其中M 3<N),则说明用户在该期间虽然存在讲话行为,但可能不是连续讲话行为,或者即使是连续讲话行为,但是不是连续发送指令的行为。因此可以将原定的滑动步长S毫秒加大到2S毫秒或者其他步长,以便通过执行更大的滑动步长的滑动窗口检测来排除噪声,同时节约算力,从而达到在执行窗口识别操作过程中尽量兼顾抗干扰能力和算力消耗。 For example, during the recognition period of the sliding window detection of the first N/2 times (ie, the total number of sliding times N), if the number of detected instructions is lower than a certain value (such as M 3 , where M 3 <N), it means that the user is in Although there is a speech behavior during this period, it may not be a continuous speech behavior, or even if it is a continuous speech behavior, it is not a behavior of continuously sending instructions. Therefore, the original sliding step length S milliseconds can be increased to 2S milliseconds or other step lengths, so that noise can be eliminated by performing sliding window detection with a larger sliding step length, and at the same time, computing power can be saved, so as to achieve the window recognition operation In the process, try to balance anti-interference ability and computing power consumption.
需要说明的是,在本申请实施例中,对于基于语音活性检测方法从音频信号中截取音频片段所采用的具体算法,本申请实施例可以不做限定,例如可以采用能量与过零率双门限算法、噪声-语音分类模型算法、方差法、谱距离法、谱熵法中的一种或多种。以采用能量与过零率双门限算法为例,如图4所示,步骤S201例如可以包括如下步骤S401至步骤S404。It should be noted that, in the embodiment of the present application, the specific algorithm used to intercept audio clips from the audio signal based on the voice activity detection method may not be limited in the embodiment of the present application. For example, dual thresholds of energy and zero-crossing rate may be used. One or more of algorithms, noise-speech classification model algorithms, variance method, spectral distance method, and spectral entropy method. Taking the dual threshold algorithm of energy and zero-crossing rate as an example, as shown in FIG. 4, step S201 may include the following steps S401 to S404, for example.
步骤S401,首先对音频信号进行分帧,然后逐帧计算各帧的短时平均能量,得到短时能量包络。In step S401, the audio signal is first divided into frames, and then the short-term average energy of each frame is calculated frame by frame to obtain the short-term energy envelope.
其中,可以以一个固定时长对音频信号进行分帧,以固定时长为1秒为例,则可以将音频信号第0秒至第1秒分为一帧,第1秒至第2秒分为一帧,第3秒至第4秒分为一帧,……,从而完成对音频信号的分帧。一帧的短时平均能量即为该帧的平均能量,对能量包络进行连线即可得到短时能量包络。Among them, the audio signal can be divided into frames with a fixed duration. Taking a fixed duration of 1 second as an example, the audio signal can be divided into one frame from the 0th second to the first second, and the first second to the second second are divided into one frame. Frame, the third second to the fourth second are divided into one frame, ..., thus completing the framing of the audio signal. The short-term average energy of a frame is the average energy of the frame, and the short-term energy envelope can be obtained by connecting the energy envelope.
步骤S402,选择一个较高的阈值T1,标记T1与短时能量包络的首尾交点为C和D。In step S402, a higher threshold value T1 is selected, and the first and last intersection points of T1 and the short-term energy envelope are marked as C and D.
其中,高于阈值T1的部分被认为是语音的概率较高。Among them, the part higher than the threshold T1 is considered to have a higher probability of being speech.
步骤S403,选取一个较低的阈值T2,其与短时能量包络的交点B位于C的左边,交点E位于D的右边。In step S403, a lower threshold value T2 is selected, and the intersection point B with the short-term energy envelope is located to the left of C, and the intersection point E is located to the right of D.
其中,T2小于T1。Among them, T2 is less than T1.
步骤S404,计算各帧的短时平均过零率,选取阈值T3,其与短时平均过零率线的交点A位于B的左边,交点F位于E的右边。In step S404, the short-term average zero-crossing rate of each frame is calculated, and the threshold T3 is selected, and the intersection point A of the short-term average zero-crossing rate line is located on the left of B, and the intersection F is located on the right of E.
其中,一帧的短时平均过零率即为该帧的平均过零率。Among them, the short-term average zero-crossing rate of a frame is the average zero-crossing rate of the frame.
至此,交点A和F即为基于语音活性检测方法所确定的该音频信号的两个 端点,交点A至交点F的音频片段即为基于语音活性检测方法从音频信号中所截取音频片段。So far, the intersection points A and F are the two endpoints of the audio signal determined based on the voice activity detection method, and the audio segments from the intersection point A to the intersection point F are the audio segments intercepted from the audio signal based on the voice activity detection method.
需要说明的是,步骤S401至步骤S404以从一段音频信号中只截取出一个音频片段为例,可以理解的是,从一段音频信号中也可以截取出多个音频片段,本申请实施例对此可以不做限定。It should be noted that in steps S401 to S404, only one audio segment is intercepted from a segment of audio signal as an example. It is understandable that multiple audio segments can also be intercepted from a segment of audio signal. This is the case in the embodiment of the present application. There is no need to limit it.
此外,需要说明的是,在本申请实施例中,在进行语音识别时,可以使用语音识别模型。语音识别模型例如可以是GMM-HMM模型(Gaussian Mixed Model-Hidden Markov Model,高斯混合模型-隐马尔科夫模型),训练流程如下:使用若干指令语音、负样本作为训练数据;对训练语音进行MFCC特征提取;然后进行GMM-HMM模型的训练,训练过程中HMM参数估计例如可以采用Baum-welch算法,为指令词的类别、噪声类别设置若干个状态,高斯数量若干,GMM-HMM模型训练使用EM(Expectation Maximization,期望最大化)方法,训练若干次,得到本申请所需的GMM-HMM模型。由此,语音识别流程可以为:提取检测到的声音片段的MFCC(梅尔频率倒谱系数)特征,利用提前训练好的GMM-HMM模型进行Viterbi解码,输出识别结果。In addition, it should be noted that in the embodiments of the present application, when performing speech recognition, a speech recognition model may be used. The speech recognition model can be, for example, a GMM-HMM model (Gaussian Mixed Model-Hidden Markov Model, Gaussian Mixture Model-Hidden Markov Model), and the training process is as follows: use several command speech and negative samples as training data; perform MFCC on the training speech Feature extraction; then the training of the GMM-HMM model is carried out. During the training process, the HMM parameter estimation can, for example, use the Baum-welch algorithm to set several states for the category of instruction words and noise categories, and the number of Gaussians. GMM-HMM model training uses EM The (Expectation Maximization) method is trained several times to obtain the GMM-HMM model required by this application. Therefore, the speech recognition process can be: extracting the MFCC (Mel Frequency Cepstral Coefficient) feature of the detected sound segment, using the pre-trained GMM-HMM model to perform Viterbi decoding, and outputting the recognition result.
以下结合附图以一个具体实施例详细阐述本申请。Hereinafter, the application is described in detail with a specific embodiment in conjunction with the accompanying drawings.
在该实施例中,智能硬件设备刚开机时,例如可以默认通过语音活性检测方法进行语音端点检测,在满足一定条件时(即应用场景发生改变时)切换到滑动窗口检测方法运行一段时间,然后恢复到语音活性检测方法,具体切换过程如图5所示,该过程例如可以包括以下步骤。In this embodiment, when the smart hardware device is just turned on, for example, the voice activity detection method can be used to perform voice endpoint detection by default, and when certain conditions are met (that is, when the application scenario changes), it switches to the sliding window detection method to run for a period of time, and then Returning to the voice activity detection method, the specific switching process is shown in FIG. 5, and the process may include the following steps, for example.
步骤S501,开机。智能硬件设备启动语音端点检测功能,并默认通过语音活性检测方法进行语音端点检测。Step S501, power on. The intelligent hardware device activates the voice endpoint detection function, and performs voice endpoint detection through the voice activity detection method by default.
步骤S502,实时获取音频流。Step S502: Acquire an audio stream in real time.
步骤S503,通过语音活性检测方法截取音频流中的音频片段。In step S503, the audio segment in the audio stream is intercepted by the voice activity detection method.
步骤S504,检测截取的音频片段中是否存在声音。如果存在,则执行步骤S505;否则,跳转至步骤S503。Step S504: Detect whether there is sound in the intercepted audio clip. If it exists, go to step S505; otherwise, go to step S503.
步骤S505,进行第一次语音识别,得到语音信息。Step S505: Perform the first voice recognition to obtain voice information.
步骤S506,判断该语音信息是否为控制指令(第一切换条件)。如果是,则执行步骤S507;否则,执行步骤S508。Step S506: Determine whether the voice information is a control instruction (first switching condition). If yes, go to step S507; otherwise, go to step S508.
步骤S507,执行指令,执行完指令后执行步骤S509。Step S507, execute the instruction, and execute step S509 after the instruction is executed.
即,若第一次语音识别的结果为控制指令,则进行人机交互。交互结束 后,在一段时间t分钟内(该时间段内的总滑动次数是N)通过滑动窗口检测方法进行语音端点检测。滑动窗口检测期间对每个滑动窗内的音频片段进行语音识别。That is, if the result of the first speech recognition is a control command, then human-computer interaction is performed. After the interaction is over, voice endpoint detection is performed through the sliding window detection method within a period of t minutes (the total number of sliding times in this period is N). During the sliding window detection, voice recognition is performed on the audio clips in each sliding window.
步骤S508,判断该语音信息的语音长度是否长于L秒(第二切换条件)。如果是,则执行步骤S509;否则,跳转至步骤S503。Step S508: Determine whether the voice length of the voice information is longer than L seconds (the second switching condition). If yes, go to step S509; otherwise, go to step S503.
即,若第一次语音识别的结果不是控制指令,则进一步判断该通过语音活性检测方法截取的语音片段的时长是否超过一定时长L秒(比如0.5s)。若该语音片段的时长小于L秒,则继续通过语音活性检测方法进行语音端点检测,即回到步骤S503。若该语音片段的时长大于或等于L秒,则认为此时用户有较大可能存在连续讲话的行为。因此在t分钟内通过滑动窗口检测方法进行语音端点检测,滑动窗口检测期间对每个滑动窗内的音频片段进行语音识别,流程如步骤S510~步骤S512。That is, if the result of the first speech recognition is not a control command, it is further determined whether the duration of the speech segment intercepted by the voice activity detection method exceeds a certain duration of L seconds (for example, 0.5s). If the duration of the voice segment is less than L seconds, continue to perform voice endpoint detection through the voice activity detection method, that is, return to step S503. If the duration of the speech segment is greater than or equal to L seconds, it is considered that the user is likely to have continuous speaking behavior at this time. Therefore, voice endpoint detection is performed by the sliding window detection method within t minutes, and voice recognition is performed on the audio segments in each sliding window during the sliding window detection period, and the flow is as step S510 to step S512.
步骤S509,初始化窗口滑动次数n=0。Step S509, initialize the window sliding times n=0.
步骤S510,滑动窗口检测一次,记录窗口滑动次数n=n+1。In step S510, the sliding window is detected once, and the number of window sliding times n=n+1 is recorded.
步骤S511,进行语音识别。Step S511, perform voice recognition.
步骤S512,判断n<N?如果是,则执行步骤S510;否则,跳转至步骤S503。Step S512, it is judged that n<N? If yes, go to step S510; otherwise, go to step S503.
其中,步骤S510~步骤S512描述了滑动窗口检测期间的运行逻辑与退出逻辑。若滑动窗口检测与识别期间语音识别模块识别到控制指令,则暂停滑动窗口检测,转而执行指令动作。指令动作执行结束后,继续进行滑动窗口检测,直到执行完该段时间(即t分钟)的N次滑动窗口检测。N次滑动窗口检测结束后,恢复到默认的语音活性检测方法进行语音端点检测,即回到步骤S503。Among them, step S510 to step S512 describe the operation logic and exit logic during the sliding window detection period. If the voice recognition module recognizes the control command during the sliding window detection and recognition, the sliding window detection is suspended, and the command action is executed instead. After the execution of the instruction action ends, the sliding window detection is continued until the N sliding window detections for the period of time (ie, t minutes) have been executed. After the N sliding window detection ends, the default voice activity detection method is restored to perform voice endpoint detection, that is, step S503 is returned.
图6为本申请一实施例提供的音频处理装置的结构示意图,如图6所示,该装置600可以包括:处理器601和存储器602。FIG. 6 is a schematic structural diagram of an audio processing device provided by an embodiment of this application. As shown in FIG. 6, the device 600 may include a processor 601 and a memory 602.
所述存储器602,用于存储程序代码;The memory 602 is used to store program codes;
所述处理器601,调用所述程序代码,当程序代码被执行时,用于执行以下操作:The processor 601 calls the program code, and when the program code is executed, is used to perform the following operations:
根据音频信号的音频特征信息在音频信号中截取音频片段;Intercept audio segments in the audio signal according to the audio feature information of the audio signal;
基于音频片段判断是否执行窗口识别操作;Determine whether to perform a window recognition operation based on the audio clip;
窗口识别操作包括如下操作:在音频片段之后的音频信号中移动采样窗口,并对采样窗口内的音频信号进行语音识别。The window recognition operation includes the following operations: moving the sampling window in the audio signal after the audio segment, and performing voice recognition on the audio signal in the sampling window.
本实施例提供的音频处理装置,可以用于执行前述方法实施例的技术方案,其实现原理和技术效果与方法实施例类似,在此不再赘述。The audio processing device provided in this embodiment can be used to implement the technical solutions of the foregoing method embodiments, and its implementation principles and technical effects are similar to those of the method embodiments, and will not be repeated here.
本领域普通技术人员可以理解:实现上述各方法实施例的全部或部分步骤可以通过程序指令相关的硬件来完成。前述的程序可以存储于一计算机可读取存储介质中。该程序在执行时,执行包括上述各方法实施例的步骤;而前述的存储介质包括:ROM、RAM、磁碟或者光盘等各种可以存储程序代码的介质。A person of ordinary skill in the art can understand that all or part of the steps in the foregoing method embodiments can be implemented by a program instructing relevant hardware. The aforementioned program can be stored in a computer readable storage medium. When the program is executed, it executes the steps including the foregoing method embodiments; and the foregoing storage medium includes: ROM, RAM, magnetic disk, or optical disk and other media that can store program codes.
最后应说明的是:以上各实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述各实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分或者全部技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the application, not to limit them; although the application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: It is still possible to modify the technical solutions described in the foregoing embodiments, or equivalently replace some or all of the technical features; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the technical solutions of the embodiments of the present application. scope.

Claims (34)

  1. 一种音频处理方法,其特征在于,包括:An audio processing method, characterized in that it comprises:
    根据音频信号的音频特征信息在所述音频信号中截取音频片段;Intercept an audio segment from the audio signal according to the audio feature information of the audio signal;
    基于所述音频片段判断是否执行窗口识别操作;Judging whether to perform a window recognition operation based on the audio clip;
    所述窗口识别操作包括如下操作:在所述音频片段之后的所述音频信号中移动采样窗口,并对所述采样窗口内的音频信号进行语音识别。The window recognition operation includes the following operations: moving a sampling window in the audio signal after the audio segment, and performing voice recognition on the audio signal in the sampling window.
  2. 根据权利要求1所述的方法,其特征在于,所述方法还包括:The method according to claim 1, wherein the method further comprises:
    从所述音频片段中提取语音信息;Extracting voice information from the audio segment;
    所述基于所述音频片段判断是否执行窗口识别操作,包括:The judging whether to perform a window recognition operation based on the audio clip includes:
    基于提取的所述语音信息判断是否执行所述窗口识别操作。Determine whether to perform the window recognition operation based on the extracted voice information.
  3. 根据权利要求1所述的方法,其特征在于,所述方法还包括:The method according to claim 1, wherein the method further comprises:
    从所述音频片段中提取语音信息;Extracting voice information from the audio segment;
    所述基于所述音频片段判断是否执行窗口识别操作,包括:The judging whether to perform a window recognition operation based on the audio clip includes:
    判断所述语音信息是否与所述第一语音信息相匹配;Judging whether the voice information matches the first voice information;
    若不匹配,则判断所述音频片段的时长是否大于或等于第一时长阈值;If they do not match, determine whether the duration of the audio segment is greater than or equal to the first duration threshold;
    若不大于或不等于所述第一时长阈值,则针对所述音频片段之后的所述音频信号,执行所述根据音频信号的音频特征信息截取音频片段,并基于所述音频片段判断是否执行窗口识别操作的步骤。If it is not greater than or not equal to the first duration threshold, for the audio signal after the audio segment, perform the interception of the audio segment according to the audio feature information of the audio signal, and determine whether to execute the window based on the audio segment Identify the steps of the operation.
  4. 根据权利要求1或3所述的方法,其特征在于,所述基于所述音频片段判断是否执行窗口识别操作,包括在发生如下任意事件的情况下执行窗口识别操作:The method according to claim 1 or 3, wherein the determining whether to perform a window recognition operation based on the audio clip comprises performing a window recognition operation when any of the following events occurs:
    从所述音频片段中提取的语音信息与第一语音信息相匹配;和/或,The voice information extracted from the audio segment matches the first voice information; and/or,
    所述音频片段的时长大于或等于第一时长阈值。The duration of the audio segment is greater than or equal to the first duration threshold.
  5. 根据权利要求3或4所述的方法,其特征在于,所述第一语音信息根据表征第一控制指令的词汇设置。The method according to claim 3 or 4, wherein the first voice information is set according to a vocabulary characterizing the first control instruction.
  6. 根据权利要求1-5中任一项所述的方法,其特征在于,所述基于所述音频片段判断是否执行窗口识别操作,包括:The method according to any one of claims 1 to 5, wherein the judging whether to perform a window recognition operation based on the audio clip comprises:
    基于所述音频片段判断是否发生预设音频事件。Determine whether a preset audio event occurs based on the audio segment.
  7. 根据权利要求6所述的方法,其特征在于,所述音频事件包括:所述音频片段中存在以下一种或多种声音:敲击声、鼓掌声、拍手声。The method according to claim 6, wherein the audio event comprises: one or more of the following sounds are present in the audio segment: percussive sounds, clapping sounds, and clapping sounds.
  8. 根据权利要求2-7中任一项所述的方法,其特征在于,还包括:The method according to any one of claims 2-7, further comprising:
    判断所述语音信息是否与第二语音信息相匹配;Judging whether the voice information matches the second voice information;
    若匹配,针对所述音频片段之后的所述音频信号,执行所述根据音频信号的音频特征信息截取音频片段,并基于所述音频片段判断是否执行窗口识别操作的步骤。If it matches, for the audio signal following the audio segment, perform the step of intercepting the audio segment according to the audio feature information of the audio signal, and determining whether to perform a window recognition operation based on the audio segment.
  9. 根据权利要求1-8中任一项所述的方法,其特征在于,还包括:在执行在所述音频片段之后的所述音频信号中移动采样窗口,并对所述采样窗口内的音频信号进行语音识别步骤的过程中,The method according to any one of claims 1-8, further comprising: moving a sampling window in the audio signal after the audio segment is executed, and performing processing on the audio signal in the sampling window During the speech recognition step,
    响应于发生的切换触发事件,针对所述采样窗口当前所处位置之后的所述音频信号,执行所述根据音频信号的音频特征信息截取音频片段,并基于所述音频片段判断是否执行窗口识别操作的步骤。In response to a switching trigger event that occurs, for the audio signal after the current position of the sampling window, execute the interception of the audio segment according to the audio feature information of the audio signal, and determine whether to perform a window recognition operation based on the audio segment A step of.
  10. 根据权利要求9所述的方法,其特征在于,所述切换触发事件包括以下中的任意一种:The method according to claim 9, wherein the handover trigger event includes any one of the following:
    在对所述采样窗口内的音频信号进行语音识别的过程中,检测到与第三语音信息相匹配的语音信息;In the process of performing voice recognition on the audio signal in the sampling window, detecting voice information matching the third voice information;
    所述采样窗口的移动次数达到最大移动次数;以及The number of movements of the sampling window reaches the maximum number of movements; and
    移动所述采样窗口并对所述采样窗口内的音频信号进行语音识别的处理时长达到第二时长阈值。The processing duration of moving the sampling window and performing voice recognition on the audio signal in the sampling window reaches the second duration threshold.
  11. 根据权利要求8或10所述的方法,其特征在于,所述第二语音信息或所述第三语音信息,根据表征第二控制指令的词汇设置;The method according to claim 8 or 10, wherein the second voice information or the third voice information is set according to a vocabulary characterizing the second control instruction;
    其中,所述第二控制指令具有以下任意特征:Wherein, the second control instruction has any of the following characteristics:
    处理器执行所述第二控制指令期间暂停执行其他控制指令;The processor suspends execution of other control instructions during the execution of the second control instruction;
    所述第二控制指令表征与用户的语音交互暂停。The second control instruction indicates that the voice interaction with the user is suspended.
  12. 根据权利要求10或11所述的方法,其特征在于,在执行在所述音频片段之后的所述音频信号中移动采样窗口,并对所述采样窗口内的音频信号进行语音识别步骤的过程中,The method according to claim 10 or 11, wherein, in the process of performing the step of moving a sampling window in the audio signal after the audio segment, and performing a voice recognition step on the audio signal in the sampling window,
    基于对所述采样窗口的音频信号进行语音识别的识别结果,动态调整所述采样窗口的最大移动次数和/或移动步长。Based on the recognition result of performing voice recognition on the audio signal of the sampling window, the maximum number of movement times and/or the movement step length of the sampling window are dynamically adjusted.
  13. 根据权利要求12所述的方法,其特征在于,还包括:The method according to claim 12, further comprising:
    基于所述识别结果中检测到的指令的个数,动态调整所述采样窗口的最大移动次数和/或移动步长。Based on the number of instructions detected in the recognition result, the maximum number of movement times and/or the movement step length of the sampling window are dynamically adjusted.
  14. 根据权利要求13所述的方法,其特征在于,所述基于检测到的指令的个数,动态调整所述采样窗口的最大移动次数,包括:The method according to claim 13, wherein the dynamically adjusting the maximum number of movement times of the sampling window based on the number of detected instructions comprises:
    响应于检测到的指令的个数超过第一预设值,将所述采样窗口的所述最大移动次数从第一值调整为大于所述第一值的第二值。In response to the number of detected instructions exceeding a first preset value, the maximum number of moves of the sampling window is adjusted from a first value to a second value greater than the first value.
  15. 根据权利要求13或14所述的方法,其特征在于,所述基于检测到的指令的个数,动态调整所述采样窗口的移动步长,包括:The method according to claim 13 or 14, wherein the dynamically adjusting the movement step size of the sampling window based on the number of detected instructions comprises:
    响应于检测到的指令的个数超过第二预设值,将所述采样窗口的所述移动步长从第一步长调整为小于所述第一步长的第二步长。In response to the number of detected instructions exceeding a second preset value, the moving step size of the sampling window is adjusted from the first step size to a second step size smaller than the first step size.
  16. 根据权利要求13-15中任一项所述的方法,其特征在于,所述基于检测到的指令的个数,动态调整所述采样窗口的移动步长,包括:The method according to any one of claims 13-15, wherein the dynamically adjusting the movement step size of the sampling window based on the number of detected instructions comprises:
    响应于检测到的指令的个数低于第三预设值,将所述采样窗口的所述移动步长从第一步长调整为大于所述第一步长的第三步长。In response to the number of detected instructions being lower than the third preset value, the movement step size of the sampling window is adjusted from the first step size to a third step size greater than the first step size.
  17. 一种音频处理装置,其特征在于,包括:处理器和存储器;An audio processing device, characterized by comprising: a processor and a memory;
    所述存储器,用于存储程序代码;The memory is used to store program code;
    所述处理器,调用所述程序代码,当程序代码被执行时,用于执行以下操作:The processor calls the program code, and when the program code is executed, is used to perform the following operations:
    根据音频信号的音频特征信息在所述音频信号中截取音频片段;Intercept an audio segment from the audio signal according to the audio feature information of the audio signal;
    基于所述音频片段判断是否执行窗口识别操作;Judging whether to perform a window recognition operation based on the audio clip;
    所述窗口识别操作包括如下步骤:在所述音频片段之后的所述音频信号中移动采样窗口,并对所述采样窗口内的音频信号进行语音识别。The window recognition operation includes the following steps: moving a sampling window in the audio signal after the audio segment, and performing voice recognition on the audio signal in the sampling window.
  18. 根据权利要求17所述的装置,其特征在于,所述处理器还用于:The device according to claim 17, wherein the processor is further configured to:
    从所述音频片段中提取语音信息;Extracting voice information from the audio segment;
    基于提取的所述语音信息判断是否执行所述窗口识别操作。Determine whether to perform the window recognition operation based on the extracted voice information.
  19. 根据权利要求17所述的装置,其特征在于,所述处理器还用于:The device according to claim 17, wherein the processor is further configured to:
    从所述音频片段中提取语音信息;Extracting voice information from the audio segment;
    判断所述语音信息是否与所述第一语音信息相匹配;Judging whether the voice information matches the first voice information;
    若不匹配,则判断所述音频片段的时长是否大于或等于第一时长阈值;If they do not match, determine whether the duration of the audio segment is greater than or equal to the first duration threshold;
    若不大于或不等于所述第一时长阈值,则针对所述音频片段之后的所述音频信号,执行所述根据音频信号的音频特征信息截取音频片段,并基于所述音频片段判断是否执行窗口识别操作的步骤。If it is not greater than or not equal to the first duration threshold, for the audio signal after the audio segment, perform the interception of the audio segment according to the audio feature information of the audio signal, and determine whether to execute the window based on the audio segment Identify the steps of the operation.
  20. 根据权利要求17或19所述的装置,其特征在于,所述处理器还用于在发生如下任意事件的情况下执行窗口识别操作:The device according to claim 17 or 19, wherein the processor is further configured to perform a window recognition operation when any of the following events occurs:
    从所述音频片段中提取的语音信息与第一语音信息相匹配;和/或,The voice information extracted from the audio segment matches the first voice information; and/or,
    所述音频片段的时长大于或等于第一时长阈值。The duration of the audio segment is greater than or equal to the first duration threshold.
  21. 根据权利要求19或20所述的装置,其特征在于,所述第一语音信息根据表征第一控制指令的词汇设置。The device according to claim 19 or 20, wherein the first voice information is set according to a vocabulary characterizing the first control instruction.
  22. 根据权利要求17-21中任一项所述的装置,其特征在于,所述处理器还用于:The device according to any one of claims 17-21, wherein the processor is further configured to:
    基于所述音频片段判断是否发生预设音频事件。Determine whether a preset audio event occurs based on the audio segment.
  23. 根据权利要求22所述的装置,其特征在于,所述音频事件包括:所述音频片段中存在以下一种或多种声音:敲击声、鼓掌声、拍手声。The device according to claim 22, wherein the audio event comprises: one or more of the following sounds are present in the audio segment: percussion, clapping, and clapping.
  24. 根据权利要求17-23中任一项所述的装置,其特征在于,所述处理器还用于:The device according to any one of claims 17-23, wherein the processor is further configured to:
    判断所述语音信息是否与第二语音信息相匹配;Judging whether the voice information matches the second voice information;
    若匹配,针对所述音频片段之后的所述音频信号,执行所述根据音频信号的音频特征信息截取音频片段,并基于所述音频片段判断是否执行窗口识别操作的步骤。If it matches, for the audio signal following the audio segment, perform the step of intercepting the audio segment according to the audio feature information of the audio signal, and determining whether to perform a window recognition operation based on the audio segment.
  25. 根据权利要求17-24中任一项所述的装置,其特征在于,所述处理器还用于:在执行在所述音频片段之后的所述音频信号中移动采样窗口,并对所述采样窗口内的音频信号进行语音识别步骤的过程中,24. The device according to any one of claims 17-24, wherein the processor is further configured to: move a sampling window in the audio signal after the audio segment is executed, and adjust the sampling window to the The audio signal within the process of the speech recognition step,
    响应于发生的切换触发事件,针对所述采样窗口当前所处位置之后的所述音频信号,执行所述根据音频信号的音频特征信息截取音频片段,并基于所述音频片段判断是否执行窗口识别操作的步骤。In response to a switching trigger event that occurs, for the audio signal after the current position of the sampling window, execute the interception of the audio segment according to the audio feature information of the audio signal, and determine whether to perform a window recognition operation based on the audio segment A step of.
  26. 根据权利要求25所述的装置,其特征在于,所述切换触发事件包括以下中的任意一种:The device according to claim 25, wherein the handover trigger event comprises any one of the following:
    在对所述采样窗口内的音频信号进行语音识别的过程中,检测到与第三语音信息相匹配的语音信息;In the process of performing voice recognition on the audio signal in the sampling window, detecting voice information matching the third voice information;
    所述采样窗口的移动次数达到最大移动次数;以及The number of movements of the sampling window reaches the maximum number of movements; and
    移动所述采样窗口并对所述采样窗口内的音频信号进行语音识别的处理时长达到第二时长阈值。The processing duration of moving the sampling window and performing voice recognition on the audio signal in the sampling window reaches the second duration threshold.
  27. 根据权利要求24或26所述的装置,其特征在于,所述第二语音信息或所述第三语音信息,根据表征第二控制指令的词汇设置;The device according to claim 24 or 26, wherein the second voice information or the third voice information is set according to a vocabulary characterizing the second control instruction;
    其中,所述第二控制指令具有以下任意特征:Wherein, the second control instruction has any of the following characteristics:
    处理器执行所述第二控制指令期间暂停执行其他控制指令;The processor suspends execution of other control instructions during the execution of the second control instruction;
    所述第二控制指令表征与用户的语音交互暂停。The second control instruction indicates that the voice interaction with the user is suspended.
  28. 根据权利要求26或27所述的装置,其特征在于,在执行在所述音频片段之后的所述音频信号中移动采样窗口,并对所述采样窗口内的音频信号进行语音识别步骤的过程中,The device according to claim 26 or 27, wherein, in the process of performing a step of moving a sampling window in the audio signal after the audio segment, and performing a voice recognition step on the audio signal in the sampling window,
    基于对所述采样窗口的音频信号进行语音识别的识别结果,动态调整所述采样窗口的最大移动次数和/或移动步长。Based on the recognition result of performing voice recognition on the audio signal of the sampling window, the maximum number of movement times and/or the movement step length of the sampling window are dynamically adjusted.
  29. 根据权利要求28所述的装置,其特征在于,所述处理器还用于:The device according to claim 28, wherein the processor is further configured to:
    基于所述识别结果中检测到的指令的个数,动态调整所述采样窗口的最大移动次数和/或移动步长。Based on the number of instructions detected in the recognition result, the maximum number of movement times and/or the movement step length of the sampling window are dynamically adjusted.
  30. 根据权利要求29所述的装置,其特征在于,所述处理器还用于:The device according to claim 29, wherein the processor is further configured to:
    响应于检测到的指令的个数超过第一预设值,将所述采样窗口的所述最大移动次数从第一值调整为大于所述第一值的第二值。In response to the number of detected instructions exceeding a first preset value, adjusting the maximum number of moves of the sampling window from a first value to a second value greater than the first value.
  31. 根据权利要求29或30所述的装置,其特征在于,所述处理器还用于:The device according to claim 29 or 30, wherein the processor is further configured to:
    响应于检测到的指令的个数超过第二预设值,将所述采样窗口的所述移动步长从第一步长调整为小于所述第一步长的第二步长。In response to the number of detected instructions exceeding the second preset value, the moving step size of the sampling window is adjusted from the first step size to a second step size smaller than the first step size.
  32. 根据权利要求29-31中任一项所述的装置,其特征在于,所述处理器还用于:The device according to any one of claims 29-31, wherein the processor is further configured to:
    响应于检测到的指令的个数低于第三预设值,将所述采样窗口的所述移动步长从第一步长调整为大于所述第一步长的第三步长。In response to the number of detected instructions being lower than the third preset value, the movement step size of the sampling window is adjusted from the first step size to a third step size greater than the first step size.
  33. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储有计算机程序,所述计算机程序包含至少一段代码,所述至少一段代码可由计算机执行,以控制所述计算机执行如权利要求1~16任一项所述的音频处理方法。A computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, the computer program contains at least one piece of code, and the at least one piece of code can be executed by a computer to control the computer to execute The audio processing method described in any one of 1-16 is required.
  34. 一种计算机程序,其特征在于,当所述计算机程序被计算机执行时,用于实现如权利要求1~16任一项所述的音频处理方法。A computer program, characterized in that, when the computer program is executed by a computer, it is used to implement the audio processing method according to any one of claims 1 to 16.
PCT/CN2020/073292 2020-01-20 2020-01-20 Audio processing method and device WO2021146857A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202080004356.6A CN112543972A (en) 2020-01-20 2020-01-20 Audio processing method and device
PCT/CN2020/073292 WO2021146857A1 (en) 2020-01-20 2020-01-20 Audio processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/073292 WO2021146857A1 (en) 2020-01-20 2020-01-20 Audio processing method and device

Publications (1)

Publication Number Publication Date
WO2021146857A1 true WO2021146857A1 (en) 2021-07-29

Family

ID=75017359

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/073292 WO2021146857A1 (en) 2020-01-20 2020-01-20 Audio processing method and device

Country Status (2)

Country Link
CN (1) CN112543972A (en)
WO (1) WO2021146857A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114238607B (en) * 2021-12-17 2022-11-22 北京斗米优聘科技发展有限公司 Deep interactive AI intelligent job-searching consultant method, system and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090222264A1 (en) * 2008-02-29 2009-09-03 Broadcom Corporation Sub-band codec with native voice activity detection
CN107305774A (en) * 2016-04-22 2017-10-31 腾讯科技(深圳)有限公司 Speech detection method and device
CN109545188A (en) * 2018-12-07 2019-03-29 深圳市友杰智新科技有限公司 A kind of real-time voice end-point detecting method and device
CN110277087A (en) * 2019-07-03 2019-09-24 四川大学 A kind of broadcast singal anticipation preprocess method

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8650029B2 (en) * 2011-02-25 2014-02-11 Microsoft Corporation Leveraging speech recognizer feedback for voice activity detection
CN106601230B (en) * 2016-12-19 2020-06-02 苏州金峰物联网技术有限公司 Logistics sorting place name voice recognition method and system based on continuous Gaussian mixture HMM model and logistics sorting system
US10431242B1 (en) * 2017-11-02 2019-10-01 Gopro, Inc. Systems and methods for identifying speech based on spectral features

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090222264A1 (en) * 2008-02-29 2009-09-03 Broadcom Corporation Sub-band codec with native voice activity detection
CN107305774A (en) * 2016-04-22 2017-10-31 腾讯科技(深圳)有限公司 Speech detection method and device
CN109545188A (en) * 2018-12-07 2019-03-29 深圳市友杰智新科技有限公司 A kind of real-time voice end-point detecting method and device
CN110277087A (en) * 2019-07-03 2019-09-24 四川大学 A kind of broadcast singal anticipation preprocess method

Also Published As

Publication number Publication date
CN112543972A (en) 2021-03-23

Similar Documents

Publication Publication Date Title
EP3577645B1 (en) End of query detection
Li et al. Robust endpoint detection and energy normalization for real-time speech and speaker recognition
EP2898510B1 (en) Method, system and computer program for adaptive control of gain applied to an audio signal
US11037584B2 (en) Direction based end-pointing for speech recognition
JP2016505888A (en) Speech recognition power management
WO2017154282A1 (en) Voice processing device and voice processing method
WO2015103836A1 (en) Voice control method and device
JPH11153999A (en) Speech recognition device and information processor using the same
US20230298575A1 (en) Freeze Words
US11348579B1 (en) Volume initiated communications
WO2021146857A1 (en) Audio processing method and device
US11676594B2 (en) Decaying automated speech recognition processing results
US11341988B1 (en) Hybrid learning-based and statistical processing techniques for voice activity detection
US20230223014A1 (en) Adapting Automated Speech Recognition Parameters Based on Hotword Properties
CN111739515A (en) Voice recognition method, device, electronic device, server and related system
KR20120111510A (en) A system of robot controlling of using voice recognition
CN112435691B (en) Online voice endpoint detection post-processing method, device, equipment and storage medium
WO2021016925A1 (en) Audio processing method and apparatus
Abdulkader et al. Multiple-instance, cascaded classification for keyword spotting in narrow-band audio
US20240079004A1 (en) System and method for receiving a voice command
US11081114B2 (en) Control method, voice interaction apparatus, voice recognition server, non-transitory storage medium, and control system
KR100281582B1 (en) Speech Recognition Method Using the Recognizer Resource Efficiently
CN111768800A (en) Voice signal processing method, apparatus and storage medium
Li et al. Robust Endpoint Detection
CN116052643A (en) Voice recognition method, device, storage medium and equipment

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20916099

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20916099

Country of ref document: EP

Kind code of ref document: A1