WO2021146857A1 - Procédé et dispositif de traitement audio - Google Patents

Procédé et dispositif de traitement audio Download PDF

Info

Publication number
WO2021146857A1
WO2021146857A1 PCT/CN2020/073292 CN2020073292W WO2021146857A1 WO 2021146857 A1 WO2021146857 A1 WO 2021146857A1 CN 2020073292 W CN2020073292 W CN 2020073292W WO 2021146857 A1 WO2021146857 A1 WO 2021146857A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio
window
audio signal
sampling window
voice information
Prior art date
Application number
PCT/CN2020/073292
Other languages
English (en)
Chinese (zh)
Inventor
吴俊峰
罗东阳
童焦龙
Original Assignee
深圳市大疆创新科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市大疆创新科技有限公司 filed Critical 深圳市大疆创新科技有限公司
Priority to PCT/CN2020/073292 priority Critical patent/WO2021146857A1/fr
Priority to CN202080004356.6A priority patent/CN112543972A/zh
Publication of WO2021146857A1 publication Critical patent/WO2021146857A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • G10L15/05Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • G10L15/142Hidden Markov Models [HMMs]
    • G10L15/144Training of HMMs
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Definitions

  • This application relates to the field of audio technology, and in particular to an audio processing method and device.
  • Voice interaction is a common way of human-computer interaction.
  • voice recognition for human-computer interaction, it is necessary to perform voice endpoint detection first, that is, to detect which parts of the sound recorded by the microphone may contain voice.
  • voice endpoint detection The anti-interference ability and computing power consumption of various voice endpoint detection methods are different from each other. How to balance the anti-interference ability and computing power consumption has become an urgent problem in voice endpoint detection.
  • the embodiments of the present application provide an audio processing method and device to solve the problem that the anti-interference ability and computing power consumption cannot be taken into account in the voice endpoint detection process in the prior art, resulting in poor anti-interference ability and computing power consumption. Big technical problem.
  • an embodiment of the present application provides an audio processing method, including: intercepting an audio segment in the audio signal according to audio feature information of the audio signal; determining whether to perform a window recognition operation based on the audio segment; and the window recognition The operation includes the following operations: moving a sampling window in the audio signal after the audio segment, and performing voice recognition on the audio signal in the sampling window.
  • an embodiment of the present application provides an audio processing device, including: a processor and a memory; the memory is used to store program code; the processor calls the program code, and when the program code is executed, It is used to perform the following operations: intercept an audio segment in the audio signal according to the audio feature information of the audio signal; determine whether to perform a window recognition operation based on the audio segment; the window recognition operation includes the following steps: after the audio segment Move the sampling window in the audio signal, and perform voice recognition on the audio signal in the sampling window.
  • an embodiment of the present application provides a computer-readable storage medium, the computer-readable storage medium stores a computer program, the computer program includes at least one piece of code, and the at least one piece of code can be executed by a computer to control all
  • the computer executes the audio processing method according to any one of the above-mentioned first aspects.
  • an embodiment of the present application provides a computer program, when the computer program is executed by a computer, it is used to implement the audio processing method according to any one of the above-mentioned first aspects.
  • the embodiments of the present application provide an audio processing method and device. By selecting and switching the voice endpoint detection method according to the characteristics of specific application scenarios in the voice endpoint detection process, it can take into account the anti-interference ability and computing power consumption.
  • FIG. 1A is a schematic diagram 1 of an application scenario of an audio processing method provided by an embodiment of this application;
  • FIG. 1B is a second schematic diagram of an application scenario of the audio processing method provided by an embodiment of the application.
  • FIG. 2 is a schematic flowchart of an audio processing method provided by an embodiment of the application.
  • 3A-3C are schematic diagrams of the audio sub-segment provided by an embodiment of the application excluding the noise of the audio segment;
  • FIG. 4 is a schematic flowchart of an audio processing method provided by another embodiment of this application.
  • FIG. 5 is a schematic flowchart of a voice activity detection method provided by an embodiment of the application.
  • FIG. 6 is a schematic structural diagram of an audio processing device provided by an embodiment of the application.
  • the audio processing method provided in the embodiments of the present application can be applied to any audio processing process that requires voice endpoint detection, and the audio processing method can be specifically executed by an audio processing device.
  • the audio processing device may be a device including an audio collection module (for example, a microphone).
  • FIG. 1A a schematic diagram of an application scenario of the audio processing method provided in an embodiment of the present application may be as shown in FIG. 1A.
  • the audio collection module of the audio processing device can collect the voice spoken by the user to obtain audio signals
  • the processor of the audio processing device can process the audio signals collected by the audio collection module using the audio processing method provided in the embodiments of the present application.
  • FIG. 1A is only a schematic diagram, and does not limit the structure of the audio processing device.
  • an amplifier may be connected between the microphone and the processor to amplify the audio signal collected by the microphone.
  • a filter may be connected between the microphone and the processor to filter the audio signal collected by the microphone.
  • the audio processing device may also be a device that does not include an audio collection module.
  • a schematic diagram of an application scenario of the audio processing method provided in an embodiment of the present application may be as shown in FIG. 1B.
  • the communication interface of the audio processing device may receive audio signals collected by other devices or equipment, and the processor of the audio processing device may process the received audio signals using the audio processing method provided in the embodiments of the present application.
  • FIG. 1B is only a schematic diagram, and does not limit the structure of the audio processing device and the connection mode between the audio processing device and other devices or equipment.
  • the communication interface in the audio processing device can be replaced with a transceiver.
  • the type of equipment including the audio processing device may not be limited in the embodiments of the present application.
  • the equipment may be, for example, smart speakers, smart lighting devices, smart robots, mobile phones, tablet computers, and the like.
  • the audio processing method provided by the embodiments of the present application can automatically switch between the voice activity detection method and the sliding window detection method according to the characteristics of the specific application scenario, so that the anti-interference ability of the voice endpoint detection can be improved as a whole, and the voice endpoint detection can be reduced. Consumption of computing power. That is, the anti-interference ability of voice endpoint detection can be stronger than that of purely using the voice activity detection method, and at the same time, the computing power consumption of voice endpoint detection can be less than that of purely using the sliding window detection method.
  • the audio processing method provided by the embodiments of this application can be applied to the field of voice recognition, smart hardware devices with voice recognition functions, audio event detection fields, smart hardware devices with audio event detection functions, etc. This is not limited.
  • the execution subject of this embodiment may be an audio processing device, and specifically may be a processor of the audio processing device. As shown in FIG. 2, the method of this embodiment may include step S201 and step S202, for example.
  • step S201 an audio segment is intercepted from the audio signal according to the audio feature information of the audio signal.
  • step S201 the audio clip is intercepted based on the audio feature information. Since the intercepted endpoint is inaccurate, the intercepted audio clip usually contains noise. The audio clip is used for speech recognition, and the accuracy is not high.
  • the audio features may include, for example, Mel Frequency Cepstrum Coefficient (MFCC) features, linear prediction coefficients (Linear Prediction Coefficients, LPC) features, and filter bank (Filter bank, Fbank) features.
  • MFCC Mel Frequency Cepstrum Coefficient
  • LPC Linear Prediction Coefficients
  • Fbank Filter bank, Fbank
  • a voice endpoint detection method with lower computing power consumption may be used to process the audio signal.
  • the voice activity detection method recognizes whether the audio signal is a voice signal through methods such as energy and zero-crossing rate double threshold or noise-speech classification model, and only determines whether the audio signal is a voice signal. It can only be intercepted at a time for speech recognition, so the voice activity detection method can effectively distinguish between silent and non-silent, and can save computing power when used for speech recognition.
  • a voice activity detection method may be used to intercept audio segments in the audio signal according to the audio feature information of the audio signal.
  • the voice activity detection method cannot effectively distinguish between voice-type noises or non-speech-type noises with slightly larger energy.
  • the voice activity detection method is used for command word speech recognition, it is easy to mix noise before and after the command speech fragment, thereby reducing the voice command recognition rate, while the sliding window detection method can extract a certain length of speech fragment from the audio stream and continuously perform speech recognition .
  • the sliding window detection method can extract a certain length of speech fragment from the audio stream and continuously perform speech recognition .
  • This method can effectively prevent mixing of noise before and after the instruction word speech segment, but the detection method since the sliding window computationally intensive speech recognition, voice recognition keep an intelligent hardware device consumes significant computing resources and power.
  • the voice activity detection method can be used for audio processing in application scenarios with low anti-interference requirements to reduce computing power consumption, but sliding windows can be used in application scenarios with high anti-interference requirements
  • the detection method performs audio processing to improve the anti-interference ability.
  • step S202 it is determined whether to perform a window recognition operation based on the audio clip obtained in step S201.
  • the window recognition operation may include, for example, the following operations: moving a sampling window in the audio signal after the audio segment obtained in step S201, and performing voice recognition on the audio signal in the sampling window.
  • the sliding window detection method can exclude the noise included in the audio segment cut out by the voice activity detection in one or more windows.
  • the audio segment X1 can exclude the noise.
  • the audio segment X2 can exclude the noise.
  • the audio segment X3 can exclude the noise. It should be noted that the parts filled with grid lines in FIGS. 3A to 3C are used to represent noise.
  • step S202 it can be determined whether the audio segment obtained in step S201 contains or may contain voice-type noise or non-speech-type noise with slightly larger energy, and then it is determined whether to perform a window recognition operation. Wherein, if it is determined that the audio segment obtained in step S201 contains or may contain speech-type noise or non-speech-type noise with slightly larger energy, a window recognition operation is performed. Otherwise, if it is determined that the audio segment obtained in step S201 does not contain or may not contain speech-type noise or non-speech-type noise with slightly larger energy, then continue to perform step S201 and step S202.
  • noise is easily mixed before and after the command-type voice information, which easily leads to inaccurate speech segment interception and/or inaccurate speech recognition. Therefore, for the audio segment obtained in step S201, if command-type voice information is detected, a window recognition operation is performed to eliminate the influence of noise.
  • a window recognition operation is performed to avoid inaccurate audio segment interception and/or inaccurate speech recognition due to the inability of the voice activity detection method to effectively distinguish this type of information.
  • the corresponding voice endpoint detection method when processing audio signals, can be selected according to the characteristics of specific application scenarios. For example, in application scenarios where the anti-interference ability is low, the voice activity detection method can be used to process the audio signal to save computing power. For application scenarios that require high anti-interference ability, the sliding window detection method can be used to process the audio signal to improve the anti-interference ability, which can take into account the advantages of high anti-interference ability of the sliding window detection method and low computational power consumption of the voice activity detection method Therefore, the anti-interference ability of voice endpoint detection is generally stronger than that of purely using the voice activity detection method, and at the same time, the computing power consumption of voice endpoint detection is generally less than that of purely using the sliding window detection method.
  • step S202 it is possible to determine whether or not there is instruction-type voice information in the audio segment obtained in step S201 in various ways.
  • voice information may be extracted from the audio segment obtained in step S201, and the content and/or voice length of the extracted voice information may be used to determine whether there is or may exist instruction information, and then determine whether to execute Window recognition operation.
  • the method may further include, for example, extracting voice information from an audio segment.
  • step S202 may include, for example, judging whether to perform a window recognition operation based on the extracted voice information.
  • a specific condition for example, whether the voice information contains instruction-type information.
  • it may be determined whether to perform a window recognition operation based on the content and the length of the extracted voice information.
  • the sliding window detection method is used for voice endpoint detection to prevent inaccurate speech fragment interception and/or inaccurate speech recognition caused by mixing too much noise before and after the instruction.
  • the voice length corresponding to the voice information containing the instruction word is also limited. In this way, it is possible to roughly estimate whether or not there is an instruction word in the voice information according to the voice length of the extracted voice information, and then it is possible to determine whether to perform a window recognition operation.
  • the voice information contained therein can be extracted first, and then based on the content and/or voice length of the voice information to determine or estimate whether there is instruction information in the audio segment, and The determination or estimation result determines whether to perform the window recognition operation, thereby avoiding inaccuracies caused by excessive noise during voice segment interception and voice recognition of the command-type voice information.
  • the voice information may be extracted from the audio segment obtained in step S201, and the presence or possible existence of instruction information can be determined according to the content of the extracted voice information and/or the duration of the audio segment. And then determine whether to perform the window recognition operation.
  • the method may further include, for example, extracting voice information from the audio segment obtained in step S201.
  • step S202 may include the following operations, for example. It is first judged whether the extracted voice information matches the first voice information. If it does not match, then it is determined whether the duration of the audio segment is greater than or equal to the first duration threshold. If it is not greater than or not equal to the first duration threshold, step S201 and step S202 are executed for the audio signal after the audio segment.
  • step S202 may also include the following operations, for example. First, it is determined whether the duration of the audio segment obtained in step S201 is greater than or equal to the first duration threshold. If it is not greater than or not equal to the first duration threshold, then it is determined whether the extracted voice information matches the first voice information. If it does not match, for the audio signal after the audio segment, step S201 and step S202 are executed.
  • the first voice information may be set according to the vocabulary of the characterizing instruction (such as the first control instruction).
  • the first voice information may be set according to a vocabulary that characterizes all control instructions applied to the smart hardware device.
  • the voice length corresponding to the voice information containing the instruction word is also limited, and the duration of the audio segment containing the instruction word is also limited. Therefore, it is possible to roughly estimate whether or not there is an instruction word in the audio segment according to the duration of the audio segment obtained in step S201, and then it can be determined whether to perform a window recognition operation.
  • the first duration threshold may be set according to the maximum duration, average duration, or any duration of the audio segment containing the instruction (such as the first control instruction).
  • the voice information contained therein can be extracted, and based on the content of the voice information and the duration of the audio clip, it is determined or estimated whether there is instruction information in the audio clip, and The estimation result determines whether to perform the window recognition operation. In this way, it is also possible to avoid inaccuracy caused by excessive noise when performing voice segment interception and voice recognition on the command-type voice information.
  • the method may further include, for example, performing a window recognition operation when any of the following events occurs: the voice information extracted from the audio segment obtained in step S201 matches the first voice information (referred to as event 1); and/ Or, the duration of the audio segment obtained in step S201 is greater than or equal to the first duration threshold (referred to as event 2).
  • the setting of the first voice information and the first duration threshold is the same or similar to the setting method in the foregoing embodiment, which is not repeated here in the embodiment of the present application.
  • the voice information may be extracted from the audio segment obtained in step S201, and then the extracted voice information may be matched with the first voice information, so as to determine whether the extracted voice information matches the The first voice message matches.
  • the duration of the audio clip obtained in step S201 may be determined first, and then it is compared whether the duration is greater than or equal to the first duration threshold.
  • one or more of event 1 and event 2 can be used to determine or estimate that the audio clip obtained in step S201 contains or may contain instruction information, so that the window recognition operation can be performed to intercept the voice clip.
  • the noise before and after the instruction is excluded, so that the interception result of the speech segment is more accurate, and the speech recognition result obtained therefrom is also more accurate.
  • step S202 it is also possible to detect whether there is non-speech information with a little higher energy in the audio segment obtained in step S201 in a variety of ways.
  • step S202 may include, for example, judging whether a preset audio event has occurred based on the audio segment obtained in step S201, so as to determine whether there is non-voice information with a higher energy in the audio segment.
  • the window recognition operation can be performed, so that the noise can be excluded when the audio segment is intercepted, so that the audio segment can be intercepted
  • the result is more accurate, which in turn makes the resulting audio processing result more accurate.
  • the preset audio event does not occur, it is considered that there is no non-voice information with a larger energy in the audio segment, and the window recognition operation may not be performed.
  • the foregoing preset audio event may include, for example, one or more of the following sounds in the audio segment: percussion, applause, and clapping.
  • the above-mentioned preset audio events can be set according to non-voice information with slightly larger energy, such as percussive sounds, clapping sounds, and clapping sounds that often appear in the audio clip, for example.
  • Voice endpoint detection can prevent excessive noise from being mixed before and after the instruction, which may lead to inaccurate speech segment interception and/or inaccurate speech recognition.
  • the user may not speak continuously during actual use, but will not issue any more instructions for a long period of time after issuing an instruction.
  • the user says “please go to sleep” to a smart hardware device with voice recognition function (such as a robot), or the user says “please fly around A twice" to a smart hardware device with voice recognition function (such as a drone) , Etc.
  • the user generally will not issue other control commands for a long period of time after the command is issued. Therefore, if these special command-type voice messages are detected, switching to the sliding window detection method for voice endpoint detection will not only not significantly improve the anti-interference ability, but also increase the computing power consumption. Therefore, in the embodiment of the present application, if these special circumstances occur, the sliding window detection method may not be switched, thereby avoiding the defect that not only the anti-interference ability is not significantly improved after the switching, but also the computational power consumption will increase.
  • step S201 in the process of judging whether to perform the window recognition operation based on the audio clip obtained in step S201, it is also possible to detect whether specific voice information appears. Wherein, if the occurrence of specific voice information is detected, then for the audio signal after the audio segment obtained in step S201, step S201 and step S202 are executed without performing the window recognition operation.
  • step S202 may include, for example, extracting voice information from the audio segment obtained in step S201, and determining whether the extracted voice information matches the second voice information. Wherein, if it matches, for the audio signal after the audio segment obtained in step S201, step S201 and step S202 are executed.
  • the above-mentioned second voice information may be set according to a vocabulary characterizing the second control instruction, for example.
  • the second control instruction may have any of the following features: the processor suspends execution of other control instructions during the execution of the second control instruction (feature 1 for short); the second control instruction represents the pause of voice interaction with the user (feature 2 for short).
  • the second control command may be, for example, "please fly around A twice" in the above example.
  • the second control command may be, for example, "please enter the sleep state” in the above example.
  • the second control instruction may be any control instruction having the above-mentioned feature 1 and/or feature 2, which is not limited in the embodiment of the present application.
  • the sliding window detection method after switching to the sliding window detection method, that is, after performing the window recognition operation, if the specific application scenario is found to change again, the sliding window detection method can also be switched back to the previously used voice endpoint Detection methods to ensure that computing power is saved as much as possible.
  • the method may further include, for example, moving the sampling window in the audio signal after the audio segment obtained in step S201, and performing the voice recognition step on the audio signal in the sampling window. That is, in the process of performing the window recognition operation, in response to the switching trigger event that occurs, for the audio signal after the current position of the sampling window, step S201 and step S202 are executed.
  • the aforementioned switching trigger event may include, for example, any one or more of the following: in the process of performing voice recognition on the audio signal in the sampling window, it is detected that it corresponds to the third voice information. Matching voice information; the number of movement times of the sampling window reaches the maximum number of movement times; and the processing duration of moving the sampling window and performing voice recognition on the audio signal in the sampling window reaches the second duration threshold.
  • the third voice information may be set according to a vocabulary characterizing the second control instruction, for example.
  • the second control instruction has any of the following characteristics: the processor suspends execution of other control instructions during the execution of the second control instruction; the second control instruction represents the suspension of voice interaction with the user.
  • the foregoing third voice information may be the same as or similar to the second voice information in the foregoing embodiment, and details are not described herein again in this embodiment of the present application.
  • the second control instruction in the embodiment of the present application is the same as the second control instruction in the foregoing embodiment, and the description of the embodiment of the present application is not repeated here.
  • the window recognition operation not only affects the overall anti-interference ability, but also affects the overall computing power consumption.
  • windows with different related parameters have different capabilities and computational power consumed when eliminating noise. Therefore, in the process of performing the window recognition operation, the relevant parameters of the window can also be dynamically adjusted according to the voice recognition result, so as to achieve the maximum balance between anti-interference ability and computing power consumption.
  • sliding window detection refers to: setting a sliding window with a fixed window length of W milliseconds, and each time only the audio signal in the sliding window is taken for identification, and the sliding window can be from a certain segment The starting point of the audio signal gradually slides backward in steps of S milliseconds, so as to achieve the purpose of sliding window detection.
  • the maximum number of movement of the sampling window that is, the total number of sliding times N
  • the movement step size that is, the step size S
  • the maximum number of movement of the sampling window ie, the total number of sliding times N
  • the movement step size ie, step size S
  • dynamically adjusting the maximum number of movement times and/or the movement step length of the sampling window may include, for example, based on the instructions detected in the recognition result The maximum number of movement and/or movement step length of the sampling window is dynamically adjusted.
  • dynamically adjusting the maximum number of movement times of the sampling window based on the number of detected instructions may include, for example: in response to the number of detected instructions exceeding a first preset value, sampling The maximum number of times of movement of the window is adjusted from the first value to a second value greater than the first value.
  • the original total number of sliding times N can be increased to 2N times or other times, so as to eliminate noise by performing more sliding window detections, thereby improving the anti-interference ability.
  • dynamically adjusting the movement step size of the sampling window based on the number of detected instructions may include, for example: in response to the number of detected instructions exceeding a second preset value, Adjust the moving step length of the sampling window from the first step length to the second step length smaller than the first step length.
  • the second preset value may be the same as or different from the first preset value in the foregoing embodiment, which is not limited in the embodiment of the present application.
  • the original sliding step length S milliseconds can be reduced to S/2 milliseconds or other step lengths, so as to eliminate noise by performing a sliding window detection with a smaller sliding step length, thereby improving the anti-interference ability.
  • dynamically adjusting the movement step size of the sampling window may further include: responding to the number of detected instructions being lower than a third preset value (such as M 3 , where M 3 ⁇ N), adjust the moving step length of the sampling window from the first step length to a third step length larger than the first step length.
  • a third preset value such as M 3 , where M 3 ⁇ N
  • the third preset value may be smaller than the second preset value in the foregoing embodiment, for example.
  • the number of detected instructions is lower than a certain value (such as M 3 , where M 3 ⁇ N), it means that the user is in Although there is a speech behavior during this period, it may not be a continuous speech behavior, or even if it is a continuous speech behavior, it is not a behavior of continuously sending instructions.
  • a certain value such as M 3 , where M 3 ⁇ N
  • the original sliding step length S milliseconds can be increased to 2S milliseconds or other step lengths, so that noise can be eliminated by performing sliding window detection with a larger sliding step length, and at the same time, computing power can be saved, so as to achieve the window recognition operation In the process, try to balance anti-interference ability and computing power consumption.
  • step S201 may include the following steps S401 to S404, for example.
  • step S401 the audio signal is first divided into frames, and then the short-term average energy of each frame is calculated frame by frame to obtain the short-term energy envelope.
  • the audio signal can be divided into frames with a fixed duration. Taking a fixed duration of 1 second as an example, the audio signal can be divided into one frame from the 0th second to the first second, and the first second to the second second are divided into one frame. Frame, the third second to the fourth second are divided into one frame, ..., thus completing the framing of the audio signal.
  • the short-term average energy of a frame is the average energy of the frame, and the short-term energy envelope can be obtained by connecting the energy envelope.
  • step S402 a higher threshold value T1 is selected, and the first and last intersection points of T1 and the short-term energy envelope are marked as C and D.
  • the part higher than the threshold T1 is considered to have a higher probability of being speech.
  • step S403 a lower threshold value T2 is selected, and the intersection point B with the short-term energy envelope is located to the left of C, and the intersection point E is located to the right of D.
  • T2 is less than T1.
  • step S404 the short-term average zero-crossing rate of each frame is calculated, and the threshold T3 is selected, and the intersection point A of the short-term average zero-crossing rate line is located on the left of B, and the intersection F is located on the right of E.
  • the short-term average zero-crossing rate of a frame is the average zero-crossing rate of the frame.
  • intersection points A and F are the two endpoints of the audio signal determined based on the voice activity detection method
  • the audio segments from the intersection point A to the intersection point F are the audio segments intercepted from the audio signal based on the voice activity detection method.
  • steps S401 to S404 only one audio segment is intercepted from a segment of audio signal as an example. It is understandable that multiple audio segments can also be intercepted from a segment of audio signal. This is the case in the embodiment of the present application. There is no need to limit it.
  • a speech recognition model when performing speech recognition, a speech recognition model may be used.
  • the speech recognition model can be, for example, a GMM-HMM model (Gaussian Mixed Model-Hidden Markov Model, Gaussian Mixture Model-Hidden Markov Model), and the training process is as follows: use several command speech and negative samples as training data; perform MFCC on the training speech Feature extraction; then the training of the GMM-HMM model is carried out.
  • the HMM parameter estimation can, for example, use the Baum-welch algorithm to set several states for the category of instruction words and noise categories, and the number of Gaussians.
  • the speech recognition process can be: extracting the MFCC (Mel Frequency Cepstral Coefficient) feature of the detected sound segment, using the pre-trained GMM-HMM model to perform Viterbi decoding, and outputting the recognition result.
  • MFCC Mel Frequency Cepstral Coefficient
  • the voice activity detection method when the smart hardware device is just turned on, for example, the voice activity detection method can be used to perform voice endpoint detection by default, and when certain conditions are met (that is, when the application scenario changes), it switches to the sliding window detection method to run for a period of time, and then Returning to the voice activity detection method, the specific switching process is shown in FIG. 5, and the process may include the following steps, for example.
  • Step S501 power on.
  • the intelligent hardware device activates the voice endpoint detection function, and performs voice endpoint detection through the voice activity detection method by default.
  • Step S502 Acquire an audio stream in real time.
  • step S503 the audio segment in the audio stream is intercepted by the voice activity detection method.
  • Step S504 Detect whether there is sound in the intercepted audio clip. If it exists, go to step S505; otherwise, go to step S503.
  • Step S505 Perform the first voice recognition to obtain voice information.
  • Step S506 Determine whether the voice information is a control instruction (first switching condition). If yes, go to step S507; otherwise, go to step S508.
  • Step S507 execute the instruction, and execute step S509 after the instruction is executed.
  • the result of the first speech recognition is a control command
  • human-computer interaction is performed.
  • voice endpoint detection is performed through the sliding window detection method within a period of t minutes (the total number of sliding times in this period is N).
  • voice recognition is performed on the audio clips in each sliding window.
  • Step S508 Determine whether the voice length of the voice information is longer than L seconds (the second switching condition). If yes, go to step S509; otherwise, go to step S503.
  • the duration of the speech segment intercepted by the voice activity detection method exceeds a certain duration of L seconds (for example, 0.5s). If the duration of the voice segment is less than L seconds, continue to perform voice endpoint detection through the voice activity detection method, that is, return to step S503. If the duration of the speech segment is greater than or equal to L seconds, it is considered that the user is likely to have continuous speaking behavior at this time. Therefore, voice endpoint detection is performed by the sliding window detection method within t minutes, and voice recognition is performed on the audio segments in each sliding window during the sliding window detection period, and the flow is as step S510 to step S512.
  • L seconds for example, 0.5s
  • Step S511, perform voice recognition.
  • Step S512 it is judged that n ⁇ N? If yes, go to step S510; otherwise, go to step S503.
  • step S510 to step S512 describe the operation logic and exit logic during the sliding window detection period. If the voice recognition module recognizes the control command during the sliding window detection and recognition, the sliding window detection is suspended, and the command action is executed instead. After the execution of the instruction action ends, the sliding window detection is continued until the N sliding window detections for the period of time (ie, t minutes) have been executed. After the N sliding window detection ends, the default voice activity detection method is restored to perform voice endpoint detection, that is, step S503 is returned.
  • FIG. 6 is a schematic structural diagram of an audio processing device provided by an embodiment of this application. As shown in FIG. 6, the device 600 may include a processor 601 and a memory 602.
  • the memory 602 is used to store program codes
  • the processor 601 calls the program code, and when the program code is executed, is used to perform the following operations:
  • the window recognition operation includes the following operations: moving the sampling window in the audio signal after the audio segment, and performing voice recognition on the audio signal in the sampling window.
  • the audio processing device provided in this embodiment can be used to implement the technical solutions of the foregoing method embodiments, and its implementation principles and technical effects are similar to those of the method embodiments, and will not be repeated here.
  • a person of ordinary skill in the art can understand that all or part of the steps in the foregoing method embodiments can be implemented by a program instructing relevant hardware.
  • the aforementioned program can be stored in a computer readable storage medium. When the program is executed, it executes the steps including the foregoing method embodiments; and the foregoing storage medium includes: ROM, RAM, magnetic disk, or optical disk and other media that can store program codes.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Signal Processing (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

L'invention concerne un procédé et un dispositif de traitement audio. Le procédé selon l'invention consiste : à intercepter un segment audio à partir d'un signal audio en fonction d'informations de caractéristique audio du signal audio ; et à déterminer, en fonction du segment audio, s'il faut exécuter une opération de reconnaissance de fenêtre, l'opération de reconnaissance de fenêtre comprenant les opérations suivantes : déplacement d'une fenêtre d'échantillonnage dans le signal audio après le segment audio et réalisation d'une reconnaissance vocale sur le signal audio dans la fenêtre d'échantillonnage. Par exemple, étant donné qu'un procédé d'activité vocale présente l'avantage d'une moindre consommation de puissance de calcul et qu'un procédé de fenêtre glissante présente l'avantage d'une forte capacité anti-interférence, la présente invention commute automatiquement le procédé d'activité vocale et le procédé de fenêtre glissante selon une scène d'application pour un traitement audio, ce qui permet d'associer les avantages du procédé d'activité vocale et du procédé de fenêtre glissante, d'économiser de la puissance de calcul et d'améliorer la précision de traitement audio.
PCT/CN2020/073292 2020-01-20 2020-01-20 Procédé et dispositif de traitement audio WO2021146857A1 (fr)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2020/073292 WO2021146857A1 (fr) 2020-01-20 2020-01-20 Procédé et dispositif de traitement audio
CN202080004356.6A CN112543972A (zh) 2020-01-20 2020-01-20 音频处理方法及装置

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/073292 WO2021146857A1 (fr) 2020-01-20 2020-01-20 Procédé et dispositif de traitement audio

Publications (1)

Publication Number Publication Date
WO2021146857A1 true WO2021146857A1 (fr) 2021-07-29

Family

ID=75017359

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/073292 WO2021146857A1 (fr) 2020-01-20 2020-01-20 Procédé et dispositif de traitement audio

Country Status (2)

Country Link
CN (1) CN112543972A (fr)
WO (1) WO2021146857A1 (fr)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114238607B (zh) * 2021-12-17 2022-11-22 北京斗米优聘科技发展有限公司 深度交互式ai智能求职顾问方法、系统及存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090222264A1 (en) * 2008-02-29 2009-09-03 Broadcom Corporation Sub-band codec with native voice activity detection
CN107305774A (zh) * 2016-04-22 2017-10-31 腾讯科技(深圳)有限公司 语音检测方法和装置
CN109545188A (zh) * 2018-12-07 2019-03-29 深圳市友杰智新科技有限公司 一种实时语音端点检测方法及装置
CN110277087A (zh) * 2019-07-03 2019-09-24 四川大学 一种广播信号预判预处理方法

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8650029B2 (en) * 2011-02-25 2014-02-11 Microsoft Corporation Leveraging speech recognizer feedback for voice activity detection
CN106601230B (zh) * 2016-12-19 2020-06-02 苏州金峰物联网技术有限公司 基于连续混合高斯hmm模型的物流分拣地名语音识别方法、系统及物流分拣系统
US10431242B1 (en) * 2017-11-02 2019-10-01 Gopro, Inc. Systems and methods for identifying speech based on spectral features

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090222264A1 (en) * 2008-02-29 2009-09-03 Broadcom Corporation Sub-band codec with native voice activity detection
CN107305774A (zh) * 2016-04-22 2017-10-31 腾讯科技(深圳)有限公司 语音检测方法和装置
CN109545188A (zh) * 2018-12-07 2019-03-29 深圳市友杰智新科技有限公司 一种实时语音端点检测方法及装置
CN110277087A (zh) * 2019-07-03 2019-09-24 四川大学 一种广播信号预判预处理方法

Also Published As

Publication number Publication date
CN112543972A (zh) 2021-03-23

Similar Documents

Publication Publication Date Title
EP3577645B1 (fr) Détection de fin d'interrogation
Li et al. Robust endpoint detection and energy normalization for real-time speech and speaker recognition
US6615170B1 (en) Model-based voice activity detection system and method using a log-likelihood ratio and pitch
EP2898510B1 (fr) Procede, systeme et programme d'ordinateur pour un controle de gain adaptatif applique a un signal audio
US11978478B2 (en) Direction based end-pointing for speech recognition
JP2016505888A (ja) 発話認識電力管理
JPH11153999A (ja) 音声認識装置及びそれを用いた情報処理装置
US11341988B1 (en) Hybrid learning-based and statistical processing techniques for voice activity detection
US11676594B2 (en) Decaying automated speech recognition processing results
US12073826B2 (en) Freeze words
WO2021016925A1 (fr) Procédé et appareil de traitement audio
US11348579B1 (en) Volume initiated communications
WO2021146857A1 (fr) Procédé et dispositif de traitement audio
CN112435691B (zh) 在线语音端点检测后处理方法、装置、设备及存储介质
US20240079004A1 (en) System and method for receiving a voice command
US12080276B2 (en) Adapting automated speech recognition parameters based on hotword properties
KR20120111510A (ko) 대화형 음성 인식을 통한 로봇 제어 시스템
Abdulkader et al. Multiple-instance, cascaded classification for keyword spotting in narrow-band audio
US11081114B2 (en) Control method, voice interaction apparatus, voice recognition server, non-transitory storage medium, and control system
KR100281582B1 (ko) 인식기 자원을 효율적으로 사용하는 음성인식 방법
CN118841010A (en) Audio processing method, device, electronic equipment and storage medium
CN111768800A (zh) 语音信号处理方法、设备及存储介质
CN118235199A (zh) 语音识别方法、语音识别装置及系统
Li et al. Robust Endpoint Detection

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20916099

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20916099

Country of ref document: EP

Kind code of ref document: A1