WO2021146857A1

WO2021146857A1 - Audio processing method and device

Info

Publication number: WO2021146857A1
Application number: PCT/CN2020/073292
Authority: WO
Inventors: 吴俊峰; 罗东阳; 童焦龙
Original assignee: 深圳市大疆创新科技有限公司
Priority date: 2020-01-20
Filing date: 2020-01-20
Publication date: 2021-07-29
Also published as: CN112543972A

Abstract

An audio processing method and device. The method comprises: intercepting an audio segment from an audio signal according to audio feature information of the audio signal; and determining, on the basis of audio segment, whether to execute a window recognition operation, the window recognition operation comprising the following operations: moving a sampling window in the audio signal after the audio segment, and performing voice recognition on the audio signal in the sampling window. For example, since a voice activity method has the advantage of less computing power consumption, and a sliding window method has the advantage of strong anti-interference capability, the present application automatically switches the voice activity method and the sliding window method according to an application scene for audio processing, thereby combining the advantages of the voice activity method and the sliding window method, saving the computing power, and improving the audio processing accuracy.

Description

Audio processing method and device

Technical field

This application relates to the field of audio technology, and in particular to an audio processing method and device.

Background technique

Voice interaction is a common way of human-computer interaction. When using voice recognition for human-computer interaction, it is necessary to perform voice endpoint detection first, that is, to detect which parts of the sound recorded by the microphone may contain voice. The anti-interference ability and computing power consumption of various voice endpoint detection methods are different from each other. How to balance the anti-interference ability and computing power consumption has become an urgent problem in voice endpoint detection.

Summary of the invention

The embodiments of the present application provide an audio processing method and device to solve the problem that the anti-interference ability and computing power consumption cannot be taken into account in the voice endpoint detection process in the prior art, resulting in poor anti-interference ability and computing power consumption. Big technical problem.

In a first aspect, an embodiment of the present application provides an audio processing method, including: intercepting an audio segment in the audio signal according to audio feature information of the audio signal; determining whether to perform a window recognition operation based on the audio segment; and the window recognition The operation includes the following operations: moving a sampling window in the audio signal after the audio segment, and performing voice recognition on the audio signal in the sampling window.

In a second aspect, an embodiment of the present application provides an audio processing device, including: a processor and a memory; the memory is used to store program code; the processor calls the program code, and when the program code is executed, It is used to perform the following operations: intercept an audio segment in the audio signal according to the audio feature information of the audio signal; determine whether to perform a window recognition operation based on the audio segment; the window recognition operation includes the following steps: after the audio segment Move the sampling window in the audio signal, and perform voice recognition on the audio signal in the sampling window.

In a third aspect, an embodiment of the present application provides a computer-readable storage medium, the computer-readable storage medium stores a computer program, the computer program includes at least one piece of code, and the at least one piece of code can be executed by a computer to control all The computer executes the audio processing method according to any one of the above-mentioned first aspects.

In a fourth aspect, an embodiment of the present application provides a computer program, when the computer program is executed by a computer, it is used to implement the audio processing method according to any one of the above-mentioned first aspects.

The embodiments of the present application provide an audio processing method and device. By selecting and switching the voice endpoint detection method according to the characteristics of specific application scenarios in the voice endpoint detection process, it can take into account the anti-interference ability and computing power consumption.

Description of the drawings

In order to more clearly describe the technical solutions in the embodiments of the present application or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the drawings in the following description These are some embodiments of the present application. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without creative work.

FIG. 1A is a schematic diagram 1 of an application scenario of an audio processing method provided by an embodiment of this application;

FIG. 1B is a second schematic diagram of an application scenario of the audio processing method provided by an embodiment of the application;

2 is a schematic flowchart of an audio processing method provided by an embodiment of the application;

3A-3C are schematic diagrams of the audio sub-segment provided by an embodiment of the application excluding the noise of the audio segment;

4 is a schematic flowchart of an audio processing method provided by another embodiment of this application;

FIG. 5 is a schematic flowchart of a voice activity detection method provided by an embodiment of the application; and

FIG. 6 is a schematic structural diagram of an audio processing device provided by an embodiment of the application.

Detailed ways

In order to make the purpose, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be described clearly and completely in conjunction with the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments It is a part of the embodiments of the present application, but not all of the embodiments. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of this application.

The audio processing method provided in the embodiments of the present application can be applied to any audio processing process that requires voice endpoint detection, and the audio processing method can be specifically executed by an audio processing device. The audio processing device may be a device including an audio collection module (for example, a microphone). Accordingly, a schematic diagram of an application scenario of the audio processing method provided in an embodiment of the present application may be as shown in FIG. 1A. Specifically, the audio collection module of the audio processing device can collect the voice spoken by the user to obtain audio signals, and the processor of the audio processing device can process the audio signals collected by the audio collection module using the audio processing method provided in the embodiments of the present application. It should be noted that FIG. 1A is only a schematic diagram, and does not limit the structure of the audio processing device. For example, an amplifier may be connected between the microphone and the processor to amplify the audio signal collected by the microphone. For another example, a filter may be connected between the microphone and the processor to filter the audio signal collected by the microphone.

Alternatively, the audio processing device may also be a device that does not include an audio collection module. Correspondingly, a schematic diagram of an application scenario of the audio processing method provided in an embodiment of the present application may be as shown in FIG. 1B. Specifically, the communication interface of the audio processing device may receive audio signals collected by other devices or equipment, and the processor of the audio processing device may process the received audio signals using the audio processing method provided in the embodiments of the present application. It should be noted that FIG. 1B is only a schematic diagram, and does not limit the structure of the audio processing device and the connection mode between the audio processing device and other devices or equipment. For example, the communication interface in the audio processing device can be replaced with a transceiver.

It should be noted that the type of equipment including the audio processing device may not be limited in the embodiments of the present application. The equipment may be, for example, smart speakers, smart lighting devices, smart robots, mobile phones, tablet computers, and the like.

The audio processing method provided by the embodiments of the present application can automatically switch between the voice activity detection method and the sliding window detection method according to the characteristics of the specific application scenario, so that the anti-interference ability of the voice endpoint detection can be improved as a whole, and the voice endpoint detection can be reduced. Consumption of computing power. That is, the anti-interference ability of voice endpoint detection can be stronger than that of purely using the voice activity detection method, and at the same time, the computing power consumption of voice endpoint detection can be less than that of purely using the sliding window detection method.

It should be noted that the audio processing method provided by the embodiments of this application can be applied to the field of voice recognition, smart hardware devices with voice recognition functions, audio event detection fields, smart hardware devices with audio event detection functions, etc. This is not limited.

Hereinafter, some embodiments of the present application will be described in detail with reference to the accompanying drawings. In the case of no conflict, the following embodiments and features in the embodiments can be combined with each other.

2 is a schematic flowchart of an audio processing method provided by an embodiment of the application. The execution subject of this embodiment may be an audio processing device, and specifically may be a processor of the audio processing device. As shown in FIG. 2, the method of this embodiment may include step S201 and step S202, for example.

In step S201, an audio segment is intercepted from the audio signal according to the audio feature information of the audio signal.

It should be noted that, in the embodiment of the present application, in step S201, the audio clip is intercepted based on the audio feature information. Since the intercepted endpoint is inaccurate, the intercepted audio clip usually contains noise. The audio clip is used for speech recognition, and the accuracy is not high.

It should be noted that the specific type of the audio feature information may not be limited in the embodiment of the present application. Optionally, the audio features may include, for example, Mel Frequency Cepstrum Coefficient (MFCC) features, linear prediction coefficients (Linear Prediction Coefficients, LPC) features, and filter bank (Filter bank, Fbank) features. One or more of.

Specifically, in the embodiment of the present application, in step S201, a voice endpoint detection method with lower computing power consumption may be used to process the audio signal.

In the process of implementing the embodiments of the present application, the inventor found that the voice activity detection method recognizes whether the audio signal is a voice signal through methods such as energy and zero-crossing rate double threshold or noise-speech classification model, and only determines whether the audio signal is a voice signal. It can only be intercepted at a time for speech recognition, so the voice activity detection method can effectively distinguish between silent and non-silent, and can save computing power when used for speech recognition.

Therefore, in one embodiment, in step S201, a voice activity detection method may be used to intercept audio segments in the audio signal according to the audio feature information of the audio signal.

In addition, in the process of implementing the embodiments of the present application, the inventor also found that the voice activity detection method cannot effectively distinguish between voice-type noises or non-speech-type noises with slightly larger energy. For example, when the voice activity detection method is used for command word speech recognition, it is easy to mix noise before and after the command speech fragment, thereby reducing the voice command recognition rate, while the sliding window detection method can extract a certain length of speech fragment from the audio stream and continuously perform speech recognition , For command word speech recognition

This method can effectively prevent mixing of noise before and after the instruction word speech segment, but the detection method since the sliding window computationally intensive speech recognition, voice recognition keep an intelligent hardware device consumes significant computing resources and power.

Therefore, in the embodiments of the present application, the voice activity detection method can be used for audio processing in application scenarios with low anti-interference requirements to reduce computing power consumption, but sliding windows can be used in application scenarios with high anti-interference requirements The detection method performs audio processing to improve the anti-interference ability.

In step S202, it is determined whether to perform a window recognition operation based on the audio clip obtained in step S201.

The window recognition operation may include, for example, the following operations: moving a sampling window in the audio signal after the audio segment obtained in step S201, and performing voice recognition on the audio signal in the sampling window.

Because the sliding window detection method can exclude the noise included in the audio segment cut out by the voice activity detection in one or more windows. For example, as shown in FIG. 3A, assuming that the beginning part of the audio signal includes noise, the audio segment X1 can exclude the noise. For another example, as shown in FIG. 3B, assuming that the middle part of the audio signal includes noise, the audio segment X2 can exclude the noise. For another example, as shown in FIG. 3C, assuming that the end part of the audio signal includes noise, the audio segment X3 can exclude the noise. It should be noted that the parts filled with grid lines in FIGS. 3A to 3C are used to represent noise.

Specifically, in step S202, it can be determined whether the audio segment obtained in step S201 contains or may contain voice-type noise or non-speech-type noise with slightly larger energy, and then it is determined whether to perform a window recognition operation. Wherein, if it is determined that the audio segment obtained in step S201 contains or may contain speech-type noise or non-speech-type noise with slightly larger energy, a window recognition operation is performed. Otherwise, if it is determined that the audio segment obtained in step S201 does not contain or may not contain speech-type noise or non-speech-type noise with slightly larger energy, then continue to perform step S201 and step S202.

For example, in one embodiment, noise is easily mixed before and after the command-type voice information, which easily leads to inaccurate speech segment interception and/or inaccurate speech recognition. Therefore, for the audio segment obtained in step S201, if command-type voice information is detected, a window recognition operation is performed to eliminate the influence of noise.

For another example, in another embodiment, for the audio segment obtained in step S201, if non-voice information with a slightly larger energy is detected, for example, if a preset audio such as a specific tapping sound, clapping sound, or clapping sound is detected, In the event, a window recognition operation is performed to avoid inaccurate audio segment interception and/or inaccurate speech recognition due to the inability of the voice activity detection method to effectively distinguish this type of information.

Through the embodiments of the present application, when processing audio signals, the corresponding voice endpoint detection method can be selected according to the characteristics of specific application scenarios. For example, in application scenarios where the anti-interference ability is low, the voice activity detection method can be used to process the audio signal to save computing power. For application scenarios that require high anti-interference ability, the sliding window detection method can be used to process the audio signal to improve the anti-interference ability, which can take into account the advantages of high anti-interference ability of the sliding window detection method and low computational power consumption of the voice activity detection method Therefore, the anti-interference ability of voice endpoint detection is generally stronger than that of purely using the voice activity detection method, and at the same time, the computing power consumption of voice endpoint detection is generally less than that of purely using the sliding window detection method.

It should be noted that, in the embodiment of the present application, in step S202, it is possible to determine whether or not there is instruction-type voice information in the audio segment obtained in step S201 in various ways.

For example, in one embodiment, voice information may be extracted from the audio segment obtained in step S201, and the content and/or voice length of the extracted voice information may be used to determine whether there is or may exist instruction information, and then determine whether to execute Window recognition operation.

Specifically, in the embodiment of the present application, the method may further include, for example, extracting voice information from an audio segment. Correspondingly, step S202 may include, for example, judging whether to perform a window recognition operation based on the extracted voice information.

For example, in one embodiment, it may be determined whether to perform a window recognition operation based on the content of the extracted voice information. Specifically, in this embodiment, it can be determined whether the content of the voice information satisfies a specific condition (for example, whether the voice information contains instruction-type information). If it is determined that the content of the voice information satisfies a specific condition, a window recognition operation is performed. Otherwise, if it is determined that the content of the voice information does not meet a specific condition, the window recognition operation is not performed.

For another example, in another embodiment, it is possible to determine whether to perform a window recognition operation based on the voice length of the extracted voice information. Specifically, in this embodiment, it can be determined whether the voice length of the voice information is less than or equal to a preset value. If it is determined that the voice length of the voice information is less than or equal to the preset value, a window recognition operation is performed. Otherwise, if it is determined that the voice length of the voice information is greater than the preset value, the window recognition operation is not performed.

For another example, in another embodiment, it may be determined whether to perform a window recognition operation based on the content and the length of the extracted voice information.

Specifically, in this embodiment, it can be determined first whether the content of the voice information meets a specific condition. If the content of the voice information does not meet the specific conditions, it is determined whether the voice length of the voice information is less than or equal to the preset value. If it is determined that the content of the voice information does not meet the specific conditions and the voice length of the voice information is not less than or not equal to the preset value, then the window recognition operation is not performed. Otherwise, the window recognition operation is performed.

Alternatively, in this embodiment, it can also be determined first whether the voice length of the voice information is less than or equal to a preset value. If the content of the voice information is not less than or not equal to the preset value, then it is determined whether the content of the voice information meets a specific condition. If it is determined that the voice length of the voice information is not less than or equal to the preset value and the content of the voice information does not meet the specific condition, the window recognition operation is not performed. Otherwise, the window recognition operation is performed.

It should be noted that, according to the content and/or the length of the extracted voice information, it can be determined whether there is instruction information in the audio segment obtained in step S201. If it is determined that there is instruction-type information in the audio clip obtained in step S201, considering that the user may continue to speak during actual use, it is possible to issue multiple instructions continuously. Therefore, in the audio signal processing process after the audio clip, you can The sliding window detection method is used for voice endpoint detection to prevent inaccurate speech fragment interception and/or inaccurate speech recognition caused by mixing too much noise before and after the instruction.

In addition, it should be noted that since the length of the instruction word is limited, the voice length corresponding to the voice information containing the instruction word is also limited. In this way, it is possible to roughly estimate whether or not there is an instruction word in the voice information according to the voice length of the extracted voice information, and then it is possible to determine whether to perform a window recognition operation.

Through the embodiment of the present application, for the audio segment obtained in step S201, the voice information contained therein can be extracted first, and then based on the content and/or voice length of the voice information to determine or estimate whether there is instruction information in the audio segment, and The determination or estimation result determines whether to perform the window recognition operation, thereby avoiding inaccuracies caused by excessive noise during voice segment interception and voice recognition of the command-type voice information.

In addition, in another embodiment, the voice information may be extracted from the audio segment obtained in step S201, and the presence or possible existence of instruction information can be determined according to the content of the extracted voice information and/or the duration of the audio segment. And then determine whether to perform the window recognition operation.

For example, in the embodiment of the present application, as an embodiment, it is possible to determine whether to perform a window recognition operation according to the content of the extracted voice information and the duration of the audio segment at the same time.

Specifically, in the embodiment of the present application, the method may further include, for example, extracting voice information from the audio segment obtained in step S201. Correspondingly, step S202 may include the following operations, for example. It is first judged whether the extracted voice information matches the first voice information. If it does not match, then it is determined whether the duration of the audio segment is greater than or equal to the first duration threshold. If it is not greater than or not equal to the first duration threshold, step S201 and step S202 are executed for the audio signal after the audio segment.

Alternatively, in the embodiment of the present application, step S202 may also include the following operations, for example. First, it is determined whether the duration of the audio segment obtained in step S201 is greater than or equal to the first duration threshold. If it is not greater than or not equal to the first duration threshold, then it is determined whether the extracted voice information matches the first voice information. If it does not match, for the audio signal after the audio segment, step S201 and step S202 are executed.

It should be noted that, in the embodiment of the present application, the first voice information may be set according to the vocabulary of the characterizing instruction (such as the first control instruction). As an optional embodiment, when the audio processing method is applied to a smart hardware device with a voice recognition function, the first voice information may be set according to a vocabulary that characterizes all control instructions applied to the smart hardware device.

Similar to the limited length of the instruction word, the voice length corresponding to the voice information containing the instruction word is also limited, and the duration of the audio segment containing the instruction word is also limited. Therefore, it is possible to roughly estimate whether or not there is an instruction word in the audio segment according to the duration of the audio segment obtained in step S201, and then it can be determined whether to perform a window recognition operation.

It should be noted that, in the embodiment of the present application, the first duration threshold may be set according to the maximum duration, average duration, or any duration of the audio segment containing the instruction (such as the first control instruction).

Through the embodiment of the present application, for the audio clip obtained in step S201, the voice information contained therein can be extracted, and based on the content of the voice information and the duration of the audio clip, it is determined or estimated whether there is instruction information in the audio clip, and The estimation result determines whether to perform the window recognition operation. In this way, it is also possible to avoid inaccuracy caused by excessive noise when performing voice segment interception and voice recognition on the command-type voice information.

For another example, in the embodiment of the present application, as another embodiment, it is also possible to determine whether to perform a window recognition operation according to one or more (ie, two) of the content of the extracted voice information and the duration of the audio segment.

Specifically, the method may further include, for example, performing a window recognition operation when any of the following events occurs: the voice information extracted from the audio segment obtained in step S201 matches the first voice information (referred to as event 1); and/ Or, the duration of the audio segment obtained in step S201 is greater than or equal to the first duration threshold (referred to as event 2).

It should be noted that, in the embodiment of the present application, for the case where both event 1 and event 2 occur, in method 1, it is possible to first determine whether event 1 has occurred, and then determine whether event 2 has occurred. Or, in method 2, it is also possible to first determine whether event 2 has occurred, and then determine whether event 1 has occurred. However, whether it is Mode 1 or Mode 2, both events must occur before the window recognition operation is performed.

In addition, in the embodiment of the present application, the setting of the first voice information and the first duration threshold is the same or similar to the setting method in the foregoing embodiment, which is not repeated here in the embodiment of the present application.

In addition, in the embodiment of the present application, for event 1, the voice information may be extracted from the audio segment obtained in step S201, and then the extracted voice information may be matched with the first voice information, so as to determine whether the extracted voice information matches the The first voice message matches.

Similarly, in the embodiment of the present application, for event 2, the duration of the audio clip obtained in step S201 may be determined first, and then it is compared whether the duration is greater than or equal to the first duration threshold.

According to the embodiment of the present application, one or more of event 1 and event 2 can be used to determine or estimate that the audio clip obtained in step S201 contains or may contain instruction information, so that the window recognition operation can be performed to intercept the voice clip. The noise before and after the instruction is excluded, so that the interception result of the speech segment is more accurate, and the speech recognition result obtained therefrom is also more accurate.

It should be noted that, in the embodiment of the present application, in step S202, it is also possible to detect whether there is non-speech information with a little higher energy in the audio segment obtained in step S201 in a variety of ways.

Specifically, step S202 may include, for example, judging whether a preset audio event has occurred based on the audio segment obtained in step S201, so as to determine whether there is non-voice information with a higher energy in the audio segment.

Among them, if it is determined that the preset audio event has occurred, it is considered that there is non-voice information with a slightly larger energy in the audio segment, and the window recognition operation can be performed, so that the noise can be excluded when the audio segment is intercepted, so that the audio segment can be intercepted The result is more accurate, which in turn makes the resulting audio processing result more accurate. In addition, if it is determined that the preset audio event does not occur, it is considered that there is no non-voice information with a larger energy in the audio segment, and the window recognition operation may not be performed.

It should be noted that, in the embodiment of the present application, the foregoing preset audio event may include, for example, one or more of the following sounds in the audio segment: percussion, applause, and clapping. The above-mentioned preset audio events can be set according to non-voice information with slightly larger energy, such as percussive sounds, clapping sounds, and clapping sounds that often appear in the audio clip, for example.

It should be understood that under normal circumstances, the user may continue to speak during actual use, and thus will continuously issue multiple instructions. Therefore, when the voice endpoint detection is performed, once the command-type voice information is detected, it can be switched to the sliding window detection method. Voice endpoint detection can prevent excessive noise from being mixed before and after the instruction, which may lead to inaccurate speech segment interception and/or inaccurate speech recognition. However, there are some special situations. For example, the user may not speak continuously during actual use, but will not issue any more instructions for a long period of time after issuing an instruction. For example, the user says "please go to sleep" to a smart hardware device with voice recognition function (such as a robot), or the user says "please fly around A twice" to a smart hardware device with voice recognition function (such as a drone) , Etc. In these cases, the user generally will not issue other control commands for a long period of time after the command is issued. Therefore, if these special command-type voice messages are detected, switching to the sliding window detection method for voice endpoint detection will not only not significantly improve the anti-interference ability, but also increase the computing power consumption. Therefore, in the embodiment of the present application, if these special circumstances occur, the sliding window detection method may not be switched, thereby avoiding the defect that not only the anti-interference ability is not significantly improved after the switching, but also the computational power consumption will increase.

That is, in another embodiment, in the process of judging whether to perform the window recognition operation based on the audio clip obtained in step S201, it is also possible to detect whether specific voice information appears. Wherein, if the occurrence of specific voice information is detected, then for the audio signal after the audio segment obtained in step S201, step S201 and step S202 are executed without performing the window recognition operation.

Specifically, step S202 may include, for example, extracting voice information from the audio segment obtained in step S201, and determining whether the extracted voice information matches the second voice information. Wherein, if it matches, for the audio signal after the audio segment obtained in step S201, step S201 and step S202 are executed.

It should be noted that, in the embodiment of the present application, the above-mentioned second voice information may be set according to a vocabulary characterizing the second control instruction, for example. Wherein, the second control instruction may have any of the following features: the processor suspends execution of other control instructions during the execution of the second control instruction (feature 1 for short); the second control instruction represents the pause of voice interaction with the user (feature 2 for short).

For feature 1, the second control command may be, for example, "please fly around A twice" in the above example. For feature 2, the second control command may be, for example, "please enter the sleep state" in the above example. Specifically, the second control instruction may be any control instruction having the above-mentioned feature 1 and/or feature 2, which is not limited in the embodiment of the present application.

In addition, in the embodiment of the present application, after switching to the sliding window detection method, that is, after performing the window recognition operation, if the specific application scenario is found to change again, the sliding window detection method can also be switched back to the previously used voice endpoint Detection methods to ensure that computing power is saved as much as possible.

Specifically, in the embodiment of the present application, the method may further include, for example, moving the sampling window in the audio signal after the audio segment obtained in step S201, and performing the voice recognition step on the audio signal in the sampling window. That is, in the process of performing the window recognition operation, in response to the switching trigger event that occurs, for the audio signal after the current position of the sampling window, step S201 and step S202 are executed.

Specifically, in the embodiment of the present application, the aforementioned switching trigger event may include, for example, any one or more of the following: in the process of performing voice recognition on the audio signal in the sampling window, it is detected that it corresponds to the third voice information. Matching voice information; the number of movement times of the sampling window reaches the maximum number of movement times; and the processing duration of moving the sampling window and performing voice recognition on the audio signal in the sampling window reaches the second duration threshold.

Specifically, in the embodiment of the present application, the third voice information may be set according to a vocabulary characterizing the second control instruction, for example. Wherein, the second control instruction has any of the following characteristics: the processor suspends execution of other control instructions during the execution of the second control instruction; the second control instruction represents the suspension of voice interaction with the user.

It should be noted that, in this embodiment of the present application, the foregoing third voice information may be the same as or similar to the second voice information in the foregoing embodiment, and details are not described herein again in this embodiment of the present application.

In addition, the second control instruction in the embodiment of the present application is the same as the second control instruction in the foregoing embodiment, and the description of the embodiment of the present application is not repeated here.

In the audio processing method provided by the embodiment of the present application, the window recognition operation not only affects the overall anti-interference ability, but also affects the overall computing power consumption. In addition, windows with different related parameters have different capabilities and computational power consumed when eliminating noise. Therefore, in the process of performing the window recognition operation, the relevant parameters of the window can also be dynamically adjusted according to the voice recognition result, so as to achieve the maximum balance between anti-interference ability and computing power consumption.

It should be noted that in the embodiments of this application, sliding window detection refers to: setting a sliding window with a fixed window length of W milliseconds, and each time only the audio signal in the sliding window is taken for identification, and the sliding window can be from a certain segment The starting point of the audio signal gradually slides backward in steps of S milliseconds, so as to achieve the purpose of sliding window detection. Wherein, in an example, the relationship between the total number of sliding times N of the sliding window and the time period t minutes can be expressed as: t*60*1000=(W+S*N).

In the embodiment of the present application, in the process of moving the sampling window in the audio signal after the audio segment obtained in step S201, and performing the voice recognition step on the audio signal in the sampling window, that is, during the process of performing the window recognition operation For example, based on the recognition result of speech recognition on the audio signal of the sampling window, the maximum number of movement of the sampling window (that is, the total number of sliding times N) and/or the movement step size (that is, the step size S) can be dynamically adjusted.

For example, in one embodiment, based on the number of instructions detected in speech recognition, the maximum number of movement of the sampling window (ie, the total number of sliding times N) and/or the movement step size (ie, step size S) can be dynamically adjusted.

Specifically, in the embodiment of the present application, based on the recognition result of performing voice recognition on the audio signal of the sampling window, dynamically adjusting the maximum number of movement times and/or the movement step length of the sampling window may include, for example, based on the instructions detected in the recognition result The maximum number of movement and/or movement step length of the sampling window is dynamically adjusted.

More specifically, in the embodiment of the present application, dynamically adjusting the maximum number of movement times of the sampling window based on the number of detected instructions may include, for example: in response to the number of detected instructions exceeding a first preset value, sampling The maximum number of times of movement of the window is adjusted from the first value to a second value greater than the first value.

For example, during the identification period of the sliding window detection of N times (that is, the total number of sliding times N), if the number of detected instructions reaches a certain value (such as M ₁ , where M ₁ ＜N), it means that the user exists during this period Continuous speaking behavior, and there is a behavior of continuously sending instructions. Therefore, the original total number of sliding times N can be increased to 2N times or other times, so as to eliminate noise by performing more sliding window detections, thereby improving the anti-interference ability.

In addition, more specifically, in the embodiment of the present application, dynamically adjusting the movement step size of the sampling window based on the number of detected instructions may include, for example: in response to the number of detected instructions exceeding a second preset value, Adjust the moving step length of the sampling window from the first step length to the second step length smaller than the first step length.

It should be noted that, in the embodiment of the present application, the second preset value may be the same as or different from the first preset value in the foregoing embodiment, which is not limited in the embodiment of the present application.

For example, during the identification period of the sliding window detection of N times (that is, the total number of sliding times N), if the number of detected instructions reaches a certain value (such as M ₂ , where M ₂ <N), it means that the user exists during this period Continuous speaking behavior, and there is a behavior of continuously sending instructions. Therefore, the original sliding step length S milliseconds can be reduced to S/2 milliseconds or other step lengths, so as to eliminate noise by performing a sliding window detection with a smaller sliding step length, thereby improving the anti-interference ability.

Or, more specifically, based on the number of detected instructions, dynamically adjusting the movement step size of the sampling window, for example, may further include: responding to the number of detected instructions being lower than a third preset value (such as M ₃ , where M ₃ <N), adjust the moving step length of the sampling window from the first step length to a third step length larger than the first step length.

It should be noted that, in the embodiment of the present application, the third preset value may be smaller than the second preset value in the foregoing embodiment, for example.

For example, during the recognition period of the sliding window detection of the first N/2 times (ie, the total number of sliding times N), if the number of detected instructions is lower than a certain value (such as M ₃ , where M ₃ ＜N), it means that the user is in Although there is a speech behavior during this period, it may not be a continuous speech behavior, or even if it is a continuous speech behavior, it is not a behavior of continuously sending instructions. Therefore, the original sliding step length S milliseconds can be increased to 2S milliseconds or other step lengths, so that noise can be eliminated by performing sliding window detection with a larger sliding step length, and at the same time, computing power can be saved, so as to achieve the window recognition operation In the process, try to balance anti-interference ability and computing power consumption.

It should be noted that, in the embodiment of the present application, the specific algorithm used to intercept audio clips from the audio signal based on the voice activity detection method may not be limited in the embodiment of the present application. For example, dual thresholds of energy and zero-crossing rate may be used. One or more of algorithms, noise-speech classification model algorithms, variance method, spectral distance method, and spectral entropy method. Taking the dual threshold algorithm of energy and zero-crossing rate as an example, as shown in FIG. 4, step S201 may include the following steps S401 to S404, for example.

In step S401, the audio signal is first divided into frames, and then the short-term average energy of each frame is calculated frame by frame to obtain the short-term energy envelope.

Among them, the audio signal can be divided into frames with a fixed duration. Taking a fixed duration of 1 second as an example, the audio signal can be divided into one frame from the 0th second to the first second, and the first second to the second second are divided into one frame. Frame, the third second to the fourth second are divided into one frame, ..., thus completing the framing of the audio signal. The short-term average energy of a frame is the average energy of the frame, and the short-term energy envelope can be obtained by connecting the energy envelope.

In step S402, a higher threshold value T1 is selected, and the first and last intersection points of T1 and the short-term energy envelope are marked as C and D.

Among them, the part higher than the threshold T1 is considered to have a higher probability of being speech.

In step S403, a lower threshold value T2 is selected, and the intersection point B with the short-term energy envelope is located to the left of C, and the intersection point E is located to the right of D.

Among them, T2 is less than T1.

In step S404, the short-term average zero-crossing rate of each frame is calculated, and the threshold T3 is selected, and the intersection point A of the short-term average zero-crossing rate line is located on the left of B, and the intersection F is located on the right of E.

Among them, the short-term average zero-crossing rate of a frame is the average zero-crossing rate of the frame.

So far, the intersection points A and F are the two endpoints of the audio signal determined based on the voice activity detection method, and the audio segments from the intersection point A to the intersection point F are the audio segments intercepted from the audio signal based on the voice activity detection method.

It should be noted that in steps S401 to S404, only one audio segment is intercepted from a segment of audio signal as an example. It is understandable that multiple audio segments can also be intercepted from a segment of audio signal. This is the case in the embodiment of the present application. There is no need to limit it.

In addition, it should be noted that in the embodiments of the present application, when performing speech recognition, a speech recognition model may be used. The speech recognition model can be, for example, a GMM-HMM model (Gaussian Mixed Model-Hidden Markov Model, Gaussian Mixture Model-Hidden Markov Model), and the training process is as follows: use several command speech and negative samples as training data; perform MFCC on the training speech Feature extraction; then the training of the GMM-HMM model is carried out. During the training process, the HMM parameter estimation can, for example, use the Baum-welch algorithm to set several states for the category of instruction words and noise categories, and the number of Gaussians. GMM-HMM model training uses EM The (Expectation Maximization) method is trained several times to obtain the GMM-HMM model required by this application. Therefore, the speech recognition process can be: extracting the MFCC (Mel Frequency Cepstral Coefficient) feature of the detected sound segment, using the pre-trained GMM-HMM model to perform Viterbi decoding, and outputting the recognition result.

Hereinafter, the application is described in detail with a specific embodiment in conjunction with the accompanying drawings.

In this embodiment, when the smart hardware device is just turned on, for example, the voice activity detection method can be used to perform voice endpoint detection by default, and when certain conditions are met (that is, when the application scenario changes), it switches to the sliding window detection method to run for a period of time, and then Returning to the voice activity detection method, the specific switching process is shown in FIG. 5, and the process may include the following steps, for example.

Step S501, power on. The intelligent hardware device activates the voice endpoint detection function, and performs voice endpoint detection through the voice activity detection method by default.

Step S502: Acquire an audio stream in real time.

In step S503, the audio segment in the audio stream is intercepted by the voice activity detection method.

Step S504: Detect whether there is sound in the intercepted audio clip. If it exists, go to step S505; otherwise, go to step S503.

Step S505: Perform the first voice recognition to obtain voice information.

Step S506: Determine whether the voice information is a control instruction (first switching condition). If yes, go to step S507; otherwise, go to step S508.

Step S507, execute the instruction, and execute step S509 after the instruction is executed.

That is, if the result of the first speech recognition is a control command, then human-computer interaction is performed. After the interaction is over, voice endpoint detection is performed through the sliding window detection method within a period of t minutes (the total number of sliding times in this period is N). During the sliding window detection, voice recognition is performed on the audio clips in each sliding window.

Step S508: Determine whether the voice length of the voice information is longer than L seconds (the second switching condition). If yes, go to step S509; otherwise, go to step S503.

That is, if the result of the first speech recognition is not a control command, it is further determined whether the duration of the speech segment intercepted by the voice activity detection method exceeds a certain duration of L seconds (for example, 0.5s). If the duration of the voice segment is less than L seconds, continue to perform voice endpoint detection through the voice activity detection method, that is, return to step S503. If the duration of the speech segment is greater than or equal to L seconds, it is considered that the user is likely to have continuous speaking behavior at this time. Therefore, voice endpoint detection is performed by the sliding window detection method within t minutes, and voice recognition is performed on the audio segments in each sliding window during the sliding window detection period, and the flow is as step S510 to step S512.

Step S509, initialize the window sliding times n=0.

In step S510, the sliding window is detected once, and the number of window sliding times n=n+1 is recorded.

Step S511, perform voice recognition.

Step S512, it is judged that n<N? If yes, go to step S510; otherwise, go to step S503.

Among them, step S510 to step S512 describe the operation logic and exit logic during the sliding window detection period. If the voice recognition module recognizes the control command during the sliding window detection and recognition, the sliding window detection is suspended, and the command action is executed instead. After the execution of the instruction action ends, the sliding window detection is continued until the N sliding window detections for the period of time (ie, t minutes) have been executed. After the N sliding window detection ends, the default voice activity detection method is restored to perform voice endpoint detection, that is, step S503 is returned.

FIG. 6 is a schematic structural diagram of an audio processing device provided by an embodiment of this application. As shown in FIG. 6, the device 600 may include a processor 601 and a memory 602.

The memory 602 is used to store program codes;

The processor 601 calls the program code, and when the program code is executed, is used to perform the following operations:

Intercept audio segments in the audio signal according to the audio feature information of the audio signal;

Determine whether to perform a window recognition operation based on the audio clip;

The window recognition operation includes the following operations: moving the sampling window in the audio signal after the audio segment, and performing voice recognition on the audio signal in the sampling window.

The audio processing device provided in this embodiment can be used to implement the technical solutions of the foregoing method embodiments, and its implementation principles and technical effects are similar to those of the method embodiments, and will not be repeated here.

A person of ordinary skill in the art can understand that all or part of the steps in the foregoing method embodiments can be implemented by a program instructing relevant hardware. The aforementioned program can be stored in a computer readable storage medium. When the program is executed, it executes the steps including the foregoing method embodiments; and the foregoing storage medium includes: ROM, RAM, magnetic disk, or optical disk and other media that can store program codes.

Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the application, not to limit them; although the application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: It is still possible to modify the technical solutions described in the foregoing embodiments, or equivalently replace some or all of the technical features; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the technical solutions of the embodiments of the present application. scope.

Claims

An audio processing method, characterized in that it comprises:

Intercept an audio segment from the audio signal according to the audio feature information of the audio signal;

Judging whether to perform a window recognition operation based on the audio clip;

The window recognition operation includes the following operations: moving a sampling window in the audio signal after the audio segment, and performing voice recognition on the audio signal in the sampling window.
The method according to claim 1, wherein the method further comprises:

Extracting voice information from the audio segment;

The judging whether to perform a window recognition operation based on the audio clip includes:

Determine whether to perform the window recognition operation based on the extracted voice information.
The method according to claim 1, wherein the method further comprises:

Extracting voice information from the audio segment;

The judging whether to perform a window recognition operation based on the audio clip includes:

Judging whether the voice information matches the first voice information;

If they do not match, determine whether the duration of the audio segment is greater than or equal to the first duration threshold;

If it is not greater than or not equal to the first duration threshold, for the audio signal after the audio segment, perform the interception of the audio segment according to the audio feature information of the audio signal, and determine whether to execute the window based on the audio segment Identify the steps of the operation.
The method according to claim 1 or 3, wherein the determining whether to perform a window recognition operation based on the audio clip comprises performing a window recognition operation when any of the following events occurs:

The voice information extracted from the audio segment matches the first voice information; and/or,

The duration of the audio segment is greater than or equal to the first duration threshold.
The method according to claim 3 or 4, wherein the first voice information is set according to a vocabulary characterizing the first control instruction.
The method according to any one of claims 1 to 5, wherein the judging whether to perform a window recognition operation based on the audio clip comprises:

Determine whether a preset audio event occurs based on the audio segment.
The method according to claim 6, wherein the audio event comprises: one or more of the following sounds are present in the audio segment: percussive sounds, clapping sounds, and clapping sounds.
The method according to any one of claims 2-7, further comprising:

Judging whether the voice information matches the second voice information;

If it matches, for the audio signal following the audio segment, perform the step of intercepting the audio segment according to the audio feature information of the audio signal, and determining whether to perform a window recognition operation based on the audio segment.
The method according to any one of claims 1-8, further comprising: moving a sampling window in the audio signal after the audio segment is executed, and performing processing on the audio signal in the sampling window During the speech recognition step,

In response to a switching trigger event that occurs, for the audio signal after the current position of the sampling window, execute the interception of the audio segment according to the audio feature information of the audio signal, and determine whether to perform a window recognition operation based on the audio segment A step of.
The method according to claim 9, wherein the handover trigger event includes any one of the following:

In the process of performing voice recognition on the audio signal in the sampling window, detecting voice information matching the third voice information;

The number of movements of the sampling window reaches the maximum number of movements; and

The processing duration of moving the sampling window and performing voice recognition on the audio signal in the sampling window reaches the second duration threshold.
The method according to claim 8 or 10, wherein the second voice information or the third voice information is set according to a vocabulary characterizing the second control instruction;

Wherein, the second control instruction has any of the following characteristics:

The processor suspends execution of other control instructions during the execution of the second control instruction;

The second control instruction indicates that the voice interaction with the user is suspended.
The method according to claim 10 or 11, wherein, in the process of performing the step of moving a sampling window in the audio signal after the audio segment, and performing a voice recognition step on the audio signal in the sampling window,

Based on the recognition result of performing voice recognition on the audio signal of the sampling window, the maximum number of movement times and/or the movement step length of the sampling window are dynamically adjusted.
The method according to claim 12, further comprising:

Based on the number of instructions detected in the recognition result, the maximum number of movement times and/or the movement step length of the sampling window are dynamically adjusted.
The method according to claim 13, wherein the dynamically adjusting the maximum number of movement times of the sampling window based on the number of detected instructions comprises:

In response to the number of detected instructions exceeding a first preset value, the maximum number of moves of the sampling window is adjusted from a first value to a second value greater than the first value.
The method according to claim 13 or 14, wherein the dynamically adjusting the movement step size of the sampling window based on the number of detected instructions comprises:

In response to the number of detected instructions exceeding a second preset value, the moving step size of the sampling window is adjusted from the first step size to a second step size smaller than the first step size.
The method according to any one of claims 13-15, wherein the dynamically adjusting the movement step size of the sampling window based on the number of detected instructions comprises:

In response to the number of detected instructions being lower than the third preset value, the movement step size of the sampling window is adjusted from the first step size to a third step size greater than the first step size.
An audio processing device, characterized by comprising: a processor and a memory;

The memory is used to store program code;

The processor calls the program code, and when the program code is executed, is used to perform the following operations:

Intercept an audio segment from the audio signal according to the audio feature information of the audio signal;

Judging whether to perform a window recognition operation based on the audio clip;

The window recognition operation includes the following steps: moving a sampling window in the audio signal after the audio segment, and performing voice recognition on the audio signal in the sampling window.
The device according to claim 17, wherein the processor is further configured to:

Extracting voice information from the audio segment;

Determine whether to perform the window recognition operation based on the extracted voice information.
The device according to claim 17, wherein the processor is further configured to:

Extracting voice information from the audio segment;

Judging whether the voice information matches the first voice information;

If they do not match, determine whether the duration of the audio segment is greater than or equal to the first duration threshold;

If it is not greater than or not equal to the first duration threshold, for the audio signal after the audio segment, perform the interception of the audio segment according to the audio feature information of the audio signal, and determine whether to execute the window based on the audio segment Identify the steps of the operation.
The device according to claim 17 or 19, wherein the processor is further configured to perform a window recognition operation when any of the following events occurs:

The voice information extracted from the audio segment matches the first voice information; and/or,

The duration of the audio segment is greater than or equal to the first duration threshold.
The device according to claim 19 or 20, wherein the first voice information is set according to a vocabulary characterizing the first control instruction.
The device according to any one of claims 17-21, wherein the processor is further configured to:

Determine whether a preset audio event occurs based on the audio segment.
The device according to claim 22, wherein the audio event comprises: one or more of the following sounds are present in the audio segment: percussion, clapping, and clapping.
The device according to any one of claims 17-23, wherein the processor is further configured to:

Judging whether the voice information matches the second voice information;

If it matches, for the audio signal following the audio segment, perform the step of intercepting the audio segment according to the audio feature information of the audio signal, and determining whether to perform a window recognition operation based on the audio segment.
24. The device according to any one of claims 17-24, wherein the processor is further configured to: move a sampling window in the audio signal after the audio segment is executed, and adjust the sampling window to the The audio signal within the process of the speech recognition step,

In response to a switching trigger event that occurs, for the audio signal after the current position of the sampling window, execute the interception of the audio segment according to the audio feature information of the audio signal, and determine whether to perform a window recognition operation based on the audio segment A step of.
The device according to claim 25, wherein the handover trigger event comprises any one of the following:

In the process of performing voice recognition on the audio signal in the sampling window, detecting voice information matching the third voice information;

The number of movements of the sampling window reaches the maximum number of movements; and

The processing duration of moving the sampling window and performing voice recognition on the audio signal in the sampling window reaches the second duration threshold.
The device according to claim 24 or 26, wherein the second voice information or the third voice information is set according to a vocabulary characterizing the second control instruction;

Wherein, the second control instruction has any of the following characteristics:

The processor suspends execution of other control instructions during the execution of the second control instruction;

The second control instruction indicates that the voice interaction with the user is suspended.
The device according to claim 26 or 27, wherein, in the process of performing a step of moving a sampling window in the audio signal after the audio segment, and performing a voice recognition step on the audio signal in the sampling window,

Based on the recognition result of performing voice recognition on the audio signal of the sampling window, the maximum number of movement times and/or the movement step length of the sampling window are dynamically adjusted.
The device according to claim 28, wherein the processor is further configured to:

Based on the number of instructions detected in the recognition result, the maximum number of movement times and/or the movement step length of the sampling window are dynamically adjusted.
The device according to claim 29, wherein the processor is further configured to:

In response to the number of detected instructions exceeding a first preset value, adjusting the maximum number of moves of the sampling window from a first value to a second value greater than the first value.
The device according to claim 29 or 30, wherein the processor is further configured to:

In response to the number of detected instructions exceeding the second preset value, the moving step size of the sampling window is adjusted from the first step size to a second step size smaller than the first step size.
The device according to any one of claims 29-31, wherein the processor is further configured to:

In response to the number of detected instructions being lower than the third preset value, the movement step size of the sampling window is adjusted from the first step size to a third step size greater than the first step size.
A computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, the computer program contains at least one piece of code, and the at least one piece of code can be executed by a computer to control the computer to execute The audio processing method described in any one of 1-16 is required.
A computer program, characterized in that, when the computer program is executed by a computer, it is used to implement the audio processing method according to any one of claims 1 to 16.