CN112543972A

CN112543972A - Audio processing method and device

Info

Publication number: CN112543972A
Application number: CN202080004356.6A
Authority: CN
Inventors: 吴俊峰; 罗东阳; 童焦龙
Original assignee: SZ DJI Technology Co Ltd
Current assignee: SZ DJI Technology Co Ltd
Priority date: 2020-01-20
Filing date: 2020-01-20
Publication date: 2021-03-23
Also published as: WO2021146857A1

Abstract

An audio processing method and device. The method comprises the following steps: intercepting an audio clip in the audio signal according to the audio characteristic information of the audio signal; judging whether to execute window identification operation based on the audio clip; the window identifying operation includes the following operations: the sampling window is moved in the audio signal following the audio segment and speech recognition is performed on the audio signal within the sampling window. For example, the voice activity method has the advantage of low computational power consumption, and the sliding window method has the advantage of high anti-interference capability, so that the voice activity method and the sliding window method are automatically switched according to the application scene to perform audio processing, and the advantages of the voice activity method and the sliding window method can be taken into account, so that the computational power can be saved and the accuracy of audio processing can be improved.

Description

Audio processing method and device

Technical Field

The present application relates to the field of audio technologies, and in particular, to an audio processing method and apparatus.

Background

Voice interaction is a common man-machine interaction method. When speech recognition is used for man-machine interaction, speech endpoint detection is required to be performed, namely, which parts of the sound recorded by the microphone may contain speech. The anti-interference capability and the computational power consumption of various voice endpoint detection methods are different from each other, and how to consider the anti-interference capability and the computational power consumption becomes a problem to be solved urgently in voice endpoint detection.

Disclosure of Invention

The embodiment of the application provides an audio processing method and an audio processing device, which are used for solving the technical problems that in the prior art, the anti-interference capability and the computational power consumption cannot be considered in the voice endpoint detection process, and therefore the anti-interference capability is poor and the computational power consumption is large.

In a first aspect, an embodiment of the present application provides an audio processing method, including: intercepting an audio clip in the audio signal according to audio characteristic information of the audio signal; judging whether to execute window identification operation based on the audio clip; the window identifying operation includes the following operations: moving a sampling window in the audio signal following the audio segment and performing speech recognition on the audio signal within the sampling window.

In a second aspect, an embodiment of the present application provides an audio processing apparatus, including: a processor and a memory; the memory for storing program code; the processor, invoking the program code, when executed, is configured to: intercepting an audio clip in the audio signal according to audio characteristic information of the audio signal; judging whether to execute window identification operation based on the audio clip; the window identifying operation includes the steps of: moving a sampling window in the audio signal following the audio segment and performing speech recognition on the audio signal within the sampling window.

In a third aspect, an embodiment of the present application provides a computer-readable storage medium, where a computer program is stored, where the computer program includes at least one piece of code, and the at least one piece of code is executable by a computer to control the computer to execute the audio processing method according to any one of the above first aspects.

In a fourth aspect, an embodiment of the present application provides a computer program, which is configured to implement the audio processing method according to any one of the above first aspects when the computer program is executed by a computer.

The embodiment of the application provides an audio processing method and device, and the voice endpoint detection method is selected and switched according to the characteristics of a specific application scene in the voice endpoint detection process, so that the anti-interference capability and the computational power consumption can be considered.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1A is a schematic view of a first application scenario of an audio processing method according to an embodiment of the present application;

fig. 1B is a schematic view of an application scenario of an audio processing method according to an embodiment of the present application;

fig. 2 is a schematic flowchart of an audio processing method according to an embodiment of the present application;

3A-3C are schematic diagrams of an audio sub-segment provided by an embodiment of the present application excluding noise of the audio segment;

fig. 4 is a flowchart illustrating an audio processing method according to another embodiment of the present application;

FIG. 5 is a schematic flow chart of a voice activity detection method provided in an embodiment of the present application; and

fig. 6 is a schematic structural diagram of an audio processing apparatus according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The audio processing method provided by the embodiment of the application can be applied to any audio processing process needing voice endpoint detection, and the audio processing method can be specifically executed by an audio processing device. The audio processing apparatus may be an apparatus including an audio acquisition module (e.g., a microphone), and accordingly, an application scenario diagram of the audio processing method provided by the embodiment of the present application may be as shown in fig. 1A. Specifically, the audio acquisition module of the audio processing device can acquire speech of a user to obtain an audio signal, and the processor of the audio processing device can process the audio signal acquired by the audio acquisition module by using the audio processing method provided by the embodiment of the application. It should be noted that fig. 1A is only a schematic diagram, and does not limit the structure of the audio processing apparatus. For example, an amplifier may be connected between the microphone and the processor for amplifying the audio signal collected by the microphone. For another example, a filter may be connected between the microphone and the processor for filtering the audio signal collected by the microphone.

Or, the audio processing apparatus may also be an apparatus that does not include an audio acquisition module, and accordingly, an application scenario schematic diagram of the audio processing method provided in the embodiment of the present application may be as shown in fig. 1B. Specifically, the communication interface of the audio processing apparatus may receive an audio signal acquired by another apparatus or device, and the processor of the audio processing apparatus may process the received audio signal by using the audio processing method provided in the embodiment of the present application. It should be noted that fig. 1B is a schematic diagram, and the structure of the audio processing apparatus and the connection manner between the audio processing apparatus and other apparatuses or devices are not limited, for example, the communication interface in the audio processing apparatus may be replaced by a transceiver.

It should be noted that, as to the type of the device including the audio processing apparatus, the embodiment of the present application may not be limited, and the device may be, for example, an intelligent sound, an intelligent lighting device, an intelligent robot, a mobile phone, a tablet computer, and the like.

The audio processing method provided by the embodiment of the application can automatically switch the voice activity detection method and the sliding window detection method according to the characteristics of a specific application scene, so that the anti-interference capability of voice endpoint detection can be improved on the whole, and the computational power consumption of the voice endpoint detection is reduced. The anti-interference capability of voice endpoint detection is stronger than that of a method for detecting voice activity only, and the computational power consumption of the voice endpoint detection is less than that of a method for detecting a sliding window only.

It should be noted that the audio processing method provided in the embodiment of the present application may be applied to the field of speech recognition, an intelligent hardware device with a speech recognition function, the field of audio event detection, an intelligent hardware device with an audio event detection function, and the like, and the present application is not limited herein.

Some embodiments of the present application will be described in detail below with reference to the accompanying drawings. The embodiments described below and the features of the embodiments can be combined with each other without conflict.

Fig. 2 is a schematic flowchart of an audio processing method according to an embodiment of the present application, where an execution main body of the embodiment may be an audio processing device, and may specifically be a processor of the audio processing device. As shown in fig. 2, the method of the present embodiment may include, for example, step S201 and step S202.

Step S201, intercepting an audio clip in the audio signal according to the audio characteristic information of the audio signal.

It should be noted that, in the embodiment of the present application, in step S201, an audio segment is intercepted based on the audio feature information, and because an endpoint intercepted by the audio segment is not accurate, the intercepted audio segment usually contains noise, and if speech recognition is performed based on the audio segment intercepted thereby, accuracy is not high.

It should be noted that, for a specific type of the audio feature information, the embodiment of the present application may not be limited. Optionally, the audio features may include one or more of Mel-Frequency Cepstrum Coefficient (MFCC) features, Linear Prediction Coefficients (LPC) features, and Filter bank (Fbank) features, for example.

Specifically, in the embodiment of the present application, in step S201, the audio signal may be processed by using a speech endpoint detection method with low computational power consumption.

In the process of implementing the embodiment of the present application, the inventor finds that the voice activity detection method is to identify whether the audio signal is a voice signal by using methods such as an energy and zero-crossing rate dual threshold or a noise-voice classification model, and to intercept the voice signal for voice recognition only when the audio signal is determined to be the voice signal, so that the voice activity detection method can effectively distinguish between silence and non-silence, and can save computation when used for voice recognition.

Thus, in one embodiment, in step S201, a speech activity detection method may be used to intercept an audio segment in an audio signal according to audio feature information of the audio signal.

In addition, in the process of implementing the embodiment of the present application, the inventor also finds that the voice activity detection method cannot effectively distinguish when the voice-type noise or the non-voice-type noise with slightly large energy is faced. For example, when the voice activity detection method is used for instruction word voice recognition, noise is easily mixed before and after an instruction voice segment, so that the voice instruction recognition rate is reduced, the sliding window detection method can extract a voice segment with a certain length from an audio stream to perform voice recognition continuously, for the instruction word voice recognition, the method can effectively prevent the noise from being mixed before and after the instruction word voice segment, but the sliding window detection method consumes a large amount of computing resources and electric quantity of intelligent hardware equipment due to large voice recognition computing amount and continuous voice recognition.

Therefore, in the embodiment of the application, in an application scenario with a low requirement on anti-interference performance, a speech activity detection method may be used for audio processing to reduce computational power consumption, but in an application scenario with a high requirement on anti-interference performance, a sliding window detection method may be used for audio processing to improve anti-interference performance.

In step S202, it is determined whether to perform a window recognition operation based on the audio clip obtained in step S201.

The window identification operation may include, for example, the following operations: the sampling window is moved in the audio signal following the audio piece obtained by step S201, and speech recognition is performed on the audio signal within the sampling window.

Noise included in the audio segment intercepted by the voice activity detection may be excluded in one or more windows due to the sliding window detection method. For example, as shown in fig. 3A, assuming that the beginning portion of the audio signal includes noise, the audio segment X1 may exclude the noise. For another example, as shown in fig. 3B, assuming that the middle portion of the audio signal includes noise, the audio piece X2 may exclude the noise. For another example, as shown in fig. 3C, assuming that the end portion of the audio signal includes noise, the audio piece X3 may exclude the noise. Note that, the portions filled with the grid lines in fig. 3A to 3C are used to represent noise.

Specifically, in step S202, it may be determined whether the audio segment obtained in step S201 contains or may contain speech-type noise or non-speech-type noise with slightly large energy, and further, whether to perform a window recognition operation. Wherein, if it is determined that the audio segment obtained in step S201 contains or may contain speech-like noise or non-speech-like noise with slightly larger energy, a window recognition operation is performed. Otherwise, if it is determined that the audio segment obtained in step S201 does not contain or may not contain speech-like noise or non-speech-like noise with slightly larger energy, step S201 and step S202 are continuously performed.

For example, in one embodiment, speech segment truncation and/or speech recognition may be prone to inaccuracies due to the susceptibility to noise mixing before and after the instruction-like speech information. Therefore, for the audio piece obtained by step S201, if instruction-like speech information is detected, a window recognition operation is performed to exclude the influence of noise.

For another example, in another embodiment, for the audio segment obtained in step S201, if non-speech information with a slightly larger energy is detected, for example, if a preset audio event such as a specific tapping sound, a clapping sound, etc. is detected, a window recognition operation is performed to avoid an inaccurate audio segment interception and/or an inaccurate speech recognition due to the fact that the voice activity detection method cannot effectively distinguish such information.

By the embodiment of the application, when the audio signal is processed, the corresponding voice endpoint detection method can be selected according to the characteristics of a specific application scene. For example, in an application scenario with low requirement on interference rejection, the voice activity detection method may be used to process the audio signal to save computation. In an application scene with high requirement on anti-interference capability, a sliding window detection method can be adopted to process an audio signal so as to improve the anti-interference capability, so that the advantages of high anti-interference capability of the sliding window detection method and low computational power consumption of a voice activity detection method can be taken into consideration, the anti-interference capability of voice endpoint detection is generally higher than that of a voice activity detection method which is only used, and the computational power consumption of the voice endpoint detection is generally lower than that of the sliding window detection method which is only used.

It should be noted that, in the embodiment of the present application, in step S202, whether or not instruction-like voice information exists or may exist in the audio clip obtained in step S201 may be determined in various ways.

For example, in one embodiment, the voice information may be extracted from the audio segment obtained in step S201, and whether the instruction type information exists or may exist is determined according to the content and/or the voice length of the extracted voice information, so as to determine whether to perform the window recognition operation.

Specifically, in the embodiment of the present application, the method may further include, for example: speech information is extracted from the audio clip. Correspondingly, step S202 may include, for example: whether to perform a window recognition operation is determined based on the extracted voice information.

For example, in one embodiment, a determination may be made whether to perform a window recognition operation based on the content of the extracted speech information. Specifically, in this embodiment, it may be determined whether the content of the voice message satisfies a specific condition (e.g., whether the voice message includes instruction type information). If it is determined that the contents of the voice information satisfy a certain condition, a window recognition operation is performed. Otherwise, if it is determined that the contents of the voice information do not satisfy the specific condition, the window recognition operation is not performed.

For another example, in another embodiment, whether to perform the window recognition operation may be determined based on a voice length of the extracted voice information. Specifically, in this embodiment, it may be determined whether the voice length of the voice message is less than or equal to a preset value. And if the voice length of the voice information is determined to be less than or equal to the preset value, executing window recognition operation. Otherwise, if the voice length of the voice information is determined to be larger than the preset value, the window recognition operation is not executed.

For another example, in another embodiment, whether to perform the window recognition operation may be determined based on the content and the voice length of the extracted voice information.

Specifically, in this embodiment, it may be determined whether the content of the voice message satisfies a specific condition. And if the content of the voice information does not meet the specific condition, judging whether the voice length of the voice information is less than or equal to a preset value. And if the content of the voice information is determined not to meet the specific condition and the voice length of the voice information is not smaller than or equal to the preset value, the window recognition operation is not executed. Otherwise, executing window identification operation.

Or, in this embodiment, it may also be determined whether the voice length of the voice message is smaller than or equal to a preset value. And if the content of the voice information is not smaller or not equal to the preset value, judging whether the content of the voice information meets a specific condition. And if the voice length of the voice information is determined not to be less than or not equal to the preset value and the content of the voice information does not meet the specific condition, not executing window recognition operation. Otherwise, executing window identification operation.

It should be noted that, according to the content and/or the voice length of the extracted voice information, it may be determined whether instruction type information exists in the audio segment obtained in step S201. If it is determined that instruction type information exists in the audio segment obtained in step S201, it is considered that a user may continuously speak during actual use, and thus may continuously issue a plurality of instructions, and therefore, in the audio signal processing process after the audio segment, a sliding window detection method may be selected for voice endpoint detection, so as to prevent inaccurate interception of the voice segment and/or inaccurate voice recognition due to excessive noise mixed before and after the instruction.

In addition, since the length of the command word is limited, the length of the voice corresponding to the voice information including the command word is also limited. Therefore, whether the instruction words exist or possibly exist in the voice information can be roughly estimated according to the voice length of the extracted voice information, and further whether window recognition operation is executed or not can be judged.

By the embodiment of the application, for the audio clip obtained in step S201, the voice information contained in the audio clip can be extracted first, then whether the instruction type information exists in the audio clip is determined or estimated based on the content and/or the voice length of the voice information, and whether the window recognition operation is executed is determined according to the determination or estimation result, so that inaccuracy caused by mixing of too much noise when the voice clip interception and the voice recognition are performed on the instruction type voice information can be avoided.

In addition, in another embodiment, the voice information may be extracted from the audio segment obtained in step S201, and whether the instruction type information exists or may exist is determined according to the content of the extracted voice information and/or the duration of the audio segment, so as to determine whether to perform the window recognition operation.

For example, in the embodiment of the present application, as one embodiment, it may be determined whether to perform the window recognition operation according to the content of the extracted voice information and the duration of the audio piece at the same time.

Specifically, in the embodiment of the present application, the method may further include, for example: speech information is extracted from the audio clip obtained in step S201. Correspondingly, step S202 may, for example, comprise the following operations. First, whether the extracted voice information is matched with the first voice information is judged. If not, whether the duration of the audio clip is greater than or equal to the first duration threshold is judged. If not, executing step S201 and step S202 for the audio signal after the audio segment.

Alternatively, in the embodiment of the present application, step S202 may also include the following operations, for example. It is first determined whether the duration of the audio segment obtained in step S201 is greater than or equal to the first duration threshold. If the time length is not larger than or not equal to the first time length threshold value, whether the extracted voice information is matched with the first voice information is judged. If not, step S201 and step S202 are executed for the audio signals after the audio clip.

It should be noted that, in the embodiment of the present application, the first speech information may be arranged according to a vocabulary of a characterization instruction (e.g., the first control instruction). As an alternative embodiment, when the audio processing method is applied to an intelligent hardware device with a speech recognition function, the first speech information may be set according to a vocabulary characterizing all control instructions applied to the intelligent hardware device.

Similarly, the length of the voice corresponding to the voice information containing the instruction word is limited, and the duration of the audio segment containing the instruction word is also limited. Therefore, whether the instruction word exists or possibly exists in the audio segment can be roughly estimated according to the duration of the audio segment obtained in the step S201, and whether the window identification operation is executed or not can be further judged.

It should be noted that, in the embodiment of the present application, the first duration threshold may be set according to a maximum duration, an average duration, or any duration of an audio segment containing an instruction (e.g., a first control instruction).

By the embodiment of the application, for the audio segment obtained in step S201, the voice information contained therein may be extracted, and whether the instruction class information exists in the audio segment is determined or estimated based on the content of the voice information and the duration of the audio segment, and whether to perform the window recognition operation is determined according to the estimation result. Therefore, inaccuracy caused by mixing excessive noise when voice segment interception and voice recognition are carried out on the instruction voice information can be avoided.

For another example, in the embodiment of the present application, as another embodiment, it may also be determined whether to perform the window recognition operation according to one or more (i.e., both) of the content of the extracted speech information and the duration of the audio piece.

Specifically, the method may further include, for example, performing a window recognition operation in the event that any of the following events occurs: matching the voice information extracted from the audio clip obtained in step S201 with the first voice information (event 1 for short); and/or the duration of the audio segment obtained in step S201 is greater than or equal to the first duration threshold (event 2 for short).

It should be noted that, in the embodiment of the present application, for the case where both the event 1 and the event 2 occur, the method 1 may determine whether the event 1 occurs first, and then determine whether the event 2 occurs. Alternatively, in the method 2, it may be determined whether or not the event 2 occurs first, and then whether or not the event 1 occurs. However, in either mode 1 or mode 2, both events must occur to perform the window identification operation.

In addition, in the embodiment of the present application, the setting of the first voice information and the first time length threshold is the same as or similar to the setting method in the foregoing embodiment, and details of the embodiment of the present application are not repeated herein.

In addition, in the embodiment of the present application, for event 1, it may be determined whether the extracted voice information matches the first voice information by extracting the voice information from the audio segment obtained in step S201 and then matching the extracted voice information with the first voice information.

Similarly, in the embodiment of the present application, for event 2, the duration of the audio segment obtained in step S201 may be determined, and then the duration may be compared to determine whether the duration is greater than or equal to the first duration threshold.

In the embodiment of the present application, it may be determined or estimated that the audio segment obtained in step S201 includes or may include instruction class information through one or more of event 1 and event 2, so that noise before and after an instruction can be excluded by performing window recognition operation to intercept the voice segment, so that the interception result of the voice segment is more accurate, and further the obtained voice recognition result is more accurate.

In the embodiment of the present application, in step S202, it may also be detected in various ways whether the audio clip obtained in step S201 has non-speech information with slightly large energy.

Specifically, step S202 may include, for example: it is determined whether a preset audio event occurs based on the audio segment obtained in step S201 to determine whether there is non-speech information with a slightly larger energy in the audio segment.

If the preset audio event is judged to occur, the non-voice information with slightly larger energy exists in the audio clip, and window recognition operation can be executed, so that noise can be excluded when the audio clip is intercepted, the interception result of the audio clip is more accurate, and the obtained audio processing result is more accurate. In addition, if the preset audio event is judged not to occur, the audio segment is considered to have no non-speech information with slightly larger energy, and the window recognition operation can not be executed.

It should be noted that, in the embodiment of the present application, the preset audio event may include, for example, the presence of one or more of the following sounds in an audio clip: tapping sound, clapping sound. The preset audio event may be set according to slightly more powerful non-speech information such as tapping sound, clapping sound, etc. that often occur in an audio clip.

It should be understood that, in general, a user may continuously speak during actual use, and thus may continuously issue a plurality of instructions, so that, when performing voice endpoint detection, once instruction-like voice information is detected, the method may be switched to a sliding window detection method for performing voice endpoint detection, so as to prevent excessive noise from being mixed before and after the instruction, which may cause voice segment interception and/or inaccurate voice recognition. However, there are some special cases, for example, the user may not speak continuously during actual use, but may not issue any instruction for a long time after issuing an instruction. For example, a user says "please enter a sleep state" for an intelligent hardware device (e.g., a robot) with a voice recognition function, or "please fly 2 turns around a" for an intelligent hardware device (e.g., an unmanned aerial vehicle) with a voice recognition function, and so on, in these cases, the user generally does not issue other control instructions for a long time after issuing the instructions. Therefore, if the special instruction voice information is detected, the method is switched to a sliding window detection method for voice endpoint detection, so that the anti-interference capability is not obviously improved, and the computational power consumption is increased. Therefore, in the embodiment of the present application, if these special situations occur, the method may not be switched to the sliding window detection method, so that the defect that not only the interference rejection capability is not significantly improved but also the computational power consumption is increased after the switching can be avoided.

That is, in another embodiment, in the process of determining whether to perform the window recognition operation based on the audio piece obtained in step S201, it is also possible to detect whether specific voice information is present. Wherein, if the occurrence of the specific voice information is detected, the step S201 and the step S202 are performed with respect to the audio signal following the audio clip obtained by the step S201 without performing the window recognition operation.

Specifically, step S202 may include, for example: voice information is extracted from the audio clip obtained in step S201, and it is determined whether the extracted voice information matches the second voice information. If the audio signal is matched with the audio signal after the audio clip obtained in step S201, step S201 and step S202 are executed.

It should be noted that, in the embodiment of the present application, the second speech information may be set according to a vocabulary characterizing the second control instruction, for example. The second control instruction may have any of the following features, for example: pausing the execution of other control instructions (feature 1 for short) during the execution of the second control instruction by the processor; the second control instruction characterizes a pause in the speech interaction with the user (feature 2 for short).

For feature 1, the second control command may be, for example, "please fly 2 turns around a" in the above example. For feature 2, the second control instruction may be, for example, "please enter sleep state" in the above example. Specifically, the second control instruction may be any control instruction having the above feature 1 and/or feature 2, and the embodiment of the present application is not limited herein.

In addition, in the embodiment of the present application, after switching to the sliding window detection method, that is, after performing the window recognition operation, if it is found that the specific application scene changes again, the sliding window detection method may be switched back to the previous voice endpoint detection method, so as to ensure that the calculation effort is saved as much as possible.

Specifically, in the embodiment of the present application, the method may further include, for example: in the process of executing the step of moving the sampling window in the audio signal after the audio clip obtained in step S201 and performing the speech recognition step on the audio signal in the sampling window, that is, in the process of executing the window recognition operation, in response to the occurrence of the switching trigger event, step S201 and step S202 are executed for the audio signal after the position where the sampling window is currently located.

Specifically, in the embodiment of the present application, the handover triggering event may include any one or more of the following: detecting voice information matched with the third voice information in the process of carrying out voice recognition on the audio signal in the sampling window; the moving times of the sampling window reach the maximum moving times; and moving the sampling window and enabling the processing time length of voice recognition of the audio signals in the sampling window to reach a second time length threshold value.

Specifically, in the embodiment of the present application, the third voice information may be set according to a vocabulary characterizing the second control instruction, for example. Wherein the second control instruction has any of the following characteristics: suspending the execution of other control instructions during the execution of the second control instruction by the processor; the second control instruction characterizes a pause in voice interaction with the user.

It should be noted that, in the embodiment of the present application, the third speech information may be the same as or similar to the second speech information in the foregoing embodiment, and details of the embodiment of the present application are not repeated herein.

In addition, the second control instruction in the embodiment of the present application is the same as the second control instruction in the foregoing embodiment, and the description of the embodiment of the present application is also omitted here.

In the audio processing method provided by the embodiment of the application, the window identification operation not only affects the overall anti-interference capability, but also affects the overall computational power consumption. And windows with different relevant parameters have different capacities and different computational power when eliminating noise. Therefore, in the process of executing window recognition operation, the relevant parameters of the window can be dynamically adjusted according to the voice recognition result, so that the interference resistance and the computational power consumption are maximally considered.

In the embodiment of the present application, the sliding window detection means: a sliding window with the window length fixed at W milliseconds is set, only the audio signal in the sliding window is taken for identification each time, and the sliding window can slide backwards step by step from the starting point of a certain section of audio signal by taking S milliseconds as a step length, so that the purpose of sliding window detection is achieved. In one example, the relationship between the total sliding number N of the sliding window and the time period t minutes can be expressed as: t 60 × 1000 ═ W + S × N.

In the embodiment of the present application, during the step of moving the sampling window in the audio signal after the audio clip obtained in step S201 and performing the speech recognition step on the audio signal in the sampling window, that is, during the step of performing the window recognition operation, for example, the maximum number of times of moving the sampling window (that is, the total number of times of sliding N) and/or the moving step size (that is, the step size S) may be dynamically adjusted based on the recognition result of performing the speech recognition on the audio signal in the sampling window.

For example, in one embodiment, the maximum number of moves of the sampling window (i.e., the total number of slips N) and/or the move step size (i.e., the step size S) may be dynamically adjusted based on the number of instructions detected in speech recognition.

Specifically, in the embodiment of the present application, dynamically adjusting the maximum moving number and/or the moving step of the sampling window based on the recognition result of performing the speech recognition on the audio signal of the sampling window may include, for example: and dynamically adjusting the maximum moving times and/or the moving step length of the sampling window based on the number of the detected instructions in the identification result.

More specifically, in this embodiment of the present application, dynamically adjusting the maximum number of times of moving the sampling window based on the number of detected instructions may include, for example: in response to the number of detected instructions exceeding a first preset value, the maximum number of movements of the sampling window is adjusted from a first value to a second value greater than the first value.

For example, during the recognition of N (i.e., total number of N) sliding window detections, if the number of detected instructions reaches a certain value (e.g., M)₁Wherein M is₁< N), it means that there is a continuous speaking behavior during the period and there is a behavior of continuously sending instructions. Therefore, the total number of the sliding times N can be increased to 2N or other times, so that the noise can be eliminated by executing more sliding window detections, and the anti-interference capability can be improved.

Furthermore, more specifically, in the embodiment of the present application, dynamically adjusting the moving step size of the sampling window based on the number of detected instructions may include, for example: and adjusting the moving step length of the sampling window from the first step length to a second step length smaller than the first step length in response to the number of the detected instructions exceeding a second preset value.

It should be noted that, in the embodiment of the present application, the second preset value may be the same as or different from the first preset value in the foregoing embodiment, and the embodiment of the present application is not limited herein.

E.g. detected in a sliding window of N (i.e. total number of slides N)During recognition, if the number of detected instructions reaches a certain value (e.g. M)₂Wherein M is₂< N), it means that there is a continuous speaking behavior during the period and there is a behavior of continuously sending instructions. The predetermined sliding step S ms may be reduced to S/2 ms or other steps to eliminate noise by performing sliding window detection of smaller sliding steps, thereby improving interference rejection.

Or, more specifically, based on the number of detected instructions, dynamically adjusting the moving step size of the sampling window may further include: in response to the number of detected commands being below a third predetermined value (e.g., M)₃Wherein M is₃< N), the step size of the movement of the sampling window is adjusted from the first step size to a third step size larger than the first step size.

It should be noted that, in the embodiment of the present application, the third preset value may be smaller than the second preset value in the foregoing embodiment, for example.

For example, during the recognition of the first N/2 (i.e., total number of N) sliding window detections, if the number of detected instructions is below a certain value (e.g., M)₃Wherein M is₃< N), it means that the user may not be a continuous speaking behavior although there is a speaking behavior during the period, or even a continuous speaking behavior, it is not a behavior of continuously transmitting instructions. Therefore, the preset sliding step S millisecond can be increased to 2S milliseconds or other steps, so that the noise can be eliminated by executing the sliding window detection with larger sliding step, and the calculation power is saved, thereby achieving the aim of considering both the anti-interference capability and the calculation power consumption as much as possible in the process of executing the window identification operation.

It should be noted that, in the embodiment of the present application, a specific algorithm used for intercepting an audio segment from an audio signal based on a voice activity detection method may not be limited in the embodiment of the present application, and for example, one or more of an energy and zero-crossing rate dual-threshold algorithm, a noise-voice classification model algorithm, a variance method, a spectral distance method, and a spectral entropy method may be used. Taking the energy and zero-crossing rate dual-threshold algorithm as an example, as shown in fig. 4, step S201 may include, for example, the following steps S401 to S404.

Step S401, firstly, frames the audio signal, and then calculates the short-time average energy of each frame by frame to obtain the short-time energy envelope.

The audio signal may be framed for a fixed duration, and for example, the fixed duration is 1 second, the audio signal may be divided into one frame from 0 th second to 1 st second, one frame from 1 st second to 2 nd second, one frame from 3 rd second to 4 th second, … …, so as to complete the framing of the audio signal. The short-time average energy of a frame is the average energy of the frame, and the short-time energy envelope can be obtained by connecting the energy envelopes.

In step S402, a higher threshold T1 is selected, where the head-to-tail intersections of the marker T1 with the short-term energy envelope are C and D.

Among them, the portion higher than the threshold T1 has a higher probability of being considered as speech.

In step S403, a lower threshold T2 is selected, where the intersection B of the short-term energy envelope and the threshold is located on the left side of C, and the intersection E is located on the right side of D.

Wherein T2 is less than T1.

Step S404, calculating the short-time average zero-crossing rate of each frame, and selecting a threshold T3, wherein the intersection A of the threshold and the short-time average zero-crossing rate line is positioned on the left side of B, and the intersection F is positioned on the right side of E.

Wherein, the short-time average zero-crossing rate of a frame is the average zero-crossing rate of the frame.

So far, the intersection points a and F are two end points of the audio signal determined based on the voice activity detection method, and the audio segment from the intersection point a to the intersection point F is an audio segment intercepted from the audio signal based on the voice activity detection method.

It should be noted that, for example, only one audio segment is cut out from one audio signal in steps S401 to S404, and it can be understood that a plurality of audio segments may also be cut out from one audio signal, which is not limited in this embodiment of the present application.

In addition, in the embodiment of the present application, a speech recognition model may be used in performing speech recognition. The speech recognition Model may be, for example, a GMM-HMM Model (Gaussian Mixed Model-Hidden Markov Model), and the training procedure is as follows: using a plurality of instruction voices and negative samples as training data; performing MFCC feature extraction on the training speech; then, training of the GMM-HMM model is carried out, in the training process, for example, a Baum-welch algorithm can be adopted for HMM parameter estimation, a plurality of states are set for the category and the noise category of the instruction word, the number of gaussians is a plurality, and the GMM-HMM model training uses an EM (Expectation Maximization) method for training for a plurality of times, so that the GMM-HMM model required by the application is obtained. Thus, the speech recognition process may be: and extracting the MFCC (Mel frequency cepstrum coefficient) features of the detected sound segments, performing Viterbi decoding by using a GMM-HMM model trained in advance, and outputting a recognition result.

The present application is described in detail below with reference to one embodiment with reference to the accompanying drawings.

In this embodiment, when the intelligent hardware device is just turned on, for example, voice endpoint detection may be performed by a voice activity detection method by default, and when a certain condition is satisfied (that is, when an application scene changes), the method is switched to a sliding window detection method to run for a period of time, and then the method is restored to the voice activity detection method, where a specific switching process is shown in fig. 5, and the process may include the following steps, for example.

Step S501, starting the computer. And the intelligent hardware equipment starts a voice endpoint detection function and defaults to carry out voice endpoint detection by a voice activity detection method.

Step S502, audio stream is obtained in real time.

In step S503, an audio segment in the audio stream is intercepted by a voice activity detection method.

Step S504, whether sound exists in the intercepted audio clip is detected. If so, executing step S505; otherwise, it jumps to step S503.

And step S505, performing first voice recognition to obtain voice information.

In step S506, it is determined whether the voice message is a control command (first switching condition). If yes, go to step S507; otherwise, step S508 is performed.

In step S507, the instruction is executed, and after the instruction is executed, step S509 is executed.

Namely, if the result of the first voice recognition is a control command, man-machine interaction is carried out. After the interaction is finished, voice endpoint detection is carried out in a sliding window detection method within a period of time t minutes (the total sliding times in the period of time is N). During sliding window detection, speech recognition is performed on the audio segments within each sliding window.

Step S508, determine whether the voice length of the voice message is longer than L seconds (second switching condition). If so, go to step S509; otherwise, it jumps to step S503.

That is, if the result of the first voice recognition is not the control instruction, it is further determined whether the duration of the voice segment intercepted by the voice activity detection method exceeds a certain duration L seconds (for example, 0.5 s). If the duration of the voice segment is less than L seconds, the voice endpoint detection is continued by the voice activity detection method, i.e., the step S503 is returned to. If the duration of the voice segment is greater than or equal to L seconds, it is considered that the user has a large possibility of continuous speech. Therefore, voice endpoint detection is performed in t minutes by a sliding window detection method, and during the sliding window detection, voice recognition is performed on the audio segment in each sliding window, and the flow is as in steps S510 to S512.

In step S509, the window sliding count n is initialized to 0.

In step S510, the sliding window is detected once, and the number of times of window sliding n is recorded as n + 1.

In step S511, speech recognition is performed.

Step S512, determine N < N? If yes, go to step S510; otherwise, it jumps to step S503.

The operation logic and the exit logic during the sliding window detection are described in steps S510 to S512. If the voice recognition module recognizes the control command during the sliding window detection and recognition, the sliding window detection is suspended, and the command action is executed. After the instruction action is executed, the sliding window detection is continued until the sliding window detection is executed for N times in the period of time (namely t minutes). After the N sliding window detections are finished, the voice endpoint detection is performed by returning to the default voice activity detection method, that is, returning to step S503.

Fig. 6 is a schematic structural diagram of an audio processing apparatus according to an embodiment of the present application, and as shown in fig. 6, the apparatus 600 may include: a processor 601 and a memory 602.

The memory 602 for storing program code;

the processor 601, which invokes the program code, is configured to perform the following operations when the program code is executed:

intercepting an audio clip in the audio signal according to the audio characteristic information of the audio signal;

judging whether to execute window identification operation based on the audio clip;

the window identifying operation includes the following operations: the sampling window is moved in the audio signal following the audio segment and speech recognition is performed on the audio signal within the sampling window.

The audio processing apparatus provided in this embodiment may be configured to execute the technical solution of the foregoing method embodiment, and the implementation principle and technical effect of the audio processing apparatus are similar to those of the method embodiment, and are not described herein again.

Those of ordinary skill in the art will understand that: all or a portion of the steps of implementing the above-described method embodiments may be performed by hardware associated with program instructions. The program may be stored in a computer-readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.

Finally, it should be noted that: the above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present application.

Claims

1. An audio processing method, comprising:

intercepting an audio clip in the audio signal according to audio characteristic information of the audio signal;

the window identifying operation includes the following operations: moving a sampling window in the audio signal following the audio segment and performing speech recognition on the audio signal within the sampling window.

2. The method of claim 1, further comprising:

extracting voice information from the audio clip;

the determining whether to perform a window identification operation based on the audio clip includes:

and judging whether to execute the window recognition operation or not based on the extracted voice information.

3. The method of claim 1, further comprising:

extracting voice information from the audio clip;

judging whether the voice information is matched with the first voice information;

if not, judging whether the duration of the audio clip is greater than or equal to a first duration threshold value;

if the time length is not larger than or not equal to the first time length threshold, for the audio signal after the audio clip, executing the step of intercepting the audio clip according to the audio feature information of the audio signal, and judging whether to execute window identification operation or not based on the audio clip.

4. The method according to claim 1 or 3, wherein the determining whether to perform a window recognition operation based on the audio segment comprises performing a window recognition operation if any of the following events occurs:

matching the voice information extracted from the audio clip with the first voice information; and/or the presence of a gas in the gas,

the duration of the audio segment is greater than or equal to a first duration threshold.

5. Method according to claim 3 or 4, characterized in that the first speech information is arranged according to a vocabulary characterizing the first control commands.

6. The method of any of claims 1-5, wherein the determining whether to perform a window identification operation based on the audio clip comprises:

and judging whether a preset audio event occurs or not based on the audio clip.

7. The method of claim 6, wherein the audio event comprises: one or more of the following sounds are present in the audio piece: tapping sound, clapping sound.

8. The method according to any one of claims 2-7, further comprising:

judging whether the voice information is matched with second voice information;

and if so, executing the step of intercepting the audio frequency segment according to the audio frequency characteristic information of the audio frequency signal and judging whether to execute window identification operation or not based on the audio frequency segment aiming at the audio frequency signal behind the audio frequency segment.

9. The method according to any one of claims 1-8, further comprising: in performing the step of moving a sampling window in the audio signal following the audio segment and performing speech recognition on the audio signal within the sampling window,

and in response to the switching trigger event, for the audio signal after the current position of the sampling window, executing the step of intercepting an audio clip according to the audio characteristic information of the audio signal, and judging whether to execute window identification operation or not based on the audio clip.

10. The method of claim 9, wherein the handover trigger event comprises any one of:

detecting voice information matched with third voice information in the process of carrying out voice recognition on the audio signal in the sampling window;

the moving times of the sampling window reach the maximum moving times; and

and moving the sampling window and enabling the processing time length of voice recognition of the audio signals in the sampling window to reach a second time length threshold value.

11. The method according to claim 8 or 10, characterized in that the second speech information or the third speech information is arranged according to a vocabulary characterizing a second control instruction;

wherein the second control instruction has any of the following characteristics:

suspending the execution of other control instructions during the execution of the second control instruction by the processor;

the second control instruction characterizes a pause in voice interaction with the user.

12. The method according to claim 10 or 11, wherein, in performing the steps of moving a sampling window in the audio signal after the audio piece and performing speech recognition on the audio signal within the sampling window,

and dynamically adjusting the maximum moving times and/or the moving step length of the sampling window based on the recognition result of performing voice recognition on the audio signal of the sampling window.

13. The method of claim 12, further comprising:

and dynamically adjusting the maximum moving times and/or the moving step length of the sampling window based on the number of the detected instructions in the identification result.

14. The method of claim 13, wherein dynamically adjusting the maximum number of moves of the sampling window based on the number of detected instructions comprises:

in response to the number of detected instructions exceeding a first preset value, adjusting the maximum number of moves of the sampling window from a first value to a second value greater than the first value.

15. The method of claim 13 or 14, wherein dynamically adjusting the step size of the moving of the sampling window based on the number of detected instructions comprises:

and in response to the number of detected instructions exceeding a second preset value, adjusting the moving step size of the sampling window from a first step size to a second step size smaller than the first step size.

16. The method of any of claims 13-15, wherein dynamically adjusting the step size of the moving of the sampling window based on the number of detected instructions comprises:

in response to the number of detected instructions being lower than a third preset value, adjusting the moving step size of the sampling window from a first step size to a third step size larger than the first step size.

17. An audio processing apparatus, comprising: a processor and a memory;

the memory for storing program code;

the processor, invoking the program code, when executed, is configured to:

the window identifying operation includes the steps of: moving a sampling window in the audio signal following the audio segment and performing speech recognition on the audio signal within the sampling window.

18. The apparatus of claim 17, wherein the processor is further configured to:

extracting voice information from the audio clip;

19. The apparatus of claim 17, wherein the processor is further configured to:

extracting voice information from the audio clip;

20. The apparatus of claim 17 or 19, wherein the processor is further configured to perform a window identification operation if any of the following occurs:

21. The apparatus according to claim 19 or 20, wherein the first speech information is arranged according to a vocabulary characterizing the first control commands.

22. The apparatus according to any of claims 17-21, wherein the processor is further configured to:

and judging whether a preset audio event occurs or not based on the audio clip.

23. The apparatus of claim 22, wherein the audio event comprises: one or more of the following sounds are present in the audio piece: tapping sound, clapping sound.

24. The apparatus according to any of claims 17-23, wherein the processor is further configured to:

judging whether the voice information is matched with second voice information;

25. The apparatus according to any of claims 17-24, wherein the processor is further configured to: in performing the step of moving a sampling window in the audio signal following the audio segment and performing speech recognition on the audio signal within the sampling window,

26. The apparatus of claim 25, wherein the handover trigger event comprises any one of:

the moving times of the sampling window reach the maximum moving times; and

27. The apparatus according to claim 24 or 26, wherein the second speech information or the third speech information is arranged according to a vocabulary characterizing a second control instruction;

28. The apparatus according to claim 26 or 27, wherein in performing the steps of moving a sampling window in the audio signal after the audio piece and performing speech recognition on the audio signal within the sampling window,

29. The apparatus of claim 28, wherein the processor is further configured to:

30. The apparatus of claim 29, wherein the processor is further configured to:

31. The apparatus of claim 29 or 30, wherein the processor is further configured to:

32. The apparatus according to any of claims 29-31, wherein the processor is further configured to:

33. A computer-readable storage medium, in which a computer program is stored, the computer program comprising at least one code section executable by a computer to control the computer to perform the audio processing method according to any one of claims 1 to 16.

34. A computer program for implementing an audio processing method according to any one of claims 1 to 16 when executed by a computer.