WO2021253235A1

WO2021253235A1 - Voice activity detection method and apparatus

Info

Publication number: WO2021253235A1
Application number: PCT/CN2020/096392
Authority: WO
Inventors: 柯波; 任博; 鄢展鹏; 王纪会
Original assignee: 华为技术有限公司
Priority date: 2020-06-16
Filing date: 2020-06-16
Publication date: 2021-12-23
Also published as: CN115699173A

Abstract

A voice activity detection (VAD) method and apparatus, which relate to the field of voice detection and are used for improving the accuracy of VAD. The voice activity detection method comprises: acquiring N paths of audio data per frame, wherein N is an integer greater than or equal to 2 (S501); for each frame, calculating an autocorrelation coefficient of each path of audio data on a high-frequency sub-band (S502); and for each frame, selecting, according to autocorrelation coefficients of the N paths of audio data, at least one path of audio data in the N paths of audio data and performing VAD on the path of audio data (S503).

Description

Voice activity detection method and device

Technical field

This application relates to the field of voice detection, and in particular to a voice activity detection (VAD) method and device.

Background technique

There are more and more terminal devices with voice wake-up function in the market, such as mobile phones, smart speakers, smart watches, etc. The terminal device acquires audio data through a microphone (MIC), and continuously performs voice wake-up recognition on the audio data. When the wake-up word spoken by the user is detected, the terminal device switches to a working state and waits for further voice commands from the user. For example, when the user says "Xiaoyi Xiaoyi", the voice wake-up application (APP) of the mobile phone will respond and prompt the user to speak further voice commands.

The core algorithm of voice wake-up is voice recognition. The object of voice recognition is an effective voice signal. If you do not distinguish whether voice signals are included, and perform voice wake-up recognition on all input audio data, the recognition effect will be poor and power consumption will increase. . For this reason, VAD can be used to find the start and end points and end points of the voice signal from the input audio data, so as to further extract the characteristics of the voice signal. Therefore, VAD can also be called voice endpoint detection and voice boundary detection.

In order to reduce environmental interference and perform active noise reduction, the terminal device can be equipped with multiple microphones, including a main microphone and a noise reduction microphone. The terminal equipment generally uses the audio data collected through the main microphone for VAD. When the main microphone is blocked, the energy of the audio data is too low and the voice information is lost, which will affect the accuracy of VAD.

Summary of the invention

The embodiments of the present application provide a VAD method and device, which are used to improve the accuracy of VAD.

In order to achieve the foregoing objectives, the following technical solutions are adopted in the embodiments of the present application:

In the first aspect, a VAD method for voice activity detection is provided, including: acquiring N channels of audio data by frame, where N is an integer greater than or equal to 2; For each frame, according to the autocorrelation coefficient of the N channels of audio data, select to perform VAD on at least one channel of the N channels of audio data.

In the voice activity detection method provided by the embodiments of the application, after obtaining N channels of audio data by frame, the autocorrelation coefficient of each channel of audio data in the high-frequency subband is calculated for each frame, and the autocorrelation coefficient of each channel of audio data is selected according to the autocorrelation coefficient of the N channels of audio data. VAD is performed on at least one of the N channels of audio data to detect that each frame of audio data includes a voice signal. For each frame, the autocorrelation coefficient of the speech signal is larger than that of the silent signal (or steady-state noise), so that it can be determined whether the frame of audio data may include a speech signal. The selection may include voice signals for VAD to determine whether this frame of audio data may include voice signals. VAD is performed on audio data that is more likely to include voice signals, so that the accuracy of VAD can be improved. In addition, VAD can be performed normally even if some microphones are blocked.

In a possible implementation manner, the N channels of audio data include the i-th channel of audio data, where i is a positive integer less than or equal to N; according to the autocorrelation coefficient of the N channels of audio data, at least one channel of the N channels of audio data is selected to be VAD of the data includes: if the autocorrelation coefficient of the i-th audio data is greater than the correlation coefficient threshold, then the i-th audio data is selected for VAD. The autocorrelation coefficient of the speech signal is larger than that of the silent signal (or steady-state noise), so that it can be determined whether the frame of audio data may include a speech signal. VAD is performed on audio data that is more likely to include voice signals, so that the accuracy of VAD can be improved.

In a possible implementation manner, selecting to perform VAD on at least one of the N channels of audio data according to the autocorrelation coefficients of the N channels of audio data includes: according to the autocorrelation coefficients of the N channels of audio data and the N channels of audio data In the energy value of the high frequency subband, VAD is selected to perform VAD on at least one of the N channels of audio data. The penetration ability of the high-frequency subband is weak. If the microphone is blocked, the corresponding energy value will be very low, so it can be determined whether the microphone may be blocked. Select audio data that may include a voice signal and the corresponding microphone may not be blocked for VAD to determine whether this frame of audio data may include a voice signal. It avoids performing VAD on the audio data of the blocked microphone, but performs VAD on the audio data that is more likely to include voice signals, so that the accuracy of VAD can be improved.

In a possible implementation, the N channels of audio data include the i-th channel of audio data and the j-th channel of audio data, i≠j, i and j are positive integers less than or equal to N; according to the autocorrelation coefficient of the N channels of audio data And the energy value of the N channels of audio data in the high-frequency subband, selecting to perform VAD on at least one of the N channels of audio data, including: if the autocorrelation coefficient of the i-th audio data is greater than the correlation coefficient threshold, the j-th audio If the energy value of the data is less than the first energy threshold, and the difference between the energy value of the i-th audio data and the j-th audio data is greater than the second energy threshold, the i-th audio data is selected for VAD. That is, it is detected that the microphone corresponding to the j-th channel of audio data may be blocked, the i-th channel of audio data may include a voice signal and the corresponding microphone may not be blocked.

In a possible implementation manner, the method further includes: when the number of frames detected as including the voice signal in the M frame meets the condition, determining that the M frame includes the voice signal, and M is a positive integer. At this time, M frames of audio data can be used in various aspects such as voice wake-up, voice detection and recognition, on the one hand, it can improve the accuracy rate, on the other hand, it can reduce power consumption.

In a possible implementation manner, the number of frames detected as including a voice signal in M frames meets the condition, including: at least m1 frames in the M frame are detected as including a voice signal, and m1 is less than or equal to M.

In a possible implementation manner, the number of frames detected as including a voice signal in M frames meets the condition, including: at least consecutive m2 frames in the M frame are detected as including a voice signal, and m2 is less than or equal to M.

In a possible implementation manner, the method further includes: performing voice wake-up recognition on N channels of audio data of M frames when the number of frames detected as including a voice signal in the M frames meets the condition. At this time, the M frames include voice signals, and performing voice wake-up recognition at this time can improve accuracy and reduce power consumption.

In a possible implementation manner, each channel of audio data is collected by a microphone, and the microphone is a microphone with the highest signal-to-noise ratio selected from a plurality of microphones. Can reduce power consumption.

In a second aspect, a VAD device for voice activity detection is provided, including: an acquisition module for acquiring N channels of audio data by frame, where N is an integer greater than or equal to 2; and a calculation module for each frame, Calculate the autocorrelation coefficient of each channel of audio data in the high-frequency subband; the selection module is used to select VAD for at least one channel of the N channels of audio data according to the autocorrelation coefficient of the N channels of audio data for each frame.

In a possible implementation manner, the selection module is specifically configured to: if the autocorrelation coefficient of the i-th channel of audio data is greater than the correlation coefficient threshold, select the i-th channel of audio data for VAD.

In a possible implementation manner, the selection module is specifically configured to: according to the autocorrelation coefficient of the N channels of audio data and the energy value of the N channels of audio data in the high-frequency subband, select the audio data for at least one channel of the N channels of audio data. Perform VAD.

In a possible implementation manner, the N channels of audio data include the i-th audio data and the j-th audio data, i≠j, i and j are positive integers less than or equal to N; the selection module is specifically used to: The autocorrelation coefficient of channel audio data is greater than the correlation coefficient threshold, the energy value of the j-th audio data is less than the first energy threshold, and the difference between the energy value of the i-th audio data and the energy value of the j-th audio data is greater than the second energy Threshold, select the i-th audio data for VAD.

In a possible implementation manner, a determining module is further included, configured to: when the number of frames detected as including the voice signal in the M frame meets the condition, determine that the M frame includes the voice signal, and M is a positive integer.

In a possible implementation manner, a voice wake-up module is further included, which is configured to perform voice wake-up recognition on N channels of audio data of M frames when the number of frames detected as including voice signals in the M frame meets the condition.

In a possible implementation manner, each channel of audio data is collected by a microphone, and the microphone is a microphone with the highest signal-to-noise ratio selected from a plurality of microphones.

In a third aspect, a voice activity detection device is provided, including a processor, the processor is connected to a memory, the memory is used to store a computer program, and the processor is used to execute the computer program stored in the memory, so that the device executes the same as in the first aspect and The method according to any one of the embodiments.

In a fourth aspect, a computer-readable storage medium is provided, and a computer program is stored in the computer-readable storage medium, and when it runs on a computer, the computer can execute the method described in the first aspect and any one of its implementations. method.

In a fifth aspect, a computer program product containing instructions is provided. When the instructions are executed on a computer or a processor, the computer or the processor executes the method described in the first aspect and any one of the implementation manners.

The technical effects of the second aspect to the fifth aspect refer to the content of the first aspect and any of the embodiments, and will not be repeated here.

Description of the drawings

FIG. 1 is a schematic diagram of voice wake-up in a black screen scenario provided by an embodiment of the application;

2 is a schematic diagram of voice wake-up in a lock screen scenario provided by an embodiment of the application;

FIG. 3 is a schematic structural diagram of a terminal device provided by an embodiment of this application;

FIG. 4 is a schematic diagram of a main microphone and a noise reduction microphone of a terminal device according to an embodiment of the application;

FIG. 5 is a schematic flowchart of a VAD method provided by an embodiment of this application;

FIG. 6 is a schematic structural diagram of another terminal device provided by an embodiment of this application;

FIG. 7 is a schematic flowchart of another VAD method provided by an embodiment of the application;

FIG. 8 is a schematic structural diagram of a voice activity detection device provided by an embodiment of this application;

FIG. 9 is a schematic structural diagram of another voice activity detection device provided by an embodiment of this application.

detailed description

As used in this application, the terms "component", "module", "system", etc. are intended to refer to a computer-related entity, which can be hardware, firmware, a combination of hardware and software, software, or running software. software. For example, a component may be, but is not limited to: a process running on a processor, a processor, an object, an executable file, an executing thread, a program, and/or a computer. As an example, both an application running on a computing device and the computing device may be components. One or more components may exist in an executing process and/or thread, and the components may be located in one computer and/or distributed between two or more computers. In addition, these components can execute from various computer-readable media having various data structures thereon. These components can be based on, for example, having one or more data packets (for example, data from a component that interacts with another component in a local system, a distributed system, and/or via signals such as the Internet). The network interacts with other systems) signals to communicate in a local and/or remote process.

This application will present various aspects, embodiments, or features around a system that may include multiple devices, components, modules, and the like. It should be understood and understood that each system may include additional devices, components, modules, etc., and/or may not include all the devices, components, modules, etc. discussed in conjunction with the accompanying drawings. In addition, a combination of these schemes can also be used.

In addition, in the embodiments of the present application, the term "exemplary" is used to mean serving as an example, illustration, or illustration. Any embodiment or design solution described as an "example" in this application should not be construed as being more preferable or advantageous than other embodiments or design solutions. To be precise, the term example is used to present the concept in a concrete way.

In the embodiments of this application, information, signal, message, and channel can sometimes be used together. It should be noted that the meanings to be expressed are the same when the differences are not emphasized. "的 (of)", "corresponding (relevant)" and "corresponding (corresponding)" can sometimes be used together. It should be pointed out that the meanings to be expressed are the same when the difference is not emphasized.

Voice wakeup means that the terminal device enters the dormant state after it is started. When the user speaks a specific wake-up word, the terminal device will be awakened, switch to the working state, and wait to receive further voice commands from the user. In this process, users do not need to touch with their hands, and can directly operate with voice. At the same time, using the voice wake-up mechanism, the terminal device does not need to be in the working state all the time, thereby saving energy consumption. Different terminal devices have different wake-up words, and when users need to wake up the device, they need to speak a specific wake-up word.

Voice wake-up has a wide range of application areas, such as robots, mobile phones, wearable devices, smart homes, and in-vehicle devices. Almost many terminal devices with voice functions will need voice wake-up technology as the entrance to human-computer interaction.

Exemplarily, taking the terminal device as a mobile phone as an example, when the voice wake-up function is turned on, in the black screen scenario as shown in Figure 1 or in the lock screen scenario as shown in Figure 2, the mobile phone detects that the user speaks a specific wake-up The word "Xiaoyi Xiaoyi", the voice wake-up APP of the mobile phone is awakened, the display interface of the voice wake-up APP is displayed, and the user is prompted to speak further voice commands, such as displaying text or playing a sound "Hello, how can I help? you".

The terminal device involved in the embodiment of the present application may be a device that includes a wireless transceiver function and can cooperate with a network device to provide communication services for users. Specifically, terminal equipment may refer to user equipment (UE), access terminal, user unit, user station, mobile station, mobile station, remote station, remote terminal, mobile equipment, user terminal, terminal, wireless communication device, User agent or user device. For example, the terminal device can be a mobile phone, a smart speaker, a smart watch, a handheld device with wireless communication function, a computing device or other processing device connected to a wireless modem, a robot, a drone, a smart driving vehicle, a smart home, a vehicle-mounted device, Medical equipment, smart logistics equipment, wearable equipment, terminal equipment in future 5G networks or networks after 5G, etc., are not limited in the embodiment of the present application.

As shown in Fig. 3, the terminal device is taken as an example to illustrate the structure of the terminal device.

The terminal device 100 may include: a radio frequency (RF) circuit 110, a memory 120, an input unit 130, a display unit 140, a sensor 150, an audio circuit 160, a wireless fidelity (Wi-Fi) module 170, and a processor 180, Bluetooth module 181, power supply 190 and other components.

The RF circuit 110 can be used to receive and send signals in the process of sending and receiving information or talking. It can receive downlink data from the base station and then forward it to the processor 180 for processing; and can send uplink data to the base station. Generally, the RF circuit includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a low noise amplifier, a duplexer, and other devices.

The memory 120 can be used to store software programs and data. The processor 180 executes various functions and data processing of the terminal device 100 by running a software program or data stored in the memory 120. The memory 120 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, or other volatile solid-state storage devices. The memory 120 stores an operating system that enables the terminal device 100 to run, such as the one developed by Apple

Operating system, developed by Google

Open source operating system, developed by Microsoft

Operating system, etc. In the present application, the memory 120 may store an operating system and various application programs, and may also store codes for executing the methods in the embodiments of the present application.

The input unit 130 (for example, a touch screen) may be used to receive input digital or character information, and generate signal input related to user settings and function control of the terminal device 100. Specifically, the input unit 130 may include a touch screen 131 provided on the front of the terminal device 100, and may collect user touch operations on or near it.

The display unit 140 (ie, display screen) may be used to display information input by the user or information provided to the user, as well as a graphical user interface (GUI) of various menus of the terminal device 100. The display unit 140 may include a display screen 141 provided on the front of the terminal device 100. Among them, the display screen 141 may be configured in the form of a liquid crystal display, a light emitting diode, or the like. The display unit 140 may be used to display various graphical user interfaces described in this application. The touch screen 131 may be covered on the display screen 141, or the touch screen 131 and the display screen 141 may be integrated to realize the input and output functions of the terminal device 100. After integration, it may be referred to as a touch display screen.

The terminal device 100 may also include at least one sensor 150, such as an acceleration sensor 155, a light sensor, and a motion sensor. The terminal device 100 may also be configured with other sensors such as a gyroscope, a barometer, a hygrometer, a thermometer, and an infrared sensor.

Wi-Fi is a short-distance wireless transmission technology. The terminal device 100 can help users send and receive emails, browse webpages, and access streaming media through the Wi-Fi module 170. It provides users with wireless broadband Internet access.

The processor 180 is the control center of the terminal device 100. It uses various interfaces and lines to connect various parts of the entire terminal, and executes the terminal device by running or executing software programs stored in the memory 120 and calling data stored in the memory 120. 100's various functions and processing data. The processor 180 in this application may refer to one or more processors, and the processor 180 may include one or more processing units; the processor 180 may also integrate an application processor and a baseband processor, where the application processor mainly processes operations For systems, user interfaces, and applications, the baseband processor mainly handles wireless communications. It can be understood that the aforementioned baseband processor may not be integrated into the processor 180. The processor 180 in this application can run an operating system, application programs, user interface display and touch response, as well as the communication method described in the embodiments of this application.

The Bluetooth module 181 is used to exchange information with other Bluetooth devices having a Bluetooth module through the Bluetooth protocol. For example, the terminal device 100 can establish a Bluetooth connection with a wearable electronic device (such as a smart watch) that also has a Bluetooth module through the Bluetooth module 181, so as to perform data interaction.

The terminal device 100 also includes a power source 190 (such as a battery) for supplying power to various components. The power supply can be logically connected to the processor 180 through the power management system, so that functions such as charging, discharging, and power consumption can be managed through the power management system.

The audio circuit 160, the speaker 161, and the microphone 162 may provide an audio interface between the user and the terminal device 100. The audio circuit 160 can transmit the electrical signal converted from the received audio data to the speaker 161, which is converted into a sound signal for output by the speaker 161; on the other hand, the microphone 162 converts the collected sound signal into an electrical signal, and the audio circuit 160 After being received, it is converted into audio data, and then the audio data is output to the RF circuit 110 to be sent to, for example, another terminal, or the audio data is output to the memory 120 for further processing.

The terminal device may include multiple microphones. As shown in FIG. 4, taking the terminal device 200 as a mobile phone as an example, the lower end of the terminal device 200 may include at least one microphone 201, which may serve as the main microphone, and the upper end of the terminal device may include at least one microphone 202. , Can be used as a noise reduction microphone.

As mentioned above, the prior art generally uses audio data collected through the main microphone for VAD when performing VAD on audio data. When the main microphone is blocked, the energy of the collected audio data is too low, which will affect VAD. The accuracy rate.

The embodiment of the present application provides a VAD method, which performs VAD by selecting at least one channel of audio data with a higher autocorrelation coefficient (and higher energy value) of a high-frequency subband from multiple channels of audio data collected by multiple microphones. Thereby improving the accuracy of VAD. Further, the detection result can be applied to voice wake-up recognition.

As shown in Figure 5, the VAD method includes:

S501: The terminal device obtains N channels of audio data according to frames.

As shown in FIG. 6, it is assumed that the terminal device includes at least N microphones 601, at least N analog to digital converters (ADC) 602, and a processor 603. N is an integer greater than or equal to 2. Optionally, a buffer 604 may also be included. The output end of each microphone 601 is electrically connected to the input end of an analog-to-digital converter (ADC) 602, and the output end of each ADC 602 is electrically connected to the input end of the VAD module 6031 in the processor 603, and the VAD module 6031 may be a hardware circuit in the processor 603. Then the VAD module 6031 can perform this step.

Wherein, each channel of audio data can be collected by one microphone, or can be collected by multiple microphones and then synthesized, or can be obtained in other ways, such as from other devices, which is not limited in this application.

It should be noted that the N microphones can be selected from multiple microphones with the highest signal-to-noise ratio, which can reduce power consumption.

The analog audio signal collected by each microphone 601 undergoes analog-to-digital conversion by the corresponding ADC 602 to obtain digital audio data.

Because the voice signal has short-term stability, it can be considered that the voice signal in the range of 10-30ms is stable, so that the audio data can be divided into frames, and VAD is performed on each channel of audio data by frame. Generally 5-20ms is a frame, and the frame is divided by overlapping segmentation, that is, there is overlap between the previous frame and the next frame. The overlapping part is called frame shift, and the ratio of frame shift to frame length is generally 0～0.5. For example, assuming 20ms as a frame, if the sampling frequency of ADC 602 is 8KHz, then in one frame, each channel of audio data corresponds to 160 sampling points; if the sampling frequency of ADC 602 is 16KHz, then in a frame, every Channel audio data corresponds to 320 sampling points. Each microphone 601 of the N microphones 601 collects one channel of audio signal, and N channels of audio data can be obtained in total.

The N channels of audio data can be sent to the VAD module 6031 for VAD on the one hand, and can be stored in the buffer 604 on the other hand to prevent loss of audio data, for example, for subsequent voice wake-up recognition. The storage depth of the buffer 604 is determined by the delay of the VAD algorithm, and the greater the delay, the deeper the storage depth.

S502. The terminal device calculates the autocorrelation coefficient of each channel of audio data in the high-frequency subband for each frame.

This step can be performed by the VAD module 6031.

Assuming that the sampling frequency of ADC 602 is 8KHz, that is, the analog audio signal gets 8KHz digital audio data after analog-to-digital conversion and data filtering. According to the sampling theorem, the sampling frequency is at least twice the frequency of the original signal to correctly restore the original signal. Therefore, The frequency bandwidth of the analog audio signal that can be represented by 8KHz digital audio data is 0-4KHz.

Therefore, the audio data can be sub-band filtered to obtain the audio data of the high-frequency sub-band. Still taking the sampling frequency of 8KHz as an example, the audio data of 0-4KHz is divided into multiple sub-bands, for example, divided into four sub-bands. 0-4KHz can be divided into 0-1KHz, 1KHz-2KHz, 2KHz-3KHz, 3KHz-4KHz, and the high frequency subband refers to 2-4KHz.

In a possible implementation manner, for a piece of audio data of the nth frame, it can be based on two adjacent frames (for example, the nth frame and the n-1th frame) or three adjacent frames (for example, the nth frame and the n-1th frame). n-1 frame and n-2th frame) audio data to calculate the autocorrelation coefficient of one channel of audio data of the nth frame in the high-frequency subband. The autocorrelation coefficient characterizes the degree of correlation (ie similarity) of audio data at two different moments. When the audio data at two different moments have the same periodic component, the maximum value of the autocorrelation coefficient reflects this period. Compared with the voice signal, the silent signal (or steady-state noise) has poorer autocorrelation, and the autocorrelation coefficient is relatively low. When the autocorrelation coefficient of a channel of audio data in the high-frequency subband is relatively large, it can be determined that the channel of audio data may include a voice signal.

It should be noted that this article uses a similar description of "possibly" because one frame of audio data has a small amount of data, and there may be differences between different frames. It is necessary to combine multiple frames to finally determine that the microphone is really not blocked. And audio data includes voice signals.

Exemplarily, the above-mentioned autocorrelation coefficient r _xx can be obtained by formula 1.

Among them, τ represents the time delay between sampling points. When τ is 1～N, it means the autocorrelation coefficient between two adjacent frames of the nth frame. When τ is 1～2N, it means the first The autocorrelation coefficient between three adjacent frames of n frames. The meaning of this formula is to get the autocorrelation coefficient when the value range of τ is traversed to maximize the calculation result of formula 1. energy(N) represents the energy of one channel of audio data of one frame in the high-frequency subband, and its common calculation formula is shown in formula 2:

Among them, N is the number of sampling points of one frame of audio data, and x(i) is the value of the i-th sampling point.

S503: For each frame, the terminal device selects to perform VAD on at least one channel of the N channels of audio data according to the autocorrelation coefficient of the N channels of audio data.

Specifically, suppose that in a certain frame, N channels of audio data include the i-th channel of audio data, and i is a positive integer less than or equal to N. If the autocorrelation coefficient of the i-th audio data is greater than the correlation coefficient threshold (for example, 0.6), the i-th audio data is selected for VAD.

Optionally, if the microphone is blocked, due to the weak penetration of high-frequency signals, the energy value of the obtained audio data in the high-frequency subband (for example, 2-4KHz) will be lower. The energy value of the subband determines whether the microphone may be blocked. Therefore, for each frame, the terminal device can select at least one of the N channels of audio data according to the autocorrelation coefficient of the N channels of audio data in the high frequency subband and the energy value of the N channels of audio data in the high frequency subband. Perform VAD.

Specifically, suppose that in a certain frame, N channels of audio data include the i-th channel of audio data and the j-th channel of audio data, where i≠j, and i and j are positive integers less than or equal to N.

If the autocorrelation coefficient of the i-th audio data is greater than the correlation coefficient threshold (e.g. 0.6), the energy value of the j-th audio data is less than the first energy threshold (e.g. -50dB), and the energy value of the i-th audio data is the same as the j-th If the difference between the energy values of the audio data is greater than the second energy threshold (for example, 20dB), the i-th audio data is selected for VAD.

It should be noted that the above calculations must be performed on N channels of audio data for each frame, and the selected audio data may be different for each frame. For example, the first frame may select the first channel of audio data for VAD, and the next frame may Select the second and third audio data for VAD.

The autocorrelation coefficient of the i-th channel of audio data is greater than the correlation coefficient threshold, that is, the auto-correlation coefficient of the i-th channel of audio data is larger, and it can be determined that the i-th channel of audio data may include a voice signal.

For the energy value of the j-th audio data being less than the first energy threshold, and the difference between the energy value of the i-th audio data and the energy value of the j-th audio data is greater than the second energy threshold, it can be determined that the j-th audio data corresponds to The microphone of may be blocked, and the microphone corresponding to the i-th audio data may not be blocked. If the microphone corresponding to the jth channel of audio data and the microphone corresponding to the ith channel of audio data are both blocked, or the two channels of audio data only include mute signals or steady-state noise, the energy value of the ith channel of audio data is not satisfied The difference between the energy value and the energy value of the j-th channel of audio data is greater than the second energy threshold.

In summary, it can be determined that in this frame, it is detected that the microphone corresponding to the j-th audio data may be blocked, the i-th audio data may include a voice signal and the corresponding microphone may not be blocked.

Exemplarily, assuming N=4, that is, four microphones correspond to four channels of audio data, the energy values of the first to fourth channels of audio data in the high frequency subband are Rms1, Rms2, Rms3, Rms4, and the autocorrelation coefficients are in order Rel1, Rel2, Rel3, Rel4. The magnitude relationship of these energy values satisfies: Rms0<Rms1<Rms2<Rms3; the magnitude relationship of these autocorrelation coefficients satisfies: Rel0<Rel1<Rel2<Rel3. If both Rel3 and Rel4 are greater than the correlation coefficient threshold, Rms0 and Rms1 are both less than the first energy threshold, and Rms2-Rms0, Rms2-Rms1, Rms3-Rms0, and Rms3-Rms1 are all greater than the second energy threshold, then the first channel of audio data can be determined The corresponding microphone and the microphone corresponding to the second channel of audio data may be blocked, and the microphone corresponding to the third channel of audio data and the fourth channel of audio data include voice signals and the corresponding microphone may not be blocked. The first road and the second road here are equivalent to the j-th road described above, and the third and fourth roads are equivalent to the i-th road described above.

For each frame, the process of performing VAD on at least one channel of audio data includes, but is not limited to: calculating the voice level, calculating the signal-to-noise ratio, calculating the voice level, and calculating the cross-correlation coefficient.

The speech level is used to characterize the amplitude of the signal, and the speech level rms can be obtained by formula 3:

rms=20*log10(energy(N)) Formula 3

The signal-to-noise ratio is used to characterize the proportion of signal and noise in speech. The signal-to-noise ratio snr can be obtained by formula 4:

The speech level is used to characterize the confidence level of the speech, and the speech level SpeechLevel can be obtained by formula 5:

The cross-correlation coefficient is used to characterize the similarity level of speech, and the cross-correlation coefficient r _xy can be obtained by formula 6:

When the speech level, signal-to-noise ratio, speech level, and cross-correlation coefficient all exceed the corresponding threshold, it can be determined that the audio data of the frame includes a speech signal.

Optionally, as shown in Figure 7, the VAD method further includes:

S701: When the number of frames detected as including the voice signal in the M frame meets the condition, determine that the M frame includes the voice signal.

Among them, M is a positive integer. Exemplary M=20.

In a possible implementation manner, if at least m1 (for example, 12) frames in the M frames are detected as including a voice signal, it is determined that the M frame includes a voice signal. m1 is less than or equal to M. That is, for one channel of audio data, it is not limited that consecutive multiple frames of audio data are detected as including a voice signal, as long as at least m1 frames of audio data in the overall M frames are detected as including a voice signal.

In another possible implementation manner, if at least consecutive m2 (for example, 8) frames in the M frame are detected as including a voice signal, it is determined that the M frame includes a voice signal. m2 is less than or equal to M. That is, for one channel of audio data, at least m2 consecutive frames of audio data are detected as including voice signals.

Optionally, as shown in Figure 7, the VAD method further includes:

S702: When the number of frames detected as including the voice signal in the M frame meets the condition, perform voice wake-up recognition on the N channels of audio data of the M frame.

Voice wake-up recognition can be implemented by the processor 603 shown in FIG. 6 executing software. When the number of times that at least one of the audio data in the M frame is detected as including a voice signal meets the condition, the VAD module 6031 can generate an interrupt, thereby triggering the processor 603 performs voice wake-up recognition on N channels of audio data of M frames.

That is to say, when it is determined that the audio data of M frames includes a voice signal, the voice wake-up recognition is performed on the N channels of audio data of the M frame. If the voice wake-up recognition is successful, the voice wake-up display interface is displayed, and the user is prompted to say further Voice commands. When the above conditions are not met, it is not necessary to perform voice wake-up recognition on audio data, which can save energy consumption and reduce the false alarm rate of voice wake-up.

It can be understood that, in each of the above embodiments, the methods and/or steps implemented by the terminal device may also be implemented by components (for example, a chip or a circuit) that can be used for the terminal device.

The embodiment of the present application also provides a voice activity detection device, which is used to implement the above-mentioned various methods. The voice activity detection device may be the terminal device in the foregoing method embodiment, or a device including the foregoing terminal device, or a chip or functional module in the terminal device.

It can be understood that, in order to realize the above-mentioned functions, the voice activity detection device includes hardware structures and/or software modules corresponding to various functions. Those skilled in the art should easily realize that in combination with the units and algorithm steps of the examples described in the embodiments disclosed herein, the present application can be implemented in the form of hardware or a combination of hardware and computer software. Whether a certain function is executed by hardware or computer software-driven hardware depends on the specific application and design constraint conditions of the technical solution. Professionals and technicians can use different methods for each specific application to implement the described functions, but such implementation should not be considered as going beyond the scope of this application.

The embodiments of the present application may divide the voice activity detection device into functional modules according to the foregoing method embodiments. For example, each functional module may be divided corresponding to each function, or two or more functions may be integrated into one processing module. The above-mentioned integrated modules can be implemented in the form of hardware or software function modules. It should be noted that the division of modules in the embodiments of the present application is illustrative, and is only a logical function division, and there may be other division methods in actual implementation.

For example, take the voice activity detection device as the terminal device in the foregoing method embodiment as an example. FIG. 8 shows a schematic structural diagram of a voice activity detection device 80. The voice activity detection device 80 includes an acquisition module 801, a calculation module 802, a selection module 803, and optionally, a determination module 804 and a voice wake-up module 805. The obtaining module 801 can execute step S501 in FIG. 5 and step S501 in FIG. 7. The calculation module 802 can execute step S502 in FIG. 5 and step S502 in FIG. 7. The selection module 803 can perform step S503 in FIG. 5 and step S503 in FIG. 7. The determining module 804 can execute step S701 in FIG. 7. The voice wake-up module 805 can perform step S702 in FIG. 7.

Exemplarily, the obtaining module 801 is used to obtain N channels of audio data by frame, where N is an integer greater than or equal to 2; the calculation module 802 is used to calculate the high frequency subband of each channel of audio data for each frame The selection module 803 is used for each frame, according to the autocorrelation coefficient of the N channels of audio data, to select at least one channel of audio data in the N channels of audio data for VAD.

In a possible implementation manner, the selection module 803 is specifically configured to: if the autocorrelation coefficient of the i-th channel of audio data is greater than the correlation coefficient threshold, select the i-th channel of audio data for VAD.

In a possible implementation manner, the selection module 803 is specifically configured to: according to the autocorrelation coefficients of the N channels of audio data and the energy value of the N channels of audio data in the high-frequency subband, select for at least one channel of the N channels of audio data. The data is VAD.

In a possible implementation, the N channels of audio data include the i-th channel of audio data and the j-th channel of audio data, i≠j, i and j are positive integers less than or equal to N; the selection module 803 is specifically configured to: The autocorrelation coefficient of the i-channel audio data is greater than the correlation coefficient threshold, the energy value of the j-th audio data is less than the first energy threshold, and the difference between the energy value of the i-th audio data and the energy value of the j-th audio data is greater than the second Energy threshold, select the i-th audio data for VAD.

In a possible implementation manner, the determining module 804 is configured to: when the number of frames detected as including the voice signal in the M frame meets the condition, determine that the M frame includes the voice signal, and M is a positive integer.

In a possible implementation manner, the voice wake-up module 805 is configured to perform voice wake-up recognition on N channels of audio data of M frames when the number of frames detected as including voice signals in the M frame meets the condition.

In this embodiment, the voice activity detection device 80 is presented in the form of dividing various functional modules in an integrated manner. The "module" here may refer to a specific ASIC, a circuit, a processor and memory that executes one or more software or firmware programs, an integrated logic circuit, and/or other devices that can provide the above-mentioned functions.

Specifically, the function/implementation process of each module in FIG. 8 can be implemented by the processor in the terminal device calling the computer execution instructions stored in the memory.

Since the voice activity detection device 80 provided in this embodiment can perform the above-mentioned method, the technical effects that can be obtained can refer to the above-mentioned method embodiment, which will not be repeated here.

As shown in FIG. 9, an embodiment of the present application also provides a voice activity detection device. The voice activity detection device 90 includes a processor 901 and a memory 902. The processor 901 is coupled to the memory 902. When the processor 901 executes the memory 902 When the computer program or instruction is executed, the corresponding method in Fig. 5 and Fig. 7 is executed.

The embodiment of the present application also provides a computer-readable storage medium, the computer-readable storage medium stores a computer program, and when it runs on a computer or a processor, the computer or the processor executes the steps shown in FIG. 5 and FIG. 7 The corresponding method.

The embodiments of the present application also provide a computer program product containing instructions, which when the instructions run on a computer or a processor, cause the computer or the processor to execute the corresponding methods in FIG. 5 and FIG. 7.

The embodiment of the present application provides a chip system including a processor for the voice activity detection device to execute the corresponding methods in FIG. 5 and FIG. 7.

In a possible design, the chip system also includes a memory for storing necessary program instructions and data. The chip system may include a chip, an integrated circuit, or may include a chip and other discrete devices, which is not specifically limited in the embodiment of the present application.

Among them, the voice activity detection device, chip, computer storage medium, computer program product, or chip system provided in this application are all used to execute the above-mentioned method. Therefore, the beneficial effects that can be achieved can refer to the above-mentioned The beneficial effects in the implementation manner will not be repeated here.

The processor involved in the embodiment of the present application may be a chip. For example, it can be a field programmable gate array (FPGA), an application specific integrated circuit (ASIC), a system on chip (SoC), or a central processing unit. The central processor unit (CPU) can also be a network processor (NP), a digital signal processing circuit (digital signal processor, DSP), or a microcontroller (microcontroller unit, MCU) It can also be a programmable logic device (PLD) or other integrated chips.

The memory involved in the embodiments of the present application may be a volatile memory or a non-volatile memory, or may include both volatile and non-volatile memory. Among them, the non-volatile memory can be a read-only memory (ROM), a programmable read-only memory (programmable ROM, PROM), an erasable programmable read-only memory (erasable PROM, EPROM), a Erase programmable read-only memory (electrically EPROM, EEPROM) or flash memory. The volatile memory may be random access memory (RAM), which is used as an external cache. By way of exemplary but not restrictive description, many forms of RAM are available, such as static random access memory (static RAM, SRAM), dynamic random access memory (dynamic RAM, DRAM), and synchronous dynamic random access memory (synchronous DRAM, SDRAM), double data rate synchronous dynamic random access memory (double data rate SDRAM, DDR SDRAM), enhanced synchronous dynamic random access memory (enhanced SDRAM, ESDRAM), synchronous connection dynamic random access memory (synchlink DRAM, SLDRAM) ) And direct memory bus random access memory (direct rambus RAM, DR RAM). It should be noted that the memories of the systems and methods described herein are intended to include, but are not limited to, these and any other suitable types of memories.

It should be understood that, in the various embodiments of the present application, the size of the sequence numbers of the above-mentioned processes does not mean the order of execution. The execution order of each process should be determined by its function and internal logic, and should not correspond to the embodiments of the present application. The implementation process constitutes any limitation.

A person of ordinary skill in the art may be aware that the units and algorithm steps of the examples described in combination with the embodiments disclosed herein can be implemented by electronic hardware or a combination of computer software and electronic hardware. Whether these functions are executed by hardware or software depends on the specific application and design constraint conditions of the technical solution. Professionals and technicians can use different methods for each specific application to implement the described functions, but such implementation should not be considered as going beyond the scope of this application.

Those skilled in the art can clearly understand that, for the convenience and conciseness of description, the specific working process of the above-described system, device, and unit can refer to the corresponding process in the foregoing method embodiment, which will not be repeated here.

In the several embodiments provided in this application, it should be understood that the disclosed system, device, and method may be implemented in other ways. For example, the device embodiments described above are merely illustrative. For example, the division of the units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components can be combined or It can be integrated into another system, or some features can be ignored or not implemented. In addition, the displayed or discussed mutual couplings or direct couplings or communication connections may be indirect couplings or communication connections between devices or units through some interfaces, and may be in electrical, mechanical or other forms.

The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.

In addition, the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.

In the foregoing embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented using a software program, it can be implemented in the form of a computer program product in whole or in part. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on the computer, the processes or functions described in the embodiments of the present application are generated in whole or in part. The computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices. The computer instructions may be stored in a computer-readable storage medium, or transmitted from one computer-readable storage medium to another computer-readable storage medium. For example, the computer instructions may be transmitted from a website, computer, server, or data center. Transmission to another website, computer, server, or data center via wired (such as coaxial cable, optical fiber, Digital Subscriber Line (DSL)) or wireless (such as infrared, wireless, microwave, etc.). The computer-readable storage medium may be any available medium that can be accessed by a computer or include one or more data storage devices such as servers, data centers, etc. that can be integrated with the medium. The usable medium may be a magnetic medium (for example, a floppy disk, a hard disk, and a magnetic tape), an optical medium (for example, a DVD), or a semiconductor medium (for example, a solid state disk (SSD)).

The above are only specific implementations of this application, but the protection scope of this application is not limited to this. Any person skilled in the art can easily think of changes or substitutions within the technical scope disclosed in this application. Should be covered within the scope of protection of this application. Therefore, the protection scope of this application should be subject to the protection scope of the claims.

Claims

A VAD method for voice activity detection, which is characterized in that it includes:

Acquire N channels of audio data in frames, where N is an integer greater than or equal to 2;

For each frame, calculate the autocorrelation coefficient of each channel of audio data in the high-frequency subband;

For each frame, according to the autocorrelation coefficient of the N channels of audio data, select to perform VAD on at least one channel of the N channels of audio data.
The method according to claim 1, wherein the N channels of audio data include the i-th channel of audio data, and i is a positive integer less than or equal to N; the autocorrelation coefficient according to the N channels of audio data , Selecting to perform VAD on at least one of the N channels of audio data includes:

If the autocorrelation coefficient of the i-th channel of audio data is greater than the correlation coefficient threshold, then the i-th channel of audio data is selected for VAD.
The method according to claim 1, wherein the selecting to perform VAD on at least one of the N channels of audio data according to the autocorrelation coefficients of the N channels of audio data comprises:

According to the autocorrelation coefficient of the N channels of audio data and the energy value of the N channels of audio data in the high frequency subband, VAD is selected for at least one channel of the N channels of audio data.
The method according to claim 3, wherein the N channels of audio data include the i-th channel of audio data and the j-th channel of audio data, i≠j, i and j are positive integers less than or equal to N; The autocorrelation coefficients of the N channels of audio data and the energy values of the N channels of audio data in the high-frequency subband are selected to perform VAD on at least one channel of the N channels of audio data, including:

If the autocorrelation coefficient of the i-th audio data is greater than the correlation coefficient threshold, the energy value of the j-th audio data is less than the first energy threshold, and the energy value of the i-th audio data is the same as the energy value of the i-th audio data. If the difference between the energy values of the j channels of audio data is greater than the second energy threshold, the i-th channel of audio data is selected for VAD.
The method according to any one of claims 1-4, further comprising:

When the number of frames detected as including the voice signal in the M frame meets the condition, it is determined that the M frame includes the voice signal, and M is a positive integer.
The method according to claim 5, wherein the number of frames detected as including a voice signal in the M frames meets a condition, comprising:

At least m1 of the M frames is detected as including a voice signal, and m1 is less than or equal to M.
The method according to claim 5, wherein the number of frames detected as including a voice signal in the M frames meets a condition, comprising:

At least consecutive m2 frames in the M frames are detected as including a voice signal, and m2 is less than or equal to M.
The method according to any one of claims 1-7, further comprising:

When the number of frames detected as including a voice signal in the M frame meets the condition, perform voice wake-up recognition on the N channels of audio data of the M frame.
The method according to any one of claims 1-8, wherein each channel of the audio data is collected by a microphone, and the microphone is a microphone with the highest signal-to-noise ratio selected from a plurality of microphones.
A VAD device for voice activity detection, which is characterized in that it comprises:

The acquisition module is used to acquire N channels of audio data by frame, where N is an integer greater than or equal to 2;

The calculation module is used to calculate the autocorrelation coefficient of each channel of audio data in the high-frequency subband for each frame;

The selection module is configured to, for each frame, select to perform VAD on at least one of the N channels of audio data according to the autocorrelation coefficient of the N channels of audio data.
The device according to claim 10, wherein the selection module is specifically configured to:

If the autocorrelation coefficient of the i-th channel of audio data is greater than the correlation coefficient threshold, then the i-th channel of audio data is selected for VAD.
The device according to claim 10, wherein the selection module is specifically configured to:

According to the autocorrelation coefficient of the N channels of audio data and the energy value of the N channels of audio data in the high frequency subband, VAD is selected for at least one channel of the N channels of audio data.
The device according to claim 12, wherein the N channels of audio data include the i-th channel of audio data and the j-th channel of audio data, i≠j, i and j are positive integers less than or equal to N; the selection The module is specifically used for:

If the autocorrelation coefficient of the i-th audio data is greater than the correlation coefficient threshold, the energy value of the j-th audio data is less than the first energy threshold, and the energy value of the i-th audio data is the same as the energy value of the i-th audio data. If the difference between the energy values of the j channels of audio data is greater than the second energy threshold, the i-th channel of audio data is selected for VAD.
The device according to any one of claims 10-13, further comprising a determining module, configured to:

When the number of frames detected as including the voice signal in the M frame meets the condition, it is determined that the M frame includes the voice signal, and M is a positive integer.
The apparatus according to claim 14, wherein the number of frames detected as including a voice signal in the M frames meets a condition, comprising:

At least m1 of the M frames is detected as including a voice signal, and m1 is less than or equal to M.
The apparatus according to claim 14, wherein the number of frames detected as including a voice signal in the M frames meets a condition, comprising:

At least consecutive m2 frames in the M frames are detected as including a voice signal, and m2 is less than or equal to M.
The device according to any one of claims 10-16, further comprising a voice wake-up module for:

When the number of frames detected as including a voice signal in the M frame meets the condition, perform voice wake-up recognition on the N channels of audio data of the M frame.
The device according to any one of claims 10-17, wherein each channel of the audio data is collected by a microphone, and the microphone is a microphone with the highest signal-to-noise ratio selected from a plurality of microphones.
A voice activity detection device, which is characterized in that it comprises:

Memory, used to store computer programs;

The processor connected to the memory is configured to call the computer program stored in the memory to cause the voice activity detection device to execute the method according to any one of claims 1 to 9.
A computer-readable storage medium, characterized by comprising a computer program, which when the computer program runs on a computer, causes the computer to execute the method according to any one of claims 1 to 9.