WO2021253235A1 - Voice activity detection method and apparatus - Google Patents

Voice activity detection method and apparatus Download PDF

Info

Publication number
WO2021253235A1
WO2021253235A1 PCT/CN2020/096392 CN2020096392W WO2021253235A1 WO 2021253235 A1 WO2021253235 A1 WO 2021253235A1 CN 2020096392 W CN2020096392 W CN 2020096392W WO 2021253235 A1 WO2021253235 A1 WO 2021253235A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio data
channels
channel
frames
vad
Prior art date
Application number
PCT/CN2020/096392
Other languages
French (fr)
Chinese (zh)
Inventor
柯波
任博
鄢展鹏
王纪会
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to CN202080101920.6A priority Critical patent/CN115699173A/en
Priority to PCT/CN2020/096392 priority patent/WO2021253235A1/en
Publication of WO2021253235A1 publication Critical patent/WO2021253235A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/21Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals

Definitions

  • This application relates to the field of voice detection, and in particular to a voice activity detection (VAD) method and device.
  • VAD voice activity detection
  • the terminal device acquires audio data through a microphone (MIC), and continuously performs voice wake-up recognition on the audio data.
  • MIC microphone
  • the terminal device switches to a working state and waits for further voice commands from the user. For example, when the user says "Xiaoyi Xiaoyi", the voice wake-up application (APP) of the mobile phone will respond and prompt the user to speak further voice commands.
  • APP voice wake-up application
  • the core algorithm of voice wake-up is voice recognition.
  • the object of voice recognition is an effective voice signal. If you do not distinguish whether voice signals are included, and perform voice wake-up recognition on all input audio data, the recognition effect will be poor and power consumption will increase. .
  • VAD can be used to find the start and end points and end points of the voice signal from the input audio data, so as to further extract the characteristics of the voice signal. Therefore, VAD can also be called voice endpoint detection and voice boundary detection.
  • the terminal device can be equipped with multiple microphones, including a main microphone and a noise reduction microphone.
  • the terminal equipment generally uses the audio data collected through the main microphone for VAD. When the main microphone is blocked, the energy of the audio data is too low and the voice information is lost, which will affect the accuracy of VAD.
  • the embodiments of the present application provide a VAD method and device, which are used to improve the accuracy of VAD.
  • a VAD method for voice activity detection including: acquiring N channels of audio data by frame, where N is an integer greater than or equal to 2; For each frame, according to the autocorrelation coefficient of the N channels of audio data, select to perform VAD on at least one channel of the N channels of audio data.
  • the autocorrelation coefficient of each channel of audio data in the high-frequency subband is calculated for each frame, and the autocorrelation coefficient of each channel of audio data is selected according to the autocorrelation coefficient of the N channels of audio data.
  • VAD is performed on at least one of the N channels of audio data to detect that each frame of audio data includes a voice signal.
  • the autocorrelation coefficient of the speech signal is larger than that of the silent signal (or steady-state noise), so that it can be determined whether the frame of audio data may include a speech signal.
  • the selection may include voice signals for VAD to determine whether this frame of audio data may include voice signals.
  • VAD is performed on audio data that is more likely to include voice signals, so that the accuracy of VAD can be improved. In addition, VAD can be performed normally even if some microphones are blocked.
  • the N channels of audio data include the i-th channel of audio data, where i is a positive integer less than or equal to N; according to the autocorrelation coefficient of the N channels of audio data, at least one channel of the N channels of audio data is selected to be VAD of the data includes: if the autocorrelation coefficient of the i-th audio data is greater than the correlation coefficient threshold, then the i-th audio data is selected for VAD.
  • the autocorrelation coefficient of the speech signal is larger than that of the silent signal (or steady-state noise), so that it can be determined whether the frame of audio data may include a speech signal.
  • VAD is performed on audio data that is more likely to include voice signals, so that the accuracy of VAD can be improved.
  • selecting to perform VAD on at least one of the N channels of audio data according to the autocorrelation coefficients of the N channels of audio data includes: according to the autocorrelation coefficients of the N channels of audio data and the N channels of audio data
  • VAD is selected to perform VAD on at least one of the N channels of audio data.
  • the penetration ability of the high-frequency subband is weak. If the microphone is blocked, the corresponding energy value will be very low, so it can be determined whether the microphone may be blocked.
  • Select audio data that may include a voice signal and the corresponding microphone may not be blocked for VAD to determine whether this frame of audio data may include a voice signal. It avoids performing VAD on the audio data of the blocked microphone, but performs VAD on the audio data that is more likely to include voice signals, so that the accuracy of VAD can be improved.
  • the N channels of audio data include the i-th channel of audio data and the j-th channel of audio data, i ⁇ j, i and j are positive integers less than or equal to N; according to the autocorrelation coefficient of the N channels of audio data And the energy value of the N channels of audio data in the high-frequency subband, selecting to perform VAD on at least one of the N channels of audio data, including: if the autocorrelation coefficient of the i-th audio data is greater than the correlation coefficient threshold, the j-th audio If the energy value of the data is less than the first energy threshold, and the difference between the energy value of the i-th audio data and the j-th audio data is greater than the second energy threshold, the i-th audio data is selected for VAD. That is, it is detected that the microphone corresponding to the j-th channel of audio data may be blocked, the i-th channel of audio data may include a voice signal and the corresponding microphone may not be blocked.
  • the method further includes: when the number of frames detected as including the voice signal in the M frame meets the condition, determining that the M frame includes the voice signal, and M is a positive integer.
  • M frames of audio data can be used in various aspects such as voice wake-up, voice detection and recognition, on the one hand, it can improve the accuracy rate, on the other hand, it can reduce power consumption.
  • the number of frames detected as including a voice signal in M frames meets the condition, including: at least m1 frames in the M frame are detected as including a voice signal, and m1 is less than or equal to M.
  • the number of frames detected as including a voice signal in M frames meets the condition, including: at least consecutive m2 frames in the M frame are detected as including a voice signal, and m2 is less than or equal to M.
  • the method further includes: performing voice wake-up recognition on N channels of audio data of M frames when the number of frames detected as including a voice signal in the M frames meets the condition.
  • the M frames include voice signals, and performing voice wake-up recognition at this time can improve accuracy and reduce power consumption.
  • each channel of audio data is collected by a microphone, and the microphone is a microphone with the highest signal-to-noise ratio selected from a plurality of microphones. Can reduce power consumption.
  • a VAD device for voice activity detection including: an acquisition module for acquiring N channels of audio data by frame, where N is an integer greater than or equal to 2; and a calculation module for each frame, Calculate the autocorrelation coefficient of each channel of audio data in the high-frequency subband; the selection module is used to select VAD for at least one channel of the N channels of audio data according to the autocorrelation coefficient of the N channels of audio data for each frame.
  • the selection module is specifically configured to: if the autocorrelation coefficient of the i-th channel of audio data is greater than the correlation coefficient threshold, select the i-th channel of audio data for VAD.
  • the selection module is specifically configured to: according to the autocorrelation coefficient of the N channels of audio data and the energy value of the N channels of audio data in the high-frequency subband, select the audio data for at least one channel of the N channels of audio data. Perform VAD.
  • the N channels of audio data include the i-th audio data and the j-th audio data, i ⁇ j, i and j are positive integers less than or equal to N; the selection module is specifically used to: The autocorrelation coefficient of channel audio data is greater than the correlation coefficient threshold, the energy value of the j-th audio data is less than the first energy threshold, and the difference between the energy value of the i-th audio data and the energy value of the j-th audio data is greater than the second energy Threshold, select the i-th audio data for VAD.
  • a determining module is further included, configured to: when the number of frames detected as including the voice signal in the M frame meets the condition, determine that the M frame includes the voice signal, and M is a positive integer.
  • the number of frames detected as including a voice signal in M frames meets the condition, including: at least m1 frames in the M frame are detected as including a voice signal, and m1 is less than or equal to M.
  • the number of frames detected as including a voice signal in M frames meets the condition, including: at least consecutive m2 frames in the M frame are detected as including a voice signal, and m2 is less than or equal to M.
  • a voice wake-up module is further included, which is configured to perform voice wake-up recognition on N channels of audio data of M frames when the number of frames detected as including voice signals in the M frame meets the condition.
  • each channel of audio data is collected by a microphone, and the microphone is a microphone with the highest signal-to-noise ratio selected from a plurality of microphones.
  • a voice activity detection device including a processor, the processor is connected to a memory, the memory is used to store a computer program, and the processor is used to execute the computer program stored in the memory, so that the device executes the same as in the first aspect and The method according to any one of the embodiments.
  • a computer-readable storage medium is provided, and a computer program is stored in the computer-readable storage medium, and when it runs on a computer, the computer can execute the method described in the first aspect and any one of its implementations. method.
  • a computer program product containing instructions is provided.
  • the instructions are executed on a computer or a processor, the computer or the processor executes the method described in the first aspect and any one of the implementation manners.
  • FIG. 1 is a schematic diagram of voice wake-up in a black screen scenario provided by an embodiment of the application
  • FIG. 2 is a schematic diagram of voice wake-up in a lock screen scenario provided by an embodiment of the application
  • FIG. 3 is a schematic structural diagram of a terminal device provided by an embodiment of this application.
  • FIG. 4 is a schematic diagram of a main microphone and a noise reduction microphone of a terminal device according to an embodiment of the application;
  • FIG. 5 is a schematic flowchart of a VAD method provided by an embodiment of this application.
  • FIG. 6 is a schematic structural diagram of another terminal device provided by an embodiment of this application.
  • FIG. 7 is a schematic flowchart of another VAD method provided by an embodiment of the application.
  • FIG. 8 is a schematic structural diagram of a voice activity detection device provided by an embodiment of this application.
  • FIG. 9 is a schematic structural diagram of another voice activity detection device provided by an embodiment of this application.
  • a component may be, but is not limited to: a process running on a processor, a processor, an object, an executable file, an executing thread, a program, and/or a computer.
  • an application running on a computing device and the computing device may be components.
  • One or more components may exist in an executing process and/or thread, and the components may be located in one computer and/or distributed between two or more computers. In addition, these components can execute from various computer-readable media having various data structures thereon.
  • These components can be based on, for example, having one or more data packets (for example, data from a component that interacts with another component in a local system, a distributed system, and/or via signals such as the Internet).
  • the network interacts with other systems) signals to communicate in a local and/or remote process.
  • the term "exemplary” is used to mean serving as an example, illustration, or illustration. Any embodiment or design solution described as an "example” in this application should not be construed as being more preferable or advantageous than other embodiments or design solutions. To be precise, the term example is used to present the concept in a concrete way.
  • information, signal, message, and channel can sometimes be used together. It should be noted that the meanings to be expressed are the same when the differences are not emphasized. “ ⁇ (of)”, “corresponding (relevant)” and “corresponding (corresponding)” can sometimes be used together. It should be pointed out that the meanings to be expressed are the same when the difference is not emphasized.
  • Voice wakeup means that the terminal device enters the dormant state after it is started.
  • the terminal device When the user speaks a specific wake-up word, the terminal device will be awakened, switch to the working state, and wait to receive further voice commands from the user. In this process, users do not need to touch with their hands, and can directly operate with voice.
  • the terminal device does not need to be in the working state all the time, thereby saving energy consumption.
  • Different terminal devices have different wake-up words, and when users need to wake up the device, they need to speak a specific wake-up word.
  • Voice wake-up has a wide range of application areas, such as robots, mobile phones, wearable devices, smart homes, and in-vehicle devices. Almost many terminal devices with voice functions will need voice wake-up technology as the entrance to human-computer interaction.
  • the terminal device when the voice wake-up function is turned on, in the black screen scenario as shown in Figure 1 or in the lock screen scenario as shown in Figure 2, the mobile phone detects that the user speaks a specific wake-up The word "Xiaoyi Xiaoyi", the voice wake-up APP of the mobile phone is awakened, the display interface of the voice wake-up APP is displayed, and the user is prompted to speak further voice commands, such as displaying text or playing a sound "Hello, how can I help? you".
  • the terminal device involved in the embodiment of the present application may be a device that includes a wireless transceiver function and can cooperate with a network device to provide communication services for users.
  • terminal equipment may refer to user equipment (UE), access terminal, user unit, user station, mobile station, mobile station, remote station, remote terminal, mobile equipment, user terminal, terminal, wireless communication device, User agent or user device.
  • UE user equipment
  • the terminal device can be a mobile phone, a smart speaker, a smart watch, a handheld device with wireless communication function, a computing device or other processing device connected to a wireless modem, a robot, a drone, a smart driving vehicle, a smart home, a vehicle-mounted device, Medical equipment, smart logistics equipment, wearable equipment, terminal equipment in future 5G networks or networks after 5G, etc., are not limited in the embodiment of the present application.
  • the terminal device is taken as an example to illustrate the structure of the terminal device.
  • the terminal device 100 may include: a radio frequency (RF) circuit 110, a memory 120, an input unit 130, a display unit 140, a sensor 150, an audio circuit 160, a wireless fidelity (Wi-Fi) module 170, and a processor 180, Bluetooth module 181, power supply 190 and other components.
  • RF radio frequency
  • the RF circuit 110 can be used to receive and send signals in the process of sending and receiving information or talking. It can receive downlink data from the base station and then forward it to the processor 180 for processing; and can send uplink data to the base station.
  • the RF circuit includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a low noise amplifier, a duplexer, and other devices.
  • the memory 120 can be used to store software programs and data.
  • the processor 180 executes various functions and data processing of the terminal device 100 by running a software program or data stored in the memory 120.
  • the memory 120 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, or other volatile solid-state storage devices.
  • the memory 120 stores an operating system that enables the terminal device 100 to run, such as the one developed by Apple Operating system, developed by Google Open source operating system, developed by Microsoft Operating system, etc.
  • the memory 120 may store an operating system and various application programs, and may also store codes for executing the methods in the embodiments of the present application.
  • the input unit 130 may be used to receive input digital or character information, and generate signal input related to user settings and function control of the terminal device 100.
  • the input unit 130 may include a touch screen 131 provided on the front of the terminal device 100, and may collect user touch operations on or near it.
  • the display unit 140 may be used to display information input by the user or information provided to the user, as well as a graphical user interface (GUI) of various menus of the terminal device 100.
  • the display unit 140 may include a display screen 141 provided on the front of the terminal device 100.
  • the display screen 141 may be configured in the form of a liquid crystal display, a light emitting diode, or the like.
  • the display unit 140 may be used to display various graphical user interfaces described in this application.
  • the touch screen 131 may be covered on the display screen 141, or the touch screen 131 and the display screen 141 may be integrated to realize the input and output functions of the terminal device 100. After integration, it may be referred to as a touch display screen.
  • the terminal device 100 may also include at least one sensor 150, such as an acceleration sensor 155, a light sensor, and a motion sensor.
  • the terminal device 100 may also be configured with other sensors such as a gyroscope, a barometer, a hygrometer, a thermometer, and an infrared sensor.
  • Wi-Fi is a short-distance wireless transmission technology.
  • the terminal device 100 can help users send and receive emails, browse webpages, and access streaming media through the Wi-Fi module 170. It provides users with wireless broadband Internet access.
  • the processor 180 is the control center of the terminal device 100. It uses various interfaces and lines to connect various parts of the entire terminal, and executes the terminal device by running or executing software programs stored in the memory 120 and calling data stored in the memory 120. 100's various functions and processing data.
  • the processor 180 in this application may refer to one or more processors, and the processor 180 may include one or more processing units; the processor 180 may also integrate an application processor and a baseband processor, where the application processor mainly processes operations For systems, user interfaces, and applications, the baseband processor mainly handles wireless communications. It can be understood that the aforementioned baseband processor may not be integrated into the processor 180.
  • the processor 180 in this application can run an operating system, application programs, user interface display and touch response, as well as the communication method described in the embodiments of this application.
  • the Bluetooth module 181 is used to exchange information with other Bluetooth devices having a Bluetooth module through the Bluetooth protocol.
  • the terminal device 100 can establish a Bluetooth connection with a wearable electronic device (such as a smart watch) that also has a Bluetooth module through the Bluetooth module 181, so as to perform data interaction.
  • a wearable electronic device such as a smart watch
  • the terminal device 100 also includes a power source 190 (such as a battery) for supplying power to various components.
  • the power supply can be logically connected to the processor 180 through the power management system, so that functions such as charging, discharging, and power consumption can be managed through the power management system.
  • the audio circuit 160, the speaker 161, and the microphone 162 may provide an audio interface between the user and the terminal device 100.
  • the audio circuit 160 can transmit the electrical signal converted from the received audio data to the speaker 161, which is converted into a sound signal for output by the speaker 161; on the other hand, the microphone 162 converts the collected sound signal into an electrical signal, and the audio circuit 160 After being received, it is converted into audio data, and then the audio data is output to the RF circuit 110 to be sent to, for example, another terminal, or the audio data is output to the memory 120 for further processing.
  • the terminal device may include multiple microphones. As shown in FIG. 4, taking the terminal device 200 as a mobile phone as an example, the lower end of the terminal device 200 may include at least one microphone 201, which may serve as the main microphone, and the upper end of the terminal device may include at least one microphone 202. , Can be used as a noise reduction microphone.
  • the prior art generally uses audio data collected through the main microphone for VAD when performing VAD on audio data.
  • the main microphone is blocked, the energy of the collected audio data is too low, which will affect VAD.
  • the accuracy rate is the reason for determining the accuracy of the main microphone for VAD.
  • the embodiment of the present application provides a VAD method, which performs VAD by selecting at least one channel of audio data with a higher autocorrelation coefficient (and higher energy value) of a high-frequency subband from multiple channels of audio data collected by multiple microphones. Thereby improving the accuracy of VAD. Further, the detection result can be applied to voice wake-up recognition.
  • the VAD method includes:
  • S501 The terminal device obtains N channels of audio data according to frames.
  • the terminal device includes at least N microphones 601, at least N analog to digital converters (ADC) 602, and a processor 603.
  • N is an integer greater than or equal to 2.
  • a buffer 604 may also be included.
  • the output end of each microphone 601 is electrically connected to the input end of an analog-to-digital converter (ADC) 602, and the output end of each ADC 602 is electrically connected to the input end of the VAD module 6031 in the processor 603, and the VAD module 6031 may be a hardware circuit in the processor 603. Then the VAD module 6031 can perform this step.
  • ADC analog-to-digital converter
  • each channel of audio data can be collected by one microphone, or can be collected by multiple microphones and then synthesized, or can be obtained in other ways, such as from other devices, which is not limited in this application.
  • the N microphones can be selected from multiple microphones with the highest signal-to-noise ratio, which can reduce power consumption.
  • the analog audio signal collected by each microphone 601 undergoes analog-to-digital conversion by the corresponding ADC 602 to obtain digital audio data.
  • the voice signal has short-term stability, it can be considered that the voice signal in the range of 10-30ms is stable, so that the audio data can be divided into frames, and VAD is performed on each channel of audio data by frame.
  • the voice signal is divided into frames, and VAD is performed on each channel of audio data by frame.
  • 5-20ms is a frame, and the frame is divided by overlapping segmentation, that is, there is overlap between the previous frame and the next frame.
  • the overlapping part is called frame shift, and the ratio of frame shift to frame length is generally 0 ⁇ 0.5.
  • each channel of audio data corresponds to 160 sampling points; if the sampling frequency of ADC 602 is 16KHz, then in a frame, every Channel audio data corresponds to 320 sampling points.
  • Each microphone 601 of the N microphones 601 collects one channel of audio signal, and N channels of audio data can be obtained in total.
  • the N channels of audio data can be sent to the VAD module 6031 for VAD on the one hand, and can be stored in the buffer 604 on the other hand to prevent loss of audio data, for example, for subsequent voice wake-up recognition.
  • the storage depth of the buffer 604 is determined by the delay of the VAD algorithm, and the greater the delay, the deeper the storage depth.
  • the terminal device calculates the autocorrelation coefficient of each channel of audio data in the high-frequency subband for each frame.
  • This step can be performed by the VAD module 6031.
  • sampling frequency of ADC 602 is 8KHz
  • the analog audio signal gets 8KHz digital audio data after analog-to-digital conversion and data filtering.
  • the sampling frequency is at least twice the frequency of the original signal to correctly restore the original signal. Therefore,
  • the frequency bandwidth of the analog audio signal that can be represented by 8KHz digital audio data is 0-4KHz.
  • the audio data can be sub-band filtered to obtain the audio data of the high-frequency sub-band.
  • the audio data of 0-4KHz is divided into multiple sub-bands, for example, divided into four sub-bands.
  • 0-4KHz can be divided into 0-1KHz, 1KHz-2KHz, 2KHz-3KHz, 3KHz-4KHz, and the high frequency subband refers to 2-4KHz.
  • n-1 frame and n-2th frame audio data to calculate the autocorrelation coefficient of one channel of audio data of the nth frame in the high-frequency subband.
  • the autocorrelation coefficient characterizes the degree of correlation (ie similarity) of audio data at two different moments. When the audio data at two different moments have the same periodic component, the maximum value of the autocorrelation coefficient reflects this period.
  • the silent signal (or steady-state noise) has poorer autocorrelation, and the autocorrelation coefficient is relatively low.
  • the autocorrelation coefficient of a channel of audio data in the high-frequency subband is relatively large, it can be determined that the channel of audio data may include a voice signal.
  • the above-mentioned autocorrelation coefficient r xx can be obtained by formula 1.
  • represents the time delay between sampling points.
  • 1 ⁇ N
  • 1 ⁇ 2N
  • 1 ⁇ 2N
  • energy(N) represents the energy of one channel of audio data of one frame in the high-frequency subband, and its common calculation formula is shown in formula 2:
  • N is the number of sampling points of one frame of audio data
  • x(i) is the value of the i-th sampling point.
  • the terminal device For each frame, the terminal device selects to perform VAD on at least one channel of the N channels of audio data according to the autocorrelation coefficient of the N channels of audio data.
  • N channels of audio data include the i-th channel of audio data, and i is a positive integer less than or equal to N. If the autocorrelation coefficient of the i-th audio data is greater than the correlation coefficient threshold (for example, 0.6), the i-th audio data is selected for VAD.
  • the correlation coefficient threshold for example, 0.6
  • the terminal device can select at least one of the N channels of audio data according to the autocorrelation coefficient of the N channels of audio data in the high frequency subband and the energy value of the N channels of audio data in the high frequency subband. Perform VAD.
  • N channels of audio data include the i-th channel of audio data and the j-th channel of audio data, where i ⁇ j, and i and j are positive integers less than or equal to N.
  • the autocorrelation coefficient of the i-th audio data is greater than the correlation coefficient threshold (e.g. 0.6)
  • the energy value of the j-th audio data is less than the first energy threshold (e.g. -50dB)
  • the energy value of the i-th audio data is the same as the j-th
  • the difference between the energy values of the audio data is greater than the second energy threshold (for example, 20dB)
  • the i-th audio data is selected for VAD.
  • the above calculations must be performed on N channels of audio data for each frame, and the selected audio data may be different for each frame.
  • the first frame may select the first channel of audio data for VAD
  • the next frame may Select the second and third audio data for VAD.
  • the autocorrelation coefficient of the i-th channel of audio data is greater than the correlation coefficient threshold, that is, the auto-correlation coefficient of the i-th channel of audio data is larger, and it can be determined that the i-th channel of audio data may include a voice signal.
  • the j-th audio data For the energy value of the j-th audio data being less than the first energy threshold, and the difference between the energy value of the i-th audio data and the energy value of the j-th audio data is greater than the second energy threshold, it can be determined that the j-th audio data corresponds to The microphone of may be blocked, and the microphone corresponding to the i-th audio data may not be blocked. If the microphone corresponding to the jth channel of audio data and the microphone corresponding to the ith channel of audio data are both blocked, or the two channels of audio data only include mute signals or steady-state noise, the energy value of the ith channel of audio data is not satisfied The difference between the energy value and the energy value of the j-th channel of audio data is greater than the second energy threshold.
  • the microphone corresponding to the j-th audio data may be blocked, the i-th audio data may include a voice signal and the corresponding microphone may not be blocked.
  • the energy values of the first to fourth channels of audio data in the high frequency subband are Rms1, Rms2, Rms3, Rms4, and the autocorrelation coefficients are in order Rel1, Rel2, Rel3, Rel4.
  • the magnitude relationship of these energy values satisfies: Rms0 ⁇ Rms1 ⁇ Rms2 ⁇ Rms3; the magnitude relationship of these autocorrelation coefficients satisfies: Rel0 ⁇ Rel1 ⁇ Rel2 ⁇ Rel3.
  • the first channel of audio data can be determined
  • the corresponding microphone and the microphone corresponding to the second channel of audio data may be blocked, and the microphone corresponding to the third channel of audio data and the fourth channel of audio data include voice signals and the corresponding microphone may not be blocked.
  • the first road and the second road here are equivalent to the j-th road described above, and the third and fourth roads are equivalent to the i-th road described above.
  • the process of performing VAD on at least one channel of audio data includes, but is not limited to: calculating the voice level, calculating the signal-to-noise ratio, calculating the voice level, and calculating the cross-correlation coefficient.
  • the speech level is used to characterize the amplitude of the signal, and the speech level rms can be obtained by formula 3:
  • the signal-to-noise ratio is used to characterize the proportion of signal and noise in speech.
  • the signal-to-noise ratio snr can be obtained by formula 4:
  • the speech level is used to characterize the confidence level of the speech, and the speech level SpeechLevel can be obtained by formula 5:
  • the cross-correlation coefficient is used to characterize the similarity level of speech, and the cross-correlation coefficient r xy can be obtained by formula 6:
  • the speech level, signal-to-noise ratio, speech level, and cross-correlation coefficient all exceed the corresponding threshold, it can be determined that the audio data of the frame includes a speech signal.
  • the autocorrelation coefficient of each channel of audio data in the high-frequency subband is calculated for each frame, and the autocorrelation coefficient of each channel of audio data is selected according to the autocorrelation coefficient of the N channels of audio data.
  • VAD is performed on at least one of the N channels of audio data to detect that each frame of audio data includes a voice signal.
  • the autocorrelation coefficient of the speech signal is larger than that of the silent signal (or steady-state noise), so that it can be determined whether the frame of audio data may include a speech signal.
  • the selection may include voice signals for VAD to determine whether this frame of audio data may include voice signals.
  • VAD is performed on audio data that is more likely to include voice signals, so that the accuracy of VAD can be improved. In addition, VAD can be performed normally even if some microphones are blocked.
  • the VAD method further includes:
  • M is a positive integer.
  • Exemplary M 20.
  • m1 for example, 12
  • m1 is less than or equal to M. That is, for one channel of audio data, it is not limited that consecutive multiple frames of audio data are detected as including a voice signal, as long as at least m1 frames of audio data in the overall M frames are detected as including a voice signal.
  • m2 for example, 8 frames in the M frame are detected as including a voice signal
  • m2 is less than or equal to M. That is, for one channel of audio data, at least m2 consecutive frames of audio data are detected as including voice signals.
  • the VAD method further includes:
  • Voice wake-up recognition can be implemented by the processor 603 shown in FIG. 6 executing software.
  • the VAD module 6031 can generate an interrupt, thereby triggering the processor 603 performs voice wake-up recognition on N channels of audio data of M frames.
  • the voice wake-up recognition is performed on the N channels of audio data of the M frame. If the voice wake-up recognition is successful, the voice wake-up display interface is displayed, and the user is prompted to say further Voice commands. When the above conditions are not met, it is not necessary to perform voice wake-up recognition on audio data, which can save energy consumption and reduce the false alarm rate of voice wake-up.
  • the methods and/or steps implemented by the terminal device may also be implemented by components (for example, a chip or a circuit) that can be used for the terminal device.
  • the embodiment of the present application also provides a voice activity detection device, which is used to implement the above-mentioned various methods.
  • the voice activity detection device may be the terminal device in the foregoing method embodiment, or a device including the foregoing terminal device, or a chip or functional module in the terminal device.
  • the voice activity detection device includes hardware structures and/or software modules corresponding to various functions.
  • the present application can be implemented in the form of hardware or a combination of hardware and computer software. Whether a certain function is executed by hardware or computer software-driven hardware depends on the specific application and design constraint conditions of the technical solution. Professionals and technicians can use different methods for each specific application to implement the described functions, but such implementation should not be considered as going beyond the scope of this application.
  • the embodiments of the present application may divide the voice activity detection device into functional modules according to the foregoing method embodiments.
  • each functional module may be divided corresponding to each function, or two or more functions may be integrated into one processing module.
  • the above-mentioned integrated modules can be implemented in the form of hardware or software function modules. It should be noted that the division of modules in the embodiments of the present application is illustrative, and is only a logical function division, and there may be other division methods in actual implementation.
  • FIG. 8 shows a schematic structural diagram of a voice activity detection device 80.
  • the voice activity detection device 80 includes an acquisition module 801, a calculation module 802, a selection module 803, and optionally, a determination module 804 and a voice wake-up module 805.
  • the obtaining module 801 can execute step S501 in FIG. 5 and step S501 in FIG. 7.
  • the calculation module 802 can execute step S502 in FIG. 5 and step S502 in FIG. 7.
  • the selection module 803 can perform step S503 in FIG. 5 and step S503 in FIG. 7.
  • the determining module 804 can execute step S701 in FIG. 7.
  • the voice wake-up module 805 can perform step S702 in FIG. 7.
  • the obtaining module 801 is used to obtain N channels of audio data by frame, where N is an integer greater than or equal to 2; the calculation module 802 is used to calculate the high frequency subband of each channel of audio data for each frame
  • the selection module 803 is used for each frame, according to the autocorrelation coefficient of the N channels of audio data, to select at least one channel of audio data in the N channels of audio data for VAD.
  • the selection module 803 is specifically configured to: if the autocorrelation coefficient of the i-th channel of audio data is greater than the correlation coefficient threshold, select the i-th channel of audio data for VAD.
  • the selection module 803 is specifically configured to: according to the autocorrelation coefficients of the N channels of audio data and the energy value of the N channels of audio data in the high-frequency subband, select for at least one channel of the N channels of audio data.
  • the data is VAD.
  • the N channels of audio data include the i-th channel of audio data and the j-th channel of audio data, i ⁇ j, i and j are positive integers less than or equal to N; the selection module 803 is specifically configured to: The autocorrelation coefficient of the i-channel audio data is greater than the correlation coefficient threshold, the energy value of the j-th audio data is less than the first energy threshold, and the difference between the energy value of the i-th audio data and the energy value of the j-th audio data is greater than the second Energy threshold, select the i-th audio data for VAD.
  • the determining module 804 is configured to: when the number of frames detected as including the voice signal in the M frame meets the condition, determine that the M frame includes the voice signal, and M is a positive integer.
  • the number of frames detected as including a voice signal in M frames meets the condition, including: at least m1 frames in the M frame are detected as including a voice signal, and m1 is less than or equal to M.
  • the number of frames detected as including a voice signal in M frames meets the condition, including: at least consecutive m2 frames in the M frame are detected as including a voice signal, and m2 is less than or equal to M.
  • the voice wake-up module 805 is configured to perform voice wake-up recognition on N channels of audio data of M frames when the number of frames detected as including voice signals in the M frame meets the condition.
  • each channel of audio data is collected by a microphone, and the microphone is a microphone with the highest signal-to-noise ratio selected from a plurality of microphones.
  • the voice activity detection device 80 is presented in the form of dividing various functional modules in an integrated manner.
  • the "module” here may refer to a specific ASIC, a circuit, a processor and memory that executes one or more software or firmware programs, an integrated logic circuit, and/or other devices that can provide the above-mentioned functions.
  • each module in FIG. 8 can be implemented by the processor in the terminal device calling the computer execution instructions stored in the memory.
  • the voice activity detection device 80 provided in this embodiment can perform the above-mentioned method, the technical effects that can be obtained can refer to the above-mentioned method embodiment, which will not be repeated here.
  • an embodiment of the present application also provides a voice activity detection device.
  • the voice activity detection device 90 includes a processor 901 and a memory 902.
  • the processor 901 is coupled to the memory 902.
  • the processor 901 executes the memory 902
  • the computer program or instruction is executed, the corresponding method in Fig. 5 and Fig. 7 is executed.
  • the embodiment of the present application also provides a computer-readable storage medium, the computer-readable storage medium stores a computer program, and when it runs on a computer or a processor, the computer or the processor executes the steps shown in FIG. 5 and FIG. 7 The corresponding method.
  • the embodiments of the present application also provide a computer program product containing instructions, which when the instructions run on a computer or a processor, cause the computer or the processor to execute the corresponding methods in FIG. 5 and FIG. 7.
  • the embodiment of the present application provides a chip system including a processor for the voice activity detection device to execute the corresponding methods in FIG. 5 and FIG. 7.
  • the chip system also includes a memory for storing necessary program instructions and data.
  • the chip system may include a chip, an integrated circuit, or may include a chip and other discrete devices, which is not specifically limited in the embodiment of the present application.
  • the voice activity detection device, chip, computer storage medium, computer program product, or chip system provided in this application are all used to execute the above-mentioned method. Therefore, the beneficial effects that can be achieved can refer to the above-mentioned The beneficial effects in the implementation manner will not be repeated here.
  • the processor involved in the embodiment of the present application may be a chip.
  • it can be a field programmable gate array (FPGA), an application specific integrated circuit (ASIC), a system on chip (SoC), or a central processing unit.
  • the central processor unit (CPU) can also be a network processor (NP), a digital signal processing circuit (digital signal processor, DSP), or a microcontroller (microcontroller unit, MCU) It can also be a programmable logic device (PLD) or other integrated chips.
  • NP network processor
  • DSP digital signal processor
  • MCU microcontroller unit
  • PLD programmable logic device
  • the memory involved in the embodiments of the present application may be a volatile memory or a non-volatile memory, or may include both volatile and non-volatile memory.
  • the non-volatile memory can be a read-only memory (ROM), a programmable read-only memory (programmable ROM, PROM), an erasable programmable read-only memory (erasable PROM, EPROM), a Erase programmable read-only memory (electrically EPROM, EEPROM) or flash memory.
  • the volatile memory may be random access memory (RAM), which is used as an external cache.
  • RAM random access memory
  • static random access memory static random access memory
  • dynamic RAM dynamic RAM
  • DRAM dynamic random access memory
  • synchronous dynamic random access memory synchronous DRAM, SDRAM
  • double data rate synchronous dynamic random access memory double data rate SDRAM, DDR SDRAM
  • enhanced synchronous dynamic random access memory enhanced SDRAM, ESDRAM
  • synchronous connection dynamic random access memory serial DRAM, SLDRAM
  • direct rambus RAM direct rambus RAM
  • the size of the sequence numbers of the above-mentioned processes does not mean the order of execution.
  • the execution order of each process should be determined by its function and internal logic, and should not correspond to the embodiments of the present application.
  • the implementation process constitutes any limitation.
  • the disclosed system, device, and method may be implemented in other ways.
  • the device embodiments described above are merely illustrative.
  • the division of the units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components can be combined or It can be integrated into another system, or some features can be ignored or not implemented.
  • the displayed or discussed mutual couplings or direct couplings or communication connections may be indirect couplings or communication connections between devices or units through some interfaces, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the computer program product includes one or more computer instructions.
  • the computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices.
  • the computer instructions may be stored in a computer-readable storage medium, or transmitted from one computer-readable storage medium to another computer-readable storage medium. For example, the computer instructions may be transmitted from a website, computer, server, or data center.
  • the computer-readable storage medium may be any available medium that can be accessed by a computer or include one or more data storage devices such as servers, data centers, etc. that can be integrated with the medium.
  • the usable medium may be a magnetic medium (for example, a floppy disk, a hard disk, and a magnetic tape), an optical medium (for example, a DVD), or a semiconductor medium (for example, a solid state disk (SSD)).

Abstract

A voice activity detection (VAD) method and apparatus, which relate to the field of voice detection and are used for improving the accuracy of VAD. The voice activity detection method comprises: acquiring N paths of audio data per frame, wherein N is an integer greater than or equal to 2 (S501); for each frame, calculating an autocorrelation coefficient of each path of audio data on a high-frequency sub-band (S502); and for each frame, selecting, according to autocorrelation coefficients of the N paths of audio data, at least one path of audio data in the N paths of audio data and performing VAD on the path of audio data (S503).

Description

语音活动检测方法和装置Voice activity detection method and device 技术领域Technical field
本申请涉及语音检测领域,尤其涉及一种语音活动检测(voice activity detection,VAD)方法和装置。This application relates to the field of voice detection, and in particular to a voice activity detection (VAD) method and device.
背景技术Background technique
市场上具备语音唤醒功能的终端设备越来越多,例如,手机、智能音箱、智能手表等。终端设备通过麦克风(microphone,MIC)获取音频数据,并持续对音频数据进行语音唤醒识别,当检测到用户说出的唤醒词后,终端设备切换到工作状态,并等待接收用户进一步的语音指令。例如,当用户说出“小艺小艺”,手机的语音唤醒应用(application,APP)会进行响应,并提示用户说出进一步的语音指令。There are more and more terminal devices with voice wake-up function in the market, such as mobile phones, smart speakers, smart watches, etc. The terminal device acquires audio data through a microphone (MIC), and continuously performs voice wake-up recognition on the audio data. When the wake-up word spoken by the user is detected, the terminal device switches to a working state and waits for further voice commands from the user. For example, when the user says "Xiaoyi Xiaoyi", the voice wake-up application (APP) of the mobile phone will respond and prompt the user to speak further voice commands.
语音唤醒的核心算法是语音识别,语音识别的对象是有效的语音信号,如果不区分是否包括语音信号,对所有输入的音频数据进行语音唤醒识别,则识别效果会很差,并且会增加功耗。为此,可以通过VAD来从输入的音频数据中找到语音信号的起止点和终止点,从而进一步抽取语音信号的特征。因此,VAD又可以称为语音端点检测、语音边界检测。The core algorithm of voice wake-up is voice recognition. The object of voice recognition is an effective voice signal. If you do not distinguish whether voice signals are included, and perform voice wake-up recognition on all input audio data, the recognition effect will be poor and power consumption will increase. . For this reason, VAD can be used to find the start and end points and end points of the voice signal from the input audio data, so as to further extract the characteristics of the voice signal. Therefore, VAD can also be called voice endpoint detection and voice boundary detection.
为了降低环境干扰,进行主动降噪,终端设备可以安装多个麦克风,包括主麦克风和降噪麦克风。终端设备一般采用通过主麦克风采集的音频数据进行VAD,当主麦克风被堵住时,导致音频数据的能量过低,语音信息丢失,将影响VAD的准确率。In order to reduce environmental interference and perform active noise reduction, the terminal device can be equipped with multiple microphones, including a main microphone and a noise reduction microphone. The terminal equipment generally uses the audio data collected through the main microphone for VAD. When the main microphone is blocked, the energy of the audio data is too low and the voice information is lost, which will affect the accuracy of VAD.
发明内容Summary of the invention
本申请实施例提供一种VAD方法和装置,用于提高VAD的准确率。The embodiments of the present application provide a VAD method and device, which are used to improve the accuracy of VAD.
为达到上述目的,本申请的实施例采用如下技术方案:In order to achieve the foregoing objectives, the following technical solutions are adopted in the embodiments of the present application:
第一方面,提供了一种语音活动检测VAD方法,包括:按帧获取N路音频数据,其中,N为大于或等于2的整数;针对每一帧,计算每路音频数据在高频子带的自相关系数;针对每一帧,根据N路音频数据的自相关系数,选择对N路音频数据中的至少一路音频数据进行VAD。In the first aspect, a VAD method for voice activity detection is provided, including: acquiring N channels of audio data by frame, where N is an integer greater than or equal to 2; For each frame, according to the autocorrelation coefficient of the N channels of audio data, select to perform VAD on at least one channel of the N channels of audio data.
本申请实施例提供的语音活动检测方法,按帧获取N路音频数据后,针对每一帧计算每路音频数据在高频子带的自相关系数,根据N路音频数据的自相关系数,选择对N路音频数据中的至少一路音频数据进行VAD,以检测每一帧音频数据中包括语音信号。对于每一帧来说,语音信号的自相关系数相对于静音信号(或稳态噪声)更大,从而可以确定该帧音频数据是否可能包括语音信号。选择可能包括语音信号进行VAD,以确定这一帧音频数据是否可能包括语音信号。对更可能包括语音信号的音频数据进行VAD,从而可以提高VAD的准确率。另外,即使有部分麦克风被堵住也可以正常进行VAD。In the voice activity detection method provided by the embodiments of the application, after obtaining N channels of audio data by frame, the autocorrelation coefficient of each channel of audio data in the high-frequency subband is calculated for each frame, and the autocorrelation coefficient of each channel of audio data is selected according to the autocorrelation coefficient of the N channels of audio data. VAD is performed on at least one of the N channels of audio data to detect that each frame of audio data includes a voice signal. For each frame, the autocorrelation coefficient of the speech signal is larger than that of the silent signal (or steady-state noise), so that it can be determined whether the frame of audio data may include a speech signal. The selection may include voice signals for VAD to determine whether this frame of audio data may include voice signals. VAD is performed on audio data that is more likely to include voice signals, so that the accuracy of VAD can be improved. In addition, VAD can be performed normally even if some microphones are blocked.
在一种可能的实施方式中,N路音频数据包括第i路音频数据,i为小于等于N的正整数;根据N路音频数据的自相关系数,选择对N路音频数据中的至少一路音频数据进行VAD,包括:如果第i路音频数据的自相关系数大于相关系数阈值,则选择第 i路音频数据进行VAD。语音信号的自相关系数相对于静音信号(或稳态噪声)更大,从而可以确定该帧音频数据是否可能包括语音信号。对更可能包括语音信号的音频数据进行VAD,从而可以提高VAD的准确率。In a possible implementation manner, the N channels of audio data include the i-th channel of audio data, where i is a positive integer less than or equal to N; according to the autocorrelation coefficient of the N channels of audio data, at least one channel of the N channels of audio data is selected to be VAD of the data includes: if the autocorrelation coefficient of the i-th audio data is greater than the correlation coefficient threshold, then the i-th audio data is selected for VAD. The autocorrelation coefficient of the speech signal is larger than that of the silent signal (or steady-state noise), so that it can be determined whether the frame of audio data may include a speech signal. VAD is performed on audio data that is more likely to include voice signals, so that the accuracy of VAD can be improved.
在一种可能的实施方式中,根据N路音频数据的自相关系数,选择对N路音频数据中的至少一路音频数据进行VAD,包括:根据N路音频数据的自相关系数以及N路音频数据在高频子带的能量值,选择对N路音频数据中的至少一路音频数据进行VAD。高频子带的穿透能力较弱,如果麦克风被堵住,则相应的能量值会很低,从而可以确定麦克风是否可能被堵住。选择可能包括语音信号并且对应的麦克风可能未被堵住的音频数据进行VAD,以确定这一帧音频数据是否可能包括语音信号。避免了对被堵住的麦克风的音频数据进行VAD,而是对更可能包括语音信号的音频数据进行VAD,从而可以提高VAD的准确率。In a possible implementation manner, selecting to perform VAD on at least one of the N channels of audio data according to the autocorrelation coefficients of the N channels of audio data includes: according to the autocorrelation coefficients of the N channels of audio data and the N channels of audio data In the energy value of the high frequency subband, VAD is selected to perform VAD on at least one of the N channels of audio data. The penetration ability of the high-frequency subband is weak. If the microphone is blocked, the corresponding energy value will be very low, so it can be determined whether the microphone may be blocked. Select audio data that may include a voice signal and the corresponding microphone may not be blocked for VAD to determine whether this frame of audio data may include a voice signal. It avoids performing VAD on the audio data of the blocked microphone, but performs VAD on the audio data that is more likely to include voice signals, so that the accuracy of VAD can be improved.
在一种可能的实施方式中,N路音频数据包括第i路音频数据和第j路音频数据,i≠j,i和j为小于等于N的正整数;根据N路音频数据的自相关系数以及N路音频数据在高频子带的能量值,选择对N路音频数据中的至少一路音频数据进行VAD,包括:如果第i路音频数据的自相关系数大于相关系数阈值,第j路音频数据的能量值小于第一能量阈值,并且第i路音频数据的能量值与第j路音频数据的能量值之差大于第二能量阈值,则选择第i路音频数据进行VAD。也就是说,检测到第j路音频数据对应的麦克风可能被堵住,第i路音频数据可能包括语音信号且对应的麦克风可能未被堵住。In a possible implementation, the N channels of audio data include the i-th channel of audio data and the j-th channel of audio data, i≠j, i and j are positive integers less than or equal to N; according to the autocorrelation coefficient of the N channels of audio data And the energy value of the N channels of audio data in the high-frequency subband, selecting to perform VAD on at least one of the N channels of audio data, including: if the autocorrelation coefficient of the i-th audio data is greater than the correlation coefficient threshold, the j-th audio If the energy value of the data is less than the first energy threshold, and the difference between the energy value of the i-th audio data and the j-th audio data is greater than the second energy threshold, the i-th audio data is selected for VAD. That is, it is detected that the microphone corresponding to the j-th channel of audio data may be blocked, the i-th channel of audio data may include a voice signal and the corresponding microphone may not be blocked.
在一种可能的实施方式中,还包括:当M帧中被检测为包括语音信号的帧数满足条件时,确定M帧包括语音信号,M为正整数。此时,M帧音频数据可以用于语音唤醒、语音检测和识别等各方面,一方面可以提高准确率,另一方面可以降低功耗。In a possible implementation manner, the method further includes: when the number of frames detected as including the voice signal in the M frame meets the condition, determining that the M frame includes the voice signal, and M is a positive integer. At this time, M frames of audio data can be used in various aspects such as voice wake-up, voice detection and recognition, on the one hand, it can improve the accuracy rate, on the other hand, it can reduce power consumption.
在一种可能的实施方式中,M帧中被检测为包括语音信号的帧数满足条件,包括:M帧中至少m1帧被检测为包括语音信号,m1小于或等于M。In a possible implementation manner, the number of frames detected as including a voice signal in M frames meets the condition, including: at least m1 frames in the M frame are detected as including a voice signal, and m1 is less than or equal to M.
在一种可能的实施方式中,M帧中被检测为包括语音信号的帧数满足条件,包括:M帧中至少连续m2帧被检测为包括语音信号,m2小于或等于M。In a possible implementation manner, the number of frames detected as including a voice signal in M frames meets the condition, including: at least consecutive m2 frames in the M frame are detected as including a voice signal, and m2 is less than or equal to M.
在一种可能的实施方式中,还包括:当M帧中被检测为包括语音信号的帧数满足条件时,对M帧的N路音频数据进行语音唤醒识别。此时,这M帧中包括语音信号,此时进行语音唤醒识别则可以提高准确率,并且可以降低功耗。In a possible implementation manner, the method further includes: performing voice wake-up recognition on N channels of audio data of M frames when the number of frames detected as including a voice signal in the M frames meets the condition. At this time, the M frames include voice signals, and performing voice wake-up recognition at this time can improve accuracy and reduce power consumption.
在一种可能的实施方式中,每路音频数据由一个麦克风所采集,麦克风为从多个麦克风中选择的信噪比最高的麦克风。可以降低功耗。In a possible implementation manner, each channel of audio data is collected by a microphone, and the microphone is a microphone with the highest signal-to-noise ratio selected from a plurality of microphones. Can reduce power consumption.
第二方面,提供了一种语音活动检测VAD装置,包括:获取模块,用于按帧获取N路音频数据,其中,N为大于或等于2的整数;计算模块,用于针对每一帧,计算每路音频数据在高频子带的自相关系数;选择模块,用于针对每一帧,根据N路音频数据的自相关系数,选择对N路音频数据中的至少一路音频数据进行VAD。In a second aspect, a VAD device for voice activity detection is provided, including: an acquisition module for acquiring N channels of audio data by frame, where N is an integer greater than or equal to 2; and a calculation module for each frame, Calculate the autocorrelation coefficient of each channel of audio data in the high-frequency subband; the selection module is used to select VAD for at least one channel of the N channels of audio data according to the autocorrelation coefficient of the N channels of audio data for each frame.
在一种可能的实施方式中,选择模块具体用于:如果第i路音频数据的自相关系数大于相关系数阈值,则选择第i路音频数据进行VAD。In a possible implementation manner, the selection module is specifically configured to: if the autocorrelation coefficient of the i-th channel of audio data is greater than the correlation coefficient threshold, select the i-th channel of audio data for VAD.
在一种可能的实施方式中,选择模块具体用于:根据N路音频数据的自相关系数以及N路音频数据在高频子带的能量值,选择对N路音频数据中的至少一路音频数据进行VAD。In a possible implementation manner, the selection module is specifically configured to: according to the autocorrelation coefficient of the N channels of audio data and the energy value of the N channels of audio data in the high-frequency subband, select the audio data for at least one channel of the N channels of audio data. Perform VAD.
在一种可能的实施方式中,N路音频数据包括第i路音频数据和第j路音频数据,i≠j,i和j为小于等于N的正整数;选择模块具体用于:如果第i路音频数据的自相关系数大于相关系数阈值,第j路音频数据的能量值小于第一能量阈值,并且第i路音频数据的能量值与第j路音频数据的能量值之差大于第二能量阈值,则选择第i路音频数据进行VAD。In a possible implementation manner, the N channels of audio data include the i-th audio data and the j-th audio data, i≠j, i and j are positive integers less than or equal to N; the selection module is specifically used to: The autocorrelation coefficient of channel audio data is greater than the correlation coefficient threshold, the energy value of the j-th audio data is less than the first energy threshold, and the difference between the energy value of the i-th audio data and the energy value of the j-th audio data is greater than the second energy Threshold, select the i-th audio data for VAD.
在一种可能的实施方式中,还包括确定模块,用于:当M帧中被检测为包括语音信号的帧数满足条件时,确定M帧包括语音信号,M为正整数。In a possible implementation manner, a determining module is further included, configured to: when the number of frames detected as including the voice signal in the M frame meets the condition, determine that the M frame includes the voice signal, and M is a positive integer.
在一种可能的实施方式中,M帧中被检测为包括语音信号的帧数满足条件,包括:M帧中至少m1帧被检测为包括语音信号,m1小于或等于M。In a possible implementation manner, the number of frames detected as including a voice signal in M frames meets the condition, including: at least m1 frames in the M frame are detected as including a voice signal, and m1 is less than or equal to M.
在一种可能的实施方式中,M帧中被检测为包括语音信号的帧数满足条件,包括:M帧中至少连续m2帧被检测为包括语音信号,m2小于或等于M。In a possible implementation manner, the number of frames detected as including a voice signal in M frames meets the condition, including: at least consecutive m2 frames in the M frame are detected as including a voice signal, and m2 is less than or equal to M.
在一种可能的实施方式中,还包括语音唤醒模块,用于:当M帧中被检测为包括语音信号的帧数满足条件时,对M帧的N路音频数据进行语音唤醒识别。In a possible implementation manner, a voice wake-up module is further included, which is configured to perform voice wake-up recognition on N channels of audio data of M frames when the number of frames detected as including voice signals in the M frame meets the condition.
在一种可能的实施方式中,每路音频数据由一个麦克风所采集,麦克风为从多个麦克风中选择的信噪比最高的麦克风。In a possible implementation manner, each channel of audio data is collected by a microphone, and the microphone is a microphone with the highest signal-to-noise ratio selected from a plurality of microphones.
第三方面,提供了一种语音活动检测装置,包括处理器,处理器与存储器相连,存储器用于存储计算机程序,处理器用于执行存储器中存储的计算机程序,以使得装置执行如第一方面及其任一项实施方式所述的方法。In a third aspect, a voice activity detection device is provided, including a processor, the processor is connected to a memory, the memory is used to store a computer program, and the processor is used to execute the computer program stored in the memory, so that the device executes the same as in the first aspect and The method according to any one of the embodiments.
第四方面,提供了一种计算机可读存储介质,计算机可读存储介质中存储有计算机程序,当其在计算机上运行时,使得计算机执行如第一方面及其任一项实施方式所述的方法。In a fourth aspect, a computer-readable storage medium is provided, and a computer program is stored in the computer-readable storage medium, and when it runs on a computer, the computer can execute the method described in the first aspect and any one of its implementations. method.
第五方面,提供了一种包含指令的计算机程序产品,当指令在计算机或处理器上运行时,使得计算机或处理器执行如第一方面及任一项实施方式所述的方法。In a fifth aspect, a computer program product containing instructions is provided. When the instructions are executed on a computer or a processor, the computer or the processor executes the method described in the first aspect and any one of the implementation manners.
第二方面到第五方面的技术效果参照第一方面及其任一实施方式的内容,在此不再重复。The technical effects of the second aspect to the fifth aspect refer to the content of the first aspect and any of the embodiments, and will not be repeated here.
附图说明Description of the drawings
图1为本申请实施例提供的一种黑屏场景下进行语音唤醒的示意图;FIG. 1 is a schematic diagram of voice wake-up in a black screen scenario provided by an embodiment of the application;
图2为本申请实施例提供的一种锁屏场景下进行语音唤醒的示意图;2 is a schematic diagram of voice wake-up in a lock screen scenario provided by an embodiment of the application;
图3为本申请实施例提供的一种终端设备的结构示意图;FIG. 3 is a schematic structural diagram of a terminal device provided by an embodiment of this application;
图4为本申请实施例提供的一种终端设备的主麦克风和降噪麦克风的示意图;FIG. 4 is a schematic diagram of a main microphone and a noise reduction microphone of a terminal device according to an embodiment of the application;
图5为本申请实施例提供的一种VAD方法的流程示意图;FIG. 5 is a schematic flowchart of a VAD method provided by an embodiment of this application;
图6为本申请实施例提供的另一种终端设备的结构示意图;FIG. 6 is a schematic structural diagram of another terminal device provided by an embodiment of this application;
图7为本申请实施例提供的另一种VAD方法的流程示意图;FIG. 7 is a schematic flowchart of another VAD method provided by an embodiment of the application;
图8为本申请实施例提供的一种语音活动检测装置的结构示意图;FIG. 8 is a schematic structural diagram of a voice activity detection device provided by an embodiment of this application;
图9为本申请实施例提供的另一种语音活动检测装置的结构示意图。FIG. 9 is a schematic structural diagram of another voice activity detection device provided by an embodiment of this application.
具体实施方式detailed description
如本申请所使用的,术语“组件”、“模块”、“系统”等等旨在指代计算机相关实体,该计算机相关实体可以是硬件、固件、硬件和软件的结合、软件或者运行中的软件。例如,组件可以是,但不限于是:在处理器上运行的处理、处理器、对象、 可执行文件、执行中的线程、程序和/或计算机。作为示例,在计算设备上运行的应用和该计算设备都可以是组件。一个或多个组件可以存在于执行中的过程和/或线程中,并且组件可以位于一个计算机中以及/或者分布在两个或更多个计算机之间。此外,这些组件能够从在其上具有各种数据结构的各种计算机可读介质中执行。这些组件可以通过诸如根据具有一个或多个数据分组(例如,来自一个组件的数据,该组件与本地系统、分布式系统中的另一个组件进行交互和/或以信号的方式通过诸如互联网之类的网络与其它系统进行交互)的信号,以本地和/或远程过程的方式进行通信。As used in this application, the terms "component", "module", "system", etc. are intended to refer to a computer-related entity, which can be hardware, firmware, a combination of hardware and software, software, or running software. software. For example, a component may be, but is not limited to: a process running on a processor, a processor, an object, an executable file, an executing thread, a program, and/or a computer. As an example, both an application running on a computing device and the computing device may be components. One or more components may exist in an executing process and/or thread, and the components may be located in one computer and/or distributed between two or more computers. In addition, these components can execute from various computer-readable media having various data structures thereon. These components can be based on, for example, having one or more data packets (for example, data from a component that interacts with another component in a local system, a distributed system, and/or via signals such as the Internet). The network interacts with other systems) signals to communicate in a local and/or remote process.
本申请将围绕可包括多个设备、组件、模块等的系统来呈现各个方面、实施例或特征。应当理解和明白的是,各个系统可以包括另外的设备、组件、模块等,并且/或者可以并不包括结合附图讨论的所有设备、组件、模块等。此外,还可以使用这些方案的组合。This application will present various aspects, embodiments, or features around a system that may include multiple devices, components, modules, and the like. It should be understood and understood that each system may include additional devices, components, modules, etc., and/or may not include all the devices, components, modules, etc. discussed in conjunction with the accompanying drawings. In addition, a combination of these schemes can also be used.
另外,在本申请实施例中,“示例的”一词用于表示作例子、例证或说明。本申请中被描述为“示例”的任何实施例或设计方案不应被解释为比其它实施例或设计方案更优选或更具优势。确切而言,使用示例的一词旨在以具体方式呈现概念。In addition, in the embodiments of the present application, the term "exemplary" is used to mean serving as an example, illustration, or illustration. Any embodiment or design solution described as an "example" in this application should not be construed as being more preferable or advantageous than other embodiments or design solutions. To be precise, the term example is used to present the concept in a concrete way.
本申请实施例中,信息(information),信号(signal),消息(message),信道(channel)有时可以混用,应当指出的是,在不强调其区别时,其所要表达的含义是一致的。“的(of)”,“相应的(corresponding,relevant)”和“对应的(corresponding)”有时可以混用,应当指出的是,在不强调其区别时,其所要表达的含义是一致的。In the embodiments of this application, information, signal, message, and channel can sometimes be used together. It should be noted that the meanings to be expressed are the same when the differences are not emphasized. "的 (of)", "corresponding (relevant)" and "corresponding (corresponding)" can sometimes be used together. It should be pointed out that the meanings to be expressed are the same when the difference is not emphasized.
语音唤醒指终端设备启动后进入休眠状态,当用户说出特定的唤醒词时,终端设备就会被唤醒,切换到工作状态,并等待接收用户进一步的语音指令。这一过程中用户不需要用手接触,直接可以用语音进行操作。同时利用语音唤醒的机制,终端设备不用始终处于工作状态,从而节省能耗。不同的终端设备会有不同的唤醒词,当用户需要唤醒设备时需要说出特定的唤醒词。Voice wakeup means that the terminal device enters the dormant state after it is started. When the user speaks a specific wake-up word, the terminal device will be awakened, switch to the working state, and wait to receive further voice commands from the user. In this process, users do not need to touch with their hands, and can directly operate with voice. At the same time, using the voice wake-up mechanism, the terminal device does not need to be in the working state all the time, thereby saving energy consumption. Different terminal devices have different wake-up words, and when users need to wake up the device, they need to speak a specific wake-up word.
语音唤醒的应用领域比较广泛,例如,机器人、手机、可穿戴设备、智能家居、车载设备等。几乎很多带有语音功能的终端设备,都会需要语音唤醒技术作为人机交互的入口。Voice wake-up has a wide range of application areas, such as robots, mobile phones, wearable devices, smart homes, and in-vehicle devices. Almost many terminal devices with voice functions will need voice wake-up technology as the entrance to human-computer interaction.
示例性的,以终端设备为手机为例,当打开语音唤醒功能时,在如图1所示的黑屏场景下或者在如图2所示的锁屏场景下,手机检测用户说出特定的唤醒词“小艺小艺”,则手机的语音唤醒APP被唤醒,显示语音唤醒APP的显示界面,并提示用户说出进一步的语音指令,例如显示文字或播放声音“您好,有什么可以帮到您”。Exemplarily, taking the terminal device as a mobile phone as an example, when the voice wake-up function is turned on, in the black screen scenario as shown in Figure 1 or in the lock screen scenario as shown in Figure 2, the mobile phone detects that the user speaks a specific wake-up The word "Xiaoyi Xiaoyi", the voice wake-up APP of the mobile phone is awakened, the display interface of the voice wake-up APP is displayed, and the user is prompted to speak further voice commands, such as displaying text or playing a sound "Hello, how can I help? you".
本申请实施例涉及的终端设备可以为包含无线收发功能、且可以与网络设备配合为用户提供通讯服务的设备。具体地,终端设备可以指用户设备(user equipment,UE)、接入终端、用户单元、用户站、移动站、移动台、远方站、远程终端、移动设备、用户终端、终端、无线通信装置、用户代理或用户装置。例如,终端设备可以是手机、智能音箱、智能手表、具有无线通信功能的手持设备、计算设备或连接到无线调制解调器的其它处理设备、机器人、无人机、智能驾驶车辆、智能家居、车载设备、医疗设备、智慧物流设备、可穿戴设备,未来5G网络或5G之后的网络中的终端设备等,本申请实施例对此不作限定。The terminal device involved in the embodiment of the present application may be a device that includes a wireless transceiver function and can cooperate with a network device to provide communication services for users. Specifically, terminal equipment may refer to user equipment (UE), access terminal, user unit, user station, mobile station, mobile station, remote station, remote terminal, mobile equipment, user terminal, terminal, wireless communication device, User agent or user device. For example, the terminal device can be a mobile phone, a smart speaker, a smart watch, a handheld device with wireless communication function, a computing device or other processing device connected to a wireless modem, a robot, a drone, a smart driving vehicle, a smart home, a vehicle-mounted device, Medical equipment, smart logistics equipment, wearable equipment, terminal equipment in future 5G networks or networks after 5G, etc., are not limited in the embodiment of the present application.
如图3所示,以终端设备为手机为例,对终端设备的结构进行说明。As shown in Fig. 3, the terminal device is taken as an example to illustrate the structure of the terminal device.
终端设备100可以包括:射频(radio frequency,RF)电路110、存储器120、输入单元130、显示单元140、传感器150、音频电路160、无线保真(wireless fidelity,Wi-Fi)模块170、处理器180、蓝牙模块181、以及电源190等部件。The terminal device 100 may include: a radio frequency (RF) circuit 110, a memory 120, an input unit 130, a display unit 140, a sensor 150, an audio circuit 160, a wireless fidelity (Wi-Fi) module 170, and a processor 180, Bluetooth module 181, power supply 190 and other components.
RF电路110可用于在收发信息或通话过程中信号的接收和发送,可以接收基站的下行数据后交给处理器180处理;可以将上行数据发送给基站。通常,RF电路包括但不限于天线、至少一个放大器、收发信机、耦合器、低噪声放大器、双工器等器件。The RF circuit 110 can be used to receive and send signals in the process of sending and receiving information or talking. It can receive downlink data from the base station and then forward it to the processor 180 for processing; and can send uplink data to the base station. Generally, the RF circuit includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a low noise amplifier, a duplexer, and other devices.
存储器120可用于存储软件程序及数据。处理器180通过运行存储在存储器120的软件程序或数据,从而执行终端设备100的各种功能以及数据处理。存储器120可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件。存储器120存储有使得终端设备100能运行的操作系统,例如苹果公司所开发的
Figure PCTCN2020096392-appb-000001
操作系统,谷歌公司所开发的
Figure PCTCN2020096392-appb-000002
开源操作系统,微软公司所开发的
Figure PCTCN2020096392-appb-000003
操作系统等。本申请中存储器120可以存储操作系统及各种应用程序,还可以存储执行本申请实施例方法的代码。
The memory 120 can be used to store software programs and data. The processor 180 executes various functions and data processing of the terminal device 100 by running a software program or data stored in the memory 120. The memory 120 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, or other volatile solid-state storage devices. The memory 120 stores an operating system that enables the terminal device 100 to run, such as the one developed by Apple
Figure PCTCN2020096392-appb-000001
Operating system, developed by Google
Figure PCTCN2020096392-appb-000002
Open source operating system, developed by Microsoft
Figure PCTCN2020096392-appb-000003
Operating system, etc. In the present application, the memory 120 may store an operating system and various application programs, and may also store codes for executing the methods in the embodiments of the present application.
输入单元130(例如触摸屏)可用于接收输入的数字或字符信息,产生与终端设备100的用户设置以及功能控制有关的信号输入。具体地,输入单元130可以包括设置在终端设备100正面的触控屏131,可收集用户在其上或附近的触摸操作。The input unit 130 (for example, a touch screen) may be used to receive input digital or character information, and generate signal input related to user settings and function control of the terminal device 100. Specifically, the input unit 130 may include a touch screen 131 provided on the front of the terminal device 100, and may collect user touch operations on or near it.
显示单元140(即显示屏)可用于显示由用户输入的信息或提供给用户的信息以及终端设备100的各种菜单的图形用户界面(graphical user interface,GUI)。显示单元140可包括设置在终端设备100正面的显示屏141。其中,显示屏141可以采用液晶显示器、发光二极管等形式来配置。显示单元140可以用于显示本申请中所述的各种图形用户界面。触控屏131可以覆盖在显示屏141之上,也可以将触控屏131与显示屏141集成而实现终端设备100的输入和输出功能,集成后可以简称触摸显示屏。The display unit 140 (ie, display screen) may be used to display information input by the user or information provided to the user, as well as a graphical user interface (GUI) of various menus of the terminal device 100. The display unit 140 may include a display screen 141 provided on the front of the terminal device 100. Among them, the display screen 141 may be configured in the form of a liquid crystal display, a light emitting diode, or the like. The display unit 140 may be used to display various graphical user interfaces described in this application. The touch screen 131 may be covered on the display screen 141, or the touch screen 131 and the display screen 141 may be integrated to realize the input and output functions of the terminal device 100. After integration, it may be referred to as a touch display screen.
终端设备100还可以包括至少一种传感器150,比如加速度传感器155、光传感器、运动传感器。终端设备100还可配置有陀螺仪、气压计、湿度计、温度计、红外线传感器等其他传感器。The terminal device 100 may also include at least one sensor 150, such as an acceleration sensor 155, a light sensor, and a motion sensor. The terminal device 100 may also be configured with other sensors such as a gyroscope, a barometer, a hygrometer, a thermometer, and an infrared sensor.
Wi-Fi属于短距离无线传输技术,终端设备100可以通过Wi-Fi模块170帮助用户收发电子邮件、浏览网页和访问流媒体等,它为用户提供了无线的宽带互联网访问。Wi-Fi is a short-distance wireless transmission technology. The terminal device 100 can help users send and receive emails, browse webpages, and access streaming media through the Wi-Fi module 170. It provides users with wireless broadband Internet access.
处理器180是终端设备100的控制中心,利用各种接口和线路连接整个终端的各个部分,通过运行或执行存储在存储器120内的软件程序,以及调用存储在存储器120内的数据,执行终端设备100的各种功能和处理数据。本申请中处理器180可以指一个或多个处理器,并且处理器180可包括一个或多个处理单元;处理器180还可以集成应用处理器和基带处理器,其中,应用处理器主要处理操作系统、用户界面和应用程序等,基带处理器主要处理无线通信。可以理解的是,上述基带处理器也可以不集成到处理器180中。本申请中处理器180可以运行操作系统、应用程序、用户界面显示及触控响应,以及本申请实施例所述的通信方法。The processor 180 is the control center of the terminal device 100. It uses various interfaces and lines to connect various parts of the entire terminal, and executes the terminal device by running or executing software programs stored in the memory 120 and calling data stored in the memory 120. 100's various functions and processing data. The processor 180 in this application may refer to one or more processors, and the processor 180 may include one or more processing units; the processor 180 may also integrate an application processor and a baseband processor, where the application processor mainly processes operations For systems, user interfaces, and applications, the baseband processor mainly handles wireless communications. It can be understood that the aforementioned baseband processor may not be integrated into the processor 180. The processor 180 in this application can run an operating system, application programs, user interface display and touch response, as well as the communication method described in the embodiments of this application.
蓝牙模块181,用于通过蓝牙协议来与其他具有蓝牙模块的蓝牙设备进行信息交互。例如,终端设备100可以通过蓝牙模块181与同样具备蓝牙模块的可穿戴电子设备(例如智能手表)建立蓝牙连接,从而进行数据交互。The Bluetooth module 181 is used to exchange information with other Bluetooth devices having a Bluetooth module through the Bluetooth protocol. For example, the terminal device 100 can establish a Bluetooth connection with a wearable electronic device (such as a smart watch) that also has a Bluetooth module through the Bluetooth module 181, so as to perform data interaction.
终端设备100还包括给各个部件供电的电源190(比如电池)。电源可以通过电 源管理系统与处理器180逻辑相连,从而通过电源管理系统实现管理充电、放电、以及功耗等功能。The terminal device 100 also includes a power source 190 (such as a battery) for supplying power to various components. The power supply can be logically connected to the processor 180 through the power management system, so that functions such as charging, discharging, and power consumption can be managed through the power management system.
音频电路160、扬声器161、麦克风162可提供用户与终端设备100之间的音频接口。音频电路160可将接收到的音频数据转换后的电信号,传输到扬声器161,由扬声器161转换为声音信号输出;另一方面,麦克风162将收集的声音信号转换为电信号,由音频电路160接收后转换为音频数据,再将音频数据输出至RF电路110以发送给比如另一终端,或者将音频数据输出至存储器120以便进一步处理。The audio circuit 160, the speaker 161, and the microphone 162 may provide an audio interface between the user and the terminal device 100. The audio circuit 160 can transmit the electrical signal converted from the received audio data to the speaker 161, which is converted into a sound signal for output by the speaker 161; on the other hand, the microphone 162 converts the collected sound signal into an electrical signal, and the audio circuit 160 After being received, it is converted into audio data, and then the audio data is output to the RF circuit 110 to be sent to, for example, another terminal, or the audio data is output to the memory 120 for further processing.
终端设备可以包括多个麦克风,如图4所示,以终端设备200为手机为例,终端设备200的下端可以包括至少一个麦克风201,可以作为主麦克风,终端设备的上端可以包括至少一个麦克风202,可以作为降噪麦克风。The terminal device may include multiple microphones. As shown in FIG. 4, taking the terminal device 200 as a mobile phone as an example, the lower end of the terminal device 200 may include at least one microphone 201, which may serve as the main microphone, and the upper end of the terminal device may include at least one microphone 202. , Can be used as a noise reduction microphone.
如前文所述的,现有技术在对音频数据进行VAD时,一般采用通过主麦克风采集的音频数据进行VAD,当主麦克风被堵住时,导致所采集的音频数据的能量过低,将影响VAD的准确率。As mentioned above, the prior art generally uses audio data collected through the main microphone for VAD when performing VAD on audio data. When the main microphone is blocked, the energy of the collected audio data is too low, which will affect VAD. The accuracy rate.
本申请实施例提供了一种VAD方法,通过从多个麦克风采集的多路音频数据中,选择高频子带的自相关系数较高(以及能量值较高)的至少一路音频数据进行VAD,从而提高VAD的准确率。进一步的可以将检测结果应用于语音唤醒识别。The embodiment of the present application provides a VAD method, which performs VAD by selecting at least one channel of audio data with a higher autocorrelation coefficient (and higher energy value) of a high-frequency subband from multiple channels of audio data collected by multiple microphones. Thereby improving the accuracy of VAD. Further, the detection result can be applied to voice wake-up recognition.
如图5所示,该VAD方法,包括:As shown in Figure 5, the VAD method includes:
S501、终端设备按帧获取N路音频数据。S501: The terminal device obtains N channels of audio data according to frames.
如图6所示,假设终端设备包括至少N个麦克风601、至少N个模数转换器(analog to digital converter,ADC)602和处理器603。N为大于或等于2的整数。可选的,还可以包括缓存器604。每个麦克风601的输出端电连接至一个模数转换器(analog to digital converter,ADC)602的输入端,每个ADC 602的输出端电连接处理器603中VAD模块6031的输入端,VAD模块6031可以为处理器603中的一个硬件电路。则VAD模块6031可以执行本步骤。As shown in FIG. 6, it is assumed that the terminal device includes at least N microphones 601, at least N analog to digital converters (ADC) 602, and a processor 603. N is an integer greater than or equal to 2. Optionally, a buffer 604 may also be included. The output end of each microphone 601 is electrically connected to the input end of an analog-to-digital converter (ADC) 602, and the output end of each ADC 602 is electrically connected to the input end of the VAD module 6031 in the processor 603, and the VAD module 6031 may be a hardware circuit in the processor 603. Then the VAD module 6031 can perform this step.
其中,每路音频数据可以由一个麦克风所采集,或者,可以由多个麦克风采集后合成得到,或者,可以通过其他方式得到,例如从其他设备得到,本申请不作限定。Wherein, each channel of audio data can be collected by one microphone, or can be collected by multiple microphones and then synthesized, or can be obtained in other ways, such as from other devices, which is not limited in this application.
需要说明的是,这N个麦克风可以是从多个麦克风中选择的信噪比最高的N个麦克风,可以降低功耗。It should be noted that the N microphones can be selected from multiple microphones with the highest signal-to-noise ratio, which can reduce power consumption.
每个麦克风601采集的模拟音频信号经过对应的ADC 602进行模数转换后,得到数字的音频数据。The analog audio signal collected by each microphone 601 undergoes analog-to-digital conversion by the corresponding ADC 602 to obtain digital audio data.
因为语音信号有短时平稳性,可以认为10-30ms范围内的语音信号是稳定的,从而可以对音频数据进行分帧,并按帧对每路音频数据进行VAD。一般以5-20ms为一帧,通过交叠分段的方式进行分帧,即前一帧和后一帧有交叠,交叠的部分称为帧移,帧移与帧长的比值一般为0~0.5。例如,假设以20ms为一帧,如果ADC 602的采样频率为8KHz,则在一帧中,每路音频数据对应160个采样点;如果ADC 602的采样频率为16KHz,则在一帧中,每路音频数据对应320个采样点。N个麦克风601中每一个麦克风601采集一路音频信号,共可以得到N路音频数据。Because the voice signal has short-term stability, it can be considered that the voice signal in the range of 10-30ms is stable, so that the audio data can be divided into frames, and VAD is performed on each channel of audio data by frame. Generally 5-20ms is a frame, and the frame is divided by overlapping segmentation, that is, there is overlap between the previous frame and the next frame. The overlapping part is called frame shift, and the ratio of frame shift to frame length is generally 0~0.5. For example, assuming 20ms as a frame, if the sampling frequency of ADC 602 is 8KHz, then in one frame, each channel of audio data corresponds to 160 sampling points; if the sampling frequency of ADC 602 is 16KHz, then in a frame, every Channel audio data corresponds to 320 sampling points. Each microphone 601 of the N microphones 601 collects one channel of audio signal, and N channels of audio data can be obtained in total.
这N路音频数据一方面可以发送给VAD模块6031进行VAD,另一方面可以存储至缓存器604中,防止音频数据丢失,例如用于后续进行语音唤醒识别等。其中, 缓存器604的存储深度由VAD算法的延时决定,延时越大则存储深度越深。The N channels of audio data can be sent to the VAD module 6031 for VAD on the one hand, and can be stored in the buffer 604 on the other hand to prevent loss of audio data, for example, for subsequent voice wake-up recognition. The storage depth of the buffer 604 is determined by the delay of the VAD algorithm, and the greater the delay, the deeper the storage depth.
S502、终端设备针对每一帧,计算每路音频数据在高频子带的自相关系数。S502. The terminal device calculates the autocorrelation coefficient of each channel of audio data in the high-frequency subband for each frame.
该步骤可以由VAD模块6031执行。This step can be performed by the VAD module 6031.
假设ADC 602的采样频率为8KHz,即模拟音频信号经过模数转换和数据滤波后得到8KHz的数字音频数据,根据采样定理,采样频率至少大于两倍的原始信号的频率才能正确还原原始信号,因此8KHz的数字音频数据能表示的模拟音频信号的频率带宽为0-4KHz。Assuming that the sampling frequency of ADC 602 is 8KHz, that is, the analog audio signal gets 8KHz digital audio data after analog-to-digital conversion and data filtering. According to the sampling theorem, the sampling frequency is at least twice the frequency of the original signal to correctly restore the original signal. Therefore, The frequency bandwidth of the analog audio signal that can be represented by 8KHz digital audio data is 0-4KHz.
因此可以对音频数据进行子带滤波得到高频子带的音频数据,仍以采样频率为8KHz为例,将0-4KHz的音频数据划分为多个子带,例如以划分为四个子带为例,可以将0-4KHz划分为0-1KHz、1KHz-2KHz、2KHz-3KHz、3KHz-4KHz,其中的高频子带指2-4KHz。Therefore, the audio data can be sub-band filtered to obtain the audio data of the high-frequency sub-band. Still taking the sampling frequency of 8KHz as an example, the audio data of 0-4KHz is divided into multiple sub-bands, for example, divided into four sub-bands. 0-4KHz can be divided into 0-1KHz, 1KHz-2KHz, 2KHz-3KHz, 3KHz-4KHz, and the high frequency subband refers to 2-4KHz.
在一种可能的实施方式中,针对第n帧的一路音频数据来说,可以根据相邻两帧(例如第n帧和第n-1帧)或相邻三帧(例如第n帧、第n-1帧和第n-2帧)的音频数据来计算第n帧的一路音频数据在高频子带的自相关系数。自相关系数表征了两个不同时刻的音频数据的相关程度(即相似性),当两个不同时刻的音频数据都具有相同周期性分量的时候,自相关系数的极大值即体现这种周期性分量,静音信号(或稳态噪声)相比语音信号而言,其自相关性较差,自相关系数相对较低。当一路音频数据在高频子带的自相关系数较大时,可以确定该路音频数据中可能包括语音信号。In a possible implementation manner, for a piece of audio data of the nth frame, it can be based on two adjacent frames (for example, the nth frame and the n-1th frame) or three adjacent frames (for example, the nth frame and the n-1th frame). n-1 frame and n-2th frame) audio data to calculate the autocorrelation coefficient of one channel of audio data of the nth frame in the high-frequency subband. The autocorrelation coefficient characterizes the degree of correlation (ie similarity) of audio data at two different moments. When the audio data at two different moments have the same periodic component, the maximum value of the autocorrelation coefficient reflects this period. Compared with the voice signal, the silent signal (or steady-state noise) has poorer autocorrelation, and the autocorrelation coefficient is relatively low. When the autocorrelation coefficient of a channel of audio data in the high-frequency subband is relatively large, it can be determined that the channel of audio data may include a voice signal.
需要说明的是,本文采用了“可能”类似的描述,是由于一帧的音频数据其数据量较少,不同帧之间可能存在差别,要结合多帧来最终确定麦克风真的未被堵住并且音频数据中包括语音信号。It should be noted that this article uses a similar description of "possibly" because one frame of audio data has a small amount of data, and there may be differences between different frames. It is necessary to combine multiple frames to finally determine that the microphone is really not blocked. And audio data includes voice signals.
示例性的,可以通过公式1得到上述自相关系数r xxExemplarily, the above-mentioned autocorrelation coefficient r xx can be obtained by formula 1.
Figure PCTCN2020096392-appb-000004
Figure PCTCN2020096392-appb-000004
其中,τ表示采样点之间的时延,当τ取值为1~N时即表示第n帧的相邻两帧之间的自相关系数,当τ取值为1~2N时即表示第n帧的相邻三帧之间的自相关系数。该公式的含义即是遍历τ的取值范围使公式1的计算结果最大时即得到自相关系数。energy(N)表示一帧的一路音频数据在高频子带的能量,其计算公共式如公式2所示:Among them, τ represents the time delay between sampling points. When τ is 1~N, it means the autocorrelation coefficient between two adjacent frames of the nth frame. When τ is 1~2N, it means the first The autocorrelation coefficient between three adjacent frames of n frames. The meaning of this formula is to get the autocorrelation coefficient when the value range of τ is traversed to maximize the calculation result of formula 1. energy(N) represents the energy of one channel of audio data of one frame in the high-frequency subband, and its common calculation formula is shown in formula 2:
Figure PCTCN2020096392-appb-000005
Figure PCTCN2020096392-appb-000005
其中,N为一帧音频数据的采样点的个数,x(i)为第i个采样点的数值。Among them, N is the number of sampling points of one frame of audio data, and x(i) is the value of the i-th sampling point.
S503、终端设备针对每一帧,根据N路音频数据的自相关系数,选择对N路音频数据中的至少一路音频数据进行VAD。S503: For each frame, the terminal device selects to perform VAD on at least one channel of the N channels of audio data according to the autocorrelation coefficient of the N channels of audio data.
具体的,假设某一帧中,N路音频数据中包括第i路音频数据,i为小于等于N的正整数。如果第i路音频数据的自相关系数大于相关系数阈值(例如0.6),则选择第i路音频数据进行VAD。Specifically, suppose that in a certain frame, N channels of audio data include the i-th channel of audio data, and i is a positive integer less than or equal to N. If the autocorrelation coefficient of the i-th audio data is greater than the correlation coefficient threshold (for example, 0.6), the i-th audio data is selected for VAD.
可选的,如果麦克风被堵住,由于高频信号的穿透能力较弱,得到的音频数据在高频子带(例如2-4KHz)的能量值会较低,可以根据音频数据在高频子带的能量值确定麦克风是否可能被堵住。因此,终端设备针对每一帧,可以根据N路音频数据在高 频子带的自相关系数以及N路音频数据在高频子带的能量值,选择对N路音频数据中的至少一路音频数据进行VAD。Optionally, if the microphone is blocked, due to the weak penetration of high-frequency signals, the energy value of the obtained audio data in the high-frequency subband (for example, 2-4KHz) will be lower. The energy value of the subband determines whether the microphone may be blocked. Therefore, for each frame, the terminal device can select at least one of the N channels of audio data according to the autocorrelation coefficient of the N channels of audio data in the high frequency subband and the energy value of the N channels of audio data in the high frequency subband. Perform VAD.
具体的,假设某一帧中,N路音频数据中包括第i路音频数据和第j路音频数据,其中,i≠j,i和j为小于等于N的正整数。Specifically, suppose that in a certain frame, N channels of audio data include the i-th channel of audio data and the j-th channel of audio data, where i≠j, and i and j are positive integers less than or equal to N.
如果第i路音频数据的自相关系数大于相关系数阈值(例如0.6),第j路音频数据的能量值小于第一能量阈值(例如-50dB),并且第i路音频数据的能量值与第j路音频数据的能量值之差大于第二能量阈值(例如20dB),则选择第i路音频数据进行VAD。If the autocorrelation coefficient of the i-th audio data is greater than the correlation coefficient threshold (e.g. 0.6), the energy value of the j-th audio data is less than the first energy threshold (e.g. -50dB), and the energy value of the i-th audio data is the same as the j-th If the difference between the energy values of the audio data is greater than the second energy threshold (for example, 20dB), the i-th audio data is selected for VAD.
需要说明的是,每一帧都要对N路音频数据进行上述计算,针对每一帧所选择的音频数据都可能不同,例如前一帧可能选择第一路音频数据进行VAD,后一帧可能选择第二路和第三路音频数据进行VAD。It should be noted that the above calculations must be performed on N channels of audio data for each frame, and the selected audio data may be different for each frame. For example, the first frame may select the first channel of audio data for VAD, and the next frame may Select the second and third audio data for VAD.
对于第i路音频数据的自相关系数大于相关系数阈值,即第i路音频数据的自相关系数较大,可以确定第i路音频数据中可能包括语音信号。The autocorrelation coefficient of the i-th channel of audio data is greater than the correlation coefficient threshold, that is, the auto-correlation coefficient of the i-th channel of audio data is larger, and it can be determined that the i-th channel of audio data may include a voice signal.
对于第j路音频数据的能量值小于第一能量阈值,以及,第i路音频数据的能量值与第j路音频数据的能量值之差大于第二能量阈值,可以确定第j路音频数据对应的麦克风可能被堵住,第i路音频数据对应的麦克风可能未被堵住。如果第j路音频数据对应的麦克风以及第i路音频数据对应的麦克风均被堵住,或者,两路音频数据中仅包括静音信号或稳态噪声,则不满足第i路音频数据的能量值与第j路音频数据的能量值之差大于第二能量阈值。For the energy value of the j-th audio data being less than the first energy threshold, and the difference between the energy value of the i-th audio data and the energy value of the j-th audio data is greater than the second energy threshold, it can be determined that the j-th audio data corresponds to The microphone of may be blocked, and the microphone corresponding to the i-th audio data may not be blocked. If the microphone corresponding to the jth channel of audio data and the microphone corresponding to the ith channel of audio data are both blocked, or the two channels of audio data only include mute signals or steady-state noise, the energy value of the ith channel of audio data is not satisfied The difference between the energy value and the energy value of the j-th channel of audio data is greater than the second energy threshold.
综上所述,可以确定在该帧中,检测到第j路音频数据对应的麦克风可能被堵住,第i路音频数据可能包括语音信号且对应的麦克风可能未被堵住。In summary, it can be determined that in this frame, it is detected that the microphone corresponding to the j-th audio data may be blocked, the i-th audio data may include a voice signal and the corresponding microphone may not be blocked.
示例性的,假设N=4,即四个麦克风对应四路音频数据,第一路至第四路音频数据在高频子带的能量值依次为Rms1、Rms2、Rms3、Rms4,自相关系数依次为Rel1、Rel2、Rel3、Rel4。这些能量值的大小关系满足:Rms0<Rms1<Rms2<Rms3;这些自相关系数的大小关系满足:Rel0<Rel1<Rel2<Rel3。如果Rel3和Rel4均大于相关系数阈值,Rms0和Rms1均小于第一能量阈值,Rms2-Rms0、Rms2-Rms1、Rms3-Rms0、Rms3-Rms1均大于第二能量阈值,则可以确定第一路音频数据对应的麦克风以及第二路音频数据对应的麦克风可能被堵住,第三路音频数据对应的麦克风以及第四路音频数据包括语音信号且对应的麦克风可能未被堵住。这里的第一路和第二路相当于前文所述的第j路,第三路和第四路相当于前文所述的第i路。Exemplarily, assuming N=4, that is, four microphones correspond to four channels of audio data, the energy values of the first to fourth channels of audio data in the high frequency subband are Rms1, Rms2, Rms3, Rms4, and the autocorrelation coefficients are in order Rel1, Rel2, Rel3, Rel4. The magnitude relationship of these energy values satisfies: Rms0<Rms1<Rms2<Rms3; the magnitude relationship of these autocorrelation coefficients satisfies: Rel0<Rel1<Rel2<Rel3. If both Rel3 and Rel4 are greater than the correlation coefficient threshold, Rms0 and Rms1 are both less than the first energy threshold, and Rms2-Rms0, Rms2-Rms1, Rms3-Rms0, and Rms3-Rms1 are all greater than the second energy threshold, then the first channel of audio data can be determined The corresponding microphone and the microphone corresponding to the second channel of audio data may be blocked, and the microphone corresponding to the third channel of audio data and the fourth channel of audio data include voice signals and the corresponding microphone may not be blocked. The first road and the second road here are equivalent to the j-th road described above, and the third and fourth roads are equivalent to the i-th road described above.
针对每一帧,对至少一路音频数据进行VAD的过程包括但不限于:计算语音电平、计算信噪比、计算语音水平以及计算互相关系数。For each frame, the process of performing VAD on at least one channel of audio data includes, but is not limited to: calculating the voice level, calculating the signal-to-noise ratio, calculating the voice level, and calculating the cross-correlation coefficient.
语音电平用于表征信号的幅度大小,可以通过公式3得到语音电平rms:The speech level is used to characterize the amplitude of the signal, and the speech level rms can be obtained by formula 3:
rms=20*log10(energy(N))                                        公式3rms=20*log10(energy(N)) Formula 3
信噪比用于表征表征信号和噪声在语音中的占比情况,可以通过公式4得到信噪比snr:The signal-to-noise ratio is used to characterize the proportion of signal and noise in speech. The signal-to-noise ratio snr can be obtained by formula 4:
Figure PCTCN2020096392-appb-000006
Figure PCTCN2020096392-appb-000006
语音水平用于表征语音的置信度水平,可以通过公式5得到语音水平SpeechLevel:The speech level is used to characterize the confidence level of the speech, and the speech level SpeechLevel can be obtained by formula 5:
Figure PCTCN2020096392-appb-000007
Figure PCTCN2020096392-appb-000007
互相关系数用于表征语音的相似度水平,可以通过公式6得到互相关系数r xyThe cross-correlation coefficient is used to characterize the similarity level of speech, and the cross-correlation coefficient r xy can be obtained by formula 6:
Figure PCTCN2020096392-appb-000008
Figure PCTCN2020096392-appb-000008
当语音电平、信噪比、语音水平和互相关系数均超过对应的阈值时,则可以确定该帧的音频数据中包括语音信号。When the speech level, signal-to-noise ratio, speech level, and cross-correlation coefficient all exceed the corresponding threshold, it can be determined that the audio data of the frame includes a speech signal.
本申请实施例提供的语音活动检测方法,按帧获取N路音频数据后,针对每一帧计算每路音频数据在高频子带的自相关系数,根据N路音频数据的自相关系数,选择对N路音频数据中的至少一路音频数据进行VAD,以检测每一帧音频数据中包括语音信号。对于每一帧来说,语音信号的自相关系数相对于静音信号(或稳态噪声)更大,从而可以确定该帧音频数据是否可能包括语音信号。选择可能包括语音信号进行VAD,以确定这一帧音频数据是否可能包括语音信号。对更可能包括语音信号的音频数据进行VAD,从而可以提高VAD的准确率。另外,即使有部分麦克风被堵住也可以正常进行VAD。In the voice activity detection method provided by the embodiments of the application, after obtaining N channels of audio data by frame, the autocorrelation coefficient of each channel of audio data in the high-frequency subband is calculated for each frame, and the autocorrelation coefficient of each channel of audio data is selected according to the autocorrelation coefficient of the N channels of audio data. VAD is performed on at least one of the N channels of audio data to detect that each frame of audio data includes a voice signal. For each frame, the autocorrelation coefficient of the speech signal is larger than that of the silent signal (or steady-state noise), so that it can be determined whether the frame of audio data may include a speech signal. The selection may include voice signals for VAD to determine whether this frame of audio data may include voice signals. VAD is performed on audio data that is more likely to include voice signals, so that the accuracy of VAD can be improved. In addition, VAD can be performed normally even if some microphones are blocked.
可选的,如图7所示,该VAD方法还包括:Optionally, as shown in Figure 7, the VAD method further includes:
S701、当M帧中被检测为包括语音信号的帧数满足条件时,确定M帧包括语音信号。S701: When the number of frames detected as including the voice signal in the M frame meets the condition, determine that the M frame includes the voice signal.
其中,M为正整数。示例性的M=20。Among them, M is a positive integer. Exemplary M=20.
在一种可能的实施方式中,M帧中至少m1(例如12)帧被检测为包括语音信号,则确定M帧包括语音信号。m1小于或等于M。即对于一路音频数据来说,不限定连续多帧的音频数据被检测为包括语音信号,只要整体M帧中至少有m1帧的音频数据被检测为包括语音信号即可。In a possible implementation manner, if at least m1 (for example, 12) frames in the M frames are detected as including a voice signal, it is determined that the M frame includes a voice signal. m1 is less than or equal to M. That is, for one channel of audio data, it is not limited that consecutive multiple frames of audio data are detected as including a voice signal, as long as at least m1 frames of audio data in the overall M frames are detected as including a voice signal.
在另一种可能的实施方式中,M帧中至少连续m2(例如8)帧被检测为包括语音信号,则确定M帧包括语音信号。m2小于或等于M。即对于一路音频数据来说,至少有连续m2帧的音频数据被检测为包括语音信号。In another possible implementation manner, if at least consecutive m2 (for example, 8) frames in the M frame are detected as including a voice signal, it is determined that the M frame includes a voice signal. m2 is less than or equal to M. That is, for one channel of audio data, at least m2 consecutive frames of audio data are detected as including voice signals.
可选的,如图7所示,该VAD方法还包括:Optionally, as shown in Figure 7, the VAD method further includes:
S702、当M帧中被检测为包括语音信号的帧数满足条件时,对M帧的N路音频数据进行语音唤醒识别。S702: When the number of frames detected as including the voice signal in the M frame meets the condition, perform voice wake-up recognition on the N channels of audio data of the M frame.
语音唤醒识别可以由图6中所示的处理器603执行软件来实现,当M帧中至少一路音频数据被检测为包括语音信号的次数满足条件时,VAD模块6031可以产生中断,从而触发处理器603对M帧的N路音频数据进行语音唤醒识别。Voice wake-up recognition can be implemented by the processor 603 shown in FIG. 6 executing software. When the number of times that at least one of the audio data in the M frame is detected as including a voice signal meets the condition, the VAD module 6031 can generate an interrupt, thereby triggering the processor 603 performs voice wake-up recognition on N channels of audio data of M frames.
也就是说,当确定M帧的音频数据包括语音信号时,对M帧的N路音频数据进行语音唤醒识别,如果语音唤醒识别成功,则显示语音唤醒的显示界面,并提示用户说出进一步的语音指令。在不满足上述条件时,不必对音频数据进行语音唤醒识别,可以节省能耗,降低语音唤醒的虚警率。That is to say, when it is determined that the audio data of M frames includes a voice signal, the voice wake-up recognition is performed on the N channels of audio data of the M frame. If the voice wake-up recognition is successful, the voice wake-up display interface is displayed, and the user is prompted to say further Voice commands. When the above conditions are not met, it is not necessary to perform voice wake-up recognition on audio data, which can save energy consumption and reduce the false alarm rate of voice wake-up.
可以理解的是,以上各个实施例中,由终端设备实现的方法和/或步骤,也可以由可用于终端设备的部件(例如芯片或者电路)实现。It can be understood that, in each of the above embodiments, the methods and/or steps implemented by the terminal device may also be implemented by components (for example, a chip or a circuit) that can be used for the terminal device.
本申请实施例还提供了一种语音活动检测装置,该装置用于实现上述各种方法。该语音活动检测装置可以为上述方法实施例中的终端设备,或者包含上述终端设备的装置,或者为终端设备内的芯片或功能模块。The embodiment of the present application also provides a voice activity detection device, which is used to implement the above-mentioned various methods. The voice activity detection device may be the terminal device in the foregoing method embodiment, or a device including the foregoing terminal device, or a chip or functional module in the terminal device.
可以理解的是,该语音活动检测装置为了实现上述功能,其包含了执行各个功能相应的硬件结构和/或软件模块。本领域技术人员应该很容易意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,本申请能够以硬件或硬件和计算机软件的结合形式来实现。某个功能究竟以硬件还是计算机软件驱动硬件的方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。It can be understood that, in order to realize the above-mentioned functions, the voice activity detection device includes hardware structures and/or software modules corresponding to various functions. Those skilled in the art should easily realize that in combination with the units and algorithm steps of the examples described in the embodiments disclosed herein, the present application can be implemented in the form of hardware or a combination of hardware and computer software. Whether a certain function is executed by hardware or computer software-driven hardware depends on the specific application and design constraint conditions of the technical solution. Professionals and technicians can use different methods for each specific application to implement the described functions, but such implementation should not be considered as going beyond the scope of this application.
本申请实施例可以根据上述方法实施例对语音活动检测装置进行功能模块的划分,例如,可以对应各个功能划分各个功能模块,也可以将两个或两个以上的功能集成在一个处理模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。需要说明的是,本申请实施例中对模块的划分是示意性的,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。The embodiments of the present application may divide the voice activity detection device into functional modules according to the foregoing method embodiments. For example, each functional module may be divided corresponding to each function, or two or more functions may be integrated into one processing module. The above-mentioned integrated modules can be implemented in the form of hardware or software function modules. It should be noted that the division of modules in the embodiments of the present application is illustrative, and is only a logical function division, and there may be other division methods in actual implementation.
比如,以语音活动检测装置为上述方法实施例中的终端设备为例。图8示出了一种语音活动检测装置80的结构示意图。该语音活动检测装置80包括获取模块801、计算模块802、选择模块803,可选的,还包括确定模块804、语音唤醒模块805。获取模块801可以执行图5中的步骤S501,图7中的步骤S501。计算模块802可以执行图5中的步骤S502,图7中的步骤S502。选择模块803可以执行图5中的步骤S503,图7中的步骤S503。确定模块804可以执行图7中的步骤S701。语音唤醒模块805可以执行图7中的步骤S702。For example, take the voice activity detection device as the terminal device in the foregoing method embodiment as an example. FIG. 8 shows a schematic structural diagram of a voice activity detection device 80. The voice activity detection device 80 includes an acquisition module 801, a calculation module 802, a selection module 803, and optionally, a determination module 804 and a voice wake-up module 805. The obtaining module 801 can execute step S501 in FIG. 5 and step S501 in FIG. 7. The calculation module 802 can execute step S502 in FIG. 5 and step S502 in FIG. 7. The selection module 803 can perform step S503 in FIG. 5 and step S503 in FIG. 7. The determining module 804 can execute step S701 in FIG. 7. The voice wake-up module 805 can perform step S702 in FIG. 7.
示例性的,获取模块801,用于按帧获取N路音频数据,其中,N为大于或等于2的整数;计算模块802,用于针对每一帧,计算每路音频数据在高频子带的自相关系数;选择模块803,用于针对每一帧,根据N路音频数据的自相关系数,选择对N路音频数据中的至少一路音频数据进行VAD。Exemplarily, the obtaining module 801 is used to obtain N channels of audio data by frame, where N is an integer greater than or equal to 2; the calculation module 802 is used to calculate the high frequency subband of each channel of audio data for each frame The selection module 803 is used for each frame, according to the autocorrelation coefficient of the N channels of audio data, to select at least one channel of audio data in the N channels of audio data for VAD.
在一种可能的实施方式中,选择模块803具体用于:如果第i路音频数据的自相关系数大于相关系数阈值,则选择第i路音频数据进行VAD。In a possible implementation manner, the selection module 803 is specifically configured to: if the autocorrelation coefficient of the i-th channel of audio data is greater than the correlation coefficient threshold, select the i-th channel of audio data for VAD.
在一种可能的实施方式中,选择模块803具体用于:根据N路音频数据的自相关系数以及N路音频数据在高频子带的能量值,选择对N路音频数据中的至少一路音频数据进行VAD。In a possible implementation manner, the selection module 803 is specifically configured to: according to the autocorrelation coefficients of the N channels of audio data and the energy value of the N channels of audio data in the high-frequency subband, select for at least one channel of the N channels of audio data. The data is VAD.
在一种可能的实施方式中,N路音频数据包括第i路音频数据和第j路音频数据,i≠j,i和j为小于等于N的正整数;选择模块803具体用于:如果第i路音频数据的自相关系数大于相关系数阈值,第j路音频数据的能量值小于第一能量阈值,并且第i路音频数据的能量值与第j路音频数据的能量值之差大于第二能量阈值,则选择第i路音频数据进行VAD。In a possible implementation, the N channels of audio data include the i-th channel of audio data and the j-th channel of audio data, i≠j, i and j are positive integers less than or equal to N; the selection module 803 is specifically configured to: The autocorrelation coefficient of the i-channel audio data is greater than the correlation coefficient threshold, the energy value of the j-th audio data is less than the first energy threshold, and the difference between the energy value of the i-th audio data and the energy value of the j-th audio data is greater than the second Energy threshold, select the i-th audio data for VAD.
在一种可能的实施方式中,确定模块804用于:当M帧中被检测为包括语音信号的帧数满足条件时,确定M帧包括语音信号,M为正整数。In a possible implementation manner, the determining module 804 is configured to: when the number of frames detected as including the voice signal in the M frame meets the condition, determine that the M frame includes the voice signal, and M is a positive integer.
在一种可能的实施方式中,M帧中被检测为包括语音信号的帧数满足条件,包括:M帧中至少m1帧被检测为包括语音信号,m1小于或等于M。In a possible implementation manner, the number of frames detected as including a voice signal in M frames meets the condition, including: at least m1 frames in the M frame are detected as including a voice signal, and m1 is less than or equal to M.
在一种可能的实施方式中,M帧中被检测为包括语音信号的帧数满足条件,包括:M帧中至少连续m2帧被检测为包括语音信号,m2小于或等于M。In a possible implementation manner, the number of frames detected as including a voice signal in M frames meets the condition, including: at least consecutive m2 frames in the M frame are detected as including a voice signal, and m2 is less than or equal to M.
在一种可能的实施方式中,语音唤醒模块805用于:当M帧中被检测为包括语音信号的帧数满足条件时,对M帧的N路音频数据进行语音唤醒识别。In a possible implementation manner, the voice wake-up module 805 is configured to perform voice wake-up recognition on N channels of audio data of M frames when the number of frames detected as including voice signals in the M frame meets the condition.
在一种可能的实施方式中,每路音频数据由一个麦克风所采集,麦克风为从多个麦克风中选择的信噪比最高的麦克风。In a possible implementation manner, each channel of audio data is collected by a microphone, and the microphone is a microphone with the highest signal-to-noise ratio selected from a plurality of microphones.
在本实施例中,该语音活动检测装置80以采用集成的方式划分各个功能模块的形式来呈现。这里的“模块”可以指特定ASIC,电路,执行一个或多个软件或固件程序的处理器和存储器,集成逻辑电路,和/或其他可以提供上述功能的器件。In this embodiment, the voice activity detection device 80 is presented in the form of dividing various functional modules in an integrated manner. The "module" here may refer to a specific ASIC, a circuit, a processor and memory that executes one or more software or firmware programs, an integrated logic circuit, and/or other devices that can provide the above-mentioned functions.
具体的,图8中的各模块的功能/实现过程可以通过终端设备中的处理器调用存储器中存储的计算机执行指令来实现。Specifically, the function/implementation process of each module in FIG. 8 can be implemented by the processor in the terminal device calling the computer execution instructions stored in the memory.
由于本实施例提供的语音活动检测装置80可执行上述方法,因此其所能获得的技术效果可参考上述方法实施例,在此不再赘述。Since the voice activity detection device 80 provided in this embodiment can perform the above-mentioned method, the technical effects that can be obtained can refer to the above-mentioned method embodiment, which will not be repeated here.
如图9所示,本申请实施例还提供了一种语音活动检测装置,该语音活动检测装置90包括处理器901、存储器902,处理器901与存储器902耦合,当处理器901执行存储器902中的计算机程序或指令时,图5、图7中对应的方法被执行。As shown in FIG. 9, an embodiment of the present application also provides a voice activity detection device. The voice activity detection device 90 includes a processor 901 and a memory 902. The processor 901 is coupled to the memory 902. When the processor 901 executes the memory 902 When the computer program or instruction is executed, the corresponding method in Fig. 5 and Fig. 7 is executed.
本申请实施例还提供了一种计算机可读存储介质,该计算机可读存储介质中存储有计算机程序,当其在计算机或处理器上运行时,使得计算机或处理器执行图5、图7中对应的方法。The embodiment of the present application also provides a computer-readable storage medium, the computer-readable storage medium stores a computer program, and when it runs on a computer or a processor, the computer or the processor executes the steps shown in FIG. 5 and FIG. 7 The corresponding method.
本申请实施例还提供了一种包含指令的计算机程序产品,当指令在计算机或处理器上运行时,使得计算机或处理器执行图5、图7中对应的方法。The embodiments of the present application also provide a computer program product containing instructions, which when the instructions run on a computer or a processor, cause the computer or the processor to execute the corresponding methods in FIG. 5 and FIG. 7.
本申请实施例提供了一种芯片系统,该芯片系统包括处理器,用于语音活动检测装置执行图5、图7中对应的方法。The embodiment of the present application provides a chip system including a processor for the voice activity detection device to execute the corresponding methods in FIG. 5 and FIG. 7.
在一种可能的设计中,该芯片系统还包括存储器,该存储器,用于保存必要的程序指令和数据。该芯片系统,可以包括芯片,集成电路,也可以包含芯片和其他分立器件,本申请实施例对此不作具体限定。In a possible design, the chip system also includes a memory for storing necessary program instructions and data. The chip system may include a chip, an integrated circuit, or may include a chip and other discrete devices, which is not specifically limited in the embodiment of the present application.
其中,本申请提供的语音活动检测装置、芯片、计算机存储介质、计算机程序产品或芯片系统均用于执行上文所述的方法,因此,其所能达到的有益效果可参考上文所提供的实施方式中的有益效果,此处不再赘述。Among them, the voice activity detection device, chip, computer storage medium, computer program product, or chip system provided in this application are all used to execute the above-mentioned method. Therefore, the beneficial effects that can be achieved can refer to the above-mentioned The beneficial effects in the implementation manner will not be repeated here.
本申请实施例涉及的处理器可以是一个芯片。例如,可以是现场可编程门阵列(field programmable gate array,FPGA),可以是专用集成芯片(application specific integrated circuit,ASIC),还可以是系统芯片(system on chip,SoC),还可以是中央处理器(central processor unit,CPU),还可以是网络处理器(network processor,NP),还可以是数字信号处理电路(digital signal processor,DSP),还可以是微控制器(micro controller unit,MCU),还可以是可编程控制器(programmable logic device,PLD)或其他集成芯片。The processor involved in the embodiment of the present application may be a chip. For example, it can be a field programmable gate array (FPGA), an application specific integrated circuit (ASIC), a system on chip (SoC), or a central processing unit. The central processor unit (CPU) can also be a network processor (NP), a digital signal processing circuit (digital signal processor, DSP), or a microcontroller (microcontroller unit, MCU) It can also be a programmable logic device (PLD) or other integrated chips.
本申请实施例涉及的存储器可以是易失性存储器或非易失性存储器,或可包括易失性和非易失性存储器两者。其中,非易失性存储器可以是只读存储器(read-only memory,ROM)、可编程只读存储器(programmable ROM,PROM)、可擦除可编 程只读存储器(erasable PROM,EPROM)、电可擦除可编程只读存储器(electrically EPROM,EEPROM)或闪存。易失性存储器可以是随机存取存储器(random access memory,RAM),其用作外部高速缓存。通过示例性但不是限制性说明,许多形式的RAM可用,例如静态随机存取存储器(static RAM,SRAM)、动态随机存取存储器(dynamic RAM,DRAM)、同步动态随机存取存储器(synchronous DRAM,SDRAM)、双倍数据速率同步动态随机存取存储器(double data rate SDRAM,DDR SDRAM)、增强型同步动态随机存取存储器(enhanced SDRAM,ESDRAM)、同步连接动态随机存取存储器(synchlink DRAM,SLDRAM)和直接内存总线随机存取存储器(direct rambus RAM,DR RAM)。应注意,本文描述的系统和方法的存储器旨在包括但不限于这些和任意其它适合类型的存储器。The memory involved in the embodiments of the present application may be a volatile memory or a non-volatile memory, or may include both volatile and non-volatile memory. Among them, the non-volatile memory can be a read-only memory (ROM), a programmable read-only memory (programmable ROM, PROM), an erasable programmable read-only memory (erasable PROM, EPROM), a Erase programmable read-only memory (electrically EPROM, EEPROM) or flash memory. The volatile memory may be random access memory (RAM), which is used as an external cache. By way of exemplary but not restrictive description, many forms of RAM are available, such as static random access memory (static RAM, SRAM), dynamic random access memory (dynamic RAM, DRAM), and synchronous dynamic random access memory (synchronous DRAM, SDRAM), double data rate synchronous dynamic random access memory (double data rate SDRAM, DDR SDRAM), enhanced synchronous dynamic random access memory (enhanced SDRAM, ESDRAM), synchronous connection dynamic random access memory (synchlink DRAM, SLDRAM) ) And direct memory bus random access memory (direct rambus RAM, DR RAM). It should be noted that the memories of the systems and methods described herein are intended to include, but are not limited to, these and any other suitable types of memories.
应理解,在本申请的各种实施例中,上述各过程的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。It should be understood that, in the various embodiments of the present application, the size of the sequence numbers of the above-mentioned processes does not mean the order of execution. The execution order of each process should be determined by its function and internal logic, and should not correspond to the embodiments of the present application. The implementation process constitutes any limitation.
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。A person of ordinary skill in the art may be aware that the units and algorithm steps of the examples described in combination with the embodiments disclosed herein can be implemented by electronic hardware or a combination of computer software and electronic hardware. Whether these functions are executed by hardware or software depends on the specific application and design constraint conditions of the technical solution. Professionals and technicians can use different methods for each specific application to implement the described functions, but such implementation should not be considered as going beyond the scope of this application.
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。Those skilled in the art can clearly understand that, for the convenience and conciseness of description, the specific working process of the above-described system, device, and unit can refer to the corresponding process in the foregoing method embodiment, which will not be repeated here.
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、设备和方法,可以通过其它的方式实现。例如,以上所描述的设备实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,设备或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed system, device, and method may be implemented in other ways. For example, the device embodiments described above are merely illustrative. For example, the division of the units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components can be combined or It can be integrated into another system, or some features can be ignored or not implemented. In addition, the displayed or discussed mutual couplings or direct couplings or communication connections may be indirect couplings or communication connections between devices or units through some interfaces, and may be in electrical, mechanical or other forms.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。In addition, the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件程序实现时,可以全部或部分地以计算机程序产品的形式来实现。该计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或者数据中 心通过有线(例如同轴电缆、光纤、数字用户线(Digital Subscriber Line,DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可以用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质(例如,软盘、硬盘、磁带),光介质(例如,DVD)、或者半导体介质(例如固态硬盘(Solid State Disk,SSD))等。In the foregoing embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented using a software program, it can be implemented in the form of a computer program product in whole or in part. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on the computer, the processes or functions described in the embodiments of the present application are generated in whole or in part. The computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices. The computer instructions may be stored in a computer-readable storage medium, or transmitted from one computer-readable storage medium to another computer-readable storage medium. For example, the computer instructions may be transmitted from a website, computer, server, or data center. Transmission to another website, computer, server, or data center via wired (such as coaxial cable, optical fiber, Digital Subscriber Line (DSL)) or wireless (such as infrared, wireless, microwave, etc.). The computer-readable storage medium may be any available medium that can be accessed by a computer or include one or more data storage devices such as servers, data centers, etc. that can be integrated with the medium. The usable medium may be a magnetic medium (for example, a floppy disk, a hard disk, and a magnetic tape), an optical medium (for example, a DVD), or a semiconductor medium (for example, a solid state disk (SSD)).
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。The above are only specific implementations of this application, but the protection scope of this application is not limited to this. Any person skilled in the art can easily think of changes or substitutions within the technical scope disclosed in this application. Should be covered within the scope of protection of this application. Therefore, the protection scope of this application should be subject to the protection scope of the claims.

Claims (20)

  1. 一种语音活动检测VAD方法,其特征在于,包括:A VAD method for voice activity detection, which is characterized in that it includes:
    按帧获取N路音频数据,其中,N为大于或等于2的整数;Acquire N channels of audio data in frames, where N is an integer greater than or equal to 2;
    针对每一帧,计算每路音频数据在高频子带的自相关系数;For each frame, calculate the autocorrelation coefficient of each channel of audio data in the high-frequency subband;
    针对所述每一帧,根据所述N路音频数据的所述自相关系数,选择对所述N路音频数据中的至少一路音频数据进行VAD。For each frame, according to the autocorrelation coefficient of the N channels of audio data, select to perform VAD on at least one channel of the N channels of audio data.
  2. 根据权利要求1所述的方法,其特征在于,所述N路音频数据包括第i路音频数据,i为小于等于N的正整数;所述根据所述N路音频数据的所述自相关系数,选择对所述N路音频数据中的至少一路音频数据进行VAD,包括:The method according to claim 1, wherein the N channels of audio data include the i-th channel of audio data, and i is a positive integer less than or equal to N; the autocorrelation coefficient according to the N channels of audio data , Selecting to perform VAD on at least one of the N channels of audio data includes:
    如果第i路音频数据的所述自相关系数大于相关系数阈值,则选择所述第i路音频数据进行VAD。If the autocorrelation coefficient of the i-th channel of audio data is greater than the correlation coefficient threshold, then the i-th channel of audio data is selected for VAD.
  3. 根据权利要求1所述的方法,其特征在于,所述根据N路音频数据的所述自相关系数,选择对所述N路音频数据中的至少一路音频数据进行VAD,包括:The method according to claim 1, wherein the selecting to perform VAD on at least one of the N channels of audio data according to the autocorrelation coefficients of the N channels of audio data comprises:
    根据所述N路音频数据的所述自相关系数以及所述N路音频数据在所述高频子带的能量值,选择对所述N路音频数据中的至少一路音频数据进行VAD。According to the autocorrelation coefficient of the N channels of audio data and the energy value of the N channels of audio data in the high frequency subband, VAD is selected for at least one channel of the N channels of audio data.
  4. 根据权利要求3所述的方法,其特征在于,所述N路音频数据包括第i路音频数据和第j路音频数据,i≠j,i和j为小于等于N的正整数;所述根据所述N路音频数据的所述自相关系数以及所述N路音频数据在所述高频子带的能量值,选择对所述N路音频数据中的至少一路音频数据进行VAD,包括:The method according to claim 3, wherein the N channels of audio data include the i-th channel of audio data and the j-th channel of audio data, i≠j, i and j are positive integers less than or equal to N; The autocorrelation coefficients of the N channels of audio data and the energy values of the N channels of audio data in the high-frequency subband are selected to perform VAD on at least one channel of the N channels of audio data, including:
    如果第i路音频数据的所述自相关系数大于相关系数阈值,第j路音频数据的所述能量值小于第一能量阈值,并且所述第i路音频数据的所述能量值与所述第j路音频数据的所述能量值之差大于第二能量阈值,则选择所述第i路音频数据进行VAD。If the autocorrelation coefficient of the i-th audio data is greater than the correlation coefficient threshold, the energy value of the j-th audio data is less than the first energy threshold, and the energy value of the i-th audio data is the same as the energy value of the i-th audio data. If the difference between the energy values of the j channels of audio data is greater than the second energy threshold, the i-th channel of audio data is selected for VAD.
  5. 根据权利要求1-4任一项所述的方法,其特征在于,还包括:The method according to any one of claims 1-4, further comprising:
    当M帧中被检测为包括语音信号的帧数满足条件时,确定所述M帧包括语音信号,M为正整数。When the number of frames detected as including the voice signal in the M frame meets the condition, it is determined that the M frame includes the voice signal, and M is a positive integer.
  6. 根据权利要求5所述的方法,其特征在于,所述M帧中被检测为包括语音信号的帧数满足条件,包括:The method according to claim 5, wherein the number of frames detected as including a voice signal in the M frames meets a condition, comprising:
    所述M帧中至少m1帧被检测为包括语音信号,m1小于或等于M。At least m1 of the M frames is detected as including a voice signal, and m1 is less than or equal to M.
  7. 根据权利要求5所述的方法,其特征在于,所述M帧中被检测为包括语音信号的帧数满足条件,包括:The method according to claim 5, wherein the number of frames detected as including a voice signal in the M frames meets a condition, comprising:
    所述M帧中至少连续m2帧被检测为包括语音信号,m2小于或等于M。At least consecutive m2 frames in the M frames are detected as including a voice signal, and m2 is less than or equal to M.
  8. 根据权利要求1-7任一项所述的方法,其特征在于,还包括:The method according to any one of claims 1-7, further comprising:
    当M帧中被检测为包括语音信号的帧数满足条件时,对所述M帧的所述N路音频数据进行语音唤醒识别。When the number of frames detected as including a voice signal in the M frame meets the condition, perform voice wake-up recognition on the N channels of audio data of the M frame.
  9. 根据权利要求1-8任一项所述的方法,其特征在于,每路所述音频数据由一个麦克风所采集,所述麦克风为从多个麦克风中选择的信噪比最高的麦克风。The method according to any one of claims 1-8, wherein each channel of the audio data is collected by a microphone, and the microphone is a microphone with the highest signal-to-noise ratio selected from a plurality of microphones.
  10. 一种语音活动检测VAD装置,其特征在于,包括:A VAD device for voice activity detection, which is characterized in that it comprises:
    获取模块,用于按帧获取N路音频数据,其中,N为大于或等于2的整数;The acquisition module is used to acquire N channels of audio data by frame, where N is an integer greater than or equal to 2;
    计算模块,用于针对每一帧,计算每路音频数据在高频子带的自相关系数;The calculation module is used to calculate the autocorrelation coefficient of each channel of audio data in the high-frequency subband for each frame;
    选择模块,用于针对所述每一帧,根据所述N路音频数据的所述自相关系数,选择对所述N路音频数据中的至少一路音频数据进行VAD。The selection module is configured to, for each frame, select to perform VAD on at least one of the N channels of audio data according to the autocorrelation coefficient of the N channels of audio data.
  11. 根据权利要求10所述的装置,其特征在于,所述选择模块具体用于:The device according to claim 10, wherein the selection module is specifically configured to:
    如果第i路音频数据的所述自相关系数大于相关系数阈值,则选择所述第i路音频数据进行VAD。If the autocorrelation coefficient of the i-th channel of audio data is greater than the correlation coefficient threshold, then the i-th channel of audio data is selected for VAD.
  12. 根据权利要求10所述的装置,其特征在于,所述选择模块具体用于:The device according to claim 10, wherein the selection module is specifically configured to:
    根据所述N路音频数据的所述自相关系数以及所述N路音频数据在所述高频子带的能量值,选择对所述N路音频数据中的至少一路音频数据进行VAD。According to the autocorrelation coefficient of the N channels of audio data and the energy value of the N channels of audio data in the high frequency subband, VAD is selected for at least one channel of the N channels of audio data.
  13. 根据权利要求12所述的装置,其特征在于,所述N路音频数据包括第i路音频数据和第j路音频数据,i≠j,i和j为小于等于N的正整数;所述选择模块具体用于:The device according to claim 12, wherein the N channels of audio data include the i-th channel of audio data and the j-th channel of audio data, i≠j, i and j are positive integers less than or equal to N; the selection The module is specifically used for:
    如果第i路音频数据的所述自相关系数大于相关系数阈值,第j路音频数据的所述能量值小于第一能量阈值,并且所述第i路音频数据的所述能量值与所述第j路音频数据的所述能量值之差大于第二能量阈值,则选择所述第i路音频数据进行VAD。If the autocorrelation coefficient of the i-th audio data is greater than the correlation coefficient threshold, the energy value of the j-th audio data is less than the first energy threshold, and the energy value of the i-th audio data is the same as the energy value of the i-th audio data. If the difference between the energy values of the j channels of audio data is greater than the second energy threshold, the i-th channel of audio data is selected for VAD.
  14. 根据权利要求10-13任一项所述的装置,其特征在于,还包括确定模块,用于:The device according to any one of claims 10-13, further comprising a determining module, configured to:
    当M帧中被检测为包括语音信号的帧数满足条件时,确定所述M帧包括语音信号,M为正整数。When the number of frames detected as including the voice signal in the M frame meets the condition, it is determined that the M frame includes the voice signal, and M is a positive integer.
  15. 根据权利要求14所述的装置,其特征在于,所述M帧中被检测为包括语音信号的帧数满足条件,包括:The apparatus according to claim 14, wherein the number of frames detected as including a voice signal in the M frames meets a condition, comprising:
    所述M帧中至少m1帧被检测为包括语音信号,m1小于或等于M。At least m1 of the M frames is detected as including a voice signal, and m1 is less than or equal to M.
  16. 根据权利要求14所述的装置,其特征在于,所述M帧中被检测为包括语音信号的帧数满足条件,包括:The apparatus according to claim 14, wherein the number of frames detected as including a voice signal in the M frames meets a condition, comprising:
    所述M帧中至少连续m2帧被检测为包括语音信号,m2小于或等于M。At least consecutive m2 frames in the M frames are detected as including a voice signal, and m2 is less than or equal to M.
  17. 根据权利要求10-16任一项所述的装置,其特征在于,还包括语音唤醒模块,用于:The device according to any one of claims 10-16, further comprising a voice wake-up module for:
    当M帧中被检测为包括语音信号的帧数满足条件时,对所述M帧的所述N路音频数据进行语音唤醒识别。When the number of frames detected as including a voice signal in the M frame meets the condition, perform voice wake-up recognition on the N channels of audio data of the M frame.
  18. 根据权利要求10-17任一项所述的装置,其特征在于,每路所述音频数据由一个麦克风所采集,所述麦克风为从多个麦克风中选择的信噪比最高的麦克风。The device according to any one of claims 10-17, wherein each channel of the audio data is collected by a microphone, and the microphone is a microphone with the highest signal-to-noise ratio selected from a plurality of microphones.
  19. 一种语音活动检测装置,其特征在于,包括:A voice activity detection device, which is characterized in that it comprises:
    存储器,用于存储计算机程序;Memory, used to store computer programs;
    与所述存储器相连的处理器,用于通过调用所述存储器中存储的计算机程序以使得语音活动检测装置执行如权利要求1至9任一项所述的方法。The processor connected to the memory is configured to call the computer program stored in the memory to cause the voice activity detection device to execute the method according to any one of claims 1 to 9.
  20. 一种计算机可读存储介质,其特征在于,包括计算机程序,当所述计算机程序在计算机上运行时,使得所述计算机执行权利要求1至9任一项所述的方法。A computer-readable storage medium, characterized by comprising a computer program, which when the computer program runs on a computer, causes the computer to execute the method according to any one of claims 1 to 9.
PCT/CN2020/096392 2020-06-16 2020-06-16 Voice activity detection method and apparatus WO2021253235A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202080101920.6A CN115699173A (en) 2020-06-16 2020-06-16 Voice activity detection method and device
PCT/CN2020/096392 WO2021253235A1 (en) 2020-06-16 2020-06-16 Voice activity detection method and apparatus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/096392 WO2021253235A1 (en) 2020-06-16 2020-06-16 Voice activity detection method and apparatus

Publications (1)

Publication Number Publication Date
WO2021253235A1 true WO2021253235A1 (en) 2021-12-23

Family

ID=79269055

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/096392 WO2021253235A1 (en) 2020-06-16 2020-06-16 Voice activity detection method and apparatus

Country Status (2)

Country Link
CN (1) CN115699173A (en)
WO (1) WO2021253235A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115862685B (en) * 2023-02-27 2023-09-15 全时云商务服务股份有限公司 Real-time voice activity detection method and device and electronic equipment

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101790752A (en) * 2007-09-28 2010-07-28 高通股份有限公司 Multiple microphone voice activity detector
CN102077274A (en) * 2008-06-30 2011-05-25 杜比实验室特许公司 Multi-microphone voice activity detector
CN103456305A (en) * 2013-09-16 2013-12-18 东莞宇龙通信科技有限公司 Terminal and speech processing method based on multiple sound collecting units
KR101711302B1 (en) * 2015-10-26 2017-03-02 한양대학교 산학협력단 Discriminative Weight Training for Dual-Microphone based Voice Activity Detection and Method thereof
CN108039182A (en) * 2017-12-22 2018-05-15 西安烽火电子科技有限责任公司 A kind of voice-activation detecting method
CN108597498A (en) * 2018-04-10 2018-09-28 广州势必可赢网络科技有限公司 A kind of multi-microphone voice acquisition method and device
CN108986833A (en) * 2018-08-21 2018-12-11 广州市保伦电子有限公司 Sound pick-up method, system, electronic equipment and storage medium based on microphone array
CN109360585A (en) * 2018-12-19 2019-02-19 晶晨半导体(上海)股份有限公司 A kind of voice-activation detecting method

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101790752A (en) * 2007-09-28 2010-07-28 高通股份有限公司 Multiple microphone voice activity detector
CN102077274A (en) * 2008-06-30 2011-05-25 杜比实验室特许公司 Multi-microphone voice activity detector
CN103456305A (en) * 2013-09-16 2013-12-18 东莞宇龙通信科技有限公司 Terminal and speech processing method based on multiple sound collecting units
KR101711302B1 (en) * 2015-10-26 2017-03-02 한양대학교 산학협력단 Discriminative Weight Training for Dual-Microphone based Voice Activity Detection and Method thereof
CN108039182A (en) * 2017-12-22 2018-05-15 西安烽火电子科技有限责任公司 A kind of voice-activation detecting method
CN108597498A (en) * 2018-04-10 2018-09-28 广州势必可赢网络科技有限公司 A kind of multi-microphone voice acquisition method and device
CN108986833A (en) * 2018-08-21 2018-12-11 广州市保伦电子有限公司 Sound pick-up method, system, electronic equipment and storage medium based on microphone array
CN109360585A (en) * 2018-12-19 2019-02-19 晶晨半导体(上海)股份有限公司 A kind of voice-activation detecting method

Also Published As

Publication number Publication date
CN115699173A (en) 2023-02-03

Similar Documents

Publication Publication Date Title
US10601599B2 (en) Voice command processing in low power devices
US11798531B2 (en) Speech recognition method and apparatus, and method and apparatus for training speech recognition model
US20180332416A1 (en) Utilizing digital microphones for low power keyword detection and noise suppression
CN105869655B (en) Audio devices and speech detection method
US9620116B2 (en) Performing automated voice operations based on sensor data reflecting sound vibration conditions and motion conditions
WO2021179965A1 (en) Information reporting method, access method determination method, terminal and network device
CN107393548B (en) Method and device for processing voice information collected by multiple voice assistant devices
KR20200027554A (en) Speech recognition method and apparatus, and storage medium
WO2021023061A1 (en) Quasi-co-location qcl information determination method, configuration method and related device
CN106782613B (en) Signal detection method and device
CN109672775B (en) Method, device and terminal for adjusting awakening sensitivity
CN106847307B (en) Signal detection method and device
CN111477243B (en) Audio signal processing method and electronic equipment
CN107360318B (en) Voice noise reduction method and device, mobile terminal and computer readable storage medium
WO2021179966A1 (en) Signal transmission method, information indication method, and communication device
CN109243488B (en) Audio detection method, device and storage medium
US9508345B1 (en) Continuous voice sensing
CN110136733B (en) Method and device for dereverberating audio signal
WO2021253235A1 (en) Voice activity detection method and apparatus
US11812281B2 (en) Measuring method, terminal and network side device
WO2018033031A1 (en) Positioning method and device
EP3493200B1 (en) Voice-controllable device and method of voice control
CN106782614B (en) Sound quality detection method and device
CN116935883B (en) Sound source positioning method and device, storage medium and electronic equipment
CN113593619B (en) Method, apparatus, device and medium for recording audio

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20941331

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20941331

Country of ref document: EP

Kind code of ref document: A1