CN113113050A

CN113113050A - Voice activity detection method, electronic equipment and device

Info

Publication number: CN113113050A
Application number: CN202110506143.XA
Authority: CN
Inventors: 何陈; 叶顺舟; 康力; 巴莉芳
Original assignee: Unisoc Chongqing Technology Co Ltd
Current assignee: Unisoc Chongqing Technology Co Ltd
Priority date: 2021-05-10
Filing date: 2021-05-10
Publication date: 2021-07-13

Abstract

The application discloses a voice activity detection method, electronic equipment and a device, wherein the method comprises the following steps: acquiring a first sound signal and a second sound signal, wherein the first sound signal is a sound signal received by an air conduction microphone, and the second sound signal is a sound signal received by a bone conduction microphone; determining a first total energy of the first sound signal and a second total energy of the second sound signal; determining whether the second total energy is greater than a first threshold; if the second total energy is greater than the first threshold, determining whether the first sound signal and the second sound signal have speech based on the first total energy and the second total energy. The method described in the application is beneficial to improving the accuracy of detecting the voice activity.

Description

Voice activity detection method, electronic equipment and device

Technical Field

The present invention relates to the field of communications, and in particular, to a voice activity detection method, an electronic device, and an apparatus.

Background

Voice Activity Detection (VAD) analyzes characteristics of an audio signal, such as energy, zero-crossing rate, and harmonic, to determine whether the audio signal contains Voice. VAD techniques are mainly used to simplify speech processing. For example, silent packets are not encoded or transmitted in Internet Protocol (IP) telephony applications, thereby effectively saving computation time and bandwidth.

Currently, voice activity detection is mainly performed on the market based on an Air conduction signal received by an Air Conduction (AC) microphone, but the Air conduction signal is often influenced by environmental noise, and a large amount of noise can reduce the accuracy of voice activity detection.

Disclosure of Invention

The application provides a voice activity detection method, electronic equipment and a voice activity detection device, which are beneficial to improving the accuracy of voice activity detection.

In a first aspect, the present application provides a method for voice activity detection, including: acquiring a first sound signal and a second sound signal, wherein the first sound signal is a sound signal received by an air conduction microphone, and the second sound signal is a sound signal received by a bone conduction microphone; determining a first total energy of the first sound signal and a second total energy of the second sound signal; determining whether the second total energy is greater than a first threshold; if the second total energy is greater than the first threshold, determining whether the first sound signal and the second sound signal have speech based on the first total energy and the second total energy.

In one possible implementation manner, the specific implementation manner of determining whether the first sound signal and the second sound signal have voices based on the first total energy and the second total energy is as follows: determining whether a target energy value is greater than a second threshold and less than or equal to a third threshold, the target energy value being a ratio of the first total energy and the second total energy; if the target energy value is greater than the second threshold and less than or equal to the third threshold, it is determined that the first sound signal and the second sound signal have speech.

In one possible implementation, if the target energy value is less than or equal to the second threshold, or the target energy value is greater than the third threshold, it is determined that no speech is present in the first sound signal and the second sound signal.

In one possible implementation, it is determined that the first sound signal and the second sound signal have no speech if the second total energy is less than or equal to the first threshold.

In a possible implementation manner, if it is continuously determined that the number of times that no voice exists in the received first sound signal and the second sound signal exceeds a preset number of times, the first sound signal and the second sound signal are acquired after waiting for a preset time.

In a second aspect, the present application provides a voice activity detection apparatus, including an obtaining module and a voice detection module: the acquiring module is used for acquiring a first sound signal and a second sound signal, wherein the first sound signal is a sound signal received by the air conduction microphone, and the second sound signal is a sound signal received by the bone conduction microphone; the voice detection module is used for: determining a first total energy of the first sound signal and a second total energy of the second sound signal; determining whether the second total energy is greater than a first threshold; if the second total energy is greater than the first threshold, determining whether the first sound signal and the second sound signal have speech based on the first total energy and the second total energy.

In a third aspect, the present application is directed to an electronic device comprising an air conduction microphone, a bone conduction microphone, a memory, and at least one processor; the air conduction microphone is used for receiving a first sound signal; the bone conduction microphone is used for receiving a second sound signal; the memory coupled with the one or more processors, the memory for storing computer program code, the computer program code comprising computer instructions; the processor is specifically configured to invoke the computer program from the memory to execute the method proposed in the first aspect.

In a fourth aspect, the present application proposes a chip, configured to: acquiring a first sound signal and a second sound signal, wherein the first sound signal is a sound signal received by an air conduction microphone, and the second sound signal is a sound signal received by a bone conduction microphone; determining a first total energy of the first sound signal and a second total energy of the second sound signal; determining whether the second total energy is greater than a first threshold; if the second total energy is greater than the first threshold, determining whether the first sound signal and the second sound signal have speech based on the first total energy and the second total energy.

In a fifth aspect, the present application provides a module device, which includes an air conduction microphone module, a bone conduction microphone module, a power module, a storage module and a chip module, wherein: the air conduction microphone module is used for receiving a first sound signal; the bone conduction microphone module is used for receiving a second sound signal; the power module is used for providing electric energy for the module equipment; the storage module is used for storing data and instructions; this chip module is used for: acquiring a first sound signal and a second sound signal, wherein the first sound signal is a sound signal received by an air conduction microphone, and the second sound signal is a sound signal received by a bone conduction microphone; determining a first total energy of the first sound signal and a second total energy of the second sound signal; determining whether the second total energy is greater than a first threshold; if the second total energy is greater than the first threshold, determining whether the first sound signal and the second sound signal have speech based on the first total energy and the second total energy.

In a sixth aspect, the present application provides a computer-readable storage medium having computer-readable instructions stored therein, which, when run on a communication apparatus, cause the communication apparatus to perform the method of the first aspect and any one of its possible implementations.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

Fig. 1 is a schematic structural diagram of a voice activity detection system according to an embodiment of the present application;

fig. 2 is a flowchart of a voice activity detection method provided by an embodiment of the present application;

FIG. 3 is a flow chart of another voice activity detection method provided by an embodiment of the present application;

fig. 4 is a schematic structural diagram of an electronic device provided in an embodiment of the present application;

FIG. 5 is a schematic diagram of an apparatus according to an embodiment of the present disclosure;

fig. 6 is a schematic structural diagram of a module apparatus according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The terminology used in the following embodiments of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the present application. As used in the specification of the present application and the appended claims, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the listed items.

It should be noted that the terms "first," "second," "third," and the like in the description and claims of the present application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in other sequences than described or illustrated herein. Furthermore, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or server that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Referring to fig. 1, fig. 1 is a schematic structural diagram of a voice activity detection system according to an embodiment of the present application. The voice activity detection system comprises at least one air conduction microphone, at least one bone conduction microphone and at least one voice activity detection module, and the number of the air conduction microphones, the bone conduction microphones and the voice activity detection modules is not limited in the embodiment of the application. The air conduction microphone and the bone conduction microphone are used for receiving sound signals, the first sound signal received by the air conduction microphone is an air conduction signal, and the second sound signal received by the bone conduction microphone is a bone conduction signal. The voice activity detection module is used for detecting the first sound signal received by the air conduction microphone and the second sound signal received by the bone conduction microphone and determining whether the first sound signal and the second sound signal have voice. The voice activity detection system can be applied to earphones or electronic equipment comprising a bone conduction microphone and an air conduction microphone, and the like.

Referring to fig. 2, fig. 2 is a flowchart illustrating a voice activity detection method according to an embodiment of the present application. The method is applied to an electronic device or a chip in the electronic device, and specifically, as shown in fig. 2, fig. 2 illustrates the electronic device as an execution subject. The same principle of execution of the voice activity detection method shown in other figures in the embodiment of the present application is not described in detail hereinafter. The voice activity detection method of the embodiment of the application comprises steps 201 to 204:

201. the electronic device acquires a first sound signal that is a sound signal received by the air conduction microphone and a second sound signal that is a sound signal received by the bone conduction microphone.

In an embodiment of the present application, the specific steps of the electronic device acquiring the first sound signal and the second sound signal are as follows: the electronic device obtains a first time-domain signal of the first sound signal from the air conduction microphone and a second time-domain signal of the second sound signal from the bone conduction microphone. After acquiring the first time domain signal and the second time domain signal, the electronic device performs framing processing on the first time domain signal and the second time domain signal. After framing processing, the electronic device performs time-frequency conversion on the time domain signals, converts the first time domain signals into first frequency domain signals, and converts the second time domain signals into second time domain signals. After time-frequency conversion, the electronic equipment obtains a discrete Fourier transform function of the first sound signal and a discrete Fourier transform function of the second sound signal, wherein the discrete Fourier transform function of the first sound signal is S_A(k, m) the discrete Fourier transform function of the second sound signal is S_B(k, m), k being the frequency index and m being the frame index.

202. The electronic device determines a first total energy of the first acoustic signal and a second total energy of the second acoustic signal.

In the embodiment of the present application, the first total energy may be represented by a formula

The second total energy can be calculated by formula

And (6) calculating. Wherein E is_AIs a first total energy, E_BIs the second total energy, S_A(k, m) is a discrete Fourier transform function, S, of the first sound signal_B(k, m) is a discrete Fourier transform function of the second sound signal, k is a frequency index and m is a frame index.

203. The electronic device determines whether the second total energy is greater than a first threshold.

In the embodiment of the application, the electronic device compares the second total energy with the first threshold, and if the second total energy is greater than the first threshold, it is preliminarily determined that the second sound signal may have a voice. The bone conduction microphone does not directly face noise, so that the bone conduction microphone has stronger noise robustness, and whether the received first sound signal and the second sound have voice or not can be preliminarily judged according to the magnitude of the second total energy. Because the second sound signal is the bone conduction signal received by the bone conduction microphone, and the first sound signal is the air conduction signal received by the air conduction microphone, the air conduction signal is more easily doped with more noise than the bone conduction signal, and the bone conduction signal does not directly face the noise in the air, whether the first sound signal and the second sound have the voice or not is preliminarily judged based on the second total energy, and the judgment is more accurate compared with the judgment based on the first total energy.

204. If the second total energy is greater than the first threshold, the electronic device determines that the first sound signal and the second sound signal have speech based on the first total energy and the second total energy.

The specific implementation manner of step 204 may refer to subsequent steps 305 to 307.

Through the method, the electronic equipment carries out voice activity detection based on the first sound signal received by the air conduction microphone and the second sound signal received by the bone conduction microphone, and the bone conduction microphone does not directly face noise in the air, so that the received second sound signal has strong noise robustness, and the problem that the first sound signal received by the air conduction microphone is easily influenced by environmental noise can be effectively avoided. In this way, the accuracy of voice activity detection is advantageously improved.

Referring to fig. 3, fig. 3 is a flowchart illustrating another method for detecting voice activity according to an embodiment of the present application, including steps 301 to 307.

301. The electronic device acquires a first sound signal that is a sound signal received by the air conduction microphone and a second sound signal that is a sound signal received by the bone conduction microphone.

302. The electronic device determines a first total energy of the first acoustic signal and a second total energy of the second acoustic signal.

303. The electronic device determines whether the second total energy is greater than a first threshold. If the second total energy is greater than the first threshold, go to step 305; if the second total energy is less than or equal to the first threshold, step 304 is performed.

The specific implementation manners of steps 301 to 303 are the same as those of steps 201 to 203, and are not described herein again in this embodiment of the present application.

304. The electronic device determines that no speech is present in the first sound signal and the second sound signal.

In the embodiment of the application, the bone conduction microphone does not directly face noise, so that the bone conduction microphone has stronger noise robustness, and whether the received first sound signal and the second sound have voice can be preliminarily judged according to the magnitude of the second total energy. If the second total energy is larger than the first threshold value, two conditions are possible, wherein one condition is that voice exists, the other condition is that a large amount of noise possibly exists, and the specific condition needs to be further judged; however, if it is determined that the second total energy is less than or equal to the first threshold, it is determined that the first and second sound signals must not contain speech. The first sound signal and the second sound signal are preliminarily screened based on the judgment of the second total energy, so that the accuracy of voice activity detection is improved.

305. The electronic device determines whether a target energy value is greater than a second threshold and less than or equal to a third threshold, the target energy value being a ratio of the first total energy and the second total energy. If the target energy value is determined to be greater than the second threshold, go to step 306; if the target energy value is determined to be less than or equal to the second threshold, step 307 is performed.

In the embodiment of the present application, the target energy value is a ratio of the first total energy to the second total energy, and the target energy value is expressed in decibels as follows:

wherein E is_AIs a first total energy, E_BFor the second total energy, m is the frame index.

In the embodiment of the application, the second threshold is used for eliminating errors caused by self-interference of the bone conduction microphone, and the third threshold is used for representing the noise suppression capability of the bone conduction microphone. In practical cases, the bone conduction microphone does not directly face the noise in the air, and therefore the noise present in the second sound signal is smaller than the noise in the first sound signal. In a general case, the second total energy is slightly smaller than the first total energy, the third threshold value is used to indicate the noise suppression capability of the bone conduction microphone, and the third threshold value is larger if the noise suppression capability of the bone conduction microphone is stronger. Therefore, the first sound signal and the second sound signal only contain voice when the target energy value is larger than the second threshold and smaller than or equal to the third threshold, and therefore the accuracy of voice activity detection is improved.

306. The electronic device determines that speech is present in the first sound signal and the second sound signal.

307. The electronic device determines that no speech is present in the first sound signal and the second sound signal.

In the embodiment of the present application, if the target energy value is greater than the third threshold, it indicates that there is a strong noise leakage, and therefore there is no voice in the first sound signal and the second sound signal, for example, if the electronic device is an earphone, when a user wears the earphone, a lot of wind blows to the ear of the user, wind noise is formed on the electronic device, and the target energy value is greater than the third threshold, that is, a strong noise leakage is formed, and therefore there is no voice in both the first sound signal and the second sound signal. If the target energy value is less than or equal to the second threshold, it indicates that the second sound signal may include noise interference received by the bone conduction microphone itself, such as tooth collision or other sounds made by bone.

Optionally, if it is continuously determined that the number of times of no voice in the first sound signal and the second sound signal exceeds a preset number of times, waiting for a preset time and then acquiring the first sound signal and the second sound signal. If the first sound signal and the second sound signal are determined to be free of voice for multiple times, it is indicated that the sound signals received for a long time may not contain voice, and therefore the electronic device can wait for the preset time to acquire the first sound signal and the second sound signal, and the load of the electronic device can be effectively reduced in such a way.

Referring to fig. 4, fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure, where the electronic device may be an earphone or other related devices. Included in the electronic device 40 are a processor 401, a memory 402, an air conduction microphone 403, and a bone conduction microphone 404.

The Processor 401 may be a Central Processing Unit (CPU), and the Processor 401 may also be other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor, and optionally, the processor 401 may be any conventional processor or the like.

Memory 402 may include both read-only memory and random access memory and provides instructions and data to processor 401. A portion of the memory 402 may also include non-volatile random access memory.

Optionally, the electronic device 40 may further include a device other than the above-described device, such as a communication interface, which is not limited in this embodiment.

Wherein:

a processor 401 for calling program instructions stored in the memory 402.

A memory 402 for storing program instructions.

An air conduction microphone 403 for receiving the first acoustic signal.

A bone conduction microphone 404 for receiving the second sound signal.

The processor 401 invokes program instructions stored in the memory 402 to cause the electronic device 40 to perform the following operations: acquiring a first sound signal from the air conduction microphone 403, and acquiring a second sound signal from the bone conduction microphone 404, the first sound signal being a sound signal received by the air conduction microphone 403, and the second sound signal being a sound signal received by the bone conduction microphone 404; determining a first total energy of the first sound signal and a second total energy of the second sound signal; determining whether the second total energy is greater than a first threshold; if the second total energy is greater than the first threshold, determining whether the first sound signal and the second sound signal have speech based on the first total energy and the second total energy.

Fig. 5 shows an apparatus 50 provided in this embodiment of the present application, for implementing the functions of the electronic device in fig. 2. The apparatus may be an electronic device or an apparatus for an electronic device. The means for the electronic device may be a system of chips or a chip within the electronic device. The chip system may be composed of a chip, or may include a chip and other discrete devices. The apparatus 50 shown in fig. 5 may include an acquisition module 501 and a speech detection module 502, wherein:

the acquiring module 501 is configured to acquire a first sound signal and a second sound signal, where the first sound signal is a sound signal received by an air conduction microphone, and the second sound signal is a sound signal received by a bone conduction microphone; the speech detection module 502 is configured to determine a first total energy of the first sound signal and a second total energy of the second sound signal; the speech detection module 502 is further configured to determine whether the second total energy is greater than a first threshold; the speech detection module 502 is further configured to determine whether the first sound signal and the second sound signal have speech based on the first total energy and the second total energy if the second total energy is greater than the first threshold.

In one possible implementation, when the speech detection module 502 determines whether the first and second sound signals have speech based on the first and second total energies, it determines whether a target energy value is greater than a second threshold and less than or equal to a third threshold, the target energy value being a ratio of the first and second total energies; if the target energy value is greater than the second threshold and less than or equal to the third threshold, it is determined that the first sound signal and the second sound signal have speech.

In a possible implementation manner, the voice detection module 502 is further configured to determine that no voice exists in the first sound signal and the second sound signal if the target energy value is less than or equal to the second threshold or greater than the third threshold.

In a possible implementation manner, the voice detection module 502 is further configured to determine that no voice exists in the first sound signal and the second sound signal if the second total energy is less than or equal to the first threshold.

In a possible implementation manner, the voice detection module 502 is further configured to wait for a preset time before acquiring the first sound signal and the second sound signal if it is continuously determined that the number of times that no voice exists in the received first sound signal and the second sound signal exceeds a preset number of times.

The above-mentioned means may be, for example: a chip, or a chip module. Each module included in each apparatus and product described in the above embodiments may be a software module, a hardware module, or a part of the software module and a part of the hardware module. For example, for each device or product applied to or integrated in a chip, each module included in the device or product may be implemented by hardware such as a circuit, or at least a part of the modules may be implemented by a software program running on a processor integrated in the chip, and the rest (if any) part of the modules may be implemented by hardware such as a circuit; for each device and product applied to or integrated with the chip module, each module included in the device and product may be implemented in a hardware manner such as a circuit, and different modules may be located in the same component (e.g., a chip, a circuit module, etc.) or different components of the chip module, or at least a part of the modules may be implemented in a software program running on a processor integrated within the chip module, and the rest (if any) part of the modules may be implemented in a hardware manner such as a circuit; for each device and product applied to or integrated in the terminal, each module included in the device and product may be implemented by using hardware such as a circuit, different modules may be located in the same component (e.g., a chip, a circuit module, etc.) or different components in the terminal, or at least a part of the modules may be implemented by using a software program running on a processor integrated in the terminal, and the rest (if any) part of the modules may be implemented by using hardware such as a circuit.

The embodiment of the present application further provides a chip, where the chip can perform the relevant steps of the electronic device in the foregoing method embodiment. The chip is used for:

acquiring a first sound signal and a second sound signal, wherein the first sound signal is a sound signal received by an air conduction microphone, and the second sound signal is a sound signal received by a bone conduction microphone; determining a first total energy of the first sound signal and a second total energy of the second sound signal; determining whether the second total energy is greater than a first threshold; if the second total energy is greater than the first threshold, determining whether the first sound signal and the second sound signal have speech based on the first total energy and the second total energy.

In one possible implementation, when the chip determines whether the first sound signal and the second sound signal have voices based on the first total energy and the second total energy, it determines whether a target energy value, which is a ratio of the first total energy and the second total energy, is greater than a second threshold and is less than or equal to a third threshold; if the target energy value is greater than the second threshold and less than or equal to the third threshold, it is determined that the first sound signal and the second sound signal have speech.

In a possible implementation manner, the chip is further configured to determine that no speech is present in the first sound signal and the second sound signal if the target energy value is less than or equal to the second threshold value or the target energy value is greater than the third threshold value.

In a possible implementation manner, the chip is further configured to determine that the first sound signal and the second sound signal have no speech if the second total energy is less than or equal to the first threshold.

In a possible implementation manner, the chip is further configured to wait for a preset time and then acquire the first sound signal and the second sound signal if it is continuously determined that the number of times that no sound exists in the received first sound signal and the second sound signal exceeds a preset number of times.

As shown in fig. 6, fig. 6 is a schematic structural diagram of a module device according to an embodiment of the present disclosure. The module device 60 can perform the steps related to the terminal device in the foregoing method embodiments, and the module device 60 includes: a communication module 601, a power module 602, a memory module 603, a chip module 604, an air conduction microphone module 605 and a bone conduction microphone module 606.

The power module 602 is configured to provide power for the module device; the storage module 603 is used for storing data and instructions; the communication module 601 is used for performing internal communication of module equipment, or is used for performing communication between the module equipment and external equipment; the air conduction microphone module 605 is configured to receive a first sound signal; the bone conduction microphone module 606 is configured to receive a second sound signal; the chip module 604 is configured to:

acquiring a first sound signal and a second sound signal, wherein the first sound signal is a sound signal received by the air conduction microphone module 605, and the second sound signal is a sound signal received by the bone conduction microphone module 606; determining a first total energy of the first sound signal and a second total energy of the second sound signal; determining whether the second total energy is greater than a first threshold; if the second total energy is greater than the first threshold, determining whether the first sound signal and the second sound signal have speech based on the first total energy and the second total energy.

In one possible implementation, when the chip module 604 determines whether the first sound signal and the second sound signal have voices based on the first total energy and the second total energy, it determines whether a target energy value is greater than a second threshold and less than or equal to a third threshold, where the target energy value is a ratio of the first total energy and the second total energy; if the target energy value is greater than the second threshold and less than or equal to the third threshold, it is determined that the first sound signal and the second sound signal have speech.

In a possible implementation manner, the chip module 604 is further configured to determine that the first sound signal and the second sound signal do not have speech if the target energy value is less than or equal to the second threshold, or if the target energy value is greater than the third threshold.

In a possible implementation manner, the chip module 604 is further configured to determine that the first sound signal and the second sound signal have no voice if the second total energy is less than or equal to the first threshold.

In a possible implementation manner, the chip module 604 is further configured to wait for a preset time and then acquire the first sound signal and the second sound signal if it is continuously determined that the number of times that no sound exists in the received first sound signal and the second sound signal exceeds a preset number of times.

Embodiments of the present application further provide a computer-readable storage medium, in which instructions are stored, and when the computer-readable storage medium is executed on a processor, the method flow of the above method embodiments is implemented.

Embodiments of the present application further provide a computer program product, where when the computer program product runs on a processor, the method flow of the above method embodiments is implemented.

It is noted that, for simplicity of explanation, the foregoing method embodiments are described as a series of acts or combination of acts, but those skilled in the art will appreciate that the present application is not limited by the order of acts, as some acts may, in accordance with the present application, occur in other orders and/or concurrently. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.

The descriptions of the embodiments provided in the present application may be referred to each other, and the descriptions of the embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments. For convenience and brevity of description, for example, the functions and operations performed by the devices and apparatuses provided in the embodiments of the present application may refer to the related descriptions of the method embodiments of the present application, and may also be referred to, combined with or cited among the method embodiments and the device embodiments.

Finally, it should be noted that: the above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present application.

Claims

1. A method of voice activity detection, the method comprising:

acquiring a first sound signal and a second sound signal, wherein the first sound signal is a sound signal received by an air conduction microphone, and the second sound signal is a sound signal received by a bone conduction microphone;

determining a first total energy of the first sound signal and a second total energy of the second sound signal;

determining whether the second total energy is greater than a first threshold;

and if the second total energy is larger than the first threshold value, determining that the first sound signal and the second sound signal have voices based on the first total energy and the second total energy.

2. The method of claim 1, wherein the determining whether the first sound signal and the second sound signal have speech based on the first total energy and the second total energy comprises:

determining whether a target energy value is greater than a second threshold and less than or equal to a third threshold, the target energy value being a ratio of the first total energy and the second total energy;

if the target energy value is greater than the second threshold and less than or equal to the third threshold, determining that the first sound signal and the second sound signal have speech.

3. The method of claim 2, further comprising:

and if the target energy value is smaller than or equal to the second threshold value or the target energy value is larger than the third threshold value, determining that no voice exists in the first sound signal and the second sound signal.

4. The method of claim 1, further comprising:

and if the second total energy is smaller than or equal to the first threshold value, determining that no voice exists in the first sound signal and the second sound signal.

5. The method according to any one of claims 1 to 4, further comprising:

and if the number of times of continuously determining that no voice exists in the received first sound signal and the second sound signal exceeds a preset number of times, waiting for a preset time and then acquiring the first sound signal and the second sound signal.

6. An apparatus for voice activity detection, the apparatus comprising an acquisition module and a voice detection module:

the acquiring module is configured to acquire a first sound signal and a second sound signal, where the first sound signal is a sound signal received by an air conduction microphone, and the second sound signal is a sound signal received by a bone conduction microphone;

the voice detection module is configured to:

determining whether the second total energy is greater than a first threshold;

if the second total energy is greater than the first threshold, determining whether the first sound signal and the second sound signal have speech based on the first total energy and the second total energy.

7. An electronic device, comprising an air conduction microphone, a bone conduction microphone, a memory, and at least one processor;

the air conduction microphone is used for receiving a first sound signal;

the bone conduction microphone is used for receiving a second sound signal;

the memory coupled with the one or more processors, the memory to store computer program code, the computer program code comprising computer instructions;

the processor is specifically configured to invoke the computer program from the memory to execute the method according to any one of claims 1 to 5.

8. A chip, wherein the chip is configured to:

determining whether the second total energy is greater than a first threshold;

9. The utility model provides a module equipment, its characterized in that, module equipment includes mike module of air conduction, mike module of bone conduction, power module, storage module and chip module, wherein:

the air conduction microphone module is used for receiving a first sound signal;

the bone conduction microphone module is used for receiving a second sound signal;

the power supply module is used for providing electric energy for the module equipment;

the storage module is used for storing data and instructions;

the chip module is used for:

determining whether the second total energy is greater than a first threshold;

10. A computer readable storage medium having computer readable instructions stored thereon which, when run on a communication device, cause the communication device to perform the method of any of claims 1-5.