CN113113050A - Voice activity detection method, electronic equipment and device - Google Patents

Voice activity detection method, electronic equipment and device Download PDF

Info

Publication number
CN113113050A
CN113113050A CN202110506143.XA CN202110506143A CN113113050A CN 113113050 A CN113113050 A CN 113113050A CN 202110506143 A CN202110506143 A CN 202110506143A CN 113113050 A CN113113050 A CN 113113050A
Authority
CN
China
Prior art keywords
sound signal
total energy
threshold
module
conduction microphone
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110506143.XA
Other languages
Chinese (zh)
Inventor
何陈
叶顺舟
康力
巴莉芳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Unisoc Chongqing Technology Co Ltd
Original Assignee
Unisoc Chongqing Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Unisoc Chongqing Technology Co Ltd filed Critical Unisoc Chongqing Technology Co Ltd
Priority to CN202110506143.XA priority Critical patent/CN113113050A/en
Publication of CN113113050A publication Critical patent/CN113113050A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/84Detection of presence or absence of voice signals for discriminating voice from noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/225Feedback of the input speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L2025/783Detection of presence or absence of voice signals based on threshold decision

Abstract

The application discloses a voice activity detection method, electronic equipment and a device, wherein the method comprises the following steps: acquiring a first sound signal and a second sound signal, wherein the first sound signal is a sound signal received by an air conduction microphone, and the second sound signal is a sound signal received by a bone conduction microphone; determining a first total energy of the first sound signal and a second total energy of the second sound signal; determining whether the second total energy is greater than a first threshold; if the second total energy is greater than the first threshold, determining whether the first sound signal and the second sound signal have speech based on the first total energy and the second total energy. The method described in the application is beneficial to improving the accuracy of detecting the voice activity.

Description

Voice activity detection method, electronic equipment and device
Technical Field
The present invention relates to the field of communications, and in particular, to a voice activity detection method, an electronic device, and an apparatus.
Background
Voice Activity Detection (VAD) analyzes characteristics of an audio signal, such as energy, zero-crossing rate, and harmonic, to determine whether the audio signal contains Voice. VAD techniques are mainly used to simplify speech processing. For example, silent packets are not encoded or transmitted in Internet Protocol (IP) telephony applications, thereby effectively saving computation time and bandwidth.
Currently, voice activity detection is mainly performed on the market based on an Air conduction signal received by an Air Conduction (AC) microphone, but the Air conduction signal is often influenced by environmental noise, and a large amount of noise can reduce the accuracy of voice activity detection.
Disclosure of Invention
The application provides a voice activity detection method, electronic equipment and a voice activity detection device, which are beneficial to improving the accuracy of voice activity detection.
In a first aspect, the present application provides a method for voice activity detection, including: acquiring a first sound signal and a second sound signal, wherein the first sound signal is a sound signal received by an air conduction microphone, and the second sound signal is a sound signal received by a bone conduction microphone; determining a first total energy of the first sound signal and a second total energy of the second sound signal; determining whether the second total energy is greater than a first threshold; if the second total energy is greater than the first threshold, determining whether the first sound signal and the second sound signal have speech based on the first total energy and the second total energy.
In one possible implementation manner, the specific implementation manner of determining whether the first sound signal and the second sound signal have voices based on the first total energy and the second total energy is as follows: determining whether a target energy value is greater than a second threshold and less than or equal to a third threshold, the target energy value being a ratio of the first total energy and the second total energy; if the target energy value is greater than the second threshold and less than or equal to the third threshold, it is determined that the first sound signal and the second sound signal have speech.
In one possible implementation, if the target energy value is less than or equal to the second threshold, or the target energy value is greater than the third threshold, it is determined that no speech is present in the first sound signal and the second sound signal.
In one possible implementation, it is determined that the first sound signal and the second sound signal have no speech if the second total energy is less than or equal to the first threshold.
In a possible implementation manner, if it is continuously determined that the number of times that no voice exists in the received first sound signal and the second sound signal exceeds a preset number of times, the first sound signal and the second sound signal are acquired after waiting for a preset time.
In a second aspect, the present application provides a voice activity detection apparatus, including an obtaining module and a voice detection module: the acquiring module is used for acquiring a first sound signal and a second sound signal, wherein the first sound signal is a sound signal received by the air conduction microphone, and the second sound signal is a sound signal received by the bone conduction microphone; the voice detection module is used for: determining a first total energy of the first sound signal and a second total energy of the second sound signal; determining whether the second total energy is greater than a first threshold; if the second total energy is greater than the first threshold, determining whether the first sound signal and the second sound signal have speech based on the first total energy and the second total energy.
In a third aspect, the present application is directed to an electronic device comprising an air conduction microphone, a bone conduction microphone, a memory, and at least one processor; the air conduction microphone is used for receiving a first sound signal; the bone conduction microphone is used for receiving a second sound signal; the memory coupled with the one or more processors, the memory for storing computer program code, the computer program code comprising computer instructions; the processor is specifically configured to invoke the computer program from the memory to execute the method proposed in the first aspect.
In a fourth aspect, the present application proposes a chip, configured to: acquiring a first sound signal and a second sound signal, wherein the first sound signal is a sound signal received by an air conduction microphone, and the second sound signal is a sound signal received by a bone conduction microphone; determining a first total energy of the first sound signal and a second total energy of the second sound signal; determining whether the second total energy is greater than a first threshold; if the second total energy is greater than the first threshold, determining whether the first sound signal and the second sound signal have speech based on the first total energy and the second total energy.
In a fifth aspect, the present application provides a module device, which includes an air conduction microphone module, a bone conduction microphone module, a power module, a storage module and a chip module, wherein: the air conduction microphone module is used for receiving a first sound signal; the bone conduction microphone module is used for receiving a second sound signal; the power module is used for providing electric energy for the module equipment; the storage module is used for storing data and instructions; this chip module is used for: acquiring a first sound signal and a second sound signal, wherein the first sound signal is a sound signal received by an air conduction microphone, and the second sound signal is a sound signal received by a bone conduction microphone; determining a first total energy of the first sound signal and a second total energy of the second sound signal; determining whether the second total energy is greater than a first threshold; if the second total energy is greater than the first threshold, determining whether the first sound signal and the second sound signal have speech based on the first total energy and the second total energy.
In a sixth aspect, the present application provides a computer-readable storage medium having computer-readable instructions stored therein, which, when run on a communication apparatus, cause the communication apparatus to perform the method of the first aspect and any one of its possible implementations.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.
Fig. 1 is a schematic structural diagram of a voice activity detection system according to an embodiment of the present application;
fig. 2 is a flowchart of a voice activity detection method provided by an embodiment of the present application;
FIG. 3 is a flow chart of another voice activity detection method provided by an embodiment of the present application;
fig. 4 is a schematic structural diagram of an electronic device provided in an embodiment of the present application;
FIG. 5 is a schematic diagram of an apparatus according to an embodiment of the present disclosure;
fig. 6 is a schematic structural diagram of a module apparatus according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The terminology used in the following embodiments of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the present application. As used in the specification of the present application and the appended claims, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the listed items.
It should be noted that the terms "first," "second," "third," and the like in the description and claims of the present application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in other sequences than described or illustrated herein. Furthermore, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or server that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
Referring to fig. 1, fig. 1 is a schematic structural diagram of a voice activity detection system according to an embodiment of the present application. The voice activity detection system comprises at least one air conduction microphone, at least one bone conduction microphone and at least one voice activity detection module, and the number of the air conduction microphones, the bone conduction microphones and the voice activity detection modules is not limited in the embodiment of the application. The air conduction microphone and the bone conduction microphone are used for receiving sound signals, the first sound signal received by the air conduction microphone is an air conduction signal, and the second sound signal received by the bone conduction microphone is a bone conduction signal. The voice activity detection module is used for detecting the first sound signal received by the air conduction microphone and the second sound signal received by the bone conduction microphone and determining whether the first sound signal and the second sound signal have voice. The voice activity detection system can be applied to earphones or electronic equipment comprising a bone conduction microphone and an air conduction microphone, and the like.
Referring to fig. 2, fig. 2 is a flowchart illustrating a voice activity detection method according to an embodiment of the present application. The method is applied to an electronic device or a chip in the electronic device, and specifically, as shown in fig. 2, fig. 2 illustrates the electronic device as an execution subject. The same principle of execution of the voice activity detection method shown in other figures in the embodiment of the present application is not described in detail hereinafter. The voice activity detection method of the embodiment of the application comprises steps 201 to 204:
201. the electronic device acquires a first sound signal that is a sound signal received by the air conduction microphone and a second sound signal that is a sound signal received by the bone conduction microphone.
In an embodiment of the present application, the specific steps of the electronic device acquiring the first sound signal and the second sound signal are as follows: the electronic device obtains a first time-domain signal of the first sound signal from the air conduction microphone and a second time-domain signal of the second sound signal from the bone conduction microphone. After acquiring the first time domain signal and the second time domain signal, the electronic device performs framing processing on the first time domain signal and the second time domain signal. After framing processing, the electronic device performs time-frequency conversion on the time domain signals, converts the first time domain signals into first frequency domain signals, and converts the second time domain signals into second time domain signals. After time-frequency conversion, the electronic equipment obtains a discrete Fourier transform function of the first sound signal and a discrete Fourier transform function of the second sound signal, wherein the discrete Fourier transform function of the first sound signal is SA(k, m) the discrete Fourier transform function of the second sound signal is SB(k, m), k being the frequency index and m being the frame index.
202. The electronic device determines a first total energy of the first acoustic signal and a second total energy of the second acoustic signal.
In the embodiment of the present application, the first total energy may be represented by a formula
Figure BDA0003058464600000051
The second total energy can be calculated by formula
Figure BDA0003058464600000052
And (6) calculating. Wherein E isAIs a first total energy, EBIs the second total energy, SA(k, m) is a discrete Fourier transform function, S, of the first sound signalB(k, m) is a discrete Fourier transform function of the second sound signal, k is a frequency index and m is a frame index.
203. The electronic device determines whether the second total energy is greater than a first threshold.
In the embodiment of the application, the electronic device compares the second total energy with the first threshold, and if the second total energy is greater than the first threshold, it is preliminarily determined that the second sound signal may have a voice. The bone conduction microphone does not directly face noise, so that the bone conduction microphone has stronger noise robustness, and whether the received first sound signal and the second sound have voice or not can be preliminarily judged according to the magnitude of the second total energy. Because the second sound signal is the bone conduction signal received by the bone conduction microphone, and the first sound signal is the air conduction signal received by the air conduction microphone, the air conduction signal is more easily doped with more noise than the bone conduction signal, and the bone conduction signal does not directly face the noise in the air, whether the first sound signal and the second sound have the voice or not is preliminarily judged based on the second total energy, and the judgment is more accurate compared with the judgment based on the first total energy.
204. If the second total energy is greater than the first threshold, the electronic device determines that the first sound signal and the second sound signal have speech based on the first total energy and the second total energy.
The specific implementation manner of step 204 may refer to subsequent steps 305 to 307.
Through the method, the electronic equipment carries out voice activity detection based on the first sound signal received by the air conduction microphone and the second sound signal received by the bone conduction microphone, and the bone conduction microphone does not directly face noise in the air, so that the received second sound signal has strong noise robustness, and the problem that the first sound signal received by the air conduction microphone is easily influenced by environmental noise can be effectively avoided. In this way, the accuracy of voice activity detection is advantageously improved.
Referring to fig. 3, fig. 3 is a flowchart illustrating another method for detecting voice activity according to an embodiment of the present application, including steps 301 to 307.
301. The electronic device acquires a first sound signal that is a sound signal received by the air conduction microphone and a second sound signal that is a sound signal received by the bone conduction microphone.
302. The electronic device determines a first total energy of the first acoustic signal and a second total energy of the second acoustic signal.
303. The electronic device determines whether the second total energy is greater than a first threshold. If the second total energy is greater than the first threshold, go to step 305; if the second total energy is less than or equal to the first threshold, step 304 is performed.
The specific implementation manners of steps 301 to 303 are the same as those of steps 201 to 203, and are not described herein again in this embodiment of the present application.
304. The electronic device determines that no speech is present in the first sound signal and the second sound signal.
In the embodiment of the application, the bone conduction microphone does not directly face noise, so that the bone conduction microphone has stronger noise robustness, and whether the received first sound signal and the second sound have voice can be preliminarily judged according to the magnitude of the second total energy. If the second total energy is larger than the first threshold value, two conditions are possible, wherein one condition is that voice exists, the other condition is that a large amount of noise possibly exists, and the specific condition needs to be further judged; however, if it is determined that the second total energy is less than or equal to the first threshold, it is determined that the first and second sound signals must not contain speech. The first sound signal and the second sound signal are preliminarily screened based on the judgment of the second total energy, so that the accuracy of voice activity detection is improved.
305. The electronic device determines whether a target energy value is greater than a second threshold and less than or equal to a third threshold, the target energy value being a ratio of the first total energy and the second total energy. If the target energy value is determined to be greater than the second threshold, go to step 306; if the target energy value is determined to be less than or equal to the second threshold, step 307 is performed.
In the embodiment of the present application, the target energy value is a ratio of the first total energy to the second total energy, and the target energy value is expressed in decibels as follows:
Figure BDA0003058464600000061
wherein E isAIs a first total energy, EBFor the second total energy, m is the frame index.
In the embodiment of the application, the second threshold is used for eliminating errors caused by self-interference of the bone conduction microphone, and the third threshold is used for representing the noise suppression capability of the bone conduction microphone. In practical cases, the bone conduction microphone does not directly face the noise in the air, and therefore the noise present in the second sound signal is smaller than the noise in the first sound signal. In a general case, the second total energy is slightly smaller than the first total energy, the third threshold value is used to indicate the noise suppression capability of the bone conduction microphone, and the third threshold value is larger if the noise suppression capability of the bone conduction microphone is stronger. Therefore, the first sound signal and the second sound signal only contain voice when the target energy value is larger than the second threshold and smaller than or equal to the third threshold, and therefore the accuracy of voice activity detection is improved.
306. The electronic device determines that speech is present in the first sound signal and the second sound signal.
307. The electronic device determines that no speech is present in the first sound signal and the second sound signal.
In the embodiment of the present application, if the target energy value is greater than the third threshold, it indicates that there is a strong noise leakage, and therefore there is no voice in the first sound signal and the second sound signal, for example, if the electronic device is an earphone, when a user wears the earphone, a lot of wind blows to the ear of the user, wind noise is formed on the electronic device, and the target energy value is greater than the third threshold, that is, a strong noise leakage is formed, and therefore there is no voice in both the first sound signal and the second sound signal. If the target energy value is less than or equal to the second threshold, it indicates that the second sound signal may include noise interference received by the bone conduction microphone itself, such as tooth collision or other sounds made by bone.
Optionally, if it is continuously determined that the number of times of no voice in the first sound signal and the second sound signal exceeds a preset number of times, waiting for a preset time and then acquiring the first sound signal and the second sound signal. If the first sound signal and the second sound signal are determined to be free of voice for multiple times, it is indicated that the sound signals received for a long time may not contain voice, and therefore the electronic device can wait for the preset time to acquire the first sound signal and the second sound signal, and the load of the electronic device can be effectively reduced in such a way.
Referring to fig. 4, fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure, where the electronic device may be an earphone or other related devices. Included in the electronic device 40 are a processor 401, a memory 402, an air conduction microphone 403, and a bone conduction microphone 404.
The Processor 401 may be a Central Processing Unit (CPU), and the Processor 401 may also be other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor, and optionally, the processor 401 may be any conventional processor or the like.
Memory 402 may include both read-only memory and random access memory and provides instructions and data to processor 401. A portion of the memory 402 may also include non-volatile random access memory.
Optionally, the electronic device 40 may further include a device other than the above-described device, such as a communication interface, which is not limited in this embodiment.
Wherein:
a processor 401 for calling program instructions stored in the memory 402.
A memory 402 for storing program instructions.
An air conduction microphone 403 for receiving the first acoustic signal.
A bone conduction microphone 404 for receiving the second sound signal.
The processor 401 invokes program instructions stored in the memory 402 to cause the electronic device 40 to perform the following operations: acquiring a first sound signal from the air conduction microphone 403, and acquiring a second sound signal from the bone conduction microphone 404, the first sound signal being a sound signal received by the air conduction microphone 403, and the second sound signal being a sound signal received by the bone conduction microphone 404; determining a first total energy of the first sound signal and a second total energy of the second sound signal; determining whether the second total energy is greater than a first threshold; if the second total energy is greater than the first threshold, determining whether the first sound signal and the second sound signal have speech based on the first total energy and the second total energy.
Fig. 5 shows an apparatus 50 provided in this embodiment of the present application, for implementing the functions of the electronic device in fig. 2. The apparatus may be an electronic device or an apparatus for an electronic device. The means for the electronic device may be a system of chips or a chip within the electronic device. The chip system may be composed of a chip, or may include a chip and other discrete devices. The apparatus 50 shown in fig. 5 may include an acquisition module 501 and a speech detection module 502, wherein:
the acquiring module 501 is configured to acquire a first sound signal and a second sound signal, where the first sound signal is a sound signal received by an air conduction microphone, and the second sound signal is a sound signal received by a bone conduction microphone; the speech detection module 502 is configured to determine a first total energy of the first sound signal and a second total energy of the second sound signal; the speech detection module 502 is further configured to determine whether the second total energy is greater than a first threshold; the speech detection module 502 is further configured to determine whether the first sound signal and the second sound signal have speech based on the first total energy and the second total energy if the second total energy is greater than the first threshold.
In one possible implementation, when the speech detection module 502 determines whether the first and second sound signals have speech based on the first and second total energies, it determines whether a target energy value is greater than a second threshold and less than or equal to a third threshold, the target energy value being a ratio of the first and second total energies; if the target energy value is greater than the second threshold and less than or equal to the third threshold, it is determined that the first sound signal and the second sound signal have speech.
In a possible implementation manner, the voice detection module 502 is further configured to determine that no voice exists in the first sound signal and the second sound signal if the target energy value is less than or equal to the second threshold or greater than the third threshold.
In a possible implementation manner, the voice detection module 502 is further configured to determine that no voice exists in the first sound signal and the second sound signal if the second total energy is less than or equal to the first threshold.
In a possible implementation manner, the voice detection module 502 is further configured to wait for a preset time before acquiring the first sound signal and the second sound signal if it is continuously determined that the number of times that no voice exists in the received first sound signal and the second sound signal exceeds a preset number of times.
The above-mentioned means may be, for example: a chip, or a chip module. Each module included in each apparatus and product described in the above embodiments may be a software module, a hardware module, or a part of the software module and a part of the hardware module. For example, for each device or product applied to or integrated in a chip, each module included in the device or product may be implemented by hardware such as a circuit, or at least a part of the modules may be implemented by a software program running on a processor integrated in the chip, and the rest (if any) part of the modules may be implemented by hardware such as a circuit; for each device and product applied to or integrated with the chip module, each module included in the device and product may be implemented in a hardware manner such as a circuit, and different modules may be located in the same component (e.g., a chip, a circuit module, etc.) or different components of the chip module, or at least a part of the modules may be implemented in a software program running on a processor integrated within the chip module, and the rest (if any) part of the modules may be implemented in a hardware manner such as a circuit; for each device and product applied to or integrated in the terminal, each module included in the device and product may be implemented by using hardware such as a circuit, different modules may be located in the same component (e.g., a chip, a circuit module, etc.) or different components in the terminal, or at least a part of the modules may be implemented by using a software program running on a processor integrated in the terminal, and the rest (if any) part of the modules may be implemented by using hardware such as a circuit.
The embodiment of the present application further provides a chip, where the chip can perform the relevant steps of the electronic device in the foregoing method embodiment. The chip is used for:
acquiring a first sound signal and a second sound signal, wherein the first sound signal is a sound signal received by an air conduction microphone, and the second sound signal is a sound signal received by a bone conduction microphone; determining a first total energy of the first sound signal and a second total energy of the second sound signal; determining whether the second total energy is greater than a first threshold; if the second total energy is greater than the first threshold, determining whether the first sound signal and the second sound signal have speech based on the first total energy and the second total energy.
In one possible implementation, when the chip determines whether the first sound signal and the second sound signal have voices based on the first total energy and the second total energy, it determines whether a target energy value, which is a ratio of the first total energy and the second total energy, is greater than a second threshold and is less than or equal to a third threshold; if the target energy value is greater than the second threshold and less than or equal to the third threshold, it is determined that the first sound signal and the second sound signal have speech.
In a possible implementation manner, the chip is further configured to determine that no speech is present in the first sound signal and the second sound signal if the target energy value is less than or equal to the second threshold value or the target energy value is greater than the third threshold value.
In a possible implementation manner, the chip is further configured to determine that the first sound signal and the second sound signal have no speech if the second total energy is less than or equal to the first threshold.
In a possible implementation manner, the chip is further configured to wait for a preset time and then acquire the first sound signal and the second sound signal if it is continuously determined that the number of times that no sound exists in the received first sound signal and the second sound signal exceeds a preset number of times.
As shown in fig. 6, fig. 6 is a schematic structural diagram of a module device according to an embodiment of the present disclosure. The module device 60 can perform the steps related to the terminal device in the foregoing method embodiments, and the module device 60 includes: a communication module 601, a power module 602, a memory module 603, a chip module 604, an air conduction microphone module 605 and a bone conduction microphone module 606.
The power module 602 is configured to provide power for the module device; the storage module 603 is used for storing data and instructions; the communication module 601 is used for performing internal communication of module equipment, or is used for performing communication between the module equipment and external equipment; the air conduction microphone module 605 is configured to receive a first sound signal; the bone conduction microphone module 606 is configured to receive a second sound signal; the chip module 604 is configured to:
acquiring a first sound signal and a second sound signal, wherein the first sound signal is a sound signal received by the air conduction microphone module 605, and the second sound signal is a sound signal received by the bone conduction microphone module 606; determining a first total energy of the first sound signal and a second total energy of the second sound signal; determining whether the second total energy is greater than a first threshold; if the second total energy is greater than the first threshold, determining whether the first sound signal and the second sound signal have speech based on the first total energy and the second total energy.
In one possible implementation, when the chip module 604 determines whether the first sound signal and the second sound signal have voices based on the first total energy and the second total energy, it determines whether a target energy value is greater than a second threshold and less than or equal to a third threshold, where the target energy value is a ratio of the first total energy and the second total energy; if the target energy value is greater than the second threshold and less than or equal to the third threshold, it is determined that the first sound signal and the second sound signal have speech.
In a possible implementation manner, the chip module 604 is further configured to determine that the first sound signal and the second sound signal do not have speech if the target energy value is less than or equal to the second threshold, or if the target energy value is greater than the third threshold.
In a possible implementation manner, the chip module 604 is further configured to determine that the first sound signal and the second sound signal have no voice if the second total energy is less than or equal to the first threshold.
In a possible implementation manner, the chip module 604 is further configured to wait for a preset time and then acquire the first sound signal and the second sound signal if it is continuously determined that the number of times that no sound exists in the received first sound signal and the second sound signal exceeds a preset number of times.
Embodiments of the present application further provide a computer-readable storage medium, in which instructions are stored, and when the computer-readable storage medium is executed on a processor, the method flow of the above method embodiments is implemented.
Embodiments of the present application further provide a computer program product, where when the computer program product runs on a processor, the method flow of the above method embodiments is implemented.
It is noted that, for simplicity of explanation, the foregoing method embodiments are described as a series of acts or combination of acts, but those skilled in the art will appreciate that the present application is not limited by the order of acts, as some acts may, in accordance with the present application, occur in other orders and/or concurrently. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.
The descriptions of the embodiments provided in the present application may be referred to each other, and the descriptions of the embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments. For convenience and brevity of description, for example, the functions and operations performed by the devices and apparatuses provided in the embodiments of the present application may refer to the related descriptions of the method embodiments of the present application, and may also be referred to, combined with or cited among the method embodiments and the device embodiments.
Finally, it should be noted that: the above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present application.

Claims (10)

1. A method of voice activity detection, the method comprising:
acquiring a first sound signal and a second sound signal, wherein the first sound signal is a sound signal received by an air conduction microphone, and the second sound signal is a sound signal received by a bone conduction microphone;
determining a first total energy of the first sound signal and a second total energy of the second sound signal;
determining whether the second total energy is greater than a first threshold;
and if the second total energy is larger than the first threshold value, determining that the first sound signal and the second sound signal have voices based on the first total energy and the second total energy.
2. The method of claim 1, wherein the determining whether the first sound signal and the second sound signal have speech based on the first total energy and the second total energy comprises:
determining whether a target energy value is greater than a second threshold and less than or equal to a third threshold, the target energy value being a ratio of the first total energy and the second total energy;
if the target energy value is greater than the second threshold and less than or equal to the third threshold, determining that the first sound signal and the second sound signal have speech.
3. The method of claim 2, further comprising:
and if the target energy value is smaller than or equal to the second threshold value or the target energy value is larger than the third threshold value, determining that no voice exists in the first sound signal and the second sound signal.
4. The method of claim 1, further comprising:
and if the second total energy is smaller than or equal to the first threshold value, determining that no voice exists in the first sound signal and the second sound signal.
5. The method according to any one of claims 1 to 4, further comprising:
and if the number of times of continuously determining that no voice exists in the received first sound signal and the second sound signal exceeds a preset number of times, waiting for a preset time and then acquiring the first sound signal and the second sound signal.
6. An apparatus for voice activity detection, the apparatus comprising an acquisition module and a voice detection module:
the acquiring module is configured to acquire a first sound signal and a second sound signal, where the first sound signal is a sound signal received by an air conduction microphone, and the second sound signal is a sound signal received by a bone conduction microphone;
the voice detection module is configured to:
determining a first total energy of the first sound signal and a second total energy of the second sound signal;
determining whether the second total energy is greater than a first threshold;
if the second total energy is greater than the first threshold, determining whether the first sound signal and the second sound signal have speech based on the first total energy and the second total energy.
7. An electronic device, comprising an air conduction microphone, a bone conduction microphone, a memory, and at least one processor;
the air conduction microphone is used for receiving a first sound signal;
the bone conduction microphone is used for receiving a second sound signal;
the memory coupled with the one or more processors, the memory to store computer program code, the computer program code comprising computer instructions;
the processor is specifically configured to invoke the computer program from the memory to execute the method according to any one of claims 1 to 5.
8. A chip, wherein the chip is configured to:
acquiring a first sound signal and a second sound signal, wherein the first sound signal is a sound signal received by an air conduction microphone, and the second sound signal is a sound signal received by a bone conduction microphone;
determining a first total energy of the first sound signal and a second total energy of the second sound signal;
determining whether the second total energy is greater than a first threshold;
if the second total energy is greater than the first threshold, determining whether the first sound signal and the second sound signal have speech based on the first total energy and the second total energy.
9. The utility model provides a module equipment, its characterized in that, module equipment includes mike module of air conduction, mike module of bone conduction, power module, storage module and chip module, wherein:
the air conduction microphone module is used for receiving a first sound signal;
the bone conduction microphone module is used for receiving a second sound signal;
the power supply module is used for providing electric energy for the module equipment;
the storage module is used for storing data and instructions;
the chip module is used for:
acquiring a first sound signal and a second sound signal, wherein the first sound signal is a sound signal received by an air conduction microphone, and the second sound signal is a sound signal received by a bone conduction microphone;
determining a first total energy of the first sound signal and a second total energy of the second sound signal;
determining whether the second total energy is greater than a first threshold;
if the second total energy is greater than the first threshold, determining whether the first sound signal and the second sound signal have speech based on the first total energy and the second total energy.
10. A computer readable storage medium having computer readable instructions stored thereon which, when run on a communication device, cause the communication device to perform the method of any of claims 1-5.
CN202110506143.XA 2021-05-10 2021-05-10 Voice activity detection method, electronic equipment and device Pending CN113113050A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110506143.XA CN113113050A (en) 2021-05-10 2021-05-10 Voice activity detection method, electronic equipment and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110506143.XA CN113113050A (en) 2021-05-10 2021-05-10 Voice activity detection method, electronic equipment and device

Publications (1)

Publication Number Publication Date
CN113113050A true CN113113050A (en) 2021-07-13

Family

ID=76721751

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110506143.XA Pending CN113113050A (en) 2021-05-10 2021-05-10 Voice activity detection method, electronic equipment and device

Country Status (1)

Country Link
CN (1) CN113113050A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111261181A (en) * 2020-01-15 2020-06-09 成都法兰特科技有限公司 Speech recognition method, noise recognition method, sound pickup device, and telephone communication apparatus
US20200184996A1 (en) * 2018-12-10 2020-06-11 Cirrus Logic International Semiconductor Ltd. Methods and systems for speech detection
CN112017696A (en) * 2020-09-10 2020-12-01 歌尔科技有限公司 Voice activity detection method of earphone, earphone and storage medium
US20200380955A1 (en) * 2019-05-30 2020-12-03 Cirrus Logic International Semiconductor Ltd. Detection of speech

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200184996A1 (en) * 2018-12-10 2020-06-11 Cirrus Logic International Semiconductor Ltd. Methods and systems for speech detection
US20200380955A1 (en) * 2019-05-30 2020-12-03 Cirrus Logic International Semiconductor Ltd. Detection of speech
CN111261181A (en) * 2020-01-15 2020-06-09 成都法兰特科技有限公司 Speech recognition method, noise recognition method, sound pickup device, and telephone communication apparatus
CN112017696A (en) * 2020-09-10 2020-12-01 歌尔科技有限公司 Voice activity detection method of earphone, earphone and storage medium

Similar Documents

Publication Publication Date Title
US9972343B1 (en) Multi-step validation of wakeup phrase processing
US9197177B2 (en) Method and implementation apparatus for intelligently controlling volume of electronic device
US9294834B2 (en) Method and apparatus for reducing noise in voices of mobile terminal
US20230352038A1 (en) Voice activation detecting method of earphones, earphones and storage medium
US9318120B2 (en) System and method for noise reduction in processing speech signals by targeting speech and disregarding noise
WO2016180100A1 (en) Method and device for improving audio processing performance
US20160372102A1 (en) Howling Suppression Method and Device Applied to an ANR Earphone
US10629226B1 (en) Acoustic signal processing with voice activity detector having processor in an idle state
US20140365212A1 (en) Receiver Intelligibility Enhancement System
CN108806707B (en) Voice processing method, device, equipment and storage medium
CN110031083A (en) A kind of noise overall sound pressure level measurement method, system and computer readable storage medium
CN110277095B (en) Voice service control device and method thereof
CN108133712B (en) Method and device for processing audio data
CN111312291B (en) Signal-to-noise ratio detection method, system, mobile terminal and storage medium
CN113539285A (en) Audio signal noise reduction method, electronic device, and storage medium
US20150325252A1 (en) Method and device for eliminating noise, and mobile terminal
CN113223561B (en) Voice activity detection method, electronic equipment and device
CN115348507A (en) Impulse noise suppression method, system, readable storage medium and computer equipment
CN110556128B (en) Voice activity detection method and device and computer readable storage medium
US20110066427A1 (en) Receiver Intelligibility Enhancement System
CN113113050A (en) Voice activity detection method, electronic equipment and device
CN111477246B (en) Voice processing method and device and intelligent terminal
CN107750038B (en) Volume adjusting method, device, equipment and storage medium
US20220210538A1 (en) Method and apparatus for recognizing wind noise of earphone
CN114023352B (en) Voice enhancement method and device based on energy spectrum depth modulation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20210713