CN109346062B

CN109346062B - Voice endpoint detection method and device

Info

Publication number: CN109346062B
Application number: CN201811587500.4A
Authority: CN
Inventors: 许堃; 周强
Original assignee: Sipic Technology Co Ltd
Current assignee: Sipic Technology Co Ltd
Priority date: 2018-12-25
Filing date: 2018-12-25
Publication date: 2021-05-28
Anticipated expiration: 2038-12-25
Also published as: CN109346062A

Abstract

The invention discloses a voice endpoint detection method and a voice endpoint detection device, wherein the method comprises the following steps: dividing the received audio signal into a plurality of sub-signals; carrying out fast Fourier transform on each frame of sub-signals in a plurality of frames of sub-signals to obtain a plurality of frequency point signals; performing multi-channel noise reduction processing on a plurality of frequency point signals of each frame of sub-signals in a plurality of frames of sub-signals, judging whether the voice signals contain endpoints based on a multi-channel noise reduction processing result, and generating a first judgment result; carrying out single-channel noise reduction processing on a plurality of frequency point signals of any one frame of sub-signals in a plurality of frames of sub-signals, judging whether any sub-frame signal contains an end point or not based on a single-channel noise reduction processing result, and generating a second judgment result; and when the first and second judgment results are both positive, determining that the received audio signal contains an endpoint. The array-based processing of the invention is very sensitive to beams, and even very weak voice signals can be found, so that voice endpoints can be detected more accurately.

Description

Voice endpoint detection method and device

Technical Field

The present invention relates to the field of speech recognition technologies, and in particular, to a method and an apparatus for detecting a speech endpoint.

Background

Some voice endpoint detection technologies exist in the market, and voice endpoints are judged by utilizing zero crossing rate, short-time energy, deep learning and the like.

And voice endpoint detection, namely detecting a starting point and an end point of voice. Voice endpoint detection can be divided into single-channel methods and microphone array-based methods; the single channel-based method comprises a signal feature-based method, a signal statistical method and a deep learning method. The method based on the signal characteristics comprises the steps of utilizing the change of zero crossing rate, the change of short-time energy, the change of self-correlation of a voice frame and the like; the method based on signal statistics mainly obtains the existence probability of the voice signal by assuming a model of the signal and then by Bayes or other methods, thereby obtaining the state of the voice; the deep learning mainly comprises the steps of marking a mute or voice label on voice, training a network, extracting characteristics of the voice and then sending the voice to the network during work so as to obtain the state of the voice; the method based on the microphone array obtains the existence probability of the voice by utilizing different time delays of the voice reaching the microphone and combining phase information with a hypothesized voice statistical model;

in a single channel, a traditional voice detection method, namely a method based on signal characteristics, a signal statistics method and the like, is easy to judge errors under the condition of unsteady noise or interference; the detection method based on deep learning has good effect, but the calculation amount is very large. In a multi-channel, a voice detection method is not mature, and the voice phase information is utilized, so that the voice detection method is insensitive to energy and is easy to be interfered;

in a single channel, most of the traditional voice detection methods utilize one or two characteristics of a voice signal, and the characteristics are easily masked by various noises (including steady-state noise, unsteady-state noise, colored noise and directional noise), so that the characteristics are difficult to distinguish; the detection method based on deep learning needs recording in advance, labeling and model training, needs adjustment for different scenes, needs multilayer neural network calculation, and is complex in the whole process and large in calculation amount.

Disclosure of Invention

In the industry, the microphone array pickup technology is gradually on the basis and is mainly used for enhancing voice. In industrial products, such as products of voice sound boxes or rearview mirrors and the like, voice endpoint detection is mostly performed immediately after voice input, and the technology is performed after array enhancement, so that voice information obtained after the array enhancement is fully utilized. At present, no commercial VAD (Voice Activity Detection) using microphone array is found.

An embodiment of the present invention provides a method and an apparatus for detecting a voice endpoint, which are used to solve at least one of the above technical problems.

In a first aspect, an embodiment of the present invention provides a method for detecting a voice endpoint, including:

dividing the received audio signal into a plurality of sub-signals;

performing, for each frame subsignal of the plurality of frames of subsignals: windowing each frame of sub-signals to perform fast Fourier transform to obtain a plurality of frequency point signals;

performing multi-channel noise reduction processing on a plurality of frequency point signals of each frame of sub-signals in the plurality of frames of sub-signals, judging whether the received audio signal contains an endpoint or not based on a multi-channel noise reduction processing result, and generating a first judgment result;

carrying out single-channel noise reduction processing on a plurality of frequency point signals of any one frame of sub-signals in the multi-frame of sub-signals, judging whether any sub-frame signal contains an endpoint or not based on a single-channel noise reduction processing result, and generating a second judgment result;

and when the first judgment result and the second judgment result are both positive, determining that the received audio signal contains an endpoint.

In a second aspect, an embodiment of the present invention provides a voice endpoint detection apparatus, including:

the audio signal framing module is used for dividing the received audio signal into multi-frame sub-signals;

a signal conversion module, configured to perform, for each frame sub-signal of the multiple frames of sub-signals: windowing each frame of sub-signals to perform fast Fourier transform to obtain a plurality of frequency point signals;

a first determination result generation module, configured to perform multi-channel noise reduction on multiple frequency point signals of each frame of sub-signals in the multiple frames of sub-signals, determine whether the received audio signal includes an endpoint based on a multi-channel noise reduction result, and generate a first determination result;

a second determination result generation module, configured to perform single-channel noise reduction on multiple frequency point signals of any one of the multiple frames of sub-signals, determine whether any one of the multiple sub-frames of signals contains an endpoint based on a single-channel noise reduction result, and generate a second determination result;

and the voice endpoint determination module is used for determining that the received audio signal contains an endpoint when the first judgment result and the second judgment result are both positive.

In a third aspect, an embodiment of the present invention provides a storage medium, where one or more programs including execution instructions are stored, where the execution instructions can be read and executed by an electronic device (including but not limited to a computer, a server, or a network device, etc.) to perform any one of the above-described voice endpoint detection methods of the present invention.

In a fourth aspect, an electronic device is provided, comprising: the voice endpoint detection system comprises at least one processor and a memory which is in communication connection with the at least one processor, wherein the memory stores instructions which can be executed by the at least one processor, and the instructions are executed by the at least one processor so as to enable the at least one processor to execute any one voice endpoint detection method of the invention.

In a fifth aspect, an embodiment of the present invention further provides a computer program product, where the computer program product includes a computer program stored on a storage medium, and the computer program includes program instructions, and when the program instructions are executed by a computer, the computer is caused to execute any one of the above-mentioned voice endpoint detection methods.

The embodiment of the invention has the beneficial effects that: the array-based processing is very sensitive to beams, and even very weak voice signals can be found, so that voice endpoints can be detected more accurately.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a flowchart of an embodiment of a voice endpoint detection method of the present invention;

FIG. 2 is a flowchart of an embodiment of step S30 in FIG. 1;

FIG. 3 is a flowchart of an embodiment of step S40 in FIG. 1;

FIG. 4 is a flowchart of another embodiment of a voice endpoint detection method of the present invention;

FIG. 5 is a schematic block diagram of an embodiment of a voice endpoint detection apparatus of the present invention;

FIG. 6 is a schematic block diagram of a first decision result generation module according to an embodiment of the present invention;

FIG. 7 is a schematic block diagram of a second determination result generation module according to an embodiment of the present invention;

FIG. 8 is a block diagram of a computer system suitable for use with the electronic device to implement an embodiment of the invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

As used in this disclosure, "module," "device," "system," and the like are intended to refer to a computer-related entity, either hardware, a combination of hardware and software, or software in execution. In particular, for example, an element may be, but is not limited to being, a process running on a processor, an object, an executable, a thread of execution, a program, and/or a computer. Also, an application or script running on a server, or a server, may be an element. One or more elements may be in a process and/or thread of execution and an element may be localized on one computer and/or distributed between two or more computers and may be operated by various computer-readable media. The elements may also communicate by way of local and/or remote processes based on a signal having one or more data packets, e.g., from a data packet interacting with another element in a local system, distributed system, and/or across a network in the internet with other systems by way of the signal.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The method is very sensitive to the wave beam, effectively filters the influence of various steady-state noises and unsteady-state noises, and can effectively obtain the voice endpoint, but the method is very sensitive to directed noise, so that the single-channel voice existence probability based on the power spectrum is estimated, the voice frame with the probability lower than a certain threshold value is filtered, the influence of the directed noise can be effectively counteracted, the voice existence probability adopts MFCC weighting (Mel Frequency Cepstrum Coefficient), is closer to auditory perception, and has more obvious effect. The algorithm can obtain good voice endpoint detection effect in the environment with low signal-to-noise ratio, and the calculation amount is small. The fixed enhanced speech can also be sent to a local awakening and recognition algorithm, and redundant calculation amount cannot be increased.

As shown in fig. 1, an embodiment of the present invention provides a voice endpoint detection method, including:

s10, dividing the received audio signal into multi-frame sub-signals;

s20, executing the following steps for each frame sub-signal in the multi-frame sub-signals: windowing each frame of sub-signals to perform fast Fourier transform to obtain a plurality of frequency point signals;

s30, performing multi-channel noise reduction processing on a plurality of frequency point signals of each frame of sub-signals in the multi-frame of sub-signals, judging whether the received audio signal contains an end point or not based on a multi-channel noise reduction processing result, and generating a first judgment result;

s40, taking a plurality of frequency point signals of any one frame of sub-signals in the multi-frame of sub-signals to perform single-channel noise reduction processing, judging whether any sub-frame signal contains an end point based on a single-channel noise reduction processing result, and generating a second judgment result;

s50, when the first judgment result and the second judgment result are both positive, determining that the received audio signal contains an endpoint.

In the embodiment of the invention, the array-based processing is very sensitive to beams, and even very weak voice signals can be found, so that voice endpoints can be detected more accurately.

As shown in fig. 2, in some embodiments, performing multi-channel noise reduction on multiple frequency bin signals of each of the multiple frames of sub-signals, and determining whether the received audio signal includes an endpoint based on a result of the multi-channel noise reduction, and generating the first determination result includes:

s31, sequentially and correspondingly inputting a plurality of frequency point signals of each frame of sub-signals in the multi-frame of sub-signals to a plurality of multi-channel noise reduction processors to obtain respective frequency point signals after multi-channel noise reduction of each frame of sub-signals;

s32, taking an energy maximum value and an energy minimum value from the frequency point signals of each frame of sub-signals after respective multi-path noise reduction to obtain a plurality of energy maximum values and a plurality of energy minimum values;

s33, taking logarithm of the energy maximum values and values, taking logarithm of the energy minimum values, and determining the signal energy corresponding to the received audio signal based on the difference of at least two logarithms;

and S34, generating a first judgment result according to the signal energy and the first set threshold value.

As shown in fig. 3, in some embodiments, performing single-channel noise reduction on multiple frequency point signals of any one sub-signal of multiple sub-signals, and determining whether the any sub-signal contains an endpoint based on a single-channel noise reduction result, where generating the second determination result includes:

s41, sequentially inputting the frequency point signals in the sub-signal of any frame to a single-channel noise reduction processor to obtain a plurality of frequency point signals after noise reduction;

s42, estimating the existence probability of a plurality of voices of the frequency point signals after noise reduction corresponding to the sub-signals of any frame;

s43, weighting the voice existence probabilities for updating the noise power spectrum; illustratively, the plurality of speech presence probabilities are weighted using the mel weighting coefficients.

S44, determining the voice existence probability of any subframe signal according to the updated noise power spectrum;

and S45, generating a second judgment result according to the voice existence probability of the arbitrary subframe signal and a second set threshold value. In this embodiment, the single channel method is sensitive to energy and can shield both colored and non-stationary noise.

As shown in fig. 4, a flowchart of another embodiment of the voice endpoint detection method of the present invention specifically includes the following steps:

step 1, dividing a space into a plurality of subspaces (assumed to be N), wherein each space needs to have certain adjacent overlapping, and performing broadband beam forming design on each space so as to obtain N fixed beam processing modules.

Step 2, calculating Mel frequency cepstrum coefficients for frequency segments with voice, and normalizing all coefficients by taking the maximum value in the coefficients, namely dividing all coefficient values by the maximum value in the coefficients to obtain Mel weighting coefficients;

and 3, framing the array microphone voice signals, windowing, performing Fast Fourier Transform (FFT), and then passing through N fixed beam processing modules to obtain N paths of voice. And the N paths of voice are subjected to a multi-channel noise reduction algorithm respectively to obtain noise-reduced voice signals.

And 4, respectively selecting Pmax with the maximum energy and Pmin with the minimum energy in the N paths of voice, and calculating the difference value Pd of the Pmax with the maximum energy and the Pmin with the minimum energy. And comparing the voice with the threshold value Ps to obtain the voice existence state Pd _ flag. The threshold Ps is updated to Ps ═ β × Pd + (1- β) × Ps, β is 0.1 when speech is present, and β is 0.5 when speech is not present. The coefficient beta is designed according to the assumption that the signal-to-noise ratio is 10db, and parameters can be adjusted according to the actual environment.

And step 5, taking one path of audio signals collected by a microphone, performing framing, windowing, FFT (fast Fourier transform), single-channel noise reduction, calculating a power spectrum, updating a noise spectrum, estimating the voice existence probability (SPP) of each frequency point according to the power spectrum, perceptually weighting the voice existence probability (SPP) of each frequency point by combining the Mel weighting coefficient obtained in the step 2, then calculating the average value of the voice existence probabilities of all frequency points to obtain the voice existence probability Ph of the current frame, and comparing the voice existence probability Ph with a preset threshold Ph _ s to obtain the single-channel voice state Ph _ flag.

Step 6, judging the existence probability of the voice, wherein when both Pd _ flag and Ph _ flag are 1, the voice exists, namely P _ flag is set to 1, otherwise, the voice does not exist, namely P _ flag is set to 0;

in some embodiments, when speech is in a silence segment, a speech segment may be entered only when the P _ flag of consecutive m-frame short frames is 1. When the speech is in a speech segment, the silence segment can be entered only when the P _ flag of consecutive n short frames is 0.

The invention fully utilizes the characteristics of the microphone array and is fused with the single-channel voice existence probability algorithm to obtain better effect. For example, steps 1, 3 and 4 array-based processing are very sensitive to beams, even very weak speech signals can be found. For example, the step 5 single channel method is sensitive to energy and can mask both colored and non-stationary noise. For example, step 2 is more sensitive to the speech part after perceptual weighting, and the calculation amount is not large, and the fixed enhanced audio can be directly sent to local awakening or local recognition without adding extra calculation. Compared with a deep learning method, the method is simpler in process and wider in applicable scenes.

It should be noted that for simplicity of explanation, the foregoing method embodiments are described as a series of acts or combination of acts, but those skilled in the art will appreciate that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required by the invention. In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

As shown in fig. 5, an embodiment of the present invention further provides a voice endpoint detection apparatus 500, where the voice endpoint detection apparatus 500 includes:

an audio signal framing module 510 for framing the received audio signal into a plurality of frames of sub-signals;

a signal conversion module 520, configured to perform, for each frame sub-signal of the multiple frames of sub-signals: windowing each frame of sub-signals to perform fast Fourier transform to obtain a plurality of frequency point signals;

a first determination result generating module 530, configured to perform multi-channel noise reduction on multiple frequency point signals of each frame of sub-signals in the multiple frames of sub-signals, determine whether the received audio signal includes an endpoint based on a multi-channel noise reduction result, and generate a first determination result;

a second determination result generating module 540, configured to perform single-channel noise reduction on multiple frequency point signals of any one frame of sub-signals in the multiple frames of sub-signals, determine whether any sub-frame signal includes an endpoint based on a single-channel noise reduction result, and generate a second determination result;

a voice endpoint determining module 550, configured to determine that the received audio signal includes an endpoint when both the first determination result and the second determination result are affirmative.

As shown in fig. 6, in some embodiments, the first determination result generating module 530 includes:

the first noise reduction processing unit 531 is configured to correspondingly input the multiple frequency point signals of each frame of sub-signals in the multiple frames of sub-signals to multiple multi-channel noise reduction processors in sequence, so as to obtain multiple channels of noise-reduced frequency point signals of each frame of sub-signals;

a selecting unit 532, configured to extract an energy maximum value and an energy minimum value from the frequency point signals after the respective multipath noise reduction of each frame of sub-signals, and obtain multiple energy maximum values and multiple energy minimum values;

a signal energy determining unit 533 configured to log the plurality of energy maximum values and the plurality of energy minimum values, and determine a signal energy corresponding to the received audio signal based on at least a difference between the two logs;

the first determination result generating module 534 is configured to generate a first determination result according to a magnitude between the signal energy and a first set threshold.

As shown in fig. 7, in some embodiments, the second determination result generating module 540 includes:

the second noise reduction unit 541 is configured to sequentially input the multiple frequency point signals in the arbitrary frame of sub-signals to a single-channel noise reduction processor to obtain multiple noise-reduced frequency point signals;

a probability estimating unit 542, configured to estimate existence probabilities of multiple voices of the noise-reduced frequency point signals corresponding to the arbitrary frame of sub-signals;

a weighting unit 543, configured to weight the speech existence probabilities for updating the noise power spectrum;

a probability determination unit 544, which averages the weighted multiple speech existence probabilities to obtain a speech existence probability of the arbitrary subframe signal;

the second determination result generating unit 545 is configured to generate a second determination result according to a magnitude between the speech existence probability of the arbitrary subframe signal and a second set threshold.

In some embodiments, said weighting said plurality of speech presence probabilities comprises: weighting the plurality of speech presence probabilities using a predetermined Mel weighting factor.

In some embodiments, the present invention provides a non-transitory computer readable storage medium, in which one or more programs including executable instructions are stored, and the executable instructions can be read and executed by an electronic device (including but not limited to a computer, a server, or a network device, etc.) to perform any of the above-described voice endpoint detection methods of the present invention.

In some embodiments, the present invention further provides a computer program product comprising a computer program stored on a non-volatile computer-readable storage medium, the computer program comprising program instructions that, when executed by a computer, cause the computer to perform any of the above-mentioned voice endpoint detection methods.

In some embodiments, an embodiment of the present invention further provides an electronic device, which includes: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a voice endpoint detection method.

In some embodiments, an embodiment of the present invention further provides a storage medium having a computer program stored thereon, where the computer program is executed by a processor to implement a voice endpoint detection method.

The voice endpoint detection apparatus according to the above embodiment of the present invention may be used to execute the voice endpoint detection method according to the above embodiment of the present invention, and accordingly achieve the technical effect achieved by the voice endpoint detection method according to the above embodiment of the present invention, and will not be described herein again. In the embodiment of the present invention, the relevant functional module may be implemented by a hardware processor (hardware processor).

Referring now to FIG. 8, shown is a block diagram of a computer system 800 suitable for use in implementing an electronic device of an embodiment of the present invention. The electronic device shown in fig. 8 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.

As shown in fig. 8, the computer system 800 includes a Central Processing Unit (CPU)801 that can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)802 or a program loaded from a storage section 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data necessary for the operation of the system 800 are also stored. The CPU 801, ROM 802, and RAM 803 are connected to each other via a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.

The following components are connected to the I/O interface 805: an input portion 806 including a keyboard, a mouse, and the like; an output section 807 including a signal such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 808 including a hard disk and the like; and a communication section 809 including a network interface card such as a LAN card, a modem, or the like. The communication section 809 performs communication processing via a network such as the internet. A drive 810 is also connected to the I/O interface 805 as necessary. A removable medium 811 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 810 as necessary, so that a computer program read out therefrom is mounted on the storage section 808 as necessary.

In particular, according to the embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program can be downloaded and installed from a network through the communication section 809 and/or installed from the removable medium 811. The computer program executes the above-described functions defined in the system of the present invention when executed by the Central Processing Unit (CPU) 801.

The product can execute the method provided by the embodiment of the application, and has the corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in this embodiment, reference may be made to the methods provided in the embodiments of the present application.

The electronic device of the embodiments of the present application exists in various forms, including but not limited to:

(1) mobile communication devices, which are characterized by mobile communication capabilities and are primarily targeted at providing voice and data communications. Such terminals include smart phones (e.g., iphones), multimedia phones, functional phones, and low-end phones, among others.

(2) The ultra-mobile personal computer equipment belongs to the category of personal computers, has calculation and processing functions and generally has the characteristic of mobile internet access. Such terminals include PDA, MID, and UMPC devices, such as ipads.

(3) Portable entertainment devices such devices may display and play multimedia content. Such devices include audio and video players (e.g., ipods), handheld game consoles, electronic books, as well as smart toys and portable car navigation devices.

(4) The server is similar to a general computer architecture, but has higher requirements on processing capability, stability, reliability, safety, expandability, manageability and the like because of the need of providing highly reliable services.

(5) And other electronic devices with data interaction functions.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a general hardware platform, and certainly can also be implemented by hardware. Based on such understanding, the above technical solutions substantially or contributing to the related art may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A voice endpoint detection method, comprising:

dividing a received audio signal into a plurality of frames of sub-signals, wherein the audio signal is a plurality of channels of audio signals collected by a microphone array;

performing, for each frame subsignal of the plurality of frames of subsignals: windowing each frame of sub-signals, and carrying out fast Fourier transform to obtain a plurality of frequency point signals;

carrying out single-channel noise reduction processing on a plurality of frequency point signals of any one frame of sub-signals in the plurality of frames of sub-signals, judging whether any one frame of sub-signal contains an end point or not based on a single-channel noise reduction processing result, and generating a second judgment result;

2. The method of claim 1, wherein the performing multi-channel noise reduction on the plurality of frequency bin signals of each of the plurality of frames of sub-signals, and determining whether the received audio signal contains an endpoint based on a result of the multi-channel noise reduction, and generating the first determination result comprises:

sequentially and correspondingly inputting a plurality of frequency point signals of each frame of sub-signals in the plurality of frames of sub-signals to a plurality of multi-channel noise reduction processors to obtain respective multi-channel noise-reduced frequency point signals of each frame of sub-signals;

taking the maximum energy value and the minimum energy value from the frequency point signals of each frame of sub-signals after respective multi-path noise reduction to obtain a plurality of maximum energy values and a plurality of minimum energy values;

logarithmically summing the plurality of energy maxima and the plurality of energy minima, and determining a signal energy corresponding to the received audio signal based at least on a difference between the logarithms;

and generating a first judgment result according to the signal energy and a first set threshold value.

3. The method according to claim 1, wherein the taking of multiple frequency point signals of any one frame of sub-signals in the multiple frames of sub-signals is performed with single-channel noise reduction processing, and determining whether the any one frame of sub-signals contains an endpoint based on a single-channel noise reduction processing result, and generating a second determination result includes:

sequentially inputting a plurality of frequency point signals in any frame of sub-signals to a single-channel noise reduction processor to obtain a plurality of frequency point signals after noise reduction;

estimating a plurality of voice existence probabilities of a plurality of noise-reduced frequency point signals corresponding to the arbitrary frame of sub-signals;

weighting the plurality of speech presence probabilities for updating a noise power spectrum;

determining the voice existence probability of the sub-signal of any frame according to the updated noise power spectrum;

and generating a second judgment result according to the voice existence probability of the sub-signal of any frame and a second set threshold value.

4. The method of claim 3, wherein the weighting the plurality of speech presence probabilities comprises: weighting the plurality of speech presence probabilities using a predetermined Mel weighting factor.

5. A voice endpoint detection apparatus comprising:

the audio signal framing module is used for framing the received audio signals into multi-frame sub-signals, and the audio signals are multi-channel audio signals collected by the microphone array;

a signal conversion module, configured to perform, for each frame sub-signal of the multiple frames of sub-signals: windowing each frame of sub-signals, and carrying out fast Fourier transform to obtain a plurality of frequency point signals;

a second determination result generation module, configured to perform single-channel noise reduction on multiple frequency point signals of any frame of sub-signals in the multiple frames of sub-signals, determine whether any frame of sub-signal contains an endpoint based on a single-channel noise reduction result, and generate a second determination result;

6. The apparatus of claim 5, wherein the first determination generation module comprises:

the first noise reduction processing unit is used for correspondingly inputting a plurality of frequency point signals of each frame of sub-signals in the plurality of frames of sub-signals to a plurality of multi-channel noise reduction processors in sequence so as to obtain respective frequency point signals after multi-channel noise reduction of each frame of sub-signal;

the selection unit is used for taking the maximum energy value and the minimum energy value from the frequency point signals of each frame of sub-signals after respective multi-channel noise reduction to obtain a plurality of maximum energy values and a plurality of minimum energy values;

a signal energy determination unit for taking a logarithm of the sum of the plurality of energy maxima, taking a logarithm of the sum of the plurality of energy minima, and determining a signal energy corresponding to the received audio signal based on at least a difference between the two logarithms;

and the first judgment result generation module is used for generating a first judgment result according to the signal energy and the first set threshold value.

7. The apparatus of claim 5, wherein the second determination generation module comprises:

the second noise reduction unit is used for sequentially inputting a plurality of frequency point signals in the sub-signal of any frame into the single-channel noise reduction processor so as to obtain a plurality of frequency point signals after noise reduction;

a probability estimation unit, configured to estimate a plurality of speech existence probabilities of the noise-reduced frequency point signals corresponding to the arbitrary frame of sub-signals;

a weighting unit for weighting the plurality of speech existence probabilities for updating a noise power spectrum;

a probability determining unit, configured to determine a speech existence probability of the sub-signal of any frame according to the updated noise power spectrum;

and the second judgment result generation unit is used for generating a second judgment result according to the voice existence probability of the sub signal of any frame and a second set threshold value.

8. The apparatus of claim 7, wherein the weighting the plurality of speech presence probabilities comprises: weighting the plurality of speech presence probabilities using a predetermined Mel weighting factor.

9. An electronic device, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method of any one of claims 1-4.

10. A storage medium on which a computer program is stored which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 4.