CN110648692B

CN110648692B - Voice endpoint detection method and system

Info

Publication number: CN110648692B
Application number: CN201910918858.9A
Authority: CN
Inventors: 沈小正; 周强
Original assignee: Sipic Technology Co Ltd
Current assignee: Sipic Technology Co Ltd
Priority date: 2019-09-26
Filing date: 2019-09-26
Publication date: 2022-04-12
Anticipated expiration: 2039-09-26
Also published as: CN110648692A

Abstract

The application discloses a voice endpoint detection method and a system, wherein the method comprises the following steps: carrying out voice enhancement on an audio signal collected by a microphone array; determining a plurality of directions of arrival from the speech enhanced audio signal; judging whether the direction of arrival exists in the target human voice azimuth area or not in the plurality of directions of arrival, and generating a first judgment result; inputting the audio signal after the voice enhancement into a pre-trained deep neural network to judge whether a target voice signal exists or not and generating a second judgment result; and determining whether the target voice signal exists according to the first judgment result and the second judgment result. The embodiment of the application provides a new VAD architecture to solve the problem that VAD is falsely triggered by interfering voices, namely, the classification of voice and noise is considered, and the direction information of a multi-channel voice signal expressed on an array is also considered, so that the target voice, the interfering voices and the environmental noise are well distinguished.

Description

Voice endpoint detection method and system

Technical Field

The present application relates to the field of speech recognition technologies, and in particular, to a method and a system for detecting a speech endpoint.

Background

Far-field voice interaction is increasingly popularized, equipment with a voice interaction function rapidly occupies the market, and voice interaction brings people's lives to a great extent. A very important device in far-field speech interaction is a microphone array, which has a great advantage over a single microphone, especially in noisy environments. Most companies in the industry use microphone arrays to obtain a target speech signal with a high signal-to-noise ratio, which is then fed to a speech recognition engine. VAD mainly used in the industry at present comprises two architectures, namely an architecture for classifying and judging voice and non-voice frames through a DNN training model in an open source tool kaidi; in another open source tool webrtc, the characteristics of noise and voice in each frequency band are calculated by the GMM to make VAD judgment.

In the recognition process, a voice activity endpoint detection (VAD) module is required, and all voice activity endpoint detections in the industry mainly adopt a neural network training method to distinguish voice from environmental noise, but the misjudgment of the voice interference cannot be well solved. Because the microphone array only improves the signal-to-interference ratio and the signal-to-noise ratio for voice enhancement, but not completely eliminates the environmental noise and interference, the voice distortion and the intelligibility are mainly considered, and a nonlinear algorithm is not adopted in the industry to completely remove the noise and the interference, so that the performance of the existing neural network VAD in a noisy environment with multi-person speaking is poor.

Disclosure of Invention

The embodiment of the present application provides a method and a system for detecting a voice endpoint, which are used to solve at least one of the above technical problems.

In a first aspect, an embodiment of the present application provides a method for detecting a voice endpoint, including:

carrying out voice enhancement on an audio signal collected by a microphone array;

determining a plurality of directions of arrival from the speech enhanced audio signal;

judging whether the direction of arrival exists in the target human voice azimuth area or not in the plurality of directions of arrival, and generating a first judgment result;

inputting the audio signal after the voice enhancement into a pre-trained deep neural network to judge whether a target voice signal exists or not and generating a second judgment result;

and determining whether the target voice signal exists according to the first judgment result and the second judgment result.

In a second aspect, an embodiment of the present application provides a voice endpoint detection system, including:

the signal enhancement module is used for carrying out voice enhancement on the audio signals collected by the microphone array;

the direction-of-arrival determining module is used for determining a plurality of directions of arrival according to the audio signal after the voice enhancement;

the first judgment module is used for judging whether the direction of arrival exists in the target human voice azimuth area or not in the plurality of directions of arrival and generating a first judgment result;

the second judgment module is used for inputting the audio signal after the voice enhancement into a pre-trained deep neural network so as to judge whether a target voice signal exists or not and generate a second judgment result;

and the target voice signal determining module is used for determining whether a target voice signal exists according to the first judgment result and the second judgment result.

In a third aspect, embodiments of the present application provide a storage medium, where one or more programs including execution instructions are stored, where the execution instructions can be read and executed by an electronic device (including but not limited to a computer, a server, or a network device, etc.) to perform any one of the voice endpoint detection methods described above in the present application.

In a fourth aspect, an electronic device is provided, comprising: the voice endpoint detection system comprises at least one processor and a memory communicatively connected with the at least one processor, wherein the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor to enable the at least one processor to execute any one of the voice endpoint detection methods described above.

In a fifth aspect, the present application further provides a computer program product, where the computer program product includes a computer program stored on a storage medium, and the computer program includes program instructions, which when executed by a computer, cause the computer to execute any one of the above-mentioned voice endpoint detection methods.

The beneficial effects of the embodiment of the application are that: a new VAD architecture is provided to solve the problem that VAD is triggered by mistake due to the fact that voice and noise are classified, direction information of a multi-channel voice signal expressed on an array is considered, and target voice, interference voice and environmental noise are well distinguished.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a flowchart of an embodiment of a voice endpoint detection method of the present application;

FIG. 2 is a flow chart of another embodiment of a voice endpoint detection method of the present application;

FIG. 3 is a flow chart of another embodiment of a voice endpoint detection method of the present application;

FIG. 4 is a functional block diagram of an embodiment of a speech endpoint detection system of the present application;

FIG. 5 is a schematic block diagram of an embodiment of a second determining module in the speech endpoint detection system of the present application;

FIG. 6 is a functional block diagram of an embodiment of a direction of arrival determination module in the speech endpoint detection system of the present application;

FIG. 7 is a functional block diagram of another embodiment of a speech endpoint detection system of the present application;

fig. 8 is a schematic structural diagram of an embodiment of an electronic device of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

The inventor tries to perform voice enhancement to completely remove the interfering voice in the process of implementing the present application, and this scheme causes great degradation of VAD performance, and if the VAD is retrained by adopting a data adaptation method, this scheme can improve the accuracy of VAD to some extent, but has a great influence on the recognition performance because the recognition rate is relatively related to the degree of voice distortion.

VAD in the industry mainly uses microphone array to perform voice enhancement and eliminate interfering voice, assuming that the input of VAD is clean target voice signal, the actual situation is that target voice with high signal-to-interference ratio is hard to obtain, especially in high-reverberation and high-noise environment. Many researchers in the industry have been working on speech enhancement, but do not consider studying spatially differentiated VAD methods.

As shown in fig. 1, an embodiment of the present application provides a voice endpoint detection method, including:

s10, performing voice enhancement on the audio signal collected by the microphone array;

s20, determining a plurality of directions of arrival according to the audio signal after the voice enhancement;

s30, judging whether the direction of arrival exists in the target human voice azimuth area or not, and generating a first judgment result;

s40, inputting the audio signal after the voice enhancement into a pre-trained deep neural network to judge whether a target voice signal exists or not and generating a second judgment result;

and S50, determining whether the target voice signal exists according to the first judgment result and the second judgment result.

The embodiment of the present application provides a new VAD architecture to solve the problem of voice-induced VAD by disturbance, and the final state of VAD is jointly determined by deep neural network voice activity endpoint detection (DNN _ VAD) and BEAM voice activity endpoint detection (BEAM _ VAD). The classification of voice and noise is considered, and the direction information of the multi-channel voice signal expressed on the array is also considered, so that the target voice, the interference voice and the environmental noise are well distinguished.

The DNN _ VAD is mature in near-field voice interaction, the performance of the DNN _ VAD is to be improved in far-field voice interaction, and the performance of the DNN _ VAD is improved by training through far-field voice data. However, in the case of a poor far-field acoustic environment, it is still difficult to obtain satisfactory performance with a single DNN _ VAD. The application improves the performance by combining a DNN _ VAD method and a BEAM _ VAD method.

Fig. 2 is a flowchart of another embodiment of the voice endpoint detection method according to the present application, in which the directions of arrival are divided into a plurality of direction groups, and each direction group includes a plurality of directions of arrival. As shown in fig. 2, the determining a plurality of directions of arrival from the speech enhanced audio signal comprises:

s21, determining a plurality of spatial energy spectrums corresponding to the multiframe audio signals in the audio signals after the voice enhancement;

and S22, selecting a plurality of arrival directions corresponding to a plurality of energy values in each spatial energy spectrum, wherein the energy values are sorted from large to small, so as to determine the plurality of direction groups.

In the embodiment of the application, the BEAM _ VAD mainly considers the directivity of the sound source through the microphone array to judge whether the signal source exists in the target direction, and in addition, the continuity of the voice signal is considered to ensure the continuous and stable state of the BEAM _ VAD, which is beneficial to improving the accuracy and reliability of voice endpoint detection.

In some embodiments, the determining whether there is a direction of arrival in the plurality of directions of arrival that exists within the target vocal azimuth region, and generating the first determination result includes:

and respectively determining whether the direction of arrival exists in the target human voice azimuth area in the plurality of direction groups, and generating a first judgment result, wherein if yes, the target voice signal is judged to exist, and otherwise, the target voice signal is judged not to exist.

As shown in fig. 3, which is a flowchart of another embodiment of the voice endpoint detection method of the present application, in this embodiment, the inputting the voice-enhanced audio signal into a pre-trained deep neural network to determine whether a target voice signal exists, and generating a second determination result includes:

s41, selecting a frame from the audio signals after the voice enhancement as a target frame audio signal;

s42, selecting a front multiframe audio signal and a rear multiframe audio signal of the target frame audio signal;

s43, inputting the target frame audio signal, the front multi-frame audio signal and the rear multi-frame audio signal to a pre-trained deep neural network to obtain a plurality of corresponding voice existence probability values;

and S44, judging whether the mean value of the probability values of the existence of the multiple voices is larger than a set threshold value, and generating a second judgment result, wherein if yes, the target voice signal is judged to exist, otherwise, the target voice signal is judged not to exist.

In the embodiment, the pre-trained deep neural network is adopted to process the multi-frame audio signals respectively to obtain the corresponding multiple voice existence probability values, and the multiple probability values are averaged and then compared with the set threshold value to judge whether the target voice signal exists, so that misjudgment caused by accidental errors existing in single frame signal processing is avoided, and the accuracy and the reliability of the judgment are improved.

The method of combining DNN _ VAD and BEAM _ VAD is adopted in the application, namely the classification of voice and noise is considered, and the direction information represented by a multi-channel voice signal on an array is also considered. Comprises the following steps:

step 1: the microphone array collects data, and frames and windows are carried out on the original data, wherein the frame length is 32ms, and the frame shift is 16 ms.

Step 2: for example, in subway ticket buying equipment, speech enhancement is performed in a normal direction of a linear microphone array (the direction is selected because a screen-mounted device such as a subway ticket buying machine is used by a user and is mainly located in the normal direction of the microphone array), and a GSC frame is mainly adopted in the industry for beam forming.

And step 3: the data after speech enhancement is used to calculate DOA (Direction of Arrival) using GCC _ path algorithm, and the positions corresponding to the two values with the largest spatial spectral energy are generally selected as DOA1 and DOA 2.

And 4, if at least one value of DOA1 and DOA2 is contained in the set target voice orientation area (the target voice orientation area can be set according to specific products and application scenes, for example, the ticket buyer is set to be in a 90-degree direction), judging that the spp1 is 1 according to the existence probability of the target signal.

And 5, continuously judging the value of the ssp of the 8-frame signal in order to increase the robustness of the algorithm. If the ssp of 1 frame signal is 1, determining the final output state of BEAM _ VAD is 1, the voice signal has continuity, and in order to ensure that the voice information is not lost, the output of the BEAM _ VAD is 1; if the ssp of the continuous 8-frame signal is 0, the final output state of the BEAM _ VAD is judged to be 0.

Step 6: and extracting 24-dimensional fbank characteristics by using the audio signal after single-frame beam forming.

And 7: and inputting a pre-trained DNN network, wherein the DNN network is used as a classifier, and calculating the existence probability spp2 of the current frame belonging to the speech.

And 8: setting the length of the buffer _ ssp2 to be 7, and the 4 th position in the buffer _ spp2 stores the speech existence probability of the current frame, and looks 3 frames forward and 3 frames backward. Illustratively, the DNN _ VAD mainly needs to consider delay and duration of voice, and the delay caused by the too large length of buffer _ ssp2 affects the experience of the user. In addition, if the length is too short, the information of the voice signal is difficult to represent, and the accuracy of judgment is affected. Through a number of experiments, the speech signal frame length is 1024, and the length of the buffer _ ssp2 is selected to be 7.

And step 9: the DNN _ VAD makes a decision based on the buffer _ spp2, averages the buffer _ spp2, and if the output state of the DNN _ VAD is greater than a set threshold value, the output state of the DNN _ VAD is 1, and if the output state of the DNN _ VAD is less than the set threshold value, the output state of the DNN _ VAD is 0. The threshold is set according to statistics of positive examples and negative examples after DNN network training, and the threshold is determined to be 0.35 according to a statistical result.

And 10, fusing the DNN _ VAD and BEAM _ VAD results, and taking the logical AND of the DNN _ VAD and BEAM _ VAD to output the final VAD state.

The application patent is especially important in the voice interaction, especially the multi-round interaction in the far-field noisy environment, for example, the great advantage is shown on equipment such as subway ticket vending machines, market televisions and the like. The VAD with the spatial information ensures that voice traffic is not interfered by external voice, is concentrated in communication of a target speaker, and improves user experience.

It is noted that while for simplicity of explanation, the foregoing method embodiments have been described as a series of acts or combination of acts, it will be appreciated by those skilled in the art that the present application is not limited by the order of acts, as some steps may, in accordance with the present application, occur in other orders and concurrently. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application. In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

As shown in fig. 4, an embodiment of the present application further provides a voice endpoint detection system 400, including:

a signal enhancement module 410, configured to perform speech enhancement on an audio signal collected by a microphone array;

a direction-of-arrival determination module 420 for determining a plurality of directions of arrival from the speech enhanced audio signal;

a first judging module 430, configured to judge whether there is a direction of arrival in the target vocal azimuth region in the multiple directions of arrival, and generate a first judgment result;

the second judgment module 440 is configured to input the audio signal after the speech enhancement to a pre-trained deep neural network to judge whether a target speech signal exists, and generate a second judgment result;

and a target voice signal determining module 450, configured to determine whether a target voice signal exists according to the first determination result and the second determination result.

In some embodiments, the plurality of directions of arrival may be divided into a plurality of direction groups, each direction group including a plurality of directions of arrival; as shown in fig. 5, in this embodiment, the direction of arrival determining module 420 includes:

a spatial energy spectrum determination unit 421 configured to determine a plurality of spatial energy spectra corresponding to a plurality of frames of audio signals in the speech-enhanced audio signal;

a direction-of-arrival determining unit 422, configured to select multiple directions of arrival corresponding to multiple energy values in each of the spatial energy spectrums, which are sorted from large to small, so as to determine the multiple direction groups.

As shown in fig. 6, in some embodiments, the second determining module 440 in the voice endpoint detection system of the present application includes:

a first signal frame selection unit 441 configured to select one frame from the audio signals after speech enhancement as a target frame audio signal;

a second signal frame selecting unit 442 for selecting a preceding multi-frame audio signal and a following multi-frame audio signal of the target frame audio signal;

the probability value calculation unit 443 is configured to input the target frame audio signal, the previous multi-frame audio signal, and the subsequent multi-frame audio signal to a pre-trained deep neural network to obtain a plurality of corresponding speech existence probability values;

and the judging unit 444 is configured to judge whether the mean value of the multiple voice existence probability values is greater than a set threshold, and generate a second judgment result, where if yes, it is judged that the target voice signal exists, and otherwise, it is judged that the target voice signal does not exist.

Fig. 7 is a schematic block diagram of another embodiment of the voice endpoint detection system of the present application, in which the voice endpoint detection system includes:

and the data acquisition preprocessing module is used for performing frame windowing processing on the audio data acquired by the microphone array, wherein the frame length is 32ms, and the frame shift is 16 ms.

And the beam forming module is used for performing adaptive filtering on the output data of the data acquisition preprocessing module to obtain enhanced audio, for example, in subway ticket purchasing equipment, speech enhancement is performed in the normal direction of the linear microphone array.

And the DOA estimation module is used for processing the enhanced data by adopting a GCC _ path algorithm to calculate DOA, and generally selecting positions corresponding to two values with the maximum spatial spectrum energy as DOA1 and DOA 2.

And the target signal judging module is used for judging that the target signal existence probability spp1 is 1 when at least one numerical value of DOA1 and DOA2 is contained in the set target voice azimuth region.

And the beam voice activity endpoint detection state judgment module is used for judging whether a voice endpoint exists or not, and continuously judging the ssp value of the 8-frame signal in order to increase the robustness of the algorithm. If the ssp of 1 frame signal is 1, determining the final output state of BEAM _ VAD is 1, the voice signal has continuity, and in order to ensure that the voice information is not lost, the output of the BEAM _ VAD is 1; if the ssp of the continuous 8-frame signal is 0, the final output state of the BEAM _ VAD is judged to be 0.

And the characteristic extraction module is used for extracting 24-dimensional fbank characteristics by adopting the audio signals after the single-frame beam forming.

And the deep neural network decoding module is used for inputting the extracted features into a pre-trained DNN network, and the DNN network is used as a classifier to calculate the existence probability spp2 of the current frame belonging to the voice.

And the state caching module is used for setting the length of the buffer _ ssp2 to be 7, storing the voice existence probability of the current frame at the 4 th position in the buffer _ spp2, and looking forward at 3 frames and looking backward at 3 frames. Illustratively, the DNN _ VAD mainly needs to consider delay and duration of voice, and the delay caused by the too large length of buffer _ ssp2 affects the experience of the user. In addition, if the length is too short, the information of the voice signal is difficult to represent, and the accuracy of judgment is affected. Through a number of experiments, the speech signal frame length is 1024, and the length of the buffer _ ssp2 is selected to be 7.

And the deep neural network voice activity endpoint detection state judgment module is used for judging whether the voice endpoint exists or not. The DNN _ VAD makes a decision based on the buffer _ spp2, averages the buffer _ spp2, and if the output state of the DNN _ VAD is greater than a set threshold value, the output state of the DNN _ VAD is 1, and if the output state of the DNN _ VAD is less than the set threshold value, the output state of the DNN _ VAD is 0. The threshold is set according to statistics of positive examples and negative examples after DNN network training, and the threshold is determined to be 0.35 according to a statistical result.

And the voice activity endpoint detection joint judgment module is used for fusing the results of the DNN _ VAD and the BEAM _ VAD, and outputting the final VAD state by taking the logical AND of the DNN _ VAD and the BEAM _ VAD.

In some embodiments, the present application provides a non-transitory computer-readable storage medium, in which one or more programs including executable instructions are stored, where the executable instructions can be read and executed by an electronic device (including but not limited to a computer, a server, or a network device, etc.) to perform any of the voice endpoint detection methods described above in the present application.

In some embodiments, the present application further provides a computer program product comprising a computer program stored on a non-volatile computer-readable storage medium, the computer program comprising program instructions that, when executed by a computer, cause the computer to perform any of the above-mentioned voice endpoint detection methods.

In some embodiments, the present application further provides an electronic device, which includes: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a voice endpoint detection method.

In some embodiments, the present application further provides a storage medium having a computer program stored thereon, where the computer program is executed by a processor to implement the voice endpoint detection method.

The voice endpoint detection system according to the embodiment of the present application may be configured to execute the voice endpoint detection method according to the embodiment of the present application, and accordingly achieve the technical effect achieved by the implementation of the voice endpoint detection method according to the embodiment of the present application, and details are not described here again. In the embodiment of the present application, the relevant functional module may be implemented by a hardware processor (hardware processor).

Fig. 8 is a schematic hardware structure diagram of an electronic device for performing a voice endpoint detection method according to another embodiment of the present application, and as shown in fig. 8, the electronic device includes: one or more processors 810 and a memory 820, with one processor 810 being an example in FIG. 8.

The apparatus for performing the voice endpoint detection method may further include: an input device 830 and an output device 840.

The processor 810, the memory 820, the input device 830, and the output device 840 may be connected by a bus or other means, such as the bus connection in fig. 8.

The memory 820, which is a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules, such as program instructions/modules corresponding to the voice endpoint detection method in the embodiments of the present application. The processor 810 executes various functional applications and data processing of the server by executing nonvolatile software programs, instructions and modules stored in the memory 820, so as to implement the voice endpoint detection method of the above-mentioned method embodiment.

The memory 820 may include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the voice endpoint detection apparatus, and the like. Further, the memory 820 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, memory 820 may optionally include memory located remotely from processor 810, which may be connected to a voice endpoint detection apparatus via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input device 830 may receive input numeric or character information and generate signals related to user settings and function control of the voice endpoint detection device. The output device 840 may include a display device such as a display screen.

The one or more modules are stored in the memory 820 and, when executed by the one or more processors 810, perform the voice endpoint detection method of any of the method embodiments described above.

The product can execute the method provided by the embodiment of the application, and has the corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in this embodiment, reference may be made to the methods provided in the embodiments of the present application.

The electronic device of the embodiments of the present application exists in various forms, including but not limited to:

(1) mobile communication devices, which are characterized by mobile communication capabilities and are primarily targeted at providing voice and data communications. Such terminals include smart phones (e.g., iphones), multimedia phones, functional phones, and low-end phones, among others.

(2) The ultra-mobile personal computer equipment belongs to the category of personal computers, has calculation and processing functions and generally has the characteristic of mobile internet access. Such terminals include PDA, MID, and UMPC devices, such as ipads.

(3) Portable entertainment devices such devices may display and play multimedia content. Such devices include audio and video players (e.g., ipods), handheld game consoles, electronic books, as well as smart toys and portable car navigation devices.

(4) The server is similar to a general computer architecture, but has higher requirements on processing capability, stability, reliability, safety, expandability, manageability and the like because of the need of providing highly reliable services.

(5) And other electronic devices with data interaction functions.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a general hardware platform, and certainly can also be implemented by hardware. Based on such understanding, the above technical solutions substantially or contributing to the related art may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A voice endpoint detection method, comprising:

judging whether the direction of arrival exists in the target human voice azimuth area or not in the plurality of directions of arrival, and generating a first judgment result; setting the target human voice azimuth area according to specific products and application scenes;

2. The method of claim 1, wherein the plurality of directions of arrival are divisible into a plurality of direction groups, each direction group including a plurality of directions of arrival;

the determining a plurality of directions of arrival from the speech enhanced audio signal comprises:

determining a plurality of spatial energy spectrums corresponding to a plurality of frames of audio signals in the audio signals after the voice enhancement;

and selecting a plurality of arrival directions corresponding to a plurality of energy values in each spatial energy spectrum, which are sequenced from large to small, so as to determine the plurality of direction groups.

3. The method according to claim 2, wherein the determining whether there is a direction of arrival in the plurality of directions of arrival that exists within the target vocal azimuth area, and generating a first determination result comprises:

4. The method of claim 1, wherein the inputting the speech enhanced audio signal to a pre-trained deep neural network to determine whether a target speech signal is present and generating a second determination result comprises:

selecting a frame from the audio signals after the voice enhancement as a target frame audio signal;

selecting a front multiframe audio signal and a rear multiframe audio signal of the target frame audio signal;

inputting the target frame audio signal, the front multi-frame audio signal and the rear multi-frame audio signal to a pre-trained deep neural network to obtain a plurality of corresponding voice existence probability values;

and judging whether the mean value of the probability values of the existence of the multiple voices is greater than a set threshold value or not, and generating a second judgment result, wherein if yes, the existence of the target voice signal is judged, and otherwise, the nonexistence of the target voice signal is judged.

5. A voice endpoint detection system comprising:

the first judgment module is used for judging whether the direction of arrival exists in the target human voice azimuth area or not in the plurality of directions of arrival and generating a first judgment result; setting the target human voice azimuth area according to specific products and application scenes;

6. The system of claim 5, wherein the plurality of directions of arrival are divisible into a plurality of direction groups, each direction group including a plurality of directions of arrival;

the direction of arrival determination module comprises:

a spatial energy spectrum determination unit configured to determine a plurality of spatial energy spectra corresponding to a plurality of frames of audio signals in the speech-enhanced audio signal;

and the direction-of-arrival determining unit is used for selecting a plurality of directions of arrival corresponding to a plurality of energy values in each spatial energy spectrum, wherein the energy values are ranked from large to small, so as to determine the plurality of direction groups.

7. The system according to claim 6, wherein the determining whether there is a direction of arrival in the plurality of directions of arrival that exists within the target vocal azimuth area, and generating a first determination result comprises:

8. The system of claim 5, wherein the second determination module comprises:

a first signal frame selection unit for selecting a frame from the voice-enhanced audio signal as a target frame audio signal;

a second signal frame selection unit configured to select a preceding multiframe audio signal and a following multiframe audio signal of the target frame audio signal;

the probability value calculation unit is used for respectively inputting the target frame audio signal, the front multi-frame audio signal and the rear multi-frame audio signal into a pre-trained deep neural network to obtain a plurality of corresponding voice existence probability values;

and the judging unit is used for judging whether the mean value of the probability values of the existence of the multiple voices is greater than a set threshold value or not and generating a second judgment result, wherein if the mean value of the probability values of the existence of the multiple voices is greater than the set threshold value, the target voice signal is judged to exist, and if the mean value of the probability values of the existence of the multiple voices is not greater than the set threshold value, the target voice signal is judged to not exist.

9. An electronic device, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method of any one of claims 1-4.

10. A storage medium on which a computer program is stored which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 4.