WO2020253073A1

WO2020253073A1 - Speech endpoint detection method, apparatus and device, and storage medium

Info

Publication number: WO2020253073A1
Application number: PCT/CN2019/118699
Authority: WO
Inventors: 魏韬; 马骏; 王少军
Original assignee: 平安科技（深圳）有限公司
Priority date: 2019-06-17
Filing date: 2019-11-15
Publication date: 2020-12-24
Also published as: CN110335593A

Abstract

A speech endpoint detection method, comprising the following steps: obtaining input speech to be detected and a preset speech frame detection model (S10); performing framing processing on the input speech to obtain multiple speech frames having time sequences (S20); sequentially inputting speech frames of the input speech into the speech frame detection model for detection, and outputting first detection results corresponding to the speech frames (S30); sequentially performing harmonic energy detection on the speech frames of the input speech to obtain second detection results corresponding to the speech frames (S40); determining frame types corresponding to the speech frames on the basis of the first detection results and the second detection results (S50); and determining a speech starting endpoint and a speech ending endpoint of the input speech on the basis of the frame types corresponding to the speech frames (S60). The method improves the accuracy of speech endpoint detection.

Description

Voice endpoint detection method, device, equipment and storage medium

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on June 17, 2019, the application number is 201910521084.6, and the invention title is "Voice Endpoint Detection Method, Apparatus, Equipment and Storage Medium", the entire content of which is incorporated by reference In application.

Technical field

This application relates to the field of artificial intelligence technology, and in particular to a voice endpoint detection method, device, equipment and storage medium.

Background technique

Existing voice recognition technology often requires voice endpoint detection, that is, to detect the start and end positions of the voice. The current voice endpoint detection algorithm is usually only suitable for voice recognition in relatively quiet scenes. This method is suitable for relatively stable noise (Such as white noise, siren sound, etc.) The effect is better, but the effect is poor for noisy environments (such as public places where many people speak). The reason is that the noise in such situations also has the characteristics of speech, so it is difficult to be accurate Distinguish noise from speech, which leads to low speech recognition rate.

Summary of the invention

The main purpose of this application is to provide a voice endpoint detection method, device, device, and storage medium, which aims to solve the technical problem of poor voice recognition accuracy caused by poor voice endpoint detection.

To achieve the foregoing objective, the present application provides a voice endpoint detection method, which includes the following steps:

Obtain the input voice to be detected and the preset voice frame detection model;

Framing the input voice to obtain multiple voice frames with time sequence;

Sequentially input each voice frame of the input voice into the voice frame detection model for detection, and output a first detection result corresponding to each voice frame;

Sequentially perform harmonic energy detection on each voice frame of the input voice to obtain a second detection result corresponding to each voice frame;

Determine a frame type corresponding to each speech frame based on the first detection result and the second detection result, the frame type including valid speech frames and noise frames;

Based on the frame category corresponding to each voice frame, the voice start endpoint and voice end endpoint of the input voice are determined.

Optionally, the voice frame detection model includes: a voice model and a noise model; before the step of acquiring the input voice to be detected and a preset voice frame detection model, it further includes:

Take normal speech data as training samples, and use the preset first machine learning algorithm for training to construct a speech model for detecting valid speech frames;

Using real environmental noise as a training sample, a preset second machine learning algorithm is used for training, and a noise model is constructed for use in detecting noise frames.

Optionally, the sequentially inputting each voice frame of the input voice into the voice frame detection model for detection, and outputting the first detection result corresponding to each voice frame includes:

Sequentially input each voice frame of the input voice into the voice model for detection, and output the first probability value of each voice frame as a valid voice frame;

Sequentially input each voice frame of the input voice into the noise model for detection, and output a second probability value of each voice frame as a noise frame;

Based on the first probability value and the second probability value, the first detection result corresponding to each speech frame is output, wherein if the first probability value of the speech frame being a valid speech frame is greater than the second probability value of being a noise frame, The voice frame is determined to be a valid voice frame, otherwise it is a noise frame.

Optionally, the performing harmonic energy detection on each voice frame of the input voice in sequence to obtain the second detection result corresponding to each voice frame includes:

Sequentially extracting the short-term speech energy in the time domain of the i-th speech frame of the input speech;

Determine whether the short-term speech energy corresponding to the i-th speech frame is greater than the preset short-term speech energy;

If yes, it is determined that the i-th speech frame is a valid speech frame, otherwise it is a noise frame.

Optionally, the calculation formula of the short-term speech energy is as follows:

Among them, M(i) represents the short-term speech energy of the i-th speech frame; x(n) represents the time domain signal of the speech waveform; w(n) represents the window function; y _i (n) represents the frame after w(n) The i-th frame of speech signal obtained after processing; b represents the frame shift length; n=1, 2,...L; i=1, 2,...f _n ; L represents the frame length, f _n represents the total number of frames after framing .

Optionally, the determining the frame category corresponding to each speech frame based on the first detection result and the second detection result includes:

If the first detection result is that the voice frame is a valid voice frame, and the second detection result is that the voice frame is a valid voice frame, determining that the frame category corresponding to the voice frame is a valid voice frame;

If the first detection result is that the voice frame is a valid voice frame, and the second detection result is that the voice frame is a noise frame, determining that the frame category corresponding to the voice frame is a noise frame;

If the first detection result is that the voice frame is a noise frame, and the second detection result is that the voice frame is a valid voice frame, determining that the frame category corresponding to the voice frame is a noise frame;

If the first detection result is that the voice frame is a noise frame, and the second detection result is that the voice frame is a noise frame, it is determined that the frame category corresponding to the voice frame is a noise frame.

Optionally, the determining the voice start endpoint and the voice end endpoint of the input voice based on the frame category corresponding to each voice frame includes:

In a preset detection window, determine whether the frame type corresponding to each voice frame in the detection window meets a preset voice endpoint determination condition;

If it is satisfied, it is determined that the voice start endpoint or the voice end endpoint of the input voice is located in the current detection window;

Wherein, the voice endpoint determination condition includes: if the ratio of valid voice frames in the current detection window exceeds the preset first ratio, determining that the voice start endpoint of the input voice exists in the current detection window; if there is valid voice in the current detection window If the ratio of the frame is lower than the preset second ratio, it is determined that there is a voice end endpoint of the input voice in the current detection window.

Further, in order to achieve the above objective, the present application also provides a voice endpoint detection device, the voice endpoint detection device includes:

The acquisition module is used to acquire the input voice to be detected and the preset voice frame detection model;

The framing module is used to perform framing processing on the input voice to obtain multiple voice frames with time sequence;

The first detection module is configured to sequentially input each voice frame of the input voice into the voice frame detection model for detection, and output a first detection result corresponding to each voice frame;

The second detection module is configured to perform harmonic energy detection on each voice frame of the input voice in sequence to obtain a second detection result corresponding to each voice frame;

A frame type determining module, configured to determine a frame type corresponding to each speech frame based on the first detection result and the second detection result, the frame type including valid speech frames and noise frames;

The voice endpoint determination module is used to determine the voice start endpoint and voice end endpoint of the input voice based on the frame category corresponding to each voice frame.

Further, in order to achieve the above objective, the present application also provides a voice endpoint detection device, the voice endpoint detection device includes a memory, a processor, and a voice endpoint detection device stored in the memory and running on the processor A program, when the voice endpoint detection program is executed by the processor, the steps of the voice endpoint detection method as described in any one of the above are implemented.

Further, in order to achieve the above objective, the present application also provides a computer-readable storage medium having a voice endpoint detection program stored on the computer-readable storage medium, and when the voice endpoint detection program is executed by a processor, any of the foregoing One of the steps of the voice endpoint detection method.

This application uses a preset speech frame detection model and a harmonic energy detection method to detect each speech frame of the input speech, and then combines the two detection results to determine whether each speech frame is a valid speech frame or a noise frame; and finally Based on the frame category corresponding to each voice frame, the voice start endpoint and voice end endpoint of the input voice are determined. This application integrates a variety of detection algorithms, which can improve the accuracy of voice endpoint detection to a certain extent. In this application, the voice endpoint is determined according to the frame category corresponding to each voice frame, so it can adapt to various voice recognition scenarios and improve Speech recognition accuracy rate.

Description of the drawings

FIG. 1 is a schematic structural diagram of an operating environment of a voice endpoint detection device involved in a solution of an embodiment of the application;

2 is a schematic flowchart of an embodiment of a voice endpoint detection method according to this application;

FIG. 3 is a detailed flowchart of an embodiment of step S30 in FIG. 2;

FIG. 4 is a detailed flowchart of an embodiment of step S40 in FIG. 2;

FIG. 5 is a detailed flowchart of an embodiment of step S60 in FIG. 2;

FIG. 6 is a schematic diagram of functional modules of an embodiment of voice endpoint detection according to this application.

The realization, functional characteristics, and advantages of the purpose of this application will be further described in conjunction with the embodiments and with reference to the accompanying drawings.

Detailed ways

It should be understood that the specific embodiments described herein are only used to explain the application, and not used to limit the application.

This application provides a voice endpoint detection device.

Referring to FIG. 1, FIG. 1 is a schematic structural diagram of an operating environment of a voice endpoint detection device involved in a solution in an embodiment of this application.

As shown in FIG. 1, the voice endpoint detection device includes: a processor 1001, such as a CPU, a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005. Among them, the communication bus 1002 is used to implement connection and communication between these components. The user interface 1003 may include a display screen (Display) and an input unit such as a keyboard (Keyboard), and the network interface 1004 may optionally include a standard wired interface and a wireless interface (such as a WI-FI interface). The memory 1005 may be a high-speed RAM memory, or a non-volatile memory (non-volatile memory), such as a magnetic disk memory. Optionally, the memory 1005 may also be a storage device independent of the foregoing processor 1001.

Those skilled in the art can understand that the hardware structure of the voice endpoint detection device shown in FIG. 1 does not constitute a limitation on the voice endpoint detection device, and may include more or less components than shown in the figure, or a combination of certain components, Or different component arrangements.

As shown in FIG. 1, the memory 1005 as a computer-readable storage medium may include an operating system, a network communication module, a user interface module, and a voice endpoint detection program. Among them, the operating system is a program that manages and controls the voice endpoint detection equipment and software resources, and supports the operation of the voice endpoint detection program and other software and/or programs.

In the hardware structure of the voice endpoint detection device shown in FIG. 1, the network interface 1004 is mainly used to access the network; the user interface 1003 is mainly used to detect and confirm instructions and edit instructions. The processor 1001 may be used to call the voice endpoint detection program stored in the memory 1005, and execute the operations of the following embodiments of the voice endpoint detection method.

Based on the foregoing hardware structure of the voice endpoint detection device, various embodiments of the voice endpoint detection method of the present application are proposed.

Referring to Fig. 2, Fig. 2 is a schematic flowchart of an embodiment of a voice endpoint detection method according to the present application. In this embodiment, the voice endpoint detection method includes the following steps:

Step S10, acquiring the input voice to be detected and a preset voice frame detection model;

In this embodiment, the input voice is not limited, and it may be voice in a quiet environment or voice in various noisy environments. At the same time, in order to improve the accuracy of voice endpoint detection, this embodiment pre-trains a voice frame detection model, and detects the input voice through the voice frame detection model.

Step S20: Perform framing processing on the input voice to obtain multiple voice frames with time series;

The voice signal is usually not stable on the macro level, and stable on the micro level, with short-term stability (the voice signal can be considered to be approximately unchanged within 10-30ms). Therefore, in the process of voice signal processing, in order to reduce the voice signal The overall non-steady state and time-varying influences require framing processing of the speech signal. That is, the voice signal is divided into short segments for processing, and each short segment is called a frame.

Step S30, sequentially input each voice frame of the input voice into the voice frame detection model for detection, and output a first detection result corresponding to each voice frame;

In this embodiment, considering that the existing voice endpoint detection is difficult to accurately distinguish normal voice and noise in complex scenarios, the reasons are mainly manifested in the following two aspects: On the one hand, it is due to the applicable scenarios of the existing voice endpoint detection algorithm Relatively simple, for example, the detection effect is better for relatively stable noise (such as white noise, siren, etc.), but the detection effect is poor for noisy environments (such as public places where many people speak); on the other hand, the existing voice Endpoint detection algorithms usually can only detect from a single dimension, which is prone to misjudgment.

Therefore, this embodiment preferably adopts multiple methods to perform endpoint detection on the input voice. Since multiple detection methods are used to detect multiple dimensions, the advantages of multiple detection algorithms can be combined to make the detection result more accurate.

In this embodiment, a pre-trained voice frame detection model is used to detect each voice frame of the input voice, and the first detection result corresponding to each voice frame is output, for example, a certain voice frame is a valid voice frame (that is, a person's speaking voice) The probability that a certain speech frame is a noise frame.

Step S40, performing harmonic energy detection on each voice frame of the input voice in turn, to obtain a second detection result corresponding to each voice frame;

In this embodiment, in addition to the detection based on the model dimension, each voice frame of the input speech is also detected based on the harmonic energy dimension. The voice signal is a harmonic signal with energy characteristics, and the harmonic energy can be measured by the magnitude of the harmonic amplitude. If the harmonic energy is high, the harmonic amplitude is large, and if the harmonic energy is low, the harmonic amplitude is small.

Therefore, in this implementation, the harmonic energy of each voice frame is detected to distinguish between valid voice frames and noise frames. Harmonic energy detection can quickly distinguish voice and noise in a quiet environment, but for a noisy environment, the accuracy of detection is reduced due to noise interference.

Step S50: Determine a frame category corresponding to each voice frame based on the first detection result and the second detection result, where the frame category includes valid voice frames and noise frames;

In this embodiment, because multiple detection algorithms are used, there are multiple results for the detection result of each voice frame of the input voice, for example, the currently detected voice frame is a valid voice frame or a noise frame. The different detection results of the same speech frame may be all the same, or all may be different, or may be partly the same and partly different. In this embodiment, a comprehensive analysis is performed in combination with the first detection result of the model dimension and the second detection result of the harmonic energy dimension, and then the frame type corresponding to each voice frame is determined. The speech frame in this embodiment not only has speech features, but also has speech capability features, so the comprehensive judgment result obtained based on multi-dimensional detection is credible.

Optionally, in an embodiment, the following rules are specifically used to determine the frame category corresponding to the voice frame:

A. If the first detection result is that the voice frame is a valid voice frame, and the second detection result is that the voice frame is a valid voice frame, determine that the frame category corresponding to the voice frame is a valid voice frame;

B. If the first detection result is that the voice frame is a valid voice frame, and the second detection result is that the voice frame is a noise frame, determine that the frame category corresponding to the voice frame is a noise frame;

C. If the first detection result is that the voice frame is a noise frame, and the second detection result is that the voice frame is a valid voice frame, determine that the frame category corresponding to the voice frame is a noise frame;

D. If the first detection result is that the voice frame is a noise frame, and the second detection result is that the voice frame is a noise frame, then it is determined that the frame category corresponding to the voice frame is a noise frame.

In this optional embodiment, when multiple detection models and detection algorithms are used for voice frame detection, if and only if the detection results are consistent and all are valid voice frames, the frame category corresponding to the voice frame is determined to be a valid voice frame Otherwise, it is determined that the frame category corresponding to the speech frame is a noise frame.

Step S60: Determine the voice start endpoint and voice end endpoint of the input voice based on the frame category corresponding to each voice frame.

Generally, in a generally quiet environment, the voice start endpoint corresponds to a valid voice frame, while the voice end endpoint corresponds to a noise frame (or silence). However, in a noisy environment, due to the interference of external environmental noise, it cannot be used Existing methods are used to determine voice endpoints. This embodiment specifically determines the voice start endpoint and voice end endpoint of the input voice based on the frame category corresponding to each voice frame. For example, if multiple consecutive voice frames are valid voice frames, it is determined that the voice start endpoint currently exists, and if multiple consecutive voice frames are noise frames, it is determined that the voice end endpoint currently exists.

In this embodiment, a preset voice frame detection model and a harmonic energy detection method are used to detect each voice frame of the input voice, and then the two detection results are combined to determine whether each voice frame is a valid voice frame or a noise frame; Finally, based on the frame category corresponding to each voice frame, the voice start endpoint and voice end endpoint of the input voice are determined. This embodiment integrates multiple detection algorithms, which can improve the accuracy of voice endpoint detection to a certain extent. In this application, the voice endpoint is determined according to the frame category corresponding to each voice frame, so it can adapt to various voice recognition scenarios. Improve the accuracy of speech recognition.

Further, in an embodiment of the voice endpoint detection method of the present application, using multiple voice frame detection models to perform model-dimensional voice frame detection specifically includes:

(1) Voice model

In this embodiment, before performing voice endpoint detection, a voice model is constructed. Specifically, normal speech data is used as a training sample, and a preset first machine learning algorithm is used for training to construct a speech model for detecting valid speech frames.

In this embodiment, according to the pre-collected normal voice data, a preset machine learning algorithm is used to train to construct a voice model. For example, a deep learning algorithm, a long short-term memory network model and other machine learning algorithms are used to build a model to extract normal voice data. The voice features are input into the model for training, and then a voice model that can detect valid voice frames is constructed.

(2) Noise model

In this embodiment, before performing voice endpoint detection, a noise model is constructed. Specifically, real environmental noise is used as a training sample, and a preset second machine learning algorithm is used for training to construct a noise model for detecting noise frames.

In this embodiment, according to the pre-collected stable noise data and unstable noise data, a preset machine learning algorithm is used to train to construct a noise model, for example, a deep learning algorithm, a long short-term memory network model and other machine learning algorithms are used to build the model. Extract the voice features of the noise data and input the model for training, and then build a noise model that can detect noise frames.

Referring to FIG. 3, FIG. 3 is a detailed flowchart of an embodiment of step S30 in FIG. Based on the foregoing embodiment, in this embodiment, the foregoing step S30 further includes:

Step S301: sequentially input each voice frame of the input voice into the voice model for detection, and output a first probability value that each voice frame is a valid voice frame;

In this embodiment, based on the timing of each voice frame in the input voice, each voice frame is sequentially input into the trained voice model for detection, and the probability value that each voice frame is a valid voice frame is output.

Step S302: sequentially input each voice frame of the input voice into the noise model for detection, and output a second probability value for each voice frame as a noise frame;

In this embodiment, based on the timing of each voice frame in the input voice, each voice frame is sequentially input into the trained noise model for detection, and the probability value of each voice frame being a noise frame is output.

Step S303, based on the first probability value and the second probability value, output a first detection result corresponding to each speech frame, wherein if the speech frame is a valid speech frame, the first probability value is greater than the second probability value of the noise frame. Probability value, the speech frame is determined to be a valid speech frame, otherwise it is a noise frame.

In this embodiment, the same speech frame is input into two different models for speech frame recognition, thereby obtaining the probability value of the speech frame being a valid speech frame and the probability value of the speech frame being a noise frame. If the speech frame is valid If the probability value of a speech frame is greater than the probability value of a noise frame, it is determined that the speech frame is a valid speech frame, and if the probability value of the speech frame is a noise frame is greater than the probability value of a valid speech frame, the speech frame is determined to be a noise frame .

For example, there are three speech frames a, b, and c in the input speech frame, respectively input the speech model and the noise model for detection, the probability value of the speech model output is 70%, 50%, 80%, and the probability value of the noise model output The sequence is 45%, 80%, 25%, and finally it is determined that the speech frame a is a valid speech frame, the speech frame b is a noise frame, and the speech frame c is a valid speech frame.

Referring to FIG. 4, FIG. 4 is a detailed flowchart of an embodiment of step S40 in FIG. Based on the foregoing embodiment, in this embodiment, the foregoing step S40 further includes:

Step S401, extracting the short-term speech energy in the time domain of the i-th speech frame of the input speech in sequence;

Step S402, judging whether the short-term speech energy corresponding to the i-th speech frame is greater than the preset short-term speech energy;

Step S403, if yes, determine that the i-th speech frame is a valid speech frame, otherwise it is a noise frame.

Short-term speech energy refers to the speech energy of audio signals in a relatively short time. The short time here usually refers to one frame of speech, that is, the speech energy within one frame is called short-term energy. In the same speech, the energy of the speech frame is usually much higher than that of the noise. Therefore, the short-term speech energy can be used to distinguish between effective speech frames and noise frames. In this embodiment, the calculation method for calculating the short-term speech energy is not limited.

Optionally, in an embodiment, the calculation formula of the short-term speech energy is as follows:

Among them, M(i) represents the short-term speech energy of the i-th speech frame; x(n) represents the time domain signal of the speech waveform; w(n) represents the window function; yi(n) represents the w(n) framing process The i-th frame speech signal obtained later; b represents the frame shift length; n=1, 2,...L; i=1, 2,...fn; L represents the frame length, and fn represents the total number of frames after framing.

In this embodiment, after calculating the short-term speech energy of a speech frame, first determine whether the short-term speech energy of the speech frame exceeds the preset short-term speech energy threshold, and if so, it is determined that the speech frame is Valid speech frame, otherwise judged as noise frame.

This embodiment detects the input speech signal from the perspective of the short-term speech energy of the speech frame, thereby determining the frame category corresponding to each frame of the input speech. Because the short-term speech energy detection method is convenient and the recognition accuracy rate is high. Therefore, the efficiency of voice endpoint detection for input voice can be greatly improved.

Referring to FIG. 5, FIG. 5 is a detailed flowchart of an embodiment of step S60 in FIG. Based on the foregoing embodiment, in this embodiment, the foregoing step S60 further includes:

Step S601: In a preset detection window, determine whether the frame type corresponding to each voice frame in the detection window meets a preset voice endpoint determination condition;

In step S602, if it is satisfied, it is determined that the voice start endpoint or the voice end endpoint of the input voice is located in the current detection window.

Considering that the method of judging the voice endpoint based on whether a certain voice frame is a valid voice frame or a noise frame is prone to misjudgment, therefore, in this embodiment, the detection window and the proportion are combined to perform voice endpoint detection. judgment.

In this embodiment, the detection window specifically includes: a voice start endpoint detection window and a voice end endpoint detection window. Among them, the size of the detection window used for the voice start endpoint judgment is different from that used for the voice end endpoint judgment. Generally, the detection window used by the voice start endpoint is smaller than the detection window used by the voice end endpoint. Specific settings and adjustments can be made according to actual needs.

In this embodiment, the voice endpoint determination conditions specifically include:

A. Voice start endpoint determination condition: if the ratio of valid voice frames in the current detection window exceeds the preset first ratio, it is determined that the voice start endpoint of the input voice exists in the current detection window;

B. Voice end endpoint determination condition: if the ratio of valid voice frames in the current detection window is lower than the preset second ratio, it is determined that the voice end endpoint of the input voice exists in the current detection window.

For example, when performing voice start endpoint detection, set a voice start endpoint detection window in advance, for example, the size of the window is 20 frames, and then count the number of valid voice frames in the detection window, and finally judge the monitoring window Whether the ratio value between the effective speech frame and the total number of frames in the window exceeds the preset ratio value (such as 60%), if so, it is determined that there is a speech start endpoint in the current detection window.

When detecting the end-of-speech endpoint, set up a detection window for the end-of-speech endpoint in advance. For example, the size of the window is 50 frames, and then count the number of noise frames in the detection window. Whether the ratio between the total number of frames in the window is lower than a preset ratio (for example, 10%), if so, it is determined that there is a voice end endpoint in the current detection window.

This application also provides a voice endpoint detection device.

Referring to FIG. 6, FIG. 6 is a schematic diagram of functional modules of an embodiment of voice endpoint detection in this application. In this embodiment, the voice endpoint detection device includes:

The obtaining module 10 is used to obtain the input voice to be detected and the preset voice frame detection model;

The framing module 20 is configured to perform framing processing on the input voice to obtain multiple voice frames with time series;

The first detection module 30 is configured to sequentially input each voice frame of the input voice into the voice frame detection model for detection, and output a first detection result corresponding to each voice frame;

The second detection module 40 is configured to perform harmonic energy detection on each voice frame of the input voice in sequence to obtain a second detection result corresponding to each voice frame;

The frame type determining module 50 is configured to determine the frame type corresponding to each speech frame based on the first detection result and the second detection result, where the frame type includes valid speech frames and noise frames;

The voice endpoint determining module 60 is configured to determine the voice start endpoint and voice end endpoint of the input voice based on the frame category corresponding to each voice frame.

Based on the description content of the same embodiment as the voice endpoint detection method of the present application, the content of the embodiment of the voice endpoint detection device will not be repeated in this embodiment.

The present application also provides a computer-readable storage medium, where the computer-readable storage medium may be volatile or non-volatile, which is not specifically limited by the present application.

In this embodiment, a voice endpoint detection program is stored on the computer-readable storage medium, and the voice endpoint detection program is executed by the processor to implement the steps of the voice endpoint detection method described in any of the above embodiments. Among them, the method implemented when the voice endpoint detection program is executed by the processor can refer to the various embodiments of the voice endpoint detection method of the present application, so it will not be repeated.

The embodiments of the present application are described based on the above in conjunction with the accompanying drawings, but the present application is not limited to the above-mentioned specific embodiments. The above-mentioned specific embodiments are only illustrative and not restrictive. Those skilled in the art Under the enlightenment of this application, without departing from the purpose of this application and the scope of protection of the claims, many forms can be made. Any equivalent structure or equivalent process transformation made by using the content of the description and drawings of this application, Or directly or indirectly used in other related technical fields, these are all within the protection of this application.

Claims

A voice endpoint detection method, the voice endpoint detection method includes the following steps:

Obtain the input voice to be detected and the preset voice frame detection model;

Framing the input voice to obtain multiple voice frames with time sequence;

Sequentially input each voice frame of the input voice into the voice frame detection model for detection, and output a first detection result corresponding to each voice frame;

Sequentially perform harmonic energy detection on each voice frame of the input voice to obtain a second detection result corresponding to each voice frame;

Determine a frame category corresponding to each speech frame based on the first detection result and the second detection result, where the frame category includes valid speech frames and noise frames;

Based on the frame category corresponding to each voice frame, the voice start endpoint and voice end endpoint of the input voice are determined.
5. The voice endpoint detection method according to claim 1, wherein the voice frame detection model comprises: a voice model and a noise model; before the step of obtaining the input voice to be detected and a preset voice frame detection model, the method further comprises:

Take normal speech data as training samples, and use the preset first machine learning algorithm for training to construct a speech model for detecting valid speech frames;

Using real environmental noise as a training sample, a preset second machine learning algorithm is used for training, and a noise model is constructed for use in detecting noise frames.
3. The voice endpoint detection method according to claim 2, wherein said sequentially inputting each voice frame of said input voice into said voice frame detection model for detection, and outputting a first detection result corresponding to each voice frame comprises:

Sequentially input each voice frame of the input voice into the voice model for detection, and output the first probability value of each voice frame as a valid voice frame;

Sequentially input each voice frame of the input voice into the noise model for detection, and output a second probability value of each voice frame as a noise frame;

Based on the first probability value and the second probability value, the first detection result corresponding to each speech frame is output, wherein if the first probability value of the speech frame being a valid speech frame is greater than the second probability value of being a noise frame, The voice frame is determined to be a valid voice frame, otherwise it is a noise frame.
8. The voice endpoint detection method according to claim 1, wherein said sequentially performing harmonic energy detection on each voice frame of said input voice to obtain a second detection result corresponding to each voice frame comprises:

Sequentially extracting the short-term speech energy in the time domain of the i-th speech frame of the input speech;

Determine whether the short-term speech energy corresponding to the i-th speech frame is greater than the preset short-term speech energy;

If yes, it is determined that the i-th speech frame is a valid speech frame, otherwise it is a noise frame.
According to the voice endpoint detection method of claim 4, the calculation formula of the short-term voice energy is as follows:

Among them, M(i) represents the short-term speech energy of the i-th speech frame; x(n) represents the time domain signal of the speech waveform; w(n) represents the window function; y i (n) represents the frame after w(n) The i-th frame of speech signal obtained after processing; b represents the frame shift length; n=1, 2,...L; i=1, 2,...f n ; L represents the frame length, f n represents the total number of frames after framing .
5. The voice endpoint detection method according to claim 3, said determining the frame category corresponding to each voice frame based on the first detection result and the second detection result comprises:

If the first detection result is that the voice frame is a valid voice frame, and the second detection result is that the voice frame is a valid voice frame, determining that the frame category corresponding to the voice frame is a valid voice frame;

If the first detection result is that the voice frame is a valid voice frame, and the second detection result is that the voice frame is a noise frame, determining that the frame category corresponding to the voice frame is a noise frame;

If the first detection result is that the voice frame is a noise frame, and the second detection result is that the voice frame is a valid voice frame, determining that the frame category corresponding to the voice frame is a noise frame;

If the first detection result is that the voice frame is a noise frame, and the second detection result is that the voice frame is a noise frame, it is determined that the frame category corresponding to the voice frame is a noise frame.
8. The voice endpoint detection method according to claim 1, wherein the determining the voice start endpoint and the voice end endpoint of the input voice based on the frame category corresponding to each voice frame comprises:

In the preset detection window, determine whether the frame category corresponding to each voice frame in the detection window meets the preset voice endpoint determination condition;

If it is satisfied, it is determined that the voice start endpoint or the voice end endpoint of the input voice is located in the current detection window;

Wherein, the voice endpoint determination condition includes: if the ratio of valid voice frames in the current detection window exceeds the preset first ratio, determining that the voice start endpoint of the input voice exists in the current detection window; if there is valid voice in the current detection window If the ratio of the frame is lower than the preset second ratio, it is determined that there is a voice end endpoint of the input voice in the current detection window.
A voice endpoint detection device, the voice endpoint detection device includes:

The acquisition module is used to acquire the input voice to be detected and the preset voice frame detection model;

The framing module is used to perform framing processing on the input voice to obtain multiple voice frames with time sequence;

The first detection module is configured to sequentially input each voice frame of the input voice into the voice frame detection model for detection, and output a first detection result corresponding to each voice frame;

The second detection module is configured to perform harmonic energy detection on each voice frame of the input voice in sequence to obtain a second detection result corresponding to each voice frame;

A frame type determining module, configured to determine a frame type corresponding to each speech frame based on the first detection result and the second detection result, the frame type including valid speech frames and noise frames;

The voice endpoint determination module is used to determine the voice start endpoint and voice end endpoint of the input voice based on the frame category corresponding to each voice frame.
8. The voice endpoint detection device according to claim 8, wherein the voice frame detection model comprises: a voice model and a noise model, and the voice endpoint detection device further comprises:

The voice model training module is used to use normal voice data as training samples and use the preset first machine learning algorithm for training to construct a voice model for use in detecting valid voice frames;

The noise model training module is used to take real environmental noise as a training sample and use a preset second machine learning algorithm for training to construct a noise model for use in detecting noise frames.
The voice endpoint detection device according to claim 9, wherein the first detection module is specifically configured to:

Sequentially input each voice frame of the input voice into the voice model for detection, and output the first probability value of each voice frame as a valid voice frame;

Sequentially input each voice frame of the input voice into the noise model for detection, and output a second probability value of each voice frame as a noise frame;

Based on the first probability value and the second probability value, the first detection result corresponding to each speech frame is output, wherein if the first probability value of the speech frame being a valid speech frame is greater than the second probability value of being a noise frame, The voice frame is determined to be a valid voice frame, otherwise it is a noise frame.
The voice endpoint detection device according to claim 8, wherein the second detection module is specifically configured to:

Sequentially extracting the short-term speech energy in the time domain of the i-th speech frame of the input speech;

Determine whether the short-term speech energy corresponding to the i-th speech frame is greater than the preset short-term speech energy;

If yes, it is determined that the i-th speech frame is a valid speech frame, otherwise it is a noise frame.
The voice endpoint detection device according to claim 11, the calculation formula of the short-term voice energy is as follows:

Among them, M(i) represents the short-term speech energy of the i-th speech frame; x(n) represents the time domain signal of the speech waveform; w(n) represents the window function; y i (n) represents the frame after w(n) The i-th frame of speech signal obtained after processing; b represents the frame shift length; n=1, 2,...L; i=1, 2,...f n ; L represents the frame length, f n represents the total number of frames after framing .
The voice endpoint detection device according to claim 10, wherein the frame type determining module is specifically configured to:

If the first detection result is that the voice frame is a valid voice frame, and the second detection result is that the voice frame is a valid voice frame, determining that the frame category corresponding to the voice frame is a valid voice frame;

If the first detection result is that the voice frame is a valid voice frame, and the second detection result is that the voice frame is a noise frame, determining that the frame category corresponding to the voice frame is a noise frame;

If the first detection result is that the voice frame is a noise frame, and the second detection result is that the voice frame is a valid voice frame, determining that the frame category corresponding to the voice frame is a noise frame;

If the first detection result is that the voice frame is a noise frame, and the second detection result is that the voice frame is a noise frame, it is determined that the frame category corresponding to the voice frame is a noise frame.
8. The voice endpoint detection device according to claim 8, wherein the voice endpoint determination module is specifically configured to:

In a preset detection window, determine whether the frame type corresponding to each voice frame in the detection window meets a preset voice endpoint determination condition;

If it is satisfied, it is determined that the voice start endpoint or the voice end endpoint of the input voice is located in the current detection window;

Wherein, the voice endpoint determination condition includes: if the ratio of valid voice frames in the current detection window exceeds the preset first ratio, determining that the voice start endpoint of the input voice exists in the current detection window; if there is valid voice in the current detection window If the ratio of the frame is lower than the preset second ratio, it is determined that there is a voice end endpoint of the input voice in the current detection window.
A voice endpoint detection device, the voice endpoint detection device includes a memory, a processor, and a voice endpoint detection program stored on the memory and running on the processor, the voice endpoint detection program being processed by the The steps of the voice endpoint detection method described below are implemented when the device is executed:

Obtain the input voice to be detected and the preset voice frame detection model;

Framing the input voice to obtain multiple voice frames with time sequence;

Sequentially input each voice frame of the input voice into the voice frame detection model for detection, and output a first detection result corresponding to each voice frame;

Sequentially perform harmonic energy detection on each voice frame of the input voice to obtain a second detection result corresponding to each voice frame;

Determine a frame category corresponding to each speech frame based on the first detection result and the second detection result, where the frame category includes valid speech frames and noise frames;

Based on the frame category corresponding to each voice frame, the voice start endpoint and voice end endpoint of the input voice are determined.
The voice endpoint detection device according to claim 15, wherein the voice frame detection model includes: a voice model and a noise model; when the voice endpoint detection program is executed by the processor, the voice endpoint detection method described below is also implemented A step of:

Take normal speech data as training samples, and use the preset first machine learning algorithm for training to construct a speech model for detecting valid speech frames;

Using real environmental noise as a training sample, a preset second machine learning algorithm is used for training, and a noise model is constructed for use in detecting noise frames.
The voice endpoint detection device according to claim 16, wherein the voice endpoint detection program is executed by the processor to realize the sequentially inputting each voice frame of the input voice into the voice frame detection model for detection, and outputting each voice The step of the first detection result corresponding to the frame further includes the following steps:

Sequentially input each voice frame of the input voice into the voice model for detection, and output the first probability value of each voice frame as a valid voice frame;

Sequentially input each voice frame of the input voice into the noise model for detection, and output a second probability value of each voice frame as a noise frame;

Based on the first probability value and the second probability value, the first detection result corresponding to each speech frame is output, wherein if the first probability value of the speech frame being a valid speech frame is greater than the second probability value of being a noise frame, The voice frame is determined to be a valid voice frame, otherwise it is a noise frame.
A computer-readable storage medium having a voice endpoint detection program stored on the computer-readable storage medium, and when the voice endpoint detection program is executed by a processor, the steps of the voice endpoint detection method described below are implemented:

Obtain the input voice to be detected and the preset voice frame detection model;

Framing the input voice to obtain multiple voice frames with time sequence;

Sequentially input each voice frame of the input voice into the voice frame detection model for detection, and output a first detection result corresponding to each voice frame;

Sequentially perform harmonic energy detection on each voice frame of the input voice to obtain a second detection result corresponding to each voice frame;

Determine a frame category corresponding to each speech frame based on the first detection result and the second detection result, where the frame category includes valid speech frames and noise frames;

Based on the frame category corresponding to each voice frame, the voice start endpoint and voice end endpoint of the input voice are determined.
The computer-readable storage medium of claim 18, wherein the voice frame detection model includes: a voice model and a noise model; when the voice endpoint detection program is executed by the processor, it also implements the following voice endpoint detection method step:

Take normal speech data as training samples, and use the preset first machine learning algorithm for training to construct a speech model for detecting valid speech frames;

Using real environmental noise as a training sample, a preset second machine learning algorithm is used for training, and a noise model is constructed for use in detecting noise frames.
The computer-readable storage medium of claim 19, wherein the voice endpoint detection program is executed by a processor to implement the sequence of inputting each voice frame of the input voice into the voice frame detection model for detection, and outputting each voice frame The step corresponding to the first detection result further includes the following steps:

Sequentially input each voice frame of the input voice into the voice model for detection, and output the first probability value of each voice frame as a valid voice frame;

Sequentially input each voice frame of the input voice into the noise model for detection, and output a second probability value of each voice frame as a noise frame;

Based on the first probability value and the second probability value, the first detection result corresponding to each speech frame is output, wherein if the first probability value of the speech frame being a valid speech frame is greater than the second probability value of being a noise frame, The voice frame is determined to be a valid voice frame, otherwise it is a noise frame.