CN107331386B

CN107331386B - Audio signal endpoint detection method and device, processing system and computer equipment

Info

Publication number: CN107331386B
Application number: CN201710493677.7A
Authority: CN
Inventors: 余世经; 朱频频
Original assignee: Shanghai Xiaoi Robot Technology Co Ltd
Current assignee: Shanghai Xiaoi Robot Technology Co Ltd
Priority date: 2017-06-26
Filing date: 2017-06-26
Publication date: 2020-07-21
Anticipated expiration: 2037-06-26
Also published as: CN107331386A

Abstract

The invention discloses an audio signal endpoint detection method, an audio signal endpoint detection device, an audio signal processing system and computer equipment. The method comprises the following steps: determining an endpoint detection judgment parameter of the audio signal to be detected based on the power spectrum of the audio signal to be detected and a predetermined noise power spectrum; and determining the sum of the endpoint detection mark values of the detection result identification array based on the endpoint detection mark values recorded in the detection result identification array with the set length, and determining the endpoint detection result of the audio signal to be detected according to the endpoint detection judgment parameter and the sum of the endpoint detection mark values. By using the method, the high accuracy of the endpoint detection can be still maintained under the condition of low signal-to-noise ratio, the accuracy of the endpoint detection in the technical scheme is not influenced by the change of the noise environment, and the robustness of the voice endpoint detection along with the change of the noise environment is better enhanced.

Description

Audio signal endpoint detection method and device, processing system and computer equipment

Technical Field

The present invention relates to the field of audio signal processing technologies, and in particular, to an audio signal endpoint detection method, an audio signal endpoint detection device, an audio signal endpoint processing system, and a computer device.

Background

Voice Active Detection (VAD) is an important link in audio signal processing such as audio coding, audio recognition and audio enhancement, and is usually used as a preprocessing module for audio signal processing, which can divide an input audio signal into a speech segment and a non-speech segment, and then perform differential processing on the speech segment or the non-speech segment, thereby achieving the target effect of audio signal processing.

In general, the performance of VAD is susceptible to the influence of environmental noise, and the lower the signal-to-noise ratio is, the less accurate the voice endpoint detection is. In the existing VAD algorithm commonly used in engineering, such as a "double-threshold" VAD algorithm based on short-time energy and zero-crossing rate, the performance of the VAD algorithm is obviously reduced along with the reduction of the signal-to-noise ratio, and the VAD algorithm basically loses application value in an environment with a low signal-to-noise ratio (less than 5dB), and in addition, the VAD algorithm lacks robustness for changes of acoustic scenes such as noise intensity and noise type, and algorithm parameters are often trained and adjusted according to changes of the environment.

The VAD algorithm is provided by G.729Annex B of ITU-T, the algorithm utilizes the short-time stationarity of audio signals to divide the audio signal to be detected into a plurality of sections by using 'frame' as unit (every 10-30 ms of data is a frame, the audio signal can be regarded as stable signal in the time section range), and finally returns the detection result that the audio signal frame is speech signal frame or non-speech signal frame_fEnergy E of the low-band signal_iThen, the characteristic parameters are compared with threshold values of all the parameters to make a preliminary VAD decision I_vd(ii) a Step two, smoothing the preliminary judgment result to obtain a smoothed judgment result S_vd. The smooth decision result can make the switching between the speech frame and the non-speech frame more natural, and reduce the loss of useful speech information to a certain extent. The drawback of the VAD algorithm of g.729annex B is: the robustness of the method for detecting the voice endpoint is insufficient in the face of noise environment change, and the accuracy of voice endpoint detection is obviously reduced in an environment with low signal-to-noise ratio.

Disclosure of Invention

The embodiment of the invention provides an audio signal endpoint detection method, an audio signal endpoint detection device, a processing system and computer equipment, which can better enhance the robustness of voice endpoint detection along with the change of a noise environment, thereby improving the accuracy of voice signal detection in an audio signal.

In a first aspect, an embodiment of the present invention provides an endpoint detection method for an audio signal, including:

determining an endpoint detection judgment parameter of the audio signal to be detected based on the power spectrum of the audio signal to be detected and a predetermined noise power spectrum;

determining the sum of the endpoint detection mark values of the detection result identification array based on all the endpoint detection mark values recorded in the detection result identification array with set length, wherein the endpoint detection mark value is a voice endpoint mark value or a non-voice endpoint mark value;

and determining an endpoint detection result of the audio signal to be detected according to the sum of the endpoint detection judgment parameter and the endpoint detection mark value.

In a second aspect, an embodiment of the present invention provides an apparatus for detecting an endpoint of an audio signal, including:

the judgment parameter determining module is used for determining an endpoint detection judgment parameter of the audio signal to be detected based on the power spectrum of the audio signal to be detected and a predetermined noise power spectrum;

the endpoint mark determining module is used for determining the sum of the endpoint detection mark values of the detection result identification array based on all the endpoint detection mark values recorded in the detection result identification array with the set length, wherein the endpoint detection mark value is a voice endpoint mark value or a non-voice endpoint mark value;

and the detection result determining module is used for determining the end point detection result of the audio signal to be detected according to the sum of the end point detection judgment parameter and the end point detection mark value.

In a third aspect, an embodiment of the present invention further provides an audio signal processing system, where the audio signal processing system includes the apparatus for detecting an endpoint of an audio signal provided in the embodiment of the present invention.

In a fourth aspect, an embodiment of the present invention also provides a computer device, where the computer device includes:

the audio signal processing system provided by the embodiment of the invention;

one or more processors;

a storage device for storing one or more programs in the audio signal processing system,

the one or more programs are executed by the one or more processors, so that the one or more processors implement the method for detecting an endpoint of an audio signal provided by the embodiment of the present invention.

In a fifth aspect, the present invention provides a storage medium containing computer-executable instructions, and the storage medium containing computer-executable instructions is used for executing the endpoint detection method of an audio signal provided by the present invention when executed by a computer processor.

The embodiment of the invention provides an audio signal endpoint detection method, an audio signal endpoint detection device, an audio signal processing system and computer equipment, wherein the audio signal endpoint detection method firstly determines an endpoint detection judgment parameter of an audio signal to be detected based on a power spectrum of the signal to be detected and a predetermined noise power spectrum; and then determining the sum of the end point detection mark values of the detection result identification array based on the end point detection mark values recorded by the detection result identification array with the set length, and finally determining the end point detection result of the audio signal to be detected according to the end point detection judgment parameter and the sum of the end point detection mark values. By utilizing the method, the judgment parameters of the endpoint detection can be determined only based on the power spectrum and the noise power spectrum of the audio signal, and the endpoint detection is finally realized, the realization of the technical scheme does not depend on the syllable characteristics of the audio signal, so that compared with the prior scheme, the technical scheme can still keep the high accuracy of the endpoint detection under the condition of low signal-to-noise ratio, the accuracy of the endpoint detection of the technical scheme is not influenced by the change of the noise environment, and the robustness of the voice endpoint detection along with the change of the noise environment is better enhanced; meanwhile, the technical scheme is simple and convenient to realize, is easier to integrate into various embedded audio processing systems, has a wide application range and has better practicability in practical application.

Drawings

Fig. 1 is a flowchart illustrating an end point detection method for an audio signal according to an embodiment of the present invention;

fig. 2 is a flowchart illustrating an end point detection method of an audio signal according to a second embodiment of the present invention;

fig. 3 is a schematic flowchart of an audio signal endpoint detection method according to a third embodiment of the present invention;

fig. 4a is a flowchart illustrating a preferred embodiment of a method for detecting an endpoint of an audio signal according to a fourth embodiment of the present invention;

FIG. 4b is a diagram of a segment of a clean audio signal file according to an embodiment of the present invention;

FIG. 4c is a schematic diagram of a voice signal in a noisy audio signal file detected by a VAD algorithm based on G.729Annex B under a 5dB white noise environment;

FIG. 4d is a schematic diagram of detecting a voice signal in a noisy audio signal file based on the technical solution of the present invention in a 5dB white noise environment;

FIG. 4e is a schematic diagram of a voice signal in a noisy audio signal file detected by a VAD algorithm based on G.729Annex B under a 10dB speaker environment;

FIG. 4f is a schematic diagram of detecting speech signals in a noisy audio signal file based on the technical solution of the present invention in a 10dB speaker environment;

FIG. 4g is a schematic diagram of a voice signal in a noisy audio signal file detected by a VAD algorithm based on G.729Annex B in a 15dB vehicle interior environment;

fig. 4h is a schematic diagram of detecting a voice signal in a noisy audio signal file based on the technical scheme of the present invention in a 15dB in-car environment.

Fig. 5 is a block diagram of an apparatus for detecting an endpoint of an audio signal according to a fifth embodiment of the present invention;

fig. 6 is a structural diagram of a computer device according to a sixth embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.

Example one

Fig. 1 is a flowchart of an end point detection method for an audio signal according to an embodiment of the present invention, where the method is applied to a device with an audio signal processing function for performing end point detection on the audio signal before processing the audio signal, and the method may be executed by an end point detection apparatus for an audio signal, where the apparatus may be implemented by software and/or hardware and is generally integrated in an audio signal processing system with an audio processing function, and the audio signal processing system may be disposed on a computer device.

It should be noted that the audio signal processing system can specifically perform signal processing operations such as audio coding, audio recognition, audio enhancement, and the like on an audio signal; the computer equipment can be an electronic product with a communication function, such as a mobile phone, a tablet computer, a notebook computer and the like, and can also be an electronic product with an audio interaction function, such as an intelligent voice assistant, an intelligent home, a voice navigator and the like.

As shown in fig. 1, an embodiment of the present invention provides an endpoint detection method for an audio signal, including the following operations:

s101, determining an endpoint detection judgment parameter of the audio signal to be detected based on the power spectrum of the audio signal to be detected and a predetermined noise power spectrum.

In this embodiment, the audio signal to be detected may specifically be understood as an audio signal to be currently obtained and subjected to endpoint detection, and this embodiment preferably determines the audio signal to be detected in units of frames, and performs an endpoint detection operation on each frame of the audio signal to be detected to determine whether each frame of the audio signal to be detected is a speech signal or a non-speech signal. Generally, it can be determined that a speech signal is generally stationary for a period of 10ms to 30ms according to the characteristic that it has short-term stationarity, and therefore, in order to ensure the accuracy of audio signal end point detection, the present embodiment preferably ranges from 10ms to 30ms in the time length of a unit frame.

In this embodiment, the power spectrum of the audio signal to be detected may be determined based on a frequency amplitude value of the audio signal to be detected in a frequency domain; if the current audio signal to be detected is the first frame audio signal to be detected, the currently adopted noise power spectrum is a pre-initialized noise power spectrum; otherwise, the currently adopted noise power spectrum can be specifically updated and determined when the endpoint detection is performed on the previous frame of audio signal to be detected.

In addition, the endpoint detection decision parameter in this embodiment may be specifically regarded as a decision parameter required for performing endpoint detection on the audio signal to be detected. It can be understood that, in this step, the endpoint detection decision parameter required for endpoint detection is specifically determined by the power spectrum of the audio signal to be detected and the noise power spectrum acquired in advance.

Further, the audio signal to be detected is an audio signal input in real time or a pre-recorded audio signal; correspondingly, when the audio signal to be detected is a pre-recorded audio signal, performing initialization calculation of a noise power spectrum based on the pre-recorded previous M frames of audio signals, wherein M is a set constant value.

In this embodiment, the audio signal to be detected may be an audio signal input to an audio input device in real time, or an audio signal pre-recorded and stored in a computer device.

Specifically, when the audio signal to be detected is input in real time, the audio signal to be detected is generally picked up directly through the audio input device in the computer device, and at this time, the audio signal to be detected may be buffered in the set buffer area, so that the audio signal to be detected is subsequently obtained from the set buffer area in units of frames and endpoint detection is performed.

When the audio signal to be detected is prerecorded, the formed audio file is usually stored in a preset storage path of the computer device in advance, at this time, the audio signal can be directly acquired from the storage path in units of frames, and the initial value of the noise power spectrum can be determined based on the previous M frames of audio signals in the prerecorded audio file, and in this embodiment, it is preferable to set the value of M to be an integer in [5,10 ].

It should be noted that, in this embodiment, the initial value of the noise power spectrum may be obtained by calculating according to the following formula:

wherein λ is_NE0Representing the initial value of the noise power spectrum preset before the endpoint detection, n representing the frame number of the required signal, n being more than or equal to 1 and less than or equal to M, M representing the frame number of the signal required for calculating the initial value of the noise power spectrum, | FFT (x (n)), |²Representing the power spectrum of the desired signal for the nth frame.

It is understood that the M frames of signals required for initializing the noise power spectrum may be the first M frames of signals in the set buffer, or the first M frames of audio signals in the pre-recorded audio file, and the selection of the required signals may be specifically determined.

S102, determining the sum of the end point detection mark values of the detection result identification array based on the end point detection mark values recorded in the detection result identification array with the set length.

In this embodiment, the detection result identifier array may be specifically configured to store endpoint detection flag values of historical audio signals to be detected, and the maximum number of the stored endpoint detection flag values is limited by the set length of the detection result identifier array. Generally, the set length of the detection result identification array may be randomly set according to historical experience, and this embodiment preferably has the same set length as the number of frames of signals required for determining the initial value of the noise power spectrum.

In this embodiment, the endpoint detection flag value may specifically be understood as a flag value corresponding to an endpoint detection result after performing endpoint detection on an audio signal, where the endpoint detection flag value is a voice endpoint flag value or a non-voice endpoint flag value. It is to be understood that when the end point detection result is determined to be a voice signal, the end point flag value may be assigned as a voice end point flag value; meanwhile, when the endpoint detection result is determined to be a non-voice signal, the endpoint flag value may be assigned as a non-voice endpoint flag value. In this step, the endpoint detection flag values in the detection result flag array may be obtained, and the sum of the endpoint detection flag values corresponding to the detection result flag array may be determined accordingly.

It should be noted that, when the audio signal to be detected is the audio signal to be detected in the first frame, the element values (end point detection flag values) in the detection result flag array may be initialized, and for example, it may be preferable to initially set each element value in the detection result flag array as a non-speech end point flag value. In addition, if the current audio signal to be detected is not the first frame of audio signal to be detected, the end point detection mark values recorded in the currently adopted detection result identification array can be updated and determined according to the detection result corresponding to the previous frame of audio signal to be detected.

It can be known that, when the audio signal to be detected is the audio signal input in real time, the embodiment can directly regard the input first frame audio signal as the first frame audio signal to be detected; when the audio signal to be detected is a pre-recorded audio signal, the embodiment may regard the previous M frames of audio signals used for calculating the initial value of the noise power spectrum as the noise signal, and then may directly ignore the end point detection on the previous M frames of audio signals, and directly regard the M +1 th frame of audio signals as the first frame of audio signal to be detected.

S103, determining an endpoint detection result of the audio signal to be detected according to the sum of the endpoint detection judgment parameter and the endpoint detection mark value.

In this embodiment, the endpoint detection result may specifically be a voice signal or a non-voice signal. Specifically, the step may determine whether the frame of audio signal to be detected is a speech signal or a non-speech signal according to a comparison result between the endpoint detection decision parameter and the set corresponding threshold value and a comparison result between the sum of the endpoint detection flag values and the set corresponding threshold value.

According to the audio signal endpoint detection method provided by the embodiment of the invention, the judgment parameters of endpoint detection can be determined only based on the power spectrum and the noise power spectrum of the audio signal, and the endpoint detection is finally realized, the realization of the technical scheme does not depend on the syllable characteristics of the audio signal, so that compared with the existing scheme, the technical scheme can still keep the high accuracy of the endpoint detection under the condition of low signal-to-noise ratio, the accuracy of the endpoint detection of the technical scheme is not influenced by the change of the noise environment, and the robustness of the voice endpoint detection when the voice endpoint detection changes along with the noise environment is better enhanced; meanwhile, the technical scheme is simple and convenient to realize, is easier to integrate into various embedded audio processing systems, has a wide application range and has better practicability in practical application.

Example two

Fig. 2 is a schematic flow chart of an audio signal endpoint detection method according to a second embodiment of the present invention, which is optimized based on the above-described embodiment, in this embodiment, an endpoint detection decision parameter of an audio signal to be detected is determined based on a power spectrum of the audio signal to be detected and a predetermined noise power spectrum, and is further optimized as follows: determining a posterior signal-to-noise ratio and a prior signal-to-noise ratio of the audio signal to be detected based on the power spectrum of the audio signal to be detected and a predetermined noise power spectrum; and determining an endpoint detection judgment parameter of the audio signal to be detected according to a set judgment parameter formula and the posterior signal-to-noise ratio and the prior signal-to-noise ratio.

Further, the embodiment further optimizes and adds: and updating and storing a predetermined noise power spectrum according to the detection result of the audio signal to be detected so as to be used for determining an endpoint detection judgment parameter of the audio signal to be detected of the next frame.

As shown in fig. 2, a method for detecting an endpoint of an audio signal according to a second embodiment of the present invention specifically includes the following operations:

it should be noted that S201 and S202 in the present embodiment embody the determination process of the endpoint detection determination parameter.

S201, determining a posterior signal-to-noise ratio and a prior signal-to-noise ratio of the audio signal to be detected based on the power spectrum of the audio signal to be detected and a predetermined noise power spectrum.

The method comprises the following steps of firstly determining a posterior signal-to-noise ratio and a prior signal-to-noise ratio for detecting the endpoint of the audio signal to be detected. Specifically, the step may obtain a power spectrum of the audio signal to be detected, and may obtain a predetermined noise power spectrum, and then obtain a required posterior signal-to-noise ratio and a required prior signal-to-noise ratio according to the determined power spectrum and the determined noise power spectrum of the audio signal to be detected.

Further, determining the posterior signal-to-noise ratio and the prior signal-to-noise ratio of the audio signal to be detected according to the posterior signal-to-noise ratio formula and the prior signal-to-noise ratio formula,

the posterior signal-to-noise ratio formula is expressed as:

the prior snr equation is expressed as:

ξ (n, K) ═ αξ (n-1, K) + (1- α) max (γ (n, K) -1,0), where K denotes the frequency number of the frequency domain and is any integer value from 0 to K-1, K denotes the length of the frequency domain, n denotes the frame number of the current audio signal to be detected, x (n) denotes the nth frame audio signal to be detected in the time domain, FFT (x (n)) denotes the nth frame audio signal to be detected in the frequency domain, γ (n, K) denotes the a posteriori signal-to-noise ratio of the nth frame audio signal to be detected, and | FFT (x (n)),/²Representing the power spectrum, lambda, of the nth frame of the audio signal to be detected_NE(n-1, k) represents a noise power spectrum corresponding to a noise signal in the n-1 th frame of audio signal to be detected, ξ (n, k) represents a prior signal-to-noise ratio of the n-1 th frame of audio signal to be detected, α is any constant between 0 and 1, and ξ (n-1, k) represents the prior signal-to-noise ratio of the n-1 th frame of audio signal to be detected.

In this embodiment, if the current audio signal to be detected is the first frame audio signal to be detected (i.e., n is 1), the noise power spectrum used in calculating the posterior signal-to-noise ratio is equivalent to the initial value of the noise power spectrum, it can be understood that the n-1 th frame audio signal to be detected is equivalent to the previous frame of the nth frame audio signal to be detected, and therefore, when the audio signal to be detected is not the first frame audio signal to be detected, the current posterior signal-to-noise ratio can be determined by using the noise power spectrum updated based on the end point detection result of the previous frame audio signal to be detected.

When n is 1, the initial value of ξ (n-1, k) is set to 0.98, but in this case, ξ (n-1, k) is a preferred historical empirical value.

S202, determining an endpoint detection judgment parameter of the audio signal to be detected according to a set judgment parameter formula, the posterior signal-to-noise ratio and the prior signal-to-noise ratio.

The step can determine the endpoint detection judgment parameters needed by the endpoint detection based on the determined posterior signal-to-noise ratio and the prior signal-to-noise ratio and the set judgment parameter formula. Further, the present embodiment may determine the endpoint detection decision parameter of the audio signal to be detected according to the following decision parameter formula, where the decision parameter formula is expressed as:

in this step, variables appearing in the decision parameter formula have the same meaning as in S201, and are not described herein again, and this step may specifically determine the required endpoint detection decision parameter through the decision parameter formula.

S203, determining the sum of the end point detection mark values of the detection result identification array based on the end point detection mark values recorded in the detection result identification array with the set length.

Specifically, each endpoint detection flag value recorded in the detection result identifier array may be specifically determined by updating an endpoint detection result of a previous frame of audio signal to be detected, where the endpoint detection flag value may be a speech endpoint flag value or a non-speech endpoint flag value, and if there is no previous frame of audio signal to be detected, each endpoint detection flag value in the detection result identifier array may be considered to be an initial value (generally, all the endpoint detection flag values are initially set to be non-speech endpoint flag values).

And S204, determining an endpoint detection result of the audio signal to be detected according to the sum of the endpoint detection judgment parameter and the endpoint detection mark value.

For example, the step may compare the endpoint detection decision parameter with a plurality of set threshold values, so as to obtain a first comparison result, and may compare the sum of the endpoint detection flag values with a plurality of set threshold values, so as to finally obtain a second comparison result, and finally determine whether the audio signal to be detected is a speech signal or a non-speech signal according to the first comparison result and the second comparison result.

And S205, updating and storing a predetermined noise power spectrum according to the end point detection result of the audio signal to be detected, so as to determine an end point detection judgment parameter of the audio signal to be detected of the next frame.

In this embodiment, after the endpoint detection result of the audio signal to be detected is determined, the noise power spectrum required for endpoint detection may be further updated according to the endpoint detection result and stored, and the updated noise power spectrum may be used to determine the endpoint detection decision parameter of the audio signal to be detected in the next frame.

Further, the updating and storing the predetermined noise power spectrum according to the endpoint detection result of the audio signal to be detected includes: if the audio signal to be detected is determined to be a voice signal, keeping the noise power spectrum unchanged; otherwise, updating and storing the noise power spectrum according to the following noise power spectrum updating formula, wherein the noise power spectrum updating formula is expressed as:

λ_NE(n,k)＝μλ_NE(n-1,k)+(1-μ)|FFT(x(n))|²wherein k represents a frequency number of a frequency domain; lambda [ alpha ]_NE(n, k) represents a noise power spectrum corresponding to a noise signal in the nth frame of audio signal to be detected; lambda [ alpha ]_NE(n-1, k) represents a noise power spectrum corresponding to a noise signal in the audio signal to be detected in the (n-1) th frame; | FFT (x (n))²Representing the power spectrum of the nth frame of the audio signal to be detected.

In this embodiment, when the audio signal to be detected is determined to be a speech signal, it may be considered that no noise signal or only a small amount of noise signal exists in the current audio signal to be detected, and at this time, the noise power spectrum does not need to be updated, and the previously determined noise power spectrum may be used. When the audio signal to be detected is determined to be a non-voice signal, it can be considered that more noise signals exist in the current audio signal to be detected, and the noise power spectrum can be updated according to the predetermined noise power spectrum and the noise power spectrum updating formula. It should be noted that, the value range of μ in the noise power spectrum update formula is (0,1), and the value range of μ in this embodiment may be preferably (0.9,1), and in addition, the specific value may be set manually according to a historical empirical value, or a corresponding value is determined according to a specific use scenario.

It should be noted that, since the initialization of the noise power spectrum in this embodiment is equivalent to the pre-estimation operation of the noise power spectrum, the noise power spectrum determined after the update of the noise power spectrum based on the pre-estimated initial value of the noise power spectrum still corresponds to an estimated value, and therefore, it can be considered that both the prior snr and the a posteriori snr determined based on the noise power spectrum in this embodiment can be regarded as an estimated value. Therefore, whether the initial value of the noise power spectrum is correctly estimated or not can influence the accuracy of the detection of the endpoint of the audio signal to be detected.

The second method for detecting an endpoint of an audio signal, provided by the embodiment of the present invention, embodies the determination process of the endpoint detection decision parameter, and specifically increases the update operation of the noise power spectrum. By using the method, the judgment parameters of the endpoint detection can be determined based on the power spectrum and the noise power spectrum of the audio signal, and the implementation of the endpoint detection does not depend on the syllable characteristics of the audio signal, so that the accuracy of the endpoint detection cannot be influenced no matter how the noise environment changes, and the robustness of the voice endpoint detection when the voice endpoint detection changes along with the noise environment is better enhanced. Meanwhile, the technical scheme is simple to realize, is easier to integrate into various embedded audio processing systems, has wide application range and has better practicability in practical application.

EXAMPLE III

Fig. 3 is a schematic flow chart of an audio signal endpoint detection method according to a third embodiment of the present invention, where the third embodiment of the present invention is optimized based on the first embodiment or the second embodiment, in this embodiment, an endpoint detection result of the audio signal to be detected is determined according to a sum of the endpoint detection decision parameter and the endpoint detection flag value, and is further embodied as: if the sum of the endpoint detection mark values is greater than or equal to a first set threshold value and the last endpoint detection mark value in the detection result identification array is a voice endpoint mark value, determining an endpoint detection result of the audio signal to be detected according to a comparison result of the endpoint detection judgment parameter and the first set threshold value; if the sum of the endpoint detection mark values is less than or equal to a second set threshold value and the last endpoint detection mark value in the detection result identification array is a non-voice endpoint mark value, determining an endpoint detection result of the audio signal to be detected according to a comparison result of the endpoint detection judgment parameter and the second set threshold value; if the sum of the endpoint detection mark values is greater than or equal to a first set threshold value and the last endpoint detection mark value in the detection result identification array is a non-voice endpoint mark value, determining an endpoint detection result of the audio signal to be detected according to a comparison result of the endpoint detection judgment parameter and a third set threshold value; if the sum of the endpoint detection mark values is less than or equal to a second set threshold value and the last endpoint detection mark value in the detection result identification array is a voice endpoint mark value, determining an endpoint detection result of the audio signal to be detected according to a comparison result of the endpoint detection judgment parameter and a third set threshold value; if the sum of the element values is smaller than the first set threshold and larger than the second set threshold, determining an endpoint detection result of the audio signal to be detected according to a comparison result of the endpoint detection judgment parameter and a third set threshold; wherein the first set threshold is greater than the second set threshold; the first set threshold value is smaller than the third set threshold value; the third set threshold is smaller than the second set threshold.

In addition, after the determining the end point detection result of the audio signal to be detected, the present embodiment further optimizes and increases: and updating the end point detection mark value recorded in the detection result identification array so as to be used for carrying out end point detection on the audio signal to be detected of the next frame.

As shown in fig. 3, a method for detecting an endpoint of an audio signal according to a third embodiment of the present invention specifically includes the following operations:

s301, determining an endpoint detection judgment parameter of the audio signal to be detected based on the power spectrum of the audio signal to be detected and a predetermined noise power spectrum.

For example, in this step, the prior signal-to-noise ratio and the posterior signal-to-noise ratio of the audio signal to be detected may be determined according to the power spectrum of the audio signal to be detected and the predetermined noise power spectrum, and then the endpoint detection determination parameter of the audio signal to be detected may be determined according to the set determination parameter formula, the prior signal-to-noise ratio and the posterior signal-to-noise ratio.

S302, based on the detection result identification array with the set length, the sum of the end point detection mark values of the detection result identification array is determined.

For example, the present embodiment may determine the setting length of the detection result identification array as a fixed value, which may be equal to the number of frames of the signal required for determining the initial value of the noise power spectrum, so that the detection result identification array may record end point detection flag values of the set length, where the end point detection flag values are voice end point flag values or non-voice end point flag values. In this step, the sum of the final end point detection mark values can be determined according to the obtained end point detection mark values.

In addition, when the endpoint detection of the audio signal is initially performed, an initial value may be set in advance for each endpoint detection flag value recorded in the detection result identification array, and each endpoint detection flag value may be updated subsequently according to the endpoint detection result of each iteration.

It should be noted that, in this embodiment, the determination process of the endpoint detection result is specifically described in S303 to S309.

S303, judging whether the sum of the endpoint detection mark values is greater than or equal to a first set threshold value, if so, executing S304; if not, S305 is executed.

This step first determines the comparison result between the sum of the endpoint detection flag values determined in S302 and the first set threshold, and if the sum of the endpoint detection flag values is greater than or equal to the first set threshold, the operation of S304 may be performed, otherwise, the operation of S305 may be performed.

S304, judging whether the last endpoint detection mark value in the detection result identification array is a voice endpoint mark value, if so, executing S307; if not, go to S309.

In this step, after it is determined based on the determination of S303 that the sum of the endpoint detection flag values is greater than or equal to the first set threshold, the last endpoint detection flag value in the detection result identifier array is further determined, and when the last endpoint detection flag value is the voice endpoint flag value, the operation of S307 is executed, otherwise, the operation of S309 is executed if the last endpoint detection flag value is equal to the non-voice endpoint flag value.

S305, judging whether the sum of the endpoint detection mark values is less than or equal to a second set threshold value, if so, executing S306; if not, go to S309.

In this step, after it is determined based on the determination in S303 that the sum of the end point detection flag values is smaller than the first set threshold, the magnitude relationship between the sum of the end point detection flag values and the second set threshold is further determined, and when the sum of the end point detection flag values is smaller than or equal to the second set threshold, the operation in S306 is performed, otherwise, the operation in S309 is performed when the sum of the end point detection flag values is larger than the second set threshold and smaller than the first set threshold.

It should be noted that the second set threshold in this step is smaller than the first set threshold in S303, and the specific values of the second set threshold and the first set threshold may be set manually according to a historical experience value, or corresponding values are determined according to a specific use scenario. For example, the second set threshold and the first set threshold may be selected in a certain relationship with the set length of the detection result identifier array and the set value of the endpoint detection flag value, for example, when the set length of the detection result identifier array is 6 and the endpoint detection flag value is set to only 0 or 1, the first set threshold may be determined to be 4, and the second set threshold may be determined to be 2, it should be noted that the above example is only one way of selecting specific values, and the specific values are not limited to be selected in other ways.

S306, judging whether the last endpoint detection mark value in the detection result identification array is a voice endpoint mark value, if not, executing S308; if yes, go to S309.

In this step, after determining that the sum of the endpoint detection flag values is smaller than the second set threshold based on the determination of S305, the last endpoint detection flag value in the detection result identifier array is further determined, and when the last endpoint detection flag value is not the voice endpoint flag value (i.e., is a non-voice endpoint flag value), the operation of S308 is executed; otherwise, the last endpoint detection flag value is equal to the voice endpoint flag value, and the operation of S309 is executed.

S307, determining an endpoint detection result of the audio signal to be detected according to a comparison result of the endpoint detection judgment parameter and a first set threshold value.

In this embodiment, the operation of this step may be performed when the determination states of S303 and S304 are both affirmative, that is, the sum of the endpoint detection flag values is greater than or equal to the first set threshold value and the last endpoint detection flag value is the voice endpoint flag value, and at this time, the endpoint detection result of the audio signal to be detected may be finally determined according to the comparison result of the endpoint detection determination parameter and the first set threshold value.

Further, the determining an endpoint detection result of the audio signal to be detected according to the comparison result between the endpoint detection determination parameter and the first set threshold value includes: when the endpoint detection judgment parameter is greater than or equal to the first set threshold value, determining that the audio signal to be detected is a voice signal, and recording an endpoint detection marking value of the audio signal to be detected as a voice endpoint marking value; otherwise, determining that the audio signal to be detected is a non-voice signal, and recording the endpoint detection mark value of the audio signal to be detected as a non-voice endpoint mark value.

In this embodiment, the endpoint detection flag value specifically represents an endpoint detection result of the audio signal to be detected, for example, a speech signal corresponds to the speech endpoint flag value, and a non-speech signal corresponds to the non-speech endpoint flag value. It is understood that the specific values of the voice endpoint flag value and the non-voice endpoint flag value in this embodiment may be constant values set at will, and this embodiment preferably sets the voice endpoint flag value to 1, and simultaneously preferably sets the non-voice endpoint flag value to 0.

S308, determining an endpoint detection result of the audio signal to be detected according to the comparison result of the endpoint detection judgment parameter and a second set threshold value.

In this embodiment, the operation of this step may be performed when the determination status of S305 is positive, and the determination status of S306 is negative, that is, the sum of the endpoint detection flag values is less than or equal to the second set threshold value, and the last endpoint detection flag value is a non-voice endpoint flag value, at this time, the endpoint detection result of the audio signal to be detected may be finally determined according to the comparison result of the endpoint detection determination parameter and the second set threshold value.

Further, the determining an endpoint detection result of the audio signal to be detected according to the comparison result between the endpoint detection determination parameter and the second set threshold value includes: when the endpoint detection judgment parameter is greater than or equal to the second set threshold value, determining that the audio signal to be detected is a voice signal, and recording an endpoint detection marking value of the audio signal to be detected as a voice endpoint marking value; otherwise, determining that the audio signal to be detected is a non-voice signal, and recording the endpoint detection mark value of the audio signal to be detected as a non-voice endpoint mark value.

S309, determining an endpoint detection result of the audio signal to be detected according to a comparison result of the endpoint detection judgment parameter and a third set threshold value.

In the present embodiment, the operation of this step may be performed when the determination state of S303 is affirmative, and the determination state of S304 is negative, that is, the sum of the endpoint detection flag values is greater than or equal to the first set threshold value, and the last endpoint detection flag value is a non-voice endpoint flag value; meanwhile, the operation of this step may also be performed when the determination state of S303 is negative, and the determination state of S305 is negative, that is, the sum of the end point detection flag values is greater than the second set threshold value and less than the first set threshold value; further, the operation of this step may be performed when the determination state of S305 is affirmative, and the determination state of S306 is also affirmative, that is, the sum of the end point detection flag values is less than or equal to the second set threshold value, and the last end point detection flag value is a voice end point flag value. When any one of the above situations occurs, the endpoint detection result of the audio signal to be detected can be finally determined according to the comparison result of the endpoint detection judgment parameter and the third set threshold value.

Further, the determining an endpoint detection result of the audio signal to be detected according to the comparison result between the endpoint detection determination parameter and a third set threshold value includes: when the endpoint detection judgment parameter is greater than or equal to the third set threshold value, determining that the audio signal to be detected is a voice signal, and recording an endpoint detection marking value of the audio signal to be detected as a voice endpoint marking value; otherwise, determining that the audio signal to be detected is a non-voice signal, and recording the endpoint detection mark value of the audio signal to be detected as a non-voice endpoint mark value.

It should be noted that, in the present embodiment, the first set threshold, the second set threshold and the third set threshold may all be regarded as a constant value, wherein the first set threshold is smaller than the third set threshold; and the third set threshold is smaller than the second set threshold, in this embodiment, the value range of the first set threshold is preferably (0.05, 0.1), the value range of the third set threshold is preferably (0.1, 0.3), and the value range of the second set threshold is preferably (0.5, 5.0). In addition, in this embodiment, specific values of the first set threshold, the second set threshold, and the third set threshold may be set manually according to a historical experience value, or corresponding values may be determined according to a specific usage scenario.

And S310, updating and storing a predetermined noise power spectrum according to the end point detection result of the audio signal to be detected, so as to determine an end point detection judgment parameter of the audio signal to be detected of the next frame.

For example, the step may update the noise power spectrum based on the update condition of the noise power spectrum and the update formula of the noise power spectrum according to the endpoint detection result.

And S311, updating the endpoint detection mark value recorded in the detection result identification array to be used for performing endpoint detection on the audio signal to be detected of the next frame.

In this embodiment, after determining the endpoint detection result of the audio signal to be detected, the endpoint detection flag values in the detection result identification array may be updated.

Further, the endpoint detection flag value in the detection result identification array is updated according to the following flag value update formula, where the flag value update formula is expressed as:

wherein m represents the array element number of the detection result identification array, L represents the set length of the detection result identification array, value [ m]Indicating the mth endpoint detection mark value recorded in the detection result identification array; val _ decision is an endpoint detection mark value corresponding to the audio signal to be detected.

In this embodiment, it may be considered that the end point detection flag value corresponding to the previous frame history with the preset length of the current audio signal to be detected is specifically recorded in the detection result identifier array, and each end point detection flag value recorded in the detection result identifier array may be updated in real time according to the end point detection result of the current audio signal to be detected.

Specifically, the 1 st endpoint detection flag value in the detection result identifier array may be represented as value [1], based on the above flag value update formula, when 1 is smaller than the set length L, value [1] may be updated to value [2] in the current detection result identifier array, and then value [2] to value [ L-1 ] are sequentially updated based on the same update mode as value [1], and finally value [ L ] is updated to the endpoint detection flag value corresponding to the endpoint detection result of the current audio signal to be detected.

Based on the updating of the endpoint detection mark values in the detection result identification array, the endpoint detection mark values in the detection result identification array can be ensured to be always recorded as the endpoint detection mark values corresponding to the historical audio signal to be detected with the set length frame in front of the audio signal to be detected.

The third embodiment of the invention provides an endpoint detection method for audio signals, which specifically expresses the implementation process of endpoint detection and optimizes and increases the updating process of element values in a detection result identification array. Compared with the prior art, the method can still keep the high accuracy of the endpoint detection under the condition of low signal-to-noise ratio, and the change of the noise environment does not influence the accuracy of the endpoint detection in the technical scheme, thereby better enhancing the robustness of the voice endpoint detection along with the change of the noise environment; meanwhile, the technical scheme is simple and convenient to realize, is easier to integrate into various embedded audio processing systems, has a wide application range and has better practicability in practical application.

Example four

Fig. 4a is a flowchart illustrating a method for detecting an endpoint of an audio signal according to a fourth embodiment of the present invention. In order to verify that the audio signal endpoint detection method provided by the embodiment of the invention has the characteristics of strong robustness and high endpoint detection accuracy, the embodiment performs endpoint detection based on the provided audio signal endpoint detection method in three different noise environments.

Specifically, the three different noise environments are: white noise (White), speaker noise (Babble), and in-Vehicle noise (Vehicle), and three signal-to-noise ratios are set for different noise environments, respectively: 5dB, 10dB and 15 dB. In order to evaluate the endpoint detection effect of the endpoint detection method provided by the present invention, in this embodiment, a pure audio signal file (i.e., an audio signal file without noise) with a duration of 15 seconds at one end is recorded first, and it is determined that a portion of the audio signal file containing a speech signal accounts for about 60%, and the remaining about 40% is a mute portion, and the frame length of a single frame is set to 10 ms; then, the frame length is used to frame the pure audio signal file, and it is manually determined whether the end point detection result of each frame is a speech frame or a non-speech frame, fig. 4b is a schematic diagram of a section of pure audio signal file according to the fourth embodiment of the present invention, and as can be seen from fig. 4b, a portion with signal fluctuation is a speech signal portion, and a portion without signal fluctuation is a silence portion; then, the present embodiment respectively plays the pure audio signal files in the three noise environments, and records the pure audio signal files to form audio signal files containing noise, and finally, the present embodiment performs endpoint detection on the three audio signal files containing noise based on the provided audio signal endpoint detection method.

As shown in fig. 4a, the method for detecting an endpoint of an audio signal according to the above embodiment of the present invention performs endpoint detection on three audio signal files (collectively referred to as noise-containing audio signal files in this embodiment) containing noise, specifically including the following operations:

s401, obtaining the audio signal to be detected of the current frame, and determining the posterior signal-to-noise ratio and the prior signal-to-noise ratio of the audio signal to be detected according to the power spectrum of the audio signal to be detected of the frame and a predetermined noise power spectrum.

In this embodiment, a noisy audio signal file may be framed in units of frames, and noise power spectrum initialization may be performed based on the previous M frames of noisy audio signals, and meanwhile, the M +1 th frame of noisy audio signal may be regarded as the first frame of audio signal to be detected. Illustratively, in this step, a posterior signal-to-noise ratio and a prior signal-to-noise ratio required for endpoint detection are obtained through a set posterior signal-to-noise ratio and a calculation formula of the prior signal-to-noise ratio according to a power spectrum of an audio signal to be detected and a predetermined (initially determined or updated determined) noise power spectrum.

S402, determining an endpoint detection judgment parameter of the audio signal to be detected according to a judgment parameter formula, a posterior signal-to-noise ratio and a prior signal-to-noise ratio.

S403, acquiring each endpoint detection mark value in the set length detection result identification array, and determining the sum of the endpoint detection mark values.

For example, assuming that the set length is 6, the detection result identifier array corresponds to the endpoint detection flag value corresponding to the historical audio signal to be detected of 6 frames before the recorded audio signal to be detected, and the recording sequence is the same as the input sequence of the historical audio signal to be detected. Meanwhile, assuming that there is no historical audio signal to be detected before the audio signal to be detected, the endpoint detection flag value in the detection result identification array may be initialized to 0 (assuming that the non-voice endpoint flag value is 0).

S404, determining an endpoint detection result of the audio signal to be detected according to the sum of the endpoint detection judgment parameter and the endpoint detection mark value.

This step may specifically determine the end point detection result of the audio signal to be detected according to the determination operation described in the third embodiment.

S405, if the audio signal to be detected is determined to be a voice signal, keeping the noise power spectrum unchanged; and otherwise, updating and storing the noise power spectrum according to a noise power spectrum updating formula so as to be used for determining an endpoint detection judgment parameter of the audio signal to be detected of the next frame.

This step specifically implements the update of the noise power spectrum.

And S406, updating the endpoint detection mark value in the detection result identification array according to the mark value updating formula so as to perform endpoint detection on the audio signal to be detected of the next frame.

The step specifically realizes the updating of the endpoint detection mark value in the detection result identification array.

S407, if the audio signal to be detected of the current frame is not the last frame of the noisy audio signal file, determining the next frame as the current frame, and returning to execute S401; otherwise, the end point detection operation is ended.

This step specifically determines the result condition of the iterative endpoint detection operation.

It should be noted that, in this embodiment, in order to determine the endpoint detection effect of the provided endpoint detection method, the existing VAD algorithm based on g.729annex B is used to perform endpoint detection on the noisy audio signal file formed in the three noise environments. In addition, the present embodiment provides comparison parameters of two endpoint detection results, which are respectively: positive detection rate P_d(probability of detecting a speech frame as a speech frame) and a false detection rate p_f(probability of detecting a non-speech frame as a speech frame), wherein the positive detection rate P_dThe higher the false positive rate p_fThe lower the detection accuracy, the higher the detection accuracy is used for comparing the detection effect of the scheme of the invention with the detection effect of the existing scheme.

TABLE 1 comparison of the technical solution of the present invention with the detection result of VAD algorithm proposed by G.729Annex B

Specifically, table 1 shows a comparison between the detection results of the proposed scheme of the present invention and the VAD algorithm proposed by g.729annex B under different noise environments. As can be seen from Table 1, compared with the VAD algorithm of G.729Annex B, the endpoint detection method provided by the invention has the advantage that the detection accuracy is obviously improved under the condition of low signal-to-noise ratio (5 dB); meanwhile, under different signal-to-noise ratios and different noise environments, the endpoint detection method provided by the invention can keep relatively high endpoint detection accuracy, so that the endpoint detection method provided by the invention has strong robustness to noise environment changes.

It should be noted that, this embodiment further provides an endpoint detection effect diagram corresponding to different endpoint detection methods, and specifically, fig. 4c is a schematic diagram of detecting a voice signal in a noisy audio signal file based on a VAD algorithm of g.729annex B in a 5dB white noise environment; FIG. 4d is a schematic diagram of detecting a voice signal in a noisy audio signal file based on the technical solution of the present invention in a 5dB white noise environment; FIG. 4e is a schematic diagram of a voice signal in a noisy audio signal file detected by a VAD algorithm based on G.729Annex B under a 10dB speaker environment; FIG. 4f is a schematic diagram of detecting speech signals in a noisy audio signal file based on the technical solution of the present invention in a 10dB speaker environment; FIG. 4g is a schematic diagram of a voice signal in a noisy audio signal file detected by a VAD algorithm based on G.729Annex B in a 15dB vehicle interior environment; fig. 4h is a schematic diagram of detecting a voice signal in a noisy audio signal file based on the technical scheme of the present invention in a 15dB in-car environment.

It can be understood that, the voice signals in the noisy audio signal file are specifically marked in fig. 4c to 4h, and comparing the detection results of fig. 4c to 4h with the voice signals in the clean audio signal file shown in fig. 4B, it can be found that the accuracy of the detection result of the endpoint detection method provided by the embodiment of the present invention is higher than that of the VAD algorithm of g.729annex B, thereby further illustrating that the endpoint detection method provided by the embodiment of the present invention has higher detection accuracy and stronger robustness.

EXAMPLE five

Fig. 5 is a block diagram of an audio signal endpoint detection apparatus according to a fifth embodiment of the present invention, where the apparatus is suitable for use in a device with an audio signal processing function to perform endpoint detection on an audio signal before processing the audio signal, and the apparatus may be implemented by software and/or hardware and is generally integrated in an audio signal processing system with an audio processing function, and the audio signal processing system is disposed on a computer device. As shown in fig. 5, the apparatus includes: a decision parameter determining module 51, an end point marker determining module 52 and a detection result determining module 53.

The determination parameter determining module 51 is configured to determine an endpoint detection determination parameter of the audio signal to be detected based on a power spectrum of the audio signal to be detected and a predetermined noise power spectrum;

an endpoint flag determining module 52, configured to determine, based on each endpoint detection flag value recorded in a detection result flag array of a set length, a sum of endpoint detection flag values of the detection result flag array, where the endpoint detection flag value is a voice endpoint flag value or a non-voice endpoint flag value;

and a detection result determining module 53, configured to determine an endpoint detection result of the audio signal to be detected according to a sum of the endpoint detection determining parameter and the endpoint detection flag value.

In this embodiment, the apparatus first determines an endpoint detection decision parameter of the audio signal to be detected based on the power spectrum of the audio signal to be detected and a predetermined noise power spectrum by the decision parameter determining module 51; then, the endpoint mark determining module 52 determines the sum of the endpoint detection mark values of the detection result identification array based on the endpoint detection mark values recorded in the detection result identification array with the set length; and finally, determining an endpoint detection result of the audio signal to be detected according to the sum of the endpoint detection judgment parameter and the endpoint detection mark value through a detection result determining module 53.

According to the audio signal endpoint detection device provided by the fifth embodiment of the invention, the judgment parameters of endpoint detection can be determined only based on the power spectrum and the noise power spectrum of the audio signal, and the endpoint detection is finally realized, the realization of the technical scheme does not depend on the syllable characteristics of the audio signal, so that compared with the existing scheme, the technical scheme can still keep the high accuracy of the endpoint detection under the condition of low signal-to-noise ratio, the accuracy of the endpoint detection of the technical scheme is not influenced by the change of the noise environment, and the robustness of the voice endpoint detection when the voice endpoint detection changes along with the noise environment is better enhanced; meanwhile, the technical scheme is simple and convenient to realize, is easier to integrate into various embedded audio processing systems, has a wide application range and has better practicability in practical application.

On the basis of the above optimization, the decision parameter determining module 51 includes:

the signal-to-noise ratio determining unit is used for determining a posterior signal-to-noise ratio and a prior signal-to-noise ratio of the audio signal to be detected based on the power spectrum of the audio signal to be detected and a predetermined noise power spectrum;

and the judgment parameter acquisition unit is used for determining the endpoint detection judgment parameters of the audio signal to be detected according to a set judgment parameter formula and the posterior signal-to-noise ratio and the prior signal-to-noise ratio.

Further, the signal-to-noise ratio determining unit is specifically configured to determine a posterior signal-to-noise ratio and a prior signal-to-noise ratio of the audio signal to be detected according to the following posterior signal-to-noise ratio formula and the prior signal-to-noise ratio formula, where the posterior signal-to-noise ratio formula is expressed as:

the prior snr formula is expressed as ξ (n, k) ═ αξ (n-1, k) + (1- α) max (γ (n, k) -1, 0);

the decision parameter obtaining unit is specifically configured to determine an endpoint detection decision parameter of the audio signal to be detected according to the following decision parameter formula, where the decision parameter formula is expressed as:

wherein K represents the frequency number of the frequency domain and is any integer value from 0 to K-1, K represents the length of the frequency domain, n represents the frame number of the current audio signal to be detected, x (n) represents the audio signal to be detected of the nth frame in the time domain, FFT (x (n)) represents the audio signal to be detected of the nth frame in the frequency domain, γ (n, K) represents the posterior signal-to-noise ratio of the audio signal to be detected of the nth frame, and | FFT (x (n))²Representing the power spectrum, lambda, of the nth frame of the audio signal to be detected_NE(n-1, k) represents a noise power spectrum corresponding to a noise signal in the n-1 th frame of audio signal to be detected, ξ (n, k) represents a prior signal-to-noise ratio of the n-1 th frame of audio signal to be detected, α is any constant between 0 and 1, and ξ (n-1, k) represents the prior signal-to-noise ratio of the n-1 th frame of audio signal to be detected.

Further, the device also optimally comprises: and a power spectrum updating module 54, configured to update and store a predetermined noise power spectrum according to the endpoint detection result of the audio signal to be detected, so as to determine an endpoint detection decision parameter of the audio signal to be detected in a next frame.

On the basis of the above optimization, the power spectrum update module 54 is specifically configured to:

when the audio signal to be detected is determined to be a voice signal, keeping the noise power spectrum unchanged; otherwise, updating and storing the noise power spectrum according to the following noise power spectrum updating formula, wherein the noise power spectrum updating formula is expressed as: lambda [ alpha ]_NE(n,k)＝μλ_NE(n-1,k)+(1-μ)|FFT(x(n))|²Wherein k represents a frequency number of a frequency domain; lambda [ alpha ]_NE(n, k) represents a noise power spectrum corresponding to a noise signal in the nth frame of audio signal to be detected; lambda [ alpha ]_NE(n-1, k) represents a noise power spectrum corresponding to a noise signal in the audio signal to be detected in the (n-1) th frame; | FFT (x (n))²Representing the power spectrum of the nth frame of the audio signal to be detected.

Further, the detection result determining module 53 includes:

a first determining unit, configured to determine an endpoint detection result of the audio signal to be detected according to a comparison result between the endpoint detection decision parameter and a first set threshold value when the sum of the endpoint detection flag values is greater than or equal to the first set threshold value and a last endpoint detection flag value in the detection result identifier array is a voice endpoint flag value;

a second determining unit, configured to determine an endpoint detection result of the audio signal to be detected according to a comparison result between the endpoint detection decision parameter and a second set threshold value when the sum of the endpoint detection flag values is less than or equal to the second set threshold value and a last endpoint detection flag value in the detection result identifier array is a non-voice endpoint flag value;

a third determining unit, configured to determine an endpoint detection result of the audio signal to be detected according to a comparison result between the endpoint detection decision parameter and a third set threshold value when the sum of the endpoint detection flag values is greater than or equal to the first set threshold value and a last endpoint detection flag value in the detection result identifier array is a non-voice endpoint flag value;

a fourth determining unit, configured to determine an endpoint detection result of the audio signal to be detected according to a comparison result between the endpoint detection decision parameter and a third set threshold value when the sum of the endpoint detection flag values is less than or equal to the second set threshold value and a last endpoint detection flag value in the detection result identifier array is a voice endpoint flag value;

a fifth determining unit, configured to determine an endpoint detection result of the audio signal to be detected according to a comparison result between the endpoint detection determining parameter and a third set threshold value when the sum of the endpoint detection flag values is smaller than the first set threshold value and larger than the second set threshold value;

wherein the first set threshold is greater than the second set threshold; the first set threshold value is smaller than the third set threshold value; the third set threshold is smaller than the second set threshold.

On the basis of the foregoing embodiment, the first determining unit is specifically configured to:

when the endpoint detection judgment parameter is greater than or equal to the first set threshold value, determining that the audio signal to be detected is a voice signal, and recording an endpoint detection marking value of the audio signal to be detected as a voice endpoint marking value; otherwise, determining that the audio signal to be detected is a non-voice signal, and recording the endpoint detection mark value of the audio signal to be detected as a non-voice endpoint mark value.

Meanwhile, the second determining unit is specifically configured to:

when the endpoint detection judgment parameter is greater than or equal to the second set threshold value, determining that the audio signal to be detected is a voice signal, and recording an endpoint detection marking value of the audio signal to be detected as a voice endpoint marking value; otherwise, determining that the audio signal to be detected is a non-voice signal, and recording the endpoint detection mark value of the audio signal to be detected as a non-voice endpoint mark value.

Further, the third determining unit is specifically configured to:

when the endpoint detection judgment parameter is greater than or equal to the third set threshold value, determining that the audio signal to be detected is a voice signal, and recording an endpoint detection marking value of the audio signal to be detected as a voice endpoint marking value; otherwise, determining that the audio signal to be detected is a non-voice signal, and recording the endpoint detection mark value of the audio signal to be detected as a non-voice endpoint mark value.

In addition, the device also optimally comprises: and a mark value updating module 55, configured to update the end point detection mark value recorded in the detection result identifier array after determining the end point detection result of the audio signal to be detected, so as to perform end point detection on the audio signal to be detected in the next frame.

On the basis of the above optimization, the marker value updating module 55 is specifically configured to update the endpoint detection marker value in the detection result identifier array according to the following marker value updating formula, where the marker value updating formula is expressed as:

On the basis of the embodiment, the audio signal to be detected in the device is an audio signal input in real time or a pre-recorded audio signal;

correspondingly, when the audio signal to be detected is a pre-recorded audio signal, performing initialization calculation of a noise power spectrum based on the pre-recorded previous M frames of audio signals, wherein M is a set constant value.

Meanwhile, the embodiment of the invention also provides an audio signal processing system, which comprises the audio signal endpoint detection device provided by the embodiment of the invention.

It can be understood that the audio signal processing system may be specifically configured to perform signal processing operations such as audio encoding, audio identification, and audio enhancement on an audio signal, and based on the audio signal processing system provided in the embodiment of the present invention, before performing the above processing operations, the audio signal may be subjected to a preprocessing operation by the integrated audio signal endpoint detection apparatus, that is, endpoint detection of the audio signal is implemented. Therefore, the input audio signal can be divided into a voice section and a non-voice section, and then distinctive processing is carried out, so that the audio signal processing system is ensured to have the performances of reducing the code rate, improving the voice recognition rate, improving the signal-to-noise ratio of the signal and the like.

EXAMPLE six

Fig. 6 is a block diagram of a computer device according to a sixth embodiment of the present invention, and as shown in fig. 6, the computer device integrates an audio signal processing system 60 according to the foregoing embodiment, and further includes: a processor 61 and a storage device 62; the number of the processors 61 in the computer device may be one or more, and one processor 61 is taken as an example in fig. 6; the processor 61 and the storage means 62 in the computer device may be connected by a bus or other means, as exemplified by the bus connection in fig. 6. In addition, the audio signal processing system may be directly installed in the device in the form of software, or may be embedded and integrated, and is installed in the computer device in the form of a chip, in this case, the audio signal processing system may be connected to the processor 61 and the storage device 62 through a bus or other connection.

It can be understood that the computer device may be an electronic product with a call function, such as a mobile phone, a tablet computer, a notebook computer, and the like, and may also be an electronic product with an audio interaction function, such as an intelligent voice assistant, an intelligent home, a voice navigator, and the like.

The storage device 62 is a computer-readable storage medium, and can be used for storing one or more programs, which may be software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to the audio signal endpoint detection method in the embodiment of the present invention (for example, the determination parameter determining module 51, the endpoint flag determining module 52, and the detection result determining module 53 in the audio signal endpoint detection device shown in fig. 6), and also can be used for storing one or more programs in the audio signal processing system 60. The processor 61 executes various functional applications of the apparatus and data processing, i.e., implements the audio signal processing method in the above-described method embodiment, by executing software programs, instructions, and modules stored in the storage device 62.

The storage device 62 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the device, and the like. Further, the storage device 62 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, the storage device 62 may further include memory located remotely from the processor 61, which may be connected to the device over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

And, when one or more programs included in the above-mentioned computer apparatus are executed by the one or more processors 61, the programs perform the following operations:

determining an endpoint detection judgment parameter of the audio signal to be detected based on the power spectrum of the audio signal to be detected and a predetermined noise power spectrum; determining the sum of the endpoint detection mark values of the detection result identification array based on all the endpoint detection mark values recorded in the detection result identification array with set length, wherein the endpoint detection mark value is a voice endpoint mark value or a non-voice endpoint mark value; and determining an endpoint detection result of the audio signal to be detected according to the sum of the endpoint detection judgment parameter and the endpoint detection mark value.

Furthermore, an embodiment of the present invention further provides a storage medium containing computer-executable instructions, which when executed by a computer processor, are configured to perform a method for endpoint detection of an audio signal as described in embodiment one, embodiment two, embodiment three, or embodiment four, the method including: determining an endpoint detection judgment parameter of the audio signal to be detected based on the power spectrum of the audio signal to be detected and a predetermined noise power spectrum; determining the sum of the endpoint detection mark values of the detection result identification array based on all the endpoint detection mark values recorded in the detection result identification array with set length, wherein the endpoint detection mark value is a voice endpoint mark value or a non-voice endpoint mark value; and determining an endpoint detection result of the audio signal to be detected according to the sum of the endpoint detection judgment parameter and the endpoint detection mark value.

Based on the understanding that the technical solutions of the present invention can be embodied in the form of software products, such as floppy disks, Read-Only memories (ROMs), Random Access Memories (RAMs), flash memories (F L ASHs), hard disks or optical disks of a computer, etc., and include instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods according to the embodiments of the present invention.

It should be noted that, in the foregoing embodiment, each included unit and each included module are only divided according to functional logic, but are not limited to the above division as long as the corresponding functions can be implemented; in addition, specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention.

It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims

1. A method of endpoint detection of an audio signal, comprising:

2. The method according to claim 1, wherein determining the endpoint detection decision parameter of the audio signal to be detected based on the power spectrum of the audio signal to be detected and a predetermined noise power spectrum comprises:

determining a posterior signal-to-noise ratio and a prior signal-to-noise ratio of the audio signal to be detected based on the power spectrum of the audio signal to be detected and a predetermined noise power spectrum;

and determining an endpoint detection judgment parameter of the audio signal to be detected according to a set judgment parameter formula and the posterior signal-to-noise ratio and the prior signal-to-noise ratio.

3. The method according to claim 2, characterized in that the posterior signal-to-noise ratio and the prior signal-to-noise ratio of the audio signal to be detected are determined according to the posterior signal-to-noise ratio formula and the prior signal-to-noise ratio formula, respectively,

the posterior signal-to-noise ratio formula is expressed as:

the prior snr equation is expressed as:

ξ(n,k)＝αξ(n-1,k)+(1-α)max(γ(n,k)-1,0)；

determining an endpoint detection decision parameter of the audio signal to be detected according to the following decision parameter formula, wherein the decision parameter formula is expressed as:

4. The method of claim 1, further comprising:

and updating and storing a predetermined noise power spectrum according to the endpoint detection result of the audio signal to be detected so as to be used for determining an endpoint detection judgment parameter of the audio signal to be detected of the next frame.

5. The method according to claim 4, wherein the updating and saving the predetermined noise power spectrum according to the endpoint detection result of the audio signal to be detected comprises:

if the audio signal to be detected is determined to be a voice signal, keeping the noise power spectrum unchanged; otherwise, updating and storing the noise power spectrum according to the following noise power spectrum updating formula, wherein the noise power spectrum updating formula is expressed as:

λ_NE(n,k)＝μλ_NE(n-1,k)+(1-μ)|FFT(x(n))|²，

wherein k represents a frequency number of a frequency domain; lambda [ alpha ]_NE(n, k) represents a noise power spectrum corresponding to a noise signal in the nth frame of audio signal to be detected; lambda [ alpha ]_NE(n-1, k) represents a noise power spectrum corresponding to a noise signal in the audio signal to be detected in the (n-1) th frame; | FFT (x (n))²Representing the power spectrum of the nth frame of the audio signal to be detected.

6. The method according to claim 1, wherein determining the endpoint detection result of the audio signal to be detected according to the sum of the endpoint detection decision parameter and the endpoint detection flag value comprises:

if the sum of the endpoint detection mark values is greater than or equal to a first set threshold value and the last endpoint detection mark value in the detection result identification array is a voice endpoint mark value, determining an endpoint detection result of the audio signal to be detected according to a comparison result of the endpoint detection judgment parameter and the first set threshold value;

if the sum of the endpoint detection mark values is less than or equal to a second set threshold value and the last endpoint detection mark value in the detection result identification array is a non-voice endpoint mark value, determining an endpoint detection result of the audio signal to be detected according to a comparison result of the endpoint detection judgment parameter and the second set threshold value;

if the sum of the endpoint detection mark values is greater than or equal to the first set threshold value and the last endpoint detection mark value in the detection result identification array is a non-voice endpoint mark value, determining an endpoint detection result of the audio signal to be detected according to a comparison result of the endpoint detection judgment parameter and a third set threshold value;

if the sum of the endpoint detection mark values is less than or equal to the second set threshold value and the last endpoint detection mark value in the detection result identification array is a voice endpoint mark value, determining an endpoint detection result of the audio signal to be detected according to a comparison result of the endpoint detection judgment parameter and a third set threshold value;

if the sum of the endpoint detection mark values is smaller than the first set threshold and larger than the second set threshold, determining an endpoint detection result of the audio signal to be detected according to a comparison result of the endpoint detection judgment parameter and a third set threshold;

7. The method according to claim 6, wherein determining the end point detection result of the audio signal to be detected according to the comparison result between the end point detection decision parameter and the first set threshold value comprises:

when the endpoint detection judgment parameter is greater than or equal to the first set threshold value, determining that the audio signal to be detected is a voice signal, and recording an endpoint detection marking value of the audio signal to be detected as a voice endpoint marking value;

otherwise, determining that the audio signal to be detected is a non-voice signal, and recording the endpoint detection mark value of the audio signal to be detected as a non-voice endpoint mark value.

8. The method according to claim 6, wherein determining the end point detection result of the audio signal to be detected according to the comparison result between the end point detection decision parameter and a second set threshold value comprises:

when the endpoint detection judgment parameter is greater than or equal to the second set threshold value, determining that the audio signal to be detected is a voice signal, and recording an endpoint detection marking value of the audio signal to be detected as a voice endpoint marking value;

9. The method according to claim 6, wherein determining the endpoint detection result of the audio signal to be detected according to the comparison result between the endpoint detection decision parameter and a third set threshold value comprises:

when the endpoint detection judgment parameter is greater than or equal to the third set threshold value, determining that the audio signal to be detected is a voice signal, and recording an endpoint detection marking value of the audio signal to be detected as a voice endpoint marking value;

10. The method according to claim 1, further comprising, after said determining an end point detection result of the audio signal to be detected:

and updating the end point detection mark value recorded in the detection result identification array so as to be used for carrying out end point detection on the audio signal to be detected of the next frame.

11. The method of claim 10, wherein the endpoint detection flag values in the detection result identification array are updated according to a flag value update formula,

the flag value update formula is expressed as:

wherein m represents the array element number of the detection result identification array, L represents the set length of the detection result identification array, value [ m]Indicating the mth endpoint detection mark value recorded in the detection result identification array; val _ decision is an endpoint detection mark value corresponding to the audio signal to be detected; value [ m + 1]]Represents the m +1 th element recorded in the detection result identification arrayEndpoint detection flag values.

12. The method according to any one of claims 1 to 11, wherein the audio signal to be detected is a real-time input audio signal or a pre-recorded audio signal;

13. An apparatus for detecting an end point of an audio signal, comprising:

14. The apparatus of claim 13, wherein the decision parameter determining module comprises:

15. The apparatus according to claim 14, wherein the SNR determination unit is specifically configured to determine a posterior SNR and a prior SNR of the audio signal to be detected according to a posterior SNR formula and a prior SNR formula, respectively,

the posterior signal-to-noise ratio formula is expressed as:

the prior snr equation is expressed as:

ξ(n,k)＝αξ(n-1,k)+(1-α)max(γ(n,k)-1,0)；

16. The apparatus of claim 13, further comprising:

and the power spectrum updating module is used for updating and storing a predetermined noise power spectrum according to the endpoint detection result of the audio signal to be detected so as to determine an endpoint detection judgment parameter of the audio signal to be detected of the next frame.

17. The apparatus of claim 16, wherein the power spectrum update module is specifically configured to:

when the audio signal to be detected is determined to be a voice signal, keeping the noise power spectrum unchanged; otherwise, updating and storing the noise power spectrum according to the following noise power spectrum updating formula, wherein the noise power spectrum updating formula is expressed as:

λ_NE(n,k)＝μλ_NE(n-1,k)+(1-μ)|FFT(x(n))|²，

18. The apparatus of claim 13, wherein the detection result determining module comprises:

19. The apparatus according to claim 18, wherein the first determining unit is specifically configured to:

20. The apparatus according to claim 18, wherein the second determining unit is specifically configured to:

21. The apparatus according to claim 18, wherein the third determining unit is specifically configured to:

22. The apparatus of claim 13, further comprising:

and the mark value updating module is used for updating the end point detection mark value recorded in the detection result identification array after the end point detection result of the audio signal to be detected is determined, so as to be used for carrying out end point detection on the audio signal to be detected of the next frame.

23. The apparatus according to claim 22, wherein the tag value updating module is specifically configured to update the endpoint detection tag value in the detection result identifier array according to a tag value updating formula,

the flag value update formula is expressed as:

wherein m represents the array element number of the detection result identification array, L represents the set length of the detection result identification array, value[m]Indicating the mth endpoint detection mark value recorded in the detection result identification array; val _ decision is an endpoint detection mark value corresponding to the audio signal to be detected; value [ m + 1]]Indicating the m +1 th endpoint detection flag value recorded in the detection result identification array.

24. The apparatus according to any one of claims 13-23, wherein the audio signal to be detected is a real-time input audio signal or a pre-recorded audio signal;

25. Audio signal processing system, characterized in that the audio signal processing system comprises end point detection means for an audio signal according to any of claims 13-24.

26. A computer device, characterized in that the computer device comprises:

the audio signal processing system of claim 25;

one or more processors;

the one or more programs being executable by the one or more processors to cause the one or more processors to implement the method of endpoint detection of an audio signal of any of claims 1-12.

27. A storage medium containing computer-executable instructions for performing the method of endpoint detection of an audio signal of any of claims 1-12 when executed by a computer processor.