CN108198547B

CN108198547B - Voice endpoint detection method and device, computer equipment and storage medium

Info

Publication number: CN108198547B
Application number: CN201810048223.3A
Authority: CN
Inventors: 黄石磊; 刘轶; 王昕�
Original assignee: Shenzhen Raisound Technology Co ltd
Current assignee: Shenzhen Raisound Technology Co ltd
Priority date: 2018-01-18
Filing date: 2018-01-18
Publication date: 2020-10-23
Anticipated expiration: 2038-01-18
Also published as: CN108198547A

Abstract

The application relates to a voice endpoint detection method, a voice endpoint detection device, computer equipment and a storage medium. The method comprises the following steps: acquiring a voice signal with noise, and extracting acoustic features and spectrum features corresponding to the voice signal with noise; converting the acoustic features and the spectral features to obtain corresponding acoustic feature vectors and spectral feature vectors; acquiring a classifier, and inputting the acoustic feature vector and the spectral feature vector into the classifier to obtain an acoustic feature vector added with a voice tag and a spectral feature vector added with the voice tag; analyzing the acoustic feature vector added with the voice tag and the frequency spectrum feature vector added with the voice tag to obtain a corresponding voice signal; and determining a starting point and an ending point corresponding to the voice signal according to the time sequence of the voice signal. The method can effectively improve the accuracy of voice endpoint detection.

Description

Voice endpoint detection method and device, computer equipment and storage medium

Technical Field

The present application relates to the field of signal processing technologies, and in particular, to a method and an apparatus for language endpoint detection, a computer device, and a storage medium.

Background

With the continuous development of voice technology, voice endpoint detection technology plays a very important role in voice recognition technology. The voice end point detection is to detect the start point and the end point of a voice part from a continuous piece of noise voice, so that the voice can be effectively recognized.

The traditional voice endpoint detection methods include two methods, one is to extract the characteristics of each segment of signal according to the difference of the time domain characteristics and the frequency domain characteristics of voice and noise signals, and compare the characteristics of each segment of signal with a set threshold value, so as to perform voice endpoint detection. However, this method is only suitable for detection under stationary noise conditions, and noise robustness is poor, and it is difficult to distinguish between pure speech and noise, resulting in low accuracy of speech endpoint detection. . The other is based on a neural network mode, and the end point detection is carried out on the voice signals by utilizing a training model. However, the input vectors of most models only contain the characteristics of noisy speech, so that the noise robustness is poor, and the accuracy of speech endpoint detection is low. Therefore, how to effectively improve the accuracy of voice endpoint detection becomes a technical problem to be solved at present.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a voice endpoint detection method, apparatus, computer device and storage medium capable of effectively improving accuracy of voice endpoint detection.

A voice endpoint detection method, comprising:

acquiring a voice signal with noise, and extracting acoustic features and spectrum features corresponding to the voice signal with noise;

converting the acoustic features and the spectral features to obtain corresponding acoustic feature vectors and spectral feature vectors;

acquiring a classifier, and inputting the acoustic feature vector and the spectral feature vector into the classifier to obtain an acoustic feature vector added with a voice tag and a spectral feature vector added with the voice tag;

analyzing the acoustic feature vector added with the voice tag and the frequency spectrum feature vector added with the voice tag to obtain a corresponding voice signal;

and determining a starting point and an ending point corresponding to the voice signal according to the time sequence of the voice signal.

In one embodiment, before the extracting the acoustic feature and the spectral feature corresponding to the noisy speech signal, the method further includes:

converting the voice signal with noise into a voice frequency spectrum with noise;

and carrying out time domain analysis and/or frequency domain analysis and/or transform domain analysis on the voice frequency spectrum with the noise to obtain acoustic characteristics corresponding to the voice signal with the noise.

converting the voice signal with noise into a voice frequency spectrum with noise, and calculating a voice amplitude spectrum with noise according to the voice frequency spectrum with noise;

carrying out dynamic noise estimation on the voice frequency spectrum with the noise according to the voice amplitude spectrum with the noise to obtain a noise amplitude spectrum;

estimating a voice amplitude spectrum of the pure voice signal according to the voice amplitude spectrum with the noise and the noise amplitude spectrum;

and generating the frequency spectrum characteristics corresponding to the voice signal with the noise by using the voice amplitude spectrum with the noise, the noise amplitude spectrum and the voice amplitude spectrum.

In one embodiment, the converting the acoustic features and the spectral features includes:

extracting a preset number of frames before and after the current frame in the acoustic features and the frequency spectrum features;

calculating a mean vector and/or a variance vector corresponding to the current frame by using a preset number of frames before and after the current frame;

and carrying out logarithmic domain conversion on the acoustic feature and the frequency spectrum feature after the mean vector and/or the variance vector corresponding to the current frame are calculated to obtain a converted acoustic feature vector and a converted frequency spectrum feature vector.

In one embodiment, the step of obtaining the classifier further comprises:

acquiring noisy voice data added with a voice category label, and training the noisy voice data to obtain an initial classifier;

obtaining a first verification set, wherein the first verification set comprises a plurality of first voice data;

inputting a plurality of first voice data into a classifier to obtain class probabilities corresponding to the plurality of first voice data;

screening the category probabilities corresponding to the plurality of first voice data, and adding category labels to the selected first voice data to obtain a verification set added with the category labels;

training by using the verification set added with the class label and the training set to obtain a verification classifier;

obtaining a second verification set, wherein the second verification set comprises a plurality of second voice data;

inputting a plurality of second voice data into a verification classifier to obtain class probabilities corresponding to the plurality of second voice data;

and when the class probability corresponding to the second voice data reaches a preset probability value, obtaining the required classifier.

In one embodiment, the step of classifying the acoustic feature vector and the spectral feature vector by the classifier comprises:

taking the acoustic feature vector and the spectral feature vector as input of a classifier to obtain decision values corresponding to the acoustic feature vector and the spectral feature vector;

when the decision value is a first threshold value, adding a voice tag to the acoustic feature vector or the spectrum feature vector;

and when the decision value is a second threshold value, adding a non-voice label to the acoustic feature vector or the spectrum feature vector.

A voice endpoint detection apparatus comprising:

the extraction module is used for acquiring a voice signal with noise and extracting acoustic features and spectral features corresponding to the voice signal with noise;

the conversion module is used for converting the acoustic features and the spectral features to obtain corresponding acoustic feature vectors and spectral feature vectors;

the classification module is used for acquiring a classifier, inputting the acoustic feature vector and the spectral feature vector into the classifier, and obtaining an acoustic feature vector added with a voice tag and a spectral feature vector added with the voice tag;

the analysis module is used for analyzing the acoustic feature vector added with the voice tag and the frequency spectrum feature vector added with the voice tag to obtain a corresponding voice signal; and determining a starting point and an ending point corresponding to the voice signal according to the time sequence of the voice signal.

In one embodiment, the conversion module is further configured to extract a preset number of frames before and after a current frame in the acoustic features and the spectral features; calculating a mean vector and/or a variance vector corresponding to the current frame by using a preset number of frames before and after the current frame; and carrying out logarithmic domain conversion on the acoustic feature and the frequency spectrum feature after the mean vector and/or the variance vector corresponding to the current frame are calculated to obtain a converted acoustic feature vector and a converted frequency spectrum feature vector.

A computer device comprising a memory, the memory storing a computer program, a processor implementing the following steps when the processor executes the computer program:

A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of:

The voice endpoint detection method, the voice endpoint detection device, the computer equipment and the storage medium acquire the voice signal with noise and extract the acoustic characteristic and the spectral characteristic corresponding to the voice signal with noise; and converting the acoustic features and the spectral features to obtain corresponding acoustic feature vectors and spectral feature vectors. And acquiring a classifier, and inputting the acoustic feature vector and the spectral feature vector into the classifier to obtain the acoustic feature vector added with the voice tag and the spectral feature vector added with the voice tag, so that the acoustic feature vector and the spectral feature vector can be effectively classified, and voice and non-voice can be effectively identified. Analyzing the acoustic feature vector added with the voice tag and the frequency spectrum feature vector added with the voice tag to obtain a corresponding voice signal; the time sequence of the voice signal determines the corresponding starting point and the corresponding end point of the voice signal, so that the starting point and the end point of the voice signal with noise can be accurately identified, and the accuracy of voice end point detection can be effectively improved.

Drawings

FIG. 1 is a flow diagram of a method for voice endpoint detection in one embodiment;

FIG. 2 is a diagram of the internal structure of the speech endpoint detection apparatus in one embodiment;

FIG. 3 is a diagram illustrating an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not limiting of the application. It will be understood that, as used herein, the terms "first," "second," and the like may be used herein to describe various elements, but these elements are not limited by these terms. These terms are only used to distinguish one element from another.

In one embodiment, as shown in fig. 1, a method for detecting a voice endpoint is provided, which is described by taking the method as an example for being applied to a terminal, and includes the following steps:

step 102, acquiring a voice signal with noise, and extracting acoustic features and spectral features corresponding to the voice signal with noise.

Generally, the actually collected voice signal usually contains noise with a certain intensity, and when the intensity of the noise is large, the effect of the voice application is obviously affected, for example, the voice recognition efficiency is low, the endpoint detection accuracy is reduced, and the like.

The terminal can acquire the voice input by the user through the voice input device. The terminal can be a terminal such as a smart phone, a tablet computer, a notebook computer, a desktop computer, and the like, and the terminal further includes a voice input device, for example, a device such as a microphone having a voice recording function. The voice input by the user and acquired by the terminal is usually a noisy voice signal containing noise, and the noisy voice signal may be a noisy voice signal such as a call voice, a recorded audio, a voice instruction and the like input by the user. And after the terminal acquires the voice signal with the noise, extracting the acoustic characteristic and the spectral characteristic corresponding to the voice signal with the noise. The acoustic features may include feature information of unvoiced sound, voiced sound, vowel sound, consonant sound, and the like of the noisy speech signal. The spectral characteristics may include the vibration frequency and vibration amplitude of the noisy speech signal and characteristic information such as loudness and timbre of the noisy speech signal.

Specifically, after the terminal acquires the voice signal with noise, the voice signal with noise is windowed and framed. For example, a hanning window may be used to divide the noisy speech signal into a plurality of frames that are 10-30ms (milliseconds) long, and the frame shift may be 10ms, so that the noisy speech signal may be divided into a plurality of frames of noisy speech signals. And after windowing and framing the voice signal with noise by the terminal, carrying out fast Fourier transform on the voice signal with noise after windowing and framing, thereby obtaining the frequency spectrum of the voice signal with noise. The terminal can extract the acoustic features and the spectrum features corresponding to the voice signals with noise according to the frequency spectrum of the voice with noise.

And 104, converting the acoustic features and the spectral features to obtain corresponding acoustic feature vectors and spectral feature vectors.

After the terminal extracts the acoustic features and the spectrum features corresponding to the voice signals with noise, the acoustic features and the spectrum features corresponding to the extracted voice signals with noise are converted, the acoustic features are converted into corresponding acoustic feature vectors, and the spectrum features are converted into corresponding spectrum feature vectors.

And 106, acquiring a classifier, and inputting the acoustic feature vector and the spectral feature vector into the classifier to obtain the acoustic feature vector added with the voice tag and the spectral feature vector added with the voice tag.

The terminal obtains a classifier, the classifier is trained before voice endpoint detection is carried out, and the classifier can divide input acoustic feature vectors and input spectral feature vectors into acoustic feature vectors and acoustic feature vectors of voice class and acoustic feature vectors of non-voice class and spectral feature vectors by adding voice tags and non-voice tags to the acoustic feature vectors and the spectral feature vectors. The terminal inputs the acoustic feature vector and the spectral feature vector corresponding to the voice with noise into the classifier, and the classifier is used for classifying the input acoustic feature vector and the spectral feature vector. When the input acoustic feature vector or the spectrum feature vector is of a voice category, adding a voice tag to the acoustic feature vector or the spectrum feature vector; when the input acoustic feature vector or the spectrum feature vector is in a non-speech category, a non-speech tag is added to the acoustic feature vector or the spectrum feature vector, so that speech and non-speech can be accurately identified. And after the terminal utilizes the classifier to carry out comparison on the acoustic feature vector and the spectral feature vector, the acoustic feature vector added with the voice tag and the spectral feature vector added with the voice tag can be obtained.

Further, the terminal takes the acoustic feature vector and the spectral feature vector as input of the classifier, and can also obtain decision values corresponding to the acoustic feature vector and the spectral feature vector. The terminal can add voice tags or non-voice tags to the acoustic feature vectors and the spectral feature vectors according to the obtained decision values. Therefore, the acoustic feature vectors and the spectral feature vectors are accurately classified.

And 108, analyzing the acoustic feature vector added with the voice tag and the frequency spectrum feature vector added with the voice tag to obtain the voice signal added with the voice tag.

And step 110, determining a starting point and an ending point corresponding to the voice signal according to the voice tag and the time sequence of the voice signal.

After the terminal classifies the acoustic feature vectors and the spectral feature vectors, the acoustic feature vectors to which the voice tags are added and the spectral feature vectors to which the voice tags are added need to be analyzed. Specifically, the terminal analyzes the acoustic feature vector added with the voice tag and the spectrum feature vector added with the voice tag to obtain the acoustic feature added with the voice tag and a spectrum corresponding to the spectrum feature. And the terminal converts the acoustic characteristics added with the voice tag and the frequency spectrum corresponding to the spectral characteristics into corresponding voice signals according to the time sequence of the voice signals with the noise, so that the corresponding voice signals can be obtained through analysis.

The noisy speech signal has a timing sequence, and the timing sequence of the speech signal after the addition of the voice tag still corresponds to the timing sequence of the noisy speech signal. The terminal analyzes the acoustic feature vector added with the voice tag and the frequency spectrum feature vector added with the voice tag into corresponding voice signals added with the voice tag, so that the terminal can determine a starting point and an ending point corresponding to the voice signals with noise according to the voice tag and the time sequence of the voice signals.

For example, after the terminal classifies the input acoustic feature vector and spectral feature vector by the classifier, the obtained decision value may be a value between 0 and 1. And when the obtained decision value is 1, the terminal adds a voice tag to the acoustic feature vector or the frequency spectrum feature vector. And when the obtained decision value is 0, the terminal adds a non-voice label to the acoustic feature vector or the frequency spectrum feature vector. Thereby, the acoustic feature vector and the spectral feature vector can be accurately classified. And after the terminal analyzes the acoustic feature vector added with the voice tag and the frequency spectrum feature vector added with the voice tag, the voice signal added with the voice tag can be obtained. According to the time sequence of the voice signal added with the voice tag, when the voice frame added with the voice tag appears for the first time, the voice frame is the starting point of the voice signal with noise, and when the voice frame corresponding to the voice tag appears for the last time, the voice frame is the ending point of the voice signal with noise. Further, it is also possible to determine a start point of the voice signal according to the jump of the decision value 0 to 1 and determine an end point of the voice signal according to the jump of the decision value 1 to 0. Therefore, the corresponding starting point and the corresponding ending point of the noisy speech signal can be accurately determined.

In this embodiment, after the terminal acquires the voice signal with noise, the terminal extracts the acoustic feature and the spectral feature corresponding to the voice signal with noise, and converts the acoustic feature and the spectral feature to obtain a corresponding acoustic feature vector and a corresponding spectral feature vector. The acoustic feature vector and the spectral feature vector are input to the classifier to obtain the acoustic feature vector added with the voice tag and the spectral feature vector added with the voice tag, so that the acoustic feature vector and the spectral feature vector can be effectively classified, and voice and non-voice can be effectively identified. And the terminal analyzes the acoustic characteristic vector added with the voice tag and the frequency spectrum characteristic vector added with the voice tag to obtain a corresponding voice signal. The terminal determines the starting point and the ending point corresponding to the voice signal according to the time sequence of the voice signal, so that the starting point and the ending point of the voice signal with noise can be accurately identified, and the accuracy of voice endpoint detection can be effectively improved.

In one embodiment, before extracting the acoustic features and the spectral features corresponding to the noisy speech signal, the method further includes: converting the voice signal with noise into voice frequency spectrum with noise; and carrying out time domain analysis and/or frequency domain analysis and/or transform domain analysis on the noisy speech frequency spectrum to obtain acoustic characteristics corresponding to the noisy speech signal.

In phonetics, speech features can be classified into acoustic features such as vowels, consonants, unvoiced sounds, voiced sounds, and silence. After the terminal acquires the voice signal with noise, windowing and framing are carried out on the voice signal with noise. For example, a hanning window may be used to divide a noisy speech signal into frames that are 10-30ms (milliseconds) in length, and the frame shift may take 10 ms. So that the noisy speech signal can be split into frames of noisy speech signals. And after windowing and framing the voice signal with noise by the terminal, carrying out fast Fourier transform on the voice signal with noise after windowing and framing, thereby obtaining the frequency spectrum of the voice signal with noise.

Further, the terminal can perform time domain analysis and/or frequency domain analysis and/or transform domain analysis on the noisy speech frequency spectrum, so that acoustic characteristics corresponding to the noisy speech signal can be obtained.

For example, the terminal may extract the acoustic features corresponding to the noisy speech signal by using an MFCC (Mel-Frequency Cepstrum Coefficients, Mel-Frequency cepstral Coefficients). After the terminal carries out windowing and framing on the voice signal with noise, the voice signal with noise is converted into the frequency spectrum of the voice signal with noise. The terminal transforms the frequency spectrum of the voice signal with noise into a voice cepstrum with noise, performs cepstrum analysis according to the voice cepstrum with noise, and performs discrete cosine transform on the voice cepstrum with noise to obtain the acoustic characteristic of each frame, so that the effective acoustic characteristic of the voice with noise can be obtained.

In one embodiment, before extracting the acoustic features and the spectral features corresponding to the noisy speech signal, the method further includes: converting the voice signal with noise into a voice frequency spectrum with noise, and calculating a voice amplitude spectrum with noise according to the voice frequency spectrum with noise; carrying out dynamic noise estimation on the noisy speech frequency spectrum according to the noisy speech amplitude spectrum to obtain a noise amplitude spectrum; estimating a voice amplitude spectrum of the pure voice signal according to the voice amplitude spectrum with the noise and the noise amplitude spectrum; and generating the frequency spectrum characteristics corresponding to the voice signal with the noise by using the voice amplitude spectrum with the noise, the noise amplitude spectrum and the voice amplitude spectrum.

After the terminal acquires the voice signal with noise, windowing and framing are carried out on the voice signal with noise. For example, a hanning window may be used to divide a noisy speech signal into frames that are 10-30ms (milliseconds) in length, and the frame shift may take 10 ms. So that the noisy speech signal can be split into frames of noisy speech signals. And after windowing and framing the voice signal with noise by the terminal, carrying out fast Fourier transform on the voice signal with noise after windowing and framing, thereby obtaining the frequency spectrum of the voice signal with noise. The frequency spectrum of the noisy speech signal may be an energy magnitude spectrum of the noisy speech after the fast fourier transform.

Further, the terminal can calculate a noisy speech amplitude spectrum and a noisy speech phase spectrum by using the noisy speech frequency spectrum. And the terminal carries out dynamic noise estimation on the voice frequency spectrum with the noise according to the voice amplitude spectrum with the noise and the voice phase spectrum with the noise. Specifically, the terminal may perform dynamic noise estimation on the noisy speech spectrum by using an improved minimum controlled recursive average algorithm, so that a noise magnitude spectrum may be obtained. And the terminal estimates the voice amplitude spectrum of the voice signal according to the voice amplitude spectrum with the noise, the voice phase spectrum with the noise and the noise amplitude spectrum. For example, the terminal may estimate the speech magnitude spectrum of the speech signal using a log magnitude spectrum minimum mean square error estimation method.

The terminal generates the frequency spectrum characteristic corresponding to the voice signal with noise by using the estimated voice amplitude spectrum with noise, the estimated noise amplitude spectrum and the estimated voice amplitude spectrum of the pure voice signal, so that the terminal can effectively extract the frequency spectrum characteristic corresponding to the voice signal with noise.

In one embodiment, converting the acoustic features and the spectral features comprises: extracting a preset number of frames before and after a current frame in the acoustic features and the spectral features; calculating a mean vector and/or a variance vector corresponding to the current frame by using a preset number of frames before and after the current frame; and carrying out logarithmic domain conversion on the acoustic feature and the frequency spectrum feature after the mean vector and/or the variance vector corresponding to the current frame are calculated to obtain a converted acoustic feature vector and a converted frequency spectrum feature vector.

After the terminal acquires the voice signal with noise, windowing and framing are carried out on the voice signal with noise. So that the noisy speech signal can be split into frames of noisy speech signals. And after windowing and framing the voice signal with noise by the terminal, carrying out fast Fourier transform on the voice signal with noise after windowing and framing, thereby obtaining the frequency spectrum of the voice signal with noise. The terminal can extract the acoustic characteristics and the spectrum characteristics corresponding to the voice signal with noise according to the spectrum of the voice with noise.

And after the terminal extracts the acoustic features and the spectral features corresponding to the voice signals with the noise, converting the acoustic features and the spectral features into acoustic feature vectors and spectral feature vectors. The terminal extracts a preset number of frames before and after the current frame in the acoustic feature vector and the spectral feature vector. The terminal calculates a mean vector or a variance vector corresponding to the current frame by using a preset number of frames before and after the current frame, so that the acoustic feature and the frequency spectrum feature can be smoothed to obtain a smoothed acoustic feature vector and a smoothed frequency spectrum feature vector.

For example, the terminal may obtain the forward and the next five frames of the current frame with acoustic or spectral features, for a total of 11 frames of noisy speech spectrum. By calculating the average of these 11 frames, the average vector of the current frame can be obtained. In particular, the terminal may obtain a filter bank in which the shape of the filter is a triangle, the triangle window representing the filtering window. Each filter has the characteristics of a triangular filter, which can be of equal bandwidth in the noisy speech spectrum. The terminal can calculate the average vector of the current frame by using the filter bank, so that the noisy speech frequency spectrum can be smoothed, and the smoothed acoustic feature vector and the smoothed spectral feature vector can be obtained.

And after the terminal smoothes the frequency spectrum of the voice with noise, calculating a logarithmic domain for the smoothed acoustic feature vector and the smoothed frequency spectrum feature vector to obtain a converted acoustic feature vector and a converted frequency spectrum feature vector. Specifically, the terminal may calculate log energy of the acoustic feature and the spectral feature output by each filter, and may thereby obtain a log domain of the acoustic feature vector and a log domain of the spectral feature vector, so that the converted acoustic feature vector and the spectral feature vector can be effectively obtained.

In one embodiment, the step of obtaining the classifier further comprises: acquiring noisy voice data added with a voice category label, and training the noisy voice data to obtain an initial classifier; acquiring a first verification set, wherein the first verification set comprises a plurality of first voice data; inputting the first voice data into a classifier to obtain class probabilities corresponding to the first voice data; screening the category probabilities corresponding to the plurality of first voice data, and adding category labels to the selected first voice data to obtain a verification set added with the category labels; training by using the verification set and the training set added with the class labels to obtain a verification classifier; acquiring a second verification set, wherein the second verification set comprises a plurality of second voice data; inputting the second voice data into a verification classifier to obtain class probabilities corresponding to the second voice data; and when the class probability corresponding to the second voice data reaches a preset probability value, obtaining the required classifier.

Before acquiring the classifier, a large amount of noisy speech data, which may be noisy speech data acquired by the terminal from a database or noisy speech data acquired by the terminal from the internet, needs to be used to train the classifier. When the classifier is trained, firstly, noisy speech data is labeled manually, and the classifier is obtained by training the artificially labeled noisy speech data.

Specifically, after extracting the acoustic features and the spectral features corresponding to the noisy speech data, the terminal converts the acoustic features and the spectral features into corresponding acoustic feature vectors and spectral feature vectors. The staff can label the acoustic feature vector and the spectral feature vector according to the category comparison table, and add a voice tag or a non-voice tag to each frame of voice signals with noise. And the terminal acquires the voice data with noise after the staff marks the voice data with noise according to the category comparison table.

The terminal combines the acoustic feature vector and the spectral feature vector after the label is added and inputs the combined acoustic feature vector and the spectral feature vector into an input layer of an LSTM (Bidirectional Long Short-term Memory), a nonlinear hidden layer in the LSTM neural network can learn new features from the input vector, and the category of the input vector is calculated through an activation function. Specifically, there are three gates in each LSTM unit, a forgetting gate, a candidate gate, and an output gate, respectively. The specific calculation formula may be:

wherein σIt is shown that the activation function is,

a forgetting gate weight matrix is represented,

is a weight matrix between the input layer and the hidden layer of the forgetting gate, b_fIndicating the offset of the forgetting gate by hiding the output h of the previous layer_t-1With the current input x_tLinear combination is performed and then the output value is compressed between 0 and 1 using the activation function. When the output value is closer to 1, the more information the memory retains is indicated; conversely, closer to 0 indicates that the memory holds less information.

The candidate gate calculates the current input unit state, and the specific formula can be as follows:

wherein, C_iRepresenting the cell state of the current input, the output value can be scaled between-1 and 1 by the tanh activation function.

The output gate can control the amount of memory information for next layer network update, and the formula can be expressed as:

wherein, O_tIndicating the amount of remembered information for the next level of network update.

The final output can be calculated by the LSTM unit and the formula can be expressed as:

h_t＝O_t×tanh(C_t)

the final acoustic feature vector or spectral feature vector is obtained by forward and backward calculation, and the formula can be expressed as:

wherein

Is the output vector in the forward direction and,

for the inverted output vector, h_iA plurality of acoustic or spectral feature vectors labeled with class labels are finally assigned.

Further, the output layer in the LSTM may calculate the output unit C according to a preset decision function_iThe value of (c). Wherein, the output unit C_iThe value of (b) may be a value between 0 and 1, with 1 representing a speech class and 0 representing a non-speech class.

The terminal calculates the probability that each acoustic feature and spectral feature belong to the voice category and the non-voice category in the category comparison table by using the plurality of acoustic feature vectors and the spectral feature vectors marked with the voice category labels, extracts the category with the maximum probability value of the acoustic feature vectors and the spectral feature vectors in the category comparison table, and adds the voice category label corresponding to the category with the maximum probability value to the acoustic feature vectors or the spectral feature vectors.

The terminal utilizes the noisy voice data added with the voice category label to train to obtain an initial classifier; the terminal acquires a first verification set, wherein the first verification set comprises a plurality of first voice data. The terminal inputs the first voice data into the classifier, and after the class probabilities corresponding to the first voice data are obtained, the class probabilities corresponding to the first voice data are screened. The staff adds the voice category label to the selected first voice data through the terminal, the terminal obtains the first voice data added with the voice category label, and the verification set added with the voice category label is generated through the first voice data added with the voice category label. And the terminal utilizes the verification set added with the category voice label and the noisy voice data to train to obtain a verification classifier. The terminal acquires a second verification set, wherein the second verification set comprises a plurality of second voice data; and inputting the second voice data into the verification classifier to obtain the class probability corresponding to the second voice data. And the terminal screens out the second language data with the class probability in a preset range, marks the screened second voice data, and trains the marked second voice data and the noise-containing voice data added with the label again to obtain a new classifier. And continuously training until the probability values of the acoustic feature vectors or the frequency spectrum feature vectors of preset number in all the verification sets are between preset probability range values, and stopping training to obtain the required classifier. Therefore, the classifier with high accuracy can be obtained, so that the acoustic feature vector and the spectral feature vector can be accurately classified, and the voice and the non-voice can be accurately identified.

In one embodiment, the step of classifying the acoustic feature vector and the spectral feature vector using a classifier comprises: taking the acoustic feature vector and the spectral feature vector as input of a classifier to obtain decision values corresponding to the acoustic feature vector and the spectral feature vector; when the decision value is a first threshold value, adding a voice tag to the acoustic feature vector or the frequency spectrum feature vector; and when the decision value is a second threshold value, adding a non-voice label to the acoustic feature vector or the spectrum feature vector.

And after the terminal acquires the voice signal with the noise, extracting the acoustic characteristic and the spectral characteristic corresponding to the voice signal with the noise. And the terminal converts the acoustic features and the spectral features to obtain corresponding acoustic feature vectors and spectral feature vectors. And after the terminal acquires the classifier, inputting the acoustic feature vector and the spectral feature vector into the classifier. After the classifier classifies the input acoustic feature vector and the input spectral feature vector, the decision values corresponding to the acoustic feature vector and the spectral feature vector can be obtained. And when the obtained decision value is a preset first threshold value, the terminal adds a voice tag to the acoustic characteristic vector or the frequency spectrum characteristic vector. Wherein the first threshold may be a range of values. And when the obtained decision value is a preset second threshold value, the terminal adds a non-voice label to the acoustic characteristic vector or the frequency spectrum characteristic vector. By accurately classifying the acoustic feature vectors and the spectral feature vectors by using the classifier, the voice signals and the non-voice signals in the noisy voice signals can be accurately identified.

For example, the resulting decision value may be a value between 0 and 1. The preset first threshold may be 1, and the preset second threshold may be 0. And when the obtained decision value is 1, the terminal adds a voice tag to the acoustic feature vector or the frequency spectrum feature vector. And when the obtained decision value is 0, the terminal adds a non-voice label to the acoustic feature vector or the frequency spectrum feature vector. Thereby, the acoustic feature vector and the spectral feature vector can be accurately classified.

In one embodiment, as shown in fig. 2, there is provided a voice endpoint detection apparatus, comprising an extraction module 202, a conversion module 204, a classification module 206, and a parsing module 208, wherein:

the extracting module 202 is configured to obtain a noisy speech signal and extract an acoustic feature and a spectral feature corresponding to the noisy speech signal.

The conversion module 204 is configured to convert the acoustic features and the spectral features to obtain corresponding acoustic feature vectors and spectral feature vectors.

The classification module 206 is configured to obtain a classifier, and input the acoustic feature vector and the spectral feature vector to the classifier to obtain an acoustic feature vector to which a voice tag is added and a spectral feature vector to which a voice tag is added.

The analysis module 208 is configured to analyze the acoustic feature vector with the voice tag added and the spectrum feature vector with the voice tag added to obtain a corresponding voice signal; and determining a starting point and an ending point corresponding to the voice signal according to the time sequence of the voice signal.

In one embodiment, the extraction module 202 is further configured to convert the noisy speech signal into a noisy speech spectrum; and carrying out time domain analysis and/or frequency domain analysis and/or transform domain analysis on the voice frequency spectrum with the noise to obtain acoustic characteristics corresponding to the voice signal with the noise.

In one embodiment, the extracting module 202 is further configured to convert the noisy speech signal into a noisy speech spectrum, and calculate a noisy speech magnitude spectrum according to the noisy speech spectrum; carrying out dynamic noise estimation on the noisy speech frequency spectrum according to the noisy speech amplitude spectrum to obtain a noise amplitude spectrum; estimating a voice amplitude spectrum of the pure voice signal according to the voice amplitude spectrum with the noise and the noise amplitude spectrum; and generating the frequency spectrum characteristics corresponding to the voice signal with the noise by using the voice amplitude spectrum with the noise, the noise amplitude spectrum and the voice amplitude spectrum.

In one embodiment, the conversion module 204 is further configured to extract a preset number of frames before and after the current frame in the acoustic feature and the spectral feature; calculating a mean vector and/or a variance vector corresponding to the current frame by using a preset number of frames before and after the current frame; and carrying out logarithmic domain conversion on the acoustic feature and the frequency spectrum feature after the mean vector and/or the variance vector corresponding to the current frame are calculated to obtain a converted acoustic feature vector and a converted frequency spectrum feature vector.

In one embodiment, the device further comprises a training module, configured to acquire noisy speech data to which the speech category label is added, and train the noisy speech data to obtain an initial classifier; acquiring a first verification set, wherein the first verification set comprises a plurality of first voice data; inputting the first voice data into an initial classifier to obtain class probabilities corresponding to the first voice data; screening the category probabilities corresponding to the plurality of first voice data, and adding category labels to the selected first voice data to obtain a verification set added with the category labels; training by using the verification set added with the class label and the noisy voice data added with the voice class label to obtain a verification classifier; acquiring a second verification set, wherein the second verification set comprises a plurality of second voice data; inputting the second voice data into a verification classifier to obtain class probabilities corresponding to the second voice data; and when the class probability corresponding to the second voice data reaches a preset probability value, obtaining the required classifier.

In one embodiment, the classification module 206 is further configured to use the acoustic feature vector and the spectral feature vector as inputs of a classifier to obtain decision values corresponding to the acoustic feature vector and the spectral feature vector; when the decision value is a first threshold value, adding a voice tag to the acoustic feature vector or the frequency spectrum feature vector; and when the decision value is a second threshold value, adding a non-voice label to the acoustic feature vector or the spectrum feature vector.

In one embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be as shown in fig. 3. For example, the computer device may be a terminal, and the terminal may be, but is not limited to, various devices having a function of inputting voice, such as a smart phone, a tablet computer, a notebook computer, a personal computer, and a portable wearable device. The computer device includes a processor, a memory, a network interface, and a voice input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of voice endpoint detection. The voice input device of the computer equipment can comprise a microphone, and can also comprise an external earphone and the like.

Those skilled in the art will appreciate that the architecture shown in fig. 3 is a block diagram of only a portion of the architecture associated with the subject application, and does not constitute a limitation on the servers to which the subject application applies, as a particular server may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is provided, comprising a memory and a processor, the memory having a computer program stored therein, the processor implementing the following steps when executing the computer program: acquiring a voice signal with noise, and extracting acoustic features and spectrum features corresponding to the voice signal with noise; converting the acoustic features and the spectral features to obtain corresponding acoustic feature vectors and spectral feature vectors; acquiring a classifier, and inputting the acoustic feature vector and the spectral feature vector into the classifier to obtain the acoustic feature vector added with the voice tag and the spectral feature vector added with the voice tag; analyzing the acoustic feature vector added with the voice tag and the frequency spectrum feature vector added with the voice tag to obtain a corresponding voice signal; and determining a starting point and an ending point corresponding to the voice signal according to the time sequence of the voice signal.

In one embodiment, the processor, when executing the computer program, further performs the steps of: converting the voice signal with noise into a voice frequency spectrum with noise; and carrying out time domain analysis and/or frequency domain analysis and/or transform domain analysis on the voice frequency spectrum with the noise to obtain acoustic characteristics corresponding to the voice signal with the noise.

In one embodiment, the processor, when executing the computer program, further performs the steps of: converting the voice signal with noise into a voice frequency spectrum with noise, and calculating a voice amplitude spectrum with noise according to the voice frequency spectrum with noise; carrying out dynamic noise estimation on the noisy speech frequency spectrum according to the noisy speech amplitude spectrum to obtain a noise amplitude spectrum; estimating a voice amplitude spectrum of the pure voice signal according to the voice amplitude spectrum with the noise and the noise amplitude spectrum; and generating the frequency spectrum characteristics corresponding to the voice signal with the noise by using the voice amplitude spectrum with the noise, the noise amplitude spectrum and the voice amplitude spectrum.

In one embodiment, the processor, when executing the computer program, further performs the steps of: extracting a preset number of frames before and after the current frame in the acoustic features and the frequency spectrum features; calculating a mean vector and/or a variance vector corresponding to the current frame by using a preset number of frames before and after the current frame; and carrying out logarithmic domain conversion on the acoustic feature and the frequency spectrum feature after the mean vector and/or the variance vector corresponding to the current frame are calculated to obtain a converted acoustic feature vector and a converted frequency spectrum feature vector.

In one embodiment, the processor, when executing the computer program, further performs the steps of: acquiring noisy voice data added with a voice category label, and training the noisy voice data to obtain an initial classifier; acquiring a first verification set, wherein the first verification set comprises a plurality of first voice data; inputting the first voice data into a classifier to obtain class probabilities corresponding to the first voice data; screening the category probabilities corresponding to the plurality of first voice data, and adding category labels to the selected first voice data to obtain a verification set added with the category labels; training by using the verification set and the training set added with the class labels to obtain a verification classifier; acquiring a second verification set, wherein the second verification set comprises a plurality of second voice data; inputting the second voice data into a verification classifier to obtain class probabilities corresponding to the second voice data; and when the class probability corresponding to the second voice data reaches a preset probability value, obtaining the required classifier.

In one embodiment, the processor, when executing the computer program, further performs the steps of: taking the acoustic feature vector and the spectral feature vector as input of a classifier to obtain decision values corresponding to the acoustic feature vector and the spectral feature vector; when the decision value is a first threshold value, adding a voice tag to the acoustic feature vector or the frequency spectrum feature vector; and when the decision value is a second threshold value, adding a non-voice label to the acoustic feature vector or the spectrum feature vector.

In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of: acquiring a voice signal with noise, and extracting acoustic features and spectrum features corresponding to the voice signal with noise; converting the acoustic features and the spectral features to obtain corresponding acoustic feature vectors and spectral feature vectors; acquiring a classifier, and inputting the acoustic feature vector and the spectral feature vector into the classifier to obtain the acoustic feature vector added with the voice tag and the spectral feature vector added with the voice tag; analyzing the acoustic feature vector added with the voice tag and the frequency spectrum feature vector added with the voice tag to obtain a corresponding voice signal; and determining a starting point and an ending point corresponding to the voice signal according to the time sequence of the voice signal.

In one embodiment, the computer program when executed by the processor further performs the steps of: converting the voice signal with noise into a voice frequency spectrum with noise; and carrying out time domain analysis and/or frequency domain analysis and/or transform domain analysis on the voice frequency spectrum with the noise to obtain acoustic characteristics corresponding to the voice signal with the noise.

In one embodiment, the computer program when executed by the processor further performs the steps of: converting the voice signal with noise into a voice frequency spectrum with noise, and calculating a voice amplitude spectrum with noise according to the voice frequency spectrum with noise; carrying out dynamic noise estimation on the noisy speech frequency spectrum according to the noisy speech amplitude spectrum to obtain a noise amplitude spectrum; estimating a voice amplitude spectrum of the pure voice signal according to the voice amplitude spectrum with the noise and the noise amplitude spectrum; and generating the frequency spectrum characteristics corresponding to the voice signal with the noise by using the voice amplitude spectrum with the noise, the noise amplitude spectrum and the voice amplitude spectrum.

In one embodiment, the computer program when executed by the processor further performs the steps of: extracting a preset number of frames before and after the current frame in the acoustic features and the frequency spectrum features; calculating a mean vector and/or a variance vector corresponding to the current frame by using a preset number of frames before and after the current frame; and carrying out logarithmic domain conversion on the acoustic feature and the frequency spectrum feature after the mean vector and/or the variance vector corresponding to the current frame are calculated to obtain a converted acoustic feature vector and a converted frequency spectrum feature vector.

In one embodiment, the computer program when executed by the processor further performs the steps of: acquiring noisy voice data added with a voice category label, and training the noisy voice data to obtain an initial classifier; acquiring a first verification set, wherein the first verification set comprises a plurality of first voice data; inputting the first voice data into a classifier to obtain class probabilities corresponding to the first voice data; screening the category probabilities corresponding to the plurality of first voice data, and adding category labels to the selected first voice data to obtain a verification set added with the category labels; training by using the verification set and the training set added with the class labels to obtain a verification classifier; acquiring a second verification set, wherein the second verification set comprises a plurality of second voice data; inputting the second voice data into a verification classifier to obtain class probabilities corresponding to the second voice data; and when the class probability corresponding to the second voice data reaches a preset probability value, obtaining the required classifier.

In one embodiment, the computer program when executed by the processor further performs the steps of: taking the acoustic feature vector and the spectral feature vector as input of a classifier to obtain decision values corresponding to the acoustic feature vector and the spectral feature vector; when the decision value is a first threshold value, adding a voice tag to the acoustic feature vector or the frequency spectrum feature vector; and when the decision value is a second threshold value, adding a non-voice label to the acoustic feature vector or the spectrum feature vector.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A voice endpoint detection method, comprising:

acquiring a voice signal with noise, and extracting acoustic characteristics corresponding to the voice signal with noise;

extracting a voice amplitude spectrum with noise, a noise amplitude spectrum and a voice amplitude spectrum of the voice signal with noise;

generating a frequency spectrum characteristic corresponding to the voice signal with the noise according to the voice amplitude spectrum with the noise, the noise amplitude spectrum and the voice amplitude spectrum;

2. The method according to claim 1, further comprising, before said extracting the corresponding acoustic feature and spectral feature of the noisy speech signal:

3. The method according to claim 1, wherein said extracting a noisy speech magnitude spectrum, a noise magnitude spectrum and a speech magnitude spectrum of said noisy speech signal comprises:

and estimating the voice amplitude spectrum of the pure voice signal according to the voice amplitude spectrum with the noise and the noise amplitude spectrum.

4. The method of claim 1, wherein the converting the acoustic and spectral features comprises:

5. The method of claim 1, wherein the step of obtaining a classifier further comprises, prior to:

inputting a plurality of first voice data into the initial classifier to obtain class probabilities corresponding to the plurality of first voice data;

training by using the verification set added with the class label and the noisy voice data added with the voice class label to obtain a verification classifier;

6. The method of any one of claims 1 to 5, wherein the step of classifying the acoustic and spectral feature vectors using the classifier comprises:

7. A voice endpoint detection apparatus comprising:

the extraction module is used for acquiring a voice signal with noise and extracting acoustic characteristics corresponding to the voice signal with noise; extracting a voice amplitude spectrum with noise, a noise amplitude spectrum and a voice amplitude spectrum of the voice signal with noise; generating a frequency spectrum characteristic corresponding to the voice signal with the noise according to the voice amplitude spectrum with the noise, the noise amplitude spectrum and the voice amplitude spectrum;

8. The apparatus of claim 7, wherein the converting module is further configured to extract a preset number of frames before and after a current frame in the acoustic feature and the spectral feature; calculating a mean vector and/or a variance vector of the current frame by using a preset number of frames before and after the current frame; and carrying out logarithmic domain conversion on the acoustic feature and the frequency spectrum feature after the mean vector and/or the variance vector corresponding to the current frame are calculated to obtain a converted acoustic feature vector and a converted frequency spectrum feature vector.

9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 6 when executing the computer program.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 6.