WO2019232848A1

WO2019232848A1 - Voice distinguishing method and device, computer device and storage medium

Info

Publication number: WO2019232848A1
Application number: PCT/CN2018/094200
Authority: WO
Inventors: 涂宏
Original assignee: 平安科技（深圳）有限公司
Priority date: 2018-06-04
Filing date: 2018-07-03
Publication date: 2019-12-12
Also published as: CN108922561A

Abstract

The present application discloses a method and device for voice distinguishing, a computer device and a storage medium. The method comprises: acquiring original test voice data, preprocessing the original test voice data to obtain preprocessed voice data; performing endpoint detection process to the preprocessed voice data to obtain voice data to be tested; extracting characteristics of the voice data to be tested to obtain voice characteristics to be tested; inputting the voice characteristics to be tested into a pre-trained convolution depth confidence network model for recognition to obtain a voice distinguishing result. The method improves the accuracy of voice distinguishing, and makes the obtained voice distinguishing result be more accurate.

Description

Speech distinguishing method, device, computer equipment and storage medium

This application is based on a Chinese invention patent application filed on June 4, 2018 with the application number 201810561695.9, entitled "Voice distinguishing method, device, computer equipment and storage medium", and claims its priority.

Technical field

The present application relates to the technical field of speech recognition, and in particular, to a method, a device, a computer device, and a storage medium for speech discrimination.

Background technique

The voice data generally includes a target voice and an interfering voice. The target voice refers to a voice part in which the voiceprint continuously changes significantly. The interfering speech can be the part of the speech data that is not pronounced due to silence (ie, the mute section), or it can be the environmental noise part (ie, the noise section). Speech discrimination refers to mute filtering of the input speech, and only retains speech data (ie, target speech) that is more meaningful for recognition. Currently, endpoint detection technology is mainly used to distinguish speech data. This method of speech discrimination, when the target voice is mixed with noise, the greater the noise, the more difficult it is to distinguish the speech, and the more inaccurate the endpoint detection recognition result is. Therefore, when the endpoint detection technology is used for speech discrimination, the recognition result of the speech discrimination is easily affected by external factors, making the speech discrimination result inaccurate.

Summary of the Invention

The embodiments of the present application provide a method, an apparatus, a computer device, and a storage medium for speech discrimination to solve the problem of inaccurate speech discrimination results.

An embodiment of the present application provides a method for distinguishing speech, including:

Acquiring raw test voice data, preprocessing the raw test voice data, and obtaining preprocessed voice data;

Performing endpoint detection processing on the pre-processed voice data to obtain voice data to be tested;

Performing feature extraction on the voice data to be tested to obtain voice features to be tested;

The speech features to be tested are input into a pre-trained convolutional deep belief network model for recognition, and a speech discrimination result is obtained.

An embodiment of the present application provides a voice distinguishing device, including:

A raw test voice data processing module, configured to obtain raw test voice data, preprocess the raw test voice data, and obtain preprocessed voice data;

A voice data acquisition module for testing to perform endpoint detection processing on the pre-processed voice data to acquire voice data to be tested;

A voice feature acquisition module to be tested, which is used to extract features from the voice data to be tested to acquire voice features to be tested;

A speech discrimination result acquisition module is configured to input the speech feature to be tested into a pre-trained convolutional deep belief network model for recognition, and obtain a speech discrimination result.

An embodiment of the present application provides a computer device including a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor. The processor implements the computer-readable instructions to implement The following steps:

The embodiments of the present application provide one or more non-volatile readable storage media storing computer-readable instructions, wherein when the computer-readable instructions are executed by one or more processors, the one or more Multiple processors implement the following steps:

Details of one or more embodiments of the present application are set forth in the accompanying drawings and description below. Other features and advantages of the application will become apparent from the description, the drawings, and the claims.

BRIEF DESCRIPTION OF THE DRAWINGS

In order to explain the technical solutions of the embodiments of the present application more clearly, the drawings used in the description of the embodiments of the application will be briefly introduced below. Obviously, the drawings in the following description are just some embodiments of the application. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without paying creative labor.

FIG. 1 is an application scenario diagram of a speech discrimination method according to an embodiment of the present application; FIG.

FIG. 2 is a flowchart of a speech discrimination method according to an embodiment of the present application; FIG.

FIG. 3 is a specific flowchart of step S10 in FIG. 2;

4 is a specific flowchart of step S20 in FIG. 3;

FIG. 5 is a specific flowchart of step S30 in FIG. 2;

FIG. 6 is another flowchart of a speech discrimination method according to an embodiment of the present application; FIG.

FIG. 7 is a specific flowchart of step S403 in FIG. 6;

FIG. 8 is a specific flowchart of step S40 in FIG. 2;

FIG. 9 is a schematic diagram of a speech distinguishing device according to an embodiment of the present application; FIG.

FIG. 10 is a schematic diagram of a computer device according to an embodiment of the present application.

Detailed ways

In the following, the technical solutions in the embodiments of the present application will be clearly and completely described with reference to the drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, but not all of the embodiments. Based on the embodiments in this application, all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of this application.

The speech discrimination method provided in the embodiment of the present application can be used in the application environment shown in FIG. 1. The terminal device sends the collected original test voice data to the corresponding server through the network. After the server connected to the terminal device obtains the original test voice data, it first performs endpoint detection processing on the original test voice data. Get the voice data to be tested. Then, feature extraction is performed on the acquired voice data to be tested to acquire the voice features to be tested. Finally, the speech features to be tested are input into a pre-trained convolutional deep belief network model for recognition, and the speech discrimination result is obtained to achieve the purpose of distinguishing the target speech and the interference speech in the speech data. The terminal device is a device that can perform human-computer interaction with a user, including, but not limited to, various personal computers, notebook computers, smart phones, and tablet computers. The server can be implemented by an independent server or a server cluster composed of multiple servers.

In an embodiment, as shown in FIG. 2, a voice discrimination method is provided. The voice discrimination method includes the following steps:

S10: Obtain original test voice data, preprocess the original test voice data, and obtain preprocessed voice data.

The original test voice data refers to the voice data of the speaker collected by the terminal device. The original test voice data includes a target voice and an interference voice, where the target voice refers to a voice part in which the voiceprint continuously changes significantly; correspondingly, the interference voice refers to a voice part other than the target voice in the voice data. Specifically, the interfering speech includes a silent section and a noise section, where the silent section refers to a portion of the speech data that is not pronounced due to silence, such as the speaker will think and breathe during the speaking process, and the speaker is thinking and breathing There is no sound from time to time, so the voice part is a silent section. The noise segment refers to the environmental noise part of the voice data, such as the sound of the opening and closing of doors and windows, the collision of objects, etc., can be considered as the noise segment.

Specifically, the terminal device obtains a piece of original test voice data through a sound acquisition module (such as a recording module), and the original test voice data is a piece of voice data including a target voice and an interference voice that needs to be distinguished. After obtaining the original test voice data, the original test voice data is preprocessed to obtain preprocessed voice data. Preprocessed voice data refers to the voice data obtained by preprocessing the original test voice data.

The preprocessing in this embodiment specifically includes: performing pre-emphasis, framing, and windowing processing on the original test voice data. The pre-emphasis processing formula s ′ _n = s _n -a * s _{n-1 is} used to pre-emphasize the original test voice data to eliminate the effect of the vocal cords and lips of the speaker on the speaker's voice, and improve the speaker's voice. Frequency resolution. Where, s' _n for the amplitude of the speech signal at time n after the pre-emphasis treatment, s _n is the amplitude of the speech signal at time n, s _n-1 is the amplitude of the speech signal time n-1, a is the coefficient of pre-emphasis. Then, frame processing is performed on the original test voice data after the pre-emphasis processing. When framing, discontinuities will appear at the start and end points of each frame of voice data. The more frames there are, the greater the error from the original test voice data. In order to maintain the frequency characteristics of each frame of voice data, a windowing process is also required. Preprocess the original test voice data, obtain preprocessed voice data, and provide a data source for the subsequent steps to perform differentiated processing of the original test voice data.

S20: Perform endpoint detection processing on the pre-processed voice data to obtain the voice data to be tested.

Among them, the endpoint detection processing is a processing method for determining the start point and the end point of a target voice from a piece of voice data. Interference voice will inevitably exist in a piece of voice data. Therefore, after the terminal device obtains the original test voice data and preprocesses it, it is necessary to perform preliminary detection processing on the acquired preprocessed voice data to remove the interfering voice and retain the remaining Voice data, and the remaining voice data is used as the voice data to be tested. The voice data to be tested will include the target voice as well as the interfering voice that has not been accurately removed.

Specifically, after obtaining the pre-processed voice data, the short-term energy feature value and the short-time zero-crossing rate corresponding to the pre-processed voice data are obtained. The short-term energy characteristic value refers to an energy value corresponding to one frame of speech in the speech data at any moment. The short-term zero-crossing rate refers to the number of intersections of the speech signal corresponding to the speech data and the horizontal axis (zero level). In this embodiment, the server performs endpoint detection on the pre-processed voice data, which can reduce the processing time for voice discrimination and improve the quality of voice discrimination processing.

Understandably, performing endpoint detection processing on the preprocessed voice data can initially remove the voice data corresponding to the mute section and the noise section, and the removal effect is not very good. In order to more accurately remove the mute section and the noise section in the preprocessed voice data After obtaining the voice data to be tested, steps S30 and S40 need to be performed to obtain a more accurate target voice.

S30: Perform feature extraction on the speech data to be tested to obtain the speech features to be tested.

The voice features to be tested include, but are not limited to, spectrum features, sound quality features, and voiceprint features. The spectrum characteristic is to distinguish different voice data, such as target voice and interference voice, according to the frequency of sound vibration. The voice quality feature and voiceprint feature are used to identify the speaker corresponding to the voice data to be tested according to the voiceprint and the voice color characteristics of the voice. Since the voice discrimination is used to distinguish the target voice and the interfering voice in the voice data, only the spectrum characteristics of the voice data to be tested need to be acquired to complete the voice discrimination. Among them, the frequency spectrum is an abbreviation of frequency spectral density, and the frequency spectrum characteristic is a parameter reflecting the frequency spectral density.

S40: The speech features to be tested are input into a pre-trained convolutional deep belief network model for recognition, and a speech discrimination result is obtained.

A Convolutional Deep Belief Networks (CDBN) model is a neural network model that is pre-trained to distinguish target speech from interfering speech in the speech data to be tested. The speech discrimination result refers to a recognition result that is recognized by a convolutional deep belief network model to distinguish a target voice from an interference voice in the voice data to be tested. The pre-trained convolutional deep belief network model is used to recognize the test voice data and obtain the voice recognition probability value. The speech recognition probability value is compared with a preset probability value, and the speech data to be tested corresponding to the speech recognition probability value greater than or equal to the preset probability value is the target speech, and the speech to be tested corresponding to the speech recognition probability value smaller than the preset probability value The data is interference speech. That is, in this embodiment, a target voice with a higher recognition probability is retained, and an interference voice with a lower recognition probability is removed. Using the convolutional deep belief network model to identify the test voice data can improve the recognition accuracy and make the speech discrimination result more accurate.

The speech discrimination method provided in this embodiment performs endpoint detection processing on the pre-processed speech data and obtains the speech data to be tested, which can reduce the processing time of speech discrimination and improve the quality of speech processing. Then perform feature extraction on the test voice data, input the test voice features into a pre-trained convolutional deep belief network model for recognition, and obtain the voice discrimination result, which improves the accuracy of the voice discrimination and makes the obtained voice discrimination result more accurate.

In an embodiment, as shown in FIG. 3, in step S10, the original test voice data is pre-processed to obtain pre-processed voice data, which specifically includes the following steps:

S11: Perform pre-emphasis processing on the original test voice data. The formula for the pre-emphasis processing is s ′ _n = s _n -a * s _n-1 , where s ′ _n is the amplitude of the voice signal at time n after the pre-emphasis processing. s _n is the amplitude of the speech signal at time n, s _n-1 is the amplitude of the speech signal at time n-1, and a is the pre-emphasis coefficient.

Specifically, in order to eliminate the influence of a speaker's vocal cords and lips on the speaker's speech and improve the high-frequency resolution of the speaker's speech, the formula s ′ _n = s _n -a * s _n-1 needs to be applied to the original test speech data. Pre-emphasis processing. The amplitude of the speech signal is the amplitude of the speech expressed by the speech data in the time domain, a is the pre-emphasis coefficient, 0.9 <a <1.0, and generally, a is better than 0.97.

S12: Perform frame processing on the original test voice data after pre-emphasis processing to obtain framed voice data.

The voice signal corresponding to the pre-emphasized voice data is a non-stationary signal, but the voice signal has short-term stability. Among them, the short-term stationarity refers to the stable nature of the speech signal in a short time range (such as 10ms-30ms). Therefore, after obtaining the pre-emphasized voice data, frame processing is required to divide the pre-emphasized voice data into frame-by-frame speech data to obtain framed speech data. The framed voice data refers to a corresponding voice segment in a short time range, and the segmented voice segment is called a frame. In general, in order to maintain the continuity of two consecutive frames of voice data during frame division, there may be an overlapping portion in the voice data of two adjacent frames. The overlapping portion is 1/2 of the frame length. The overlapping portion is called Frame shift.

S13: Perform windowing on the framed speech data to obtain pre-processed speech data. The windowing formula is

And s ″ _n = w _n * s ′ _n , where w _n is the Hamming window at time n, N is the window length of Hamming window, s ′ _n is the signal amplitude in the time domain at time n, and s ″ _n is n Signal amplitude in time domain after windowing at time.

After frame processing, discontinuities will appear at the beginning and end points of each frame of voice data. The more frames there are, the greater the error from the original test voice data. In order to maintain the frequency characteristics of each frame of voice data, it is also necessary to perform windowing on the framed voice data. In this embodiment, the Hamming window is used to perform windowing processing on the voice data. Specifically, the Hamming window function is first used.

Perform windowing, and then use the formula s ″ _n = w _n * s ′ _{n to} obtain the windowed signal amplitude.

Steps S11-S13: By pre-emphasizing, framing, and windowing the original test voice data, pre-processed voice data with high resolution, good stability, and small errors from the original test voice data can be obtained, which improves the subsequent pass endpoints. Detection and processing to obtain the efficiency of the voice data to be tested and ensure the quality of the voice data to be tested.

In an embodiment, as shown in FIG. 3, in step S20, endpoint detection processing is performed on the preprocessed voice data to obtain the voice data to be tested, which specifically includes the following steps:

S21: Use the short-term energy feature value calculation formula to process the pre-processed voice data, obtain the short-term energy feature value corresponding to the pre-processed voice data, and remove the pre-processed voice data whose short-term energy feature value is less than the first threshold, and obtain The first test speech data, the short-term energy characteristic value calculation formula is

Among them, N is the number of frames in the preprocessed voice data, N≥2, and s (n) is the signal amplitude of the nth frame of preprocessed voice data in the time domain.

The first threshold is a preset threshold that distinguishes a mute segment in an interference voice from a target voice based on a short-term energy feature value. Specifically, a short-term energy characteristic value calculation formula is adopted

Process the pre-processed speech data to obtain the corresponding short-term energy feature value, where N is the number of frames in the pre-processed speech data, N≥2, and s (n) is the n-th frame of pre-processed speech data in the time domain , E is the short-term energy characteristic value of the preprocessed speech data.

In this embodiment, the short-term energy characteristic value is obtained, the short-term energy characteristic value is compared with a first threshold value, the pre-processed voice data whose short-term energy feature value is less than the first threshold value is removed, and the remaining pre-processed voice data is obtained, Use the remaining pre-processed speech data as the first test speech data. Understandably, the first test voice data is the voice data after the silence segment in the preprocessed voice data is excluded for the first time.

S22: Use the short-term zero-crossing rate calculation formula to process the pre-processed voice data, obtain the short-term zero-cross rate corresponding to the pre-processed voice data, and remove the pre-processed voice data whose short-term zero-cross rate is less than the second threshold to obtain The second test speech data, the short-term zero-crossing rate calculation formula is

Among them, N is the number of frames in the pre-processed voice data, N≥2, and s (n) is the signal amplitude of the n-th voice data in the time domain.

The second threshold is a preset threshold that distinguishes a mute segment in an interfering voice from a target voice based on a short-time zero-crossing rate. Specifically, a short-term zero-crossing rate calculation formula is used

Process the pre-processed voice data to obtain the corresponding short-term zero-crossing rate, where N is the number of frames in the pre-processed voice data, N≥2, and s (n) is the signal of the n-th voice data in the time domain. Amplitude, ZCR is the short-term zero-crossing rate of pre-processed speech data. In this embodiment, a short-term zero-crossing rate is obtained, the short-term zero-crossing rate is compared with a second threshold, the pre-processed voice data whose short-term zero-cross rate is less than the second threshold is removed, and the remaining pre-processed voice data is obtained, Use the remaining pre-processed speech data as the second test speech data. Understandably, obtaining the second test voice data is the voice data obtained after excluding the mute segment in the preprocessed voice data for the second time.

For example, in the endpoint detection process, two thresholds are set in advance, namely a first threshold T1 and a second threshold T2, where the first threshold T1 is a threshold corresponding to a short-term energy characteristic value and the second threshold T2 is a short-term zero crossing Rate corresponding threshold. In this embodiment, the first threshold T1 is set to 10 and the second threshold T2 is set to 15. If the short-term energy feature value of the pre-processed voice data is less than 10, the pre-processed voice data corresponding to the short-term energy feature value is muted. If the short-term energy feature value of the pre-processed speech data is not less than 10, the pre-processed voice data corresponding to the short-term energy feature value is not a mute segment and needs to be retained. If the short-term zero-crossing rate of the pre-processed voice data is less than 10, the pre-processed voice data corresponding to the short-time zero-crossing rate is a mute segment and needs to be removed; if the short-term zero-crossing rate of the pre-processed voice data is not less than 10, then The pre-processed voice data corresponding to the short-term zero-crossing rate is not a mute segment and needs to be retained.

S23: Perform noise reduction processing on the first test voice data and the second test voice data to obtain voice data to be tested.

Specifically, after acquiring the first test voice data and the second test voice data from which the mute segment is removed, the preprocessed voice data in which the first test voice data and the second test voice data coexist is obtained as common voice data, and then the common voice is Data denoising processing to obtain voice data to be tested. The noise reduction processing performed on the first test voice data and the second test voice data refers to removing noise segments in the first test voice data and the second test voice data. The noise section includes, but is not limited to, sounds generated when a door or a window is opened or an object collides.

Further, the common voice data is denoised, and obtaining the voice data to be tested specifically includes the following steps: (1) obtaining the voice signal energy of the common voice data, and determining at least one maximum value and a minimum value corresponding to the voice signal energy. (2) Obtain the change time between the adjacent maximum and minimum values. (3) If the mutation time is less than a preset minimum time threshold, it indicates that the voice signal energy in the common voice data has mutated in a short time, and the common voice data corresponding to the mutation time is a noise band, so it is necessary to change this Part of the noise band is removed to obtain the voice data to be tested. The minimum time threshold is a preset time value and is used to determine a noise segment in the common voice data.

In steps S21-S23, the first test voice data and the second test voice data are obtained by obtaining the short-term energy feature value and the short-time zero-crossing rate of the pre-processed voice data, and comparing them with the first threshold and the second threshold, respectively. , The pre-processed speech data corresponding to the mute segment can be excluded. Then, noise reduction processing is performed on the first test voice data and the second test voice data, the voice data to be tested corresponding to the target voice can be retained, and the amount of data that needs to be processed when feature extraction is performed on the voice data to be tested.

In an embodiment, the voice data to be tested is pre-processed, framed, and windowed on the original test voice data, and then acquired after endpoint detection, so that the voice data to be tested includes multiple frames of single-frame voice. The data enables subsequent feature extraction of the voice data to be tested, and can specifically perform feature extraction for each frame of single-frame voice data in the voice data to be tested.

In an embodiment, as shown in FIG. 5, in step S30, feature extraction is performed on the voice data to be tested to obtain the voice features to be tested, which specifically includes the following steps:

S31: Perform fast Fourier transform processing on a single frame of voice data to obtain a power spectrum of the voice data to be tested.

Get each frame of single-frame voice data in the voice data to be tested, using the formula

Fast Fourier Transformation (FFT) is performed to obtain the frequency spectrum of the speech data to be tested. formula

Where 1≤k≤N, N is the number of frames in the voice data to be tested, s (k) is the signal amplitude in the frequency domain, and s (n) is the signal amplitude of the nth frame of voice data in the time domain, j Is a negative unit. After obtaining the spectrum of the voice data to be tested, use the formula for the spectrum

Perform power spectrum calculation to obtain the power spectrum of the single frame of voice data in the voice data to be tested. formula

Among them, 1≤k≤N, N is the number of frames in the voice data to be tested, s (k) is the signal amplitude in the frequency domain, and P (k) is the power spectrum of the voice data to be tested. Obtaining the power spectrum facilitates obtaining the Mel spectrum in step S32.

S32: Use a Mel filter bank to perform dimension reduction processing on the power spectrum to obtain a Mel spectrum.

Because the human auditory perception system can simulate a complex non-linear system, the power spectrum obtained based on step S31 cannot well show the non-linear characteristics of the speech data. Therefore, it is also necessary to use a Mel filter bank to reduce the dimensionality of the frequency spectrum. The frequency spectrum of the acquired voice data to be tested is closer to the frequency perceived by the human ear. Among them, the Mel filter bank is composed of multiple overlapping triangular band-pass filters. The triangular band-pass filter carries three frequencies: a lower limit frequency, a cutoff frequency, and a center frequency. The center frequencies of these triangular bandpass filters are equidistant on the Mel scale, which grows linearly before 1000HZ and increases logarithmically after 1000HZ. Conversion relationship between Mel spectrum and power spectrum:

Among them, n represents the number of triangular band-pass filters, w _n is a conversion coefficient, l _n is a lower limit frequency, h _n is a cutoff frequency, P (k) is a power spectrum, and k is k- _th frame voice data.

S33: Perform cepstrum analysis on the Mel spectrum to obtain the voice characteristics to be tested.

Among them, cepstrum refers to the inverse Fourier transform of the Fourier transform spectrum of a signal after logarithmic operation. Since the general Fourier spectrum is a complex spectrum, the cepstrum is also called complex cepstrum. .

Specifically, after obtaining the Mel spectrum, take a logarithm X = log mel _(n) of the Mel spectrum, and then perform a discrete cosine transform (DCT) on X to obtain a Mel frequency cepstrum coefficient (MFCC) The Melc Frequency Cepstral Coefficient (MFCC) is the speech feature to be tested. Among them, the formula of discrete Fourier change is

i = 1, 2, 3 ... N. c _i represents the ith Mel frequency cepstrum coefficient, n represents the number of Mel frequency cepstrum coefficients, which is related to the number of Mel filters, if the number of Mel filters is 13, the Mel frequency The number of cepstrum coefficients can also be 13.

Further, in order to facilitate observation and better reflect the characteristics of the speech signal corresponding to the speech data to be tested, after obtaining the Mel Frequency Cepstrum Coefficient (MFCC), the MFCC needs to be normalized. Wherein the step of normalizing the specific treatment is: averaging all c _i, c _i and then subtracting the average of each of the processed values acquired each c _i corresponding normalized. The normalized value corresponding to c _i is the Mel Frequency Cepstrum Coefficient (MFCC) of the voice data to be tested, that is, the voice characteristics of the voice data to be tested.

In an embodiment, as shown in FIG. 6, before the step of inputting the voice features to be tested into a pre-trained convolutional deep belief network model for recognition in step S40, the voice discrimination method further includes: pre-training the convolution Deep Belief Network Model.

The pre-trained convolutional deep belief network model includes the following steps:

S401: Acquire voice data to be trained. The voice data to be trained includes standard training voice data and interference training voice data.

The speech data to be trained refers to speech data used to train the convolutional deep belief network model. The speech data in the speech data to be trained includes standard training speech data and interference training speech data. Among them, the standard training speech data refers to pure speech data that does not include the mute segment and the noise segment; the interference training speech data refers to speech data that includes the mute segment and the noise segment. The to-be-trained voice data can be obtained from a pre-differentiated voice database that stores standard training voice data and interference training voice data, or from an open source voice training set. The to-be-trained voice data obtained in this embodiment is voice data that is distinguished in advance, and the ratio of standard training voice data to interference training voice data is 1: 1, which is convenient based on the obtained standard training voice data and interference training voice data. Perform model training on the Convolutional Deep Confidence Network (CDBN) model to improve training efficiency and avoid overfitting.

S402: The standard training speech data and the interference training speech data are input into the convolutional deep confidence network model in the same proportion for model training to obtain the original convolution-limited Boltzmann machine.

The Convolutional Deep Confidence Network (CDBN) model is composed of multiple convolution-limited Boltzmann machines (CRBM). Therefore, the standard training speech data and interference training speech data are input to the convolutional deep confidence network model in the same proportion. In training, we should train each convolution-limited Boltzmann machine (CRBM) in the Convolutional Deep Confidence Network (CDBN) model.

Specifically, the number of CRBMs in the CDBN is n, and the CRBM is divided into two layers. The upper layer is a hidden layer h, which is used to extract the voice data to be trained (the ratio of standard training voice data to interference training voice data is Training voice data); the lower layer is the visible layer v, which is used to input the training voice data to be trained. The hidden layer and the visible layer include multiple hidden units and multiple visual units. Assume that the speech data in the hidden unit and the speech features in the visual unit are binary variables v _i ∈ {0, 1}, h _j ∈ {0, 1}, where v _i represents the i-th in the visible layer binary state variable v, h the hidden layer state _j j th h of binary variables. The number of visible units is n, and the number of hidden units is m. The standard training speech data and interference training speech data are input into the convolutional deep confidence network model for training in the same proportion. The specific steps include the following steps: First, the built-in energy formula of CRBM is used.

OK (v, h). When the parameters (v, h) are determined, the corresponding probability distribution formula is obtained

Where z (θ) is the normalization factor,

Then, based on the correlation formula p (h _j = 1 | v) = σ (b _j + w _ij * _v v) (1), p (v _i = 1 | h) = σ (a _i + w _ji * _f h ) (2) and

Train the training speech features, adjust the bias parameters of the visible layer and the hidden layer and the weights between them to obtain the original convolution-limited Boltzmann machine. Where θ = {w _ij , a _i , b _j }, a _i is the bias parameter of the visible layer, b _j is the bias parameter of the hidden layer, and w _ij is the i-th visible unit and the j-th hidden The weight on the connection line of the unit, w _ji is the weight on the connection line of the jth hidden unit and the i-th visible unit, w _ji = w _ij , σ represents the sigmoid activation function, * _v represents the effective convolution, and * f Full convolution symbols, v and h represent the states of the visible and hidden layers, respectively.

S403: Perform stack processing on the original convolution-limited Boltzmann machine to obtain a convolutional deep confidence network model.

After obtaining the original convolution-limited Boltzmann machine, the original convolution-limited Boltzmann machine is stacked, that is, the output data of the first effective convolution-limited Boltzmann machine is used as the second original convolution Limit the input data of the Boltzmann machine, and use the output data of the second valid convolution-limited Boltzmann machine as the input data of the third original convolution-limited Boltzmann machine, and so on. Convolution-limited Boltzmann machines generate a convolutional deep belief network model.

The already distinguished standard training speech data and interference training speech data are input into a convolutional deep belief network model, and the relevant formula (the relevant formula in step S402) in the convolution-limited Boltzmann machine (CRBM) is used to convolute The bias parameters and weights in the convolutional deep belief network model are iteratively updated to obtain the original convolution-limited Boltzmann machine. Then the original convolution-limited Boltzmann machine is stacked to obtain a convolutional deep confidence network model, so that the convolutional deep confidence network model is continuously updated, and the recognition accuracy of the convolutional deep confidence network model is improved.

In an embodiment, as shown in FIG. 7, step S403: stacking the original convolution-limited Boltzmann machine to obtain a convolutional deep confidence network model, which specifically includes the following steps:

S4031: Perform a maximum probability pooling process and a sparse regularization process on the original convolution-limited Boltzmann machine to obtain an effective convolution-limited Boltzmann machine.

Specifically, when the convolutional deep belief network model is stacked on the original convolution-limited Boltzmann machine, overfitting and overlap may occur. Among them, overfitting means that in the process of using the convolutional deep confidence network model to identify the speech data to be tested, if the input speech data to be used is the speech data to be used in training the model, the recognition accuracy is very high. When the input voice data to be tested is non-training voice data, the recognition accuracy is very low. Overlap refers to the case where adjacent original convolutions limit Boltzmann's chance of overlapping. Therefore, when the original convolution-restricted Boltzmann machine is used to stack the convolutional depth-confidence network model, the original convolution-restricted Boltzmann machine also needs to be processed with maximum probability pooling and sparse regularization to avoid the original convolution. The product restricts the Boltzmann machine from overfitting and overlapping. Among them, the maximum probability pooling process is a processing operation performed to prevent the occurrence of overlap; the sparse regularization process is a processing operation performed to prevent the occurrence of overfitting. Probabilistic maximum pooling processing and sparse regularization processing on the original convolution-limited Boltzmann machine can effectively reduce the processing amount of the stacking process, while improving the recognition accuracy of the convolution-limited Boltzmann machine.

S4032: Perform stacking processing on the effective convolution-limited Boltzmann machine to obtain a convolutional deep confidence network model.

After the maximum probability pooling process and sparse regularization process, the acquired effective convolution-limited Boltzmann machine is stacked to obtain a convolutional deep confidence network model. In this embodiment, the acquired adaptive environment model of the convolutional deep confidence network model is more complete, which can avoid over-fitting and overlapping, which makes it more accurate to recognize any speech data to be tested.

In an embodiment, as shown in FIG. 8, step S40 is to input a voice feature to be tested into a pre-trained convolutional deep belief network model for recognition, and obtain a voice discrimination result, which specifically includes the following steps:

S41: The speech features to be tested are input into a pre-trained convolutional deep confidence network model for recognition, and a speech recognition probability value is obtained.

The speech features to be tested are input to a pre-trained convolutional deep confidence network model for recognition. According to the recognition process of the convolutional deep confidence network model, the speech features to be tested output a probability value, and the probability value is obtained. Speech recognition probability value.

Further, when the speech features to be tested are input into a pre-trained convolutional deep confidence network model, in order to reduce the calculation amount of the convolutional deep confidence network model, and to improve the accuracy of identifying the speech features to be tested, the convolution depth The belief network model divides the voice to be tested before recognition, and divides the single frame of voice data in the voice data to be tested into at least two voice segments by the same number for recognition. The convolutional deep belief network model recognizes the speech features to be tested corresponding to each speech segment, and obtains the speech recognition probability value of each speech segment. Then, the average value of the speech recognition probability values of at least two speech segments is calculated, and the obtained average value is the speech recognition probability value corresponding to the speech data to be tested. The voice segment refers to a segment containing multiple single-frame voice data.

S42: Acquire a speech discrimination result based on the speech recognition probability value.

After obtaining the speech recognition probability value, the convolutional deep confidence network model will compare the speech recognition probability value of each group based on a preset preset probability value. Speech fragments that are less than the preset probability value are interfering speech, and are greater than or equal to Let the speech segment with probability value be the target speech. Further, after obtaining the speech recognition probability value, the convolutional deep belief network model will remove speech fragments with recognition probability values less than a preset probability value, and only retain speech fragments with recognition probability values greater than a preset probability value. Therefore, the voice data to be tested only retains the voice data to be tested corresponding to the target voice.

Based on a preset probability value, the target voice and interference voice in the voice data to be tested are judged, and the voice data to be tested corresponding to the interference voice is removed, and the voice data to be tested corresponding to the target voice is retained, thereby achieving the target in the voice data to be tested. Voice and interfering voice features.

Perform pre-emphasis, framing, and windowing on the original test voice data to obtain pre-processed voice data, and then perform endpoint detection processing on the pre-processed voice data through short-term energy feature values and short-time zero-crossing rates to obtain the test data The voice data can initially remove the test voice data corresponding to the interference voice, and effectively reduce the processing time of the convolutional deep confidence network model to identify the test voice data. Perform feature extraction on the test voice data, obtain the test voice features, and input the test voice features into a pre-trained convolutional deep belief network model for recognition, obtain the voice discrimination result, and improve the accuracy of the voice discrimination, so that The obtained speech discrimination results are more accurate.

It should be understood that the size of the sequence numbers of the steps in the above embodiments does not mean the order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of this application.

In one embodiment, a voice distinguishing device is provided, and the voice distinguishing device corresponds to the voice distinguishing method in the above embodiment. As shown in FIG. 9, the voice discrimination device includes an original test voice data processing module 10, a voice data acquisition module 20 to be tested, a voice feature acquisition module 30 to be tested, and a voice discrimination result acquisition module 40. The implementation functions of the original test voice data processing module 10, the voice data acquisition module 20 to be tested, the voice feature acquisition module 30 to be tested, and the voice discrimination result acquisition module 40 correspond to the steps corresponding to the voice discrimination method in the above embodiment. In order to avoid redundant description, this embodiment is not detailed one by one.

The raw test voice data processing module 10 is configured to obtain raw test voice data, preprocess the raw test voice data, and obtain preprocessed voice data.

The voice data to be tested module 20 is configured to perform endpoint detection processing on the preprocessed voice data to obtain voice data to be tested.

The feature-to-be-tested voice acquisition module 30 is configured to perform feature extraction on the to-be-tested voice data to obtain the feature to be tested.

The speech discrimination result acquisition module 40 is configured to input a voice feature to be tested into a pre-trained convolutional deep belief network model for recognition, and obtain a speech discrimination result.

Specifically, the original test voice data processing module 10 includes a first processing unit 11 and a second processing unit 12.

The first processing unit 11 is configured to perform pre-emphasis processing on the original test voice data, and a formula for the pre-emphasis processing is s ′ _n = s _n -a * s _n-1 , where s ′ _n is n after the pre-emphasis processing. speech signal amplitude timing, s _n is the amplitude of the speech signal at time n, s _n-1 is the amplitude of the speech signal time n-1, a is the coefficient of pre-emphasis.

The second processing unit 12 is configured to perform frame processing on the original test voice data after the pre-emphasis processing to obtain framed voice data.

The third processing unit 13 is configured to perform window processing on the framed voice data to obtain pre-processed voice data. The formula of the window processing is

Specifically, the voice data acquisition module 20 includes a first test voice data acquisition unit 21, a second test voice data acquisition unit 22, and a voice data acquisition unit 23.

The first test voice data obtaining unit 21 is configured to process the pre-processed voice data by using a short-term energy feature value calculation formula, obtain a short-term energy feature value corresponding to the pre-processed voice data, and make the short-term energy feature value smaller than the first The threshold pre-processed voice data is removed to obtain the first test voice data. The short-term energy feature value calculation formula is

The second test voice data obtaining unit 22 is configured to process the preprocessed voice data by using a short-term zero-crossing rate calculation formula, obtain a short-time zero-crossing rate corresponding to the pre-processed voice data, and make the short-time zero-crossing rate smaller than the second The threshold preprocessed voice data is removed to obtain the second test voice data. The short-term zero-crossing rate calculation formula is

The voice data to be tested unit 23 is configured to perform noise reduction processing on the first test voice data and the second test voice data to obtain the voice data to be tested.

Specifically, the voice data to be tested includes single-frame voice data.

The speech feature acquisition module 30 includes a power spectrum acquisition unit 31, a Mel spectrum acquisition unit 32, and a speech feature acquisition unit 33.

The power spectrum obtaining unit 31 is configured to perform fast Fourier transform processing on single frame of voice data to obtain a power spectrum of the voice data to be tested.

The Mel spectrum acquisition unit 32 is configured to perform a dimensionality reduction process on the power spectrum by using a Mel filter bank to obtain a Mel spectrum.

The speech feature acquiring unit 33 is configured to perform cepstrum analysis on the Mel spectrum to acquire the speech feature to be tested.

Specifically, the speech discrimination device is further used for pre-training a convolutional deep belief network model.

The speech discrimination device further includes a to-be-trained speech data acquisition unit 401, a model training unit 402, and a model acquisition unit 403.

The voice data to be trained unit 401 is configured to acquire voice data to be trained, and the voice data to be trained includes standard training voice data and interference training voice data.

A model training unit 402 is configured to input standard training speech data and interference training speech data into a convolutional deep belief network model in an equal proportion for model training, and obtain an original convolution-limited Boltzmann machine.

A model obtaining unit 403 is configured to perform stack processing on the original convolution-limited Boltzmann machine to obtain a convolutional deep confidence network model.

Specifically, the model acquisition unit 403 includes a pooling and regular processing unit 4031 and a stack processing unit 4032.

The pooling and regular processing unit 4031 is configured to perform a maximum probability pooling process and a sparse regularization process on the original convolution-limited Boltzmann machine to obtain an effective convolution-limited Boltzmann machine.

A stacking processing unit 4032 is configured to perform stacking processing on the effective convolution-limited Boltzmann machine to obtain a convolutional deep confidence network model.

Specifically, the speech discrimination result acquisition module 40 includes a speech recognition probability value acquisition unit 41 and a speech discrimination result acquisition unit 42.

The speech recognition probability value obtaining unit 41 is configured to input a speech feature to be tested into a pre-trained convolutional deep confidence network model for recognition, and obtain a speech recognition probability value.

The speech discrimination result acquisition unit 42 is configured to acquire a speech discrimination result based on a speech recognition probability value.

In one embodiment, a computer device is provided. The computer device may be a server, and its internal structure diagram may be as shown in FIG. 10. The computer device includes a processor, a memory, a network interface, and a database connected through a system bus. The processor of the computer device is used to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer-readable instructions, and a database. The internal memory provides an environment for the operation of the operating system and computer-readable instructions in a non-volatile storage medium. The database of the computer equipment is used to store data obtained or generated during the method of speech discrimination. The network interface of the computer device is used to communicate with an external terminal through a network connection. The computer-readable instructions are executed by a processor to implement a method of speech discrimination.

In one embodiment, a computer device is provided, which includes a memory, a processor, and computer-readable instructions stored on the memory and executable on the processor. When the processor executes the computer-readable instructions, the processor implements the following steps: obtaining the original test Voice data, preprocess the original test voice data to obtain preprocessed voice data; perform endpoint detection processing on the preprocessed voice data to obtain the voice data to be tested; perform feature extraction on the test voice data to obtain the voice features to be tested; The test speech features are input into a pre-trained convolutional deep belief network model for recognition, and the speech discrimination results are obtained.

In an embodiment, when the processor executes the computer-readable instructions, the following steps are further implemented: pre-emphasis processing is performed on the original test voice data, and a formula for the pre-emphasis processing is s ′ _n = s _n -a * s _n-1 , where , s' _n for the speech signal amplitude at time n after processing pre-emphasis, s _n is the speech signal amplitude at time n, s _n-1 is the speech signal amplitude n-1 time point, a is pre-emphasis coefficient; preemphasis The processed original test voice data is subjected to frame processing to obtain framed voice data; windowed processing is performed on the framed voice data to obtain preprocessed voice data, and the formula for windowing processing is

In an embodiment, when the processor executes the computer-readable instructions, the following steps are further implemented: the short-term energy feature value calculation formula is used to process the pre-processed voice data, and the short-term energy feature value corresponding to the pre-processed voice data is obtained, and The pre-processed voice data whose short-term energy feature value is less than the first threshold is removed to obtain the first test voice data. The short-term energy feature value calculation formula is

Among them, N is the number of frames in the preprocessed voice data, N≥2, and s (n) is the signal amplitude of the nth frame of preprocessed voice data in the time domain; the short-term zero-crossing rate calculation formula is used to preprocess the voice data Perform processing to obtain the short-term zero-crossing rate corresponding to the pre-processed voice data, and remove the pre-processed voice data whose short-term zero-cross rate is less than the second threshold to obtain the second test voice data. The short-term zero-crossing rate calculation formula is

Among them, N is the number of frames in the preprocessed voice data, N≥2, and s (n) is the signal amplitude of the n-th frame voice data in the time domain; denoising the first test voice data and the second test voice data Process and obtain the voice data to be tested.

In an embodiment, when the processor executes the computer-readable instructions, the following steps are further implemented: fast Fourier transform processing is performed on the single frame of voice data to obtain the power spectrum of the voice data to be tested; Perform dimensionality reduction processing to obtain the Mel spectrum; perform cepstrum analysis on the Mel spectrum to obtain the speech features to be tested.

In an embodiment, when the processor executes the computer-readable instructions, the following steps are further implemented: pre-training the convolutional deep belief network model. Specifically, pre-training the convolutional deep belief network model includes: acquiring voice data to be trained, the voice data to be trained includes standard training voice data and interference training voice data; and inputting the standard training voice data and the interference training voice data to an equal proportion of The model is trained in the convolutional deep confidence network model to obtain the original convolution-limited Boltzmann machine; the original convolution-limited Boltzmann machine is stacked to obtain the convolutional deep confidence network model.

In an embodiment, when the processor executes the computer-readable instructions, the following steps are further implemented: performing a probabilistic maximum pooling process and a sparse regularization process on the original convolution-limited Boltzmann machine to obtain an effective convolution-limited Boltzmann The stacking process is performed on the effective convolution-limited Boltzmann machine to obtain a convolutional deep confidence network model.

In one embodiment, when the processor executes the computer-readable instructions, the following steps are further implemented: inputting the voice features to be tested into a pre-trained convolutional deep confidence network model for recognition, and obtaining a voice recognition probability value; based on the voice recognition probability Value to get the speech discrimination result.

In one embodiment, one or more non-readable storage media storing computer-readable instructions are provided, and the computer-readable instructions, when executed by one or more processors, cause the one or more The processor implements the following steps: obtaining the original test voice data, preprocessing the original test voice data, and obtaining the preprocessed voice data; performing endpoint detection processing on the preprocessed voice data, obtaining the voice data to be tested; and performing feature extraction on the test voice data To obtain the voice features to be tested; input the voice features to be tested into a pre-trained convolutional deep confidence network model for recognition, and obtain the voice discrimination result.

In an embodiment, when executed by the processor, the following steps are also implemented: pre-emphasis processing is performed on the original test voice data, and the formula for the pre-emphasis processing is s ′ _n = s _n -a * s _n-1 , where s ′ _n is the amplitude of the speech signal at time n after the pre-emphasis processing, s _n is the amplitude of the speech signal at time n, s _n-1 is the amplitude of the speech signal at n-1, and a is the pre-emphasis coefficient The original test voice data is framed to obtain framed voice data; the framed voice data is windowed to obtain preprocessed voice data. The formula for windowing is

In an embodiment, when the computer program is executed by the processor, the following steps are further implemented: the short-term energy feature value calculation formula is used to process the pre-processed voice data, and the short-term energy feature value corresponding to the pre-processed voice data is obtained, and The pre-processed speech data whose energy characteristic value is less than the first threshold is removed to obtain the first test speech data. The calculation formula of the short-term energy characteristic value is

In an embodiment, when the computer-readable instructions are executed by the processor, the following steps are further implemented: performing fast Fourier transform processing on the single frame of speech data to obtain the power spectrum of the speech data to be tested; The spectrum is subjected to dimensionality reduction processing to obtain the Mel spectrum; the cepstrum analysis is performed on the Mel spectrum to obtain the voice characteristics to be tested.

In one embodiment, when the computer-readable instructions are executed by the processor, the following steps are further implemented: pre-training the convolutional deep belief network model. Specifically, pre-training the convolutional deep belief network model includes: acquiring voice data to be trained, the voice data to be trained includes standard training voice data and interference training voice data; and inputting the standard training voice data and the interference training voice data to an equal proportion of The model is trained in the convolutional deep confidence network model to obtain the original convolution-limited Boltzmann machine; the original convolution-limited Boltzmann machine is stacked to obtain the convolutional deep confidence network model.

In an embodiment, when the computer-readable instructions are executed by the processor, the following steps are further implemented: performing a probabilistic maximum pooling process and a sparse regularization process on the original convolution-limited Boltzmann machine to obtain an effective convolution-limited Boltzmann Man machine; stacking processing is performed on the effective convolution-limited Boltzmann machine to obtain a convolutional deep confidence network model.

In an embodiment, when the computer-readable instructions are executed by the processor, the following steps are also implemented: inputting the voice features to be tested into a pre-trained convolutional deep confidence network model for recognition, obtaining a voice recognition probability value; based on voice recognition The probability value obtains the result of speech discrimination.

A person of ordinary skill in the art may understand that the implementation of all or part of the processes in the methods of the foregoing embodiments may be performed by computer-readable instructions to instruct related hardware. The computer-readable instructions may be stored on a computer device. In the volatile storage medium, when the computer-readable instructions are executed, the computer-readable instructions may include the processes of the embodiments of the methods described above. Wherein, any reference to the storage, storage, database, or other media used in the embodiments provided in this application may include non-volatile and / or volatile storage. Non-volatile memory may include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory can include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in various forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

Those skilled in the art can clearly understand that, for the convenience and brevity of the description, only the above-mentioned division of functional units and modules is used as an example. In practical applications, the above functions can be assigned by different functional units, Module completion, that is, dividing the internal structure of the device into different functional units or modules to complete all or part of the functions described above.

The above-mentioned embodiments are only used to describe the technical solution of the present application, but not limited thereto. Although the present application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that they can still implement the foregoing implementations. The technical solutions described in the examples are modified, or some of the technical features are equivalently replaced; and these modifications or replacements do not deviate the essence of the corresponding technical solutions from the spirit and scope of the technical solutions of the embodiments of the application, and should be included in Within the scope of this application.

Claims

A method for distinguishing speech, comprising:

Acquiring raw test voice data, preprocessing the raw test voice data, and obtaining preprocessed voice data;

Performing endpoint detection processing on the pre-processed voice data to obtain voice data to be tested;

Performing feature extraction on the voice data to be tested to obtain voice features to be tested;

The speech features to be tested are input into a pre-trained convolutional deep belief network model for recognition, and a speech discrimination result is obtained.
The method of claim 1, wherein the preprocessing the original test voice data to obtain the preprocessed voice data comprises:

Perform pre-emphasis processing on the original test voice data, and the formula of the pre-emphasis processing is s ' n = s n -a * s n-1 , where s' n is a voice signal at time n after the pre-emphasis processing amplitude, s n is the amplitude of the speech signal at time n, s n-1 is the amplitude of the speech signal time n-1, a is the coefficient of pre-emphasis;

Frame processing the original test voice data after pre-emphasis processing to obtain framed voice data;

Perform windowing on the framed speech data to obtain pre-processed speech data, and the formula for windowing processing is
And s ″ n = w n * s ′ n , where w n is the Hamming window at time n, N is the window length of Hamming window, s ′ n is the signal amplitude in the time domain at time n, and s ″ n is n Signal amplitude in time domain after windowing at time.
The speech discrimination method according to claim 2, wherein the performing endpoint detection processing on the pre-processed speech data to obtain the speech data to be tested comprises:

A short-term energy feature value calculation formula is used to process the pre-processed voice data, obtain a short-term energy feature value corresponding to the pre-processed voice data, and reduce the pre-processed voice with the short-term energy feature value less than a first threshold Data is removed to obtain the first test voice data. The short-term energy characteristic value calculation formula is
Among them, N is the number of frames in the preprocessed voice data, N≥2, and s (n) is the signal amplitude of the nth frame of preprocessed voice data in the time domain;

A short-term zero-crossing rate calculation formula is used to process the pre-processed voice data, obtain a short-term zero-cross rate corresponding to the pre-processed voice data, and reduce the short-term zero-cross rate to a pre-processed voice that is less than a second threshold Data is removed to obtain the second test voice data. The short-term zero-crossing rate calculation formula is
Where N is the number of frames in the preprocessed voice data, N≥2, and s (n) is the signal amplitude of the nth frame of voice data in the time domain;

De-noise processing is performed on the first test voice data and the second test voice data to obtain voice data to be tested.
The method of claim 1, wherein the voice data to be tested includes a single frame of voice data;

Performing feature extraction on the voice data to be tested to obtain voice features to be tested includes:

Performing fast Fourier transform processing on the single-frame voice data to obtain a power spectrum of the voice data to be tested;

Using a Mel filter bank to perform dimension reduction processing on the power spectrum to obtain a Mel spectrum;

Cepstral analysis is performed on the Mel spectrum to obtain the voice characteristics to be tested.
The speech discrimination method according to claim 1, wherein before the step of inputting the speech features to be tested into a pre-trained convolutional deep belief network model for recognition, the speech discrimination method further comprises: Including: pre-trained convolutional deep belief network models;

The pre-trained convolutional deep belief network model includes:

Acquiring voice data to be trained, where the voice data to be trained includes standard training voice data and interference training voice data;

Input the standard training speech data and the interference training speech data into a convolutional deep confidence network model in the same proportion for model training to obtain the original convolution-limited Boltzmann machine;

Stack the original convolution-limited Boltzmann machine to obtain a convolutional deep confidence network model.
The speech discrimination method according to claim 5, wherein the stacking processing of the original convolution-limited Boltzmann machine to obtain a convolutional deep confidence network model comprises:

Performing probabilistic maximum pooling processing and sparse regularization processing on the original convolution-limited Boltzmann machine to obtain an effective convolution-limited Boltzmann machine;

Perform stacking processing on the effective convolution-limited Boltzmann machine to obtain a convolutional deep confidence network model.
The speech discrimination method according to claim 1, wherein the inputting the speech feature to be tested into a pre-trained convolutional deep belief network model for recognition and obtaining a speech discrimination result comprises:

Inputting the speech features to be tested into a pre-trained convolutional deep belief network model for recognition, and obtaining a speech recognition probability value;

The speech discrimination result is obtained based on the speech recognition probability value.
A voice distinguishing device, comprising:

A raw test voice data processing module, configured to obtain raw test voice data, preprocess the raw test voice data, and obtain preprocessed voice data;

A voice data acquisition module for testing to perform endpoint detection processing on the pre-processed voice data to acquire voice data to be tested;

A voice feature acquisition module to be tested, which is used to extract features from the voice data to be tested to acquire voice features to be tested;

A speech discrimination result acquisition module is configured to input the speech feature to be tested into a pre-trained convolutional deep belief network model for recognition, and obtain a speech discrimination result.
A computer device includes a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor, and is characterized in that the processor implements the computer-readable instructions as follows step:

Acquiring raw test voice data, preprocessing the raw test voice data, and obtaining preprocessed voice data;

Performing endpoint detection processing on the pre-processed voice data to obtain voice data to be tested;

Performing feature extraction on the voice data to be tested to obtain voice features to be tested;

The speech features to be tested are input into a pre-trained convolutional deep belief network model for recognition, and a speech discrimination result is obtained.
The computer device according to claim 9, wherein the preprocessing the original test voice data to obtain the preprocessed voice data comprises:

Perform pre-emphasis processing on the original test voice data, and the formula of the pre-emphasis processing is s ' n = s n -a * s n-1 , where s' n is a voice signal at time n after the pre-emphasis processing amplitude, s n is the amplitude of the speech signal at time n, s n-1 is the amplitude of the speech signal time n-1, a is the coefficient of pre-emphasis;

Frame processing the original test voice data after pre-emphasis processing to obtain framed voice data;

Perform windowing on the framed speech data to obtain pre-processed speech data, and the formula for windowing processing is
And s ″ n = w n * s ′ n , where w n is the Hamming window at time n, N is the window length of Hamming window, s ′ n is the signal amplitude in the time domain at time n, and s ″ n is n Signal amplitude in time domain after windowing at time.
The computer device according to claim 10, wherein performing endpoint detection processing on the pre-processed voice data to obtain voice data to be tested comprises:

A short-term energy feature value calculation formula is used to process the pre-processed voice data, obtain a short-term energy feature value corresponding to the pre-processed voice data, and reduce the pre-processed voice with the short-term energy feature value less than a first threshold Data is removed to obtain the first test voice data. The short-term energy characteristic value calculation formula is
Among them, N is the number of frames in the preprocessed voice data, N≥2, and s (n) is the signal amplitude of the nth frame of preprocessed voice data in the time domain;

A short-term zero-crossing rate calculation formula is used to process the pre-processed voice data, obtain a short-term zero-cross rate corresponding to the pre-processed voice data, and reduce the short-term zero-cross rate to a pre-processed voice with a second threshold value Data is removed to obtain the second test voice data. The short-term zero-crossing rate calculation formula is
Where N is the number of frames in the preprocessed voice data, N≥2, and s (n) is the signal amplitude of the nth frame of voice data in the time domain;

De-noise processing is performed on the first test voice data and the second test voice data to obtain voice data to be tested.
The computer device according to claim 9, wherein the voice data to be tested includes single-frame voice data;

Performing feature extraction on the voice data to be tested to obtain voice features to be tested includes:

Performing fast Fourier transform processing on the single-frame voice data to obtain a power spectrum of the voice data to be tested;

Using a Mel filter bank to perform dimension reduction processing on the power spectrum to obtain a Mel spectrum;

Cepstral analysis is performed on the Mel spectrum to obtain the voice characteristics to be tested.
The computer device according to claim 9, wherein before the step of inputting the speech feature to be tested into a pre-trained convolutional deep confidence network model for recognition, the processor executes the The computer-readable instructions also implement the following steps: pre-train the convolutional deep belief network model;

The pre-trained convolutional deep belief network model includes:

Acquiring voice data to be trained, where the voice data to be trained includes standard training voice data and interference training voice data;

Input the standard training speech data and the interference training speech data into a convolutional deep confidence network model in the same proportion for model training to obtain the original convolution-limited Boltzmann machine;

Stack the original convolution-limited Boltzmann machine to obtain a convolutional deep confidence network model.
The computer device according to claim 13, wherein the stacking processing of the original convolution-limited Boltzmann machine to obtain a convolutional deep confidence network model comprises:

Performing probabilistic maximum pooling processing and sparse regularization processing on the original convolution-limited Boltzmann machine to obtain an effective convolution-limited Boltzmann machine;

Perform stacking processing on the effective convolution-limited Boltzmann machine to obtain a convolutional deep confidence network model.
One or more non-volatile readable storage media storing computer-readable instructions, wherein when the computer-readable instructions are executed by one or more processors, the one or more processors implement The following steps:

Acquiring raw test voice data, preprocessing the raw test voice data, and obtaining preprocessed voice data;

Performing endpoint detection processing on the pre-processed voice data to obtain voice data to be tested;

Performing feature extraction on the voice data to be tested to obtain voice features to be tested;

The speech features to be tested are input into a pre-trained convolutional deep belief network model for recognition, and a speech discrimination result is obtained.
The non-volatile readable storage medium according to claim 15, wherein the preprocessing the original test voice data to obtain the preprocessed voice data comprises:

Perform pre-emphasis processing on the original test voice data, and the formula of the pre-emphasis processing is s ' n = s n -a * s n-1 , where s' n is a voice signal at time n after the pre-emphasis processing amplitude, s n is the amplitude of the speech signal at time n, s n-1 is the amplitude of the speech signal time n-1, a is the coefficient of pre-emphasis;

Frame processing the original test voice data after pre-emphasis processing to obtain framed voice data;

Perform windowing on the framed speech data to obtain pre-processed speech data, and the formula for windowing processing is
And s ″ n = w n * s ′ n , where w n is the Hamming window at time n, N is the window length of Hamming window, s ′ n is the signal amplitude in the time domain at time n, and s ″ n is n Signal amplitude in time domain after windowing at time.
The non-volatile readable storage medium of claim 16, wherein the performing endpoint detection processing on the pre-processed voice data to obtain voice data to be tested comprises:

A short-term energy feature value calculation formula is used to process the pre-processed voice data, obtain a short-term energy feature value corresponding to the pre-processed voice data, and reduce the pre-processed voice with the short-term energy feature value less than a first threshold Data is removed to obtain the first test voice data. The short-term energy characteristic value calculation formula is
Among them, N is the number of frames in the preprocessed voice data, N≥2, and s (n) is the signal amplitude of the nth frame of preprocessed voice data in the time domain;

A short-term zero-crossing rate calculation formula is used to process the pre-processed voice data, obtain a short-term zero-cross rate corresponding to the pre-processed voice data, and reduce the short-term zero-cross rate to a pre-processed voice that is less than a second threshold Data is removed to obtain the second test voice data. The short-term zero-crossing rate calculation formula is
Where N is the number of frames in the preprocessed voice data, N≥2, and s (n) is the signal amplitude of the nth frame of voice data in the time domain;

De-noise processing is performed on the first test voice data and the second test voice data to obtain voice data to be tested.
The non-volatile readable storage medium according to claim 15, wherein the voice data to be tested comprises a single frame of voice data;

Performing feature extraction on the voice data to be tested to obtain voice features to be tested includes:

Performing fast Fourier transform processing on the single-frame voice data to obtain a power spectrum of the voice data to be tested;

Using a Mel filter bank to perform dimension reduction processing on the power spectrum to obtain a Mel spectrum;

Cepstral analysis is performed on the Mel spectrum to obtain the voice characteristics to be tested.
The non-volatile readable storage medium according to claim 15, wherein before the step of inputting the speech feature to be tested into a pre-trained convolutional deep confidence network model for identification, When the computer-readable instructions are executed by one or more processors, the one or more processors further implement the following steps: pre-train a convolutional deep belief network model;

The pre-trained convolutional deep belief network model includes:

Acquiring voice data to be trained, where the voice data to be trained includes standard training voice data and interference training voice data;

Input the standard training speech data and the interference training speech data into a convolutional deep confidence network model in the same proportion for model training to obtain the original convolution-limited Boltzmann machine;

Stack the original convolution-limited Boltzmann machine to obtain a convolutional deep confidence network model.
The non-volatile readable storage medium according to claim 19, wherein the stacking processing of the original convolution-limited Boltzmann machine to obtain a convolutional deep confidence network model comprises:

Performing probabilistic maximum pooling processing and sparse regularization processing on the original convolution-limited Boltzmann machine to obtain an effective convolution-limited Boltzmann machine;

Perform stacking processing on the effective convolution-limited Boltzmann machine to obtain a convolutional deep confidence network model.