WO2019232848A1 - Voice distinguishing method and device, computer device and storage medium - Google Patents

Voice distinguishing method and device, computer device and storage medium Download PDF

Info

Publication number
WO2019232848A1
WO2019232848A1 PCT/CN2018/094200 CN2018094200W WO2019232848A1 WO 2019232848 A1 WO2019232848 A1 WO 2019232848A1 CN 2018094200 W CN2018094200 W CN 2018094200W WO 2019232848 A1 WO2019232848 A1 WO 2019232848A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice data
tested
speech
data
voice
Prior art date
Application number
PCT/CN2018/094200
Other languages
French (fr)
Chinese (zh)
Inventor
涂宏
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2019232848A1 publication Critical patent/WO2019232848A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals

Definitions

  • the present application relates to the technical field of speech recognition, and in particular, to a method, a device, a computer device, and a storage medium for speech discrimination.
  • the voice data generally includes a target voice and an interfering voice.
  • the target voice refers to a voice part in which the voiceprint continuously changes significantly.
  • the interfering speech can be the part of the speech data that is not pronounced due to silence (ie, the mute section), or it can be the environmental noise part (ie, the noise section).
  • Speech discrimination refers to mute filtering of the input speech, and only retains speech data (ie, target speech) that is more meaningful for recognition.
  • endpoint detection technology is mainly used to distinguish speech data. This method of speech discrimination, when the target voice is mixed with noise, the greater the noise, the more difficult it is to distinguish the speech, and the more inaccurate the endpoint detection recognition result is. Therefore, when the endpoint detection technology is used for speech discrimination, the recognition result of the speech discrimination is easily affected by external factors, making the speech discrimination result inaccurate.
  • the embodiments of the present application provide a method, an apparatus, a computer device, and a storage medium for speech discrimination to solve the problem of inaccurate speech discrimination results.
  • An embodiment of the present application provides a method for distinguishing speech, including:
  • the speech features to be tested are input into a pre-trained convolutional deep belief network model for recognition, and a speech discrimination result is obtained.
  • An embodiment of the present application provides a voice distinguishing device, including:
  • a raw test voice data processing module configured to obtain raw test voice data, preprocess the raw test voice data, and obtain preprocessed voice data
  • a voice data acquisition module for testing to perform endpoint detection processing on the pre-processed voice data to acquire voice data to be tested;
  • a voice feature acquisition module to be tested which is used to extract features from the voice data to be tested to acquire voice features to be tested;
  • a speech discrimination result acquisition module is configured to input the speech feature to be tested into a pre-trained convolutional deep belief network model for recognition, and obtain a speech discrimination result.
  • An embodiment of the present application provides a computer device including a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor.
  • the processor implements the computer-readable instructions to implement The following steps:
  • the speech features to be tested are input into a pre-trained convolutional deep belief network model for recognition, and a speech discrimination result is obtained.
  • the embodiments of the present application provide one or more non-volatile readable storage media storing computer-readable instructions, wherein when the computer-readable instructions are executed by one or more processors, the one or more Multiple processors implement the following steps:
  • the speech features to be tested are input into a pre-trained convolutional deep belief network model for recognition, and a speech discrimination result is obtained.
  • FIG. 1 is an application scenario diagram of a speech discrimination method according to an embodiment of the present application
  • FIG. 2 is a flowchart of a speech discrimination method according to an embodiment of the present application.
  • FIG. 3 is a specific flowchart of step S10 in FIG. 2;
  • step S20 in FIG. 3 is a specific flowchart of step S20 in FIG. 3;
  • FIG. 5 is a specific flowchart of step S30 in FIG. 2;
  • FIG. 6 is another flowchart of a speech discrimination method according to an embodiment of the present application.
  • FIG. 7 is a specific flowchart of step S403 in FIG. 6;
  • FIG. 8 is a specific flowchart of step S40 in FIG. 2;
  • FIG. 9 is a schematic diagram of a speech distinguishing device according to an embodiment of the present application.
  • FIG. 10 is a schematic diagram of a computer device according to an embodiment of the present application.
  • the speech discrimination method provided in the embodiment of the present application can be used in the application environment shown in FIG. 1.
  • the terminal device sends the collected original test voice data to the corresponding server through the network.
  • the server connected to the terminal device obtains the original test voice data, it first performs endpoint detection processing on the original test voice data. Get the voice data to be tested.
  • feature extraction is performed on the acquired voice data to be tested to acquire the voice features to be tested.
  • the speech features to be tested are input into a pre-trained convolutional deep belief network model for recognition, and the speech discrimination result is obtained to achieve the purpose of distinguishing the target speech and the interference speech in the speech data.
  • the terminal device is a device that can perform human-computer interaction with a user, including, but not limited to, various personal computers, notebook computers, smart phones, and tablet computers.
  • the server can be implemented by an independent server or a server cluster composed of multiple servers.
  • a voice discrimination method includes the following steps:
  • the original test voice data refers to the voice data of the speaker collected by the terminal device.
  • the original test voice data includes a target voice and an interference voice, where the target voice refers to a voice part in which the voiceprint continuously changes significantly; correspondingly, the interference voice refers to a voice part other than the target voice in the voice data.
  • the interfering speech includes a silent section and a noise section, where the silent section refers to a portion of the speech data that is not pronounced due to silence, such as the speaker will think and breathe during the speaking process, and the speaker is thinking and breathing There is no sound from time to time, so the voice part is a silent section.
  • the noise segment refers to the environmental noise part of the voice data, such as the sound of the opening and closing of doors and windows, the collision of objects, etc., can be considered as the noise segment.
  • the terminal device obtains a piece of original test voice data through a sound acquisition module (such as a recording module), and the original test voice data is a piece of voice data including a target voice and an interference voice that needs to be distinguished.
  • the original test voice data is preprocessed to obtain preprocessed voice data.
  • Preprocessed voice data refers to the voice data obtained by preprocessing the original test voice data.
  • the preprocessing in this embodiment specifically includes: performing pre-emphasis, framing, and windowing processing on the original test voice data.
  • s' n for the amplitude of the speech signal at time n after the pre-emphasis treatment
  • s n is the amplitude of the speech signal at time n
  • s n-1 is the amplitude of the speech signal time n-1
  • a is the coefficient of pre-emphasis.
  • S20 Perform endpoint detection processing on the pre-processed voice data to obtain the voice data to be tested.
  • the endpoint detection processing is a processing method for determining the start point and the end point of a target voice from a piece of voice data.
  • Interference voice will inevitably exist in a piece of voice data. Therefore, after the terminal device obtains the original test voice data and preprocesses it, it is necessary to perform preliminary detection processing on the acquired preprocessed voice data to remove the interfering voice and retain the remaining Voice data, and the remaining voice data is used as the voice data to be tested.
  • the voice data to be tested will include the target voice as well as the interfering voice that has not been accurately removed.
  • the short-term energy feature value and the short-time zero-crossing rate corresponding to the pre-processed voice data are obtained.
  • the short-term energy characteristic value refers to an energy value corresponding to one frame of speech in the speech data at any moment.
  • the short-term zero-crossing rate refers to the number of intersections of the speech signal corresponding to the speech data and the horizontal axis (zero level).
  • the server performs endpoint detection on the pre-processed voice data, which can reduce the processing time for voice discrimination and improve the quality of voice discrimination processing.
  • performing endpoint detection processing on the preprocessed voice data can initially remove the voice data corresponding to the mute section and the noise section, and the removal effect is not very good.
  • steps S30 and S40 need to be performed to obtain a more accurate target voice.
  • S30 Perform feature extraction on the speech data to be tested to obtain the speech features to be tested.
  • the voice features to be tested include, but are not limited to, spectrum features, sound quality features, and voiceprint features.
  • the spectrum characteristic is to distinguish different voice data, such as target voice and interference voice, according to the frequency of sound vibration.
  • the voice quality feature and voiceprint feature are used to identify the speaker corresponding to the voice data to be tested according to the voiceprint and the voice color characteristics of the voice. Since the voice discrimination is used to distinguish the target voice and the interfering voice in the voice data, only the spectrum characteristics of the voice data to be tested need to be acquired to complete the voice discrimination.
  • the frequency spectrum is an abbreviation of frequency spectral density
  • the frequency spectrum characteristic is a parameter reflecting the frequency spectral density.
  • S40 The speech features to be tested are input into a pre-trained convolutional deep belief network model for recognition, and a speech discrimination result is obtained.
  • a Convolutional Deep Belief Networks (CDBN) model is a neural network model that is pre-trained to distinguish target speech from interfering speech in the speech data to be tested.
  • the speech discrimination result refers to a recognition result that is recognized by a convolutional deep belief network model to distinguish a target voice from an interference voice in the voice data to be tested.
  • the pre-trained convolutional deep belief network model is used to recognize the test voice data and obtain the voice recognition probability value.
  • the speech recognition probability value is compared with a preset probability value, and the speech data to be tested corresponding to the speech recognition probability value greater than or equal to the preset probability value is the target speech, and the speech to be tested corresponding to the speech recognition probability value smaller than the preset probability value
  • the data is interference speech.
  • a target voice with a higher recognition probability is retained, and an interference voice with a lower recognition probability is removed.
  • Using the convolutional deep belief network model to identify the test voice data can improve the recognition accuracy and make the speech discrimination result more accurate.
  • the speech discrimination method provided in this embodiment performs endpoint detection processing on the pre-processed speech data and obtains the speech data to be tested, which can reduce the processing time of speech discrimination and improve the quality of speech processing. Then perform feature extraction on the test voice data, input the test voice features into a pre-trained convolutional deep belief network model for recognition, and obtain the voice discrimination result, which improves the accuracy of the voice discrimination and makes the obtained voice discrimination result more accurate.
  • step S10 the original test voice data is pre-processed to obtain pre-processed voice data, which specifically includes the following steps:
  • S11 Perform pre-emphasis processing on the original test voice data.
  • s n is the amplitude of the speech signal at time n
  • s n-1 is the amplitude of the speech signal at time n-1
  • a is the pre-emphasis coefficient.
  • Pre-emphasis processing The amplitude of the speech signal is the amplitude of the speech expressed by the speech data in the time domain, a is the pre-emphasis coefficient, 0.9 ⁇ a ⁇ 1.0, and generally, a is better than 0.97.
  • S12 Perform frame processing on the original test voice data after pre-emphasis processing to obtain framed voice data.
  • the voice signal corresponding to the pre-emphasized voice data is a non-stationary signal, but the voice signal has short-term stability.
  • the short-term stationarity refers to the stable nature of the speech signal in a short time range (such as 10ms-30ms). Therefore, after obtaining the pre-emphasized voice data, frame processing is required to divide the pre-emphasized voice data into frame-by-frame speech data to obtain framed speech data.
  • the framed voice data refers to a corresponding voice segment in a short time range, and the segmented voice segment is called a frame.
  • the overlapping portion is 1/2 of the frame length.
  • the overlapping portion is called Frame shift.
  • S13 Perform windowing on the framed speech data to obtain pre-processed speech data.
  • Steps S11-S13 By pre-emphasizing, framing, and windowing the original test voice data, pre-processed voice data with high resolution, good stability, and small errors from the original test voice data can be obtained, which improves the subsequent pass endpoints. Detection and processing to obtain the efficiency of the voice data to be tested and ensure the quality of the voice data to be tested.
  • step S20 endpoint detection processing is performed on the preprocessed voice data to obtain the voice data to be tested, which specifically includes the following steps:
  • S21 Use the short-term energy feature value calculation formula to process the pre-processed voice data, obtain the short-term energy feature value corresponding to the pre-processed voice data, and remove the pre-processed voice data whose short-term energy feature value is less than the first threshold, and obtain The first test speech data, the short-term energy characteristic value calculation formula is Among them, N is the number of frames in the preprocessed voice data, N ⁇ 2, and s (n) is the signal amplitude of the nth frame of preprocessed voice data in the time domain.
  • the first threshold is a preset threshold that distinguishes a mute segment in an interference voice from a target voice based on a short-term energy feature value.
  • a short-term energy characteristic value calculation formula is adopted Process the pre-processed speech data to obtain the corresponding short-term energy feature value, where N is the number of frames in the pre-processed speech data, N ⁇ 2, and s (n) is the n-th frame of pre-processed speech data in the time domain , E is the short-term energy characteristic value of the preprocessed speech data.
  • the short-term energy characteristic value is obtained, the short-term energy characteristic value is compared with a first threshold value, the pre-processed voice data whose short-term energy feature value is less than the first threshold value is removed, and the remaining pre-processed voice data is obtained, Use the remaining pre-processed speech data as the first test speech data.
  • the first test voice data is the voice data after the silence segment in the preprocessed voice data is excluded for the first time.
  • S22 Use the short-term zero-crossing rate calculation formula to process the pre-processed voice data, obtain the short-term zero-cross rate corresponding to the pre-processed voice data, and remove the pre-processed voice data whose short-term zero-cross rate is less than the second threshold to obtain
  • the second test speech data, the short-term zero-crossing rate calculation formula is Among them, N is the number of frames in the pre-processed voice data, N ⁇ 2, and s (n) is the signal amplitude of the n-th voice data in the time domain.
  • the second threshold is a preset threshold that distinguishes a mute segment in an interfering voice from a target voice based on a short-time zero-crossing rate.
  • a short-term zero-crossing rate calculation formula is used Process the pre-processed voice data to obtain the corresponding short-term zero-crossing rate, where N is the number of frames in the pre-processed voice data, N ⁇ 2, and s (n) is the signal of the n-th voice data in the time domain. Amplitude, ZCR is the short-term zero-crossing rate of pre-processed speech data.
  • a short-term zero-crossing rate is obtained, the short-term zero-crossing rate is compared with a second threshold, the pre-processed voice data whose short-term zero-cross rate is less than the second threshold is removed, and the remaining pre-processed voice data is obtained, Use the remaining pre-processed speech data as the second test speech data.
  • obtaining the second test voice data is the voice data obtained after excluding the mute segment in the preprocessed voice data for the second time.
  • two thresholds are set in advance, namely a first threshold T1 and a second threshold T2, where the first threshold T1 is a threshold corresponding to a short-term energy characteristic value and the second threshold T2 is a short-term zero crossing Rate corresponding threshold.
  • the first threshold T1 is set to 10 and the second threshold T2 is set to 15. If the short-term energy feature value of the pre-processed voice data is less than 10, the pre-processed voice data corresponding to the short-term energy feature value is muted.
  • the pre-processed voice data corresponding to the short-term energy feature value is not a mute segment and needs to be retained. If the short-term zero-crossing rate of the pre-processed voice data is less than 10, the pre-processed voice data corresponding to the short-time zero-crossing rate is a mute segment and needs to be removed; if the short-term zero-crossing rate of the pre-processed voice data is not less than 10, then The pre-processed voice data corresponding to the short-term zero-crossing rate is not a mute segment and needs to be retained.
  • S23 Perform noise reduction processing on the first test voice data and the second test voice data to obtain voice data to be tested.
  • the preprocessed voice data in which the first test voice data and the second test voice data coexist is obtained as common voice data, and then the common voice is Data denoising processing to obtain voice data to be tested.
  • the noise reduction processing performed on the first test voice data and the second test voice data refers to removing noise segments in the first test voice data and the second test voice data.
  • the noise section includes, but is not limited to, sounds generated when a door or a window is opened or an object collides.
  • the common voice data is denoised, and obtaining the voice data to be tested specifically includes the following steps: (1) obtaining the voice signal energy of the common voice data, and determining at least one maximum value and a minimum value corresponding to the voice signal energy. (2) Obtain the change time between the adjacent maximum and minimum values. (3) If the mutation time is less than a preset minimum time threshold, it indicates that the voice signal energy in the common voice data has mutated in a short time, and the common voice data corresponding to the mutation time is a noise band, so it is necessary to change this Part of the noise band is removed to obtain the voice data to be tested.
  • the minimum time threshold is a preset time value and is used to determine a noise segment in the common voice data.
  • the first test voice data and the second test voice data are obtained by obtaining the short-term energy feature value and the short-time zero-crossing rate of the pre-processed voice data, and comparing them with the first threshold and the second threshold, respectively.
  • the pre-processed speech data corresponding to the mute segment can be excluded.
  • noise reduction processing is performed on the first test voice data and the second test voice data, the voice data to be tested corresponding to the target voice can be retained, and the amount of data that needs to be processed when feature extraction is performed on the voice data to be tested.
  • the voice data to be tested is pre-processed, framed, and windowed on the original test voice data, and then acquired after endpoint detection, so that the voice data to be tested includes multiple frames of single-frame voice.
  • the data enables subsequent feature extraction of the voice data to be tested, and can specifically perform feature extraction for each frame of single-frame voice data in the voice data to be tested.
  • step S30 feature extraction is performed on the voice data to be tested to obtain the voice features to be tested, which specifically includes the following steps:
  • S31 Perform fast Fourier transform processing on a single frame of voice data to obtain a power spectrum of the voice data to be tested.
  • FFT Fast Fourier Transformation
  • N is the number of frames in the voice data to be tested
  • s (k) is the signal amplitude in the frequency domain
  • P (k) is the power spectrum of the voice data to be tested.
  • S32 Use a Mel filter bank to perform dimension reduction processing on the power spectrum to obtain a Mel spectrum.
  • the human auditory perception system can simulate a complex non-linear system, the power spectrum obtained based on step S31 cannot well show the non-linear characteristics of the speech data. Therefore, it is also necessary to use a Mel filter bank to reduce the dimensionality of the frequency spectrum.
  • the frequency spectrum of the acquired voice data to be tested is closer to the frequency perceived by the human ear.
  • the Mel filter bank is composed of multiple overlapping triangular band-pass filters.
  • the triangular band-pass filter carries three frequencies: a lower limit frequency, a cutoff frequency, and a center frequency.
  • the center frequencies of these triangular bandpass filters are equidistant on the Mel scale, which grows linearly before 1000HZ and increases logarithmically after 1000HZ.
  • n represents the number of triangular band-pass filters
  • w n is a conversion coefficient
  • l n is a lower limit frequency
  • h n is a cutoff frequency
  • P (k) is a power spectrum
  • k is k- th frame voice data.
  • cepstrum refers to the inverse Fourier transform of the Fourier transform spectrum of a signal after logarithmic operation. Since the general Fourier spectrum is a complex spectrum, the cepstrum is also called complex cepstrum. .
  • MFCC Mel frequency cepstrum coefficient
  • DCT discrete cosine transform
  • MFCC Mel frequency cepstrum coefficient
  • i 1, 2, 3 ... N
  • c i the ith Mel frequency cepstrum coefficient
  • n the number of Mel frequency cepstrum coefficients, which is related to the number of Mel filters, if the number of Mel filters is 13, the Mel frequency
  • the number of cepstrum coefficients can also be 13.
  • the MFCC needs to be normalized.
  • the step of normalizing the specific treatment is: averaging all c i, c i and then subtracting the average of each of the processed values acquired each c i corresponding normalized.
  • the normalized value corresponding to c i is the Mel Frequency Cepstrum Coefficient (MFCC) of the voice data to be tested, that is, the voice characteristics of the voice data to be tested.
  • the voice discrimination method before the step of inputting the voice features to be tested into a pre-trained convolutional deep belief network model for recognition in step S40, the voice discrimination method further includes: pre-training the convolution Deep Belief Network Model.
  • the pre-trained convolutional deep belief network model includes the following steps:
  • the voice data to be trained includes standard training voice data and interference training voice data.
  • the speech data to be trained refers to speech data used to train the convolutional deep belief network model.
  • the speech data in the speech data to be trained includes standard training speech data and interference training speech data.
  • the standard training speech data refers to pure speech data that does not include the mute segment and the noise segment
  • the interference training speech data refers to speech data that includes the mute segment and the noise segment.
  • the to-be-trained voice data can be obtained from a pre-differentiated voice database that stores standard training voice data and interference training voice data, or from an open source voice training set.
  • the to-be-trained voice data obtained in this embodiment is voice data that is distinguished in advance, and the ratio of standard training voice data to interference training voice data is 1: 1, which is convenient based on the obtained standard training voice data and interference training voice data.
  • CDBN Convolutional Deep Confidence Network
  • S402 The standard training speech data and the interference training speech data are input into the convolutional deep confidence network model in the same proportion for model training to obtain the original convolution-limited Boltzmann machine.
  • the Convolutional Deep Confidence Network (CDBN) model is composed of multiple convolution-limited Boltzmann machines (CRBM). Therefore, the standard training speech data and interference training speech data are input to the convolutional deep confidence network model in the same proportion. In training, we should train each convolution-limited Boltzmann machine (CRBM) in the Convolutional Deep Confidence Network (CDBN) model.
  • CRBM convolution-limited Boltzmann machine
  • the number of CRBMs in the CDBN is n, and the CRBM is divided into two layers.
  • the upper layer is a hidden layer h, which is used to extract the voice data to be trained (the ratio of standard training voice data to interference training voice data is Training voice data); the lower layer is the visible layer v, which is used to input the training voice data to be trained.
  • the hidden layer and the visible layer include multiple hidden units and multiple visual units.
  • the speech data in the hidden unit and the speech features in the visual unit are binary variables v i ⁇ ⁇ 0, 1 ⁇ , h j ⁇ ⁇ 0, 1 ⁇ , where v i represents the i-th in the visible layer binary state variable v, h the hidden layer state j j th h of binary variables.
  • the number of visible units is n, and the number of hidden units is m.
  • the standard training speech data and interference training speech data are input into the convolutional deep confidence network model for training in the same proportion.
  • the specific steps include the following steps: First, the built-in energy formula of CRBM is used. OK (v, h).
  • ⁇ w ij , a i , b j ⁇
  • a i the bias parameter of the visible layer
  • b j the bias parameter of the hidden layer
  • w ij the i-th visible unit and the j-th hidden
  • w ji the weight on the connection line of the jth hidden unit and the i-th visible unit
  • w ji w ij
  • represents the sigmoid activation function
  • * v represents the effective convolution
  • v and h represent the states of the visible and hidden layers, respectively.
  • S403 Perform stack processing on the original convolution-limited Boltzmann machine to obtain a convolutional deep confidence network model.
  • the original convolution-limited Boltzmann machine After obtaining the original convolution-limited Boltzmann machine, the original convolution-limited Boltzmann machine is stacked, that is, the output data of the first effective convolution-limited Boltzmann machine is used as the second original convolution Limit the input data of the Boltzmann machine, and use the output data of the second valid convolution-limited Boltzmann machine as the input data of the third original convolution-limited Boltzmann machine, and so on.
  • Convolution-limited Boltzmann machines generate a convolutional deep belief network model.
  • the already distinguished standard training speech data and interference training speech data are input into a convolutional deep belief network model, and the relevant formula (the relevant formula in step S402) in the convolution-limited Boltzmann machine (CRBM) is used to convolute
  • the bias parameters and weights in the convolutional deep belief network model are iteratively updated to obtain the original convolution-limited Boltzmann machine.
  • the original convolution-limited Boltzmann machine is stacked to obtain a convolutional deep confidence network model, so that the convolutional deep confidence network model is continuously updated, and the recognition accuracy of the convolutional deep confidence network model is improved.
  • step S403 stacking the original convolution-limited Boltzmann machine to obtain a convolutional deep confidence network model, which specifically includes the following steps:
  • S4031 Perform a maximum probability pooling process and a sparse regularization process on the original convolution-limited Boltzmann machine to obtain an effective convolution-limited Boltzmann machine.
  • overfitting means that in the process of using the convolutional deep confidence network model to identify the speech data to be tested, if the input speech data to be used is the speech data to be used in training the model, the recognition accuracy is very high. When the input voice data to be tested is non-training voice data, the recognition accuracy is very low.
  • Overlap refers to the case where adjacent original convolutions limit Boltzmann's chance of overlapping.
  • the original convolution-restricted Boltzmann machine when used to stack the convolutional depth-confidence network model, the original convolution-restricted Boltzmann machine also needs to be processed with maximum probability pooling and sparse regularization to avoid the original convolution.
  • the product restricts the Boltzmann machine from overfitting and overlapping.
  • the maximum probability pooling process is a processing operation performed to prevent the occurrence of overlap
  • the sparse regularization process is a processing operation performed to prevent the occurrence of overfitting.
  • Probabilistic maximum pooling processing and sparse regularization processing on the original convolution-limited Boltzmann machine can effectively reduce the processing amount of the stacking process, while improving the recognition accuracy of the convolution-limited Boltzmann machine.
  • S4032 Perform stacking processing on the effective convolution-limited Boltzmann machine to obtain a convolutional deep confidence network model.
  • the acquired effective convolution-limited Boltzmann machine is stacked to obtain a convolutional deep confidence network model.
  • the acquired adaptive environment model of the convolutional deep confidence network model is more complete, which can avoid over-fitting and overlapping, which makes it more accurate to recognize any speech data to be tested.
  • step S40 is to input a voice feature to be tested into a pre-trained convolutional deep belief network model for recognition, and obtain a voice discrimination result, which specifically includes the following steps:
  • S41 The speech features to be tested are input into a pre-trained convolutional deep confidence network model for recognition, and a speech recognition probability value is obtained.
  • the speech features to be tested are input to a pre-trained convolutional deep confidence network model for recognition. According to the recognition process of the convolutional deep confidence network model, the speech features to be tested output a probability value, and the probability value is obtained. Speech recognition probability value.
  • the convolution depth divides the voice to be tested before recognition, and divides the single frame of voice data in the voice data to be tested into at least two voice segments by the same number for recognition.
  • the convolutional deep belief network model recognizes the speech features to be tested corresponding to each speech segment, and obtains the speech recognition probability value of each speech segment. Then, the average value of the speech recognition probability values of at least two speech segments is calculated, and the obtained average value is the speech recognition probability value corresponding to the speech data to be tested.
  • the voice segment refers to a segment containing multiple single-frame voice data.
  • the convolutional deep confidence network model After obtaining the speech recognition probability value, the convolutional deep confidence network model will compare the speech recognition probability value of each group based on a preset preset probability value. Speech fragments that are less than the preset probability value are interfering speech, and are greater than or equal to Let the speech segment with probability value be the target speech. Further, after obtaining the speech recognition probability value, the convolutional deep belief network model will remove speech fragments with recognition probability values less than a preset probability value, and only retain speech fragments with recognition probability values greater than a preset probability value. Therefore, the voice data to be tested only retains the voice data to be tested corresponding to the target voice.
  • the target voice and interference voice in the voice data to be tested are judged, and the voice data to be tested corresponding to the interference voice is removed, and the voice data to be tested corresponding to the target voice is retained, thereby achieving the target in the voice data to be tested.
  • Voice and interfering voice features are provided.
  • the voice data can initially remove the test voice data corresponding to the interference voice, and effectively reduce the processing time of the convolutional deep confidence network model to identify the test voice data.
  • Perform feature extraction on the test voice data obtain the test voice features, and input the test voice features into a pre-trained convolutional deep belief network model for recognition, obtain the voice discrimination result, and improve the accuracy of the voice discrimination, so that The obtained speech discrimination results are more accurate.
  • a voice distinguishing device is provided, and the voice distinguishing device corresponds to the voice distinguishing method in the above embodiment.
  • the voice discrimination device includes an original test voice data processing module 10, a voice data acquisition module 20 to be tested, a voice feature acquisition module 30 to be tested, and a voice discrimination result acquisition module 40.
  • the implementation functions of the original test voice data processing module 10, the voice data acquisition module 20 to be tested, the voice feature acquisition module 30 to be tested, and the voice discrimination result acquisition module 40 correspond to the steps corresponding to the voice discrimination method in the above embodiment. In order to avoid redundant description, this embodiment is not detailed one by one.
  • the raw test voice data processing module 10 is configured to obtain raw test voice data, preprocess the raw test voice data, and obtain preprocessed voice data.
  • the voice data to be tested module 20 is configured to perform endpoint detection processing on the preprocessed voice data to obtain voice data to be tested.
  • the feature-to-be-tested voice acquisition module 30 is configured to perform feature extraction on the to-be-tested voice data to obtain the feature to be tested.
  • the speech discrimination result acquisition module 40 is configured to input a voice feature to be tested into a pre-trained convolutional deep belief network model for recognition, and obtain a speech discrimination result.
  • the original test voice data processing module 10 includes a first processing unit 11 and a second processing unit 12.
  • speech signal amplitude timing s n is the amplitude of the speech signal at time n
  • s n-1 is the amplitude of the speech signal time n-1
  • a is the coefficient of pre-emphasis.
  • the second processing unit 12 is configured to perform frame processing on the original test voice data after the pre-emphasis processing to obtain framed voice data.
  • the third processing unit 13 is configured to perform window processing on the framed voice data to obtain pre-processed voice data.
  • the voice data acquisition module 20 includes a first test voice data acquisition unit 21, a second test voice data acquisition unit 22, and a voice data acquisition unit 23.
  • the first test voice data obtaining unit 21 is configured to process the pre-processed voice data by using a short-term energy feature value calculation formula, obtain a short-term energy feature value corresponding to the pre-processed voice data, and make the short-term energy feature value smaller than the first
  • the threshold pre-processed voice data is removed to obtain the first test voice data.
  • the short-term energy feature value calculation formula is Among them, N is the number of frames in the preprocessed voice data, N ⁇ 2, and s (n) is the signal amplitude of the nth frame of preprocessed voice data in the time domain.
  • the second test voice data obtaining unit 22 is configured to process the preprocessed voice data by using a short-term zero-crossing rate calculation formula, obtain a short-time zero-crossing rate corresponding to the pre-processed voice data, and make the short-time zero-crossing rate smaller than the second
  • the threshold preprocessed voice data is removed to obtain the second test voice data.
  • the short-term zero-crossing rate calculation formula is Among them, N is the number of frames in the pre-processed voice data, N ⁇ 2, and s (n) is the signal amplitude of the n-th voice data in the time domain.
  • the voice data to be tested unit 23 is configured to perform noise reduction processing on the first test voice data and the second test voice data to obtain the voice data to be tested.
  • the voice data to be tested includes single-frame voice data.
  • the speech feature acquisition module 30 includes a power spectrum acquisition unit 31, a Mel spectrum acquisition unit 32, and a speech feature acquisition unit 33.
  • the power spectrum obtaining unit 31 is configured to perform fast Fourier transform processing on single frame of voice data to obtain a power spectrum of the voice data to be tested.
  • the Mel spectrum acquisition unit 32 is configured to perform a dimensionality reduction process on the power spectrum by using a Mel filter bank to obtain a Mel spectrum.
  • the speech feature acquiring unit 33 is configured to perform cepstrum analysis on the Mel spectrum to acquire the speech feature to be tested.
  • the speech discrimination device is further used for pre-training a convolutional deep belief network model.
  • the speech discrimination device further includes a to-be-trained speech data acquisition unit 401, a model training unit 402, and a model acquisition unit 403.
  • the voice data to be trained unit 401 is configured to acquire voice data to be trained, and the voice data to be trained includes standard training voice data and interference training voice data.
  • a model training unit 402 is configured to input standard training speech data and interference training speech data into a convolutional deep belief network model in an equal proportion for model training, and obtain an original convolution-limited Boltzmann machine.
  • a model obtaining unit 403 is configured to perform stack processing on the original convolution-limited Boltzmann machine to obtain a convolutional deep confidence network model.
  • the model acquisition unit 403 includes a pooling and regular processing unit 4031 and a stack processing unit 4032.
  • the pooling and regular processing unit 4031 is configured to perform a maximum probability pooling process and a sparse regularization process on the original convolution-limited Boltzmann machine to obtain an effective convolution-limited Boltzmann machine.
  • a stacking processing unit 4032 is configured to perform stacking processing on the effective convolution-limited Boltzmann machine to obtain a convolutional deep confidence network model.
  • the speech discrimination result acquisition module 40 includes a speech recognition probability value acquisition unit 41 and a speech discrimination result acquisition unit 42.
  • the speech recognition probability value obtaining unit 41 is configured to input a speech feature to be tested into a pre-trained convolutional deep confidence network model for recognition, and obtain a speech recognition probability value.
  • the speech discrimination result acquisition unit 42 is configured to acquire a speech discrimination result based on a speech recognition probability value.
  • a computer device is provided.
  • the computer device may be a server, and its internal structure diagram may be as shown in FIG. 10.
  • the computer device includes a processor, a memory, a network interface, and a database connected through a system bus.
  • the processor of the computer device is used to provide computing and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium stores an operating system, computer-readable instructions, and a database.
  • the internal memory provides an environment for the operation of the operating system and computer-readable instructions in a non-volatile storage medium.
  • the database of the computer equipment is used to store data obtained or generated during the method of speech discrimination.
  • the network interface of the computer device is used to communicate with an external terminal through a network connection.
  • the computer-readable instructions are executed by a processor to implement a method of speech discrimination.
  • a computer device which includes a memory, a processor, and computer-readable instructions stored on the memory and executable on the processor.
  • the processor executes the computer-readable instructions, the processor implements the following steps: obtaining the original test Voice data, preprocess the original test voice data to obtain preprocessed voice data; perform endpoint detection processing on the preprocessed voice data to obtain the voice data to be tested; perform feature extraction on the test voice data to obtain the voice features to be tested;
  • the test speech features are input into a pre-trained convolutional deep belief network model for recognition, and the speech discrimination results are obtained.
  • the processor executes the computer-readable instructions, the following steps are further implemented: the short-term energy feature value calculation formula is used to process the pre-processed voice data, and the short-term energy feature value corresponding to the pre-processed voice data is obtained, and The pre-processed voice data whose short-term energy feature value is less than the first threshold is removed to obtain the first test voice data.
  • the short-term energy feature value calculation formula is Among them, N is the number of frames in the preprocessed voice data, N ⁇ 2, and s (n) is the signal amplitude of the nth frame of preprocessed voice data in the time domain; the short-term zero-crossing rate calculation formula is used to preprocess the voice data Perform processing to obtain the short-term zero-crossing rate corresponding to the pre-processed voice data, and remove the pre-processed voice data whose short-term zero-cross rate is less than the second threshold to obtain the second test voice data.
  • N is the number of frames in the preprocessed voice data
  • N ⁇ 2 is the number of frames in the preprocessed voice data
  • s (n) is the signal amplitude of the n-th frame voice data in the time domain
  • the processor when the processor executes the computer-readable instructions, the following steps are further implemented: fast Fourier transform processing is performed on the single frame of voice data to obtain the power spectrum of the voice data to be tested; Perform dimensionality reduction processing to obtain the Mel spectrum; perform cepstrum analysis on the Mel spectrum to obtain the speech features to be tested.
  • pre-training the convolutional deep belief network model includes: acquiring voice data to be trained, the voice data to be trained includes standard training voice data and interference training voice data; and inputting the standard training voice data and the interference training voice data to an equal proportion of The model is trained in the convolutional deep confidence network model to obtain the original convolution-limited Boltzmann machine; the original convolution-limited Boltzmann machine is stacked to obtain the convolutional deep confidence network model.
  • the processor when the processor executes the computer-readable instructions, the following steps are further implemented: performing a probabilistic maximum pooling process and a sparse regularization process on the original convolution-limited Boltzmann machine to obtain an effective convolution-limited Boltzmann
  • the stacking process is performed on the effective convolution-limited Boltzmann machine to obtain a convolutional deep confidence network model.
  • the processor executes the computer-readable instructions, the following steps are further implemented: inputting the voice features to be tested into a pre-trained convolutional deep confidence network model for recognition, and obtaining a voice recognition probability value; based on the voice recognition probability Value to get the speech discrimination result.
  • one or more non-readable storage media storing computer-readable instructions
  • the computer-readable instructions when executed by one or more processors, cause the one or more The processor implements the following steps: obtaining the original test voice data, preprocessing the original test voice data, and obtaining the preprocessed voice data; performing endpoint detection processing on the preprocessed voice data, obtaining the voice data to be tested; and performing feature extraction on the test voice data
  • obtaining the original test voice data preprocessing the original test voice data, and obtaining the preprocessed voice data
  • performing endpoint detection processing on the preprocessed voice data, obtaining the voice data to be tested
  • performing feature extraction on the test voice data To obtain the voice features to be tested; input the voice features to be tested into a pre-trained convolutional deep confidence network model for recognition, and obtain the voice discrimination result.
  • the original test voice data is framed to obtain framed voice data; the framed voice data is windowed to obtain preprocessed voice data.
  • s ′′ n w n * s ′ n , where w n is the Hamming window at time n, N is the window length of Hamming window, s ′ n is the signal amplitude in the time domain at time n, and s ′′ n is n Signal amplitude in time domain after windowing at time.
  • the short-term energy feature value calculation formula is used to process the pre-processed voice data, and the short-term energy feature value corresponding to the pre-processed voice data is obtained, and The pre-processed speech data whose energy characteristic value is less than the first threshold is removed to obtain the first test speech data.
  • N is the number of frames in the preprocessed voice data
  • N ⁇ 2 is the signal amplitude of the nth frame of preprocessed voice data in the time domain
  • s (n) is the signal amplitude of the nth frame of preprocessed voice data in the time domain
  • the short-term zero-crossing rate calculation formula is used to preprocess the voice data Perform processing to obtain the short-term zero-crossing rate corresponding to the pre-processed voice data, and remove the pre-processed voice data whose short-term zero-cross rate is less than the second threshold to obtain the second test voice data.
  • N is the number of frames in the preprocessed voice data
  • N ⁇ 2 is the number of frames in the preprocessed voice data
  • s (n) is the signal amplitude of the n-th frame voice data in the time domain
  • the following steps are further implemented: performing fast Fourier transform processing on the single frame of speech data to obtain the power spectrum of the speech data to be tested; The spectrum is subjected to dimensionality reduction processing to obtain the Mel spectrum; the cepstrum analysis is performed on the Mel spectrum to obtain the voice characteristics to be tested.
  • pre-training the convolutional deep belief network model includes: acquiring voice data to be trained, the voice data to be trained includes standard training voice data and interference training voice data; and inputting the standard training voice data and the interference training voice data to an equal proportion of The model is trained in the convolutional deep confidence network model to obtain the original convolution-limited Boltzmann machine; the original convolution-limited Boltzmann machine is stacked to obtain the convolutional deep confidence network model.
  • the following steps are further implemented: performing a probabilistic maximum pooling process and a sparse regularization process on the original convolution-limited Boltzmann machine to obtain an effective convolution-limited Boltzmann Man machine; stacking processing is performed on the effective convolution-limited Boltzmann machine to obtain a convolutional deep confidence network model.
  • the following steps are also implemented: inputting the voice features to be tested into a pre-trained convolutional deep confidence network model for recognition, obtaining a voice recognition probability value; based on voice recognition The probability value obtains the result of speech discrimination.
  • Non-volatile memory may include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory can include random access memory (RAM) or external cache memory.
  • RAM is available in various forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The present application discloses a method and device for voice distinguishing, a computer device and a storage medium. The method comprises: acquiring original test voice data, preprocessing the original test voice data to obtain preprocessed voice data; performing endpoint detection process to the preprocessed voice data to obtain voice data to be tested; extracting characteristics of the voice data to be tested to obtain voice characteristics to be tested; inputting the voice characteristics to be tested into a pre-trained convolution depth confidence network model for recognition to obtain a voice distinguishing result. The method improves the accuracy of voice distinguishing, and makes the obtained voice distinguishing result be more accurate.

Description

语音区分方法、装置、计算机设备及存储介质Speech distinguishing method, device, computer equipment and storage medium
本申请以2018年6月4日提交的申请号为201810561695.9,名称为“语音区分方法、装置、计算机设备及存储介质”的中国发明专利申请为基础,并要求其优先权。This application is based on a Chinese invention patent application filed on June 4, 2018 with the application number 201810561695.9, entitled "Voice distinguishing method, device, computer equipment and storage medium", and claims its priority.
技术领域Technical field
本申请涉及语音识别技术领域,尤其涉及一种语音区分方法、装置、计算机设备及存储介质。The present application relates to the technical field of speech recognition, and in particular, to a method, a device, a computer device, and a storage medium for speech discrimination.
背景技术Background technique
语音数据一般包括目标语音和干扰语音,其中,目标语音是指语音数据中声纹连续变化明显的语音部分。干扰语音可以是语音数据中由于静默而没有发音的语音部分(即静音段),也可以是环境噪音部分(即噪音段)。语音区分是指对输入的语音进行静音筛选,仅保留对识别更有意义的语音数据(即目标语音)。当前主要采用端点检测技术对语音数据进行区分,这种语音区分方式,在目标语音中夹杂噪音时,随着噪音越大,进行语音区分的难度越大,其端点检测的识别结果越不准确。因此,采用端点检测技术进行语音区分时,其语音区分的识别结果容易受外部因素影响,使得语音区分结果不准确。The voice data generally includes a target voice and an interfering voice. The target voice refers to a voice part in which the voiceprint continuously changes significantly. The interfering speech can be the part of the speech data that is not pronounced due to silence (ie, the mute section), or it can be the environmental noise part (ie, the noise section). Speech discrimination refers to mute filtering of the input speech, and only retains speech data (ie, target speech) that is more meaningful for recognition. Currently, endpoint detection technology is mainly used to distinguish speech data. This method of speech discrimination, when the target voice is mixed with noise, the greater the noise, the more difficult it is to distinguish the speech, and the more inaccurate the endpoint detection recognition result is. Therefore, when the endpoint detection technology is used for speech discrimination, the recognition result of the speech discrimination is easily affected by external factors, making the speech discrimination result inaccurate.
发明内容Summary of the Invention
本申请实施例提供一种语音区分方法、装置、计算机设备及存储介质,以解决语音区分结果不准确的问题。The embodiments of the present application provide a method, an apparatus, a computer device, and a storage medium for speech discrimination to solve the problem of inaccurate speech discrimination results.
本申请实施例提供一种语音区分方法,包括:An embodiment of the present application provides a method for distinguishing speech, including:
获取原始测试语音数据,对所述原始测试语音数据进行预处理,获取预处理语音数据;Acquiring raw test voice data, preprocessing the raw test voice data, and obtaining preprocessed voice data;
对所述预处理语音数据进行端点检测处理,获取待测试语音数据;Performing endpoint detection processing on the pre-processed voice data to obtain voice data to be tested;
对所述待测试语音数据进行特征提取,获取待测试语音特征;Performing feature extraction on the voice data to be tested to obtain voice features to be tested;
将所述待测试语音特征输入到预先训练好的卷积深度置信网络模型中进行识别,获取语音区分结果。The speech features to be tested are input into a pre-trained convolutional deep belief network model for recognition, and a speech discrimination result is obtained.
本申请实施例提供一种语音区分装置,包括:An embodiment of the present application provides a voice distinguishing device, including:
原始测试语音数据处理模块,用于获取原始测试语音数据,对所述原始测试语音数据进行预处理,获取预处理语音数据;A raw test voice data processing module, configured to obtain raw test voice data, preprocess the raw test voice data, and obtain preprocessed voice data;
待测试语音数据获取模块,用于对所述预处理语音数据进行端点检测处理,获取待测试语音数据;A voice data acquisition module for testing to perform endpoint detection processing on the pre-processed voice data to acquire voice data to be tested;
待测试语音特征获取模块,用于对所述待测试语音数据进行特征提取,获取待测试语音特征;A voice feature acquisition module to be tested, which is used to extract features from the voice data to be tested to acquire voice features to be tested;
语音区分结果获取模块,用于将所述待测试语音特征输入到预先训练好的卷积深度置信网络模型中进行识别,获取语音区分结果。A speech discrimination result acquisition module is configured to input the speech feature to be tested into a pre-trained convolutional deep belief network model for recognition, and obtain a speech discrimination result.
本申请实施例提供一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现如下步骤:An embodiment of the present application provides a computer device including a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor. The processor implements the computer-readable instructions to implement The following steps:
获取原始测试语音数据,对所述原始测试语音数据进行预处理,获取预处理语音数据;Acquiring raw test voice data, preprocessing the raw test voice data, and obtaining preprocessed voice data;
对所述预处理语音数据进行端点检测处理,获取待测试语音数据;Performing endpoint detection processing on the pre-processed voice data to obtain voice data to be tested;
对所述待测试语音数据进行特征提取,获取待测试语音特征;Performing feature extraction on the voice data to be tested to obtain voice features to be tested;
将所述待测试语音特征输入到预先训练好的卷积深度置信网络模型中进行识别,获取语音区分结果。The speech features to be tested are input into a pre-trained convolutional deep belief network model for recognition, and a speech discrimination result is obtained.
本申请实施例提供一个或多个存储有计算机可读指令的非易失性可读存储介质,其特征在于,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器实现如下步骤:The embodiments of the present application provide one or more non-volatile readable storage media storing computer-readable instructions, wherein when the computer-readable instructions are executed by one or more processors, the one or more Multiple processors implement the following steps:
获取原始测试语音数据,对所述原始测试语音数据进行预处理,获取预处理语音数据;Acquiring raw test voice data, preprocessing the raw test voice data, and obtaining preprocessed voice data;
对所述预处理语音数据进行端点检测处理,获取待测试语音数据;Performing endpoint detection processing on the pre-processed voice data to obtain voice data to be tested;
对所述待测试语音数据进行特征提取,获取待测试语音特征;Performing feature extraction on the voice data to be tested to obtain voice features to be tested;
将所述待测试语音特征输入到预先训练好的卷积深度置信网络模型中进行识别,获取语音区分结果。The speech features to be tested are input into a pre-trained convolutional deep belief network model for recognition, and a speech discrimination result is obtained.
本申请的一个或多个实施例的细节在下面的附图及描述中提出。本申请的其他特征和优点将从说明书、附图以及权利要求书变得明显。Details of one or more embodiments of the present application are set forth in the accompanying drawings and description below. Other features and advantages of the application will become apparent from the description, the drawings, and the claims.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
为了更清楚地说明本申请实施例的技术方案,下面将对本申请实施例的描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。In order to explain the technical solutions of the embodiments of the present application more clearly, the drawings used in the description of the embodiments of the application will be briefly introduced below. Obviously, the drawings in the following description are just some embodiments of the application. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without paying creative labor.
图1是本申请一实施例中语音区分方法的一应用场景图;FIG. 1 is an application scenario diagram of a speech discrimination method according to an embodiment of the present application; FIG.
图2是本申请一实施例中语音区分方法的一流程图;FIG. 2 is a flowchart of a speech discrimination method according to an embodiment of the present application; FIG.
图3是图2中步骤S10的一具体流程图;FIG. 3 is a specific flowchart of step S10 in FIG. 2;
图4是图3中步骤S20的一具体流程图;4 is a specific flowchart of step S20 in FIG. 3;
图5是图2中步骤S30的一具体流程图;FIG. 5 is a specific flowchart of step S30 in FIG. 2;
图6是本申请一实施例中语音区分方法的另一流程图;FIG. 6 is another flowchart of a speech discrimination method according to an embodiment of the present application; FIG.
图7是图6中步骤S403的一具体流程图;FIG. 7 is a specific flowchart of step S403 in FIG. 6;
图8是图2中步骤S40的一具体流程图;FIG. 8 is a specific flowchart of step S40 in FIG. 2;
图9是本申请一实施例语音区分装置的一示意图;FIG. 9 is a schematic diagram of a speech distinguishing device according to an embodiment of the present application; FIG.
图10是本申请一实施例中计算机设备的一示意图。FIG. 10 is a schematic diagram of a computer device according to an embodiment of the present application.
具体实施方式Detailed ways
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。In the following, the technical solutions in the embodiments of the present application will be clearly and completely described with reference to the drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, but not all of the embodiments. Based on the embodiments in this application, all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of this application.
本申请实施例提供的语音区分方法,可以用在如图1所示的应用环境中。其中,终端设备将采集到的原始测试语音数据,通过网络发送给对应的服务器,与该终端设备连接的服务器在获取到的原始测试语音数据后,首先对该原始测试语音数据进行端点检测处理,获取待测试语音数据。然后对获取到的待测试语音数据进行特征提取,获取待测试语音特征。最后将待测试语音特征输入到预先训练好的卷积深度置信网络模型中进行识别,获取语音区分结果,达到对语音数据中的目标语音和干扰语音的进行区分处理的目的。其中,终端设备是可与用户进行人机交互的设备,包括但不限于各种个人计算机、笔记本电脑、智能手机和平板电脑。服务器可以用独立的服务器或者是多个服务器组成的服务器集群来实现。The speech discrimination method provided in the embodiment of the present application can be used in the application environment shown in FIG. 1. The terminal device sends the collected original test voice data to the corresponding server through the network. After the server connected to the terminal device obtains the original test voice data, it first performs endpoint detection processing on the original test voice data. Get the voice data to be tested. Then, feature extraction is performed on the acquired voice data to be tested to acquire the voice features to be tested. Finally, the speech features to be tested are input into a pre-trained convolutional deep belief network model for recognition, and the speech discrimination result is obtained to achieve the purpose of distinguishing the target speech and the interference speech in the speech data. The terminal device is a device that can perform human-computer interaction with a user, including, but not limited to, various personal computers, notebook computers, smart phones, and tablet computers. The server can be implemented by an independent server or a server cluster composed of multiple servers.
在一实施例中,如图2所示,提供一种语音区分方法,该语音区分方法包括如下步骤:In an embodiment, as shown in FIG. 2, a voice discrimination method is provided. The voice discrimination method includes the following steps:
S10:获取原始测试语音数据,对原始测试语音数据进行预处理,获取预处理语音数据。S10: Obtain original test voice data, preprocess the original test voice data, and obtain preprocessed voice data.
其中,原始测试语音数据是指终端设备采集到的说话人的语音数据。该原始测试语音数据包括目标语音和干扰语音,其中,目标语音是指语音数据中声纹连续变化明显的语音部分;相应地,干扰语音是指语音数据中目标语音之外的语音部分。具体地,干扰语音包括静音段和噪音段,其中,静音段是指语音数据中由于静默而没有发音的语音部分,如说话人在说话过程中会思考、呼吸等,由于说话人在思考和呼吸时是不会发出声音的,因此该语音部分则为静音段。噪音段是指语音数据中的环境噪音部分,如门窗的开关、物体的碰撞等发出的声音都可以认为是噪音段。The original test voice data refers to the voice data of the speaker collected by the terminal device. The original test voice data includes a target voice and an interference voice, where the target voice refers to a voice part in which the voiceprint continuously changes significantly; correspondingly, the interference voice refers to a voice part other than the target voice in the voice data. Specifically, the interfering speech includes a silent section and a noise section, where the silent section refers to a portion of the speech data that is not pronounced due to silence, such as the speaker will think and breathe during the speaking process, and the speaker is thinking and breathing There is no sound from time to time, so the voice part is a silent section. The noise segment refers to the environmental noise part of the voice data, such as the sound of the opening and closing of doors and windows, the collision of objects, etc., can be considered as the noise segment.
具体地,终端设备通过声音采集模块(如录音模块)获取一段原始测试语音数据,该 原始测试语音数据是需要进行语音区分的包含有目标语音和干扰语音的一段语音数据。获取原始测试语音数据后,对原始测试语音数据进行预处理,获取预处理语音数据。预处理语音数据指原始测试语音数据经过预处理后得到的语音数据。Specifically, the terminal device obtains a piece of original test voice data through a sound acquisition module (such as a recording module), and the original test voice data is a piece of voice data including a target voice and an interference voice that needs to be distinguished. After obtaining the original test voice data, the original test voice data is preprocessed to obtain preprocessed voice data. Preprocessed voice data refers to the voice data obtained by preprocessing the original test voice data.
本实施例中的预处理具体包括:对原始测试语音数据进行预加重、分帧和加窗处理。采用预加重处理的公式s′ n=s n-a*s n-1对原始测试语音数据进行预加重处理,以消除说话人的声带和嘴唇对说话人语音的影响,提高说话人语音的高频分辨率。其中,s′ n为预加重处理后的n时刻的语音信号幅度,s n为n时刻的语音信号幅度,s n-1为n-1时刻的语音信号幅度,a为预加重系数。然后对预加重处理后的原始测试语音数据进行分帧处理时。在分帧时,每一帧语音数据的起始点和末尾点都会出现不连续的地方,分帧越多,与原始测试语音数据的误差也就越大。为了保持每一帧语音数据的频率特性,还需要进行加窗处理。对原始测试语音数据进行预处理,获取预处理语音数据,为后续步骤执行对原始测试语音数据进行区分处理提供数据来源。 The preprocessing in this embodiment specifically includes: performing pre-emphasis, framing, and windowing processing on the original test voice data. The pre-emphasis processing formula s ′ n = s n -a * s n-1 is used to pre-emphasize the original test voice data to eliminate the effect of the vocal cords and lips of the speaker on the speaker's voice, and improve the speaker's voice. Frequency resolution. Where, s' n for the amplitude of the speech signal at time n after the pre-emphasis treatment, s n is the amplitude of the speech signal at time n, s n-1 is the amplitude of the speech signal time n-1, a is the coefficient of pre-emphasis. Then, frame processing is performed on the original test voice data after the pre-emphasis processing. When framing, discontinuities will appear at the start and end points of each frame of voice data. The more frames there are, the greater the error from the original test voice data. In order to maintain the frequency characteristics of each frame of voice data, a windowing process is also required. Preprocess the original test voice data, obtain preprocessed voice data, and provide a data source for the subsequent steps to perform differentiated processing of the original test voice data.
S20:对预处理语音数据进行端点检测处理,获取待测试语音数据。S20: Perform endpoint detection processing on the pre-processed voice data to obtain the voice data to be tested.
其中,端点检测处理是从一段语音数据中确定目标语音的起始点和结束点的一种处理手段。一段语音数据中不可避免地会存在有干扰语音,因此,在终端设备获取原始测试语音数据并经过预处理后,需要对获取的预处理语音数据进行初步检测处理,去除掉干扰语音,保留剩余的语音数据,该剩余的语音数据则作为待测试语音数据。该待测试语音数据中会包括目标语音,也会包括没有准确去除的干扰语音Among them, the endpoint detection processing is a processing method for determining the start point and the end point of a target voice from a piece of voice data. Interference voice will inevitably exist in a piece of voice data. Therefore, after the terminal device obtains the original test voice data and preprocesses it, it is necessary to perform preliminary detection processing on the acquired preprocessed voice data to remove the interfering voice and retain the remaining Voice data, and the remaining voice data is used as the voice data to be tested. The voice data to be tested will include the target voice as well as the interfering voice that has not been accurately removed.
具体地,在获取预处理语音数据后,获取该预处理语音数据对应的短时能量特征值和短时过零率。其中,短时能量特征值指任一时刻语音数据中的一帧语音对应的能量值。短时过零率指语音数据对应的语音信号与横轴(零电平)的交点的个数。本实施例中,服务器对预处理语音数据进行端点检测,可以减少语音区分的处理时间,提高语音区分处理的质量。Specifically, after obtaining the pre-processed voice data, the short-term energy feature value and the short-time zero-crossing rate corresponding to the pre-processed voice data are obtained. The short-term energy characteristic value refers to an energy value corresponding to one frame of speech in the speech data at any moment. The short-term zero-crossing rate refers to the number of intersections of the speech signal corresponding to the speech data and the horizontal axis (zero level). In this embodiment, the server performs endpoint detection on the pre-processed voice data, which can reduce the processing time for voice discrimination and improve the quality of voice discrimination processing.
可以理解地,对预处理语音数据进行端点检测处理,可以初步地去除静音段和噪音段对应的语音数据,去除效果不是很好,为了更加准确地去除预处理语音数据中的静音段和噪音段,在获取待测试语音数据后还需执行步骤S30和步骤S40,以获取更加准确的目标语音。Understandably, performing endpoint detection processing on the preprocessed voice data can initially remove the voice data corresponding to the mute section and the noise section, and the removal effect is not very good. In order to more accurately remove the mute section and the noise section in the preprocessed voice data After obtaining the voice data to be tested, steps S30 and S40 need to be performed to obtain a more accurate target voice.
S30:对待测试语音数据进行特征提取,获取待测试语音特征。S30: Perform feature extraction on the speech data to be tested to obtain the speech features to be tested.
其中,待测试语音特征包括但不限于频谱特征、音质特征和声纹特征。频谱特征是根据声音振动频率区分不同的语音数据,如目标语音和干扰语音。音质特征和声纹特征是根据声纹和声音的音色特点识别待测试语音数据对应的说话人。由于语音区分是用于区分语音数据中的目标语音和干扰语音,因此,只需要获取待测试语音数据的频谱特征,就可以完成语音区分。其中,频谱是频率谱密度的简称,频谱特征是反映频率谱密度的参数。The voice features to be tested include, but are not limited to, spectrum features, sound quality features, and voiceprint features. The spectrum characteristic is to distinguish different voice data, such as target voice and interference voice, according to the frequency of sound vibration. The voice quality feature and voiceprint feature are used to identify the speaker corresponding to the voice data to be tested according to the voiceprint and the voice color characteristics of the voice. Since the voice discrimination is used to distinguish the target voice and the interfering voice in the voice data, only the spectrum characteristics of the voice data to be tested need to be acquired to complete the voice discrimination. Among them, the frequency spectrum is an abbreviation of frequency spectral density, and the frequency spectrum characteristic is a parameter reflecting the frequency spectral density.
S40:将待测试语音特征输入到预先训练好的卷积深度置信网络模型中进行识别,获取语音区分结果。S40: The speech features to be tested are input into a pre-trained convolutional deep belief network model for recognition, and a speech discrimination result is obtained.
卷积深度置信网络(Convolutional Deep Belief Networks,CDBN)模型是预先训练好的用于区分待测试语音数据中的目标语音和干扰语音的神经网络模型。语音区分结果指经过卷积深度置信网络模型识别,区分待测试语音数据中的目标语音和干扰语音的识别结果。采用预先训练好的卷积深度置信网络模型对待测试语音数据进行识别,获取语音识别概率值。将语音识别概率值与预设概率值进行比较,大于或等于预设概率值的语音识别概率值对应的待测试语音数据为目标语音,小于预设概率值的语音识别概率值对应的待测试语音数据为干扰语音。即本实施例中,将识别概率较高的目标语音保留,去除识别概率较低的干扰语音。采用卷积深度置信网络模型对待测试语音数据进行识别,可以提高识别准 确率,使得语音区分结果更加准确。A Convolutional Deep Belief Networks (CDBN) model is a neural network model that is pre-trained to distinguish target speech from interfering speech in the speech data to be tested. The speech discrimination result refers to a recognition result that is recognized by a convolutional deep belief network model to distinguish a target voice from an interference voice in the voice data to be tested. The pre-trained convolutional deep belief network model is used to recognize the test voice data and obtain the voice recognition probability value. The speech recognition probability value is compared with a preset probability value, and the speech data to be tested corresponding to the speech recognition probability value greater than or equal to the preset probability value is the target speech, and the speech to be tested corresponding to the speech recognition probability value smaller than the preset probability value The data is interference speech. That is, in this embodiment, a target voice with a higher recognition probability is retained, and an interference voice with a lower recognition probability is removed. Using the convolutional deep belief network model to identify the test voice data can improve the recognition accuracy and make the speech discrimination result more accurate.
本实施例所提供的语音区分方法,对预处理语音数据进行端点检测处理,获取待测试语音数据,可以减少语音区分的处理时间,提高语音处理的质量。再对待测试语音数据进行特征提取,将待测试语音特征输入到预先训练好的卷积深度置信网络模型中进行识别,获取语音区分结果,提高了语音区分的准确性,使得获取的语音区分结果更加准确。The speech discrimination method provided in this embodiment performs endpoint detection processing on the pre-processed speech data and obtains the speech data to be tested, which can reduce the processing time of speech discrimination and improve the quality of speech processing. Then perform feature extraction on the test voice data, input the test voice features into a pre-trained convolutional deep belief network model for recognition, and obtain the voice discrimination result, which improves the accuracy of the voice discrimination and makes the obtained voice discrimination result more accurate.
在一实施例中,如图3所示,步骤S10,对原始测试语音数据进行预处理,获取预处理语音数据,具体包括如下步骤:In an embodiment, as shown in FIG. 3, in step S10, the original test voice data is pre-processed to obtain pre-processed voice data, which specifically includes the following steps:
S11:对原始测试语音数据进行预加重处理,预加重处理的公式为s′ n=s n-a*s n-1,其中,s′ n为预加重处理后的n时刻的语音信号幅度,s n为n时刻的语音信号幅度,s n-1为n-1时刻的语音信号幅度,a为预加重系数。 S11: Perform pre-emphasis processing on the original test voice data. The formula for the pre-emphasis processing is s ′ n = s n -a * s n-1 , where s ′ n is the amplitude of the voice signal at time n after the pre-emphasis processing. s n is the amplitude of the speech signal at time n, s n-1 is the amplitude of the speech signal at time n-1, and a is the pre-emphasis coefficient.
具体地,为了消除说话人的声带和嘴唇对说话人语音的影响,提高说话人语音的高频分辨率,需采用公式s′ n=s n-a*s n-1对原始测试语音数据进行预加重处理。语音信号幅度即语音数据在时域上表达的语音的幅度,a为预加重系数,0.9<a<1.0,一般地,a取0.97效果比较好。 Specifically, in order to eliminate the influence of a speaker's vocal cords and lips on the speaker's speech and improve the high-frequency resolution of the speaker's speech, the formula s ′ n = s n -a * s n-1 needs to be applied to the original test speech data. Pre-emphasis processing. The amplitude of the speech signal is the amplitude of the speech expressed by the speech data in the time domain, a is the pre-emphasis coefficient, 0.9 <a <1.0, and generally, a is better than 0.97.
S12:对预加重处理后的原始测试语音数据进行分帧处理,获取分帧语音数据。S12: Perform frame processing on the original test voice data after pre-emphasis processing to obtain framed voice data.
预加重后的语音数据对应的语音信号是一种非平稳信号,但是语音信号具有短时平稳性。其中,短时平稳性指在短时间范围(如10ms-30ms)内,语音信号是平稳的性质。因此,在获取预加重后的语音数据后,还需进行分帧处理,以将预加重后的语音数据划分为一帧一帧的语音数据,得到分帧语音数据。该分帧语音数据指短时间范围内对应的语音片段,该分割出的语音片段则称为帧。一般地,在分帧时,为保持相邻两帧语音数据的连续性,可使相邻两帧的语音数据中会存在重叠部分,该重叠部分为帧长的1/2,该重叠部分称为帧移。The voice signal corresponding to the pre-emphasized voice data is a non-stationary signal, but the voice signal has short-term stability. Among them, the short-term stationarity refers to the stable nature of the speech signal in a short time range (such as 10ms-30ms). Therefore, after obtaining the pre-emphasized voice data, frame processing is required to divide the pre-emphasized voice data into frame-by-frame speech data to obtain framed speech data. The framed voice data refers to a corresponding voice segment in a short time range, and the segmented voice segment is called a frame. In general, in order to maintain the continuity of two consecutive frames of voice data during frame division, there may be an overlapping portion in the voice data of two adjacent frames. The overlapping portion is 1/2 of the frame length. The overlapping portion is called Frame shift.
S13:对分帧语音数据进行加窗处理,获取预处理语音数据,加窗处理的公式为
Figure PCTCN2018094200-appb-000001
和s″ n=w n*s′ n,其中,w n为n时刻的汉明窗,N为汉明窗窗长,s′ n为n时刻时域上的信号幅度,s″ n为n时刻加窗后时域上的信号幅度。
S13: Perform windowing on the framed speech data to obtain pre-processed speech data. The windowing formula is
Figure PCTCN2018094200-appb-000001
And s ″ n = w n * s ′ n , where w n is the Hamming window at time n, N is the window length of Hamming window, s ′ n is the signal amplitude in the time domain at time n, and s ″ n is n Signal amplitude in time domain after windowing at time.
在分帧处理后,每一帧语音数据的起始点和末尾点都会出现不连续的地方,分帧越多,与原始测试语音数据的误差也就越大。为了保持每一帧语音数据的频率特性,还需要对分帧语音数据进行加窗处理。本实施例中采用汉明窗对语音数据进行加窗处理,具体为:先采用汉明窗函数
Figure PCTCN2018094200-appb-000002
进行加窗处理,然后采用公式s″ n=w n*s′ n获取加窗处理后的信号幅度。
After frame processing, discontinuities will appear at the beginning and end points of each frame of voice data. The more frames there are, the greater the error from the original test voice data. In order to maintain the frequency characteristics of each frame of voice data, it is also necessary to perform windowing on the framed voice data. In this embodiment, the Hamming window is used to perform windowing processing on the voice data. Specifically, the Hamming window function is first used.
Figure PCTCN2018094200-appb-000002
Perform windowing, and then use the formula s ″ n = w n * s ′ n to obtain the windowed signal amplitude.
步骤S11-S13,通过对原始测试语音数据进行预加重、分帧和加窗处理,可以获取分辨率高、平稳性好且与原始测试语音数据误差较小的预处理语音数据,提高后续通过端点检测处理,获取待测试语音数据的效率,并保证待测试语音数据的质量。Steps S11-S13: By pre-emphasizing, framing, and windowing the original test voice data, pre-processed voice data with high resolution, good stability, and small errors from the original test voice data can be obtained, which improves the subsequent pass endpoints. Detection and processing to obtain the efficiency of the voice data to be tested and ensure the quality of the voice data to be tested.
在一实施例中,如图3所示,步骤S20,对预处理语音数据进行端点检测处理,获取待测试语音数据,具体包括如下步骤:In an embodiment, as shown in FIG. 3, in step S20, endpoint detection processing is performed on the preprocessed voice data to obtain the voice data to be tested, which specifically includes the following steps:
S21:采用短时能量特征值计算公式对预处理语音数据进行处理,获取预处理语音数据对应的短时能量特征值,并将短时能量特征值小于第一阈值的预处理语音数据去除,获 取第一测试语音数据,短时能量特征值计算公式为
Figure PCTCN2018094200-appb-000003
其中,N为预处理语音数据中帧的个数,N≥2,s(n)为时域上第n帧预处理语音数据的信号幅度。
S21: Use the short-term energy feature value calculation formula to process the pre-processed voice data, obtain the short-term energy feature value corresponding to the pre-processed voice data, and remove the pre-processed voice data whose short-term energy feature value is less than the first threshold, and obtain The first test speech data, the short-term energy characteristic value calculation formula is
Figure PCTCN2018094200-appb-000003
Among them, N is the number of frames in the preprocessed voice data, N≥2, and s (n) is the signal amplitude of the nth frame of preprocessed voice data in the time domain.
其中,第一阈值是预先设置的基于短时能量特征值区分干扰语音中的静音段和目标语音的阈值。具体地,采用短时能量特征值计算公式
Figure PCTCN2018094200-appb-000004
对预处理语音数据进行处理,获取对应的短时能量特征值,其中,N为预处理语音数据中帧的个数,N≥2,s(n)为时域上第n帧预处理语音数据的信号幅度,E为预处理语音数据的短时能量特征值。
The first threshold is a preset threshold that distinguishes a mute segment in an interference voice from a target voice based on a short-term energy feature value. Specifically, a short-term energy characteristic value calculation formula is adopted
Figure PCTCN2018094200-appb-000004
Process the pre-processed speech data to obtain the corresponding short-term energy feature value, where N is the number of frames in the pre-processed speech data, N≥2, and s (n) is the n-th frame of pre-processed speech data in the time domain , E is the short-term energy characteristic value of the preprocessed speech data.
本实施例中,获取短时能量特征值,将短时能量特征值与第一阈值进行比较,将短时能量特征值小于第一阈值的预处理语音数据去除,获取剩余的预处理语音数据,将该剩余的预处理语音数据作为第一测试语音数据。可以理解地,第一测试语音数据是第一次排除预处理语音数据中静音段之后的语音数据。In this embodiment, the short-term energy characteristic value is obtained, the short-term energy characteristic value is compared with a first threshold value, the pre-processed voice data whose short-term energy feature value is less than the first threshold value is removed, and the remaining pre-processed voice data is obtained, Use the remaining pre-processed speech data as the first test speech data. Understandably, the first test voice data is the voice data after the silence segment in the preprocessed voice data is excluded for the first time.
S22:采用短时过零率计算公式对预处理语音数据进行处理,获取预处理语音数据对应的短时过零率,并将短时过零率小于第二阈值的预处理语音数据去除,获取第二测试语音数据,短时过零率计算公式为
Figure PCTCN2018094200-appb-000005
其中,N为预处理语音数据中帧的个数,N≥2,s(n)为时域上第n帧语音数据的信号幅度。
S22: Use the short-term zero-crossing rate calculation formula to process the pre-processed voice data, obtain the short-term zero-cross rate corresponding to the pre-processed voice data, and remove the pre-processed voice data whose short-term zero-cross rate is less than the second threshold to obtain The second test speech data, the short-term zero-crossing rate calculation formula is
Figure PCTCN2018094200-appb-000005
Among them, N is the number of frames in the pre-processed voice data, N≥2, and s (n) is the signal amplitude of the n-th voice data in the time domain.
第二阈值是预先设置的基于短时过零率区分干扰语音中的静音段和目标语音的阈值。具体地,采用短时过零率计算公式
Figure PCTCN2018094200-appb-000006
对预处理语音数据进行处理,获取对应的短时过零率,其中,N为预处理语音数据中帧的个数,N≥2,s(n)为时域上第n帧语音数据的信号幅度,ZCR为预处理语音数据的短时过零率。本实施例中,获取短时过零率,将短时过零率与第二阈值进行比较,将短时过零率小于第二阈值的预处理语音数据去除,获取剩余的预处理语音数据,将该剩余的预处理语音数据作为第二测试语音数据。可以理解地,获取第二测试语音数据是第二次排除预处理语音数据中静音段之后获取的语音数据。
The second threshold is a preset threshold that distinguishes a mute segment in an interfering voice from a target voice based on a short-time zero-crossing rate. Specifically, a short-term zero-crossing rate calculation formula is used
Figure PCTCN2018094200-appb-000006
Process the pre-processed voice data to obtain the corresponding short-term zero-crossing rate, where N is the number of frames in the pre-processed voice data, N≥2, and s (n) is the signal of the n-th voice data in the time domain. Amplitude, ZCR is the short-term zero-crossing rate of pre-processed speech data. In this embodiment, a short-term zero-crossing rate is obtained, the short-term zero-crossing rate is compared with a second threshold, the pre-processed voice data whose short-term zero-cross rate is less than the second threshold is removed, and the remaining pre-processed voice data is obtained, Use the remaining pre-processed speech data as the second test speech data. Understandably, obtaining the second test voice data is the voice data obtained after excluding the mute segment in the preprocessed voice data for the second time.
例如,在端点检测处理时,预先设置两个阈值,即第一阈值T1和第二阈值T2,其中,第一阈值T1为短时能量特征值对应的阈值,第二阈值T2为短时过零率对应的阈值。本实施例中,设第一阈值T1为10,设第二阈值T2为15,若预处理语音数据的短时能量特征值小于10,则该短时能量特征值对应的预处理语音数据为静音段,需去除;若预处理语音数据的短时能量特征值不小于10,则该短时能量特征值对应的预处理语音数据不为静音段,需保留。若预处理语音数据的短时过零率小于10,则该短时过零率对应的预处理语音数据为静音段,需去除;若预处理语音数据的短时过零率不小于10,则该短时过零率对应的预处理语音数据不为静音段,需保留。For example, in the endpoint detection process, two thresholds are set in advance, namely a first threshold T1 and a second threshold T2, where the first threshold T1 is a threshold corresponding to a short-term energy characteristic value and the second threshold T2 is a short-term zero crossing Rate corresponding threshold. In this embodiment, the first threshold T1 is set to 10 and the second threshold T2 is set to 15. If the short-term energy feature value of the pre-processed voice data is less than 10, the pre-processed voice data corresponding to the short-term energy feature value is muted. If the short-term energy feature value of the pre-processed speech data is not less than 10, the pre-processed voice data corresponding to the short-term energy feature value is not a mute segment and needs to be retained. If the short-term zero-crossing rate of the pre-processed voice data is less than 10, the pre-processed voice data corresponding to the short-time zero-crossing rate is a mute segment and needs to be removed; if the short-term zero-crossing rate of the pre-processed voice data is not less than 10, then The pre-processed voice data corresponding to the short-term zero-crossing rate is not a mute segment and needs to be retained.
S23:对第一测试语音数据和第二测试语音数据进行去噪音处理,获取待测试语音数据。S23: Perform noise reduction processing on the first test voice data and the second test voice data to obtain voice data to be tested.
具体地,在获取去除静音段的第一测试语音数据和第二测试语音数据之后,获取第一测试语音数据和第二测试语音数据共同存在的预处理语音数据作为共同语音数据,再对共同语音数据去噪音处理,获取待测试语音数据。其中,对第一测试语音数据和第二测试语音数据进行去噪音处理是指去除第一测试语音数据和第二测试语音数据中的噪音段。该噪音段包括但不限于门窗的开关或者物体的碰撞时发出的声音。Specifically, after acquiring the first test voice data and the second test voice data from which the mute segment is removed, the preprocessed voice data in which the first test voice data and the second test voice data coexist is obtained as common voice data, and then the common voice is Data denoising processing to obtain voice data to be tested. The noise reduction processing performed on the first test voice data and the second test voice data refers to removing noise segments in the first test voice data and the second test voice data. The noise section includes, but is not limited to, sounds generated when a door or a window is opened or an object collides.
进一步地,对共同语音数据去噪音处理,获取待测试语音数据具体包括如下步骤:(1) 获取共同语音数据的语音信号能量,确定该语音信号能量对应的至少一个极大值和极小值。(2)获取相邻的极大值和极小值之间变化时间。(3)若该突变时间小于预设的最短时间阈值,则说明该共同语音数据中的语音信号能量在较短时间内发生突变,该突变时间对应的共同语音数据为噪音段,因此需将这部分噪音段去除,以获取待测试语音数据。其中,最短时间阈值是预先设置的时间值,用于判断共同语音数据中的噪音段。Further, the common voice data is denoised, and obtaining the voice data to be tested specifically includes the following steps: (1) obtaining the voice signal energy of the common voice data, and determining at least one maximum value and a minimum value corresponding to the voice signal energy. (2) Obtain the change time between the adjacent maximum and minimum values. (3) If the mutation time is less than a preset minimum time threshold, it indicates that the voice signal energy in the common voice data has mutated in a short time, and the common voice data corresponding to the mutation time is a noise band, so it is necessary to change this Part of the noise band is removed to obtain the voice data to be tested. The minimum time threshold is a preset time value and is used to determine a noise segment in the common voice data.
步骤S21-S23中,通过获取预处理语音数据的短时能量特征值和短时过零率,并分别与第一阈值和第二阈值比较,分别获取第一测试语音数据和第二测试语音数据,可以排除静音段对应的预处理语音数据。然后,对第一测试语音数据和第二测试语音数据进行去噪音处理,可保留目标语音对应的待测试语音数据,减少对待测试语音数据进行特征提取时需要处理的数据量。In steps S21-S23, the first test voice data and the second test voice data are obtained by obtaining the short-term energy feature value and the short-time zero-crossing rate of the pre-processed voice data, and comparing them with the first threshold and the second threshold, respectively. , The pre-processed speech data corresponding to the mute segment can be excluded. Then, noise reduction processing is performed on the first test voice data and the second test voice data, the voice data to be tested corresponding to the target voice can be retained, and the amount of data that needs to be processed when feature extraction is performed on the voice data to be tested.
在一实施例中,由于待测试语音数据是对原始测试语音数据进行预处理、分帧和加窗处理后,再进行端点检测后获取的语音数据,使得待测试语音数据包括多帧单帧语音数据,使得后续对待测试语音数据进行特征提取,可以具体为对待测试语音数据中的每一帧单帧语音数据进行特征提取。In an embodiment, the voice data to be tested is pre-processed, framed, and windowed on the original test voice data, and then acquired after endpoint detection, so that the voice data to be tested includes multiple frames of single-frame voice. The data enables subsequent feature extraction of the voice data to be tested, and can specifically perform feature extraction for each frame of single-frame voice data in the voice data to be tested.
在一实施例中,如图5所示,步骤S30,对待测试语音数据进行特征提取,获取待测试语音特征,具体包括如下步骤:In an embodiment, as shown in FIG. 5, in step S30, feature extraction is performed on the voice data to be tested to obtain the voice features to be tested, which specifically includes the following steps:
S31:对单帧语音数据进行快速傅里叶变换处理,获取待测试语音数据的功率谱。S31: Perform fast Fourier transform processing on a single frame of voice data to obtain a power spectrum of the voice data to be tested.
获取待测试语音数据中的每一帧单帧语音数据,采用公式
Figure PCTCN2018094200-appb-000007
进行快速傅里叶变换(Fast Fourier Transformation,简称FFT)处理,获取待测试语音数据的频谱。公式
Figure PCTCN2018094200-appb-000008
中,1≤k≤N,N为待测试语音数据中帧的个数,s(k)为频域上的信号幅度,s(n)为时域上第n帧语音数据的信号幅度,j为负数单位。在获取待测试语音数据的频谱后,对频谱采用公式
Figure PCTCN2018094200-appb-000009
进行功率谱计算,获取待测试语音数据中该单帧语音数据的功率谱。公式
Figure PCTCN2018094200-appb-000010
中,1≤k≤N,N为待测试语音数据中帧的个数,s(k)为频域上的信号幅度,P(k)为求得待测试语音数据的功率谱。获取功率谱方便步骤S32获取梅尔频谱。
Get each frame of single-frame voice data in the voice data to be tested, using the formula
Figure PCTCN2018094200-appb-000007
Fast Fourier Transformation (FFT) is performed to obtain the frequency spectrum of the speech data to be tested. formula
Figure PCTCN2018094200-appb-000008
Where 1≤k≤N, N is the number of frames in the voice data to be tested, s (k) is the signal amplitude in the frequency domain, and s (n) is the signal amplitude of the nth frame of voice data in the time domain, j Is a negative unit. After obtaining the spectrum of the voice data to be tested, use the formula for the spectrum
Figure PCTCN2018094200-appb-000009
Perform power spectrum calculation to obtain the power spectrum of the single frame of voice data in the voice data to be tested. formula
Figure PCTCN2018094200-appb-000010
Among them, 1≤k≤N, N is the number of frames in the voice data to be tested, s (k) is the signal amplitude in the frequency domain, and P (k) is the power spectrum of the voice data to be tested. Obtaining the power spectrum facilitates obtaining the Mel spectrum in step S32.
S32:采用梅尔滤波器组对功率谱进行降维处理,获取梅尔频谱。S32: Use a Mel filter bank to perform dimension reduction processing on the power spectrum to obtain a Mel spectrum.
由于人的听觉感知系统可以模拟复杂的非线性系统,基于步骤S31获取的功率谱不能很好地展现语音数据的非线性特点,因此,还需要采用梅尔滤波器组对频谱进行降维处理,使得获取的待测试语音数据的频谱更加接近人耳感知的频率。其中,梅尔滤波器组是由多个重叠的三角带通滤波器组成的,三角带通滤波器携带有下限频率、截止频率和中心频率三种频率。这些三角带通滤波器的中心频率在梅尔刻度上是等距的,梅尔刻度在1000HZ之前是线性增长的,1000HZ之后是成对数增长的。梅尔频谱与功率谱之间的转换关系:
Figure PCTCN2018094200-appb-000011
其中,n表示三角带通滤波器的个数,w n为转换系数,l n为下限频率,h n为截止频率,P(k)为功率谱,k为第k帧语音数据。
Because the human auditory perception system can simulate a complex non-linear system, the power spectrum obtained based on step S31 cannot well show the non-linear characteristics of the speech data. Therefore, it is also necessary to use a Mel filter bank to reduce the dimensionality of the frequency spectrum. The frequency spectrum of the acquired voice data to be tested is closer to the frequency perceived by the human ear. Among them, the Mel filter bank is composed of multiple overlapping triangular band-pass filters. The triangular band-pass filter carries three frequencies: a lower limit frequency, a cutoff frequency, and a center frequency. The center frequencies of these triangular bandpass filters are equidistant on the Mel scale, which grows linearly before 1000HZ and increases logarithmically after 1000HZ. Conversion relationship between Mel spectrum and power spectrum:
Figure PCTCN2018094200-appb-000011
Among them, n represents the number of triangular band-pass filters, w n is a conversion coefficient, l n is a lower limit frequency, h n is a cutoff frequency, P (k) is a power spectrum, and k is k- th frame voice data.
S33:对梅尔频谱进行倒谱分析,获取待测试语音特征。S33: Perform cepstrum analysis on the Mel spectrum to obtain the voice characteristics to be tested.
其中,倒谱(cepstrum)是指一种信号的傅里叶变换谱经对数运算后再进行的傅里叶逆变换,由于一般傅里叶谱是复数谱,因而倒谱又称复倒谱。Among them, cepstrum refers to the inverse Fourier transform of the Fourier transform spectrum of a signal after logarithmic operation. Since the general Fourier spectrum is a complex spectrum, the cepstrum is also called complex cepstrum. .
具体地,在获取梅尔频谱后,对梅尔频谱取对数X=log mel (n),然后对X作离散余弦变换(Discrete Cosine Transform,DCT),获取梅尔频率倒谱系数(MFCC),该梅尔频率倒谱系数(MFCC)即为待测试语音特征。其中,离散傅里叶变化的公式为
Figure PCTCN2018094200-appb-000012
i=1、2、3···N。c i表示第i个梅尔频率倒谱系数,n表示梅尔频率倒谱系数的个数,与梅尔滤波器的个数相关,若梅尔滤波器的个数为13个,则梅尔频率倒谱系数的个数也可以有13个。
Specifically, after obtaining the Mel spectrum, take a logarithm X = log mel (n) of the Mel spectrum, and then perform a discrete cosine transform (DCT) on X to obtain a Mel frequency cepstrum coefficient (MFCC) The Melc Frequency Cepstral Coefficient (MFCC) is the speech feature to be tested. Among them, the formula of discrete Fourier change is
Figure PCTCN2018094200-appb-000012
i = 1, 2, 3 ... N. c i represents the ith Mel frequency cepstrum coefficient, n represents the number of Mel frequency cepstrum coefficients, which is related to the number of Mel filters, if the number of Mel filters is 13, the Mel frequency The number of cepstrum coefficients can also be 13.
进一步地,为了便于观察和更好地反映待测试语音数据对应的语音信号的特点,在获取梅尔频率倒谱系数(MFCC)后,还需要对MFCC进行归一化处理。其中,归一化处理的具体步骤为:对所有的c i求平均值,然后用每一个c i减去平均值获取每一个c i对应的归一化处理后的值。c i对应的归一化处理后的值为待测试语音数据的梅尔频率倒谱系数(MFCC),即就是该待测试语音数据的待测试语音特征。 Further, in order to facilitate observation and better reflect the characteristics of the speech signal corresponding to the speech data to be tested, after obtaining the Mel Frequency Cepstrum Coefficient (MFCC), the MFCC needs to be normalized. Wherein the step of normalizing the specific treatment is: averaging all c i, c i and then subtracting the average of each of the processed values acquired each c i corresponding normalized. The normalized value corresponding to c i is the Mel Frequency Cepstrum Coefficient (MFCC) of the voice data to be tested, that is, the voice characteristics of the voice data to be tested.
在一实施例中,如图6所示在步骤S40,将待测试语音特征输入到预先训练好的卷积深度置信网络模型中进行识别的步骤之前,该语音区分方法还包括:预先训练卷积深度置信网络模型。In an embodiment, as shown in FIG. 6, before the step of inputting the voice features to be tested into a pre-trained convolutional deep belief network model for recognition in step S40, the voice discrimination method further includes: pre-training the convolution Deep Belief Network Model.
预先训练卷积深度置信网络模型具体包括如下步骤:The pre-trained convolutional deep belief network model includes the following steps:
S401:获取待训练语音数据,待训练语音数据包括标准训练语音数据和干扰训练语音数据。S401: Acquire voice data to be trained. The voice data to be trained includes standard training voice data and interference training voice data.
待训练语音数据指用来训练卷积深度置信网络模型的语音数据,该待训练语音数据中的语音数据包括标准训练语音数据和干扰训练语音数据。其中,标准训练语音数据指不包含静音段和噪音段的纯净的语音数据;干扰训练语音数据指包括静音段和噪音段的语音数据。待训练语音数据可以从已经预先区分好的存储有标准训练语音数据和干扰训练语音数据的语音数据库中获取,也可以从开源的语音训练集中获取。本实施例中获取的待训练语音数据为预先区分好的,并且标准训练语音数据和干扰训练语音数据的比例为1∶1的语音数据,方便基于获取到的标准训练语音数据和干扰训练语音数据对卷积深度置信网络(CDBN)模型进行模型训练,提高训练效率,避免出现过拟合现象。The speech data to be trained refers to speech data used to train the convolutional deep belief network model. The speech data in the speech data to be trained includes standard training speech data and interference training speech data. Among them, the standard training speech data refers to pure speech data that does not include the mute segment and the noise segment; the interference training speech data refers to speech data that includes the mute segment and the noise segment. The to-be-trained voice data can be obtained from a pre-differentiated voice database that stores standard training voice data and interference training voice data, or from an open source voice training set. The to-be-trained voice data obtained in this embodiment is voice data that is distinguished in advance, and the ratio of standard training voice data to interference training voice data is 1: 1, which is convenient based on the obtained standard training voice data and interference training voice data. Perform model training on the Convolutional Deep Confidence Network (CDBN) model to improve training efficiency and avoid overfitting.
S402:将标准训练语音数据和干扰训练语音数据按同等比例输入到卷积深度置信网络模型中进行模型训练,获取原始卷积限制玻尔兹曼机。S402: The standard training speech data and the interference training speech data are input into the convolutional deep confidence network model in the same proportion for model training to obtain the original convolution-limited Boltzmann machine.
卷积深度置信网络(CDBN)模型是由多个卷积限制玻尔兹曼机(CRBM)组成的,因此,将标准训练语音数据和干扰训练语音数据按同等比例输入到卷积深度置信网络模型中进行训练时,应当是对卷积深度置信网络(CDBN)模型中的每一个卷积限制玻尔兹曼机(CRBM)进行训练。The Convolutional Deep Confidence Network (CDBN) model is composed of multiple convolution-limited Boltzmann machines (CRBM). Therefore, the standard training speech data and interference training speech data are input to the convolutional deep confidence network model in the same proportion. In training, we should train each convolution-limited Boltzmann machine (CRBM) in the Convolutional Deep Confidence Network (CDBN) model.
具体地,CDBN中的CRBM的个数为n,CRBM分为两个层,上层是隐藏层h,用于提取待训练语音数据(标准训练语音数据和干扰训练语音数据比例为1∶1的待训练语音数据)的语音特征;下层是可视层v,用于输入训练的待训练语音数据。隐藏层和可视层中包括多个隐藏单元和多个可视单元。假设隐藏单元中的语音数据和可视单元中的语音特征均为二值变量v i∈{0,1},h j∈{0,1},其中,v i表示可视层中第i个二值变量v的状态、h j隐藏层中第j个二值变量h的状态。可视单元的个数为n,隐藏单元的个数为m。则将标准训练语音数据和干扰训练语音数据按同等比例输入到卷积深度置信网络模型中进行训练具 体包括如下步骤:首先,采用CRBM内置的能量公式
Figure PCTCN2018094200-appb-000013
确定(v,h)。当参数(v,h)确定后,获取对应的概率分布公式
Figure PCTCN2018094200-appb-000014
其中,z(θ)是归一化因子,
Figure PCTCN2018094200-appb-000015
然后,基于相关公式p(h j=1|v)=σ(b j+w ij* vv)(1)、p(v i=1|h)=σ(a i+w ji* fh)(2)和
Figure PCTCN2018094200-appb-000016
对训练语音特征进行训练,调整可视层和隐藏层的偏置参数和两者之间的权重,获取原始卷积限制玻尔兹曼机。其中,θ={w ij,a i,b j},a i为可视层的偏置参数,b j为隐藏层的偏置参数,w ij为第i个可视单元和第j个隐藏单元连接线上的权重,w ji为第j个隐藏单元和第i个可视单元连接线上的权重,w ji=w ij,σ表示sigmoid激活函数,* v表示有效卷积,*f为全卷积符号,v和h分别表示可视层和隐藏层的状态。
Specifically, the number of CRBMs in the CDBN is n, and the CRBM is divided into two layers. The upper layer is a hidden layer h, which is used to extract the voice data to be trained (the ratio of standard training voice data to interference training voice data is Training voice data); the lower layer is the visible layer v, which is used to input the training voice data to be trained. The hidden layer and the visible layer include multiple hidden units and multiple visual units. Assume that the speech data in the hidden unit and the speech features in the visual unit are binary variables v i ∈ {0, 1}, h j ∈ {0, 1}, where v i represents the i-th in the visible layer binary state variable v, h the hidden layer state j j th h of binary variables. The number of visible units is n, and the number of hidden units is m. The standard training speech data and interference training speech data are input into the convolutional deep confidence network model for training in the same proportion. The specific steps include the following steps: First, the built-in energy formula of CRBM is used.
Figure PCTCN2018094200-appb-000013
OK (v, h). When the parameters (v, h) are determined, the corresponding probability distribution formula is obtained
Figure PCTCN2018094200-appb-000014
Where z (θ) is the normalization factor,
Figure PCTCN2018094200-appb-000015
Then, based on the correlation formula p (h j = 1 | v) = σ (b j + w ij * v v) (1), p (v i = 1 | h) = σ (a i + w ji * f h ) (2) and
Figure PCTCN2018094200-appb-000016
Train the training speech features, adjust the bias parameters of the visible layer and the hidden layer and the weights between them to obtain the original convolution-limited Boltzmann machine. Where θ = {w ij , a i , b j }, a i is the bias parameter of the visible layer, b j is the bias parameter of the hidden layer, and w ij is the i-th visible unit and the j-th hidden The weight on the connection line of the unit, w ji is the weight on the connection line of the jth hidden unit and the i-th visible unit, w ji = w ij , σ represents the sigmoid activation function, * v represents the effective convolution, and * f Full convolution symbols, v and h represent the states of the visible and hidden layers, respectively.
S403:对原始卷积限制玻尔兹曼机进行堆叠处理,获取卷积深度置信网络模型。S403: Perform stack processing on the original convolution-limited Boltzmann machine to obtain a convolutional deep confidence network model.
在获取原始卷积限制玻尔兹曼机后,对原始卷积限制玻尔兹曼机进行堆叠处理,即将第一个有效卷积限制玻尔兹曼机的输出数据作为第二个原始卷积限制玻尔兹曼机的输入数据,将第二个有效卷积限制玻尔兹曼机的输出数据作为第三个原始卷积限制玻尔兹曼机的输入数据,依此类推,多个原始卷积限制玻尔兹曼机生成一个卷积深度置信网络模型。After obtaining the original convolution-limited Boltzmann machine, the original convolution-limited Boltzmann machine is stacked, that is, the output data of the first effective convolution-limited Boltzmann machine is used as the second original convolution Limit the input data of the Boltzmann machine, and use the output data of the second valid convolution-limited Boltzmann machine as the input data of the third original convolution-limited Boltzmann machine, and so on. Convolution-limited Boltzmann machines generate a convolutional deep belief network model.
将已经区分好的标准训练语音数据和干扰训练语音数据输入到卷积深度置信网络模型中,通过卷积限制玻尔兹曼机(CRBM)中的相关公式(步骤S402中的相关公式)对卷积深度置信网络模型中的偏置参数和权重进行迭代更新,获取原始卷积限制玻尔兹曼机。然后对原始卷积限制玻尔兹曼机进行堆叠处理,获取卷积深度置信网络模型,使得卷积深度置信网络模型不断地进行更新,提高卷积深度置信网络模型的识别准确性。The already distinguished standard training speech data and interference training speech data are input into a convolutional deep belief network model, and the relevant formula (the relevant formula in step S402) in the convolution-limited Boltzmann machine (CRBM) is used to convolute The bias parameters and weights in the convolutional deep belief network model are iteratively updated to obtain the original convolution-limited Boltzmann machine. Then the original convolution-limited Boltzmann machine is stacked to obtain a convolutional deep confidence network model, so that the convolutional deep confidence network model is continuously updated, and the recognition accuracy of the convolutional deep confidence network model is improved.
在一实施例中,如图7所示,步骤S403,对原始卷积限制玻尔兹曼机进行堆叠处理,获取卷积深度置信网络模型,具体包括如下步骤:In an embodiment, as shown in FIG. 7, step S403: stacking the original convolution-limited Boltzmann machine to obtain a convolutional deep confidence network model, which specifically includes the following steps:
S4031:对原始卷积限制玻尔兹曼机进行概率最大池化处理和稀疏正则化处理,获取有效卷积限制玻尔兹曼机。S4031: Perform a maximum probability pooling process and a sparse regularization process on the original convolution-limited Boltzmann machine to obtain an effective convolution-limited Boltzmann machine.
具体地,卷积深度置信网络模型在对原始卷积限制玻尔兹曼机进行堆叠处理时,可能会出现过度拟合和重叠的情况。其中,过度拟合是指在使用卷积深度置信网络模型识别待测试语音数据的过程中,若输入的待测试语音数据为训练模型时使用的待训练语音数据时,识别准确率非常高,若输入的待测试语音数据为非训练语音数据时,识别准确率非常低的情况。重叠是指相邻的原始卷积限制玻尔兹曼机会出现重叠的情况。因此,在原始卷积限制玻尔兹曼机对叠成卷积深度置信网络模型时,还需要对原始卷积限制玻尔兹曼机进行概率最大池化处理和稀疏正则化处理,避免原始卷积限制玻尔兹曼机出现过度拟合和重叠的情况。其中,概率最大池化处理是为了防止出现重叠的情况进行的处理操作;稀疏正则化处理是为了防止出现过拟合的情况进行的处理操作。对原始卷积限制玻尔兹曼机进行概率最大池化处理和稀疏正则化处理可以有效减少堆叠处理的处理量,同时提高卷积限制玻尔兹曼机的识别准确性。Specifically, when the convolutional deep belief network model is stacked on the original convolution-limited Boltzmann machine, overfitting and overlap may occur. Among them, overfitting means that in the process of using the convolutional deep confidence network model to identify the speech data to be tested, if the input speech data to be used is the speech data to be used in training the model, the recognition accuracy is very high. When the input voice data to be tested is non-training voice data, the recognition accuracy is very low. Overlap refers to the case where adjacent original convolutions limit Boltzmann's chance of overlapping. Therefore, when the original convolution-restricted Boltzmann machine is used to stack the convolutional depth-confidence network model, the original convolution-restricted Boltzmann machine also needs to be processed with maximum probability pooling and sparse regularization to avoid the original convolution. The product restricts the Boltzmann machine from overfitting and overlapping. Among them, the maximum probability pooling process is a processing operation performed to prevent the occurrence of overlap; the sparse regularization process is a processing operation performed to prevent the occurrence of overfitting. Probabilistic maximum pooling processing and sparse regularization processing on the original convolution-limited Boltzmann machine can effectively reduce the processing amount of the stacking process, while improving the recognition accuracy of the convolution-limited Boltzmann machine.
S4032:对有效卷积限制玻尔兹曼机进行堆叠处理,获取卷积深度置信网络模型。S4032: Perform stacking processing on the effective convolution-limited Boltzmann machine to obtain a convolutional deep confidence network model.
在经过概率最大池化处理和稀疏正则化处理后,对获取的有效卷积限制玻尔兹曼机进行堆叠处理,获取卷积深度置信网络模型。本实施例中,获取到的卷积深度置信网络模型环境适应能力更加完善,可避免过度拟合和重叠的情况出现,使其对任何待测试语音数据的识别更加准确。After the maximum probability pooling process and sparse regularization process, the acquired effective convolution-limited Boltzmann machine is stacked to obtain a convolutional deep confidence network model. In this embodiment, the acquired adaptive environment model of the convolutional deep confidence network model is more complete, which can avoid over-fitting and overlapping, which makes it more accurate to recognize any speech data to be tested.
在一实施例中,如图8所示,步骤S40,将待测试语音特征输入到预先训练好的卷积深度置信网络模型中进行识别,获取语音区分结果,具体包括如下步骤:In an embodiment, as shown in FIG. 8, step S40 is to input a voice feature to be tested into a pre-trained convolutional deep belief network model for recognition, and obtain a voice discrimination result, which specifically includes the following steps:
S41:将待测试语音特征输入到预先训练好的卷积深度置信网络模型中进行识别,获取语音识别概率值。S41: The speech features to be tested are input into a pre-trained convolutional deep confidence network model for recognition, and a speech recognition probability value is obtained.
将待测试语音特征输入到预先训练好的卷积深度置信网络模型中进行识别,根据卷积深度置信网络模型的识别过程,待测试语音特征输出的是一个概率值,该概率值即为获取的语音识别概率值。The speech features to be tested are input to a pre-trained convolutional deep confidence network model for recognition. According to the recognition process of the convolutional deep confidence network model, the speech features to be tested output a probability value, and the probability value is obtained. Speech recognition probability value.
进一步地,将待测试语音特征输入到预先训练好的卷积深度置信网络模型中时,为了减少卷积深度置信网络模型的计算量,同时为了提高识别待测试语音特征的准确性,卷积深度置信网络模型会在识别之前对待测试语音进行划分,将待测试语音数据中的单帧语音数据按相同数量划分为至少两个语音片段进行识别。卷积深度置信网络模型对每个语音片段对应的待测试语音特征进行识别,获取每个语音片段的语音识别概率值。然后对至少两个语音片段的语音识别概率值进行求均值计算,获取到的均值即为待测试语音数据对应的语音识别概率值。其中,语音片段是指含有多个单帧语音数据的片段。Further, when the speech features to be tested are input into a pre-trained convolutional deep confidence network model, in order to reduce the calculation amount of the convolutional deep confidence network model, and to improve the accuracy of identifying the speech features to be tested, the convolution depth The belief network model divides the voice to be tested before recognition, and divides the single frame of voice data in the voice data to be tested into at least two voice segments by the same number for recognition. The convolutional deep belief network model recognizes the speech features to be tested corresponding to each speech segment, and obtains the speech recognition probability value of each speech segment. Then, the average value of the speech recognition probability values of at least two speech segments is calculated, and the obtained average value is the speech recognition probability value corresponding to the speech data to be tested. The voice segment refers to a segment containing multiple single-frame voice data.
S42:基于语音识别概率值获取语音区分结果。S42: Acquire a speech discrimination result based on the speech recognition probability value.
获取语音识别概率值后,卷积深度置信网络模型会基于预先设置好的预设概率值对每个组的语音识别概率值进行比较,小于预设概率值的语音片段为干扰语音,大于等于预设概率值的语音片段为目标语音。进一步地,卷积深度置信网络模型会在获取语音识别概率值后,会将识别概率值小于预设概率值的语音片段去除,仅保留识别概率值大于预设概率值的语音片段。使得待测语音数据仅保留目标语音对应的待测试语音数据。After obtaining the speech recognition probability value, the convolutional deep confidence network model will compare the speech recognition probability value of each group based on a preset preset probability value. Speech fragments that are less than the preset probability value are interfering speech, and are greater than or equal to Let the speech segment with probability value be the target speech. Further, after obtaining the speech recognition probability value, the convolutional deep belief network model will remove speech fragments with recognition probability values less than a preset probability value, and only retain speech fragments with recognition probability values greater than a preset probability value. Therefore, the voice data to be tested only retains the voice data to be tested corresponding to the target voice.
基于预设概率值判断待测试语音数据中的目标语音和干扰语音,并将干扰语音对应的待测试语音数据去除,保留目标语音对应的待测试语音数据,实现了区分待测试语音数据中的目标语音和干扰语音的功能。Based on a preset probability value, the target voice and interference voice in the voice data to be tested are judged, and the voice data to be tested corresponding to the interference voice is removed, and the voice data to be tested corresponding to the target voice is retained, thereby achieving the target in the voice data to be tested. Voice and interfering voice features.
对原始测试语音数据进行预加重、分帧和加窗处理,获取预处理语音数据,然后,通过短时能量特征值和短时过零率对该预处理语音数据进行端点检测处理,获取待测试语音数据,可以初步去除干扰语音对应的待测试语音数据,有效减少卷积深度置信网络模型对待测试语音数据进行识别的处理时间。对待测试语音数据进行特征提取,获取待测试语音特征,并将待测试语音特征输入到预先训练好的卷积深度置信网络模型中进行识别,获取语音区分结果,提高了语音区分的准确性,使得获取的语音区分结果更加准确。Perform pre-emphasis, framing, and windowing on the original test voice data to obtain pre-processed voice data, and then perform endpoint detection processing on the pre-processed voice data through short-term energy feature values and short-time zero-crossing rates to obtain the test data The voice data can initially remove the test voice data corresponding to the interference voice, and effectively reduce the processing time of the convolutional deep confidence network model to identify the test voice data. Perform feature extraction on the test voice data, obtain the test voice features, and input the test voice features into a pre-trained convolutional deep belief network model for recognition, obtain the voice discrimination result, and improve the accuracy of the voice discrimination, so that The obtained speech discrimination results are more accurate.
应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。It should be understood that the size of the sequence numbers of the steps in the above embodiments does not mean the order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of this application.
在一实施例中,提供一种语音区分装置,该语音区分装置与上述实施例中语音区分方法一一对应。如图9所示,该语音区分装置包括原始测试语音数据处理模块10、待测试语音数据获取模块20、待测试语音特征获取模块30和语音区分结果获取模块40。其中,原始测试语音数据处理模块10、待测试语音数据获取模块20、待测试语音特征获取模块30和语音区分结果获取模块40的实现功能与上述实施例中语音区分方法对应的步骤一一对应,为避免赘述,本实施例不一一详述。In one embodiment, a voice distinguishing device is provided, and the voice distinguishing device corresponds to the voice distinguishing method in the above embodiment. As shown in FIG. 9, the voice discrimination device includes an original test voice data processing module 10, a voice data acquisition module 20 to be tested, a voice feature acquisition module 30 to be tested, and a voice discrimination result acquisition module 40. The implementation functions of the original test voice data processing module 10, the voice data acquisition module 20 to be tested, the voice feature acquisition module 30 to be tested, and the voice discrimination result acquisition module 40 correspond to the steps corresponding to the voice discrimination method in the above embodiment. In order to avoid redundant description, this embodiment is not detailed one by one.
原始测试语音数据处理模块10,用于获取原始测试语音数据,对原始测试语音数据进行预处理,获取预处理语音数据。The raw test voice data processing module 10 is configured to obtain raw test voice data, preprocess the raw test voice data, and obtain preprocessed voice data.
待测试语音数据获取模块20,用于对预处理语音数据进行端点检测处理,获取待测试语音数据。The voice data to be tested module 20 is configured to perform endpoint detection processing on the preprocessed voice data to obtain voice data to be tested.
待测试语音特征获取模块30,用于对待测试语音数据进行特征提取,获取待测试语 音特征。The feature-to-be-tested voice acquisition module 30 is configured to perform feature extraction on the to-be-tested voice data to obtain the feature to be tested.
语音区分结果获取模块40,用于将待测试语音特征输入到预先训练好的卷积深度置信网络模型中进行识别,获取语音区分结果。The speech discrimination result acquisition module 40 is configured to input a voice feature to be tested into a pre-trained convolutional deep belief network model for recognition, and obtain a speech discrimination result.
具体地,原始测试语音数据处理模块10包括第一处理单元11和第二处理单元12。Specifically, the original test voice data processing module 10 includes a first processing unit 11 and a second processing unit 12.
第一处理单元11,用于对原始测试语音数据进行预加重处理,预加重处理的公式为s′ n=s n-a*s n-1,其中,s′ n为预加重处理后的n时刻的语音信号幅度,s n为n时刻的语音信号幅度,s n-1为n-1时刻的语音信号幅度,a为预加重系数。 The first processing unit 11 is configured to perform pre-emphasis processing on the original test voice data, and a formula for the pre-emphasis processing is s ′ n = s n -a * s n-1 , where s ′ n is n after the pre-emphasis processing. speech signal amplitude timing, s n is the amplitude of the speech signal at time n, s n-1 is the amplitude of the speech signal time n-1, a is the coefficient of pre-emphasis.
第二处理单元12,用于对预加重处理后的原始测试语音数据进行分帧处理,获取分帧语音数据。The second processing unit 12 is configured to perform frame processing on the original test voice data after the pre-emphasis processing to obtain framed voice data.
第三处理单元13,用于对分帧语音数据进行加窗处理,获取预处理语音数据,加窗处理的公式为
Figure PCTCN2018094200-appb-000017
和s″ n=w n*s′ n,其中,w n为n时刻的汉明窗,N为汉明窗窗长,s′ n为n时刻时域上的信号幅度,s″ n为n时刻加窗后时域上的信号幅度。
The third processing unit 13 is configured to perform window processing on the framed voice data to obtain pre-processed voice data. The formula of the window processing is
Figure PCTCN2018094200-appb-000017
And s ″ n = w n * s ′ n , where w n is the Hamming window at time n, N is the window length of Hamming window, s ′ n is the signal amplitude in the time domain at time n, and s ″ n is n Signal amplitude in time domain after windowing at time.
具体地,待测试语音数据获取模块20包括第一测试语音数据获取单元21、第二测试语音数据获取单元22和待测试语音数据获取单元23。Specifically, the voice data acquisition module 20 includes a first test voice data acquisition unit 21, a second test voice data acquisition unit 22, and a voice data acquisition unit 23.
第一测试语音数据获取单元21,用于采用短时能量特征值计算公式对预处理语音数据进行处理,获取预处理语音数据对应的短时能量特征值,并将短时能量特征值小于第一阈值的预处理语音数据去除,获取第一测试语音数据,短时能量特征值计算公式为
Figure PCTCN2018094200-appb-000018
其中,N为预处理语音数据中帧的个数,N≥2,s(n)为时域上第n帧预处理语音数据的信号幅度。
The first test voice data obtaining unit 21 is configured to process the pre-processed voice data by using a short-term energy feature value calculation formula, obtain a short-term energy feature value corresponding to the pre-processed voice data, and make the short-term energy feature value smaller than the first The threshold pre-processed voice data is removed to obtain the first test voice data. The short-term energy feature value calculation formula is
Figure PCTCN2018094200-appb-000018
Among them, N is the number of frames in the preprocessed voice data, N≥2, and s (n) is the signal amplitude of the nth frame of preprocessed voice data in the time domain.
第二测试语音数据获取单元22,用于采用短时过零率计算公式对预处理语音数据进行处理,获取预处理语音数据对应的短时过零率,并将短时过零率小于第二阈值的预处理语音数据去除,获取第二测试语音数据,短时过零率计算公式为
Figure PCTCN2018094200-appb-000019
其中,N为预处理语音数据中帧的个数,N≥2,s(n)为时域上第n帧语音数据的信号幅度。
The second test voice data obtaining unit 22 is configured to process the preprocessed voice data by using a short-term zero-crossing rate calculation formula, obtain a short-time zero-crossing rate corresponding to the pre-processed voice data, and make the short-time zero-crossing rate smaller than the second The threshold preprocessed voice data is removed to obtain the second test voice data. The short-term zero-crossing rate calculation formula is
Figure PCTCN2018094200-appb-000019
Among them, N is the number of frames in the pre-processed voice data, N≥2, and s (n) is the signal amplitude of the n-th voice data in the time domain.
待测试语音数据获取单元23,用于对第一测试语音数据和第二测试语音数据进行去噪音处理,获取待测试语音数据。The voice data to be tested unit 23 is configured to perform noise reduction processing on the first test voice data and the second test voice data to obtain the voice data to be tested.
具体地,待测试语音数据包括单帧语音数据。Specifically, the voice data to be tested includes single-frame voice data.
待测试语音特征获取模块30包括功率谱获取单元31、梅尔频谱获取单元32和待测试语音特征获取单元33。The speech feature acquisition module 30 includes a power spectrum acquisition unit 31, a Mel spectrum acquisition unit 32, and a speech feature acquisition unit 33.
功率谱获取单元31,用于对单帧语音数据进行快速傅里叶变换处理,获取待测试语音数据的功率谱。The power spectrum obtaining unit 31 is configured to perform fast Fourier transform processing on single frame of voice data to obtain a power spectrum of the voice data to be tested.
梅尔频谱获取单元32,用于采用梅尔滤波器组对功率谱进行降维处理,获取梅尔频谱。The Mel spectrum acquisition unit 32 is configured to perform a dimensionality reduction process on the power spectrum by using a Mel filter bank to obtain a Mel spectrum.
待测试语音特征获取单元33,用于对梅尔频谱进行倒谱分析,获取待测试语音特征。The speech feature acquiring unit 33 is configured to perform cepstrum analysis on the Mel spectrum to acquire the speech feature to be tested.
具体地,该语音区分装置还用于预先训练卷积深度置信网络模型。Specifically, the speech discrimination device is further used for pre-training a convolutional deep belief network model.
该语音区分装置还包括待训练语音数据获取单元401、模型训练单元402和模型获取 单元403。The speech discrimination device further includes a to-be-trained speech data acquisition unit 401, a model training unit 402, and a model acquisition unit 403.
待训练语音数据获取单元401,用于获取待训练语音数据,待训练语音数据包括标准训练语音数据和干扰训练语音数据。The voice data to be trained unit 401 is configured to acquire voice data to be trained, and the voice data to be trained includes standard training voice data and interference training voice data.
模型训练单元402,用于将标准训练语音数据和干扰训练语音数据按同等比例输入到卷积深度置信网络模型中进行模型训练,获取原始卷积限制玻尔兹曼机。A model training unit 402 is configured to input standard training speech data and interference training speech data into a convolutional deep belief network model in an equal proportion for model training, and obtain an original convolution-limited Boltzmann machine.
模型获取单元403,用于对原始卷积限制玻尔兹曼机进行堆叠处理,获取卷积深度置信网络模型。A model obtaining unit 403 is configured to perform stack processing on the original convolution-limited Boltzmann machine to obtain a convolutional deep confidence network model.
具体地,模型获取单元403包括池化和正则处理单元4031和堆叠处理单元4032。Specifically, the model acquisition unit 403 includes a pooling and regular processing unit 4031 and a stack processing unit 4032.
池化和正则处理单元4031,用于对原始卷积限制玻尔兹曼机进行概率最大池化处理和稀疏正则化处理,获取有效卷积限制玻尔兹曼机。The pooling and regular processing unit 4031 is configured to perform a maximum probability pooling process and a sparse regularization process on the original convolution-limited Boltzmann machine to obtain an effective convolution-limited Boltzmann machine.
堆叠处理单元4032,用于对有效卷积限制玻尔兹曼机进行堆叠处理,获取卷积深度置信网络模型。A stacking processing unit 4032 is configured to perform stacking processing on the effective convolution-limited Boltzmann machine to obtain a convolutional deep confidence network model.
具体地,语音区分结果获取模块40包括语音识别概率值获取单元41和语音区分结果获取单元42。Specifically, the speech discrimination result acquisition module 40 includes a speech recognition probability value acquisition unit 41 and a speech discrimination result acquisition unit 42.
语音识别概率值获取单元41,用于将待测试语音特征输入到预先训练好的卷积深度置信网络模型中进行识别,获取语音识别概率值。The speech recognition probability value obtaining unit 41 is configured to input a speech feature to be tested into a pre-trained convolutional deep confidence network model for recognition, and obtain a speech recognition probability value.
语音区分结果获取单元42,用于基于语音识别概率值获取语音区分结果。The speech discrimination result acquisition unit 42 is configured to acquire a speech discrimination result based on a speech recognition probability value.
在一个实施例中,提供了一种计算机设备,该计算机设备可以是服务器,其内部结构图可以如图10所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口和数据库。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统、计算机可读指令和数据库。该内存储器为非易失性存储介质中的操作系统和计算机可读指令的运行提供环境。该计算机设备的数据库用于存储语音区分方法过程中获取的或者生成的数据。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机可读指令被处理器执行时以实现一种语音区分方法。In one embodiment, a computer device is provided. The computer device may be a server, and its internal structure diagram may be as shown in FIG. 10. The computer device includes a processor, a memory, a network interface, and a database connected through a system bus. The processor of the computer device is used to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer-readable instructions, and a database. The internal memory provides an environment for the operation of the operating system and computer-readable instructions in a non-volatile storage medium. The database of the computer equipment is used to store data obtained or generated during the method of speech discrimination. The network interface of the computer device is used to communicate with an external terminal through a network connection. The computer-readable instructions are executed by a processor to implement a method of speech discrimination.
在一实施例中,提供一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机可读指令,处理器执行计算机可读指令时实现以下步骤:获取原始测试语音数据,对原始测试语音数据进行预处理,获取预处理语音数据;对预处理语音数据进行端点检测处理,获取待测试语音数据;对待测试语音数据进行特征提取,获取待测试语音特征;将待测试语音特征输入到预先训练好的卷积深度置信网络模型中进行识别,获取语音区分结果。In one embodiment, a computer device is provided, which includes a memory, a processor, and computer-readable instructions stored on the memory and executable on the processor. When the processor executes the computer-readable instructions, the processor implements the following steps: obtaining the original test Voice data, preprocess the original test voice data to obtain preprocessed voice data; perform endpoint detection processing on the preprocessed voice data to obtain the voice data to be tested; perform feature extraction on the test voice data to obtain the voice features to be tested; The test speech features are input into a pre-trained convolutional deep belief network model for recognition, and the speech discrimination results are obtained.
在一实施例中,处理器执行计算机可读指令时还实现以下步骤:对原始测试语音数据进行预加重处理,预加重处理的公式为s′ n=s n-a*s n-1,其中,s′ n为预加重处理后的n时刻的语音信号幅度,s n为n时刻的语音信号幅度,s n-1为n-1时刻的语音信号幅度,a为预加重系数;对预加重处理后的原始测试语音数据进行分帧处理,获取分帧语音数据;对分帧语音数据进行加窗处理,获取预处理语音数据,加窗处理的公式为
Figure PCTCN2018094200-appb-000020
和s″ n=w n*s′ n,其中,w n为n时刻的汉明窗,N为汉明窗窗长,s′ n为n时刻时域上的信号幅度,s″ n为n时刻加窗后时域上的信号幅度。
In an embodiment, when the processor executes the computer-readable instructions, the following steps are further implemented: pre-emphasis processing is performed on the original test voice data, and a formula for the pre-emphasis processing is s ′ n = s n -a * s n-1 , where , s' n for the speech signal amplitude at time n after processing pre-emphasis, s n is the speech signal amplitude at time n, s n-1 is the speech signal amplitude n-1 time point, a is pre-emphasis coefficient; preemphasis The processed original test voice data is subjected to frame processing to obtain framed voice data; windowed processing is performed on the framed voice data to obtain preprocessed voice data, and the formula for windowing processing is
Figure PCTCN2018094200-appb-000020
And s ″ n = w n * s ′ n , where w n is the Hamming window at time n, N is the window length of Hamming window, s ′ n is the signal amplitude in the time domain at time n, and s ″ n is n Signal amplitude in time domain after windowing at time.
在一实施例中,处理器执行计算机可读指令时还实现以下步骤:采用短时能量特征值计算公式对预处理语音数据进行处理,获取预处理语音数据对应的短时能量特征值,并将短时能量特征值小于第一阈值的预处理语音数据去除,获取第一测试语音数据,短时能量 特征值计算公式为
Figure PCTCN2018094200-appb-000021
其中,N为预处理语音数据中帧的个数,N≥2,s(n)为时域上第n帧预处理语音数据的信号幅度;采用短时过零率计算公式对预处理语音数据进行处理,获取预处理语音数据对应的短时过零率,并将短时过零率小于第二阈值的预处理语音数据去除,获取第二测试语音数据,短时过零率计算公式为
Figure PCTCN2018094200-appb-000022
其中,N为预处理语音数据中帧的个数,N≥2,s(n)为时域上第n帧语音数据的信号幅度;对第一测试语音数据和第二测试语音数据进行去噪音处理,获取待测试语音数据。
In an embodiment, when the processor executes the computer-readable instructions, the following steps are further implemented: the short-term energy feature value calculation formula is used to process the pre-processed voice data, and the short-term energy feature value corresponding to the pre-processed voice data is obtained, and The pre-processed voice data whose short-term energy feature value is less than the first threshold is removed to obtain the first test voice data. The short-term energy feature value calculation formula is
Figure PCTCN2018094200-appb-000021
Among them, N is the number of frames in the preprocessed voice data, N≥2, and s (n) is the signal amplitude of the nth frame of preprocessed voice data in the time domain; the short-term zero-crossing rate calculation formula is used to preprocess the voice data Perform processing to obtain the short-term zero-crossing rate corresponding to the pre-processed voice data, and remove the pre-processed voice data whose short-term zero-cross rate is less than the second threshold to obtain the second test voice data. The short-term zero-crossing rate calculation formula is
Figure PCTCN2018094200-appb-000022
Among them, N is the number of frames in the preprocessed voice data, N≥2, and s (n) is the signal amplitude of the n-th frame voice data in the time domain; denoising the first test voice data and the second test voice data Process and obtain the voice data to be tested.
在一实施例中,处理器执行计算机可读指令时还实现以下步骤:对单帧语音数据进行快速傅里叶变换处理,获取待测试语音数据的功率谱;采用梅尔滤波器组对功率谱进行降维处理,获取梅尔频谱;对梅尔频谱进行倒谱分析,获取待测试语音特征。In an embodiment, when the processor executes the computer-readable instructions, the following steps are further implemented: fast Fourier transform processing is performed on the single frame of voice data to obtain the power spectrum of the voice data to be tested; Perform dimensionality reduction processing to obtain the Mel spectrum; perform cepstrum analysis on the Mel spectrum to obtain the speech features to be tested.
在一实施例中,处理器执行计算机可读指令时还实现以下步骤:预先训练卷积深度置信网络模型。具体地,预先训练卷积深度置信网络模型,包括:获取待训练语音数据,待训练语音数据包括标准训练语音数据和干扰训练语音数据;将标准训练语音数据和干扰训练语音数据按同等比例输入到卷积深度置信网络模型中进行模型训练,获取原始卷积限制玻尔兹曼机;对原始卷积限制玻尔兹曼机进行堆叠处理,获取卷积深度置信网络模型。In an embodiment, when the processor executes the computer-readable instructions, the following steps are further implemented: pre-training the convolutional deep belief network model. Specifically, pre-training the convolutional deep belief network model includes: acquiring voice data to be trained, the voice data to be trained includes standard training voice data and interference training voice data; and inputting the standard training voice data and the interference training voice data to an equal proportion of The model is trained in the convolutional deep confidence network model to obtain the original convolution-limited Boltzmann machine; the original convolution-limited Boltzmann machine is stacked to obtain the convolutional deep confidence network model.
在一实施例中,处理器执行计算机可读指令时还实现以下步骤:对原始卷积限制玻尔兹曼机进行概率最大池化处理和稀疏正则化处理,获取有效卷积限制玻尔兹曼机;对有效卷积限制玻尔兹曼机进行堆叠处理,获取卷积深度置信网络模型。In an embodiment, when the processor executes the computer-readable instructions, the following steps are further implemented: performing a probabilistic maximum pooling process and a sparse regularization process on the original convolution-limited Boltzmann machine to obtain an effective convolution-limited Boltzmann The stacking process is performed on the effective convolution-limited Boltzmann machine to obtain a convolutional deep confidence network model.
在一实施例中,处理器执行计算机可读指令时还实现以下步骤:将待测试语音特征输入到预先训练好的卷积深度置信网络模型中进行识别,获取语音识别概率值;基于语音识别概率值获取语音区分结果。In one embodiment, when the processor executes the computer-readable instructions, the following steps are further implemented: inputting the voice features to be tested into a pre-trained convolutional deep confidence network model for recognition, and obtaining a voice recognition probability value; based on the voice recognition probability Value to get the speech discrimination result.
在一个实施例中,提供一个或多个存储有计算机可读指令的非易失性可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器实现如下步骤:获取原始测试语音数据,对原始测试语音数据进行预处理,获取预处理语音数据;对预处理语音数据进行端点检测处理,获取待测试语音数据;对待测试语音数据进行特征提取,获取待测试语音特征;将待测试语音特征输入到预先训练好的卷积深度置信网络模型中进行识别,获取语音区分结果。In one embodiment, one or more non-readable storage media storing computer-readable instructions are provided, and the computer-readable instructions, when executed by one or more processors, cause the one or more The processor implements the following steps: obtaining the original test voice data, preprocessing the original test voice data, and obtaining the preprocessed voice data; performing endpoint detection processing on the preprocessed voice data, obtaining the voice data to be tested; and performing feature extraction on the test voice data To obtain the voice features to be tested; input the voice features to be tested into a pre-trained convolutional deep confidence network model for recognition, and obtain the voice discrimination result.
在一实施例中,被处理器执行时还实现以下步骤:对原始测试语音数据进行预加重处理,预加重处理的公式为s′ n=s n-a*s n-1,其中,s′ n为预加重处理后的n时刻的语音信号幅度,s n为n时刻的语音信号幅度,s n-1为n-1时刻的语音信号幅度,a为预加重系数;对预加重处理后的原始测试语音数据进行分帧处理,获取分帧语音数据;对分帧语音数据进行加窗处理,获取预处理语音数据,加窗处理的公式为
Figure PCTCN2018094200-appb-000023
和s″ n=w n*s′ n,其中,w n为n时刻的汉明窗,N为汉明窗窗长,s′ n为n时刻时域上的信号幅度,s″ n为n时刻加窗后时域上的信号幅度。
In an embodiment, when executed by the processor, the following steps are also implemented: pre-emphasis processing is performed on the original test voice data, and the formula for the pre-emphasis processing is s ′ n = s n -a * s n-1 , where s ′ n is the amplitude of the speech signal at time n after the pre-emphasis processing, s n is the amplitude of the speech signal at time n, s n-1 is the amplitude of the speech signal at n-1, and a is the pre-emphasis coefficient The original test voice data is framed to obtain framed voice data; the framed voice data is windowed to obtain preprocessed voice data. The formula for windowing is
Figure PCTCN2018094200-appb-000023
And s ″ n = w n * s ′ n , where w n is the Hamming window at time n, N is the window length of Hamming window, s ′ n is the signal amplitude in the time domain at time n, and s ″ n is n Signal amplitude in time domain after windowing at time.
在一实施例中,计算机程序被处理器执行时还实现以下步骤:采用短时能量特征值计算公式对预处理语音数据进行处理,获取预处理语音数据对应的短时能量特征值,并将短 时能量特征值小于第一阈值的预处理语音数据去除,获取第一测试语音数据,短时能量特征值计算公式为
Figure PCTCN2018094200-appb-000024
其中,N为预处理语音数据中帧的个数,N≥2,s(n)为时域上第n帧预处理语音数据的信号幅度;采用短时过零率计算公式对预处理语音数据进行处理,获取预处理语音数据对应的短时过零率,并将短时过零率小于第二阈值的预处理语音数据去除,获取第二测试语音数据,短时过零率计算公式为
Figure PCTCN2018094200-appb-000025
其中,N为预处理语音数据中帧的个数,N≥2,s(n)为时域上第n帧语音数据的信号幅度;对第一测试语音数据和第二测试语音数据进行去噪音处理,获取待测试语音数据。
In an embodiment, when the computer program is executed by the processor, the following steps are further implemented: the short-term energy feature value calculation formula is used to process the pre-processed voice data, and the short-term energy feature value corresponding to the pre-processed voice data is obtained, and The pre-processed speech data whose energy characteristic value is less than the first threshold is removed to obtain the first test speech data. The calculation formula of the short-term energy characteristic value is
Figure PCTCN2018094200-appb-000024
Among them, N is the number of frames in the preprocessed voice data, N≥2, and s (n) is the signal amplitude of the nth frame of preprocessed voice data in the time domain; the short-term zero-crossing rate calculation formula is used to preprocess the voice data Perform processing to obtain the short-term zero-crossing rate corresponding to the pre-processed voice data, and remove the pre-processed voice data whose short-term zero-cross rate is less than the second threshold to obtain the second test voice data. The short-term zero-crossing rate calculation formula is
Figure PCTCN2018094200-appb-000025
Among them, N is the number of frames in the preprocessed voice data, N≥2, and s (n) is the signal amplitude of the n-th frame voice data in the time domain; denoising the first test voice data and the second test voice data Process and obtain the voice data to be tested.
在一实施例中,计算机可读指令被处理器执行时还实现以下步骤:对单帧语音数据进行快速傅里叶变换处理,获取待测试语音数据的功率谱;采用梅尔滤波器组对功率谱进行降维处理,获取梅尔频谱;对梅尔频谱进行倒谱分析,获取待测试语音特征。In an embodiment, when the computer-readable instructions are executed by the processor, the following steps are further implemented: performing fast Fourier transform processing on the single frame of speech data to obtain the power spectrum of the speech data to be tested; The spectrum is subjected to dimensionality reduction processing to obtain the Mel spectrum; the cepstrum analysis is performed on the Mel spectrum to obtain the voice characteristics to be tested.
在一实施例中,计算机可读指令被处理器执行时还实现以下步骤:预先训练卷积深度置信网络模型。具体地,预先训练卷积深度置信网络模型,包括:获取待训练语音数据,待训练语音数据包括标准训练语音数据和干扰训练语音数据;将标准训练语音数据和干扰训练语音数据按同等比例输入到卷积深度置信网络模型中进行模型训练,获取原始卷积限制玻尔兹曼机;对原始卷积限制玻尔兹曼机进行堆叠处理,获取卷积深度置信网络模型。In one embodiment, when the computer-readable instructions are executed by the processor, the following steps are further implemented: pre-training the convolutional deep belief network model. Specifically, pre-training the convolutional deep belief network model includes: acquiring voice data to be trained, the voice data to be trained includes standard training voice data and interference training voice data; and inputting the standard training voice data and the interference training voice data to an equal proportion of The model is trained in the convolutional deep confidence network model to obtain the original convolution-limited Boltzmann machine; the original convolution-limited Boltzmann machine is stacked to obtain the convolutional deep confidence network model.
在一实施例中,计算机可读指令被处理器执行时还实现以下步骤:对原始卷积限制玻尔兹曼机进行概率最大池化处理和稀疏正则化处理,获取有效卷积限制玻尔兹曼机;对有效卷积限制玻尔兹曼机进行堆叠处理,获取卷积深度置信网络模型。In an embodiment, when the computer-readable instructions are executed by the processor, the following steps are further implemented: performing a probabilistic maximum pooling process and a sparse regularization process on the original convolution-limited Boltzmann machine to obtain an effective convolution-limited Boltzmann Man machine; stacking processing is performed on the effective convolution-limited Boltzmann machine to obtain a convolutional deep confidence network model.
在一实施例中,计算机可读指令被处理器执行时还实现以下步骤:将待测试语音特征输入到预先训练好的卷积深度置信网络模型中进行识别,获取语音识别概率值;基于语音识别概率值获取语音区分结果。In an embodiment, when the computer-readable instructions are executed by the processor, the following steps are also implemented: inputting the voice features to be tested into a pre-trained convolutional deep confidence network model for recognition, obtaining a voice recognition probability value; based on voice recognition The probability value obtains the result of speech discrimination.
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,所述的计算机可读指令可存储于计算机设备上的非易失性存储介质中,该计算机可读指令在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。A person of ordinary skill in the art may understand that the implementation of all or part of the processes in the methods of the foregoing embodiments may be performed by computer-readable instructions to instruct related hardware. The computer-readable instructions may be stored on a computer device. In the volatile storage medium, when the computer-readable instructions are executed, the computer-readable instructions may include the processes of the embodiments of the methods described above. Wherein, any reference to the storage, storage, database, or other media used in the embodiments provided in this application may include non-volatile and / or volatile storage. Non-volatile memory may include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory can include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in various forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,仅以上述各功能单元、模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能单元、模块完成,即将所述装置的内部结构划分成不同的功能单元或模块,以完成以上描述的全部或者部分功能。Those skilled in the art can clearly understand that, for the convenience and brevity of the description, only the above-mentioned division of functional units and modules is used as an example. In practical applications, the above functions can be assigned by different functional units, Module completion, that is, dividing the internal structure of the device into different functional units or modules to complete all or part of the functions described above.
以上所述实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围,均应包含在本申请的保护范围之内。The above-mentioned embodiments are only used to describe the technical solution of the present application, but not limited thereto. Although the present application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that they can still implement the foregoing implementations. The technical solutions described in the examples are modified, or some of the technical features are equivalently replaced; and these modifications or replacements do not deviate the essence of the corresponding technical solutions from the spirit and scope of the technical solutions of the embodiments of the application, and should be included in Within the scope of this application.

Claims (20)

  1. 一种语音区分方法,其特征在于,包括:A method for distinguishing speech, comprising:
    获取原始测试语音数据,对所述原始测试语音数据进行预处理,获取预处理语音数据;Acquiring raw test voice data, preprocessing the raw test voice data, and obtaining preprocessed voice data;
    对所述预处理语音数据进行端点检测处理,获取待测试语音数据;Performing endpoint detection processing on the pre-processed voice data to obtain voice data to be tested;
    对所述待测试语音数据进行特征提取,获取待测试语音特征;Performing feature extraction on the voice data to be tested to obtain voice features to be tested;
    将所述待测试语音特征输入到预先训练好的卷积深度置信网络模型中进行识别,获取语音区分结果。The speech features to be tested are input into a pre-trained convolutional deep belief network model for recognition, and a speech discrimination result is obtained.
  2. 如权利要求1所述的语音区分方法,其特征在于,所述对所述原始测试语音数据进行预处理,获取预处理语音数据,包括:The method of claim 1, wherein the preprocessing the original test voice data to obtain the preprocessed voice data comprises:
    对所述原始测试语音数据进行预加重处理,所述预加重处理的公式为s' n=s n-a*s n-1,其中,s' n为预加重处理后的n时刻的语音信号幅度,s n为n时刻的语音信号幅度,s n-1为n-1时刻的语音信号幅度,a为预加重系数; Perform pre-emphasis processing on the original test voice data, and the formula of the pre-emphasis processing is s ' n = s n -a * s n-1 , where s' n is a voice signal at time n after the pre-emphasis processing amplitude, s n is the amplitude of the speech signal at time n, s n-1 is the amplitude of the speech signal time n-1, a is the coefficient of pre-emphasis;
    对预加重处理后的原始测试语音数据进行分帧处理,获取分帧语音数据;Frame processing the original test voice data after pre-emphasis processing to obtain framed voice data;
    对所述分帧语音数据进行加窗处理,获取预处理语音数据,所述加窗处理的公式为
    Figure PCTCN2018094200-appb-100001
    和s″ n=w n*s′ n,其中,w n为n时刻的汉明窗,N为汉明窗窗长,s' n为n时刻时域上的信号幅度,s″ n为n时刻加窗后时域上的信号幅度。
    Perform windowing on the framed speech data to obtain pre-processed speech data, and the formula for windowing processing is
    Figure PCTCN2018094200-appb-100001
    And s ″ n = w n * s ′ n , where w n is the Hamming window at time n, N is the window length of Hamming window, s ′ n is the signal amplitude in the time domain at time n, and s ″ n is n Signal amplitude in time domain after windowing at time.
  3. 如权利要求2所述的语音区分方法,其特征在于,所述对所述预处理语音数据进行端点检测处理,获取待测试语音数据,包括:The speech discrimination method according to claim 2, wherein the performing endpoint detection processing on the pre-processed speech data to obtain the speech data to be tested comprises:
    采用短时能量特征值计算公式对所述预处理语音数据进行处理,获取所述预处理语音数据对应的短时能量特征值,并将所述短时能量特征值小于第一阈值的预处理语音数据去除,获取第一测试语音数据,短时能量特征值计算公式为
    Figure PCTCN2018094200-appb-100002
    其中,N为预处理语音数据中帧的个数,N≥2,s(n)为时域上第n帧预处理语音数据的信号幅度;
    A short-term energy feature value calculation formula is used to process the pre-processed voice data, obtain a short-term energy feature value corresponding to the pre-processed voice data, and reduce the pre-processed voice with the short-term energy feature value less than a first threshold Data is removed to obtain the first test voice data. The short-term energy characteristic value calculation formula is
    Figure PCTCN2018094200-appb-100002
    Among them, N is the number of frames in the preprocessed voice data, N≥2, and s (n) is the signal amplitude of the nth frame of preprocessed voice data in the time domain;
    采用短时过零率计算公式对所述预处理语音数据进行处理,获取所述预处理语音数据对应的短时过零率,并将所述短时过零率小于第二阈值的预处理语音数据去除,获取第二测试语音数据,短时过零率计算公式为
    Figure PCTCN2018094200-appb-100003
    其中,N为预处理语音数据中帧的个数,N≥2,s(n)为时域上第n帧语音数据的信号幅度;
    A short-term zero-crossing rate calculation formula is used to process the pre-processed voice data, obtain a short-term zero-cross rate corresponding to the pre-processed voice data, and reduce the short-term zero-cross rate to a pre-processed voice that is less than a second threshold Data is removed to obtain the second test voice data. The short-term zero-crossing rate calculation formula is
    Figure PCTCN2018094200-appb-100003
    Where N is the number of frames in the preprocessed voice data, N≥2, and s (n) is the signal amplitude of the nth frame of voice data in the time domain;
    对所述第一测试语音数据和所述第二测试语音数据进行去噪音处理,获取待测试语音数据。De-noise processing is performed on the first test voice data and the second test voice data to obtain voice data to be tested.
  4. 如权利要求1所述的语音区分方法,其特征在于,所述待测试语音数据包括单帧语音数据;The method of claim 1, wherein the voice data to be tested includes a single frame of voice data;
    所述对所述待测试语音数据进行特征提取,获取待测试语音特征,包括Performing feature extraction on the voice data to be tested to obtain voice features to be tested includes:
    对所述单帧语音数据进行快速傅里叶变换处理,获取待测试语音数据的功率谱;Performing fast Fourier transform processing on the single-frame voice data to obtain a power spectrum of the voice data to be tested;
    采用梅尔滤波器组对所述功率谱进行降维处理,获取梅尔频谱;Using a Mel filter bank to perform dimension reduction processing on the power spectrum to obtain a Mel spectrum;
    对所述梅尔频谱进行倒谱分析,获取所述待测试语音特征。Cepstral analysis is performed on the Mel spectrum to obtain the voice characteristics to be tested.
  5. 如权利要求1所述的语音区分方法,其特征在于,在所述将所述待测试语音特征输 入到预先训练好的卷积深度置信网络模型中进行识别的步骤之前,所述语音区分方法还包括:预先训练卷积深度置信网络模型;The speech discrimination method according to claim 1, wherein before the step of inputting the speech features to be tested into a pre-trained convolutional deep belief network model for recognition, the speech discrimination method further comprises: Including: pre-trained convolutional deep belief network models;
    所述预先训练卷积深度置信网络模型,包括:The pre-trained convolutional deep belief network model includes:
    获取待训练语音数据,所述待训练语音数据包括标准训练语音数据和干扰训练语音数据;Acquiring voice data to be trained, where the voice data to be trained includes standard training voice data and interference training voice data;
    将所述标准训练语音数据和所述干扰训练语音数据按同等比例输入到卷积深度置信网络模型中进行模型训练,获取原始卷积限制玻尔兹曼机;Input the standard training speech data and the interference training speech data into a convolutional deep confidence network model in the same proportion for model training to obtain the original convolution-limited Boltzmann machine;
    对所述原始卷积限制玻尔兹曼机进行堆叠处理,获取卷积深度置信网络模型。Stack the original convolution-limited Boltzmann machine to obtain a convolutional deep confidence network model.
  6. 如权利要求5所述的语音区分方法,其特征在于,所述对所述原始卷积限制玻尔兹曼机进行堆叠处理,获取卷积深度置信网络模型,包括:The speech discrimination method according to claim 5, wherein the stacking processing of the original convolution-limited Boltzmann machine to obtain a convolutional deep confidence network model comprises:
    对所述原始卷积限制玻尔兹曼机进行概率最大池化处理和稀疏正则化处理,获取有效卷积限制玻尔兹曼机;Performing probabilistic maximum pooling processing and sparse regularization processing on the original convolution-limited Boltzmann machine to obtain an effective convolution-limited Boltzmann machine;
    对所述有效卷积限制玻尔兹曼机进行堆叠处理,获取卷积深度置信网络模型。Perform stacking processing on the effective convolution-limited Boltzmann machine to obtain a convolutional deep confidence network model.
  7. 如权利要求1所述的语音区分方法,其特征在于,所述将所述待测试语音特征输入到预先训练好的卷积深度置信网络模型中进行识别,获取语音区分结果,包括:The speech discrimination method according to claim 1, wherein the inputting the speech feature to be tested into a pre-trained convolutional deep belief network model for recognition and obtaining a speech discrimination result comprises:
    将所述待测试语音特征输入到预先训练好的卷积深度置信网络模型中进行识别,获取语音识别概率值;Inputting the speech features to be tested into a pre-trained convolutional deep belief network model for recognition, and obtaining a speech recognition probability value;
    基于语音识别概率值获取语音区分结果。The speech discrimination result is obtained based on the speech recognition probability value.
  8. 一种语音区分装置,其特征在于,包括:A voice distinguishing device, comprising:
    原始测试语音数据处理模块,用于获取原始测试语音数据,对所述原始测试语音数据进行预处理,获取预处理语音数据;A raw test voice data processing module, configured to obtain raw test voice data, preprocess the raw test voice data, and obtain preprocessed voice data;
    待测试语音数据获取模块,用于对所述预处理语音数据进行端点检测处理,获取待测试语音数据;A voice data acquisition module for testing to perform endpoint detection processing on the pre-processed voice data to acquire voice data to be tested;
    待测试语音特征获取模块,用于对所述待测试语音数据进行特征提取,获取待测试语音特征;A voice feature acquisition module to be tested, which is used to extract features from the voice data to be tested to acquire voice features to be tested;
    语音区分结果获取模块,用于将所述待测试语音特征输入到预先训练好的卷积深度置信网络模型中进行识别,获取语音区分结果。A speech discrimination result acquisition module is configured to input the speech feature to be tested into a pre-trained convolutional deep belief network model for recognition, and obtain a speech discrimination result.
  9. 一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,其特征在于,所述处理器执行所述计算机可读指令时实现如下步骤:A computer device includes a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor, and is characterized in that the processor implements the computer-readable instructions as follows step:
    获取原始测试语音数据,对所述原始测试语音数据进行预处理,获取预处理语音数据;Acquiring raw test voice data, preprocessing the raw test voice data, and obtaining preprocessed voice data;
    对所述预处理语音数据进行端点检测处理,获取待测试语音数据;Performing endpoint detection processing on the pre-processed voice data to obtain voice data to be tested;
    对所述待测试语音数据进行特征提取,获取待测试语音特征;Performing feature extraction on the voice data to be tested to obtain voice features to be tested;
    将所述待测试语音特征输入到预先训练好的卷积深度置信网络模型中进行识别,获取语音区分结果。The speech features to be tested are input into a pre-trained convolutional deep belief network model for recognition, and a speech discrimination result is obtained.
  10. 如权利要求9所述的计算机设备,其特征在于,所述对所述原始测试语音数据进行预处理,获取预处理语音数据,包括:The computer device according to claim 9, wherein the preprocessing the original test voice data to obtain the preprocessed voice data comprises:
    对所述原始测试语音数据进行预加重处理,所述预加重处理的公式为s' n=s n-a*s n-1,其中,s' n为预加重处理后的n时刻的语音信号幅度,s n为n时刻的语音信号幅度,s n-1为n-1时刻的语音信号幅度,a为预加重系数; Perform pre-emphasis processing on the original test voice data, and the formula of the pre-emphasis processing is s ' n = s n -a * s n-1 , where s' n is a voice signal at time n after the pre-emphasis processing amplitude, s n is the amplitude of the speech signal at time n, s n-1 is the amplitude of the speech signal time n-1, a is the coefficient of pre-emphasis;
    对预加重处理后的原始测试语音数据进行分帧处理,获取分帧语音数据;Frame processing the original test voice data after pre-emphasis processing to obtain framed voice data;
    对所述分帧语音数据进行加窗处理,获取预处理语音数据,所述加窗处理的公式为
    Figure PCTCN2018094200-appb-100004
    和s″ n=w n*s′ n,其中,w n为n时刻的汉明窗,N为汉明窗窗长,s' n为n时刻时域上的信号幅度,s″ n为n时刻加窗后时域上的信号幅度。
    Perform windowing on the framed speech data to obtain pre-processed speech data, and the formula for windowing processing is
    Figure PCTCN2018094200-appb-100004
    And s ″ n = w n * s ′ n , where w n is the Hamming window at time n, N is the window length of Hamming window, s ′ n is the signal amplitude in the time domain at time n, and s ″ n is n Signal amplitude in time domain after windowing at time.
  11. 如权利要求10所述的计算机设备,其特征在于,所述对所述预处理语音数据进行端点检测处理,获取待测试语音数据,包括:The computer device according to claim 10, wherein performing endpoint detection processing on the pre-processed voice data to obtain voice data to be tested comprises:
    采用短时能量特征值计算公式对所述预处理语音数据进行处理,获取所述预处理语音数据对应的短时能量特征值,并将所述短时能量特征值小于第一阈值的预处理语音数据去除,获取第一测试语音数据,短时能量特征值计算公式为
    Figure PCTCN2018094200-appb-100005
    其中,N为预处理语音数据中帧的个数,N≥2,s(n)为时域上第n帧预处理语音数据的信号幅度;
    A short-term energy feature value calculation formula is used to process the pre-processed voice data, obtain a short-term energy feature value corresponding to the pre-processed voice data, and reduce the pre-processed voice with the short-term energy feature value less than a first threshold Data is removed to obtain the first test voice data. The short-term energy characteristic value calculation formula is
    Figure PCTCN2018094200-appb-100005
    Among them, N is the number of frames in the preprocessed voice data, N≥2, and s (n) is the signal amplitude of the nth frame of preprocessed voice data in the time domain;
    采用短时过零率计算公式对所述预处理语音数据进行处理,获取所述预处理语音数据对应的短时过零率,并将所述短时过零率小于第二阈值的预处理语音数据去除,获取第二测试语音数据,短时过零率计算公式为
    Figure PCTCN2018094200-appb-100006
    其中,N为预处理语音数据中帧的个数,N≥2,s(n)为时域上第n帧语音数据的信号幅度;
    A short-term zero-crossing rate calculation formula is used to process the pre-processed voice data, obtain a short-term zero-cross rate corresponding to the pre-processed voice data, and reduce the short-term zero-cross rate to a pre-processed voice with a second threshold value Data is removed to obtain the second test voice data. The short-term zero-crossing rate calculation formula is
    Figure PCTCN2018094200-appb-100006
    Where N is the number of frames in the preprocessed voice data, N≥2, and s (n) is the signal amplitude of the nth frame of voice data in the time domain;
    对所述第一测试语音数据和所述第二测试语音数据进行去噪音处理,获取待测试语音数据。De-noise processing is performed on the first test voice data and the second test voice data to obtain voice data to be tested.
  12. 如权利要求9所述的计算机设备,其特征在于,所述待测试语音数据包括单帧语音数据;The computer device according to claim 9, wherein the voice data to be tested includes single-frame voice data;
    所述对所述待测试语音数据进行特征提取,获取待测试语音特征,包括Performing feature extraction on the voice data to be tested to obtain voice features to be tested includes:
    对所述单帧语音数据进行快速傅里叶变换处理,获取待测试语音数据的功率谱;Performing fast Fourier transform processing on the single-frame voice data to obtain a power spectrum of the voice data to be tested;
    采用梅尔滤波器组对所述功率谱进行降维处理,获取梅尔频谱;Using a Mel filter bank to perform dimension reduction processing on the power spectrum to obtain a Mel spectrum;
    对所述梅尔频谱进行倒谱分析,获取所述待测试语音特征。Cepstral analysis is performed on the Mel spectrum to obtain the voice characteristics to be tested.
  13. 如权利要求9所述的计算机设备,其特征在于,在所述将所述待测试语音特征输入到预先训练好的卷积深度置信网络模型中进行识别的步骤之前,所述处理器执行所述计算机可读指令时还实现如下步骤:预先训练卷积深度置信网络模型;The computer device according to claim 9, wherein before the step of inputting the speech feature to be tested into a pre-trained convolutional deep confidence network model for recognition, the processor executes the The computer-readable instructions also implement the following steps: pre-train the convolutional deep belief network model;
    所述预先训练卷积深度置信网络模型,包括:The pre-trained convolutional deep belief network model includes:
    获取待训练语音数据,所述待训练语音数据包括标准训练语音数据和干扰训练语音数据;Acquiring voice data to be trained, where the voice data to be trained includes standard training voice data and interference training voice data;
    将所述标准训练语音数据和所述干扰训练语音数据按同等比例输入到卷积深度置信网络模型中进行模型训练,获取原始卷积限制玻尔兹曼机;Input the standard training speech data and the interference training speech data into a convolutional deep confidence network model in the same proportion for model training to obtain the original convolution-limited Boltzmann machine;
    对所述原始卷积限制玻尔兹曼机进行堆叠处理,获取卷积深度置信网络模型。Stack the original convolution-limited Boltzmann machine to obtain a convolutional deep confidence network model.
  14. 如权利要求13所述的计算机设备,其特征在于,所述对所述原始卷积限制玻尔兹曼机进行堆叠处理,获取卷积深度置信网络模型,包括:The computer device according to claim 13, wherein the stacking processing of the original convolution-limited Boltzmann machine to obtain a convolutional deep confidence network model comprises:
    对所述原始卷积限制玻尔兹曼机进行概率最大池化处理和稀疏正则化处理,获取有效卷积限制玻尔兹曼机;Performing probabilistic maximum pooling processing and sparse regularization processing on the original convolution-limited Boltzmann machine to obtain an effective convolution-limited Boltzmann machine;
    对所述有效卷积限制玻尔兹曼机进行堆叠处理,获取卷积深度置信网络模型。Perform stacking processing on the effective convolution-limited Boltzmann machine to obtain a convolutional deep confidence network model.
  15. 一个或多个存储有计算机可读指令的非易失性可读存储介质,其特征在于,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器实现如下步骤:One or more non-volatile readable storage media storing computer-readable instructions, wherein when the computer-readable instructions are executed by one or more processors, the one or more processors implement The following steps:
    获取原始测试语音数据,对所述原始测试语音数据进行预处理,获取预处理语音数据;Acquiring raw test voice data, preprocessing the raw test voice data, and obtaining preprocessed voice data;
    对所述预处理语音数据进行端点检测处理,获取待测试语音数据;Performing endpoint detection processing on the pre-processed voice data to obtain voice data to be tested;
    对所述待测试语音数据进行特征提取,获取待测试语音特征;Performing feature extraction on the voice data to be tested to obtain voice features to be tested;
    将所述待测试语音特征输入到预先训练好的卷积深度置信网络模型中进行识别,获取语音区分结果。The speech features to be tested are input into a pre-trained convolutional deep belief network model for recognition, and a speech discrimination result is obtained.
  16. 如权利要求15所述的非易失性可读存储介质,其特征在于,所述对所述原始测试语音数据进行预处理,获取预处理语音数据,包括:The non-volatile readable storage medium according to claim 15, wherein the preprocessing the original test voice data to obtain the preprocessed voice data comprises:
    对所述原始测试语音数据进行预加重处理,所述预加重处理的公式为s' n=s n-a*s n-1,其中,s' n为预加重处理后的n时刻的语音信号幅度,s n为n时刻的语音信号幅度,s n-1为n-1时刻的语音信号幅度,a为预加重系数; Perform pre-emphasis processing on the original test voice data, and the formula of the pre-emphasis processing is s ' n = s n -a * s n-1 , where s' n is a voice signal at time n after the pre-emphasis processing amplitude, s n is the amplitude of the speech signal at time n, s n-1 is the amplitude of the speech signal time n-1, a is the coefficient of pre-emphasis;
    对预加重处理后的原始测试语音数据进行分帧处理,获取分帧语音数据;Frame processing the original test voice data after pre-emphasis processing to obtain framed voice data;
    对所述分帧语音数据进行加窗处理,获取预处理语音数据,所述加窗处理的公式为
    Figure PCTCN2018094200-appb-100007
    和s″ n=w n*s′ n,其中,w n为n时刻的汉明窗,N为汉明窗窗长,s′ n为n时刻时域上的信号幅度,s″ n为n时刻加窗后时域上的信号幅度。
    Perform windowing on the framed speech data to obtain pre-processed speech data, and the formula for windowing processing is
    Figure PCTCN2018094200-appb-100007
    And s ″ n = w n * s ′ n , where w n is the Hamming window at time n, N is the window length of Hamming window, s ′ n is the signal amplitude in the time domain at time n, and s ″ n is n Signal amplitude in time domain after windowing at time.
  17. 如权利要求16所述的非易失性可读存储介质,其特征在于,所述对所述预处理语音数据进行端点检测处理,获取待测试语音数据,包括:The non-volatile readable storage medium of claim 16, wherein the performing endpoint detection processing on the pre-processed voice data to obtain voice data to be tested comprises:
    采用短时能量特征值计算公式对所述预处理语音数据进行处理,获取所述预处理语音数据对应的短时能量特征值,并将所述短时能量特征值小于第一阈值的预处理语音数据去除,获取第一测试语音数据,短时能量特征值计算公式为
    Figure PCTCN2018094200-appb-100008
    其中,N为预处理语音数据中帧的个数,N≥2,s(n)为时域上第n帧预处理语音数据的信号幅度;
    A short-term energy feature value calculation formula is used to process the pre-processed voice data, obtain a short-term energy feature value corresponding to the pre-processed voice data, and reduce the pre-processed voice with the short-term energy feature value less than a first threshold Data is removed to obtain the first test voice data. The short-term energy characteristic value calculation formula is
    Figure PCTCN2018094200-appb-100008
    Among them, N is the number of frames in the preprocessed voice data, N≥2, and s (n) is the signal amplitude of the nth frame of preprocessed voice data in the time domain;
    采用短时过零率计算公式对所述预处理语音数据进行处理,获取所述预处理语音数据对应的短时过零率,并将所述短时过零率小于第二阈值的预处理语音数据去除,获取第二测试语音数据,短时过零率计算公式为
    Figure PCTCN2018094200-appb-100009
    其中,N为预处理语音数据中帧的个数,N≥2,s(n)为时域上第n帧语音数据的信号幅度;
    A short-term zero-crossing rate calculation formula is used to process the pre-processed voice data, obtain a short-term zero-cross rate corresponding to the pre-processed voice data, and reduce the short-term zero-cross rate to a pre-processed voice that is less than a second threshold Data is removed to obtain the second test voice data. The short-term zero-crossing rate calculation formula is
    Figure PCTCN2018094200-appb-100009
    Where N is the number of frames in the preprocessed voice data, N≥2, and s (n) is the signal amplitude of the nth frame of voice data in the time domain;
    对所述第一测试语音数据和所述第二测试语音数据进行去噪音处理,获取待测试语音数据。De-noise processing is performed on the first test voice data and the second test voice data to obtain voice data to be tested.
  18. 如权利要求15所述的非易失性可读存储介质,其特征在于,所述待测试语音数据包括单帧语音数据;The non-volatile readable storage medium according to claim 15, wherein the voice data to be tested comprises a single frame of voice data;
    所述对所述待测试语音数据进行特征提取,获取待测试语音特征,包括Performing feature extraction on the voice data to be tested to obtain voice features to be tested includes:
    对所述单帧语音数据进行快速傅里叶变换处理,获取待测试语音数据的功率谱;Performing fast Fourier transform processing on the single-frame voice data to obtain a power spectrum of the voice data to be tested;
    采用梅尔滤波器组对所述功率谱进行降维处理,获取梅尔频谱;Using a Mel filter bank to perform dimension reduction processing on the power spectrum to obtain a Mel spectrum;
    对所述梅尔频谱进行倒谱分析,获取所述待测试语音特征。Cepstral analysis is performed on the Mel spectrum to obtain the voice characteristics to be tested.
  19. 如权利要求15所述的非易失性可读存储介质,其特征在于,在所述将所述待测试语音特征输入到预先训练好的卷积深度置信网络模型中进行识别的步骤之前,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器还实现如下步骤:预先训练卷积深度置信网络模型;The non-volatile readable storage medium according to claim 15, wherein before the step of inputting the speech feature to be tested into a pre-trained convolutional deep confidence network model for identification, When the computer-readable instructions are executed by one or more processors, the one or more processors further implement the following steps: pre-train a convolutional deep belief network model;
    所述预先训练卷积深度置信网络模型,包括:The pre-trained convolutional deep belief network model includes:
    获取待训练语音数据,所述待训练语音数据包括标准训练语音数据和干扰训练语音数 据;Acquiring voice data to be trained, where the voice data to be trained includes standard training voice data and interference training voice data;
    将所述标准训练语音数据和所述干扰训练语音数据按同等比例输入到卷积深度置信网络模型中进行模型训练,获取原始卷积限制玻尔兹曼机;Input the standard training speech data and the interference training speech data into a convolutional deep confidence network model in the same proportion for model training to obtain the original convolution-limited Boltzmann machine;
    对所述原始卷积限制玻尔兹曼机进行堆叠处理,获取卷积深度置信网络模型。Stack the original convolution-limited Boltzmann machine to obtain a convolutional deep confidence network model.
  20. 如权利要求19所述的非易失性可读存储介质,其特征在于,所述对所述原始卷积限制玻尔兹曼机进行堆叠处理,获取卷积深度置信网络模型,包括:The non-volatile readable storage medium according to claim 19, wherein the stacking processing of the original convolution-limited Boltzmann machine to obtain a convolutional deep confidence network model comprises:
    对所述原始卷积限制玻尔兹曼机进行概率最大池化处理和稀疏正则化处理,获取有效卷积限制玻尔兹曼机;Performing probabilistic maximum pooling processing and sparse regularization processing on the original convolution-limited Boltzmann machine to obtain an effective convolution-limited Boltzmann machine;
    对所述有效卷积限制玻尔兹曼机进行堆叠处理,获取卷积深度置信网络模型。Perform stacking processing on the effective convolution-limited Boltzmann machine to obtain a convolutional deep confidence network model.
PCT/CN2018/094200 2018-06-04 2018-07-03 Voice distinguishing method and device, computer device and storage medium WO2019232848A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810561695.9 2018-06-04
CN201810561695.9A CN108922561A (en) 2018-06-04 2018-06-04 Speech differentiation method, apparatus, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
WO2019232848A1 true WO2019232848A1 (en) 2019-12-12

Family

ID=64410753

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/094200 WO2019232848A1 (en) 2018-06-04 2018-07-03 Voice distinguishing method and device, computer device and storage medium

Country Status (2)

Country Link
CN (1) CN108922561A (en)
WO (1) WO2019232848A1 (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109545226B (en) * 2019-01-04 2022-11-22 平安科技(深圳)有限公司 Voice recognition method, device and computer readable storage medium
CN109785865A (en) * 2019-03-07 2019-05-21 上海电力学院 The method of broadcasting speech and noise measuring based on short-time EZQ
CN110246506A (en) * 2019-05-29 2019-09-17 平安科技(深圳)有限公司 Voice intelligent detecting method, device and computer readable storage medium
CN110223688A (en) * 2019-06-08 2019-09-10 安徽中医药大学 A kind of self-evaluating system of compressed sensing based hepatolenticular degeneration disfluency
CN110211566A (en) * 2019-06-08 2019-09-06 安徽中医药大学 A kind of classification method of compressed sensing based hepatolenticular degeneration disfluency
CN110428853A (en) * 2019-08-30 2019-11-08 北京太极华保科技股份有限公司 Voice activity detection method, Voice activity detection device and electronic equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102097095A (en) * 2010-12-28 2011-06-15 天津市亚安科技电子有限公司 Speech endpoint detecting method and device
CN104157290A (en) * 2014-08-19 2014-11-19 大连理工大学 Speaker recognition method based on depth learning
US20170116991A1 (en) * 2015-10-22 2017-04-27 Avaya Inc. Source-based automatic speech recognition
CN106887226A (en) * 2017-04-07 2017-06-23 天津中科先进技术研究院有限公司 Speech recognition algorithm based on artificial intelligence recognition

Family Cites Families (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH07239690A (en) * 1994-02-28 1995-09-12 Nissan Motor Co Ltd On-vehicle active noise controller and on-vehicle active vibration controller
JPH08103592A (en) * 1994-10-04 1996-04-23 Matsushita Electric Ind Co Ltd Washing machine
JP2004261390A (en) * 2003-02-28 2004-09-24 Sanyo Electric Co Ltd Method of judging noise and massaging machine
JP4307443B2 (en) * 2004-12-16 2009-08-05 富士通テン株式会社 Data processing device
JP2009110413A (en) * 2007-10-31 2009-05-21 Olympus Corp Barcode reader
CN103065629A (en) * 2012-11-20 2013-04-24 广东工业大学 Speech recognition system of humanoid robot
CN103366784B (en) * 2013-07-16 2016-04-13 湖南大学 There is multi-medium play method and the device of Voice command and singing search function
CN103617799B (en) * 2013-11-28 2016-04-27 广东外语外贸大学 A kind of English statement pronunciation quality detection method being adapted to mobile device
US9997172B2 (en) * 2013-12-02 2018-06-12 Nuance Communications, Inc. Voice activity detection (VAD) for a coded speech bitstream without decoding
JP6479478B2 (en) * 2014-01-07 2019-03-06 株式会社神戸製鋼所 Ultrasonic flaw detection method
CN105529038A (en) * 2014-10-21 2016-04-27 阿里巴巴集团控股有限公司 Method and system for processing users' speech signals
CN104305991B (en) * 2014-11-18 2016-06-01 北京海思敏医疗技术有限公司 The method of detection noise and equipment from electrocardiosignal
CN105006230A (en) * 2015-06-10 2015-10-28 合肥工业大学 Voice sensitive information detecting and filtering method based on unspecified people
CN105118502B (en) * 2015-07-14 2017-05-10 百度在线网络技术(北京)有限公司 End point detection method and system of voice identification system
US11042795B2 (en) * 2016-06-13 2021-06-22 The Regents Of The University Of Michigan Sparse neuromorphic processor
CN106197480B (en) * 2016-06-30 2019-01-29 湖北工业大学 A kind of processing system of Low SNR signal
CN106328123B (en) * 2016-08-25 2020-03-20 苏州大学 Method for recognizing middle ear voice in normal voice stream under condition of small database
CN106328168B (en) * 2016-08-30 2019-10-18 成都普创通信技术股份有限公司 A kind of voice signal similarity detection method
CN106446868A (en) * 2016-10-13 2017-02-22 成都芯安尤里卡信息科技有限公司 Side channel signal feature extraction method based on EMD and singular value difference spectrum
CN107393526B (en) * 2017-07-19 2024-01-02 腾讯科技(深圳)有限公司 Voice silence detection method, device, computer equipment and storage medium
CN107799126B (en) * 2017-10-16 2020-10-16 苏州狗尾草智能科技有限公司 Voice endpoint detection method and device based on supervised machine learning

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102097095A (en) * 2010-12-28 2011-06-15 天津市亚安科技电子有限公司 Speech endpoint detecting method and device
CN104157290A (en) * 2014-08-19 2014-11-19 大连理工大学 Speaker recognition method based on depth learning
US20170116991A1 (en) * 2015-10-22 2017-04-27 Avaya Inc. Source-based automatic speech recognition
CN106887226A (en) * 2017-04-07 2017-06-23 天津中科先进技术研究院有限公司 Speech recognition algorithm based on artificial intelligence recognition

Also Published As

Publication number Publication date
CN108922561A (en) 2018-11-30

Similar Documents

Publication Publication Date Title
WO2019232848A1 (en) Voice distinguishing method and device, computer device and storage medium
WO2019232845A1 (en) Voice data processing method and apparatus, and computer device, and storage medium
WO2020177380A1 (en) Voiceprint detection method, apparatus and device based on short text, and storage medium
WO2019227590A1 (en) Voice enhancement method, apparatus, computer device, and storage medium
WO2018107810A1 (en) Voiceprint recognition method and apparatus, and electronic device and medium
WO2019232829A1 (en) Voiceprint recognition method and apparatus, computer device and storage medium
US8160877B1 (en) Hierarchical real-time speaker recognition for biometric VoIP verification and targeting
WO2019232851A1 (en) Method and apparatus for training speech differentiation model, and computer device and storage medium
WO2020034628A1 (en) Accent identification method and device, computer device, and storage medium
WO2022178942A1 (en) Emotion recognition method and apparatus, computer device, and storage medium
WO2019237519A1 (en) General vector training method, voice clustering method, apparatus, device and medium
CN108305639B (en) Speech emotion recognition method, computer-readable storage medium and terminal
CN108922543B (en) Model base establishing method, voice recognition method, device, equipment and medium
WO2021179717A1 (en) Speech recognition front-end processing method and apparatus, and terminal device
WO2019232826A1 (en) I-vector extraction method, speaker recognition method and apparatus, device, and medium
WO2023283823A1 (en) Speech adversarial sample testing method and apparatus, device, and computer-readable storage medium
CN110265035B (en) Speaker recognition method based on deep learning
WO2022141868A1 (en) Method and apparatus for extracting speech features, terminal, and storage medium
CN111933185A (en) Lung sound classification method, system, terminal and storage medium based on knowledge distillation
WO2019232833A1 (en) Speech differentiating method and device, computer device and storage medium
CN108682432B (en) Speech emotion recognition device
Al-Kaltakchi et al. Thorough evaluation of TIMIT database speaker identification performance under noise with and without the G. 712 type handset
WO2019232867A1 (en) Voice discrimination method and apparatus, and computer device, and storage medium
Матиченко et al. The structural tuning of the convolutional neural network for speaker identification in mel frequency cepstrum coefficients space
Cuccovillo et al. Spectral denoising for microphone classification

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18921446

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 12.03.2021)

122 Ep: pct application non-entry in european phase

Ref document number: 18921446

Country of ref document: EP

Kind code of ref document: A1