CN108198547A

CN108198547A - Sound end detecting method, device, computer equipment and storage medium

Info

Publication number: CN108198547A
Application number: CN201810048223.3A
Authority: CN
Inventors: 黄石磊; 刘轶; 王昕�
Original assignee: Shenzhen Beike Risound Polytron Technologies Inc
Current assignee: Shenzhen Beike Risound Polytron Technologies Inc
Priority date: 2018-01-18
Filing date: 2018-01-18
Publication date: 2018-06-22
Anticipated expiration: 2038-01-18
Also published as: CN108198547B

Abstract

This application involves a kind of sound end detecting method, device, computer equipment and storage mediums.This method includes：Noisy Speech Signal is obtained, extracts the corresponding acoustic feature of the Noisy Speech Signal and spectrum signature；The acoustic feature and spectrum signature are converted, obtain corresponding acoustic feature vector sum spectrum signature vector；Grader is obtained, the acoustic feature vector sum spectrum signature vector is input to the grader, obtains the spectrum signature vector of the acoustic feature vector sum addition voice label of addition voice label；The spectrum signature vector of the acoustic feature vector sum addition voice label of the addition voice label is parsed, obtains corresponding voice signal；The corresponding starting point and ending point of the voice signal is determined according to the sequential of the voice signal.The accuracy of speech terminals detection can be effectively improved using this method.

Description

Sound end detecting method, device, computer equipment and storage medium

Technical field

This application involves signal processing technology field, more particularly to a kind of language end-point detecting method, device, computer Equipment and storage medium.

Background technology

With the continuous development of voice technology, speech terminals detection technology is occupied highly important in speech recognition technology Status.Speech terminals detection is the starting point and ending point that phonological component is detected from one section of continuous noise speech, so as to Voice can be efficiently identified out.

There are two types of traditional speech terminals detection modes, and a kind of is the spy of the time domain according to voice and noise signal and frequency domain Sign is different, extracts the feature of each segment signal, the feature of each segment signal is compared with the threshold value set, so as to carry out language Voice endpoint detects.But this mode is only applicable to detect under the conditions of stationary noise, and noise robustness is poor, it is difficult to distinguish clean speech And noise, the accuracy for leading to speech terminals detection are relatively low..It is another then be the mode based on neural network, by using instruction Practice model and end-point detection is carried out to voice signal.However the input vector of big multi-model contains only the feature of noisy speech so that Noise robustness is poor, relatively low so as to cause the accuracy of speech terminals detection.Therefore, how speech terminals detection is effectively improved Accuracy becomes the current technical issues that need to address.

Invention content

Based on this, it is necessary to which for above-mentioned technical problem, providing a kind of can effectively improve the accurate of speech terminals detection Sound end detecting method, device, computer equipment and the storage medium of property.

A kind of sound end detecting method, including：

Noisy Speech Signal is obtained, extracts the corresponding acoustic feature of the Noisy Speech Signal and spectrum signature；

The acoustic feature and spectrum signature are converted, obtain corresponding acoustic feature vector sum spectrum signature to Amount；

Grader is obtained, the acoustic feature vector sum spectrum signature vector is input to the grader, is added The spectrum signature vector of the acoustic feature vector sum addition voice label of voice label；

The spectrum signature vector of the acoustic feature vector sum addition voice label of the addition voice label is parsed, Obtain corresponding voice signal；

The corresponding starting point and ending point of the voice signal is determined according to the sequential of the voice signal.

In one of the embodiments, in the corresponding acoustic feature of the extraction Noisy Speech Signal and spectrum signature Before, it further includes：

The Noisy Speech Signal is converted into noisy speech frequency spectrum；

Time-domain analysis and/or frequency-domain analysis and/or transform domain analysis are carried out to the noisy speech frequency spectrum, obtain the band The corresponding acoustic feature of noisy speech signal.

The Noisy Speech Signal is converted into noisy speech frequency spectrum, noisy speech is calculated according to the noisy speech frequency spectrum Amplitude spectrum；

Dynamic noise estimation is carried out to the noisy speech frequency spectrum according to the noisy speech amplitude spectrum, obtains noise amplitude Spectrum；

It is composed according to the voice amplitudes of the noisy speech amplitude spectrum and the noise amplitude Power estimation clean speech signal；

The noisy speech is generated using the noisy speech amplitude spectrum, noise amplitude spectrum and voice amplitudes spectrum The corresponding spectrum signature of signal.

In one of the embodiments, it is described to the acoustic feature and spectrum signature carry out conversion include：

Extract the front and rear preset quantity frame of present frame in the acoustic feature and the spectrum signature；

The corresponding mean value vector of present frame and/or Variance Vector are calculated by using the front and rear preset quantity frame of present frame；

Logarithm is carried out to the acoustic feature after the corresponding mean value vector of calculating present frame and/or Variance Vector and spectrum signature Domain is converted, and obtains transformed acoustic feature vector sum spectrum signature vector.

It is further included before the step of acquisition grader in one of the embodiments,：

The noisy speech data of addition voice class label are obtained, by being trained to the noisy speech data, are obtained To preliminary classification device；

The first verification collection is obtained, first verification, which is concentrated, includes multiple first voice data；

Multiple first voice data are input to grader, it is general to obtain the corresponding classification of the multiple first voice data Rate；

The corresponding class probability of multiple first voice data is screened, classification is added to the first voice data selected Label obtains the verification collection of addition class label；

It is trained using the verification collection and the training set of the addition class label, is verified grader；

The second verification collection is obtained, second verification, which is concentrated, includes multiple second speech datas；

Multiple second speech datas are input to verification grader, obtain the corresponding classification of the multiple second speech data Probability；

When the corresponding class probability of multiple second speech datas reaches predetermined probabilities value, required grader is obtained.

It is described vectorial to the acoustic feature vector sum spectrum signature using the grader in one of the embodiments, The step of being classified includes：

Using the acoustic feature vector sum spectrum signature vector as the input of grader, the acoustic feature vector is obtained The corresponding decision value with spectrum signature vector；

When the decision value is first threshold, voice mark is added to acoustic feature vector or spectrum signature vector Label；

When the decision value is second threshold, non-voice mark is added to acoustic feature vector or spectrum signature vector Label.

A kind of speech terminals detection device, including：

Extraction module for obtaining Noisy Speech Signal, extracts the corresponding acoustic feature of the Noisy Speech Signal and frequency Spectrum signature；

Modular converter for being converted to the acoustic feature and spectrum signature, obtains corresponding acoustic feature vector With spectrum signature vector；

The acoustic feature vector sum spectrum signature vector for obtaining grader, is input to described point by sort module Class device obtains the spectrum signature vector of the acoustic feature vector sum addition voice label of addition voice label；

Parsing module, for the spectrum signature of the acoustic feature vector sum addition voice label to the addition voice label Vector is parsed, and obtains corresponding voice signal；Determine that the voice signal is corresponding according to the sequential of the voice signal Starting point and ending point.

The modular converter is additionally operable to extract in the acoustic feature and the spectrum signature in one of the embodiments, The front and rear preset quantity frame of present frame；The corresponding mean value vector of present frame is calculated by using the front and rear preset quantity frame of present frame And/or Variance Vector；To calculate the acoustic feature after the corresponding mean value vector of present frame and/or Variance Vector and spectrum signature into Row log-domain is converted, and obtains transformed acoustic feature vector sum spectrum signature vector.

A kind of computer equipment, including memory, processor, the memory is stored with computer program, the processing Device realizes following steps when performing the computer program：

A kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the computer program Following steps are realized when being executed by processor：

Above-mentioned sound end detecting method, device, computer equipment and storage medium obtain Noisy Speech Signal, extraction The corresponding acoustic feature of Noisy Speech Signal and spectrum signature；By being converted to acoustic feature and spectrum signature, obtain pair The acoustic feature vector sum spectrum signature vector answered.Grader is obtained, by the way that acoustic feature vector sum spectrum signature vector is defeated Enter to grader, obtain the spectrum signature vector of the acoustic feature vector sum addition voice label of addition voice label, thus, it is possible to It is enough effectively to classify to acoustic feature vector sum spectrum signature vector, so as to effectively identify voice and non-voice. The spectrum signature vector of acoustic feature vector sum addition voice label to adding voice label parses, and obtains corresponding language Sound signal；The sequential of voice signal determines the corresponding starting point and ending point of voice signal, and thus, it is possible to accurately identify that band is made an uproar The starting point and ending point of voice signal, so as to effectively improve the accuracy of speech terminals detection.

Description of the drawings

Fig. 1 is the flow chart of sound end detecting method in one embodiment；

Fig. 2 is the internal structure chart of speech terminals detection device in one embodiment；

Fig. 3 is the internal structure chart of one embodiment Computer equipment.

Specific embodiment

It is with reference to the accompanying drawings and embodiments, right in order to which the object, technical solution and advantage for making the application are more clearly understood The present invention is further elaborated.It should be appreciated that specific embodiment described herein is only used to explain the application, not Apply for limiting.It is appreciated that term " first " used in the present invention, " second " etc. can be used to describe herein it is various Element, but these elements should not be limited by these terms.These terms are only used to distinguish first element and another element.

In one embodiment, as shown in Figure 1, providing a kind of sound end detecting method, it is applied in this way eventually It illustrates, includes the following steps for end：

Step 102, Noisy Speech Signal, the corresponding acoustic feature of extraction Noisy Speech Signal and spectrum signature are obtained.

Typically, actual acquisition to voice signal usually contain the noise of some strength, when these noise intensities compared with When big, the effect of voice application, which can be generated, significantly influences, for example audio identification efficiency is low, and end-point detection accuracy declines Deng.

Terminal can obtain the voice that user is inputted by speech input device.Wherein, terminal can be smart mobile phone, put down The terminals such as plate computer, laptop, desktop computer, terminal further include speech input device, for example, it may be the tools such as microphone There is the device of typing phonetic function.The voice input by user that terminal is got is usually noise-containing Noisy Speech Signal, Noisy Speech Signal can be the Noisy Speech Signals such as call voice input by user, recorded audio, phonetic order.Terminal obtains After Noisy Speech Signal, the corresponding acoustic feature of Noisy Speech Signal and spectrum signature are extracted.Wherein, acoustic feature can wrap Include the characteristic informations such as voiceless sound, the voiced sound of Noisy Speech Signal, vowel, consonant.Spectrum signature can include Noisy Speech Signal The characteristic informations such as vibration frequency, the loudness of vibration amplitude and Noisy Speech Signal, tone color.

Specifically, after terminal obtains Noisy Speech Signal, adding window framing is carried out to Noisy Speech Signal.For example, it can adopt Noisy Speech Signal is divided into the frame that multiple frame lengths are 10-30ms (millisecond) with Hanning window, frame shifting can take 10ms, so as to Noisy Speech Signal is divided into multiframe Noisy Speech Signal.After terminal-pair carries out adding window framing to Noisy Speech Signal, to adding window Noisy Speech Signal after framing carries out fast Fourier transform, thus obtains the frequency spectrum of Noisy Speech Signal.Terminal then can be with The corresponding acoustic feature of Noisy Speech Signal and spectrum signature are extracted according to the frequency spectrum of noisy speech.

Step 104, acoustic feature and spectrum signature are converted, obtains corresponding acoustic feature vector sum spectrum signature Vector.

After terminal extracts the corresponding acoustic feature of Noisy Speech Signal and spectrum signature, the noisy speech extracted is believed Number corresponding acoustic feature and spectrum signature are converted, and acoustic feature are converted to corresponding acoustic feature vector, by frequency spectrum Feature Conversion is corresponding spectrum signature vector.

Step 106, grader is obtained, acoustic feature vector sum spectrum signature vector is input to grader, is added The spectrum signature vector of the acoustic feature vector sum addition voice label of voice label.

Terminal obtains grader, and grader is the trained grader before speech terminals detection is carried out, and grader can With by adding voice label and non-voice label to acoustic feature vector sum spectrum signature vector, by the acoustic feature of input to Amount and spectrum signature vector be divided into voice class acoustic feature vector sum spectrum signature vector sum non-voice class acoustic feature to Amount and spectrum signature vector.Terminal by the corresponding acoustic feature vector sum spectrum signature vector of noisy speech by being input to classification Device classifies to the acoustic feature vector sum spectrum signature vector of input using grader.When the acoustic feature vector of input Or spectrum signature vector be voice class when, be acoustic feature vector or spectrum signature vector add voice label；When input It is that acoustic feature vector or spectrum signature vector add non-language when acoustic feature vector or spectrum signature vector are non-voice classification Phonetic symbol label, thus, it is possible to voice and non-voice is recognized accurately.Terminal is using grader to acoustic feature vector sum spectrum signature After vector, it is possible to obtain the spectrum signature vector of the acoustic feature vector sum addition voice label of addition voice label.

Further, terminal can also be obtained using acoustic feature vector sum spectrum signature vector as the input of grader The corresponding decision value of acoustic feature vector sum spectrum signature vector.Terminal can be according to obtained decision value to acoustic feature vector Voice label or non-voice label are added with spectrum signature vector.So as to fulfill to acoustic feature vector sum spectrum signature vector into Row Accurate classification.

Step 108, the spectrum signature vector of the acoustic feature vector sum addition voice label to adding voice label carries out Parsing obtains the voice signal after addition voice label.

Step 110, the corresponding starting point and ending point of voice signal is determined according to the voice label of voice signal and sequential.

After terminal-pair acoustic feature vector sum spectrum signature vector is classified, the acoustics to being added to voice label is needed Feature vector is parsed with the spectrum signature vector for being added to voice label.Specifically, terminal will be added to voice label The spectrum signature vector that acoustic feature vector sum is added to voice label is parsed, and the acoustics for obtaining being added to voice label is special The corresponding frequency spectrum of spectrum signature of seeking peace.Terminal according to the sequential of Noisy Speech Signal by be added to voice label acoustic feature and The corresponding frequency spectrum of spectrum signature is converted to corresponding voice signal, and thus, it is possible to parse to obtain corresponding voice signal.

Noisy Speech Signal has sequential, and the sequential for being added to the voice signal after voice label is still believed with noisy speech Number sequential it is corresponding.Terminal by the acoustic feature vector sum for being added to voice label be added to the spectrum signature of voice label to Amount resolves to the corresponding voice signal for being added to voice label, terminal thus, it is possible to the voice label according to voice signal and when Sequence determines the corresponding starting point and ending point of Noisy Speech Signal.

For example, after terminal classifies to the acoustic feature vector sum spectrum signature vector of input by grader, obtain Decision value can be value between one 0 to 1.When obtained decision value is 1, terminal-pair acoustic feature vector or frequency spectrum are special Sign vector addition voice label.When obtained decision value is 0, terminal-pair acoustic feature vector or the addition of spectrum signature vector are non- Voice label.Thus, it is possible to accurately carry out Accurate classification to acoustic feature vector sum spectrum signature vector.Terminal will be added to The acoustic feature vector sum of voice label be added to voice label spectrum signature vector parsed after, it is possible to added Voice signal after voice label.According to the sequential for being added to the voice signal after voice label, added when for the first time The speech frame of voice label is then the starting point of Noisy Speech Signal, when occurring the corresponding speech frame of voice label for the last time It is then the terminating point of Noisy Speech Signal.Further, can also according to decision value 0 to 1 redirect determine voice signal Starting point, according to the terminating point redirected to determine voice signal of decision value 1 to 0.It is possible thereby to accurately determine noisy speech The corresponding starting point and ending point of signal.

In the present embodiment, after terminal obtains Noisy Speech Signal, the corresponding acoustic feature of extraction Noisy Speech Signal and frequency Spectrum signature by being converted to acoustic feature and spectrum signature, obtains corresponding acoustic feature vector sum spectrum signature vector. By the way that acoustic feature vector sum spectrum signature vector is input to grader, the acoustic feature vector sum of addition voice label is obtained The spectrum signature vector of voice label is added, thus, it is possible to effectively acoustic feature vector sum spectrum signature vector is divided Class, so as to effectively identify voice and non-voice.Terminal is by adding the acoustic feature vector sum for adding voice label The spectrum signature vector of voice label is parsed, and obtains corresponding voice signal.Terminal is determined according to the sequential of voice signal The corresponding starting point and ending point of voice signal, thus, it is possible to accurately identify the starting point and ending point of Noisy Speech Signal, So as to effectively improve the accuracy of speech terminals detection.

In one embodiment, it before the corresponding acoustic feature of extraction Noisy Speech Signal and spectrum signature, further includes： Noisy Speech Signal is converted into noisy speech frequency spectrum；Noisy speech frequency spectrum is carried out time-domain analysis and/or frequency-domain analysis and/or Transform domain analysis obtains the corresponding acoustic feature of Noisy Speech Signal.

In phonetics, phonetic feature can be divided into vowel, consonant, voiceless sound, voiced sound and the acoustic features such as mute.Terminal After obtaining Noisy Speech Signal, adding window framing is carried out to Noisy Speech Signal.Noisy speech is believed for example, Hanning window may be used Number it is divided into the frame that multiple frame lengths are 10-30ms (millisecond), frame shifting can take 10ms.It is more so as to which Noisy Speech Signal is divided into Frame Noisy Speech Signal.After terminal-pair carries out adding window framing to Noisy Speech Signal, to the Noisy Speech Signal after adding window framing Fast Fourier transform is carried out, thus obtains the frequency spectrum of Noisy Speech Signal.

Further, terminal can carry out noisy speech frequency spectrum time-domain analysis and/or frequency-domain analysis and/or transform domain point Analysis, so as to obtain the corresponding acoustic feature of Noisy Speech Signal.

For example, MFCC may be used in terminal, (Mel-Frequency Cepstrum Coefficients, mel-frequency fall Spectral coefficient) mode extracts the corresponding acoustic feature of Noisy Speech Signal.After terminal-pair Noisy Speech Signal carries out adding window framing, Noisy Speech Signal is converted to the frequency spectrum of Noisy Speech Signal.The Spectrum Conversion of Noisy Speech Signal is noisy speech by terminal Cepstrum, terminal carry out cepstral analysis according to noisy speech cepstrum, and noisy speech cepstrum is carried out discrete cosine transform, is obtained each The acoustic feature of frame, so as to obtain the effective acoustic feature of noisy speech.

In one embodiment, it before the corresponding acoustic feature of extraction Noisy Speech Signal and spectrum signature, further includes： Noisy Speech Signal is converted into noisy speech frequency spectrum, noisy speech amplitude spectrum is calculated according to noisy speech frequency spectrum；It is made an uproar according to band Voice amplitudes spectrum carries out dynamic noise estimation to noisy speech frequency spectrum, obtains noise amplitude spectrum；According to noisy speech amplitude spectrum and The voice amplitudes spectrum of noise amplitude Power estimation clean speech signal；Utilize noisy speech amplitude spectrum, noise amplitude spectrum and voice width The corresponding spectrum signature of degree spectrum generation Noisy Speech Signal.

After terminal obtains Noisy Speech Signal, adding window framing is carried out to Noisy Speech Signal.For example, Hanning window may be used Noisy Speech Signal is divided into the frame that multiple frame lengths are 10-30ms (millisecond), frame shifting can take 10ms.So as to which band is made an uproar language Sound signal is divided into multiframe Noisy Speech Signal.After terminal-pair carries out adding window framing to Noisy Speech Signal, after adding window framing Noisy Speech Signal carries out fast Fourier transform, thus obtains the frequency spectrum of Noisy Speech Signal.Wherein, Noisy Speech Signal Frequency spectrum can be the energy amplitude spectrum of the noisy speech after fast Fourier transform.

Further, terminal can calculate noisy speech amplitude spectrum and noisy speech phase using noisy speech frequency spectrum Spectrum.Terminal carries out dynamic noise estimation according to noisy speech amplitude spectrum and noisy speech phase spectrum to noisy speech frequency spectrum.Specifically Ground, terminal can carry out dynamic noise estimation using minimum controlled recursive average algorithm is improved to noisy speech frequency spectrum, so as to To obtain noise amplitude spectrum.Terminal goes out voice according to noisy speech amplitude spectrum, noisy speech phase spectrum and noise amplitude Power estimation The voice amplitudes spectrum of signal.For example, terminal can utilize log-magnitude spectrum nonlinear IEM model method, voice signal is estimated Voice amplitudes are composed.

Terminal is given birth to using the voice amplitudes spectrum of noisy speech amplitude spectrum, noise amplitude spectrum and clean speech signal estimated Into the corresponding spectrum signature of Noisy Speech Signal, thus terminal can efficiently extract out the corresponding frequency spectrum of Noisy Speech Signal Feature.

In one embodiment, conversion is carried out to acoustic feature and spectrum signature to include：It extracts acoustic feature and frequency spectrum is special The front and rear preset quantity frame of present frame in sign；The corresponding mean value of present frame is calculated by using the front and rear preset quantity frame of present frame Vector and/or Variance Vector；It is special to the acoustic feature after the corresponding mean value vector of calculating present frame and/or Variance Vector and frequency spectrum Sign carries out log-domain conversion, obtains transformed acoustic feature vector sum spectrum signature vector.

After terminal obtains Noisy Speech Signal, adding window framing is carried out to Noisy Speech Signal.So as to by noisy speech Signal is divided into multiframe Noisy Speech Signal.After terminal-pair carries out adding window framing to Noisy Speech Signal, to the band after adding window framing Noisy speech signal carries out fast Fourier transform, thus obtains the frequency spectrum of Noisy Speech Signal.Terminal can make an uproar language according to band The frequency spectrum of sound extracts the corresponding acoustic feature of Noisy Speech Signal and spectrum signature.

After terminal extracts the corresponding acoustic feature of Noisy Speech Signal and spectrum signature, by acoustic feature and spectrum signature Be converted to acoustic feature vector sum spectrum signature vector.Present frame in terminal extraction acoustic feature vector sum spectrum signature vector Front and rear preset quantity frame.Terminal by using the front and rear preset quantity frame of present frame calculate the corresponding mean value vector of present frame or Variance Vector, so as to be smoothed to acoustic feature and spectrum signature, the acoustic feature vector sum after obtaining smoothly Spectrum signature vector.

For example, terminal can obtain acoustic feature or spectrum signature present frame be pushed forward and follow-up each five frame, 11 frame in total Noisy speech frequency spectrum.By calculating the average value of this 11 frame, the mean value vector of present frame can be obtained.Specifically, terminal can be with Wave filter group is obtained, wherein, the shape of wave filter is triangle, and triangle Window Table shows filter window.Each wave filter has three The characteristic of angular wave filter, in noisy speech spectral range, these wave filters can be equiband.Terminal can utilize filter Wave device group calculates the mean value vector of present frame, it is possible thereby to be smoothed to noisy speech frequency spectrum, the sound after obtaining smoothly Learn feature vector and spectrum signature vector.

After terminal-pair noisy speech frequency spectrum is smoothed, it is smooth to the acoustic feature vector sum after smooth after frequency spectrum Feature vector calculates log-domain, obtains transformed acoustic feature vector sum spectrum signature vector.Specifically, terminal can calculate The acoustic feature of each wave filter output and the logarithmic energy of spectrum signature, it is hereby achieved that the log-domain of acoustic feature vector It is vectorial so as to effectively obtain transformed acoustic feature vector sum spectrum signature with the log-domain of spectrum signature vector.

In one embodiment, it is further included before the step of obtaining grader：The band for obtaining addition voice class label is made an uproar Voice data by being trained to noisy speech data, obtains preliminary classification device；Obtain the first verification collection, the first verification collection Include multiple first voice data；Multiple first voice data are input to grader, obtain multiple first voice data pair The class probability answered；The corresponding class probability of multiple first voice data is screened, the first voice data selected is added Add class label, obtain the verification collection of addition class label；It is trained using the verification collection and training set that add class label, It is verified grader；The second verification collection is obtained, the second verification, which is concentrated, includes multiple second speech datas；By multiple second voices Data are input to verification grader, obtain the corresponding class probability of multiple second speech datas；When multiple second speech datas pair When the class probability answered reaches predetermined probabilities value, required grader is obtained.

It before grader is obtained, needs to train grader using a large amount of noisy speech data, these a large amount of bands are made an uproar language Sound data can be that the band that the noisy speech data that terminal is obtained from database or terminal are obtained from internet is made an uproar Voice data.In training grader, first by being manually labeled to noisy speech data, the band after artificial mark is utilized Voice data of making an uproar is trained to obtain grader.

Specifically, after terminal extracts the corresponding acoustic feature of noisy speech data and spectrum signature, to acoustic feature and frequency Spectrum signature is converted, and is converted to corresponding acoustic feature vector sum spectrum signature vector.Staff can be according to classification pair Acoustic feature vector sum spectrum signature vector is labeled according to table, voice label or non-is added to each frame Noisy Speech Signal Voice label.Terminal obtains the noisy speech number after staff is labeled noisy speech data according to the classification table of comparisons According to.

Acoustic feature vector sum spectrum signature vector after adding label is combined and is input to LSTM by terminal The input layer of (Bidirectional Long Short-term Memory, two-way shot and long term Memory Neural Networks), LSTM god Input can be calculated by activation primitive from the vectorial learning of input to new feature through the non-linear hidden layer in network The generic of vector.Specifically, there are three doors in each LSTM units, respectively forget door, candidate door and out gate.Specifically Calculation formula can be：

Wherein, σ represents activation primitive,It represents to forget door weight matrix,It is to forget between door input layer and hidden layer Weight matrix, b_fRepresent to forget the biasing of door, it is by by the output h of previous hidden layer to forget door_t-1With current input x_tIt carries out Linear combination, then outputs it value using activation primitive and is compressed between 0 to 1.When output valve is closer to 1, then show to remember The information for recalling body reservation is more；Conversely, closer to 0, then show that the information that memory body retains is fewer.

The location mode that the calculating of candidate door currently inputs, specific formula can be：

Wherein, C_iRepresent the location mode currently inputted, it can output valve is regular to -1 and 1 by tanh activation primitives Between.

Out gate can be controlled for the quantity of the newer recall info of next layer network, and formula can be expressed as：

Wherein, O_tRepresent the quantity for the newer recall info of next layer network.

Last output can be calculated by LSTM units, formula can be expressed as：

h_t=O_t×tanh(C_t)

Last acoustic feature vector or spectrum signature vector, formula, which is calculated, by forward and reverse to be expressed as：

WhereinFor positive output vector,For reversed output vector, h_iFor the last class label that is labelled with Multiple acoustic feature vectors or spectrum signature vector.

Further, the output layer in LSTM can calculate output unit C according to preset decision function_iValue.Its In, output unit C_iValue can be 0 to 1 between value, 1 represents voice class, and 0 represents non-voice class.

Terminal is calculated each using the multiple acoustic feature vector sum spectrum signature vectors for being labelled with voice class label Acoustic feature and spectrum signature belong to voice class and the probability of non-voice classification in the classification table of comparisons, extraction acoustic feature vector With the classification of probability value maximum of the spectrum signature vector in the classification table of comparisons, acoustic feature vector or spectrum signature vector are added Add voice class label corresponding with the classification of probability value maximum.

Terminal is trained using the noisy speech data for being added to voice class label, obtains preliminary classification device；Terminal The first verification collection is obtained, the first verification, which is concentrated, includes multiple first voice data.Multiple first voice data are input to by terminal Grader, after obtaining the corresponding class probability of multiple first voice data, class probability corresponding to multiple first voice data It is screened.Staff adds voice class label using the first voice data that terminal-pair is selected, and terminal obtains addition language The first voice data after sound class label, using adding the generation addition voice class of the first voice data after voice class label The verification collection of distinguishing label.Terminal is trained again using the verification collection and noisy speech data for adding classification voice label, is obtained Verify grader.Terminal obtains the second verification collection, and the second verification, which is concentrated, includes multiple second speech datas；By multiple second voices Data are input to verification grader, obtain the corresponding class probability of multiple second speech datas.Terminal filters out class probability and exists The second language data of preset range, and the second speech data filtered out is labeled again, by the second voice after mark Noisy speech data after data and addition label re-start training and obtain new grader.It is continued for training, Zhi Daosuo When having the probability value for verifying the acoustic feature vector for concentrating preset quantity or spectrum signature vector between predetermined probabilities value range, Deconditioning, it is possible to obtain required grader.It is hereby achieved that the grader that accuracy rate is higher, so as to fulfill to acoustics Feature vector and spectrum signature vector carry out Accurate classification, and then can accurately identify voice and non-voice.

In one embodiment, the step of being classified using grader to acoustic feature vector sum spectrum signature vector is wrapped It includes：Using acoustic feature vector sum spectrum signature vector as the input of grader, obtain acoustic feature vector sum spectrum signature to Measure corresponding decision value；When decision value is first threshold, voice label is added to acoustic feature vector or spectrum signature vector； When decision value is second threshold, non-voice label is added to acoustic feature vector or spectrum signature vector.

After terminal obtains Noisy Speech Signal, the corresponding acoustic feature of extraction Noisy Speech Signal and spectrum signature.Terminal Acoustic feature and spectrum signature are converted, obtain corresponding acoustic feature vector sum spectrum signature vector.Terminal, which obtains, divides After class device, acoustic feature vector sum spectrum signature vector is input to grader.Grader is to the acoustic feature vector sum of input After spectrum signature vector is classified, the corresponding decision value of acoustic feature vector sum spectrum signature vector can be obtained.When obtaining Decision value be preset first threshold when, terminal-pair acoustic feature vector or spectrum signature vector addition voice label.Wherein, First threshold can be a value range.When obtained decision value is preset second threshold, terminal-pair acoustic feature vector Or spectrum signature vector addition non-voice label.Acoustic feature vector sum spectrum signature vector is carried out by using grader accurate Really classification, so as to accurately identify voice signal and non-speech audio in Noisy Speech Signal.

For example, obtained decision value can be the value between one 0 to 1.Preset first threshold can be 1, preset Second threshold can be 0.When obtained decision value is 1, terminal-pair acoustic feature vector or spectrum signature vector addition voice Label.When obtained decision value is 0, terminal-pair acoustic feature vector or spectrum signature vector addition non-voice label.Thus Can Accurate classification accurately be carried out to acoustic feature vector sum spectrum signature vector.

In one embodiment, as shown in Fig. 2, providing a kind of speech terminals detection device, including extraction module 202, Modular converter 204, sort module 206 and parsing module 208, wherein：

Extraction module 202, for obtaining Noisy Speech Signal, the corresponding acoustic feature of extraction Noisy Speech Signal and frequency spectrum Feature.

Modular converter 204 for being converted to acoustic feature and spectrum signature, obtains corresponding acoustic feature vector sum Spectrum signature vector.

Acoustic feature vector sum spectrum signature vector for obtaining grader, is input to the classification by sort module 206 Device obtains the spectrum signature vector of the acoustic feature vector sum addition voice label of addition voice label.

Parsing module 208, for the spectrum signature of the acoustic feature vector sum addition voice label to adding voice label Vector is parsed, and obtains corresponding voice signal；According to the sequential of voice signal determine the corresponding starting point of voice signal and Terminating point.

In one embodiment, extraction module 202 is additionally operable to the Noisy Speech Signal being converted to noisy speech frequency spectrum； Time-domain analysis and/or frequency-domain analysis and/or transform domain analysis are carried out to the noisy speech frequency spectrum, obtain the noisy speech letter Number corresponding acoustic feature.

In one embodiment, extraction module 202 is additionally operable to Noisy Speech Signal being converted to noisy speech frequency spectrum, according to Noisy speech frequency spectrum calculates noisy speech amplitude spectrum；Dynamic noise is carried out according to noisy speech amplitude spectrum to noisy speech frequency spectrum to estimate Meter obtains noise amplitude spectrum；It is composed according to the voice amplitudes of noisy speech amplitude spectrum and noise amplitude Power estimation clean speech signal； Utilize noisy speech amplitude spectrum, noise amplitude spectrum and the corresponding spectrum signature of voice amplitudes spectrum generation Noisy Speech Signal.

In one embodiment, modular converter 204 is additionally operable to extract in the acoustic feature and the spectrum signature currently The front and rear preset quantity frame of frame；By using the front and rear preset quantity frame of present frame calculate the corresponding mean value vector of present frame and/ Or Variance Vector；Acoustic feature after the corresponding mean value vector of calculating present frame and/or Variance Vector and spectrum signature are carried out Log-domain is converted, and obtains transformed acoustic feature vector sum spectrum signature vector.

In one embodiment, which further includes training module, and the band for obtaining addition voice class label is made an uproar language Sound data by being trained to noisy speech data, obtain preliminary classification device；The first verification collection is obtained, the first verification is concentrated Including multiple first voice data；Multiple first voice data are input to preliminary classification device, obtain multiple first voice data Corresponding class probability；The corresponding class probability of multiple first voice data is screened, to the first voice data selected Class label is added, obtains the verification collection of addition class label；Utilize the verification collection and addition voice class for adding class label The noisy speech data of label are trained, and are verified grader；The second verification collection is obtained, the second verification concentration includes multiple Second speech data；Multiple second speech datas are input to verification grader, obtain the corresponding class of multiple second speech datas Other probability；When the corresponding class probability of multiple second speech datas reaches predetermined probabilities value, required grader is obtained.

In one embodiment, sort module 206 is additionally operable to using acoustic feature vector sum spectrum signature vector as classification The input of device obtains the corresponding decision value of acoustic feature vector sum spectrum signature vector；When decision value is first threshold, to sound Learn feature vector or spectrum signature vector addition voice label；When decision value is second threshold, to acoustic feature vector or frequency Spectrum signature vector adds non-voice label.

In one embodiment, a kind of computer equipment is provided, which can be terminal, internal structure Figure can be as shown in Figure 3.For example, the computer equipment can be terminal, terminal can be, but not limited to be it is various be smart mobile phone, Tablet computer, laptop, personal computer and portable wearable device etc. have the function of the equipment for inputting voice.It should Computer equipment includes the processor, memory, network interface and the speech input device that are connected by system bus.Wherein, should The processor of computer equipment is calculated for offer and control ability.The memory of the computer equipment includes non-volatile memories Medium, built-in storage.The non-volatile memory medium is stored with operating system and computer program.The built-in storage is non-volatile Property storage medium in operating system and computer program operation provide environment.The network interface of the computer equipment be used for External terminal is communicated by network connection.To realize a kind of speech terminals detection side when the computer program is executed by processor Method.The speech input device of the computer equipment can include microphone, can also include external earphone etc..

It will be understood by those skilled in the art that the structure shown in Fig. 3, only part knot relevant with application scheme The block diagram of structure, does not form the restriction for the server being applied thereon to application scheme, and specific server can include Certain components are either combined than components more or fewer shown in figure or are arranged with different components.

In one embodiment, a kind of computer equipment is provided, including memory and processor, is stored in memory Computer program, the processor realize following steps when performing computer program：Noisy Speech Signal is obtained, extracts noisy speech The corresponding acoustic feature of signal and spectrum signature；Acoustic feature and spectrum signature are converted, obtain corresponding acoustic feature Vector sum spectrum signature vector；Grader is obtained, acoustic feature vector sum spectrum signature vector is input to grader, is added The acoustic feature vector sum of voice label is added to add the spectrum signature vector of voice label；To adding the acoustic feature of voice label The spectrum signature vector of vector sum addition voice label is parsed, and obtains corresponding voice signal；According to voice signal when Sequence determines the corresponding starting point and ending point of voice signal.

In one embodiment, following steps are also realized when processor performs computer program：The noisy speech is believed Number be converted to noisy speech frequency spectrum；Time-domain analysis and/or frequency-domain analysis and/or transform domain point are carried out to the noisy speech frequency spectrum Analysis, obtains the corresponding acoustic feature of the Noisy Speech Signal.

In one embodiment, following steps are also realized when processor performs computer program：Noisy Speech Signal is turned Noisy speech frequency spectrum is changed to, noisy speech amplitude spectrum is calculated according to noisy speech frequency spectrum；It is made an uproar according to noisy speech amplitude spectrum to band Voice spectrum carries out dynamic noise estimation, obtains noise amplitude spectrum；It is pure according to noisy speech amplitude spectrum and noise amplitude Power estimation The voice amplitudes spectrum of net voice signal；Utilize noisy speech amplitude spectrum, noise amplitude spectrum and voice amplitudes spectrum generation noisy speech The corresponding spectrum signature of signal.

In one embodiment, following steps are also realized when processor performs computer program：Extract the acoustic feature With the front and rear preset quantity frame of present frame in the spectrum signature；It is calculated by using the front and rear preset quantity frame of present frame current The corresponding mean value vector of frame and/or Variance Vector；To the acoustics after the corresponding mean value vector of calculating present frame and/or Variance Vector Feature and spectrum signature carry out log-domain conversion, obtain transformed acoustic feature vector sum spectrum signature vector.

In one embodiment, following steps are also realized when processor performs computer program：Obtain addition voice class The noisy speech data of label by being trained to noisy speech data, obtain preliminary classification device；The first verification collection is obtained, First verification, which is concentrated, includes multiple first voice data；Multiple first voice data are input to grader, obtain multiple first The corresponding class probability of voice data；The corresponding class probability of multiple first voice data is screened, to select first Voice data adds class label, obtains the verification collection of addition class label；Utilize the verification collection and training for adding class label Collection is trained, and is verified grader；The second verification collection is obtained, the second verification, which is concentrated, includes multiple second speech datas；It will Multiple second speech datas are input to verification grader, obtain the corresponding class probability of multiple second speech datas；When multiple When the corresponding class probability of two voice data reaches predetermined probabilities value, required grader is obtained.

In one embodiment, following steps are also realized when processor performs computer program：By acoustic feature vector sum Input of the spectrum signature vector as grader obtains the corresponding decision value of acoustic feature vector sum spectrum signature vector；When certainly When plan value is first threshold, voice label is added to acoustic feature vector or spectrum signature vector；When decision value is second threshold When, non-voice label is added to acoustic feature vector or spectrum signature vector.

In one embodiment, a kind of computer readable storage medium is provided, is stored thereon with computer program, is calculated Machine program realizes following steps when being executed by processor：Obtain Noisy Speech Signal, the corresponding acoustics of extraction Noisy Speech Signal Feature and spectrum signature；Acoustic feature and spectrum signature are converted, obtain corresponding acoustic feature vector sum spectrum signature Vector；Grader is obtained, acoustic feature vector sum spectrum signature vector is input to grader, obtains the sound of addition voice label It learns feature vector and adds the spectrum signature vector of voice label；Voice is added to the acoustic feature vector sum for adding voice label The spectrum signature vector of label is parsed, and obtains corresponding voice signal；Voice signal is determined according to the sequential of voice signal Corresponding starting point and ending point.

In one embodiment, following steps are also realized when computer program is executed by processor：By the noisy speech Signal is converted to noisy speech frequency spectrum；Time-domain analysis and/or frequency-domain analysis and/or transform domain are carried out to the noisy speech frequency spectrum Analysis, obtains the corresponding acoustic feature of the Noisy Speech Signal.

In one embodiment, following steps are also realized when computer program is executed by processor：By Noisy Speech Signal Noisy speech frequency spectrum is converted to, noisy speech amplitude spectrum is calculated according to noisy speech frequency spectrum；According to noisy speech amplitude spectrum to band Voice spectrum of making an uproar carries out dynamic noise estimation, obtains noise amplitude spectrum；According to noisy speech amplitude spectrum and noise amplitude Power estimation The voice amplitudes spectrum of clean speech signal；It is made an uproar language using noisy speech amplitude spectrum, noise amplitude spectrum and voice amplitudes spectrum generation band The corresponding spectrum signature of sound signal.

In one embodiment, following steps are also realized when computer program is executed by processor：It is special to extract the acoustics It seeks peace the front and rear preset quantity frame of present frame in the spectrum signature；It calculates and works as by using the front and rear preset quantity frame of present frame The corresponding mean value vector of previous frame and/or Variance Vector；To the sound after the corresponding mean value vector of calculating present frame and/or Variance Vector It learns feature and spectrum signature carries out log-domain conversion, obtain transformed acoustic feature vector sum spectrum signature vector.

In one embodiment, following steps are also realized when computer program is executed by processor：Obtain addition voice class The noisy speech data of distinguishing label by being trained to noisy speech data, obtain preliminary classification device；Obtain the first verification Collection, the first verification, which is concentrated, includes multiple first voice data；Multiple first voice data are input to grader, obtain multiple The corresponding class probability of one voice data；The corresponding class probability of multiple first voice data is screened, to select One voice data adds class label, obtains the verification collection of addition class label；Utilize the verification collection and instruction for adding class label Practice collection to be trained, be verified grader；The second verification collection is obtained, the second verification, which is concentrated, includes multiple second speech datas； Multiple second speech datas are input to verification grader, obtain the corresponding class probability of multiple second speech datas；When multiple When the corresponding class probability of second speech data reaches predetermined probabilities value, required grader is obtained.

In one embodiment, following steps are also realized when computer program is executed by processor：By acoustic feature vector With input of the spectrum signature vector as grader, the corresponding decision value of acoustic feature vector sum spectrum signature vector is obtained；When When decision value is first threshold, voice label is added to acoustic feature vector or spectrum signature vector；When decision value is the second threshold During value, non-voice label is added to acoustic feature vector or spectrum signature vector.

One of ordinary skill in the art will appreciate that realizing all or part of flow in above-described embodiment method, being can be with Relevant hardware is instructed to complete by computer program, the computer program can be stored in a non-volatile computer In read/write memory medium, the computer program is when being executed, it may include such as the flow of the embodiment of above-mentioned each method.Wherein, Any reference to memory, storage, database or other media used in each embodiment provided herein, Including non-volatile and/or volatile memory.Nonvolatile memory may include read-only memory (ROM), programming ROM (PROM), electrically programmable ROM (EPROM), electrically erasable ROM (EEPROM) or flash memory.Volatile memory may include Random access memory (RAM) or external cache.By way of illustration and not limitation, RAM is available in many forms, Such as static state RAM (SRAM), dynamic ram (DRAM), synchronous dram (SDRAM), double data rate sdram (DDRSDRAM), enhancing Type SDRAM (ESDRAM), synchronization link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic ram (DRDRAM) and memory bus dynamic ram (RDRAM) etc..

Each technical characteristic of above example can be combined arbitrarily, to make description succinct, not to above-described embodiment In each technical characteristic it is all possible combination be all described, as long as however, the combination of these technical characteristics be not present lance Shield is all considered to be the range of this specification record.

Embodiment described above only expresses the several embodiments of the application, and description is more specific and detailed, but simultaneously It cannot therefore be construed as limiting the scope of the patent.It should be pointed out that those of ordinary skill in the art are come It says, under the premise of the application design is not departed from, various modifications and improvements can be made, these belong to the protection of the application Range.Therefore, the protection domain of the application patent should be determined by the appended claims.

Claims

1. a kind of sound end detecting method, including：

The acoustic feature and spectrum signature are converted, obtain corresponding acoustic feature vector sum spectrum signature vector；

Grader is obtained, the acoustic feature vector sum spectrum signature vector is input to the grader, obtains addition voice The spectrum signature vector of the acoustic feature vector sum addition voice label of label；

The spectrum signature vector of the acoustic feature vector sum addition voice label of the addition voice label is parsed, is obtained Corresponding voice signal；

2. according to the method described in claim 1, it is characterized in that, in the corresponding acoustics of the extraction Noisy Speech Signal Before feature and spectrum signature, further include：

The Noisy Speech Signal is converted into noisy speech frequency spectrum；

Time-domain analysis and/or frequency-domain analysis and/or transform domain analysis are carried out to the noisy speech frequency spectrum, the band is obtained and makes an uproar language The corresponding acoustic feature of sound signal.

3. according to the method described in claim 1, it is characterized in that, in the corresponding acoustics of the extraction Noisy Speech Signal Before feature and spectrum signature, further include：

The Noisy Speech Signal is converted into noisy speech frequency spectrum, noisy speech amplitude is calculated according to the noisy speech frequency spectrum Spectrum；

The Noisy Speech Signal is generated using the noisy speech amplitude spectrum, noise amplitude spectrum and voice amplitudes spectrum Corresponding spectrum signature.

4. according to the method described in claim 1, it is characterized in that, described convert the acoustic feature and spectrum signature Including：

Log-domain is carried out to the acoustic feature after the corresponding mean value vector of calculating present frame and/or Variance Vector and spectrum signature to turn It changes, obtains transformed acoustic feature vector sum spectrum signature vector.

5. it according to the method described in claim 1, it is characterized in that, is further included before described the step of obtaining grader：

The noisy speech data of addition voice class label are obtained, by being trained to the noisy speech data, are obtained just Beginning grader；

Multiple first voice data are input to the preliminary classification device, obtain the corresponding classification of the multiple first voice data Probability；

The corresponding class probability of multiple first voice data is screened, classification mark is added to the first voice data selected Label obtain the verification collection of addition class label；

It is trained using the verification collection of the addition class label and the noisy speech data of the addition voice class label, It is verified grader；

Multiple second speech datas are input to verification grader, it is general to obtain the corresponding classification of the multiple second speech data Rate；

6. method according to any one of claims 1 to 5, which is characterized in that described to utilize the grader to the sound The step of feature vector and spectrum signature vector are classified includes：

Using the acoustic feature vector sum spectrum signature vector as the input of grader, the acoustic feature vector sum frequency is obtained The corresponding decision value of spectrum signature vector；

When the decision value is first threshold, voice label is added to acoustic feature vector or spectrum signature vector；

When the decision value is second threshold, non-voice label is added to acoustic feature vector or spectrum signature vector.

7. a kind of speech terminals detection device, including：

Extraction module for obtaining Noisy Speech Signal, extracts the corresponding acoustic feature of the Noisy Speech Signal and frequency spectrum is special Sign；

Modular converter for being converted to the acoustic feature and spectrum signature, obtains corresponding acoustic feature vector sum frequency Spectrum signature vector；

The acoustic feature vector sum spectrum signature vector for obtaining grader, is input to the grader by sort module, Obtain the spectrum signature vector of the acoustic feature vector sum addition voice label of addition voice label；

Parsing module, for the spectrum signature vector of the acoustic feature vector sum addition voice label to the addition voice label It is parsed, obtains corresponding voice signal；The corresponding starting of the voice signal is determined according to the sequential of the voice signal Point and terminating point.

8. device according to claim 7, which is characterized in that the modular converter be additionally operable to extract the acoustic feature and The front and rear preset quantity frame of present frame in the spectrum signature；It calculates and works as by using the corresponding front and rear preset quantity frame of present frame The mean value vector and/or Variance Vector of previous frame；It is special to the acoustics after the corresponding mean value vector of calculating present frame and/or Variance Vector Spectrum signature of seeking peace carries out log-domain conversion, obtains transformed acoustic feature vector sum spectrum signature vector.

9. a kind of computer equipment, including memory and processor, the memory is stored with computer program, and feature exists In when the processor performs the computer program the step of any one of realization realization claim 1 to 6 the method.

10. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the computer program The step of any one of claim 1 to 6 the method is realized when being executed by processor.