CN108198547A - Sound end detecting method, device, computer equipment and storage medium - Google Patents
Sound end detecting method, device, computer equipment and storage medium Download PDFInfo
- Publication number
- CN108198547A CN108198547A CN201810048223.3A CN201810048223A CN108198547A CN 108198547 A CN108198547 A CN 108198547A CN 201810048223 A CN201810048223 A CN 201810048223A CN 108198547 A CN108198547 A CN 108198547A
- Authority
- CN
- China
- Prior art keywords
- acoustic feature
- voice
- spectrum signature
- noisy speech
- vector
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 23
- 238000001228 spectrum Methods 0.000 claims abstract description 325
- 239000013598 vector Substances 0.000 claims abstract description 317
- 239000000284 extract Substances 0.000 claims abstract description 19
- 238000001514 detection method Methods 0.000 claims abstract description 18
- 238000012795 verification Methods 0.000 claims description 49
- 238000004590 computer program Methods 0.000 claims description 27
- 230000015654 memory Effects 0.000 claims description 21
- 241001269238 Data Species 0.000 claims description 20
- 238000004458 analytical method Methods 0.000 claims description 20
- 238000000605 extraction Methods 0.000 claims description 20
- 238000006243 chemical reaction Methods 0.000 claims description 8
- 230000005236 sound signal Effects 0.000 claims description 4
- 238000009432 framing Methods 0.000 description 13
- 238000012549 training Methods 0.000 description 7
- 238000005516 engineering process Methods 0.000 description 5
- 230000004913 activation Effects 0.000 description 4
- 238000013528 artificial neural network Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 239000011159 matrix material Substances 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 230000003595 spectral effect Effects 0.000 description 2
- 238000010183 spectrum analysis Methods 0.000 description 2
- 230000008901 benefit Effects 0.000 description 1
- 230000002457 bidirectional effect Effects 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 230000007787 long-term memory Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000002441 reversible effect Effects 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/02—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
- G10L19/032—Quantisation or dequantisation of spectral components
- G10L19/038—Vector quantisation, e.g. TwinVQ audio
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
- G10L25/87—Detection of discrete points within a voice signal
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Computational Linguistics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Quality & Reliability (AREA)
- Telephonic Communication Services (AREA)
Abstract
This application involves a kind of sound end detecting method, device, computer equipment and storage mediums.This method includes:Noisy Speech Signal is obtained, extracts the corresponding acoustic feature of the Noisy Speech Signal and spectrum signature;The acoustic feature and spectrum signature are converted, obtain corresponding acoustic feature vector sum spectrum signature vector;Grader is obtained, the acoustic feature vector sum spectrum signature vector is input to the grader, obtains the spectrum signature vector of the acoustic feature vector sum addition voice label of addition voice label;The spectrum signature vector of the acoustic feature vector sum addition voice label of the addition voice label is parsed, obtains corresponding voice signal;The corresponding starting point and ending point of the voice signal is determined according to the sequential of the voice signal.The accuracy of speech terminals detection can be effectively improved using this method.
Description
Technical field
This application involves signal processing technology field, more particularly to a kind of language end-point detecting method, device, computer
Equipment and storage medium.
Background technology
With the continuous development of voice technology, speech terminals detection technology is occupied highly important in speech recognition technology
Status.Speech terminals detection is the starting point and ending point that phonological component is detected from one section of continuous noise speech, so as to
Voice can be efficiently identified out.
There are two types of traditional speech terminals detection modes, and a kind of is the spy of the time domain according to voice and noise signal and frequency domain
Sign is different, extracts the feature of each segment signal, the feature of each segment signal is compared with the threshold value set, so as to carry out language
Voice endpoint detects.But this mode is only applicable to detect under the conditions of stationary noise, and noise robustness is poor, it is difficult to distinguish clean speech
And noise, the accuracy for leading to speech terminals detection are relatively low..It is another then be the mode based on neural network, by using instruction
Practice model and end-point detection is carried out to voice signal.However the input vector of big multi-model contains only the feature of noisy speech so that
Noise robustness is poor, relatively low so as to cause the accuracy of speech terminals detection.Therefore, how speech terminals detection is effectively improved
Accuracy becomes the current technical issues that need to address.
Invention content
Based on this, it is necessary to which for above-mentioned technical problem, providing a kind of can effectively improve the accurate of speech terminals detection
Sound end detecting method, device, computer equipment and the storage medium of property.
A kind of sound end detecting method, including:
Noisy Speech Signal is obtained, extracts the corresponding acoustic feature of the Noisy Speech Signal and spectrum signature;
The acoustic feature and spectrum signature are converted, obtain corresponding acoustic feature vector sum spectrum signature to
Amount;
Grader is obtained, the acoustic feature vector sum spectrum signature vector is input to the grader, is added
The spectrum signature vector of the acoustic feature vector sum addition voice label of voice label;
The spectrum signature vector of the acoustic feature vector sum addition voice label of the addition voice label is parsed,
Obtain corresponding voice signal;
The corresponding starting point and ending point of the voice signal is determined according to the sequential of the voice signal.
In one of the embodiments, in the corresponding acoustic feature of the extraction Noisy Speech Signal and spectrum signature
Before, it further includes:
The Noisy Speech Signal is converted into noisy speech frequency spectrum;
Time-domain analysis and/or frequency-domain analysis and/or transform domain analysis are carried out to the noisy speech frequency spectrum, obtain the band
The corresponding acoustic feature of noisy speech signal.
In one of the embodiments, in the corresponding acoustic feature of the extraction Noisy Speech Signal and spectrum signature
Before, it further includes:
The Noisy Speech Signal is converted into noisy speech frequency spectrum, noisy speech is calculated according to the noisy speech frequency spectrum
Amplitude spectrum;
Dynamic noise estimation is carried out to the noisy speech frequency spectrum according to the noisy speech amplitude spectrum, obtains noise amplitude
Spectrum;
It is composed according to the voice amplitudes of the noisy speech amplitude spectrum and the noise amplitude Power estimation clean speech signal;
The noisy speech is generated using the noisy speech amplitude spectrum, noise amplitude spectrum and voice amplitudes spectrum
The corresponding spectrum signature of signal.
In one of the embodiments, it is described to the acoustic feature and spectrum signature carry out conversion include:
Extract the front and rear preset quantity frame of present frame in the acoustic feature and the spectrum signature;
The corresponding mean value vector of present frame and/or Variance Vector are calculated by using the front and rear preset quantity frame of present frame;
Logarithm is carried out to the acoustic feature after the corresponding mean value vector of calculating present frame and/or Variance Vector and spectrum signature
Domain is converted, and obtains transformed acoustic feature vector sum spectrum signature vector.
It is further included before the step of acquisition grader in one of the embodiments,:
The noisy speech data of addition voice class label are obtained, by being trained to the noisy speech data, are obtained
To preliminary classification device;
The first verification collection is obtained, first verification, which is concentrated, includes multiple first voice data;
Multiple first voice data are input to grader, it is general to obtain the corresponding classification of the multiple first voice data
Rate;
The corresponding class probability of multiple first voice data is screened, classification is added to the first voice data selected
Label obtains the verification collection of addition class label;
It is trained using the verification collection and the training set of the addition class label, is verified grader;
The second verification collection is obtained, second verification, which is concentrated, includes multiple second speech datas;
Multiple second speech datas are input to verification grader, obtain the corresponding classification of the multiple second speech data
Probability;
When the corresponding class probability of multiple second speech datas reaches predetermined probabilities value, required grader is obtained.
It is described vectorial to the acoustic feature vector sum spectrum signature using the grader in one of the embodiments,
The step of being classified includes:
Using the acoustic feature vector sum spectrum signature vector as the input of grader, the acoustic feature vector is obtained
The corresponding decision value with spectrum signature vector;
When the decision value is first threshold, voice mark is added to acoustic feature vector or spectrum signature vector
Label;
When the decision value is second threshold, non-voice mark is added to acoustic feature vector or spectrum signature vector
Label.
A kind of speech terminals detection device, including:
Extraction module for obtaining Noisy Speech Signal, extracts the corresponding acoustic feature of the Noisy Speech Signal and frequency
Spectrum signature;
Modular converter for being converted to the acoustic feature and spectrum signature, obtains corresponding acoustic feature vector
With spectrum signature vector;
The acoustic feature vector sum spectrum signature vector for obtaining grader, is input to described point by sort module
Class device obtains the spectrum signature vector of the acoustic feature vector sum addition voice label of addition voice label;
Parsing module, for the spectrum signature of the acoustic feature vector sum addition voice label to the addition voice label
Vector is parsed, and obtains corresponding voice signal;Determine that the voice signal is corresponding according to the sequential of the voice signal
Starting point and ending point.
The modular converter is additionally operable to extract in the acoustic feature and the spectrum signature in one of the embodiments,
The front and rear preset quantity frame of present frame;The corresponding mean value vector of present frame is calculated by using the front and rear preset quantity frame of present frame
And/or Variance Vector;To calculate the acoustic feature after the corresponding mean value vector of present frame and/or Variance Vector and spectrum signature into
Row log-domain is converted, and obtains transformed acoustic feature vector sum spectrum signature vector.
A kind of computer equipment, including memory, processor, the memory is stored with computer program, the processing
Device realizes following steps when performing the computer program:
Noisy Speech Signal is obtained, extracts the corresponding acoustic feature of the Noisy Speech Signal and spectrum signature;
The acoustic feature and spectrum signature are converted, obtain corresponding acoustic feature vector sum spectrum signature to
Amount;
Grader is obtained, the acoustic feature vector sum spectrum signature vector is input to the grader, is added
The spectrum signature vector of the acoustic feature vector sum addition voice label of voice label;
The spectrum signature vector of the acoustic feature vector sum addition voice label of the addition voice label is parsed,
Obtain corresponding voice signal;
The corresponding starting point and ending point of the voice signal is determined according to the sequential of the voice signal.
A kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the computer program
Following steps are realized when being executed by processor:
Noisy Speech Signal is obtained, extracts the corresponding acoustic feature of the Noisy Speech Signal and spectrum signature;
The acoustic feature and spectrum signature are converted, obtain corresponding acoustic feature vector sum spectrum signature to
Amount;
Grader is obtained, the acoustic feature vector sum spectrum signature vector is input to the grader, is added
The spectrum signature vector of the acoustic feature vector sum addition voice label of voice label;
The spectrum signature vector of the acoustic feature vector sum addition voice label of the addition voice label is parsed,
Obtain corresponding voice signal;
The corresponding starting point and ending point of the voice signal is determined according to the sequential of the voice signal.
Above-mentioned sound end detecting method, device, computer equipment and storage medium obtain Noisy Speech Signal, extraction
The corresponding acoustic feature of Noisy Speech Signal and spectrum signature;By being converted to acoustic feature and spectrum signature, obtain pair
The acoustic feature vector sum spectrum signature vector answered.Grader is obtained, by the way that acoustic feature vector sum spectrum signature vector is defeated
Enter to grader, obtain the spectrum signature vector of the acoustic feature vector sum addition voice label of addition voice label, thus, it is possible to
It is enough effectively to classify to acoustic feature vector sum spectrum signature vector, so as to effectively identify voice and non-voice.
The spectrum signature vector of acoustic feature vector sum addition voice label to adding voice label parses, and obtains corresponding language
Sound signal;The sequential of voice signal determines the corresponding starting point and ending point of voice signal, and thus, it is possible to accurately identify that band is made an uproar
The starting point and ending point of voice signal, so as to effectively improve the accuracy of speech terminals detection.
Description of the drawings
Fig. 1 is the flow chart of sound end detecting method in one embodiment;
Fig. 2 is the internal structure chart of speech terminals detection device in one embodiment;
Fig. 3 is the internal structure chart of one embodiment Computer equipment.
Specific embodiment
It is with reference to the accompanying drawings and embodiments, right in order to which the object, technical solution and advantage for making the application are more clearly understood
The present invention is further elaborated.It should be appreciated that specific embodiment described herein is only used to explain the application, not
Apply for limiting.It is appreciated that term " first " used in the present invention, " second " etc. can be used to describe herein it is various
Element, but these elements should not be limited by these terms.These terms are only used to distinguish first element and another element.
In one embodiment, as shown in Figure 1, providing a kind of sound end detecting method, it is applied in this way eventually
It illustrates, includes the following steps for end:
Step 102, Noisy Speech Signal, the corresponding acoustic feature of extraction Noisy Speech Signal and spectrum signature are obtained.
Typically, actual acquisition to voice signal usually contain the noise of some strength, when these noise intensities compared with
When big, the effect of voice application, which can be generated, significantly influences, for example audio identification efficiency is low, and end-point detection accuracy declines
Deng.
Terminal can obtain the voice that user is inputted by speech input device.Wherein, terminal can be smart mobile phone, put down
The terminals such as plate computer, laptop, desktop computer, terminal further include speech input device, for example, it may be the tools such as microphone
There is the device of typing phonetic function.The voice input by user that terminal is got is usually noise-containing Noisy Speech Signal,
Noisy Speech Signal can be the Noisy Speech Signals such as call voice input by user, recorded audio, phonetic order.Terminal obtains
After Noisy Speech Signal, the corresponding acoustic feature of Noisy Speech Signal and spectrum signature are extracted.Wherein, acoustic feature can wrap
Include the characteristic informations such as voiceless sound, the voiced sound of Noisy Speech Signal, vowel, consonant.Spectrum signature can include Noisy Speech Signal
The characteristic informations such as vibration frequency, the loudness of vibration amplitude and Noisy Speech Signal, tone color.
Specifically, after terminal obtains Noisy Speech Signal, adding window framing is carried out to Noisy Speech Signal.For example, it can adopt
Noisy Speech Signal is divided into the frame that multiple frame lengths are 10-30ms (millisecond) with Hanning window, frame shifting can take 10ms, so as to
Noisy Speech Signal is divided into multiframe Noisy Speech Signal.After terminal-pair carries out adding window framing to Noisy Speech Signal, to adding window
Noisy Speech Signal after framing carries out fast Fourier transform, thus obtains the frequency spectrum of Noisy Speech Signal.Terminal then can be with
The corresponding acoustic feature of Noisy Speech Signal and spectrum signature are extracted according to the frequency spectrum of noisy speech.
Step 104, acoustic feature and spectrum signature are converted, obtains corresponding acoustic feature vector sum spectrum signature
Vector.
After terminal extracts the corresponding acoustic feature of Noisy Speech Signal and spectrum signature, the noisy speech extracted is believed
Number corresponding acoustic feature and spectrum signature are converted, and acoustic feature are converted to corresponding acoustic feature vector, by frequency spectrum
Feature Conversion is corresponding spectrum signature vector.
Step 106, grader is obtained, acoustic feature vector sum spectrum signature vector is input to grader, is added
The spectrum signature vector of the acoustic feature vector sum addition voice label of voice label.
Terminal obtains grader, and grader is the trained grader before speech terminals detection is carried out, and grader can
With by adding voice label and non-voice label to acoustic feature vector sum spectrum signature vector, by the acoustic feature of input to
Amount and spectrum signature vector be divided into voice class acoustic feature vector sum spectrum signature vector sum non-voice class acoustic feature to
Amount and spectrum signature vector.Terminal by the corresponding acoustic feature vector sum spectrum signature vector of noisy speech by being input to classification
Device classifies to the acoustic feature vector sum spectrum signature vector of input using grader.When the acoustic feature vector of input
Or spectrum signature vector be voice class when, be acoustic feature vector or spectrum signature vector add voice label;When input
It is that acoustic feature vector or spectrum signature vector add non-language when acoustic feature vector or spectrum signature vector are non-voice classification
Phonetic symbol label, thus, it is possible to voice and non-voice is recognized accurately.Terminal is using grader to acoustic feature vector sum spectrum signature
After vector, it is possible to obtain the spectrum signature vector of the acoustic feature vector sum addition voice label of addition voice label.
Further, terminal can also be obtained using acoustic feature vector sum spectrum signature vector as the input of grader
The corresponding decision value of acoustic feature vector sum spectrum signature vector.Terminal can be according to obtained decision value to acoustic feature vector
Voice label or non-voice label are added with spectrum signature vector.So as to fulfill to acoustic feature vector sum spectrum signature vector into
Row Accurate classification.
Step 108, the spectrum signature vector of the acoustic feature vector sum addition voice label to adding voice label carries out
Parsing obtains the voice signal after addition voice label.
Step 110, the corresponding starting point and ending point of voice signal is determined according to the voice label of voice signal and sequential.
After terminal-pair acoustic feature vector sum spectrum signature vector is classified, the acoustics to being added to voice label is needed
Feature vector is parsed with the spectrum signature vector for being added to voice label.Specifically, terminal will be added to voice label
The spectrum signature vector that acoustic feature vector sum is added to voice label is parsed, and the acoustics for obtaining being added to voice label is special
The corresponding frequency spectrum of spectrum signature of seeking peace.Terminal according to the sequential of Noisy Speech Signal by be added to voice label acoustic feature and
The corresponding frequency spectrum of spectrum signature is converted to corresponding voice signal, and thus, it is possible to parse to obtain corresponding voice signal.
Noisy Speech Signal has sequential, and the sequential for being added to the voice signal after voice label is still believed with noisy speech
Number sequential it is corresponding.Terminal by the acoustic feature vector sum for being added to voice label be added to the spectrum signature of voice label to
Amount resolves to the corresponding voice signal for being added to voice label, terminal thus, it is possible to the voice label according to voice signal and when
Sequence determines the corresponding starting point and ending point of Noisy Speech Signal.
For example, after terminal classifies to the acoustic feature vector sum spectrum signature vector of input by grader, obtain
Decision value can be value between one 0 to 1.When obtained decision value is 1, terminal-pair acoustic feature vector or frequency spectrum are special
Sign vector addition voice label.When obtained decision value is 0, terminal-pair acoustic feature vector or the addition of spectrum signature vector are non-
Voice label.Thus, it is possible to accurately carry out Accurate classification to acoustic feature vector sum spectrum signature vector.Terminal will be added to
The acoustic feature vector sum of voice label be added to voice label spectrum signature vector parsed after, it is possible to added
Voice signal after voice label.According to the sequential for being added to the voice signal after voice label, added when for the first time
The speech frame of voice label is then the starting point of Noisy Speech Signal, when occurring the corresponding speech frame of voice label for the last time
It is then the terminating point of Noisy Speech Signal.Further, can also according to decision value 0 to 1 redirect determine voice signal
Starting point, according to the terminating point redirected to determine voice signal of decision value 1 to 0.It is possible thereby to accurately determine noisy speech
The corresponding starting point and ending point of signal.
In the present embodiment, after terminal obtains Noisy Speech Signal, the corresponding acoustic feature of extraction Noisy Speech Signal and frequency
Spectrum signature by being converted to acoustic feature and spectrum signature, obtains corresponding acoustic feature vector sum spectrum signature vector.
By the way that acoustic feature vector sum spectrum signature vector is input to grader, the acoustic feature vector sum of addition voice label is obtained
The spectrum signature vector of voice label is added, thus, it is possible to effectively acoustic feature vector sum spectrum signature vector is divided
Class, so as to effectively identify voice and non-voice.Terminal is by adding the acoustic feature vector sum for adding voice label
The spectrum signature vector of voice label is parsed, and obtains corresponding voice signal.Terminal is determined according to the sequential of voice signal
The corresponding starting point and ending point of voice signal, thus, it is possible to accurately identify the starting point and ending point of Noisy Speech Signal,
So as to effectively improve the accuracy of speech terminals detection.
In one embodiment, it before the corresponding acoustic feature of extraction Noisy Speech Signal and spectrum signature, further includes:
Noisy Speech Signal is converted into noisy speech frequency spectrum;Noisy speech frequency spectrum is carried out time-domain analysis and/or frequency-domain analysis and/or
Transform domain analysis obtains the corresponding acoustic feature of Noisy Speech Signal.
In phonetics, phonetic feature can be divided into vowel, consonant, voiceless sound, voiced sound and the acoustic features such as mute.Terminal
After obtaining Noisy Speech Signal, adding window framing is carried out to Noisy Speech Signal.Noisy speech is believed for example, Hanning window may be used
Number it is divided into the frame that multiple frame lengths are 10-30ms (millisecond), frame shifting can take 10ms.It is more so as to which Noisy Speech Signal is divided into
Frame Noisy Speech Signal.After terminal-pair carries out adding window framing to Noisy Speech Signal, to the Noisy Speech Signal after adding window framing
Fast Fourier transform is carried out, thus obtains the frequency spectrum of Noisy Speech Signal.
Further, terminal can carry out noisy speech frequency spectrum time-domain analysis and/or frequency-domain analysis and/or transform domain point
Analysis, so as to obtain the corresponding acoustic feature of Noisy Speech Signal.
For example, MFCC may be used in terminal, (Mel-Frequency Cepstrum Coefficients, mel-frequency fall
Spectral coefficient) mode extracts the corresponding acoustic feature of Noisy Speech Signal.After terminal-pair Noisy Speech Signal carries out adding window framing,
Noisy Speech Signal is converted to the frequency spectrum of Noisy Speech Signal.The Spectrum Conversion of Noisy Speech Signal is noisy speech by terminal
Cepstrum, terminal carry out cepstral analysis according to noisy speech cepstrum, and noisy speech cepstrum is carried out discrete cosine transform, is obtained each
The acoustic feature of frame, so as to obtain the effective acoustic feature of noisy speech.
In one embodiment, it before the corresponding acoustic feature of extraction Noisy Speech Signal and spectrum signature, further includes:
Noisy Speech Signal is converted into noisy speech frequency spectrum, noisy speech amplitude spectrum is calculated according to noisy speech frequency spectrum;It is made an uproar according to band
Voice amplitudes spectrum carries out dynamic noise estimation to noisy speech frequency spectrum, obtains noise amplitude spectrum;According to noisy speech amplitude spectrum and
The voice amplitudes spectrum of noise amplitude Power estimation clean speech signal;Utilize noisy speech amplitude spectrum, noise amplitude spectrum and voice width
The corresponding spectrum signature of degree spectrum generation Noisy Speech Signal.
After terminal obtains Noisy Speech Signal, adding window framing is carried out to Noisy Speech Signal.For example, Hanning window may be used
Noisy Speech Signal is divided into the frame that multiple frame lengths are 10-30ms (millisecond), frame shifting can take 10ms.So as to which band is made an uproar language
Sound signal is divided into multiframe Noisy Speech Signal.After terminal-pair carries out adding window framing to Noisy Speech Signal, after adding window framing
Noisy Speech Signal carries out fast Fourier transform, thus obtains the frequency spectrum of Noisy Speech Signal.Wherein, Noisy Speech Signal
Frequency spectrum can be the energy amplitude spectrum of the noisy speech after fast Fourier transform.
Further, terminal can calculate noisy speech amplitude spectrum and noisy speech phase using noisy speech frequency spectrum
Spectrum.Terminal carries out dynamic noise estimation according to noisy speech amplitude spectrum and noisy speech phase spectrum to noisy speech frequency spectrum.Specifically
Ground, terminal can carry out dynamic noise estimation using minimum controlled recursive average algorithm is improved to noisy speech frequency spectrum, so as to
To obtain noise amplitude spectrum.Terminal goes out voice according to noisy speech amplitude spectrum, noisy speech phase spectrum and noise amplitude Power estimation
The voice amplitudes spectrum of signal.For example, terminal can utilize log-magnitude spectrum nonlinear IEM model method, voice signal is estimated
Voice amplitudes are composed.
Terminal is given birth to using the voice amplitudes spectrum of noisy speech amplitude spectrum, noise amplitude spectrum and clean speech signal estimated
Into the corresponding spectrum signature of Noisy Speech Signal, thus terminal can efficiently extract out the corresponding frequency spectrum of Noisy Speech Signal
Feature.
In one embodiment, conversion is carried out to acoustic feature and spectrum signature to include:It extracts acoustic feature and frequency spectrum is special
The front and rear preset quantity frame of present frame in sign;The corresponding mean value of present frame is calculated by using the front and rear preset quantity frame of present frame
Vector and/or Variance Vector;It is special to the acoustic feature after the corresponding mean value vector of calculating present frame and/or Variance Vector and frequency spectrum
Sign carries out log-domain conversion, obtains transformed acoustic feature vector sum spectrum signature vector.
After terminal obtains Noisy Speech Signal, adding window framing is carried out to Noisy Speech Signal.So as to by noisy speech
Signal is divided into multiframe Noisy Speech Signal.After terminal-pair carries out adding window framing to Noisy Speech Signal, to the band after adding window framing
Noisy speech signal carries out fast Fourier transform, thus obtains the frequency spectrum of Noisy Speech Signal.Terminal can make an uproar language according to band
The frequency spectrum of sound extracts the corresponding acoustic feature of Noisy Speech Signal and spectrum signature.
After terminal extracts the corresponding acoustic feature of Noisy Speech Signal and spectrum signature, by acoustic feature and spectrum signature
Be converted to acoustic feature vector sum spectrum signature vector.Present frame in terminal extraction acoustic feature vector sum spectrum signature vector
Front and rear preset quantity frame.Terminal by using the front and rear preset quantity frame of present frame calculate the corresponding mean value vector of present frame or
Variance Vector, so as to be smoothed to acoustic feature and spectrum signature, the acoustic feature vector sum after obtaining smoothly
Spectrum signature vector.
For example, terminal can obtain acoustic feature or spectrum signature present frame be pushed forward and follow-up each five frame, 11 frame in total
Noisy speech frequency spectrum.By calculating the average value of this 11 frame, the mean value vector of present frame can be obtained.Specifically, terminal can be with
Wave filter group is obtained, wherein, the shape of wave filter is triangle, and triangle Window Table shows filter window.Each wave filter has three
The characteristic of angular wave filter, in noisy speech spectral range, these wave filters can be equiband.Terminal can utilize filter
Wave device group calculates the mean value vector of present frame, it is possible thereby to be smoothed to noisy speech frequency spectrum, the sound after obtaining smoothly
Learn feature vector and spectrum signature vector.
After terminal-pair noisy speech frequency spectrum is smoothed, it is smooth to the acoustic feature vector sum after smooth after frequency spectrum
Feature vector calculates log-domain, obtains transformed acoustic feature vector sum spectrum signature vector.Specifically, terminal can calculate
The acoustic feature of each wave filter output and the logarithmic energy of spectrum signature, it is hereby achieved that the log-domain of acoustic feature vector
It is vectorial so as to effectively obtain transformed acoustic feature vector sum spectrum signature with the log-domain of spectrum signature vector.
In one embodiment, it is further included before the step of obtaining grader:The band for obtaining addition voice class label is made an uproar
Voice data by being trained to noisy speech data, obtains preliminary classification device;Obtain the first verification collection, the first verification collection
Include multiple first voice data;Multiple first voice data are input to grader, obtain multiple first voice data pair
The class probability answered;The corresponding class probability of multiple first voice data is screened, the first voice data selected is added
Add class label, obtain the verification collection of addition class label;It is trained using the verification collection and training set that add class label,
It is verified grader;The second verification collection is obtained, the second verification, which is concentrated, includes multiple second speech datas;By multiple second voices
Data are input to verification grader, obtain the corresponding class probability of multiple second speech datas;When multiple second speech datas pair
When the class probability answered reaches predetermined probabilities value, required grader is obtained.
It before grader is obtained, needs to train grader using a large amount of noisy speech data, these a large amount of bands are made an uproar language
Sound data can be that the band that the noisy speech data that terminal is obtained from database or terminal are obtained from internet is made an uproar
Voice data.In training grader, first by being manually labeled to noisy speech data, the band after artificial mark is utilized
Voice data of making an uproar is trained to obtain grader.
Specifically, after terminal extracts the corresponding acoustic feature of noisy speech data and spectrum signature, to acoustic feature and frequency
Spectrum signature is converted, and is converted to corresponding acoustic feature vector sum spectrum signature vector.Staff can be according to classification pair
Acoustic feature vector sum spectrum signature vector is labeled according to table, voice label or non-is added to each frame Noisy Speech Signal
Voice label.Terminal obtains the noisy speech number after staff is labeled noisy speech data according to the classification table of comparisons
According to.
Acoustic feature vector sum spectrum signature vector after adding label is combined and is input to LSTM by terminal
The input layer of (Bidirectional Long Short-term Memory, two-way shot and long term Memory Neural Networks), LSTM god
Input can be calculated by activation primitive from the vectorial learning of input to new feature through the non-linear hidden layer in network
The generic of vector.Specifically, there are three doors in each LSTM units, respectively forget door, candidate door and out gate.Specifically
Calculation formula can be:
Wherein, σ represents activation primitive,It represents to forget door weight matrix,It is to forget between door input layer and hidden layer
Weight matrix, bfRepresent to forget the biasing of door, it is by by the output h of previous hidden layer to forget doort-1With current input xtIt carries out
Linear combination, then outputs it value using activation primitive and is compressed between 0 to 1.When output valve is closer to 1, then show to remember
The information for recalling body reservation is more;Conversely, closer to 0, then show that the information that memory body retains is fewer.
The location mode that the calculating of candidate door currently inputs, specific formula can be:
Wherein, CiRepresent the location mode currently inputted, it can output valve is regular to -1 and 1 by tanh activation primitives
Between.
Out gate can be controlled for the quantity of the newer recall info of next layer network, and formula can be expressed as:
Wherein, OtRepresent the quantity for the newer recall info of next layer network.
Last output can be calculated by LSTM units, formula can be expressed as:
ht=Ot×tanh(Ct)
Last acoustic feature vector or spectrum signature vector, formula, which is calculated, by forward and reverse to be expressed as:
WhereinFor positive output vector,For reversed output vector, hiFor the last class label that is labelled with
Multiple acoustic feature vectors or spectrum signature vector.
Further, the output layer in LSTM can calculate output unit C according to preset decision functioniValue.Its
In, output unit CiValue can be 0 to 1 between value, 1 represents voice class, and 0 represents non-voice class.
Terminal is calculated each using the multiple acoustic feature vector sum spectrum signature vectors for being labelled with voice class label
Acoustic feature and spectrum signature belong to voice class and the probability of non-voice classification in the classification table of comparisons, extraction acoustic feature vector
With the classification of probability value maximum of the spectrum signature vector in the classification table of comparisons, acoustic feature vector or spectrum signature vector are added
Add voice class label corresponding with the classification of probability value maximum.
Terminal is trained using the noisy speech data for being added to voice class label, obtains preliminary classification device;Terminal
The first verification collection is obtained, the first verification, which is concentrated, includes multiple first voice data.Multiple first voice data are input to by terminal
Grader, after obtaining the corresponding class probability of multiple first voice data, class probability corresponding to multiple first voice data
It is screened.Staff adds voice class label using the first voice data that terminal-pair is selected, and terminal obtains addition language
The first voice data after sound class label, using adding the generation addition voice class of the first voice data after voice class label
The verification collection of distinguishing label.Terminal is trained again using the verification collection and noisy speech data for adding classification voice label, is obtained
Verify grader.Terminal obtains the second verification collection, and the second verification, which is concentrated, includes multiple second speech datas;By multiple second voices
Data are input to verification grader, obtain the corresponding class probability of multiple second speech datas.Terminal filters out class probability and exists
The second language data of preset range, and the second speech data filtered out is labeled again, by the second voice after mark
Noisy speech data after data and addition label re-start training and obtain new grader.It is continued for training, Zhi Daosuo
When having the probability value for verifying the acoustic feature vector for concentrating preset quantity or spectrum signature vector between predetermined probabilities value range,
Deconditioning, it is possible to obtain required grader.It is hereby achieved that the grader that accuracy rate is higher, so as to fulfill to acoustics
Feature vector and spectrum signature vector carry out Accurate classification, and then can accurately identify voice and non-voice.
In one embodiment, the step of being classified using grader to acoustic feature vector sum spectrum signature vector is wrapped
It includes:Using acoustic feature vector sum spectrum signature vector as the input of grader, obtain acoustic feature vector sum spectrum signature to
Measure corresponding decision value;When decision value is first threshold, voice label is added to acoustic feature vector or spectrum signature vector;
When decision value is second threshold, non-voice label is added to acoustic feature vector or spectrum signature vector.
After terminal obtains Noisy Speech Signal, the corresponding acoustic feature of extraction Noisy Speech Signal and spectrum signature.Terminal
Acoustic feature and spectrum signature are converted, obtain corresponding acoustic feature vector sum spectrum signature vector.Terminal, which obtains, divides
After class device, acoustic feature vector sum spectrum signature vector is input to grader.Grader is to the acoustic feature vector sum of input
After spectrum signature vector is classified, the corresponding decision value of acoustic feature vector sum spectrum signature vector can be obtained.When obtaining
Decision value be preset first threshold when, terminal-pair acoustic feature vector or spectrum signature vector addition voice label.Wherein,
First threshold can be a value range.When obtained decision value is preset second threshold, terminal-pair acoustic feature vector
Or spectrum signature vector addition non-voice label.Acoustic feature vector sum spectrum signature vector is carried out by using grader accurate
Really classification, so as to accurately identify voice signal and non-speech audio in Noisy Speech Signal.
For example, obtained decision value can be the value between one 0 to 1.Preset first threshold can be 1, preset
Second threshold can be 0.When obtained decision value is 1, terminal-pair acoustic feature vector or spectrum signature vector addition voice
Label.When obtained decision value is 0, terminal-pair acoustic feature vector or spectrum signature vector addition non-voice label.Thus
Can Accurate classification accurately be carried out to acoustic feature vector sum spectrum signature vector.
In one embodiment, as shown in Fig. 2, providing a kind of speech terminals detection device, including extraction module 202,
Modular converter 204, sort module 206 and parsing module 208, wherein:
Extraction module 202, for obtaining Noisy Speech Signal, the corresponding acoustic feature of extraction Noisy Speech Signal and frequency spectrum
Feature.
Modular converter 204 for being converted to acoustic feature and spectrum signature, obtains corresponding acoustic feature vector sum
Spectrum signature vector.
Acoustic feature vector sum spectrum signature vector for obtaining grader, is input to the classification by sort module 206
Device obtains the spectrum signature vector of the acoustic feature vector sum addition voice label of addition voice label.
Parsing module 208, for the spectrum signature of the acoustic feature vector sum addition voice label to adding voice label
Vector is parsed, and obtains corresponding voice signal;According to the sequential of voice signal determine the corresponding starting point of voice signal and
Terminating point.
In one embodiment, extraction module 202 is additionally operable to the Noisy Speech Signal being converted to noisy speech frequency spectrum;
Time-domain analysis and/or frequency-domain analysis and/or transform domain analysis are carried out to the noisy speech frequency spectrum, obtain the noisy speech letter
Number corresponding acoustic feature.
In one embodiment, extraction module 202 is additionally operable to Noisy Speech Signal being converted to noisy speech frequency spectrum, according to
Noisy speech frequency spectrum calculates noisy speech amplitude spectrum;Dynamic noise is carried out according to noisy speech amplitude spectrum to noisy speech frequency spectrum to estimate
Meter obtains noise amplitude spectrum;It is composed according to the voice amplitudes of noisy speech amplitude spectrum and noise amplitude Power estimation clean speech signal;
Utilize noisy speech amplitude spectrum, noise amplitude spectrum and the corresponding spectrum signature of voice amplitudes spectrum generation Noisy Speech Signal.
In one embodiment, modular converter 204 is additionally operable to extract in the acoustic feature and the spectrum signature currently
The front and rear preset quantity frame of frame;By using the front and rear preset quantity frame of present frame calculate the corresponding mean value vector of present frame and/
Or Variance Vector;Acoustic feature after the corresponding mean value vector of calculating present frame and/or Variance Vector and spectrum signature are carried out
Log-domain is converted, and obtains transformed acoustic feature vector sum spectrum signature vector.
In one embodiment, which further includes training module, and the band for obtaining addition voice class label is made an uproar language
Sound data by being trained to noisy speech data, obtain preliminary classification device;The first verification collection is obtained, the first verification is concentrated
Including multiple first voice data;Multiple first voice data are input to preliminary classification device, obtain multiple first voice data
Corresponding class probability;The corresponding class probability of multiple first voice data is screened, to the first voice data selected
Class label is added, obtains the verification collection of addition class label;Utilize the verification collection and addition voice class for adding class label
The noisy speech data of label are trained, and are verified grader;The second verification collection is obtained, the second verification concentration includes multiple
Second speech data;Multiple second speech datas are input to verification grader, obtain the corresponding class of multiple second speech datas
Other probability;When the corresponding class probability of multiple second speech datas reaches predetermined probabilities value, required grader is obtained.
In one embodiment, sort module 206 is additionally operable to using acoustic feature vector sum spectrum signature vector as classification
The input of device obtains the corresponding decision value of acoustic feature vector sum spectrum signature vector;When decision value is first threshold, to sound
Learn feature vector or spectrum signature vector addition voice label;When decision value is second threshold, to acoustic feature vector or frequency
Spectrum signature vector adds non-voice label.
In one embodiment, a kind of computer equipment is provided, which can be terminal, internal structure
Figure can be as shown in Figure 3.For example, the computer equipment can be terminal, terminal can be, but not limited to be it is various be smart mobile phone,
Tablet computer, laptop, personal computer and portable wearable device etc. have the function of the equipment for inputting voice.It should
Computer equipment includes the processor, memory, network interface and the speech input device that are connected by system bus.Wherein, should
The processor of computer equipment is calculated for offer and control ability.The memory of the computer equipment includes non-volatile memories
Medium, built-in storage.The non-volatile memory medium is stored with operating system and computer program.The built-in storage is non-volatile
Property storage medium in operating system and computer program operation provide environment.The network interface of the computer equipment be used for
External terminal is communicated by network connection.To realize a kind of speech terminals detection side when the computer program is executed by processor
Method.The speech input device of the computer equipment can include microphone, can also include external earphone etc..
It will be understood by those skilled in the art that the structure shown in Fig. 3, only part knot relevant with application scheme
The block diagram of structure, does not form the restriction for the server being applied thereon to application scheme, and specific server can include
Certain components are either combined than components more or fewer shown in figure or are arranged with different components.
In one embodiment, a kind of computer equipment is provided, including memory and processor, is stored in memory
Computer program, the processor realize following steps when performing computer program:Noisy Speech Signal is obtained, extracts noisy speech
The corresponding acoustic feature of signal and spectrum signature;Acoustic feature and spectrum signature are converted, obtain corresponding acoustic feature
Vector sum spectrum signature vector;Grader is obtained, acoustic feature vector sum spectrum signature vector is input to grader, is added
The acoustic feature vector sum of voice label is added to add the spectrum signature vector of voice label;To adding the acoustic feature of voice label
The spectrum signature vector of vector sum addition voice label is parsed, and obtains corresponding voice signal;According to voice signal when
Sequence determines the corresponding starting point and ending point of voice signal.
In one embodiment, following steps are also realized when processor performs computer program:The noisy speech is believed
Number be converted to noisy speech frequency spectrum;Time-domain analysis and/or frequency-domain analysis and/or transform domain point are carried out to the noisy speech frequency spectrum
Analysis, obtains the corresponding acoustic feature of the Noisy Speech Signal.
In one embodiment, following steps are also realized when processor performs computer program:Noisy Speech Signal is turned
Noisy speech frequency spectrum is changed to, noisy speech amplitude spectrum is calculated according to noisy speech frequency spectrum;It is made an uproar according to noisy speech amplitude spectrum to band
Voice spectrum carries out dynamic noise estimation, obtains noise amplitude spectrum;It is pure according to noisy speech amplitude spectrum and noise amplitude Power estimation
The voice amplitudes spectrum of net voice signal;Utilize noisy speech amplitude spectrum, noise amplitude spectrum and voice amplitudes spectrum generation noisy speech
The corresponding spectrum signature of signal.
In one embodiment, following steps are also realized when processor performs computer program:Extract the acoustic feature
With the front and rear preset quantity frame of present frame in the spectrum signature;It is calculated by using the front and rear preset quantity frame of present frame current
The corresponding mean value vector of frame and/or Variance Vector;To the acoustics after the corresponding mean value vector of calculating present frame and/or Variance Vector
Feature and spectrum signature carry out log-domain conversion, obtain transformed acoustic feature vector sum spectrum signature vector.
In one embodiment, following steps are also realized when processor performs computer program:Obtain addition voice class
The noisy speech data of label by being trained to noisy speech data, obtain preliminary classification device;The first verification collection is obtained,
First verification, which is concentrated, includes multiple first voice data;Multiple first voice data are input to grader, obtain multiple first
The corresponding class probability of voice data;The corresponding class probability of multiple first voice data is screened, to select first
Voice data adds class label, obtains the verification collection of addition class label;Utilize the verification collection and training for adding class label
Collection is trained, and is verified grader;The second verification collection is obtained, the second verification, which is concentrated, includes multiple second speech datas;It will
Multiple second speech datas are input to verification grader, obtain the corresponding class probability of multiple second speech datas;When multiple
When the corresponding class probability of two voice data reaches predetermined probabilities value, required grader is obtained.
In one embodiment, following steps are also realized when processor performs computer program:By acoustic feature vector sum
Input of the spectrum signature vector as grader obtains the corresponding decision value of acoustic feature vector sum spectrum signature vector;When certainly
When plan value is first threshold, voice label is added to acoustic feature vector or spectrum signature vector;When decision value is second threshold
When, non-voice label is added to acoustic feature vector or spectrum signature vector.
In one embodiment, a kind of computer readable storage medium is provided, is stored thereon with computer program, is calculated
Machine program realizes following steps when being executed by processor:Obtain Noisy Speech Signal, the corresponding acoustics of extraction Noisy Speech Signal
Feature and spectrum signature;Acoustic feature and spectrum signature are converted, obtain corresponding acoustic feature vector sum spectrum signature
Vector;Grader is obtained, acoustic feature vector sum spectrum signature vector is input to grader, obtains the sound of addition voice label
It learns feature vector and adds the spectrum signature vector of voice label;Voice is added to the acoustic feature vector sum for adding voice label
The spectrum signature vector of label is parsed, and obtains corresponding voice signal;Voice signal is determined according to the sequential of voice signal
Corresponding starting point and ending point.
In one embodiment, following steps are also realized when computer program is executed by processor:By the noisy speech
Signal is converted to noisy speech frequency spectrum;Time-domain analysis and/or frequency-domain analysis and/or transform domain are carried out to the noisy speech frequency spectrum
Analysis, obtains the corresponding acoustic feature of the Noisy Speech Signal.
In one embodiment, following steps are also realized when computer program is executed by processor:By Noisy Speech Signal
Noisy speech frequency spectrum is converted to, noisy speech amplitude spectrum is calculated according to noisy speech frequency spectrum;According to noisy speech amplitude spectrum to band
Voice spectrum of making an uproar carries out dynamic noise estimation, obtains noise amplitude spectrum;According to noisy speech amplitude spectrum and noise amplitude Power estimation
The voice amplitudes spectrum of clean speech signal;It is made an uproar language using noisy speech amplitude spectrum, noise amplitude spectrum and voice amplitudes spectrum generation band
The corresponding spectrum signature of sound signal.
In one embodiment, following steps are also realized when computer program is executed by processor:It is special to extract the acoustics
It seeks peace the front and rear preset quantity frame of present frame in the spectrum signature;It calculates and works as by using the front and rear preset quantity frame of present frame
The corresponding mean value vector of previous frame and/or Variance Vector;To the sound after the corresponding mean value vector of calculating present frame and/or Variance Vector
It learns feature and spectrum signature carries out log-domain conversion, obtain transformed acoustic feature vector sum spectrum signature vector.
In one embodiment, following steps are also realized when computer program is executed by processor:Obtain addition voice class
The noisy speech data of distinguishing label by being trained to noisy speech data, obtain preliminary classification device;Obtain the first verification
Collection, the first verification, which is concentrated, includes multiple first voice data;Multiple first voice data are input to grader, obtain multiple
The corresponding class probability of one voice data;The corresponding class probability of multiple first voice data is screened, to select
One voice data adds class label, obtains the verification collection of addition class label;Utilize the verification collection and instruction for adding class label
Practice collection to be trained, be verified grader;The second verification collection is obtained, the second verification, which is concentrated, includes multiple second speech datas;
Multiple second speech datas are input to verification grader, obtain the corresponding class probability of multiple second speech datas;When multiple
When the corresponding class probability of second speech data reaches predetermined probabilities value, required grader is obtained.
In one embodiment, following steps are also realized when computer program is executed by processor:By acoustic feature vector
With input of the spectrum signature vector as grader, the corresponding decision value of acoustic feature vector sum spectrum signature vector is obtained;When
When decision value is first threshold, voice label is added to acoustic feature vector or spectrum signature vector;When decision value is the second threshold
During value, non-voice label is added to acoustic feature vector or spectrum signature vector.
One of ordinary skill in the art will appreciate that realizing all or part of flow in above-described embodiment method, being can be with
Relevant hardware is instructed to complete by computer program, the computer program can be stored in a non-volatile computer
In read/write memory medium, the computer program is when being executed, it may include such as the flow of the embodiment of above-mentioned each method.Wherein,
Any reference to memory, storage, database or other media used in each embodiment provided herein,
Including non-volatile and/or volatile memory.Nonvolatile memory may include read-only memory (ROM), programming ROM
(PROM), electrically programmable ROM (EPROM), electrically erasable ROM (EEPROM) or flash memory.Volatile memory may include
Random access memory (RAM) or external cache.By way of illustration and not limitation, RAM is available in many forms,
Such as static state RAM (SRAM), dynamic ram (DRAM), synchronous dram (SDRAM), double data rate sdram (DDRSDRAM), enhancing
Type SDRAM (ESDRAM), synchronization link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM
(RDRAM), direct memory bus dynamic ram (DRDRAM) and memory bus dynamic ram (RDRAM) etc..
Each technical characteristic of above example can be combined arbitrarily, to make description succinct, not to above-described embodiment
In each technical characteristic it is all possible combination be all described, as long as however, the combination of these technical characteristics be not present lance
Shield is all considered to be the range of this specification record.
Embodiment described above only expresses the several embodiments of the application, and description is more specific and detailed, but simultaneously
It cannot therefore be construed as limiting the scope of the patent.It should be pointed out that those of ordinary skill in the art are come
It says, under the premise of the application design is not departed from, various modifications and improvements can be made, these belong to the protection of the application
Range.Therefore, the protection domain of the application patent should be determined by the appended claims.
Claims (10)
1. a kind of sound end detecting method, including:
Noisy Speech Signal is obtained, extracts the corresponding acoustic feature of the Noisy Speech Signal and spectrum signature;
The acoustic feature and spectrum signature are converted, obtain corresponding acoustic feature vector sum spectrum signature vector;
Grader is obtained, the acoustic feature vector sum spectrum signature vector is input to the grader, obtains addition voice
The spectrum signature vector of the acoustic feature vector sum addition voice label of label;
The spectrum signature vector of the acoustic feature vector sum addition voice label of the addition voice label is parsed, is obtained
Corresponding voice signal;
The corresponding starting point and ending point of the voice signal is determined according to the sequential of the voice signal.
2. according to the method described in claim 1, it is characterized in that, in the corresponding acoustics of the extraction Noisy Speech Signal
Before feature and spectrum signature, further include:
The Noisy Speech Signal is converted into noisy speech frequency spectrum;
Time-domain analysis and/or frequency-domain analysis and/or transform domain analysis are carried out to the noisy speech frequency spectrum, the band is obtained and makes an uproar language
The corresponding acoustic feature of sound signal.
3. according to the method described in claim 1, it is characterized in that, in the corresponding acoustics of the extraction Noisy Speech Signal
Before feature and spectrum signature, further include:
The Noisy Speech Signal is converted into noisy speech frequency spectrum, noisy speech amplitude is calculated according to the noisy speech frequency spectrum
Spectrum;
Dynamic noise estimation is carried out to the noisy speech frequency spectrum according to the noisy speech amplitude spectrum, obtains noise amplitude spectrum;
It is composed according to the voice amplitudes of the noisy speech amplitude spectrum and the noise amplitude Power estimation clean speech signal;
The Noisy Speech Signal is generated using the noisy speech amplitude spectrum, noise amplitude spectrum and voice amplitudes spectrum
Corresponding spectrum signature.
4. according to the method described in claim 1, it is characterized in that, described convert the acoustic feature and spectrum signature
Including:
Extract the front and rear preset quantity frame of present frame in the acoustic feature and the spectrum signature;
The corresponding mean value vector of present frame and/or Variance Vector are calculated by using the front and rear preset quantity frame of present frame;
Log-domain is carried out to the acoustic feature after the corresponding mean value vector of calculating present frame and/or Variance Vector and spectrum signature to turn
It changes, obtains transformed acoustic feature vector sum spectrum signature vector.
5. it according to the method described in claim 1, it is characterized in that, is further included before described the step of obtaining grader:
The noisy speech data of addition voice class label are obtained, by being trained to the noisy speech data, are obtained just
Beginning grader;
The first verification collection is obtained, first verification, which is concentrated, includes multiple first voice data;
Multiple first voice data are input to the preliminary classification device, obtain the corresponding classification of the multiple first voice data
Probability;
The corresponding class probability of multiple first voice data is screened, classification mark is added to the first voice data selected
Label obtain the verification collection of addition class label;
It is trained using the verification collection of the addition class label and the noisy speech data of the addition voice class label,
It is verified grader;
The second verification collection is obtained, second verification, which is concentrated, includes multiple second speech datas;
Multiple second speech datas are input to verification grader, it is general to obtain the corresponding classification of the multiple second speech data
Rate;
When the corresponding class probability of multiple second speech datas reaches predetermined probabilities value, required grader is obtained.
6. method according to any one of claims 1 to 5, which is characterized in that described to utilize the grader to the sound
The step of feature vector and spectrum signature vector are classified includes:
Using the acoustic feature vector sum spectrum signature vector as the input of grader, the acoustic feature vector sum frequency is obtained
The corresponding decision value of spectrum signature vector;
When the decision value is first threshold, voice label is added to acoustic feature vector or spectrum signature vector;
When the decision value is second threshold, non-voice label is added to acoustic feature vector or spectrum signature vector.
7. a kind of speech terminals detection device, including:
Extraction module for obtaining Noisy Speech Signal, extracts the corresponding acoustic feature of the Noisy Speech Signal and frequency spectrum is special
Sign;
Modular converter for being converted to the acoustic feature and spectrum signature, obtains corresponding acoustic feature vector sum frequency
Spectrum signature vector;
The acoustic feature vector sum spectrum signature vector for obtaining grader, is input to the grader by sort module,
Obtain the spectrum signature vector of the acoustic feature vector sum addition voice label of addition voice label;
Parsing module, for the spectrum signature vector of the acoustic feature vector sum addition voice label to the addition voice label
It is parsed, obtains corresponding voice signal;The corresponding starting of the voice signal is determined according to the sequential of the voice signal
Point and terminating point.
8. device according to claim 7, which is characterized in that the modular converter be additionally operable to extract the acoustic feature and
The front and rear preset quantity frame of present frame in the spectrum signature;It calculates and works as by using the corresponding front and rear preset quantity frame of present frame
The mean value vector and/or Variance Vector of previous frame;It is special to the acoustics after the corresponding mean value vector of calculating present frame and/or Variance Vector
Spectrum signature of seeking peace carries out log-domain conversion, obtains transformed acoustic feature vector sum spectrum signature vector.
9. a kind of computer equipment, including memory and processor, the memory is stored with computer program, and feature exists
In when the processor performs the computer program the step of any one of realization realization claim 1 to 6 the method.
10. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the computer program
The step of any one of claim 1 to 6 the method is realized when being executed by processor.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810048223.3A CN108198547B (en) | 2018-01-18 | 2018-01-18 | Voice endpoint detection method and device, computer equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810048223.3A CN108198547B (en) | 2018-01-18 | 2018-01-18 | Voice endpoint detection method and device, computer equipment and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108198547A true CN108198547A (en) | 2018-06-22 |
CN108198547B CN108198547B (en) | 2020-10-23 |
Family
ID=62589616
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810048223.3A Active CN108198547B (en) | 2018-01-18 | 2018-01-18 | Voice endpoint detection method and device, computer equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108198547B (en) |
Cited By (19)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108922556A (en) * | 2018-07-16 | 2018-11-30 | 百度在线网络技术(北京)有限公司 | sound processing method, device and equipment |
CN109036471A (en) * | 2018-08-20 | 2018-12-18 | 百度在线网络技术(北京)有限公司 | Sound end detecting method and equipment |
CN110265032A (en) * | 2019-06-05 | 2019-09-20 | 平安科技(深圳)有限公司 | Conferencing data analysis and processing method, device, computer equipment and storage medium |
CN110322872A (en) * | 2019-06-05 | 2019-10-11 | 平安科技(深圳)有限公司 | Conference voice data processing method, device, computer equipment and storage medium |
CN110415704A (en) * | 2019-06-14 | 2019-11-05 | 平安科技(深圳)有限公司 | Data processing method, device, computer equipment and storage medium are put down in court's trial |
CN110428853A (en) * | 2019-08-30 | 2019-11-08 | 北京太极华保科技股份有限公司 | Voice activity detection method, Voice activity detection device and electronic equipment |
CN110752973A (en) * | 2018-07-24 | 2020-02-04 | Tcl集团股份有限公司 | Terminal equipment control method and device and terminal equipment |
CN110808061A (en) * | 2019-11-11 | 2020-02-18 | 广州国音智能科技有限公司 | Voice separation method and device, mobile terminal and computer readable storage medium |
CN110910906A (en) * | 2019-11-12 | 2020-03-24 | 国网山东省电力公司临沂供电公司 | Audio endpoint detection and noise reduction method based on power intranet |
CN111179972A (en) * | 2019-12-12 | 2020-05-19 | 中山大学 | Human voice detection algorithm based on deep learning |
CN111192600A (en) * | 2019-12-27 | 2020-05-22 | 北京网众共创科技有限公司 | Sound data processing method and device, storage medium and electronic device |
WO2020173488A1 (en) * | 2019-02-28 | 2020-09-03 | 北京字节跳动网络技术有限公司 | Audio starting point detection method and apparatus |
CN111626061A (en) * | 2020-05-27 | 2020-09-04 | 深圳前海微众银行股份有限公司 | Conference record generation method, device, equipment and readable storage medium |
CN111916060A (en) * | 2020-08-12 | 2020-11-10 | 四川长虹电器股份有限公司 | Deep learning voice endpoint detection method and system based on spectral subtraction |
CN112652324A (en) * | 2020-12-28 | 2021-04-13 | 深圳万兴软件有限公司 | Speech enhancement optimization method, speech enhancement optimization system and readable storage medium |
CN113327626A (en) * | 2021-06-23 | 2021-08-31 | 深圳市北科瑞声科技股份有限公司 | Voice noise reduction method, device, equipment and storage medium |
CN113744725A (en) * | 2021-08-19 | 2021-12-03 | 清华大学苏州汽车研究院(相城) | Training method of voice endpoint detection model and voice noise reduction method |
CN114974258A (en) * | 2022-07-27 | 2022-08-30 | 深圳市北科瑞声科技股份有限公司 | Speaker separation method, device, equipment and storage medium based on voice processing |
CN115497511A (en) * | 2022-10-31 | 2022-12-20 | 广州方硅信息技术有限公司 | Method, device, equipment and medium for training and detecting voice activity detection model |
Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2004272201A (en) * | 2002-09-27 | 2004-09-30 | Matsushita Electric Ind Co Ltd | Method and device for detecting speech end point |
US20060155537A1 (en) * | 2005-01-12 | 2006-07-13 | Samsung Electronics Co., Ltd. | Method and apparatus for discriminating between voice and non-voice using sound model |
CN101308653A (en) * | 2008-07-17 | 2008-11-19 | 安徽科大讯飞信息科技股份有限公司 | End-point detecting method applied to speech identification system |
US20090254341A1 (en) * | 2008-04-03 | 2009-10-08 | Kabushiki Kaisha Toshiba | Apparatus, method, and computer program product for judging speech/non-speech |
CN101599269A (en) * | 2009-07-02 | 2009-12-09 | 中国农业大学 | Sound end detecting method and device |
CN103489454A (en) * | 2013-09-22 | 2014-01-01 | 浙江大学 | Voice endpoint detection method based on waveform morphological characteristic clustering |
CN103730124A (en) * | 2013-12-31 | 2014-04-16 | 上海交通大学无锡研究院 | Noise robustness endpoint detection method based on likelihood ratio test |
CN104021789A (en) * | 2014-06-25 | 2014-09-03 | 厦门大学 | Self-adaption endpoint detection method using short-time time-frequency value |
CN105023572A (en) * | 2014-04-16 | 2015-11-04 | 王景芳 | Noised voice end point robustness detection method |
CN105118502A (en) * | 2015-07-14 | 2015-12-02 | 百度在线网络技术(北京)有限公司 | End point detection method and system of voice identification system |
US20160379632A1 (en) * | 2015-06-29 | 2016-12-29 | Amazon Technologies, Inc. | Language model speech endpointing |
CN107393526A (en) * | 2017-07-19 | 2017-11-24 | 腾讯科技(深圳)有限公司 | Speech silence detection method, device, computer equipment and storage medium |
-
2018
- 2018-01-18 CN CN201810048223.3A patent/CN108198547B/en active Active
Patent Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2004272201A (en) * | 2002-09-27 | 2004-09-30 | Matsushita Electric Ind Co Ltd | Method and device for detecting speech end point |
US20060155537A1 (en) * | 2005-01-12 | 2006-07-13 | Samsung Electronics Co., Ltd. | Method and apparatus for discriminating between voice and non-voice using sound model |
US20090254341A1 (en) * | 2008-04-03 | 2009-10-08 | Kabushiki Kaisha Toshiba | Apparatus, method, and computer program product for judging speech/non-speech |
CN101308653A (en) * | 2008-07-17 | 2008-11-19 | 安徽科大讯飞信息科技股份有限公司 | End-point detecting method applied to speech identification system |
CN101599269A (en) * | 2009-07-02 | 2009-12-09 | 中国农业大学 | Sound end detecting method and device |
CN103489454A (en) * | 2013-09-22 | 2014-01-01 | 浙江大学 | Voice endpoint detection method based on waveform morphological characteristic clustering |
CN103730124A (en) * | 2013-12-31 | 2014-04-16 | 上海交通大学无锡研究院 | Noise robustness endpoint detection method based on likelihood ratio test |
CN105023572A (en) * | 2014-04-16 | 2015-11-04 | 王景芳 | Noised voice end point robustness detection method |
CN104021789A (en) * | 2014-06-25 | 2014-09-03 | 厦门大学 | Self-adaption endpoint detection method using short-time time-frequency value |
US20160379632A1 (en) * | 2015-06-29 | 2016-12-29 | Amazon Technologies, Inc. | Language model speech endpointing |
CN105118502A (en) * | 2015-07-14 | 2015-12-02 | 百度在线网络技术(北京)有限公司 | End point detection method and system of voice identification system |
CN107393526A (en) * | 2017-07-19 | 2017-11-24 | 腾讯科技(深圳)有限公司 | Speech silence detection method, device, computer equipment and storage medium |
Non-Patent Citations (2)
Title |
---|
包武杰: ""基于语音增强方法的语音端点检测"", 《现代电子技术》 * |
汪鲁才: ""一种改进的含噪语音端点检测方法"", 《计算机工程与应用》 * |
Cited By (26)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108922556B (en) * | 2018-07-16 | 2019-08-27 | 百度在线网络技术(北京)有限公司 | Sound processing method, device and equipment |
CN108922556A (en) * | 2018-07-16 | 2018-11-30 | 百度在线网络技术(北京)有限公司 | sound processing method, device and equipment |
CN110752973A (en) * | 2018-07-24 | 2020-02-04 | Tcl集团股份有限公司 | Terminal equipment control method and device and terminal equipment |
CN110752973B (en) * | 2018-07-24 | 2020-12-25 | Tcl科技集团股份有限公司 | Terminal equipment control method and device and terminal equipment |
CN109036471A (en) * | 2018-08-20 | 2018-12-18 | 百度在线网络技术(北京)有限公司 | Sound end detecting method and equipment |
WO2020173488A1 (en) * | 2019-02-28 | 2020-09-03 | 北京字节跳动网络技术有限公司 | Audio starting point detection method and apparatus |
US12119023B2 (en) | 2019-02-28 | 2024-10-15 | Beijing Bytedance Network Technology Co., Ltd. | Audio onset detection method and apparatus |
CN110265032A (en) * | 2019-06-05 | 2019-09-20 | 平安科技(深圳)有限公司 | Conferencing data analysis and processing method, device, computer equipment and storage medium |
CN110322872A (en) * | 2019-06-05 | 2019-10-11 | 平安科技(深圳)有限公司 | Conference voice data processing method, device, computer equipment and storage medium |
CN110415704A (en) * | 2019-06-14 | 2019-11-05 | 平安科技(深圳)有限公司 | Data processing method, device, computer equipment and storage medium are put down in court's trial |
CN110428853A (en) * | 2019-08-30 | 2019-11-08 | 北京太极华保科技股份有限公司 | Voice activity detection method, Voice activity detection device and electronic equipment |
CN110808061B (en) * | 2019-11-11 | 2022-03-15 | 广州国音智能科技有限公司 | Voice separation method and device, mobile terminal and computer readable storage medium |
CN110808061A (en) * | 2019-11-11 | 2020-02-18 | 广州国音智能科技有限公司 | Voice separation method and device, mobile terminal and computer readable storage medium |
CN110910906A (en) * | 2019-11-12 | 2020-03-24 | 国网山东省电力公司临沂供电公司 | Audio endpoint detection and noise reduction method based on power intranet |
CN111179972A (en) * | 2019-12-12 | 2020-05-19 | 中山大学 | Human voice detection algorithm based on deep learning |
CN111192600A (en) * | 2019-12-27 | 2020-05-22 | 北京网众共创科技有限公司 | Sound data processing method and device, storage medium and electronic device |
CN111626061A (en) * | 2020-05-27 | 2020-09-04 | 深圳前海微众银行股份有限公司 | Conference record generation method, device, equipment and readable storage medium |
CN111916060A (en) * | 2020-08-12 | 2020-11-10 | 四川长虹电器股份有限公司 | Deep learning voice endpoint detection method and system based on spectral subtraction |
CN111916060B (en) * | 2020-08-12 | 2022-03-01 | 四川长虹电器股份有限公司 | Deep learning voice endpoint detection method and system based on spectral subtraction |
CN112652324A (en) * | 2020-12-28 | 2021-04-13 | 深圳万兴软件有限公司 | Speech enhancement optimization method, speech enhancement optimization system and readable storage medium |
CN113327626A (en) * | 2021-06-23 | 2021-08-31 | 深圳市北科瑞声科技股份有限公司 | Voice noise reduction method, device, equipment and storage medium |
CN113327626B (en) * | 2021-06-23 | 2023-09-08 | 深圳市北科瑞声科技股份有限公司 | Voice noise reduction method, device, equipment and storage medium |
CN113744725A (en) * | 2021-08-19 | 2021-12-03 | 清华大学苏州汽车研究院(相城) | Training method of voice endpoint detection model and voice noise reduction method |
CN113744725B (en) * | 2021-08-19 | 2024-07-05 | 清华大学苏州汽车研究院(相城) | Training method of voice endpoint detection model and voice noise reduction method |
CN114974258A (en) * | 2022-07-27 | 2022-08-30 | 深圳市北科瑞声科技股份有限公司 | Speaker separation method, device, equipment and storage medium based on voice processing |
CN115497511A (en) * | 2022-10-31 | 2022-12-20 | 广州方硅信息技术有限公司 | Method, device, equipment and medium for training and detecting voice activity detection model |
Also Published As
Publication number | Publication date |
---|---|
CN108198547B (en) | 2020-10-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108198547A (en) | Sound end detecting method, device, computer equipment and storage medium | |
CN108877775B (en) | Voice data processing method and device, computer equipment and storage medium | |
EP3955246B1 (en) | Voiceprint recognition method and device based on memory bottleneck feature | |
US9792897B1 (en) | Phoneme-expert assisted speech recognition and re-synthesis | |
Reynolds | An overview of automatic speaker recognition technology | |
CN111145786A (en) | Speech emotion recognition method and device, server and computer readable storage medium | |
Revathi et al. | Speaker independent continuous speech and isolated digit recognition using VQ and HMM | |
Kim et al. | Robust DTW-based recognition algorithm for hand-held consumer devices | |
Mandel et al. | Audio super-resolution using concatenative resynthesis | |
CN112216285B (en) | Multi-user session detection method, system, mobile terminal and storage medium | |
WO2003065352A1 (en) | Method and apparatus for speech detection using time-frequency variance | |
CN111429919B (en) | Crosstalk prevention method based on conference real recording system, electronic device and storage medium | |
Gambhir et al. | Residual networks for text-independent speaker identification: Unleashing the power of residual learning | |
Muralikrishna et al. | HMM based isolated Kannada digit recognition system using MFCC | |
Petrovska-Delacrétaz et al. | Text-independent speaker verification: state of the art and challenges | |
Mardhotillah et al. | Speaker recognition for digital forensic audio analysis using support vector machine | |
Nasibov | Decision fusion of voice activity detectors | |
Shome et al. | A robust technique for end point detection under practical environment | |
CN113658599A (en) | Conference record generation method, device, equipment and medium based on voice recognition | |
Tzudir et al. | Low-resource dialect identification in Ao using noise robust mean Hilbert envelope coefficients | |
Alkhatib et al. | ASR Features Extraction Using MFCC And LPC: A Comparative Study | |
Sharma et al. | Speech recognition of Punjabi numerals using synergic HMM and DTW approach | |
Gbadamosi | Text independent biometric speaker recognition system | |
Marković et al. | Recognition of normal and whispered speech based on RASTA filtering and DTW algorithm | |
Yousafzai et al. | Tuning support vector machines for robust phoneme classification with acoustic waveforms |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |