CN108198547A - Sound end detecting method, device, computer equipment and storage medium - Google Patents

Sound end detecting method, device, computer equipment and storage medium Download PDF

Info

Publication number
CN108198547A
CN108198547A CN201810048223.3A CN201810048223A CN108198547A CN 108198547 A CN108198547 A CN 108198547A CN 201810048223 A CN201810048223 A CN 201810048223A CN 108198547 A CN108198547 A CN 108198547A
Authority
CN
China
Prior art keywords
acoustic feature
voice
spectrum signature
noisy speech
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810048223.3A
Other languages
Chinese (zh)
Other versions
CN108198547B (en
Inventor
黄石磊
刘轶
王昕�
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Beike Risound Polytron Technologies Inc
Original Assignee
Shenzhen Beike Risound Polytron Technologies Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Beike Risound Polytron Technologies Inc filed Critical Shenzhen Beike Risound Polytron Technologies Inc
Priority to CN201810048223.3A priority Critical patent/CN108198547B/en
Publication of CN108198547A publication Critical patent/CN108198547A/en
Application granted granted Critical
Publication of CN108198547B publication Critical patent/CN108198547B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/02Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
    • G10L19/032Quantisation or dequantisation of spectral components
    • G10L19/038Vector quantisation, e.g. TwinVQ audio
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/87Detection of discrete points within a voice signal

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Quality & Reliability (AREA)
  • Telephonic Communication Services (AREA)

Abstract

This application involves a kind of sound end detecting method, device, computer equipment and storage mediums.This method includes:Noisy Speech Signal is obtained, extracts the corresponding acoustic feature of the Noisy Speech Signal and spectrum signature;The acoustic feature and spectrum signature are converted, obtain corresponding acoustic feature vector sum spectrum signature vector;Grader is obtained, the acoustic feature vector sum spectrum signature vector is input to the grader, obtains the spectrum signature vector of the acoustic feature vector sum addition voice label of addition voice label;The spectrum signature vector of the acoustic feature vector sum addition voice label of the addition voice label is parsed, obtains corresponding voice signal;The corresponding starting point and ending point of the voice signal is determined according to the sequential of the voice signal.The accuracy of speech terminals detection can be effectively improved using this method.

Description

Sound end detecting method, device, computer equipment and storage medium
Technical field
This application involves signal processing technology field, more particularly to a kind of language end-point detecting method, device, computer Equipment and storage medium.
Background technology
With the continuous development of voice technology, speech terminals detection technology is occupied highly important in speech recognition technology Status.Speech terminals detection is the starting point and ending point that phonological component is detected from one section of continuous noise speech, so as to Voice can be efficiently identified out.
There are two types of traditional speech terminals detection modes, and a kind of is the spy of the time domain according to voice and noise signal and frequency domain Sign is different, extracts the feature of each segment signal, the feature of each segment signal is compared with the threshold value set, so as to carry out language Voice endpoint detects.But this mode is only applicable to detect under the conditions of stationary noise, and noise robustness is poor, it is difficult to distinguish clean speech And noise, the accuracy for leading to speech terminals detection are relatively low..It is another then be the mode based on neural network, by using instruction Practice model and end-point detection is carried out to voice signal.However the input vector of big multi-model contains only the feature of noisy speech so that Noise robustness is poor, relatively low so as to cause the accuracy of speech terminals detection.Therefore, how speech terminals detection is effectively improved Accuracy becomes the current technical issues that need to address.
Invention content
Based on this, it is necessary to which for above-mentioned technical problem, providing a kind of can effectively improve the accurate of speech terminals detection Sound end detecting method, device, computer equipment and the storage medium of property.
A kind of sound end detecting method, including:
Noisy Speech Signal is obtained, extracts the corresponding acoustic feature of the Noisy Speech Signal and spectrum signature;
The acoustic feature and spectrum signature are converted, obtain corresponding acoustic feature vector sum spectrum signature to Amount;
Grader is obtained, the acoustic feature vector sum spectrum signature vector is input to the grader, is added The spectrum signature vector of the acoustic feature vector sum addition voice label of voice label;
The spectrum signature vector of the acoustic feature vector sum addition voice label of the addition voice label is parsed, Obtain corresponding voice signal;
The corresponding starting point and ending point of the voice signal is determined according to the sequential of the voice signal.
In one of the embodiments, in the corresponding acoustic feature of the extraction Noisy Speech Signal and spectrum signature Before, it further includes:
The Noisy Speech Signal is converted into noisy speech frequency spectrum;
Time-domain analysis and/or frequency-domain analysis and/or transform domain analysis are carried out to the noisy speech frequency spectrum, obtain the band The corresponding acoustic feature of noisy speech signal.
In one of the embodiments, in the corresponding acoustic feature of the extraction Noisy Speech Signal and spectrum signature Before, it further includes:
The Noisy Speech Signal is converted into noisy speech frequency spectrum, noisy speech is calculated according to the noisy speech frequency spectrum Amplitude spectrum;
Dynamic noise estimation is carried out to the noisy speech frequency spectrum according to the noisy speech amplitude spectrum, obtains noise amplitude Spectrum;
It is composed according to the voice amplitudes of the noisy speech amplitude spectrum and the noise amplitude Power estimation clean speech signal;
The noisy speech is generated using the noisy speech amplitude spectrum, noise amplitude spectrum and voice amplitudes spectrum The corresponding spectrum signature of signal.
In one of the embodiments, it is described to the acoustic feature and spectrum signature carry out conversion include:
Extract the front and rear preset quantity frame of present frame in the acoustic feature and the spectrum signature;
The corresponding mean value vector of present frame and/or Variance Vector are calculated by using the front and rear preset quantity frame of present frame;
Logarithm is carried out to the acoustic feature after the corresponding mean value vector of calculating present frame and/or Variance Vector and spectrum signature Domain is converted, and obtains transformed acoustic feature vector sum spectrum signature vector.
It is further included before the step of acquisition grader in one of the embodiments,:
The noisy speech data of addition voice class label are obtained, by being trained to the noisy speech data, are obtained To preliminary classification device;
The first verification collection is obtained, first verification, which is concentrated, includes multiple first voice data;
Multiple first voice data are input to grader, it is general to obtain the corresponding classification of the multiple first voice data Rate;
The corresponding class probability of multiple first voice data is screened, classification is added to the first voice data selected Label obtains the verification collection of addition class label;
It is trained using the verification collection and the training set of the addition class label, is verified grader;
The second verification collection is obtained, second verification, which is concentrated, includes multiple second speech datas;
Multiple second speech datas are input to verification grader, obtain the corresponding classification of the multiple second speech data Probability;
When the corresponding class probability of multiple second speech datas reaches predetermined probabilities value, required grader is obtained.
It is described vectorial to the acoustic feature vector sum spectrum signature using the grader in one of the embodiments, The step of being classified includes:
Using the acoustic feature vector sum spectrum signature vector as the input of grader, the acoustic feature vector is obtained The corresponding decision value with spectrum signature vector;
When the decision value is first threshold, voice mark is added to acoustic feature vector or spectrum signature vector Label;
When the decision value is second threshold, non-voice mark is added to acoustic feature vector or spectrum signature vector Label.
A kind of speech terminals detection device, including:
Extraction module for obtaining Noisy Speech Signal, extracts the corresponding acoustic feature of the Noisy Speech Signal and frequency Spectrum signature;
Modular converter for being converted to the acoustic feature and spectrum signature, obtains corresponding acoustic feature vector With spectrum signature vector;
The acoustic feature vector sum spectrum signature vector for obtaining grader, is input to described point by sort module Class device obtains the spectrum signature vector of the acoustic feature vector sum addition voice label of addition voice label;
Parsing module, for the spectrum signature of the acoustic feature vector sum addition voice label to the addition voice label Vector is parsed, and obtains corresponding voice signal;Determine that the voice signal is corresponding according to the sequential of the voice signal Starting point and ending point.
The modular converter is additionally operable to extract in the acoustic feature and the spectrum signature in one of the embodiments, The front and rear preset quantity frame of present frame;The corresponding mean value vector of present frame is calculated by using the front and rear preset quantity frame of present frame And/or Variance Vector;To calculate the acoustic feature after the corresponding mean value vector of present frame and/or Variance Vector and spectrum signature into Row log-domain is converted, and obtains transformed acoustic feature vector sum spectrum signature vector.
A kind of computer equipment, including memory, processor, the memory is stored with computer program, the processing Device realizes following steps when performing the computer program:
Noisy Speech Signal is obtained, extracts the corresponding acoustic feature of the Noisy Speech Signal and spectrum signature;
The acoustic feature and spectrum signature are converted, obtain corresponding acoustic feature vector sum spectrum signature to Amount;
Grader is obtained, the acoustic feature vector sum spectrum signature vector is input to the grader, is added The spectrum signature vector of the acoustic feature vector sum addition voice label of voice label;
The spectrum signature vector of the acoustic feature vector sum addition voice label of the addition voice label is parsed, Obtain corresponding voice signal;
The corresponding starting point and ending point of the voice signal is determined according to the sequential of the voice signal.
A kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the computer program Following steps are realized when being executed by processor:
Noisy Speech Signal is obtained, extracts the corresponding acoustic feature of the Noisy Speech Signal and spectrum signature;
The acoustic feature and spectrum signature are converted, obtain corresponding acoustic feature vector sum spectrum signature to Amount;
Grader is obtained, the acoustic feature vector sum spectrum signature vector is input to the grader, is added The spectrum signature vector of the acoustic feature vector sum addition voice label of voice label;
The spectrum signature vector of the acoustic feature vector sum addition voice label of the addition voice label is parsed, Obtain corresponding voice signal;
The corresponding starting point and ending point of the voice signal is determined according to the sequential of the voice signal.
Above-mentioned sound end detecting method, device, computer equipment and storage medium obtain Noisy Speech Signal, extraction The corresponding acoustic feature of Noisy Speech Signal and spectrum signature;By being converted to acoustic feature and spectrum signature, obtain pair The acoustic feature vector sum spectrum signature vector answered.Grader is obtained, by the way that acoustic feature vector sum spectrum signature vector is defeated Enter to grader, obtain the spectrum signature vector of the acoustic feature vector sum addition voice label of addition voice label, thus, it is possible to It is enough effectively to classify to acoustic feature vector sum spectrum signature vector, so as to effectively identify voice and non-voice. The spectrum signature vector of acoustic feature vector sum addition voice label to adding voice label parses, and obtains corresponding language Sound signal;The sequential of voice signal determines the corresponding starting point and ending point of voice signal, and thus, it is possible to accurately identify that band is made an uproar The starting point and ending point of voice signal, so as to effectively improve the accuracy of speech terminals detection.
Description of the drawings
Fig. 1 is the flow chart of sound end detecting method in one embodiment;
Fig. 2 is the internal structure chart of speech terminals detection device in one embodiment;
Fig. 3 is the internal structure chart of one embodiment Computer equipment.
Specific embodiment
It is with reference to the accompanying drawings and embodiments, right in order to which the object, technical solution and advantage for making the application are more clearly understood The present invention is further elaborated.It should be appreciated that specific embodiment described herein is only used to explain the application, not Apply for limiting.It is appreciated that term " first " used in the present invention, " second " etc. can be used to describe herein it is various Element, but these elements should not be limited by these terms.These terms are only used to distinguish first element and another element.
In one embodiment, as shown in Figure 1, providing a kind of sound end detecting method, it is applied in this way eventually It illustrates, includes the following steps for end:
Step 102, Noisy Speech Signal, the corresponding acoustic feature of extraction Noisy Speech Signal and spectrum signature are obtained.
Typically, actual acquisition to voice signal usually contain the noise of some strength, when these noise intensities compared with When big, the effect of voice application, which can be generated, significantly influences, for example audio identification efficiency is low, and end-point detection accuracy declines Deng.
Terminal can obtain the voice that user is inputted by speech input device.Wherein, terminal can be smart mobile phone, put down The terminals such as plate computer, laptop, desktop computer, terminal further include speech input device, for example, it may be the tools such as microphone There is the device of typing phonetic function.The voice input by user that terminal is got is usually noise-containing Noisy Speech Signal, Noisy Speech Signal can be the Noisy Speech Signals such as call voice input by user, recorded audio, phonetic order.Terminal obtains After Noisy Speech Signal, the corresponding acoustic feature of Noisy Speech Signal and spectrum signature are extracted.Wherein, acoustic feature can wrap Include the characteristic informations such as voiceless sound, the voiced sound of Noisy Speech Signal, vowel, consonant.Spectrum signature can include Noisy Speech Signal The characteristic informations such as vibration frequency, the loudness of vibration amplitude and Noisy Speech Signal, tone color.
Specifically, after terminal obtains Noisy Speech Signal, adding window framing is carried out to Noisy Speech Signal.For example, it can adopt Noisy Speech Signal is divided into the frame that multiple frame lengths are 10-30ms (millisecond) with Hanning window, frame shifting can take 10ms, so as to Noisy Speech Signal is divided into multiframe Noisy Speech Signal.After terminal-pair carries out adding window framing to Noisy Speech Signal, to adding window Noisy Speech Signal after framing carries out fast Fourier transform, thus obtains the frequency spectrum of Noisy Speech Signal.Terminal then can be with The corresponding acoustic feature of Noisy Speech Signal and spectrum signature are extracted according to the frequency spectrum of noisy speech.
Step 104, acoustic feature and spectrum signature are converted, obtains corresponding acoustic feature vector sum spectrum signature Vector.
After terminal extracts the corresponding acoustic feature of Noisy Speech Signal and spectrum signature, the noisy speech extracted is believed Number corresponding acoustic feature and spectrum signature are converted, and acoustic feature are converted to corresponding acoustic feature vector, by frequency spectrum Feature Conversion is corresponding spectrum signature vector.
Step 106, grader is obtained, acoustic feature vector sum spectrum signature vector is input to grader, is added The spectrum signature vector of the acoustic feature vector sum addition voice label of voice label.
Terminal obtains grader, and grader is the trained grader before speech terminals detection is carried out, and grader can With by adding voice label and non-voice label to acoustic feature vector sum spectrum signature vector, by the acoustic feature of input to Amount and spectrum signature vector be divided into voice class acoustic feature vector sum spectrum signature vector sum non-voice class acoustic feature to Amount and spectrum signature vector.Terminal by the corresponding acoustic feature vector sum spectrum signature vector of noisy speech by being input to classification Device classifies to the acoustic feature vector sum spectrum signature vector of input using grader.When the acoustic feature vector of input Or spectrum signature vector be voice class when, be acoustic feature vector or spectrum signature vector add voice label;When input It is that acoustic feature vector or spectrum signature vector add non-language when acoustic feature vector or spectrum signature vector are non-voice classification Phonetic symbol label, thus, it is possible to voice and non-voice is recognized accurately.Terminal is using grader to acoustic feature vector sum spectrum signature After vector, it is possible to obtain the spectrum signature vector of the acoustic feature vector sum addition voice label of addition voice label.
Further, terminal can also be obtained using acoustic feature vector sum spectrum signature vector as the input of grader The corresponding decision value of acoustic feature vector sum spectrum signature vector.Terminal can be according to obtained decision value to acoustic feature vector Voice label or non-voice label are added with spectrum signature vector.So as to fulfill to acoustic feature vector sum spectrum signature vector into Row Accurate classification.
Step 108, the spectrum signature vector of the acoustic feature vector sum addition voice label to adding voice label carries out Parsing obtains the voice signal after addition voice label.
Step 110, the corresponding starting point and ending point of voice signal is determined according to the voice label of voice signal and sequential.
After terminal-pair acoustic feature vector sum spectrum signature vector is classified, the acoustics to being added to voice label is needed Feature vector is parsed with the spectrum signature vector for being added to voice label.Specifically, terminal will be added to voice label The spectrum signature vector that acoustic feature vector sum is added to voice label is parsed, and the acoustics for obtaining being added to voice label is special The corresponding frequency spectrum of spectrum signature of seeking peace.Terminal according to the sequential of Noisy Speech Signal by be added to voice label acoustic feature and The corresponding frequency spectrum of spectrum signature is converted to corresponding voice signal, and thus, it is possible to parse to obtain corresponding voice signal.
Noisy Speech Signal has sequential, and the sequential for being added to the voice signal after voice label is still believed with noisy speech Number sequential it is corresponding.Terminal by the acoustic feature vector sum for being added to voice label be added to the spectrum signature of voice label to Amount resolves to the corresponding voice signal for being added to voice label, terminal thus, it is possible to the voice label according to voice signal and when Sequence determines the corresponding starting point and ending point of Noisy Speech Signal.
For example, after terminal classifies to the acoustic feature vector sum spectrum signature vector of input by grader, obtain Decision value can be value between one 0 to 1.When obtained decision value is 1, terminal-pair acoustic feature vector or frequency spectrum are special Sign vector addition voice label.When obtained decision value is 0, terminal-pair acoustic feature vector or the addition of spectrum signature vector are non- Voice label.Thus, it is possible to accurately carry out Accurate classification to acoustic feature vector sum spectrum signature vector.Terminal will be added to The acoustic feature vector sum of voice label be added to voice label spectrum signature vector parsed after, it is possible to added Voice signal after voice label.According to the sequential for being added to the voice signal after voice label, added when for the first time The speech frame of voice label is then the starting point of Noisy Speech Signal, when occurring the corresponding speech frame of voice label for the last time It is then the terminating point of Noisy Speech Signal.Further, can also according to decision value 0 to 1 redirect determine voice signal Starting point, according to the terminating point redirected to determine voice signal of decision value 1 to 0.It is possible thereby to accurately determine noisy speech The corresponding starting point and ending point of signal.
In the present embodiment, after terminal obtains Noisy Speech Signal, the corresponding acoustic feature of extraction Noisy Speech Signal and frequency Spectrum signature by being converted to acoustic feature and spectrum signature, obtains corresponding acoustic feature vector sum spectrum signature vector. By the way that acoustic feature vector sum spectrum signature vector is input to grader, the acoustic feature vector sum of addition voice label is obtained The spectrum signature vector of voice label is added, thus, it is possible to effectively acoustic feature vector sum spectrum signature vector is divided Class, so as to effectively identify voice and non-voice.Terminal is by adding the acoustic feature vector sum for adding voice label The spectrum signature vector of voice label is parsed, and obtains corresponding voice signal.Terminal is determined according to the sequential of voice signal The corresponding starting point and ending point of voice signal, thus, it is possible to accurately identify the starting point and ending point of Noisy Speech Signal, So as to effectively improve the accuracy of speech terminals detection.
In one embodiment, it before the corresponding acoustic feature of extraction Noisy Speech Signal and spectrum signature, further includes: Noisy Speech Signal is converted into noisy speech frequency spectrum;Noisy speech frequency spectrum is carried out time-domain analysis and/or frequency-domain analysis and/or Transform domain analysis obtains the corresponding acoustic feature of Noisy Speech Signal.
In phonetics, phonetic feature can be divided into vowel, consonant, voiceless sound, voiced sound and the acoustic features such as mute.Terminal After obtaining Noisy Speech Signal, adding window framing is carried out to Noisy Speech Signal.Noisy speech is believed for example, Hanning window may be used Number it is divided into the frame that multiple frame lengths are 10-30ms (millisecond), frame shifting can take 10ms.It is more so as to which Noisy Speech Signal is divided into Frame Noisy Speech Signal.After terminal-pair carries out adding window framing to Noisy Speech Signal, to the Noisy Speech Signal after adding window framing Fast Fourier transform is carried out, thus obtains the frequency spectrum of Noisy Speech Signal.
Further, terminal can carry out noisy speech frequency spectrum time-domain analysis and/or frequency-domain analysis and/or transform domain point Analysis, so as to obtain the corresponding acoustic feature of Noisy Speech Signal.
For example, MFCC may be used in terminal, (Mel-Frequency Cepstrum Coefficients, mel-frequency fall Spectral coefficient) mode extracts the corresponding acoustic feature of Noisy Speech Signal.After terminal-pair Noisy Speech Signal carries out adding window framing, Noisy Speech Signal is converted to the frequency spectrum of Noisy Speech Signal.The Spectrum Conversion of Noisy Speech Signal is noisy speech by terminal Cepstrum, terminal carry out cepstral analysis according to noisy speech cepstrum, and noisy speech cepstrum is carried out discrete cosine transform, is obtained each The acoustic feature of frame, so as to obtain the effective acoustic feature of noisy speech.
In one embodiment, it before the corresponding acoustic feature of extraction Noisy Speech Signal and spectrum signature, further includes: Noisy Speech Signal is converted into noisy speech frequency spectrum, noisy speech amplitude spectrum is calculated according to noisy speech frequency spectrum;It is made an uproar according to band Voice amplitudes spectrum carries out dynamic noise estimation to noisy speech frequency spectrum, obtains noise amplitude spectrum;According to noisy speech amplitude spectrum and The voice amplitudes spectrum of noise amplitude Power estimation clean speech signal;Utilize noisy speech amplitude spectrum, noise amplitude spectrum and voice width The corresponding spectrum signature of degree spectrum generation Noisy Speech Signal.
After terminal obtains Noisy Speech Signal, adding window framing is carried out to Noisy Speech Signal.For example, Hanning window may be used Noisy Speech Signal is divided into the frame that multiple frame lengths are 10-30ms (millisecond), frame shifting can take 10ms.So as to which band is made an uproar language Sound signal is divided into multiframe Noisy Speech Signal.After terminal-pair carries out adding window framing to Noisy Speech Signal, after adding window framing Noisy Speech Signal carries out fast Fourier transform, thus obtains the frequency spectrum of Noisy Speech Signal.Wherein, Noisy Speech Signal Frequency spectrum can be the energy amplitude spectrum of the noisy speech after fast Fourier transform.
Further, terminal can calculate noisy speech amplitude spectrum and noisy speech phase using noisy speech frequency spectrum Spectrum.Terminal carries out dynamic noise estimation according to noisy speech amplitude spectrum and noisy speech phase spectrum to noisy speech frequency spectrum.Specifically Ground, terminal can carry out dynamic noise estimation using minimum controlled recursive average algorithm is improved to noisy speech frequency spectrum, so as to To obtain noise amplitude spectrum.Terminal goes out voice according to noisy speech amplitude spectrum, noisy speech phase spectrum and noise amplitude Power estimation The voice amplitudes spectrum of signal.For example, terminal can utilize log-magnitude spectrum nonlinear IEM model method, voice signal is estimated Voice amplitudes are composed.
Terminal is given birth to using the voice amplitudes spectrum of noisy speech amplitude spectrum, noise amplitude spectrum and clean speech signal estimated Into the corresponding spectrum signature of Noisy Speech Signal, thus terminal can efficiently extract out the corresponding frequency spectrum of Noisy Speech Signal Feature.
In one embodiment, conversion is carried out to acoustic feature and spectrum signature to include:It extracts acoustic feature and frequency spectrum is special The front and rear preset quantity frame of present frame in sign;The corresponding mean value of present frame is calculated by using the front and rear preset quantity frame of present frame Vector and/or Variance Vector;It is special to the acoustic feature after the corresponding mean value vector of calculating present frame and/or Variance Vector and frequency spectrum Sign carries out log-domain conversion, obtains transformed acoustic feature vector sum spectrum signature vector.
After terminal obtains Noisy Speech Signal, adding window framing is carried out to Noisy Speech Signal.So as to by noisy speech Signal is divided into multiframe Noisy Speech Signal.After terminal-pair carries out adding window framing to Noisy Speech Signal, to the band after adding window framing Noisy speech signal carries out fast Fourier transform, thus obtains the frequency spectrum of Noisy Speech Signal.Terminal can make an uproar language according to band The frequency spectrum of sound extracts the corresponding acoustic feature of Noisy Speech Signal and spectrum signature.
After terminal extracts the corresponding acoustic feature of Noisy Speech Signal and spectrum signature, by acoustic feature and spectrum signature Be converted to acoustic feature vector sum spectrum signature vector.Present frame in terminal extraction acoustic feature vector sum spectrum signature vector Front and rear preset quantity frame.Terminal by using the front and rear preset quantity frame of present frame calculate the corresponding mean value vector of present frame or Variance Vector, so as to be smoothed to acoustic feature and spectrum signature, the acoustic feature vector sum after obtaining smoothly Spectrum signature vector.
For example, terminal can obtain acoustic feature or spectrum signature present frame be pushed forward and follow-up each five frame, 11 frame in total Noisy speech frequency spectrum.By calculating the average value of this 11 frame, the mean value vector of present frame can be obtained.Specifically, terminal can be with Wave filter group is obtained, wherein, the shape of wave filter is triangle, and triangle Window Table shows filter window.Each wave filter has three The characteristic of angular wave filter, in noisy speech spectral range, these wave filters can be equiband.Terminal can utilize filter Wave device group calculates the mean value vector of present frame, it is possible thereby to be smoothed to noisy speech frequency spectrum, the sound after obtaining smoothly Learn feature vector and spectrum signature vector.
After terminal-pair noisy speech frequency spectrum is smoothed, it is smooth to the acoustic feature vector sum after smooth after frequency spectrum Feature vector calculates log-domain, obtains transformed acoustic feature vector sum spectrum signature vector.Specifically, terminal can calculate The acoustic feature of each wave filter output and the logarithmic energy of spectrum signature, it is hereby achieved that the log-domain of acoustic feature vector It is vectorial so as to effectively obtain transformed acoustic feature vector sum spectrum signature with the log-domain of spectrum signature vector.
In one embodiment, it is further included before the step of obtaining grader:The band for obtaining addition voice class label is made an uproar Voice data by being trained to noisy speech data, obtains preliminary classification device;Obtain the first verification collection, the first verification collection Include multiple first voice data;Multiple first voice data are input to grader, obtain multiple first voice data pair The class probability answered;The corresponding class probability of multiple first voice data is screened, the first voice data selected is added Add class label, obtain the verification collection of addition class label;It is trained using the verification collection and training set that add class label, It is verified grader;The second verification collection is obtained, the second verification, which is concentrated, includes multiple second speech datas;By multiple second voices Data are input to verification grader, obtain the corresponding class probability of multiple second speech datas;When multiple second speech datas pair When the class probability answered reaches predetermined probabilities value, required grader is obtained.
It before grader is obtained, needs to train grader using a large amount of noisy speech data, these a large amount of bands are made an uproar language Sound data can be that the band that the noisy speech data that terminal is obtained from database or terminal are obtained from internet is made an uproar Voice data.In training grader, first by being manually labeled to noisy speech data, the band after artificial mark is utilized Voice data of making an uproar is trained to obtain grader.
Specifically, after terminal extracts the corresponding acoustic feature of noisy speech data and spectrum signature, to acoustic feature and frequency Spectrum signature is converted, and is converted to corresponding acoustic feature vector sum spectrum signature vector.Staff can be according to classification pair Acoustic feature vector sum spectrum signature vector is labeled according to table, voice label or non-is added to each frame Noisy Speech Signal Voice label.Terminal obtains the noisy speech number after staff is labeled noisy speech data according to the classification table of comparisons According to.
Acoustic feature vector sum spectrum signature vector after adding label is combined and is input to LSTM by terminal The input layer of (Bidirectional Long Short-term Memory, two-way shot and long term Memory Neural Networks), LSTM god Input can be calculated by activation primitive from the vectorial learning of input to new feature through the non-linear hidden layer in network The generic of vector.Specifically, there are three doors in each LSTM units, respectively forget door, candidate door and out gate.Specifically Calculation formula can be:
Wherein, σ represents activation primitive,It represents to forget door weight matrix,It is to forget between door input layer and hidden layer Weight matrix, bfRepresent to forget the biasing of door, it is by by the output h of previous hidden layer to forget doort-1With current input xtIt carries out Linear combination, then outputs it value using activation primitive and is compressed between 0 to 1.When output valve is closer to 1, then show to remember The information for recalling body reservation is more;Conversely, closer to 0, then show that the information that memory body retains is fewer.
The location mode that the calculating of candidate door currently inputs, specific formula can be:
Wherein, CiRepresent the location mode currently inputted, it can output valve is regular to -1 and 1 by tanh activation primitives Between.
Out gate can be controlled for the quantity of the newer recall info of next layer network, and formula can be expressed as:
Wherein, OtRepresent the quantity for the newer recall info of next layer network.
Last output can be calculated by LSTM units, formula can be expressed as:
ht=Ot×tanh(Ct)
Last acoustic feature vector or spectrum signature vector, formula, which is calculated, by forward and reverse to be expressed as:
WhereinFor positive output vector,For reversed output vector, hiFor the last class label that is labelled with Multiple acoustic feature vectors or spectrum signature vector.
Further, the output layer in LSTM can calculate output unit C according to preset decision functioniValue.Its In, output unit CiValue can be 0 to 1 between value, 1 represents voice class, and 0 represents non-voice class.
Terminal is calculated each using the multiple acoustic feature vector sum spectrum signature vectors for being labelled with voice class label Acoustic feature and spectrum signature belong to voice class and the probability of non-voice classification in the classification table of comparisons, extraction acoustic feature vector With the classification of probability value maximum of the spectrum signature vector in the classification table of comparisons, acoustic feature vector or spectrum signature vector are added Add voice class label corresponding with the classification of probability value maximum.
Terminal is trained using the noisy speech data for being added to voice class label, obtains preliminary classification device;Terminal The first verification collection is obtained, the first verification, which is concentrated, includes multiple first voice data.Multiple first voice data are input to by terminal Grader, after obtaining the corresponding class probability of multiple first voice data, class probability corresponding to multiple first voice data It is screened.Staff adds voice class label using the first voice data that terminal-pair is selected, and terminal obtains addition language The first voice data after sound class label, using adding the generation addition voice class of the first voice data after voice class label The verification collection of distinguishing label.Terminal is trained again using the verification collection and noisy speech data for adding classification voice label, is obtained Verify grader.Terminal obtains the second verification collection, and the second verification, which is concentrated, includes multiple second speech datas;By multiple second voices Data are input to verification grader, obtain the corresponding class probability of multiple second speech datas.Terminal filters out class probability and exists The second language data of preset range, and the second speech data filtered out is labeled again, by the second voice after mark Noisy speech data after data and addition label re-start training and obtain new grader.It is continued for training, Zhi Daosuo When having the probability value for verifying the acoustic feature vector for concentrating preset quantity or spectrum signature vector between predetermined probabilities value range, Deconditioning, it is possible to obtain required grader.It is hereby achieved that the grader that accuracy rate is higher, so as to fulfill to acoustics Feature vector and spectrum signature vector carry out Accurate classification, and then can accurately identify voice and non-voice.
In one embodiment, the step of being classified using grader to acoustic feature vector sum spectrum signature vector is wrapped It includes:Using acoustic feature vector sum spectrum signature vector as the input of grader, obtain acoustic feature vector sum spectrum signature to Measure corresponding decision value;When decision value is first threshold, voice label is added to acoustic feature vector or spectrum signature vector; When decision value is second threshold, non-voice label is added to acoustic feature vector or spectrum signature vector.
After terminal obtains Noisy Speech Signal, the corresponding acoustic feature of extraction Noisy Speech Signal and spectrum signature.Terminal Acoustic feature and spectrum signature are converted, obtain corresponding acoustic feature vector sum spectrum signature vector.Terminal, which obtains, divides After class device, acoustic feature vector sum spectrum signature vector is input to grader.Grader is to the acoustic feature vector sum of input After spectrum signature vector is classified, the corresponding decision value of acoustic feature vector sum spectrum signature vector can be obtained.When obtaining Decision value be preset first threshold when, terminal-pair acoustic feature vector or spectrum signature vector addition voice label.Wherein, First threshold can be a value range.When obtained decision value is preset second threshold, terminal-pair acoustic feature vector Or spectrum signature vector addition non-voice label.Acoustic feature vector sum spectrum signature vector is carried out by using grader accurate Really classification, so as to accurately identify voice signal and non-speech audio in Noisy Speech Signal.
For example, obtained decision value can be the value between one 0 to 1.Preset first threshold can be 1, preset Second threshold can be 0.When obtained decision value is 1, terminal-pair acoustic feature vector or spectrum signature vector addition voice Label.When obtained decision value is 0, terminal-pair acoustic feature vector or spectrum signature vector addition non-voice label.Thus Can Accurate classification accurately be carried out to acoustic feature vector sum spectrum signature vector.
In one embodiment, as shown in Fig. 2, providing a kind of speech terminals detection device, including extraction module 202, Modular converter 204, sort module 206 and parsing module 208, wherein:
Extraction module 202, for obtaining Noisy Speech Signal, the corresponding acoustic feature of extraction Noisy Speech Signal and frequency spectrum Feature.
Modular converter 204 for being converted to acoustic feature and spectrum signature, obtains corresponding acoustic feature vector sum Spectrum signature vector.
Acoustic feature vector sum spectrum signature vector for obtaining grader, is input to the classification by sort module 206 Device obtains the spectrum signature vector of the acoustic feature vector sum addition voice label of addition voice label.
Parsing module 208, for the spectrum signature of the acoustic feature vector sum addition voice label to adding voice label Vector is parsed, and obtains corresponding voice signal;According to the sequential of voice signal determine the corresponding starting point of voice signal and Terminating point.
In one embodiment, extraction module 202 is additionally operable to the Noisy Speech Signal being converted to noisy speech frequency spectrum; Time-domain analysis and/or frequency-domain analysis and/or transform domain analysis are carried out to the noisy speech frequency spectrum, obtain the noisy speech letter Number corresponding acoustic feature.
In one embodiment, extraction module 202 is additionally operable to Noisy Speech Signal being converted to noisy speech frequency spectrum, according to Noisy speech frequency spectrum calculates noisy speech amplitude spectrum;Dynamic noise is carried out according to noisy speech amplitude spectrum to noisy speech frequency spectrum to estimate Meter obtains noise amplitude spectrum;It is composed according to the voice amplitudes of noisy speech amplitude spectrum and noise amplitude Power estimation clean speech signal; Utilize noisy speech amplitude spectrum, noise amplitude spectrum and the corresponding spectrum signature of voice amplitudes spectrum generation Noisy Speech Signal.
In one embodiment, modular converter 204 is additionally operable to extract in the acoustic feature and the spectrum signature currently The front and rear preset quantity frame of frame;By using the front and rear preset quantity frame of present frame calculate the corresponding mean value vector of present frame and/ Or Variance Vector;Acoustic feature after the corresponding mean value vector of calculating present frame and/or Variance Vector and spectrum signature are carried out Log-domain is converted, and obtains transformed acoustic feature vector sum spectrum signature vector.
In one embodiment, which further includes training module, and the band for obtaining addition voice class label is made an uproar language Sound data by being trained to noisy speech data, obtain preliminary classification device;The first verification collection is obtained, the first verification is concentrated Including multiple first voice data;Multiple first voice data are input to preliminary classification device, obtain multiple first voice data Corresponding class probability;The corresponding class probability of multiple first voice data is screened, to the first voice data selected Class label is added, obtains the verification collection of addition class label;Utilize the verification collection and addition voice class for adding class label The noisy speech data of label are trained, and are verified grader;The second verification collection is obtained, the second verification concentration includes multiple Second speech data;Multiple second speech datas are input to verification grader, obtain the corresponding class of multiple second speech datas Other probability;When the corresponding class probability of multiple second speech datas reaches predetermined probabilities value, required grader is obtained.
In one embodiment, sort module 206 is additionally operable to using acoustic feature vector sum spectrum signature vector as classification The input of device obtains the corresponding decision value of acoustic feature vector sum spectrum signature vector;When decision value is first threshold, to sound Learn feature vector or spectrum signature vector addition voice label;When decision value is second threshold, to acoustic feature vector or frequency Spectrum signature vector adds non-voice label.
In one embodiment, a kind of computer equipment is provided, which can be terminal, internal structure Figure can be as shown in Figure 3.For example, the computer equipment can be terminal, terminal can be, but not limited to be it is various be smart mobile phone, Tablet computer, laptop, personal computer and portable wearable device etc. have the function of the equipment for inputting voice.It should Computer equipment includes the processor, memory, network interface and the speech input device that are connected by system bus.Wherein, should The processor of computer equipment is calculated for offer and control ability.The memory of the computer equipment includes non-volatile memories Medium, built-in storage.The non-volatile memory medium is stored with operating system and computer program.The built-in storage is non-volatile Property storage medium in operating system and computer program operation provide environment.The network interface of the computer equipment be used for External terminal is communicated by network connection.To realize a kind of speech terminals detection side when the computer program is executed by processor Method.The speech input device of the computer equipment can include microphone, can also include external earphone etc..
It will be understood by those skilled in the art that the structure shown in Fig. 3, only part knot relevant with application scheme The block diagram of structure, does not form the restriction for the server being applied thereon to application scheme, and specific server can include Certain components are either combined than components more or fewer shown in figure or are arranged with different components.
In one embodiment, a kind of computer equipment is provided, including memory and processor, is stored in memory Computer program, the processor realize following steps when performing computer program:Noisy Speech Signal is obtained, extracts noisy speech The corresponding acoustic feature of signal and spectrum signature;Acoustic feature and spectrum signature are converted, obtain corresponding acoustic feature Vector sum spectrum signature vector;Grader is obtained, acoustic feature vector sum spectrum signature vector is input to grader, is added The acoustic feature vector sum of voice label is added to add the spectrum signature vector of voice label;To adding the acoustic feature of voice label The spectrum signature vector of vector sum addition voice label is parsed, and obtains corresponding voice signal;According to voice signal when Sequence determines the corresponding starting point and ending point of voice signal.
In one embodiment, following steps are also realized when processor performs computer program:The noisy speech is believed Number be converted to noisy speech frequency spectrum;Time-domain analysis and/or frequency-domain analysis and/or transform domain point are carried out to the noisy speech frequency spectrum Analysis, obtains the corresponding acoustic feature of the Noisy Speech Signal.
In one embodiment, following steps are also realized when processor performs computer program:Noisy Speech Signal is turned Noisy speech frequency spectrum is changed to, noisy speech amplitude spectrum is calculated according to noisy speech frequency spectrum;It is made an uproar according to noisy speech amplitude spectrum to band Voice spectrum carries out dynamic noise estimation, obtains noise amplitude spectrum;It is pure according to noisy speech amplitude spectrum and noise amplitude Power estimation The voice amplitudes spectrum of net voice signal;Utilize noisy speech amplitude spectrum, noise amplitude spectrum and voice amplitudes spectrum generation noisy speech The corresponding spectrum signature of signal.
In one embodiment, following steps are also realized when processor performs computer program:Extract the acoustic feature With the front and rear preset quantity frame of present frame in the spectrum signature;It is calculated by using the front and rear preset quantity frame of present frame current The corresponding mean value vector of frame and/or Variance Vector;To the acoustics after the corresponding mean value vector of calculating present frame and/or Variance Vector Feature and spectrum signature carry out log-domain conversion, obtain transformed acoustic feature vector sum spectrum signature vector.
In one embodiment, following steps are also realized when processor performs computer program:Obtain addition voice class The noisy speech data of label by being trained to noisy speech data, obtain preliminary classification device;The first verification collection is obtained, First verification, which is concentrated, includes multiple first voice data;Multiple first voice data are input to grader, obtain multiple first The corresponding class probability of voice data;The corresponding class probability of multiple first voice data is screened, to select first Voice data adds class label, obtains the verification collection of addition class label;Utilize the verification collection and training for adding class label Collection is trained, and is verified grader;The second verification collection is obtained, the second verification, which is concentrated, includes multiple second speech datas;It will Multiple second speech datas are input to verification grader, obtain the corresponding class probability of multiple second speech datas;When multiple When the corresponding class probability of two voice data reaches predetermined probabilities value, required grader is obtained.
In one embodiment, following steps are also realized when processor performs computer program:By acoustic feature vector sum Input of the spectrum signature vector as grader obtains the corresponding decision value of acoustic feature vector sum spectrum signature vector;When certainly When plan value is first threshold, voice label is added to acoustic feature vector or spectrum signature vector;When decision value is second threshold When, non-voice label is added to acoustic feature vector or spectrum signature vector.
In one embodiment, a kind of computer readable storage medium is provided, is stored thereon with computer program, is calculated Machine program realizes following steps when being executed by processor:Obtain Noisy Speech Signal, the corresponding acoustics of extraction Noisy Speech Signal Feature and spectrum signature;Acoustic feature and spectrum signature are converted, obtain corresponding acoustic feature vector sum spectrum signature Vector;Grader is obtained, acoustic feature vector sum spectrum signature vector is input to grader, obtains the sound of addition voice label It learns feature vector and adds the spectrum signature vector of voice label;Voice is added to the acoustic feature vector sum for adding voice label The spectrum signature vector of label is parsed, and obtains corresponding voice signal;Voice signal is determined according to the sequential of voice signal Corresponding starting point and ending point.
In one embodiment, following steps are also realized when computer program is executed by processor:By the noisy speech Signal is converted to noisy speech frequency spectrum;Time-domain analysis and/or frequency-domain analysis and/or transform domain are carried out to the noisy speech frequency spectrum Analysis, obtains the corresponding acoustic feature of the Noisy Speech Signal.
In one embodiment, following steps are also realized when computer program is executed by processor:By Noisy Speech Signal Noisy speech frequency spectrum is converted to, noisy speech amplitude spectrum is calculated according to noisy speech frequency spectrum;According to noisy speech amplitude spectrum to band Voice spectrum of making an uproar carries out dynamic noise estimation, obtains noise amplitude spectrum;According to noisy speech amplitude spectrum and noise amplitude Power estimation The voice amplitudes spectrum of clean speech signal;It is made an uproar language using noisy speech amplitude spectrum, noise amplitude spectrum and voice amplitudes spectrum generation band The corresponding spectrum signature of sound signal.
In one embodiment, following steps are also realized when computer program is executed by processor:It is special to extract the acoustics It seeks peace the front and rear preset quantity frame of present frame in the spectrum signature;It calculates and works as by using the front and rear preset quantity frame of present frame The corresponding mean value vector of previous frame and/or Variance Vector;To the sound after the corresponding mean value vector of calculating present frame and/or Variance Vector It learns feature and spectrum signature carries out log-domain conversion, obtain transformed acoustic feature vector sum spectrum signature vector.
In one embodiment, following steps are also realized when computer program is executed by processor:Obtain addition voice class The noisy speech data of distinguishing label by being trained to noisy speech data, obtain preliminary classification device;Obtain the first verification Collection, the first verification, which is concentrated, includes multiple first voice data;Multiple first voice data are input to grader, obtain multiple The corresponding class probability of one voice data;The corresponding class probability of multiple first voice data is screened, to select One voice data adds class label, obtains the verification collection of addition class label;Utilize the verification collection and instruction for adding class label Practice collection to be trained, be verified grader;The second verification collection is obtained, the second verification, which is concentrated, includes multiple second speech datas; Multiple second speech datas are input to verification grader, obtain the corresponding class probability of multiple second speech datas;When multiple When the corresponding class probability of second speech data reaches predetermined probabilities value, required grader is obtained.
In one embodiment, following steps are also realized when computer program is executed by processor:By acoustic feature vector With input of the spectrum signature vector as grader, the corresponding decision value of acoustic feature vector sum spectrum signature vector is obtained;When When decision value is first threshold, voice label is added to acoustic feature vector or spectrum signature vector;When decision value is the second threshold During value, non-voice label is added to acoustic feature vector or spectrum signature vector.
One of ordinary skill in the art will appreciate that realizing all or part of flow in above-described embodiment method, being can be with Relevant hardware is instructed to complete by computer program, the computer program can be stored in a non-volatile computer In read/write memory medium, the computer program is when being executed, it may include such as the flow of the embodiment of above-mentioned each method.Wherein, Any reference to memory, storage, database or other media used in each embodiment provided herein, Including non-volatile and/or volatile memory.Nonvolatile memory may include read-only memory (ROM), programming ROM (PROM), electrically programmable ROM (EPROM), electrically erasable ROM (EEPROM) or flash memory.Volatile memory may include Random access memory (RAM) or external cache.By way of illustration and not limitation, RAM is available in many forms, Such as static state RAM (SRAM), dynamic ram (DRAM), synchronous dram (SDRAM), double data rate sdram (DDRSDRAM), enhancing Type SDRAM (ESDRAM), synchronization link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic ram (DRDRAM) and memory bus dynamic ram (RDRAM) etc..
Each technical characteristic of above example can be combined arbitrarily, to make description succinct, not to above-described embodiment In each technical characteristic it is all possible combination be all described, as long as however, the combination of these technical characteristics be not present lance Shield is all considered to be the range of this specification record.
Embodiment described above only expresses the several embodiments of the application, and description is more specific and detailed, but simultaneously It cannot therefore be construed as limiting the scope of the patent.It should be pointed out that those of ordinary skill in the art are come It says, under the premise of the application design is not departed from, various modifications and improvements can be made, these belong to the protection of the application Range.Therefore, the protection domain of the application patent should be determined by the appended claims.

Claims (10)

1. a kind of sound end detecting method, including:
Noisy Speech Signal is obtained, extracts the corresponding acoustic feature of the Noisy Speech Signal and spectrum signature;
The acoustic feature and spectrum signature are converted, obtain corresponding acoustic feature vector sum spectrum signature vector;
Grader is obtained, the acoustic feature vector sum spectrum signature vector is input to the grader, obtains addition voice The spectrum signature vector of the acoustic feature vector sum addition voice label of label;
The spectrum signature vector of the acoustic feature vector sum addition voice label of the addition voice label is parsed, is obtained Corresponding voice signal;
The corresponding starting point and ending point of the voice signal is determined according to the sequential of the voice signal.
2. according to the method described in claim 1, it is characterized in that, in the corresponding acoustics of the extraction Noisy Speech Signal Before feature and spectrum signature, further include:
The Noisy Speech Signal is converted into noisy speech frequency spectrum;
Time-domain analysis and/or frequency-domain analysis and/or transform domain analysis are carried out to the noisy speech frequency spectrum, the band is obtained and makes an uproar language The corresponding acoustic feature of sound signal.
3. according to the method described in claim 1, it is characterized in that, in the corresponding acoustics of the extraction Noisy Speech Signal Before feature and spectrum signature, further include:
The Noisy Speech Signal is converted into noisy speech frequency spectrum, noisy speech amplitude is calculated according to the noisy speech frequency spectrum Spectrum;
Dynamic noise estimation is carried out to the noisy speech frequency spectrum according to the noisy speech amplitude spectrum, obtains noise amplitude spectrum;
It is composed according to the voice amplitudes of the noisy speech amplitude spectrum and the noise amplitude Power estimation clean speech signal;
The Noisy Speech Signal is generated using the noisy speech amplitude spectrum, noise amplitude spectrum and voice amplitudes spectrum Corresponding spectrum signature.
4. according to the method described in claim 1, it is characterized in that, described convert the acoustic feature and spectrum signature Including:
Extract the front and rear preset quantity frame of present frame in the acoustic feature and the spectrum signature;
The corresponding mean value vector of present frame and/or Variance Vector are calculated by using the front and rear preset quantity frame of present frame;
Log-domain is carried out to the acoustic feature after the corresponding mean value vector of calculating present frame and/or Variance Vector and spectrum signature to turn It changes, obtains transformed acoustic feature vector sum spectrum signature vector.
5. it according to the method described in claim 1, it is characterized in that, is further included before described the step of obtaining grader:
The noisy speech data of addition voice class label are obtained, by being trained to the noisy speech data, are obtained just Beginning grader;
The first verification collection is obtained, first verification, which is concentrated, includes multiple first voice data;
Multiple first voice data are input to the preliminary classification device, obtain the corresponding classification of the multiple first voice data Probability;
The corresponding class probability of multiple first voice data is screened, classification mark is added to the first voice data selected Label obtain the verification collection of addition class label;
It is trained using the verification collection of the addition class label and the noisy speech data of the addition voice class label, It is verified grader;
The second verification collection is obtained, second verification, which is concentrated, includes multiple second speech datas;
Multiple second speech datas are input to verification grader, it is general to obtain the corresponding classification of the multiple second speech data Rate;
When the corresponding class probability of multiple second speech datas reaches predetermined probabilities value, required grader is obtained.
6. method according to any one of claims 1 to 5, which is characterized in that described to utilize the grader to the sound The step of feature vector and spectrum signature vector are classified includes:
Using the acoustic feature vector sum spectrum signature vector as the input of grader, the acoustic feature vector sum frequency is obtained The corresponding decision value of spectrum signature vector;
When the decision value is first threshold, voice label is added to acoustic feature vector or spectrum signature vector;
When the decision value is second threshold, non-voice label is added to acoustic feature vector or spectrum signature vector.
7. a kind of speech terminals detection device, including:
Extraction module for obtaining Noisy Speech Signal, extracts the corresponding acoustic feature of the Noisy Speech Signal and frequency spectrum is special Sign;
Modular converter for being converted to the acoustic feature and spectrum signature, obtains corresponding acoustic feature vector sum frequency Spectrum signature vector;
The acoustic feature vector sum spectrum signature vector for obtaining grader, is input to the grader by sort module, Obtain the spectrum signature vector of the acoustic feature vector sum addition voice label of addition voice label;
Parsing module, for the spectrum signature vector of the acoustic feature vector sum addition voice label to the addition voice label It is parsed, obtains corresponding voice signal;The corresponding starting of the voice signal is determined according to the sequential of the voice signal Point and terminating point.
8. device according to claim 7, which is characterized in that the modular converter be additionally operable to extract the acoustic feature and The front and rear preset quantity frame of present frame in the spectrum signature;It calculates and works as by using the corresponding front and rear preset quantity frame of present frame The mean value vector and/or Variance Vector of previous frame;It is special to the acoustics after the corresponding mean value vector of calculating present frame and/or Variance Vector Spectrum signature of seeking peace carries out log-domain conversion, obtains transformed acoustic feature vector sum spectrum signature vector.
9. a kind of computer equipment, including memory and processor, the memory is stored with computer program, and feature exists In when the processor performs the computer program the step of any one of realization realization claim 1 to 6 the method.
10. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the computer program The step of any one of claim 1 to 6 the method is realized when being executed by processor.
CN201810048223.3A 2018-01-18 2018-01-18 Voice endpoint detection method and device, computer equipment and storage medium Active CN108198547B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810048223.3A CN108198547B (en) 2018-01-18 2018-01-18 Voice endpoint detection method and device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810048223.3A CN108198547B (en) 2018-01-18 2018-01-18 Voice endpoint detection method and device, computer equipment and storage medium

Publications (2)

Publication Number Publication Date
CN108198547A true CN108198547A (en) 2018-06-22
CN108198547B CN108198547B (en) 2020-10-23

Family

ID=62589616

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810048223.3A Active CN108198547B (en) 2018-01-18 2018-01-18 Voice endpoint detection method and device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN108198547B (en)

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108922556A (en) * 2018-07-16 2018-11-30 百度在线网络技术(北京)有限公司 sound processing method, device and equipment
CN109036471A (en) * 2018-08-20 2018-12-18 百度在线网络技术(北京)有限公司 Sound end detecting method and equipment
CN110265032A (en) * 2019-06-05 2019-09-20 平安科技(深圳)有限公司 Conferencing data analysis and processing method, device, computer equipment and storage medium
CN110322872A (en) * 2019-06-05 2019-10-11 平安科技(深圳)有限公司 Conference voice data processing method, device, computer equipment and storage medium
CN110415704A (en) * 2019-06-14 2019-11-05 平安科技(深圳)有限公司 Data processing method, device, computer equipment and storage medium are put down in court's trial
CN110428853A (en) * 2019-08-30 2019-11-08 北京太极华保科技股份有限公司 Voice activity detection method, Voice activity detection device and electronic equipment
CN110752973A (en) * 2018-07-24 2020-02-04 Tcl集团股份有限公司 Terminal equipment control method and device and terminal equipment
CN110808061A (en) * 2019-11-11 2020-02-18 广州国音智能科技有限公司 Voice separation method and device, mobile terminal and computer readable storage medium
CN110910906A (en) * 2019-11-12 2020-03-24 国网山东省电力公司临沂供电公司 Audio endpoint detection and noise reduction method based on power intranet
CN111179972A (en) * 2019-12-12 2020-05-19 中山大学 Human voice detection algorithm based on deep learning
CN111192600A (en) * 2019-12-27 2020-05-22 北京网众共创科技有限公司 Sound data processing method and device, storage medium and electronic device
WO2020173488A1 (en) * 2019-02-28 2020-09-03 北京字节跳动网络技术有限公司 Audio starting point detection method and apparatus
CN111626061A (en) * 2020-05-27 2020-09-04 深圳前海微众银行股份有限公司 Conference record generation method, device, equipment and readable storage medium
CN111916060A (en) * 2020-08-12 2020-11-10 四川长虹电器股份有限公司 Deep learning voice endpoint detection method and system based on spectral subtraction
CN112652324A (en) * 2020-12-28 2021-04-13 深圳万兴软件有限公司 Speech enhancement optimization method, speech enhancement optimization system and readable storage medium
CN113327626A (en) * 2021-06-23 2021-08-31 深圳市北科瑞声科技股份有限公司 Voice noise reduction method, device, equipment and storage medium
CN113744725A (en) * 2021-08-19 2021-12-03 清华大学苏州汽车研究院(相城) Training method of voice endpoint detection model and voice noise reduction method
CN114974258A (en) * 2022-07-27 2022-08-30 深圳市北科瑞声科技股份有限公司 Speaker separation method, device, equipment and storage medium based on voice processing
CN115497511A (en) * 2022-10-31 2022-12-20 广州方硅信息技术有限公司 Method, device, equipment and medium for training and detecting voice activity detection model

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004272201A (en) * 2002-09-27 2004-09-30 Matsushita Electric Ind Co Ltd Method and device for detecting speech end point
US20060155537A1 (en) * 2005-01-12 2006-07-13 Samsung Electronics Co., Ltd. Method and apparatus for discriminating between voice and non-voice using sound model
CN101308653A (en) * 2008-07-17 2008-11-19 安徽科大讯飞信息科技股份有限公司 End-point detecting method applied to speech identification system
US20090254341A1 (en) * 2008-04-03 2009-10-08 Kabushiki Kaisha Toshiba Apparatus, method, and computer program product for judging speech/non-speech
CN101599269A (en) * 2009-07-02 2009-12-09 中国农业大学 Sound end detecting method and device
CN103489454A (en) * 2013-09-22 2014-01-01 浙江大学 Voice endpoint detection method based on waveform morphological characteristic clustering
CN103730124A (en) * 2013-12-31 2014-04-16 上海交通大学无锡研究院 Noise robustness endpoint detection method based on likelihood ratio test
CN104021789A (en) * 2014-06-25 2014-09-03 厦门大学 Self-adaption endpoint detection method using short-time time-frequency value
CN105023572A (en) * 2014-04-16 2015-11-04 王景芳 Noised voice end point robustness detection method
CN105118502A (en) * 2015-07-14 2015-12-02 百度在线网络技术(北京)有限公司 End point detection method and system of voice identification system
US20160379632A1 (en) * 2015-06-29 2016-12-29 Amazon Technologies, Inc. Language model speech endpointing
CN107393526A (en) * 2017-07-19 2017-11-24 腾讯科技(深圳)有限公司 Speech silence detection method, device, computer equipment and storage medium

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004272201A (en) * 2002-09-27 2004-09-30 Matsushita Electric Ind Co Ltd Method and device for detecting speech end point
US20060155537A1 (en) * 2005-01-12 2006-07-13 Samsung Electronics Co., Ltd. Method and apparatus for discriminating between voice and non-voice using sound model
US20090254341A1 (en) * 2008-04-03 2009-10-08 Kabushiki Kaisha Toshiba Apparatus, method, and computer program product for judging speech/non-speech
CN101308653A (en) * 2008-07-17 2008-11-19 安徽科大讯飞信息科技股份有限公司 End-point detecting method applied to speech identification system
CN101599269A (en) * 2009-07-02 2009-12-09 中国农业大学 Sound end detecting method and device
CN103489454A (en) * 2013-09-22 2014-01-01 浙江大学 Voice endpoint detection method based on waveform morphological characteristic clustering
CN103730124A (en) * 2013-12-31 2014-04-16 上海交通大学无锡研究院 Noise robustness endpoint detection method based on likelihood ratio test
CN105023572A (en) * 2014-04-16 2015-11-04 王景芳 Noised voice end point robustness detection method
CN104021789A (en) * 2014-06-25 2014-09-03 厦门大学 Self-adaption endpoint detection method using short-time time-frequency value
US20160379632A1 (en) * 2015-06-29 2016-12-29 Amazon Technologies, Inc. Language model speech endpointing
CN105118502A (en) * 2015-07-14 2015-12-02 百度在线网络技术(北京)有限公司 End point detection method and system of voice identification system
CN107393526A (en) * 2017-07-19 2017-11-24 腾讯科技(深圳)有限公司 Speech silence detection method, device, computer equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
包武杰: ""基于语音增强方法的语音端点检测"", 《现代电子技术》 *
汪鲁才: ""一种改进的含噪语音端点检测方法"", 《计算机工程与应用》 *

Cited By (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108922556B (en) * 2018-07-16 2019-08-27 百度在线网络技术(北京)有限公司 Sound processing method, device and equipment
CN108922556A (en) * 2018-07-16 2018-11-30 百度在线网络技术(北京)有限公司 sound processing method, device and equipment
CN110752973A (en) * 2018-07-24 2020-02-04 Tcl集团股份有限公司 Terminal equipment control method and device and terminal equipment
CN110752973B (en) * 2018-07-24 2020-12-25 Tcl科技集团股份有限公司 Terminal equipment control method and device and terminal equipment
CN109036471A (en) * 2018-08-20 2018-12-18 百度在线网络技术(北京)有限公司 Sound end detecting method and equipment
WO2020173488A1 (en) * 2019-02-28 2020-09-03 北京字节跳动网络技术有限公司 Audio starting point detection method and apparatus
US12119023B2 (en) 2019-02-28 2024-10-15 Beijing Bytedance Network Technology Co., Ltd. Audio onset detection method and apparatus
CN110265032A (en) * 2019-06-05 2019-09-20 平安科技(深圳)有限公司 Conferencing data analysis and processing method, device, computer equipment and storage medium
CN110322872A (en) * 2019-06-05 2019-10-11 平安科技(深圳)有限公司 Conference voice data processing method, device, computer equipment and storage medium
CN110415704A (en) * 2019-06-14 2019-11-05 平安科技(深圳)有限公司 Data processing method, device, computer equipment and storage medium are put down in court's trial
CN110428853A (en) * 2019-08-30 2019-11-08 北京太极华保科技股份有限公司 Voice activity detection method, Voice activity detection device and electronic equipment
CN110808061B (en) * 2019-11-11 2022-03-15 广州国音智能科技有限公司 Voice separation method and device, mobile terminal and computer readable storage medium
CN110808061A (en) * 2019-11-11 2020-02-18 广州国音智能科技有限公司 Voice separation method and device, mobile terminal and computer readable storage medium
CN110910906A (en) * 2019-11-12 2020-03-24 国网山东省电力公司临沂供电公司 Audio endpoint detection and noise reduction method based on power intranet
CN111179972A (en) * 2019-12-12 2020-05-19 中山大学 Human voice detection algorithm based on deep learning
CN111192600A (en) * 2019-12-27 2020-05-22 北京网众共创科技有限公司 Sound data processing method and device, storage medium and electronic device
CN111626061A (en) * 2020-05-27 2020-09-04 深圳前海微众银行股份有限公司 Conference record generation method, device, equipment and readable storage medium
CN111916060A (en) * 2020-08-12 2020-11-10 四川长虹电器股份有限公司 Deep learning voice endpoint detection method and system based on spectral subtraction
CN111916060B (en) * 2020-08-12 2022-03-01 四川长虹电器股份有限公司 Deep learning voice endpoint detection method and system based on spectral subtraction
CN112652324A (en) * 2020-12-28 2021-04-13 深圳万兴软件有限公司 Speech enhancement optimization method, speech enhancement optimization system and readable storage medium
CN113327626A (en) * 2021-06-23 2021-08-31 深圳市北科瑞声科技股份有限公司 Voice noise reduction method, device, equipment and storage medium
CN113327626B (en) * 2021-06-23 2023-09-08 深圳市北科瑞声科技股份有限公司 Voice noise reduction method, device, equipment and storage medium
CN113744725A (en) * 2021-08-19 2021-12-03 清华大学苏州汽车研究院(相城) Training method of voice endpoint detection model and voice noise reduction method
CN113744725B (en) * 2021-08-19 2024-07-05 清华大学苏州汽车研究院(相城) Training method of voice endpoint detection model and voice noise reduction method
CN114974258A (en) * 2022-07-27 2022-08-30 深圳市北科瑞声科技股份有限公司 Speaker separation method, device, equipment and storage medium based on voice processing
CN115497511A (en) * 2022-10-31 2022-12-20 广州方硅信息技术有限公司 Method, device, equipment and medium for training and detecting voice activity detection model

Also Published As

Publication number Publication date
CN108198547B (en) 2020-10-23

Similar Documents

Publication Publication Date Title
CN108198547A (en) Sound end detecting method, device, computer equipment and storage medium
CN108877775B (en) Voice data processing method and device, computer equipment and storage medium
EP3955246B1 (en) Voiceprint recognition method and device based on memory bottleneck feature
US9792897B1 (en) Phoneme-expert assisted speech recognition and re-synthesis
Reynolds An overview of automatic speaker recognition technology
CN111145786A (en) Speech emotion recognition method and device, server and computer readable storage medium
Revathi et al. Speaker independent continuous speech and isolated digit recognition using VQ and HMM
Kim et al. Robust DTW-based recognition algorithm for hand-held consumer devices
Mandel et al. Audio super-resolution using concatenative resynthesis
CN112216285B (en) Multi-user session detection method, system, mobile terminal and storage medium
WO2003065352A1 (en) Method and apparatus for speech detection using time-frequency variance
CN111429919B (en) Crosstalk prevention method based on conference real recording system, electronic device and storage medium
Gambhir et al. Residual networks for text-independent speaker identification: Unleashing the power of residual learning
Muralikrishna et al. HMM based isolated Kannada digit recognition system using MFCC
Petrovska-Delacrétaz et al. Text-independent speaker verification: state of the art and challenges
Mardhotillah et al. Speaker recognition for digital forensic audio analysis using support vector machine
Nasibov Decision fusion of voice activity detectors
Shome et al. A robust technique for end point detection under practical environment
CN113658599A (en) Conference record generation method, device, equipment and medium based on voice recognition
Tzudir et al. Low-resource dialect identification in Ao using noise robust mean Hilbert envelope coefficients
Alkhatib et al. ASR Features Extraction Using MFCC And LPC: A Comparative Study
Sharma et al. Speech recognition of Punjabi numerals using synergic HMM and DTW approach
Gbadamosi Text independent biometric speaker recognition system
Marković et al. Recognition of normal and whispered speech based on RASTA filtering and DTW algorithm
Yousafzai et al. Tuning support vector machines for robust phoneme classification with acoustic waveforms

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant