CN108806725A - Speech differentiation method, apparatus, computer equipment and storage medium - Google Patents
Speech differentiation method, apparatus, computer equipment and storage medium Download PDFInfo
- Publication number
- CN108806725A CN108806725A CN201810561789.6A CN201810561789A CN108806725A CN 108806725 A CN108806725 A CN 108806725A CN 201810561789 A CN201810561789 A CN 201810561789A CN 108806725 A CN108806725 A CN 108806725A
- Authority
- CN
- China
- Prior art keywords
- voice data
- asr
- distinguished
- target
- biasing
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000004069 differentiation Effects 0.000 title claims abstract description 80
- 238000000034 method Methods 0.000 title claims abstract description 64
- 238000003860 storage Methods 0.000 title claims abstract description 14
- 238000012545 processing Methods 0.000 claims abstract description 30
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 26
- 230000000694 effects Effects 0.000 claims abstract description 25
- 238000001514 detection method Methods 0.000 claims abstract description 18
- 238000001228 spectrum Methods 0.000 claims description 63
- 210000002569 neuron Anatomy 0.000 claims description 55
- 238000012549 training Methods 0.000 claims description 44
- 238000004364 calculation method Methods 0.000 claims description 22
- 230000008569 process Effects 0.000 claims description 21
- 238000004590 computer program Methods 0.000 claims description 12
- 238000004458 analytical method Methods 0.000 claims description 11
- 238000000605 extraction Methods 0.000 claims description 11
- 238000009432 framing Methods 0.000 claims description 10
- 238000005457 optimization Methods 0.000 claims description 3
- 235000013399 edible fruits Nutrition 0.000 claims description 2
- 238000010183 spectrum analysis Methods 0.000 claims description 2
- 239000010410 layer Substances 0.000 description 77
- 230000006870 function Effects 0.000 description 14
- 238000012216 screening Methods 0.000 description 9
- 238000005516 engineering process Methods 0.000 description 8
- 238000009826 distribution Methods 0.000 description 6
- 239000000284 extract Substances 0.000 description 5
- 238000010586 diagram Methods 0.000 description 4
- 239000013598 vector Substances 0.000 description 4
- 230000005540 biological transmission Effects 0.000 description 3
- 230000000717 retained effect Effects 0.000 description 3
- 230000009466 transformation Effects 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 239000000203 mixture Substances 0.000 description 2
- 238000005303 weighing Methods 0.000 description 2
- 230000004913 activation Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 230000010485 coping Effects 0.000 description 1
- 230000008878 coupling Effects 0.000 description 1
- 238000010168 coupling process Methods 0.000 description 1
- 238000005859 coupling reaction Methods 0.000 description 1
- 238000005520 cutting process Methods 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 230000005713 exacerbation Effects 0.000 description 1
- 230000001771 impaired effect Effects 0.000 description 1
- 239000004615 ingredient Substances 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 239000011229 interlayer Substances 0.000 description 1
- 230000014759 maintenance of location Effects 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 230000000737 periodic effect Effects 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 230000002035 prolonged effect Effects 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 210000001260 vocal cord Anatomy 0.000 description 1
- 230000001755 vocal effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
- G10L25/84—Detection of presence or absence of voice signals for discriminating voice from noise
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/18—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/21—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
- G10L2025/783—Detection of presence or absence of voice signals based on threshold decision
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/45—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window
Abstract
The invention discloses a kind of speech differentiation method, apparatus, computer equipment and storage mediums.The speech differentiation method includes:Based on the original voice data to be distinguished of voice activity detection algorithms processing, target voice data to be distinguished is obtained;Based on target voice data to be distinguished, corresponding ASR phonetic features are obtained;The ASR phonetic features are input in advance trained ASR-DBN models and are distinguished, target is obtained and distinguishes result.Target voice and interference voice can be distinguished using the speech differentiation method well, it is very big in voice data noise jamming, it still can carry out accurate speech differentiation.
Description
Technical field
The present invention relates to a kind of speech processes field more particularly to speech differentiation method, apparatus, computer equipment and storages
Medium.
Background technology
Speech differentiation refers to carrying out mute screening to the voice of input, is only retained to identifying more meaningful voice segments (i.e.
Target voice).There are prodigious deficiencies for current speech differentiation method, especially in the presence of noise, with noise
Become larger, the difficulty for carrying out speech differentiation is bigger, can not accurately distinguish out target voice and interference voice, lead to speech differentiation
Effect is undesirable.
Invention content
A kind of speech differentiation method, apparatus of offer of the embodiment of the present invention, computer equipment and storage medium, to solve voice
Distinguish the undesirable problem of effect.
The embodiment of the present invention provides a kind of speech differentiation method, including:
Based on the original voice data to be distinguished of voice activity detection algorithms processing, target voice data to be distinguished is obtained;
Based on target voice data to be distinguished, corresponding ASR phonetic features are obtained;
The ASR phonetic features are input in advance trained ASR-DBN models and are distinguished, target is obtained and distinguishes
As a result.
The embodiment of the present invention provides a kind of speech differentiation device, including:
Target voice data acquisition module to be distinguished handles original voice to be distinguished for being based on voice activity detection algorithms
Data obtain target voice data to be distinguished;
It is special to obtain corresponding ASR voices for being based on target voice data to be distinguished for phonetic feature acquisition module
Sign;
Target distinguishes result acquisition module, for the ASR phonetic features to be input to advance trained ASR-DBN moulds
It is distinguished in type, obtains target and distinguish result.
The embodiment of the present invention provides a kind of computer equipment, including memory, processor and is stored in the memory
In and the computer program that can run on the processor, the processor realize institute's predicate when executing the computer program
The step of sound differentiating method.
The embodiment of the present invention provides a kind of computer readable storage medium, and the computer-readable recording medium storage has meter
The step of calculation machine program, the computer program realizes speech differentiation method when being executed by processor.
In speech differentiation method, apparatus, computer equipment and storage medium that the embodiment of the present invention is provided, it is primarily based on
The original voice data to be distinguished of voice activity detection algorithms processing, obtains target voice data to be distinguished, original language to be distinguished
Sound data are first distinguished once by voice activity detection algorithms, obtain the smaller target of range voice data to be distinguished, Neng Gouchu
Step effectively removes non-voice.It is then based on target voice data to be distinguished, obtains corresponding ASR phonetic features, is used to be follow-up
Offer technical foundation is identified to the ASR phonetic features in ASR-DBN models.ASR phonetic features are finally input to advance instruction
It is distinguished in the ASR-DBN models perfected, obtains target and distinguish as a result, the ASR-DBN models are special according to ASR phonetic features
The identification model for accurately distinguishing voice of door training, can correctly distinguish target voice from target voice data to be distinguished
With interference voice, the accuracy of speech differentiation is improved.
Description of the drawings
In order to illustrate the technical solution of the embodiments of the present invention more clearly, below by institute in the description to the embodiment of the present invention
Attached drawing to be used is needed to be briefly described, it should be apparent that, the accompanying drawings in the following description is only some implementations of the present invention
Example, for those of ordinary skill in the art, without having to pay creative labor, can also be according to these attached drawings
Obtain other attached drawings.
Fig. 1 is an applied environment figure of speech differentiation method in one embodiment of the invention;
Fig. 2 is a flow chart of speech differentiation method in one embodiment of the invention;
Fig. 3 is a particular flow sheet of step S10 in Fig. 2;
Fig. 4 is a particular flow sheet of step S20 in Fig. 2;
Fig. 5 is a particular flow sheet of step S21 in Fig. 4;
Fig. 6 is a particular flow sheet of step S24 in Fig. 4;
Fig. 7 is the particular flow sheet before step S30 in Fig. 2;
Fig. 8 is a schematic diagram of speech differentiation device in one embodiment of the invention;
Fig. 9 is a schematic diagram of one embodiment of the invention Computer equipment.
Specific implementation mode
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete
Site preparation describes, it is clear that described embodiments are some of the embodiments of the present invention, instead of all the embodiments.Based on this hair
Embodiment in bright, the every other implementation that those of ordinary skill in the art are obtained without creative efforts
Example, shall fall within the protection scope of the present invention.
Fig. 1 shows the application environment of speech differentiation method provided in an embodiment of the present invention.The application of the audio recognition method
Environment includes server-side and client, wherein is attached by network between server-side and client.Client be can with
The equipment that family carries out human-computer interaction, the including but not limited to equipment such as computer, smart mobile phone and tablet.Server-side can specifically be used only
The server cluster of vertical server or multiple servers composition is realized.Speech differentiation method provided in an embodiment of the present invention is answered
For server-side.
As shown in Fig. 2, Fig. 2 shows a flow chart of speech differentiation method in the present embodiment, which includes
Following steps:
S10:Based on the original voice data to be distinguished of voice activity detection algorithms processing, target voice number to be distinguished is obtained
According to.
Wherein, voice activity detection (Voice Activity Detection, hereinafter referred to as VAD), it is therefore an objective to from sound
The prolonged mute phase is identified and eliminated in signal stream, and traffic resource is saved in the case where not reducing quality of service to reach
Effect can save valuable bandwidth resources, reduce time delay end to end, promote user experience.Voice activity detection algorithms
(vad algorithm) i.e. voice activity detection when the algorithm that specifically uses, the algorithm can there are many.It is to be appreciated that VAD can be answered
Used in speech differentiation, target voice and interference voice can be distinguished.Target voice refers to that vocal print consecutive variations are bright in voice data
Aobvious phonological component, interference voice can be since silence is without the phonological component of pronunciation in voice data, can also be ring
Border noise.Original voice data to be distinguished is the voice data to be distinguished that most original is got, the original voice data to be distinguished
Refer to that vad algorithm to be employed carries out the preliminary voice data for distinguishing processing.Target voice data to be distinguished refers to being lived by voice
Detection algorithm is moved to original after distinguishing voice data and handling, the voice data for carrying out speech differentiation of acquisition.
In the present embodiment, original voice data to be distinguished is handled using vad algorithm, from original voice number to be distinguished
Go out target voice and interference voice according to middle preliminary screening, and the target voice part that preliminary screening is gone out is as target language to be distinguished
Sound data.It is to be appreciated that need not be distinguished again for the interference voice that preliminary screening goes out, to improve the effect of speech differentiation
Rate.And the target voice that preliminary screening goes out from original voice data to be distinguished still have interference voice content, especially when
If the interference voice (such as noise) that the original target voice that when noise for distinguishing voice data is bigger, preliminary screening goes out mixes
It is more, it is clear that use vad algorithm that can not effectively distinguish voice at this time, therefore should go out preliminary screening mixes interference language
The target voice of sound carries out more accurate distinguish as target voice data to be distinguished, with the target voice gone out to preliminary screening.
Server carries out preliminary speech differentiation by using vad algorithm to original voice data to be distinguished, can be according to preliminary screening
Original voice data to be distinguished is repartitioned, while removing a large amount of interference voice, is conducive to follow-up further speech region
Point.
In a specific embodiment, as shown in figure 3, in step S10, original wait for is handled based on voice activity detection algorithms
Voice data is distinguished, target voice data to be distinguished is obtained, includes the following steps:
S11:Original voice data to be distinguished is handled according to short-time energy characteristic value calculation formula, is obtained corresponding
Short-time energy characteristic value, the original data to be distinguished that short-time energy characteristic value is more than to first threshold retain, and are determined as the first original
Begin to distinguish voice data, wherein short-time energy characteristic value calculation formula isN is voice frame length, when s (n) is
Signal amplitude on domain, n are the time.
Wherein, it is corresponding in its time domain to describe a frame voice (frame generally takes 10-30ms) for short-time energy characteristic value
Energy, the time (i.e. voice frame length) for being interpreted as a frame " in short-term " of the short-time energy.Due in short-term capable of for target voice
Measure feature value, the short-time energy characteristic value compared to interference voice (mute) can be higher by very much, therefore can in short-term can according to this
Measure feature value come distinguish target voice and interference voice.
In the present embodiment, original voice data to be distinguished is handled according to short-time energy characteristic value calculation formula and (is needed in advance
Make the processing of framing to original voice data to be distinguished), calculate and obtain the short-time energy of original each frame of voice data to be distinguished
The short-time energy characteristic value of each frame is compared with pre-set first threshold, will be greater than the original of first threshold by characteristic value
Begin voice data to be distinguished retains, and is determined as the first original differentiation voice data.The first threshold in short-term can for weighing
Measure feature value belongs to target voice or interferes the cut off value of voice.In the present embodiment, according to short-time energy characteristic value and
The comparison result of one threshold value can tentatively be distinguished from the angle of short-time energy characteristic value and be obtained in original voice data to be distinguished
Target voice, and effectively remove in original voice data to be distinguished and largely interfere voice.
S12:Original voice data to be distinguished is handled according to zero-crossing rate characteristic value calculation formula, obtains corresponding mistake
Zero rate characteristic value, the original voice data to be distinguished that zero-crossing rate characteristic value is less than to second threshold retain, and it is original to be determined as second
Distinguish voice data, wherein zero-crossing rate characteristic value calculation formula isN is voice frame length, s (n)
It it is the time for the signal amplitude n in time domain.
Wherein, zero-crossing rate characteristic value is the number for describing voice signal waveform in a frame voice and passing through horizontal axis (zero level).
Due to the zero-crossing rate characteristic value of target voice, the zero-crossing rate characteristic value compared to interference voice can be much lower, therefore can basis
The short-time energy characteristic value come distinguish target voice and interference voice.
In the present embodiment, original voice data to be distinguished is handled according to zero-crossing rate characteristic value calculation formula, calculates and obtains
The zero-crossing rate characteristic value of original each frame of voice data to be distinguished, by the zero-crossing rate characteristic value of each frame and pre-set second threshold
It is compared, the original voice data to be distinguished less than second threshold is retained, and be determined as the second original differentiation voice data.
The second threshold is the cut off value for belonging to target voice or interference voice for weighing short-time energy characteristic value.The present embodiment
In, according to the comparison result of zero-crossing rate characteristic value and second threshold, it can tentatively distinguish and obtain from the angle of zero-crossing rate characteristic value
Target voice in original voice data to be distinguished, and effectively remove in original voice data to be distinguished and largely interfere voice.
S13:Using the first original differentiation voice data and the second original differentiation voice data as target voice number to be distinguished
According to.
In the present embodiment, the first original differentiation voice data is to wait distinguishing from original according to the angle of short-time energy characteristic value
It distinguishes and obtains in voice data, the second original differentiation voice data is to wait for area from original according to the angle of zero-crossing rate characteristic value
It distinguishes and obtains in point voice data.First original differentiation voice data and the second original differentiation voice data are respectively from differentiation
The different angle of voice is set out, the two angles can distinguish voice well, therefore by the first original differentiation voice data
Together with merging with the second original differentiation voice data and (being merged in a manner of taking intersection), as target voice data to be distinguished.
Step S11-S13 can tentatively be effectively removed most interference voice number in original voice data to be distinguished
According to reservation mixes the original voice data to be distinguished of target voice and small part interference voice (such as noise), and this is original
Voice data to be distinguished can make effective preliminary voice as target voice data to be distinguished to original voice data to be distinguished
It distinguishes.
S20:Based on target voice data to be distinguished, corresponding ASR phonetic features are obtained.
Wherein, ASR (Automatic Speech Recognition, automatic speech recognition technology) is to turn voice data
It is changed to the technology of computer-readable input, such as converts voice data to the shapes such as button, binary coding or character string
Formula.The phonetic feature in target voice data to be distinguished can be extracted by ASR, the voice extracted is corresponding thereto
ASR phonetic features.It is to be appreciated that the voice data that script computer can not be directly read can be converted to computer by ASR
The ASR phonetic features that can be read, the mode which may be used vector indicate.
It in the present embodiment, is handled using ASR voice data to be distinguished to target, it is special to obtain corresponding ASR voices
Sign, the ASR phonetic features can reflect the potential feature of target voice data to be distinguished well, can be special according to ASR voices
Sign voice data to be distinguished to target distinguishes, and knows subsequently to carry out corresponding ASR-DBN models according to the ASR phonetic features
Indescribably for technology premise.
In a specific embodiment, as shown in figure 4, in step S20, based on target voice data to be distinguished, phase is obtained
Corresponding ASR phonetic features, include the following steps:
S21:Voice data to be distinguished to target pre-processes, and obtains pretreatment voice data.
In the present embodiment, voice data to be distinguished to target pre-processes, and obtains corresponding pretreatment voice number
According to.Voice data to be distinguished to target pre-processes the ASR voice spies that can preferably extract target voice data to be distinguished
Sign so that the ASR phonetic features extracted can more represent target voice data to be distinguished, with use the ASR phonetic features into
Row speech differentiation.
In a specific embodiment, as shown in figure 5, in step S21, voice data to be distinguished to target is located in advance
Reason obtains pretreatment voice data, includes the following steps:
S211:Voice data to be distinguished to target makees preemphasis processing, and the calculation formula of preemphasis processing is s'n=sn-a*
sn-1, wherein snFor the signal amplitude in time domain, sn-1For with snThe signal amplitude of corresponding last moment, s'nFor preemphasis
Signal amplitude in time domain afterwards, a are pre emphasis factor, and the value range of a is 0.9<a<1.0.
Wherein, preemphasis is a kind of signal processing mode compensated to input signal high fdrequency component in transmitting terminal.With
The increase of signal rate, signal is damaged very greatly in transmission process, in order to enable receiving terminal to obtain relatively good signal waveform,
With regard to needing to compensate impaired signal.The thought of pre-emphasis technique is exactly the high frequency in the transmitting terminal enhancing signal of transmission line
Ingredient, to compensate excessive decaying of the high fdrequency component in transmission process so that receiving terminal can obtain preferable signal waveform.In advance
Exacerbation does not have an impact noise, therefore can effectively improve output signal-to-noise ratio.
In the present embodiment, voice data to be distinguished to target makees preemphasis processing, and the formula of preemphasis processing is s'n=
sn-a*sn-1, wherein snFor the signal amplitude in time domain, i.e., the amplitude (amplitude) for the voice that voice data is expressed in the time domain,
sn-1For with snThe signal amplitude of opposite last moment, s'nFor the signal amplitude in time domain after preemphasis, a is pre emphasis factor,
The value range of a is 0.9<a<1.0, take the effect of 0.97 preemphasis relatively good here.Being handled using the preemphasis can eliminate
Interference caused by vocal cords and lip etc. in voiced process, can be with the pent-up radio-frequency head of effective compensation target voice data to be distinguished
Point, and the formant of target voice data high frequency to be distinguished can be highlighted, reinforce the signal width of target voice data to be distinguished
Degree helps to extract ASR phonetic features.
S212:Target voice data to be distinguished after preemphasis is subjected to sub-frame processing.
In the present embodiment, in preemphasis target after distinguishing voice data, sub-frame processing should be also carried out.Framing refers to will be whole
The voice signal of section is cut into several sections of voice processing technology, and the size per frame is in the range of 10-30ms, with general 1/2
Frame length is moved as frame.Frame shifting refers to the overlapping region of adjacent two interframe, and adjacent two frame can be avoided to change excessive problem.To mesh
It marks voice data to be distinguished and carries out sub-frame processing, target voice data to be distinguished can be divided into several sections of voice data, it can
To segment target voice data to be distinguished, it is convenient for the extraction of ASR phonetic features.
S213:Target voice data to be distinguished after framing is subjected to windowing process, obtains pretreatment voice data, adding window
Calculation formula beWherein, N is that window is long, and n is time, snFor the letter in time domain
Number amplitude, s'nFor the signal amplitude in time domain after adding window.
In the present embodiment, to target wait distinguish voice data carry out sub-frame processing after, the initial segment of each frame and end
Discontinuous place can all occur in end, so framing is mostly also bigger with the error of target voice data to be distinguished.Using adding
Window can solve the problems, such as this, and the voice data to be distinguished of the target after framing can be made to become continuous, and enable each frame
Enough show the feature of periodic function.Windowing process is specifically referred to using at window function voice data to be distinguished to target
Reason, window function can select Hamming window, then the formula of the adding window isN is Hamming
Window window is long, and n is time, snFor the signal amplitude in time domain, s'nFor the signal amplitude in time domain after adding window.Language to be distinguished to target
Sound data carry out windowing process, obtain pretreatment voice data, enable to the target after framing wait distinguish voice data when
Signal on domain becomes continuous, contributes to the ASR phonetic features for extracting target voice data to be distinguished.
The pretreatment operation of above-mentioned steps S211-S213 voice data to be distinguished to target, to extract target language to be distinguished
The ASR phonetic features of sound data provide the foundation, and enable to the ASR phonetic features of extraction that can more represent target language to be distinguished
Sound data, and speech differentiation is carried out according to the ASR phonetic features.
S22:Fast Fourier Transform (FFT) is made to pretreatment voice data, obtains the frequency spectrum of target voice data to be distinguished, and
The power spectrum of target voice data to be distinguished is obtained according to frequency spectrum.
Wherein, Fast Fourier Transform (FFT) (Fast Fourier Transformation, abbreviation FFT), refers to and utilizes computer
Calculate efficient, quick calculation method the general designation of discrete Fourier transform, abbreviation FFT.Computer meter can be made using this algorithm
It calculates the required multiplication number of discrete Fourier transform to be greatly reduced, the number of sampling points being especially transformed is more, fft algorithm meter
The saving of calculation amount is more notable.
In the present embodiment, to pretreatment voice data carry out Fast Fourier Transform (FFT), will pre-process voice data from when
Signal amplitude on domain is converted to the signal amplitude (frequency spectrum) on frequency domain.The formula of the calculating frequency spectrum is1≤k≤N, N are the size of frame, and s (k) is the signal amplitude on frequency domain, and s (n) is the letter in time domain
Number amplitude, n is the time, and i is complex unit.After the frequency spectrum for obtaining pretreatment voice data, can directly it be asked according to the frequency spectrum
The power spectrum of voice data must be pre-processed, the power spectrum for pre-processing voice data is known as target voice data to be distinguished below
Power spectrum.The formula of the power spectrum of the calculating target voice data to be distinguished is1≤k≤N, N are frame
Size, s (k) are the signal amplitude on frequency domain.By the way that pretreatment voice data is converted to frequency domain from the signal amplitude in time domain
On signal amplitude, the power spectrum of target voice data to be distinguished is obtained further according to the signal amplitude on the frequency domain, is from target
ASR phonetic features are extracted in the power spectrum of voice data to be distinguished, and important technical foundation is provided.
S23:Using the power spectrum of melscale filter group processing target voice data to be distinguished, obtains target and wait distinguishing
The Meier power spectrum of voice data.
Wherein, using the power spectrum of melscale filter group processing target voice data to be distinguished carried out to power spectrum
Mel-frequency analysis, mel-frequency analysis be the analysis perceived based on human auditory.Observation finds that human ear is filtered just as one
Device group is the same, only focuses on certain specific frequency components (sense of hearing of people is selective frequency), that is to say, that human ear is only
It allows the signal of certain frequencies to pass through, and directly ignores the certain frequency signals for being not desired to perception.However these filters are sat in frequency
But it is not univesral distribution on parameter, there are many filters in low frequency region, they is distributed than comparatively dense, but in high frequency region
The number in domain, filter just becomes fewer, and distribution is very sparse.It is to be appreciated that melscale filter group is in low frequency part
High resolution, the auditory properties with human ear are consistent, where this is also the physical significance of melscale.
In the present embodiment, using the power spectrum of melscale filter group processing target voice data to be distinguished, mesh is obtained
The Meier power spectrum for marking voice data to be distinguished carries out cutting by using melscale filter group to frequency-region signal so that
Last each frequency band corresponds to a numerical value, if the number of filter is 22, can obtain target voice data to be distinguished
Corresponding 22 energy values of Meier power spectrum.Mel-frequency analysis is carried out by the power spectrum to target voice data to be distinguished,
So that the Meier power spectrum obtained after its analysis maintains the frequency-portions closely related with human ear characteristic, which can
Reflect the feature of target voice data to be distinguished well.
S24:Cepstral analysis is carried out on Meier power spectrum, obtains the mel-frequency cepstrum system of target voice data to be distinguished
Number.
Wherein, cepstrum (cepstrum) refers in a kind of Fu that the Fourier transform spectrum of signal carries out again after logarithm operation
Leaf inverse transformation, since general Fourier spectrum is complex number spectrum, thus cepstrum is also known as cepstrum.
In the present embodiment, cepstral analysis is carried out to Meier power spectrum, according to cepstrum as a result, analyzing and obtaining target and wait for area
Divide the mel-frequency cepstrum coefficient of voice data.It, can be excessively high by script intrinsic dimensionality, it is difficult to directly make by the cepstral analysis
The feature for including in the Meier power spectrum of target voice data to be distinguished, by carrying out cepstrum point on Meier power spectrum
Analysis, is converted into wieldy feature (for the mel-frequency cepstrum coefficient feature vector for being trained or identifying).The Meier
The coefficient that frequency cepstral coefficient can distinguish different phonetic as ASR phonetic features, the ASR phonetic features can reflect
Difference between voice can be used for identifying and distinguishing between target voice data to be distinguished.
In a specific embodiment, as shown in fig. 6, in step S24, cepstral analysis is carried out on Meier power spectrum, is obtained
The mel-frequency cepstrum coefficient for taking target voice data to be distinguished, includes the following steps:
S241:The logarithm for taking Meier power spectrum obtains Meier power spectrum to be transformed.
In the present embodiment, according to the definition of cepstrum, logarithm log is taken to Meier power spectrum, obtains Meier power to be transformed
Compose m.
S242:Discrete cosine transform is made to Meier power spectrum to be transformed, obtains the Meier frequency of target voice data to be distinguished
Rate cepstrum coefficient.
In the present embodiment, discrete cosine transform (Discrete Cosine are made to Meier power spectrum m to be transformed
Transform, DCT), the mel-frequency cepstrum coefficient of corresponding target voice data to be distinguished is obtained, generally takes the 2nd to arrive
13rd coefficient can reflect the difference between voice data as ASR phonetic features, the ASR phonetic features.To Meier to be transformed
The formula that power spectrum m makees discrete cosine transform is
N is frame length, and m is Meier power spectrum to be transformed, and j is the independent variable of Meier power spectrum to be transformed.Due to being between Meier filter
There is overlapping, so having correlation, discrete cosine transform can between the energy value obtained using melscale filter
To carry out dimensionality reduction compression to Meier power spectrum m to be transformed and be abstracted, and corresponding ASR phonetic features are obtained, compared to Fourier
Transformation, the result of discrete cosine transform do not have imaginary part, there is apparent advantage in terms of calculating.
Step S21-S24 carries out the processing of feature extraction based on ASR technology voice data to be distinguished to target, final to obtain
ASR phonetic features can embody target voice data to be distinguished well, be trained and can obtain using the ASR phonetic features
Corresponding ASR-DBN models are taken, keep result of the ASR-DBN models of training acquisition when carrying out speech differentiation more accurate,
Even if under the conditions of noise is prodigious, accurately noise and speech differentiation can also be come.
It is characterized as mel-frequency cepstrum coefficient it should be noted that extracting above, it herein should not be by ASR phonetic features
It is a kind of to be limited to only mel-frequency cepstrum coefficient, and will be understood that the phonetic feature obtained using ASR technology, as long as can have
Effect reflection voice data feature, can all be identified and model training as ASR phonetic features.
S30:ASR phonetic features are input in advance trained ASR-DBN models and are distinguished, target is obtained and distinguishes
As a result.
Wherein, ASR-DBN models refer to the neural network model trained using ASR phonetic features, and DBN refers to depth letter
Read network (Deep Belief Network).The ASR-DBN models are special using the ASR voices of voice data to be trained extraction
What sign was trained, therefore the model can identify ASR phonetic features, so as to distinguish language according to ASR phonetic features
Sound.Particularly, different from traditional neural network, ASR-DBN models are stacked by several ASR-RBM models, wherein
ASR-RBM is the component units of ASR-DBN, and RBM refers to limited Boltzmann machine (Restricted Boltzmann Machine).
Voice data to be trained includes target voice and noise, is extracted using when trained voice data trains ASR-DBN models
The ASR phonetic features of target voice and the ASR phonetic features of noise so that ASR-DBN models can be known according to ASR phonetic features
Noise in other target voice and interference voice (is having been removed big portion using VAD differentiations are original when distinguishing voice data
Point interference voice, as due to silence without pronunciation in voice data phonological component and a part of noise, so here
The interference voice that ASR-DBN models are distinguished specifically refers to noise components), it realizes and effective district is carried out to target voice and interference voice
The purpose divided.
In the present embodiment, ASR phonetic features are input in advance trained ASR-DBN models and are distinguished, due to
ASR phonetic features can reflect the feature of voice data, therefore can be according to ASR-DBN models voice data to be distinguished to target
The ASR phonetic features of extraction are identified, to make accurate language according to ASR phonetic features voice data to be distinguished to target
Sound is distinguished.Trained ASR-DBN models couplings ASR phonetic features and depth belief network carry out deep layer to feature in advance for this
The characteristics of extraction, distinguishes voice from the feature of voice data, still has in the case of noise conditions very severe
Very high accurate rate.Specifically, since the feature of ASR extractions also includes the ASR phonetic features of noise, in the ASR-DBN
In model, noise is also accurately to distinguish, and efficiently solves current speech differentiating method (including but not limited to VAD)
The problem of speech differentiation can not be effectively carried out under conditions of noise effect is larger.
In a specific embodiment, ASR phonetic features are being input to advance trained ASR-DBN moulds by step S30
It is distinguished in type, before obtaining the step of target distinguishes result, speech differentiation method further includes following steps:Obtain ASR-
DBN model.
As shown in fig. 7, the step of obtaining ASR-DBN models specifically includes:
S31:Voice data to be trained is obtained, and extracts the ASR phonetic features to be trained of voice data to be trained.
Wherein, voice data to be trained refers to the voice data training sample set needed for trained ASR-DBN models, including mesh
Poster sound and noise.The voice training collection increased income may be used in the voice data to be trained, or by collecting great amount of samples
The voice training collection of voice data.The voice data to be trained is divided into tape label voice data to be trained and without label language to be trained
Sound data.Tape label voice data to be trained refers to the voice number that voice and noise are distinguished by the form of label in advance
According to, such as target voice be labeled as " 1 ", noise signature be " 0 ".No label voice data to be trained and tape label language to be trained
Sound data are meant that opposite, and no label voice data to be trained distinguishes voice and noise not over the form of label
Come.No label voice data to be trained is used in the training ASR-RBM model stages, and tape label voice data to be trained is used in tuning
The ASR-DBN stages should all extract corresponding ASR phonetic features to be trained before each stage.
In the present embodiment, voice data to be trained is obtained, and extracts the feature of the voice data to be trained, this feature waits for
The step of training ASR phonetic features, extracting ASR phonetic features to be trained is identical as step S21-S24, and details are not described herein.It waits for
Training voice data includes target voice part and noise components, this two parts voice data has respective ASR voices special
Sign therefore can be by the ASR phonetic features to be trained of extraction, and training simultaneously obtains ASR-DBN models so that waits instructing according to this
Target voice and noise can accurately be distinguished (noise belongs to interference language by practicing the ASR-DBN models that the training of ASR phonetic features obtains
Sound).
S32:ASR-RBM models are trained using ASR phonetic features to be trained successively, the error generated according to training is joined
Number update, obtains each ASR-RBM models, wherein parameter includes weights and biasing, biasing of the biasing comprising aobvious layer neuron and
The biasing of hidden neuron, the formula for updating weights W are:W'=W+ λ (P (h1|v1)v1-P(h2|v2)v2), W is the power before update
Value, W' are updated weights, and λ is learning rate, v1For original aobvious layer, v2Layer, h are shown to reconstruct1For original hidden layer, h2For reconstruct
Hidden layer;Update biasing a formula be:A'=a+ λ (v1-v2), a is the biasing before aobvious layer neuron update, and a' is aobvious layer neuron
Updated biasing;The formula of update biasing b is b'=b+ λ (h1-h2), b is the biasing before hidden neuron update, and b' is hidden
The layer updated biasing of neuron.
ASR-RBM is the component units of ASR-DBN, obtains ASR-DBN models and needs first to train up first ASR-RBM
Model, and weights and the biasing of the ASR-RBM models are fixed, the ASR phonetic features exported using the ASR-RBM model hidden layers,
As the ASR phonetic features of second ASR-RBM mode input, second ASR-RBM model is then trained up.It trains up
After second ASR-RBM model, second ASR-RBM model is stacked on to the top of first ASR-RBM model, i.e., first
Aobvious layer of the hidden layer of ASR-RBM models as second ASR-RBM model, and so on train up to obtain predetermined number successively
ASR-RBM models.
In one embodiment, it is trained successively using ASR phonetic features to be trained and obtains each ASR-RBM of predetermined number
Model.The process of training ASR-RBM models is the training method of unsupervised learning, compared to the training method with supervised learning
(tape label training voice data), the characteristics of ASR phonetic features can be retained as much as possible, while reducing the dimension of feature;?
In the case that sample data volume is larger, the time for being largely used to setting label spent in data acquisition phase can be reduced, it is bright
It is aobvious to improve training effectiveness.
In one embodiment, ASR-DBN models are stacked to obtain by several ASR-RBM models, it is therefore desirable to training
Obtain each ASR-RBM model, wherein stacking refers to the hidden layer of a upper ASR-RBM model as next ASR-RBM moulds
The aobvious layer of type, the output valve of a upper ASR-RBM model are the input value of next ASR-RBM models.In training process, need
Next ASR-RBM models could be trained after training up an ASR-RBM model, until the last one.Each
ASR-RBM models have two layers of neuron, including aobvious layer (visible layer) and hidden layer (Hidden layer).Aobvious layer
(visible layer) is made of aobvious first (visible units), for inputting ASR phonetic features to be trained.Correspondingly, hidden
Layer (Hidden layer) is made of hidden member (hidden units), is used as property detector (feature detectors), energy
It is enough that corresponding output valve is obtained according to the input value (ASR phonetic features to be trained) for showing layer.Two layers of (aobvious layer of ASR-RBM models
And hidden layer) between be the two-way connection relation connected entirely, but all do not interconnected between the neuron inside aobvious layer and hidden layer, only
The neuron of interlayer has symmetrical connecting line, then in the case where giving the output valve of all aobvious members, what each hidden member takes
Value is orthogonal.That is, equally in given hidden layer, the value of all aobvious members is also orthogonal.Therefore, it is calculating
One need not be just calculated when the value condition of each neuron every time, but calculate simultaneously in parallel flood show layer neuron or
The neuron of flood hidden layer.In ASR-RBM models, its company is indicated there are one weights W between the connected neuron of any two
Intensity (weights) is connect, each aobvious layer neuron itself is there are one a is biased, and there are one bias b for each hidden neuron itself.
The process of training ASR-RBM models is updated parameter W, a and b in model, specifically, in an ASR-
In RBM models, hidden neuron hjThe probability being activated is:
P(hj=1 | v)=σ (bj+∑iWi,jvi) -- formula (1)
Due to the two-way connection relation connected entirely, layer neuron v is showniIt can equally be activated by hidden neuron:
P(vi=1 | h)=σ (ai+∑jWi,jhj) -- formula (2)
In above formula, h indicates that hidden neuron, v indicate to show layer neuron, and i indicates to show i-th in layer neuron, j
It indicates that j-th in hidden neuron, σ indicate activation primitive, specifically can indicate to show the inclined of layer neuron with sigmoid functions, a
It sets, b indicates that the biasing of hidden neuron, W indicate to show the weights that layer neuron is connected with hidden neuron, Wi,jI.e. i-th aobvious layer
The weights of neuron and the connection of j-th of hidden neuron.After trained ASR phonetic features x is input to aobvious layer, ASR-RBM moulds
Type calculates the probability P (h that each hidden neuron is activated according to formula (1)j| x), j=1,2..., Nh, take 0-1's
Random number μ is activated as threshold value (such as 0.5) if the neuron that the probability that hidden neuron is activated is more than the threshold value, no
It is not activated then, is formulated as:
Thus, it is possible to obtain whether each neuron of hidden layer is activated.In addition, in given hidden layer, the calculating side of layer is shown
Method is similar, is expressed as:
Since training process will produce error, it is therefore desirable to update weights and the biasing of ASR-RBM according to output valve.Specifically
Ground builds error function according to output valve first, which can be specifically log error function.It is then based on error letter
Number optimally updates weights and biasing.For a sample data, i.e. an ASR phonetic features x to be trained, may be used following
Step updates weights and biasing:
(1) ASR phonetic features x to be trained is input to original aobvious layer v1, original hidden layer h is calculated using formula (1)1In
Probability P (the h that each neuron is activated1=1 | v1);
(2) Gibbs model is taken to extract a sample from the probability distribution of calculating:h1~P (h1|v1);
(3) h is used1Reconstruct obtains reconstructing aobvious layer v2, it is specific to push away aobvious layer by the way that hidden layer is counter, it is calculated and is shown in layer using formula (2)
Probability P (the v that each neuron is activated2|h1);
(4) similarly, Gibbs model is taken to extract a sample from the probability distribution being calculated:v2~P (v2|
h1);
(5) v is used2Reconstruct obtains reconstruct hidden layer h2, specific i.e. layer is counter pushes away hidden layer by aobvious, is calculated in hidden layer using formula (1)
The probability that each neuron is activated, obtains probability distribution:P(h2|v2);
Then according to this method, then the formula for updating weights W is:W'=W+ λ (P (h1|v1)v1-P(h2|v2)v2), wherein W
For the weights before update, W' is updated weights, and λ is learning rate.Update biasing a formula be:A'=a+ λ (v1-v2), a is
Biasing before aobvious layer neuron update, a' are the updated biasing of aobvious layer neuron, and the formula of update biasing b is b'=b+ λ
(h1-h2), b is the biasing before hidden neuron update, and b' is the updated biasing of hidden neuron.After training several times,
Hidden layer more can not only accurately show the feature of aobvious layer, while can also restore aobvious layer, reach maximum frequency of training or
Gradient change rate stops update training when being less than predetermined threshold value.It can fully be instructed according to the above weights and the newer formula of biasing
Practice ASR-RBM models, obtain each ASR-RBM models, each ASR-RBM models that training obtains remain as far as possible to wait training
ASR phonetic features, while the dimension for effectively reducing feature can compared with that can embody ASR phonetic features to be trained under low dimensional
Effective speech differentiation is realized according to the feature of ASR-RBM model extractions.
S33:The output valve of the last one ASR-RBM model is obtained, arameter optimization is carried out based on output valve, update is each
The weights of ASR-RBM models and biasing obtain ASR-DBN models.
In one embodiment, after training each ASR-RBM models, can also the last one ASR-RBM model again
One layer of backpropagation layer is established, which can carry out the training method of supervised learning, using the last one ASR-
The output valve of RBM models carries out parameter (weights and biasing) tuning.Wherein, it can be reduced using the training method of supervised learning
The error that training process generates improves the recognition accuracy for the ASR-DBN models that training obtains.It specifically, can be at last
The ASR phonetic features to be trained of aobvious layer input tape label in a ASR-RBM models, it is corresponding can to calculate its by formula (3)
The output valve of hidden layer, the output valve i.e. output valve of the last one ASR-RBM model.It is suitable to be built by the output valve
Error function, and according to error function, the effect of arameter optimization is realized using BP (Back Propagation, back-propagating) algorithm
Fruit.Wherein, BP algorithm is the newer algorithms most in use of neural network parameter, and details are not described herein.It is to be appreciated that each ASR-
RBM models can only ensure that the weights in own layer are optimal this layer of maps feature vectors with biasing, and be not to entire
The maps feature vectors of ASR-DBN models are optimal, the mistake for being generated supervised learning training process by backpropagation layer
Difference propagates to each ASR-RBM model from up to down, finely tunes entire ASR-DBN models.The last one ASR-RBM model again
One layer of backpropagation layer is established, effective tuning can be carried out to parameter (weights and biasing), advanced optimize parameter so that instruction
Practice the ASR-DBN models obtained and possesses higher recognition accuracy.
Step S31-S33 according to this feature training and obtains ASR-DBN models by extracting ASR phonetic features to be trained,
The training method for using unsupervised learning to each ASR-RBM models in the training process, can preferably retain and wait training
ASR phonetic features, while the dimension for effectively reducing feature can be reduced and be obtained in data in the case where sample data volume is larger
Take the time for being largely used to setting label that the stage spends, hence it is evident that improve training effectiveness.And it is built in the last one ASR-RBM model
Vertical backpropagation layer carries out tuning to parameter using the training method of supervised learning in the backpropagation layer, can reduce instruction
Practice the error that process generates, improve the recognition accuracy of ASR-DBN models, realization accurately distinguishes voice.
In the speech differentiation method that the present embodiment is provided, it is original to be primarily based on voice activity detection algorithms (VAD) processing
Voice data to be distinguished obtains target voice data to be distinguished, original voice data to be distinguished is calculated by voice activity detection
Method is first distinguished once, is obtained the smaller target of range voice data to be distinguished, can be tentatively effectively removed original language to be distinguished
Interference voice in sound data retains the original voice data to be distinguished for mixing target voice and interfering voice, and by the original
Voice data to be distinguished begin as target voice data to be distinguished, effective preliminary language can be made to original voice data to be distinguished
Sound is distinguished, and a large amount of interference voice is removed.It is then based on target voice data to be distinguished, obtains corresponding ASR phonetic features,
The ASR phonetic features can make the result of speech differentiation more accurate, even if under the conditions of noise is prodigious, it can also be accurately
Voice (such as noise) and target voice will be interfered to distinguish, subsequently to carry out corresponding ASR-DBN according to the ASR phonetic features
Model Identification provides important technology premise.Finally by ASR phonetic features be input in advance trained ASR-DBN models into
Row is distinguished, and is obtained target and is distinguished as a result, the ASR-DBN models are to be used to effectively distinguish according to ASR phonetic feature specialized trainings
The identification model of voice, can from mix target voice and interference voice (due to having used VAD to distinguish once, so
Here interference voice specifically refers to noise) target voice data to be distinguished in correctly distinguish target voice and interference voice,
Improve the accuracy of speech differentiation.
It should be understood that the size of the serial number of each step is not meant that the order of the execution order in above-described embodiment, each process
Execution sequence should be determined by its function and internal logic, the implementation process without coping with the embodiment of the present invention constitutes any limit
It is fixed.
Fig. 8 shows the functional block diagram with the one-to-one speech differentiation device of speech differentiation method in embodiment.Such as Fig. 8 institutes
Show, which includes target voice data acquisition module 10 to be distinguished, phonetic feature acquisition module 20 and target area
Divide result acquisition module 30.Wherein, target voice data acquisition module 10 to be distinguished, phonetic feature acquisition module 20 and target area
The realization function of result acquisition module 30 step corresponding with speech differentiation method in embodiment is divided to correspond, to avoid going to live in the household of one's in-laws on getting married
It states, the present embodiment is not described in detail one by one.
Target voice data acquisition module 10 to be distinguished handles original language to be distinguished for being based on voice activity detection algorithms
Sound data obtain target voice data to be distinguished.
It is special to obtain corresponding ASR voices for being based on target voice data to be distinguished for phonetic feature acquisition module 20
Sign.
Target distinguishes result acquisition module 30, for ASR phonetic features to be input to advance trained ASR-DBN models
In distinguish, obtain target distinguish result.
Preferably, target voice data acquisition module 10 to be distinguished includes the first original differentiation voice data acquiring unit
11, the second original differentiation voice data acquiring unit 12 and target voice data acquiring unit 13 to be distinguished.
First original differentiation voice data acquiring unit 11, for being waited for original according to short-time energy characteristic value calculation formula
It distinguishes voice data to be handled, obtains corresponding short-time energy characteristic value, short-time energy characteristic value is more than first threshold
Original data to be distinguished retain, and are determined as the first original differentiation voice data, wherein short-time energy characteristic value calculation formula isN is voice frame length, and s (n) is the signal amplitude in time domain, and n is the time.
Second original differentiation voice data acquiring unit 12, for waiting for area to original according to zero-crossing rate characteristic value calculation formula
Divide voice data to be handled, obtains corresponding zero-crossing rate characteristic value, zero-crossing rate characteristic value is waited for less than the original of second threshold
It distinguishes voice data to retain, is determined as the second original differentiation voice data, wherein zero-crossing rate characteristic value calculation formula isN is voice frame length, and s (n) is the signal amplitude in time domain, and n is the time.
Target voice data acquiring unit 13 to be distinguished is used for the first original differentiation voice data and the second original differentiation
Voice data is as target voice data to be distinguished.
Preferably, phonetic feature acquisition module 20 includes pretreatment voice data acquiring unit 21, power spectrum acquiring unit
22, Meier power spectrum acquiring unit 23 and mel-frequency cepstrum coefficient unit 24.
Pretreatment unit 21 obtains pretreatment voice data for being pre-processed to target voice data to be distinguished.
Power spectrum acquiring unit 22 obtains target and waits distinguishing for making Fast Fourier Transform (FFT) to pretreatment voice data
The frequency spectrum of voice data, and according to the power spectrum of frequency spectrum acquisition target voice data to be distinguished.
Meier power spectrum acquiring unit 23, for using melscale filter group processing target voice data to be distinguished
Power spectrum obtains the Meier power spectrum of target voice data to be distinguished.
Mel-frequency cepstrum coefficient unit 24 obtains target and waits distinguishing for carrying out cepstral analysis on Meier power spectrum
The mel-frequency cepstrum coefficient of voice data.
Preferably, pretreatment unit 21 includes preemphasis subelement 211, framing subelement 212 and adding window subelement 213.
Preemphasis subelement 211, for making preemphasis processing, the calculating of preemphasis processing to target voice data to be distinguished
Formula is s'n=sn-a*sn-1, wherein snFor the signal amplitude in time domain, sn-1For with snThe signal of corresponding last moment
Amplitude, s'nFor the signal amplitude in time domain after preemphasis, a is pre emphasis factor, and the value range of a is 0.9<a<1.0.
Framing subelement 212, for the target voice data to be distinguished after preemphasis to be carried out sub-frame processing.
Adding window subelement 213 obtains pretreatment for the target voice data to be distinguished after framing to be carried out windowing process
The calculation formula of voice data, adding window isWherein, N is that window is long, and n is the time,
snFor the signal amplitude in time domain, s'nFor the signal amplitude in time domain after adding window.
Preferably, mel-frequency cepstrum coefficient unit 24 includes that Meier power spectrum to be transformed obtains subelement 241 and Meier
Frequency cepstral coefficient subelement 242.
Meier power spectrum to be transformed obtains subelement 241, and the logarithm for taking Meier power spectrum obtains Meier to be transformed
Power spectrum.
Mel-frequency cepstrum coefficient subelement 242 obtains mesh for making discrete cosine transform to Meier power spectrum to be transformed
Mark the mel-frequency cepstrum coefficient of voice data to be distinguished.
Preferably, which further includes ASR-DBN models acquisition module 40, ASR-DBN model acquisition modules
40 for obtaining ASR-DBN models.
ASR-DBN models acquisition module 40 includes ASR phonetic features acquiring unit 41 to be trained, the acquisition of ASR-RBM models
Unit 42 and tuning unit 43.
ASR phonetic features acquiring unit 41 to be trained for obtaining voice data to be trained, and extracts voice number to be trained
According to ASR phonetic features to be trained.
ASR-RBM models acquiring unit 42, for training ASR-RBM models, root successively using ASR phonetic features to be trained
Parameter update is carried out according to the error that training generates, obtains each ASR-RBM models, wherein parameter includes weights and biasing, biasing
Include the biasing of the biasing and hidden neuron of aobvious layer neuron, the formula of update weights W is:W'=W+ λ (P (h1|v1)v1-P
(h2|v2)v2), W is the weights before update, and W' is updated weights, and λ is learning rate, v1For original aobvious layer, v2It is aobvious to reconstruct
Layer, h1For original hidden layer, h2To reconstruct hidden layer;Update biasing a formula be:A'=a+ λ (v1-v2), a is to show layer neuron more
Biasing before new, a' are the updated biasing of aobvious layer neuron;The formula of update biasing b is b'=b+ λ (h1-h2), b is hidden layer
Biasing before neuron update, b' are the updated biasing of hidden neuron.
Tuning unit 43, the output valve for obtaining the last one ASR-RBM model carry out parameter tune based on output valve
It is excellent, weights and the biasing of each ASR-RBM models are updated, ASR-DBN models are obtained.
The present embodiment provides a computer readable storage medium, computer journey is stored on the computer readable storage medium
Sequence realizes speech differentiation method in embodiment when the computer program is executed by processor, no longer superfluous here to avoid repeating
It states.Alternatively, realizing the work(of each module/unit in speech differentiation device in embodiment when the computer program is executed by processor
Can, to avoid repeating, which is not described herein again.
It is to be appreciated that the computer readable storage medium may include:The computer program code can be carried
Any entity or device, recording medium, USB flash disk, mobile hard disk, magnetic disc, CD, computer storage, read-only memory (ROM,
Read-Only Memory), random access memory (RAM, Random Access Memory), electric carrier signal and telecommunications letter
Number etc..
Fig. 9 is the schematic diagram of the present embodiment Computer equipment.As shown in figure 9, computer equipment 50 include processor 51,
Memory 52 and it is stored in the computer program 53 that can be run in memory 52 and on processor 51.Processor 51 executes meter
Each step of speech differentiation method in embodiment, such as step S10, S20 and S30 shown in Fig. 2 are realized when calculation machine program 53.
Alternatively, processor 51 realizes the function of each module/unit of speech differentiation device in embodiment when executing computer program 53, such as scheme
The voice data acquisition module 10 to be distinguished of target shown in 8, phonetic feature acquisition module 20 and target distinguish result acquisition module 30
Function.
It is apparent to those skilled in the art that for convenience of description and succinctly, only with above-mentioned each work(
Can unit, module division progress for example, in practical application, can be as needed and by above-mentioned function distribution by different
Functional unit, module are completed, i.e., the internal structure of described device are divided into different functional units or module, more than completion
The all or part of function of description.
In addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit, it can also
It is that each unit physically exists alone, it can also be during two or more units be integrated in one unit.Above-mentioned integrated list
The form that hardware had both may be used in member is realized, can also be realized in the form of SFU software functional unit.
Embodiment described above is merely illustrative of the technical solution of the present invention, rather than its limitations;Although with reference to aforementioned reality
Applying example, invention is explained in detail, it will be understood by those of ordinary skill in the art that:It still can be to aforementioned each
Technical solution recorded in embodiment is modified or equivalent replacement of some of the technical features;These modification or
Person replaces, and the spirit and scope for various embodiments of the present invention technical solution that it does not separate the essence of the corresponding technical solution should all wrap
Containing within protection scope of the present invention.
Claims (10)
1. a kind of speech differentiation method, which is characterized in that including:
Based on the original voice data to be distinguished of voice activity detection algorithms processing, target voice data to be distinguished is obtained;
Based on target voice data to be distinguished, corresponding ASR phonetic features are obtained;
The ASR phonetic features are input in advance trained ASR-DBN models and are distinguished, target is obtained and distinguishes knot
Fruit.
2. speech differentiation method according to claim 1, which is characterized in that input the ASR phonetic features described
To being distinguished in advance trained ASR-DBN models, before obtaining the step of distinguishing result, the speech differentiation method is also
Including:Obtain ASR-DBN models;
The step of acquisition ASR-DBN models includes:
Obtain voice data to be trained, and the ASR phonetic features to be trained of voice data to be trained described in extraction;
ASR-RBM models are trained using the ASR phonetic features to be trained successively, parameter is carried out more according to the error that training generates
Newly, each ASR-RBM models are obtained, wherein the parameter includes weights and biasing, and the biasing is inclined comprising aobvious layer neuron
The biasing with hidden neuron is set, the formula of update weights W is:W'=W+ λ (P (h1|v1)v1-P(h2|v2)v2), W is before updating
Weights, W' be updated weights, λ is learning rate, v1For original aobvious layer, v2Layer, h are shown to reconstruct1For original hidden layer, h2For
Reconstruct hidden layer;Update biasing a formula be:A'=a+ λ (v1-v2), a is the biasing before aobvious layer neuron update, and a' is aobvious layer god
Through the updated biasing of member;The formula of update biasing b is b'=b+ λ (h1-h2), b is the biasing before hidden neuron update, b'
For the updated biasing of hidden neuron;
The output valve of the last one ASR-RBM model is obtained, arameter optimization is carried out based on the output valve, updates each institute
Weights and the biasing of ASR-RBM models are stated, ASR-DBN models are obtained.
3. speech differentiation method according to claim 1, which is characterized in that described to be handled based on voice activity detection algorithms
Original voice data to be distinguished obtains target voice data to be distinguished, including:
The original voice data to be distinguished is handled according to short-time energy characteristic value calculation formula, acquisition is corresponding in short-term
The short-time energy characteristic value is more than original data the to be distinguished reservation of first threshold by energy eigenvalue, is determined as the
One original differentiation voice data, short-time energy characteristic value calculation formula areWherein, N is voice frame length, and s (n) is
Signal amplitude in time domain, n are the time;
The original voice data to be distinguished is handled according to zero-crossing rate characteristic value calculation formula, obtains corresponding zero-crossing rate
Characteristic value, the original voice data to be distinguished that the zero-crossing rate characteristic value is less than to second threshold retain, and are determined as second
Original differentiation voice data, zero-crossing rate characteristic value calculation formula areWherein, N is voice frame length, s
(n) it is the signal amplitude in time domain, n is the time;
Using the described first original differentiation voice data and the second original differentiation voice data as target language to be distinguished
Sound data.
4. speech differentiation method according to claim 1, which is characterized in that described to be based on target voice number to be distinguished
According to, corresponding ASR phonetic features are obtained, including:
Target voice data to be distinguished is pre-processed, pretreatment voice data is obtained;
Fast Fourier Transform (FFT) made to the pretreatment voice data, obtains the frequency spectrum of target voice data to be distinguished, and according to
The frequency spectrum obtains the power spectrum of target voice data to be distinguished;
The power spectrum of target voice data to be distinguished is handled using melscale filter group, obtains target voice to be distinguished
The Meier power spectrum of data;
Cepstral analysis is carried out on the Meier power spectrum, obtains the mel-frequency cepstrum coefficient of target voice data to be distinguished.
5. speech differentiation method according to claim 4, which is characterized in that described to target voice data to be distinguished
It is pre-processed, obtains pretreatment voice data, including:
Preemphasis processing is made to target voice data to be distinguished, the calculation formula of preemphasis processing is s'n=sn-a*sn-1,
Wherein, snFor the signal amplitude in time domain, sn-1For with snThe signal amplitude of corresponding last moment, s'nWhen for after preemphasis
Signal amplitude on domain, a are pre emphasis factor, and the value range of a is 0.9<a<1.0;
Target voice data to be distinguished after preemphasis is subjected to sub-frame processing;
Target voice data to be distinguished after framing is subjected to windowing process, obtains pretreatment voice data, the meter of adding window
Calculating formula isWherein, N is that window is long, and n is time, snFor the signal width in time domain
Degree, s'nFor the signal amplitude in time domain after adding window.
6. speech differentiation method according to claim 4, which is characterized in that described to be fallen on the Meier power spectrum
Spectrum analysis obtains the mel-frequency cepstrum coefficient of target voice data to be distinguished, including:
The logarithm for taking the Meier power spectrum obtains Meier power spectrum to be transformed;
Discrete cosine transform is made to the Meier power spectrum to be transformed, obtains the mel-frequency cepstrum of target voice data to be distinguished
Coefficient.
7. a kind of speech differentiation device, which is characterized in that including:
Target voice data acquisition module to be distinguished handles original voice number to be distinguished for being based on voice activity detection algorithms
According to acquisition target voice data to be distinguished;
Phonetic feature acquisition module obtains corresponding ASR phonetic features for being based on target voice data to be distinguished;
Target distinguishes result acquisition module, for the ASR phonetic features to be input in advance trained ASR-DBN models
It distinguishes, obtains target and distinguish result.
8. speech differentiation device according to claim 7, which is characterized in that the speech differentiation device further includes ASR-
DBN model acquisition module, the ASR-DBN models acquisition module include:
ASR phonetic features acquiring unit to be trained, for obtaining voice data to be trained, and voice data to be trained described in extraction
ASR phonetic features to be trained;
ASR-RBM model acquiring units train ASR-RBM models successively for ASR phonetic features to be trained described in use, according to
The error that training generates carries out parameter update, obtains each ASR-RBM models, wherein the parameter includes weights and biasing, institute
The biasing of biasing and hidden neuron of the biasing comprising aobvious layer neuron is stated, the formula of update weights W is:W'=W+ λ (P (h1|
v1)v1-P(h2|v2)v2), W is the weights before update, and W' is updated weights, and λ is learning rate, v1For original aobvious layer, v2For
Reconstruct shows layer, h1For original hidden layer, h2To reconstruct hidden layer;Update biasing a formula be:A'=a+ λ (v1-v2), a is aobvious layer god
Biasing before being updated through member, a' are the updated biasing of aobvious layer neuron;The formula of update biasing b is b'=b+ λ (h1-h2), b
Biasing before being updated for hidden neuron, b' are the updated biasing of hidden neuron;
Tuning unit, the output valve for obtaining the last one ASR-RBM model carry out parameter tune based on the output valve
It is excellent, weights and the biasing of each ASR-RBM models are updated, ASR-DBN models are obtained.
9. a kind of computer equipment, including memory, processor and it is stored in the memory and can be in the processor
The computer program of upper operation, which is characterized in that the processor realized when executing the computer program as claim 1 to
The step of any one of 6 speech differentiation method.
10. a kind of computer readable storage medium, the computer-readable recording medium storage has computer program, feature to exist
In the step of realization speech differentiation method as described in any one of claim 1 to 6 when the computer program is executed by processor
Suddenly.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810561789.6A CN108806725A (en) | 2018-06-04 | 2018-06-04 | Speech differentiation method, apparatus, computer equipment and storage medium |
PCT/CN2018/094342 WO2019232867A1 (en) | 2018-06-04 | 2018-07-03 | Voice discrimination method and apparatus, and computer device, and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810561789.6A CN108806725A (en) | 2018-06-04 | 2018-06-04 | Speech differentiation method, apparatus, computer equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108806725A true CN108806725A (en) | 2018-11-13 |
Family
ID=64090244
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810561789.6A Pending CN108806725A (en) | 2018-06-04 | 2018-06-04 | Speech differentiation method, apparatus, computer equipment and storage medium |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN108806725A (en) |
WO (1) | WO2019232867A1 (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109739112A (en) * | 2018-12-29 | 2019-05-10 | 张卫校 | A kind of wobble objects control method and wobble objects |
CN110047510A (en) * | 2019-04-15 | 2019-07-23 | 北京达佳互联信息技术有限公司 | Audio identification methods, device, computer equipment and storage medium |
WO2019232867A1 (en) * | 2018-06-04 | 2019-12-12 | 平安科技(深圳)有限公司 | Voice discrimination method and apparatus, and computer device, and storage medium |
CN112652324A (en) * | 2020-12-28 | 2021-04-13 | 深圳万兴软件有限公司 | Speech enhancement optimization method, speech enhancement optimization system and readable storage medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104157290A (en) * | 2014-08-19 | 2014-11-19 | 大连理工大学 | Speaker recognition method based on depth learning |
KR101561651B1 (en) * | 2014-05-23 | 2015-11-02 | 서강대학교산학협력단 | Interest detecting method and apparatus based feature data of voice signal using Deep Belief Network, recording medium recording program of the method |
CN107358966A (en) * | 2017-06-27 | 2017-11-17 | 北京理工大学 | Based on deep learning speech enhan-cement without reference voice quality objective evaluation method |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN100580770C (en) * | 2005-08-08 | 2010-01-13 | 中国科学院声学研究所 | Voice end detection method based on energy and harmonic |
US10229700B2 (en) * | 2015-09-24 | 2019-03-12 | Google Llc | Voice activity detection |
CN108806725A (en) * | 2018-06-04 | 2018-11-13 | 平安科技(深圳)有限公司 | Speech differentiation method, apparatus, computer equipment and storage medium |
-
2018
- 2018-06-04 CN CN201810561789.6A patent/CN108806725A/en active Pending
- 2018-07-03 WO PCT/CN2018/094342 patent/WO2019232867A1/en active Application Filing
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR101561651B1 (en) * | 2014-05-23 | 2015-11-02 | 서강대학교산학협력단 | Interest detecting method and apparatus based feature data of voice signal using Deep Belief Network, recording medium recording program of the method |
CN104157290A (en) * | 2014-08-19 | 2014-11-19 | 大连理工大学 | Speaker recognition method based on depth learning |
CN107358966A (en) * | 2017-06-27 | 2017-11-17 | 北京理工大学 | Based on deep learning speech enhan-cement without reference voice quality objective evaluation method |
Non-Patent Citations (4)
Title |
---|
XIAO-LEI ZHANG: "Deep Belief Networks Based Voice Activity Detection", 《IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING》, 30 April 2013 (2013-04-30), pages 697 - 709 * |
宋知用: "《MATLAB语音信号分析与合成》", 30 January 2018, pages: 118 * |
张雪英: "《数字信号处理基础教程》", 西安电子科技大学出版社, pages: 212 - 214 * |
贺宏兵: "《雷达目标识别原理与实验技术》", 30 December 2017, pages: 132 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2019232867A1 (en) * | 2018-06-04 | 2019-12-12 | 平安科技(深圳)有限公司 | Voice discrimination method and apparatus, and computer device, and storage medium |
CN109739112A (en) * | 2018-12-29 | 2019-05-10 | 张卫校 | A kind of wobble objects control method and wobble objects |
CN110047510A (en) * | 2019-04-15 | 2019-07-23 | 北京达佳互联信息技术有限公司 | Audio identification methods, device, computer equipment and storage medium |
CN112652324A (en) * | 2020-12-28 | 2021-04-13 | 深圳万兴软件有限公司 | Speech enhancement optimization method, speech enhancement optimization system and readable storage medium |
Also Published As
Publication number | Publication date |
---|---|
WO2019232867A1 (en) | 2019-12-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110827801B (en) | Automatic voice recognition method and system based on artificial intelligence | |
CN108922513A (en) | Speech differentiation method, apparatus, computer equipment and storage medium | |
CN108447495B (en) | Deep learning voice enhancement method based on comprehensive feature set | |
CN110428842A (en) | Speech model training method, device, equipment and computer readable storage medium | |
CN108806725A (en) | Speech differentiation method, apparatus, computer equipment and storage medium | |
CN113012720B (en) | Depression detection method by multi-voice feature fusion under spectral subtraction noise reduction | |
CN111292762A (en) | Single-channel voice separation method based on deep learning | |
CN105023580A (en) | Unsupervised noise estimation and speech enhancement method based on separable deep automatic encoding technology | |
CN111899757B (en) | Single-channel voice separation method and system for target speaker extraction | |
CN109036470B (en) | Voice distinguishing method, device, computer equipment and storage medium | |
CN111341319B (en) | Audio scene identification method and system based on local texture features | |
Mallidi et al. | Novel neural network based fusion for multistream ASR | |
CN111128209A (en) | Speech enhancement method based on mixed masking learning target | |
CN111951796A (en) | Voice recognition method and device, electronic equipment and storage medium | |
CN111883181A (en) | Audio detection method and device, storage medium and electronic device | |
CN110136726A (en) | A kind of estimation method, device, system and the storage medium of voice gender | |
Thomas et al. | Acoustic and data-driven features for robust speech activity detection | |
Hasan et al. | Preprocessing of continuous bengali speech for feature extraction | |
Lim et al. | Harmonic and percussive source separation using a convolutional auto encoder | |
Sekkate et al. | Speaker identification for OFDM-based aeronautical communication system | |
CN112735466A (en) | Audio detection method and device | |
KR20190135916A (en) | Apparatus and method for determining user stress using speech signal | |
CN114613387A (en) | Voice separation method and device, electronic equipment and storage medium | |
CN114724589A (en) | Voice quality inspection method and device, electronic equipment and storage medium | |
CN113744715A (en) | Vocoder speech synthesis method, device, computer equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20181113 |
|
RJ01 | Rejection of invention patent application after publication |