CN108922513A - Speech differentiation method, apparatus, computer equipment and storage medium - Google Patents
Speech differentiation method, apparatus, computer equipment and storage medium Download PDFInfo
- Publication number
- CN108922513A CN108922513A CN201810561788.1A CN201810561788A CN108922513A CN 108922513 A CN108922513 A CN 108922513A CN 201810561788 A CN201810561788 A CN 201810561788A CN 108922513 A CN108922513 A CN 108922513A
- Authority
- CN
- China
- Prior art keywords
- voice data
- distinguished
- asr
- indicate
- layer
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000004069 differentiation Effects 0.000 title claims abstract description 82
- 238000000034 method Methods 0.000 title claims abstract description 50
- 238000003860 storage Methods 0.000 title claims abstract description 15
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 31
- 238000012545 processing Methods 0.000 claims abstract description 31
- 230000000694 effects Effects 0.000 claims abstract description 28
- 238000001514 detection method Methods 0.000 claims abstract description 18
- 238000001228 spectrum Methods 0.000 claims description 63
- 238000012549 training Methods 0.000 claims description 24
- 238000004364 calculation method Methods 0.000 claims description 22
- 239000011159 matrix material Substances 0.000 claims description 21
- 230000008569 process Effects 0.000 claims description 14
- 238000004590 computer program Methods 0.000 claims description 12
- 238000004458 analytical method Methods 0.000 claims description 11
- 238000009432 framing Methods 0.000 claims description 10
- 230000000644 propagated effect Effects 0.000 claims description 9
- 230000004913 activation Effects 0.000 claims description 8
- 239000000284 extract Substances 0.000 claims description 7
- 230000017105 transposition Effects 0.000 claims description 5
- 238000010276 construction Methods 0.000 claims description 4
- 230000000717 retained effect Effects 0.000 claims description 4
- 238000010183 spectrum analysis Methods 0.000 claims description 2
- 235000013399 edible fruits Nutrition 0.000 claims 1
- 230000006870 function Effects 0.000 description 22
- 238000013528 artificial neural network Methods 0.000 description 9
- 238000012216 screening Methods 0.000 description 9
- 238000005516 engineering process Methods 0.000 description 7
- 230000000306 recurrent effect Effects 0.000 description 7
- 238000010586 diagram Methods 0.000 description 4
- 238000000605 extraction Methods 0.000 description 4
- 230000005540 biological transmission Effects 0.000 description 3
- 210000002569 neuron Anatomy 0.000 description 3
- 230000009466 transformation Effects 0.000 description 3
- 238000009826 distribution Methods 0.000 description 2
- 238000003062 neural network model Methods 0.000 description 2
- 230000008447 perception Effects 0.000 description 2
- 238000007781 pre-processing Methods 0.000 description 2
- 230000008901 benefit Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 230000010485 coping Effects 0.000 description 1
- 230000008878 coupling Effects 0.000 description 1
- 238000010168 coupling process Methods 0.000 description 1
- 238000005859 coupling reaction Methods 0.000 description 1
- 238000005520 cutting process Methods 0.000 description 1
- 125000004122 cyclic group Chemical group 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 230000005713 exacerbation Effects 0.000 description 1
- 230000001771 impaired effect Effects 0.000 description 1
- 239000004615 ingredient Substances 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 210000004218 nerve net Anatomy 0.000 description 1
- 230000000737 periodic effect Effects 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 230000002035 prolonged effect Effects 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
- 210000001260 vocal cord Anatomy 0.000 description 1
- 230000001755 vocal effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
- G10L25/84—Detection of presence or absence of voice signals for discriminating voice from noise
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/18—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/21—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Compression, Expansion, Code Conversion, And Decoders (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a kind of speech differentiation method, apparatus, computer equipment and storage mediums.The speech differentiation method includes:Based on the original voice data to be distinguished of voice activity detection algorithms processing, target voice data to be distinguished is obtained;Based on target voice data to be distinguished, corresponding ASR phonetic feature is obtained;The ASR phonetic feature is input in preparatory trained ASR-RNN model and is distinguished, target is obtained and distinguishes result.Target voice and interference voice can be distinguished well using the speech differentiation method still can carry out accurate speech differentiation in the very big situation of voice data noise jamming.
Description
Technical field
The present invention relates to speech processes field more particularly to a kind of speech differentiation method, apparatus, computer equipment and storage
Medium.
Background technique
Speech differentiation, which refers to, carries out mute screening to the voice of input, only retains the voice segments more meaningful to identification (i.e.
Target voice).There is very big deficiency in current speech differentiation method, especially in the presence of noise, with noise
Become larger, the difficulty for carrying out speech differentiation is bigger, can not accurately distinguish out target voice and interference voice, lead to speech differentiation
Effect is undesirable.
Summary of the invention
The embodiment of the present invention provides a kind of speech differentiation method, apparatus, computer equipment and storage medium, with solve into
The undesirable problem of row speech differentiation effect.
The embodiment of the present invention provides a kind of speech differentiation method, including:
Based on the original voice data to be distinguished of voice activity detection algorithms processing, target voice data to be distinguished is obtained;
Based on target voice data to be distinguished, corresponding ASR phonetic feature is obtained;
The ASR phonetic feature is input in preparatory trained ASR-RNN model and is distinguished, target is obtained and distinguishes
As a result.
The embodiment of the present invention provides a kind of speech differentiation device, including:
Target voice data to be distinguished obtains module, for handling original voice to be distinguished based on voice activity detection algorithms
Data obtain target voice data to be distinguished;
Phonetic feature obtains module, and for being based on target voice data to be distinguished, it is special to obtain corresponding ASR voice
Sign;
Target distinguishes result and obtains module, for the ASR phonetic feature to be input to preparatory trained ASR-RNN mould
It is distinguished in type, obtains target and distinguish result.
The embodiment of the present invention provides a kind of computer equipment, including memory, processor and is stored in the memory
In and the computer program that can run on the processor, the processor realize institute's predicate when executing the computer program
The step of sound differentiating method.
The embodiment of the present invention provides a kind of computer readable storage medium, and the computer-readable recording medium storage has meter
The step of calculation machine program, the computer program realizes speech differentiation method when being executed by processor.
In speech differentiation method, apparatus, computer equipment and storage medium provided by the embodiment of the present invention, it is primarily based on
The original voice data to be distinguished of voice activity detection algorithms processing, obtains target voice data to be distinguished, original language to be distinguished
Sound data are first distinguished once by voice activity detection algorithms, obtain the smaller target of range voice data to be distinguished, Neng Gouchu
Step effectively removes non-voice.It is then based on target voice data to be distinguished, obtains corresponding ASR phonetic feature, is subsequent basis
The ASR phonetic feature carries out corresponding ASR-RNN model identification and provides technical foundation.Finally ASR phonetic feature is input to pre-
It is first distinguished in trained ASR-RNN model, obtains target and distinguish as a result, the ASR-RNN model is special according to ASR voice
The identification model for being used to accurately distinguish voice of the feature specialized training of sign and voice in timing, can be from target language to be distinguished
Target voice and interference voice are correctly distinguished in sound data, improve the accuracy of speech differentiation.
Detailed description of the invention
In order to illustrate the technical solution of the embodiments of the present invention more clearly, below by institute in the description to the embodiment of the present invention
Attached drawing to be used is needed to be briefly described, it should be apparent that, the accompanying drawings in the following description is only some implementations of the invention
Example, for those of ordinary skill in the art, without any creative labor, can also be according to these attached drawings
Obtain other attached drawings.
Fig. 1 is an applied environment figure of speech differentiation method in one embodiment of the invention;
Fig. 2 is a flow chart of speech differentiation method in one embodiment of the invention;
Fig. 3 is a specific flow chart of step S10 in Fig. 2;
Fig. 4 is a specific flow chart of step S20 in Fig. 2;
Fig. 5 is a specific flow chart of step S21 in Fig. 4;
Fig. 6 is a specific flow chart of step S24 in Fig. 4;
Fig. 7 is the specific flow chart in Fig. 2 before step S30;
Fig. 8 is a schematic diagram of speech differentiation device in one embodiment of the invention;
Fig. 9 is a schematic diagram of computer equipment in one embodiment of the invention.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete
Site preparation description, it is clear that described embodiments are some of the embodiments of the present invention, instead of all the embodiments.Based on this hair
Embodiment in bright, every other implementation obtained by those of ordinary skill in the art without making creative efforts
Example, shall fall within the protection scope of the present invention.
Fig. 1 shows the application environment of speech differentiation method provided in an embodiment of the present invention.The application of the audio recognition method
Environment includes server-side and client, wherein is attached between server-side and client by network.Client be can with
The equipment that family carries out human-computer interaction, the including but not limited to equipment such as computer, smart phone and plate.Server-side specifically can be with solely
The server cluster of vertical server or multiple servers composition is realized.Speech differentiation method provided in an embodiment of the present invention is answered
For server-side.
As shown in Fig. 2, Fig. 2 shows a flow chart of speech differentiation method in the present embodiment, which includes
Following steps:
S10:Based on the original voice data to be distinguished of voice activity detection algorithms processing, target voice number to be distinguished is obtained
According to.
Wherein, voice activity detection (Voice Activity Detection, hereinafter referred to as VAD), it is therefore an objective to from sound
The prolonged mute phase is identified and eliminated in signal stream, and traffic resource is saved in the case where not reducing quality of service to reach
Effect can save valuable bandwidth resources, reduce time delay end to end, promote user experience.Voice activity detection algorithms
(vad algorithm) i.e. voice activity detection when the algorithm that specifically uses, the algorithm can there are many.It is to be appreciated that VAD can be answered
Used in speech differentiation, target voice and interference voice can be distinguished.Target voice refers to that vocal print consecutive variations are bright in voice data
Aobvious phonological component, interference voice can be in voice data since silence is without the phonological component of pronunciation, be also possible to ring
Border noise.Original voice data to be distinguished is the voice data to be distinguished that most original is got, the original voice data to be distinguished
Refer to that vad algorithm to be employed carries out the preliminary voice data for distinguishing processing.Target voice data to be distinguished refers to living by voice
Detection algorithm is moved to original after distinguishing voice data and handling, the voice data for being used to carry out speech differentiation of acquisition.
In the present embodiment, original voice data to be distinguished is handled using vad algorithm, from original voice number to be distinguished
Go out target voice and interference voice according to middle preliminary screening, and the target voice part that preliminary screening is gone out is as target language to be distinguished
Sound data.It is to be appreciated that the interference voice gone out for preliminary screening need not distinguish again, to improve the effect of speech differentiation
Rate.And from the original content for still having interference voice wait distinguish the target voice that preliminary screening in voice data goes out, especially when
When the original noise wait distinguish voice data is bigger, if the interference voice (such as noise) that the target voice that preliminary screening goes out mixes
It is more, it is clear that voice can not effectively be distinguished using vad algorithm at this time, therefore interference language should be mixed by what preliminary screening went out
The target voice of sound carries out more accurate distinguish as target voice data to be distinguished, with the target voice gone out to preliminary screening.
Preliminary speech differentiation is carried out to original voice data to be distinguished by using vad algorithm, can according to preliminary screening it is original to
It distinguishes voice data to be repartitioned, while removing a large amount of interference voice, be conducive to subsequent further speech differentiation.
In a specific embodiment, as shown in figure 3, in step S10, based on voice activity detection algorithms processing it is original to
Voice data is distinguished, target voice data to be distinguished is obtained, includes the following steps:
S11:Original voice data to be distinguished is handled according to short-time energy characteristic value calculation formula, is obtained corresponding
The original data to be distinguished that short-time energy characteristic value is greater than first threshold are retained, are determined as the first original by short-time energy characteristic value
Begin to distinguish voice data, wherein short-time energy characteristic value calculation formula isN is voice frame length, when s (n) is
Signal amplitude on domain, n are the time.
Wherein, it is corresponding in its time domain to describe a frame voice (frame generally takes 10-30ms) for short-time energy characteristic value
Energy, the time (i.e. voice frame length) for being interpreted as a frame " in short-term " of the short-time energy.Due in short-term capable of for target voice
Measure feature value, the short-time energy characteristic value compared to interference voice (mute) can be higher by very much, therefore in short-term can according to this
Measure feature value come distinguish target voice and interference voice.
In the present embodiment, original voice data to be distinguished is handled according to short-time energy characteristic value calculation formula and (is needed in advance
Make the processing of framing to original voice data to be distinguished), calculate and obtain the short-time energy of original each frame of voice data to be distinguished
The short-time energy characteristic value of each frame is compared with pre-set first threshold, will be greater than the original of first threshold by characteristic value
Begin voice data to be distinguished retains, and is determined as the first original differentiation voice data.The first threshold in short-term can for measuring
Measure feature value belongs to target voice or interferes the cut off value of voice.In the present embodiment, according to short-time energy characteristic value and
The comparison result of one threshold value, can tentatively be distinguished from the angle of short-time energy characteristic value obtain it is original wait distinguish in voice data
Target voice, and effectively remove and original largely interfere voice in voice data wait distinguish.
S12:Original voice data to be distinguished is handled according to zero-crossing rate characteristic value calculation formula, obtains corresponding mistake
The original voice data to be distinguished that zero-crossing rate characteristic value is less than second threshold is retained, it is original to be determined as second by zero rate characteristic value
Distinguish voice data, wherein zero-crossing rate characteristic value calculation formula isN is voice frame length, s (n)
It is the time for the signal amplitude n in time domain.
Wherein, zero-crossing rate characteristic value is the number for describing voice signal waveform in a frame voice and passing through horizontal axis (zero level).
Due to the zero-crossing rate characteristic value of target voice, the zero-crossing rate characteristic value compared to interference voice can be much lower, therefore can basis
The short-time energy characteristic value come distinguish target voice and interference voice.
In the present embodiment, original voice data to be distinguished is handled according to zero-crossing rate characteristic value calculation formula, calculates and obtains
The zero-crossing rate characteristic value of original each frame of voice data to be distinguished, by the zero-crossing rate characteristic value of each frame and pre-set second threshold
It is compared, the original voice data to be distinguished for being less than second threshold is retained, and be determined as the second original differentiation voice data.
The second threshold is the cut off value for belonging to target voice or interference voice for measuring short-time energy characteristic value.The present embodiment
In, according to the comparison result of zero-crossing rate characteristic value and second threshold, it can tentatively distinguish and obtain from the angle of zero-crossing rate characteristic value
It is original wait distinguish the target voice in voice data, and effectively remove and original largely interfere voice in voice data wait distinguish.
S13:Using the first original differentiation voice data and the second original differentiation voice data as target voice number to be distinguished
According to.
In the present embodiment, the first original differentiation voice data is from original according to the angle of short-time energy characteristic value wait distinguish
It distinguishes and obtains in voice data, the second original differentiation voice data is from original according to the angle of zero-crossing rate characteristic value to area
It distinguishes and obtains in point voice data.First original differentiation voice data and the second original differentiation voice data are respectively from differentiation
The different angle of voice is set out, the two angles can distinguish voice well, therefore by the first original differentiation voice data
Together with merging (merging in a manner of taking intersection) with the second original differentiation voice data, as target voice data to be distinguished.
Step S11-S13 can tentatively be effectively removed original wait distinguish most interference voice number in voice data
According to reservation mixes the original voice data to be distinguished of target voice and small part interference voice (such as noise), and this is original
Voice data to be distinguished can make effective preliminary voice to original voice data to be distinguished as target voice data to be distinguished
It distinguishes.
S20:Based on target voice data to be distinguished, corresponding ASR phonetic feature is obtained.
Wherein, ASR (Automatic Speech Recognition, automatic speech recognition technology) is to turn voice data
It is changed to the technology of computer-readable input, such as converts the shapes such as key, binary coding or character string for voice data
Formula.The phonetic feature in target voice data to be distinguished can be extracted by ASR, the voice extracted is corresponding thereto
ASR phonetic feature.It is to be appreciated that the voice data that script computer can not be directly read can be converted to computer by ASR
The ASR phonetic feature that can be read, the ASR phonetic feature can be indicated by the way of vector.
It in the present embodiment, is handled using ASR voice data to be distinguished to target, it is special to obtain corresponding ASR voice
Sign, the ASR phonetic feature can be well reflected the potential feature of target voice data to be distinguished, can be special according to ASR voice
Voice data to be distinguished to target is levied to distinguish, for it is subsequent according to the corresponding ASR-RNN of ASR phonetic feature progress (RNN,
Recurrent neural networks, Recognition with Recurrent Neural Network) the important technology premise of model identification offer.
In a specific embodiment, as shown in figure 4, in step S20, based on target voice data to be distinguished, phase is obtained
Corresponding ASR phonetic feature, includes the following steps:
S21:Voice data to be distinguished to target pre-processes, and obtains pretreatment voice data.
In the present embodiment, voice data to be distinguished to target is pre-processed, and obtains corresponding pretreatment voice number
According to.Voice data to be distinguished to target pre-processes the ASR voice spy that can preferably extract target voice data to be distinguished
Sign so that the ASR phonetic feature extracted is more representative of target voice data to be distinguished, with the use ASR phonetic feature into
Row speech differentiation.
In a specific embodiment, as shown in figure 5, in step S21, voice data to be distinguished to target is located in advance
Reason obtains pretreatment voice data, includes the following steps:
S211:Voice data to be distinguished to target makees preemphasis processing, and the calculation formula of preemphasis processing is s'n=sn-a*
sn-1, wherein snFor the signal amplitude in time domain, sn-1For with snThe signal amplitude of corresponding last moment, s'nFor preemphasis
Signal amplitude in time domain afterwards, a are pre emphasis factor, and the value range of a is 0.9<a<1.0.
Wherein, preemphasis is a kind of signal processing mode compensated in transmitting terminal to input signal high fdrequency component.With
The increase of signal rate, signal be damaged in transmission process it is very big, in order to enable receiving end to obtain relatively good signal waveform,
With regard to needing to compensate impaired signal.The thought of pre-emphasis technique is exactly the high frequency in the transmitting terminal enhancing signal of transmission line
Ingredient enables receiving end to obtain preferable signal waveform to compensate excessive decaying of the high fdrequency component in transmission process.In advance
Exacerbation does not have an impact to noise, therefore can effectively improve output signal-to-noise ratio.
In the present embodiment, voice data to be distinguished to target makees preemphasis processing, and the formula of preemphasis processing is s'n=
sn-a*sn-1, wherein snFor the signal amplitude in time domain, i.e. the amplitude (amplitude) of voice expressed in the time domain of voice data,
sn-1For with snThe signal amplitude of opposite last moment, s'nFor the signal amplitude in time domain after preemphasis, a is pre emphasis factor,
The value range of a is 0.9<a<1.0, take the effect of 0.97 preemphasis relatively good here.Being handled using the preemphasis can eliminate
It is interfered caused by vocal cords and lip etc. in voiced process, it can be with the pent-up radio-frequency head of effective compensation target voice data to be distinguished
Point, and the formant of target voice data high frequency to be distinguished can be highlighted, reinforce the signal width of target voice data to be distinguished
Degree helps to extract ASR phonetic feature.
S212:Target voice data to be distinguished after preemphasis is subjected to sub-frame processing.
In the present embodiment, in preemphasis target after distinguishing voice data, sub-frame processing should be also carried out.Framing refers to will be whole
The voice signal of section is cut into the voice processing technology of several segments, and the size of every frame is in the range of 10-30ms, with general 1/2
Frame length is moved as frame.Frame moves the overlapping region for referring to adjacent two interframe, can be avoided adjacent two frame and changes excessive problem.To mesh
It marks voice data to be distinguished and carries out sub-frame processing, target voice data to be distinguished can be divided into the voice data of several segments, it can
To segment target voice data to be distinguished, convenient for the extraction of ASR phonetic feature.
S213:Target voice data to be distinguished after framing is subjected to windowing process, obtains pretreatment voice data, adding window
Calculation formula beWherein, N is that window is long, and n is time, snFor the letter in time domain
Number amplitude, s'nFor the signal amplitude in time domain after adding window.
In the present embodiment, to target wait distinguish voice data carry out sub-frame processing after, the initial segment of each frame and end
Discontinuous place can all occur in end, so framing is mostly also bigger with the error of target voice data to be distinguished.Using adding
Window is able to solve this problem, and the voice data to be distinguished of the target after can making framing becomes continuously, and enables each frame
Enough show the feature of periodic function.Windowing process is specifically referred to using at window function voice data to be distinguished to target
Reason, window function can choose Hamming window, then the formula of the adding window isN is Hamming
Window window is long, and n is time, snFor the signal amplitude in time domain, s'nFor the signal amplitude in time domain after adding window.Language to be distinguished to target
Sound data carry out windowing process, obtain pretreatment voice data, target after enabling to framing wait distinguish voice data when
Signal on domain becomes the ASR phonetic feature for continuously facilitating to extract target voice data to be distinguished.
The pretreatment operation of above-mentioned steps S211-S213 voice data to be distinguished to target, to extract target language to be distinguished
The ASR phonetic feature of sound data provides the foundation, and enables to the ASR phonetic feature extracted more representative of target language to be distinguished
Sound data, and speech differentiation is carried out according to the ASR phonetic feature.
S22:Fast Fourier Transform (FFT) is made to pretreatment voice data, obtains the frequency spectrum of target voice data to be distinguished, and
The power spectrum of target voice data to be distinguished is obtained according to frequency spectrum.
Wherein, Fast Fourier Transform (FFT) (Fast Fourier Transformation, abbreviation FFT), refers to and utilizes computer
Calculate efficient, quick calculation method the general designation of discrete Fourier transform, abbreviation FFT.Computer meter can be made using this algorithm
It calculates multiplication number required for discrete Fourier transform to be greatly reduced, the number of sampling points being especially transformed is more, fft algorithm meter
The saving of calculation amount is more significant.
In the present embodiment, Fast Fourier Transform (FFT) is carried out to pretreatment voice data, voice data will be pre-processed from time domain
Signal amplitude be converted to the signal amplitude (frequency spectrum) on frequency domain.The formula of the calculating frequency spectrum is1
≤ k≤N, N are the size of frame, and s (k) is the signal amplitude on frequency domain, and s (n) is the signal amplitude in time domain, and n is the time, and i is
Complex unit.After the frequency spectrum for obtaining pretreatment voice data, pretreatment voice data can be directly acquired according to the frequency spectrum
The power spectrum for pre-processing voice data is known as the power spectrum of target voice data to be distinguished by power spectrum below.The calculating target
The formula of the power spectrum of voice data to be distinguished is1≤k≤N, N are the size of frame, and s (k) is on frequency domain
Signal amplitude.By the way that pretreatment voice data is converted to the signal amplitude on frequency domain, then root from the signal amplitude in time domain
The power spectrum of target voice data to be distinguished is obtained according to the signal amplitude on the frequency domain, is the function from target voice data to be distinguished
ASR phonetic feature is extracted in rate spectrum, and important technical foundation is provided.
S23:Using the power spectrum of melscale filter group processing target voice data to be distinguished, obtains target and wait distinguishing
The Meier power spectrum of voice data.
It wherein, using the power spectrum of melscale filter group processing target voice data to be distinguished is carried out to power spectrum
Mel-frequency analysis, mel-frequency analysis be based on human auditory perception analysis.Observation discovery, human ear are filtered just as one
Device group is the same, only focuses on certain specific frequency components (sense of hearing of people is selective frequency), that is to say, that human ear is only
It allows the signal of certain frequencies to pass through, and directly ignores the certain frequency signals for being not desired to perception.However these filters are sat in frequency
But it is not univesral distribution on parameter, there are many filters in low frequency region, they is distributed than comparatively dense, but in high frequency region
Domain, the number of filter just become fewer, are distributed very sparse.It is to be appreciated that melscale filter group is in low frequency part
High resolution, the auditory properties with human ear are consistent, where this is also the physical significance of melscale.
In the present embodiment, using the power spectrum of melscale filter group processing target voice data to be distinguished, mesh is obtained
The Meier power spectrum for marking voice data to be distinguished carries out cutting to frequency-region signal by using melscale filter group, so that
The last corresponding numerical value of each frequency band, if the number of filter is 22, available target voice data to be distinguished
Corresponding 22 energy values of Meier power spectrum.Mel-frequency analysis is carried out by the power spectrum to target voice data to be distinguished,
So that the Meier power spectrum obtained after its analysis maintains the frequency-portions closely related with human ear characteristic, which can
It is well reflected out the feature of target voice data to be distinguished.
S24:Cepstral analysis is carried out on Meier power spectrum, obtains the mel-frequency cepstrum system of target voice data to be distinguished
Number.
Wherein, cepstrum (cepstrum) refers in Fu that a kind of Fourier transform spectrum of signal carries out again after logarithm operation
Leaf inverse transformation, since general Fourier spectrum is complex number spectrum, thus cepstrum is also known as cepstrum.
In the present embodiment, cepstral analysis is carried out to Meier power spectrum, according to cepstrum as a result, analyzing and obtaining target and wait for area
Divide the mel-frequency cepstrum coefficient of voice data.It, can be excessively high by script intrinsic dimensionality, it is difficult to directly make by the cepstral analysis
The feature for including in the Meier power spectrum of target voice data to be distinguished, by carrying out cepstrum point on Meier power spectrum
Analysis, is converted into wieldy feature (for the mel-frequency cepstrum coefficient feature vector for being trained or identifying).The Meier
The coefficient that frequency cepstral coefficient can distinguish different phonetic as ASR phonetic feature, the ASR phonetic feature can reflect
Difference between voice can be used to identify and distinguish between target voice data to be distinguished.
In a specific embodiment, as shown in fig. 6, in step S24, cepstral analysis is carried out on Meier power spectrum, is obtained
The mel-frequency cepstrum coefficient for taking target voice data to be distinguished, includes the following steps:
S241:The logarithm for taking Meier power spectrum obtains Meier power spectrum to be transformed.
In the present embodiment, according to the definition of cepstrum, logarithm log is taken to Meier power spectrum, obtains Meier power to be transformed
Compose m.
S242:Discrete cosine transform is made to Meier power spectrum to be transformed, obtains the Meier frequency of target voice data to be distinguished
Rate cepstrum coefficient.
In the present embodiment, discrete cosine transform (Discrete Cosine is made to Meier power spectrum m to be transformed
Transform, DCT), the mel-frequency cepstrum coefficient of corresponding target voice data to be distinguished is obtained, generally takes the 2nd to arrive
13rd coefficient is able to reflect the difference between voice data as ASR phonetic feature, the ASR phonetic feature.To Meier to be transformed
The formula that power spectrum m makees discrete cosine transform is
N is frame length, and m is Meier power spectrum to be transformed, and j is the independent variable of Meier power spectrum to be transformed.Due to being between Meier filter
There is overlapping, so having correlation between the energy value obtained using melscale filter, discrete cosine transform can
To carry out dimensionality reduction compression to Meier power spectrum m to be transformed and be abstracted, and corresponding ASR phonetic feature is obtained, compared to Fourier
Transformation, the result of discrete cosine transform do not have imaginary part, there is apparent advantage in terms of calculating.
Step S21-S24 carries out the processing of feature extraction based on ASR technology voice data to be distinguished to target, final to obtain
ASR phonetic feature can embody target voice data to be distinguished well, which can be in depth network model
Training acquires ASR-RNN model, and result of the ASR-RNN model for obtaining training when carrying out speech differentiation is more smart
Really, even if accurately noise and speech differentiation can also come under conditions of noise is very big.
It should be noted that the feature extracted above is mel-frequency cepstrum coefficient, it herein should not be by ASR phonetic feature
It is a kind of to be limited to only mel-frequency cepstrum coefficient, and will be understood that the phonetic feature obtained using ASR technology, as long as can have
Effect reflection voice data feature, all can be used as ASR phonetic feature and carries out identification and model training.
S30:ASR phonetic feature is input in preparatory trained ASR-RNN model and is distinguished, target is obtained and distinguishes
As a result.
Wherein, ASR-RNN model refers to that the Recognition with Recurrent Neural Network model obtained using the training of ASR phonetic feature, RNN are referred to
Recognition with Recurrent Neural Network (Recurrent neural networks).The ASR-RNN model is extracted using voice data to be trained
ASR phonetic feature be trained, therefore the model can identify ASR phonetic feature, thus according to ASR phonetic feature
Distinguish voice.Specifically, voice data to be trained includes target voice and noise, is extracted when carrying out ASR-RNN model training
The ASR phonetic feature of target voice and the ASR phonetic feature of noise, the ASR-RNN model that training is obtained is according to ASR
Phonetic feature identify target voice and interfere voice in noise (using VAD distinguish it is original when distinguishing voice data
Eliminate most interference voice, as in voice data since silence is without the phonological component and a part of noise of pronunciation,
So here ASR-DBN model distinguish interference voice specifically refer to noise components), realize to target voice and interference voice into
The purpose that row is effectively distinguished.
In the present embodiment, ASR phonetic feature is input in preparatory trained ASR-RNN model and is distinguished, due to
ASR phonetic feature is able to reflect the feature of voice data, therefore can be according to ASR-RNN model voice data to be distinguished to target
The ASR phonetic feature of extraction is identified, to make accurate language according to ASR phonetic feature voice data to be distinguished to target
Sound is distinguished.Trained ASR-RNN models coupling ASR phonetic feature and Recognition with Recurrent Neural Network carry out deep layer to feature in advance for this
The characteristics of extraction, distinguishes voice from the ASR phonetic feature of voice data, the noise conditions very severe the case where
Under still have very high accurate rate.Specifically, since the ASR feature extracted also contains the ASR phonetic feature of noise,
In the ASR-RNN model, noise is also accurately to distinguish, solution current speech differentiating method (including but it is unlimited
In VAD) the problem of speech differentiation can not be effectively carried out under the conditions of noise effect is biggish.
In a specific embodiment, ASR phonetic feature is being input to preparatory trained ASR-RNN mould by step S30
It is distinguished in type, before obtaining the step of target distinguishes result, speech differentiation method further includes following steps:Obtain ASR-
RNN model.
As shown in fig. 7, the step of obtaining ASR-RNN model specifically includes:
S31:Voice data to be trained is obtained, and extracts the ASR phonetic feature to be trained of voice data to be trained.
Wherein, voice data training sample set needed for voice data to be trained refers to trained ASR-RNN model, should be wait instruct
Practicing voice data can be the voice training collection for directlying adopt open source, or the voice by collecting great amount of samples voice data
Training set.Being somebody's turn to do voice data to be trained is to distinguish target voice and interference voice (being herein specially noise) well in advance,
The concrete mode that differentiation is taken, which can be, target voice and noise is respectively set different label values.For example, will language be trained
Target voice part in sound data is collectively labeled as 1 (representing "true"), and noise components are collectively labeled as 0 (representing "false"), are passed through
The accuracy that the label value being arranged in advance can examine ASR-RNN model to identify updates ASR- in order to provide improved reference
Network parameter in RNN model continues to optimize ASR-RNN model.In the present embodiment, the ratio of target voice and noise specifically may be used
To take 1:1, it can be avoided using the ratio and occurred due to target voice in voice data to be trained and not identical noise quantity
Fitting phenomenon.Wherein, over-fitting refers in order to obtain unanimously hypothesis and makes to assume to become over stringent phenomenon, avoids over-fitting
It is a core missions in classifier design.
In the present embodiment, obtain voice data to be trained, and extract the feature of the voice data to be trained, this feature i.e. to
The step of training ASR phonetic feature, extracting ASR phonetic feature to be trained is identical as step S21-S24, and details are not described herein.To
Training voice data includes the training sample of target voice and the training sample of noise, this two parts voice data has respective
Therefore ASR phonetic feature can extract and train corresponding ASR-RNN model using ASR phonetic feature to be trained, so that
Target voice and noise (noise can be accurately distinguished according to the ASR-RNN model that the ASR phonetic feature training to be trained obtains
Belong to interference voice).
S32:Initialize RNN model.
Wherein, RNN model, that is, Recognition with Recurrent Neural Network model.RNN model includes the input layer being made of neuron, hidden layer
And output layer.RNN model includes that the weight that each neuron connects between each layer and biasing, these weights and biasing determine
The property and recognition effect of RNN model.With traditional neural network such as DNN (Deep Neural Network, depth nerve net
Network) compare, RNN be a kind of pair of sequence data (such as time series) modeling neural network, i.e. the current output of a sequence with
The output of front is related.The specific form of expression is that network can carry out the hiding layer state of front to remember and applied to current defeated
In calculating out, i.e., the node between hidden layer is no longer connectionless but has connection, and the input of hidden layer not only includes defeated
The output for entering layer further includes the output of last moment hidden layer.Since voice data has the characteristics that in timing, can adopt
It is accurate to extract target voice and interference voice respective further feature in timing with voice data to be trained training RNN model,
Realize accurately distinguishing for voice.
In the present embodiment, RNN model is initialized, which is weight and biasing in RNN model is arranged initial
Value, when initial value initial setting up, can be set to lesser value, such as be arranged between section [- 0.3-0.3].It is reasonable initial
Change RNN model, which can make model in the early stage, to be had compared with flexibly adjustment ability, can be had during model training to model
The adjustment of effect causes the model trained to distinguish effect bad without keeping model just very poor in the adjustment capability of initial stage.
S33:ASR phonetic feature to be trained is input in RNN model, RNN model is obtained according to propagated forward algorithm
Output valve, output value table are shown as:σ indicates activation primitive, connects between V expression hidden layer and output layer
Weight, htIndicate that the hidden state of t moment, c indicate the biasing between hidden layer and output layer.
In the present embodiment, the process of RNN propagated forward is according to weight, the biasing for connecting each neuron in RNN model
A series of linear operations and activation carried out in RNN model with the ASR phonetic feature to be trained of input according to time series are transported
It calculates, each layer of network of output valve in obtained RNN model.Particularly, since RNN is to sequence (when specifically can be here
Between sequence) neural network that is modeled of data, in the hidden state h for calculating t moment hidden layertWhen, it needs according to the t-1 moment
Hidden layer state ht-1It is acquired jointly with the ASR phonetic feature to be trained of t moment input.By the process of RNN model propagated forward,
The propagated forward algorithm of available RNN model:For any time t, according to the ASR phonetic feature to be trained of input from RNN
The input layer of model calculates the output to hidden layer, output (the i.e. hidden state h of the hidden layert) be expressed as:ht=σ (Uxt+
Wht-1+ b), wherein σ indicates that activation primitive (can specifically use tanh activation primitive, tanh can not in cyclic process here
Difference between the disconnected feature for expanding ASR phonetic feature to be trained is conducive to distinguish target voice and noise), U indicates input layer
To the weight connected between hidden layer, W indicates the weight connected between hidden layer (between the hidden layer realized by time series
Connection), ht-1Indicate that the hidden state at t-1 moment, b indicate the biasing between input layer and hidden layer.From hiding for RNN model
Layer calculates the output for arriving output layer, and the output (i.e. the output valve of RNN model) of the output layer is expressed asIts
In, (the softmax function is used for classification problem effect ratio for what activation primitive here specifically used can be softmax function
Preferably), V indicates the weight connected between hidden layer and output layer, htIndicate that the hidden state of t moment, c indicate hidden layer and defeated
Biasing between layer out.The output valve (output of output layer) of the RNN modelSequence one is as pressed by propagated forward algorithm
The output valve being calculated layer by layer is properly termed as prediction output valve.It, can basis after server obtains the output valve of RNN model
Network parameter (weight and biasing) in output valve update, adjustment RNN model, so that the RNN model obtained can be according to language
The timing feature that sound has distinguishes, by the ASR phonetic feature of target voice and interfere voice ASR phonetic feature and
The difference showed in timing obtains accurate recognition result.
S34:Error-duration model is carried out based on output valve, weight and the biasing of each layer of RNN model is updated, obtains ASR-RNN model,
Wherein, the formula of update weight V is:V indicates to connect between hidden layer and output layer before updating
Weight, V' indicates that the weight that connects between hidden layer and output layer after updating, α indicate that learning rate, t indicate t moment, τ table
Show total duration,Indicate prediction output valve, ytIndicate true output, htIndicate the hidden state of t moment, T representing matrix transposition
Operation;Update biasing c formula be:C indicates the biasing before updated between hidden layer and output layer, c' table
Show the biasing after updating between hidden layer and output layer;Update weight U formula be:
U indicates that input layer is to the weight connected between hidden layer before updating, and U' indicates input layer after updating to connecting between hidden layer
Weight, diag () indicate one diagonal matrix of construction or return to the square of diagonal entry on a matrix in vector form
Battle array operation, δtIndicate the gradient of hiding layer state, xtIndicate the ASR phonetic feature to be trained of t moment input;Update the public affairs of weight W
Formula is:W indicates that the weight connected between hidden layer before updating, W' indicate more
The weight connected between hidden layer after new;Update biasing b formula be:B indicates to update
Biasing between preceding input layer and hidden layer, b' indicate the biasing after updated between input layer and hidden layer.
In the present embodiment, server-side is obtaining the output valve of RNN model (prediction output valve) according to propagated forward algorithm
It afterwards, can basisWith the ASR phonetic feature to be trained for pre-setting label value, ASR phonetic feature to be trained is calculated at this
The error generated when RNN model training, and suitable error function is constructed (such as using log error function come table according to the error
Show the error of generation).Server-side uses the error function to carry out error-duration model, adjustment, the weight for updating each layer of RNN model again
(U, W and V) and weight (b and c).Specifically, the label value pre-set is properly termed as true output and (represents objective thing
Real, label value 1 represents target voice, and label value is that voice is interfered in 0 representative), use ytIt indicates.During training RNN model,
To having error when output before RNN model is calculated at each layer in time series, error function L can be used by measuring the error,
It is expressed as:Wherein, t refers to t moment, and τ indicates total duration, LtIndicate to be indicated by error function generates in t moment
Error.It, can be according to BPTT (Back Propagation Trough Time, when being based on after server-side obtains error function
Between back-propagation algorithm) update RNN model weight and biasing, obtain the ASR-RNN mould based on ASR phonetic feature to be trained
Type.Specifically, the formula of update weight V is:Wherein, V indicates hidden layer and defeated before updating
The weight connected between layer out, V' indicate that the weight connected between hidden layer and output layer after updating, α indicate learning rate,Indicate prediction output valve, ytIndicate true output, htIndicate the hidden state of t moment, T representing matrix transposition operation.It updates
Biasing c formula be:C indicates the biasing before updated between hidden layer and output layer, and c' is indicated after updating
Biasing between hidden layer and output layer.It is compared to weight V and biasing c, weight U, weight W and biasing b, in backpropagation
When, the gradient loss of a certain moment t loses two parts by the corresponding gradient loss of output of current location and the gradient at t+1 moment
It codetermines.Therefore the update of weight U, weight W and biasing b need the gradient δ by hiding layer statetIt obtains.The t sequence moment
Hide the gradient δ of layer statetIt is expressed as:δt+1With δtBetween exist connection, according to δt+1It can be in the hope of δt, connection
Expression formula is:Wherein, δt+1Indicate that the t+1 sequence moment hides the ladder of layer state
Degree, diag () indicate a kind of calculating function of matrix operation, and the calculating function is for constructing a diagonal matrix or with vector
Form return a matrix on diagonal entry, ht+1Indicate the hiding layer state at t+1 sequence moment.When then can be by obtaining τ
Carve the gradient δ for hiding layer stateτ, utilize δt+1With δtBetween the expression formula that contacts
By δτBackpropagation recursion obtains δ from level to levelt.Due to δτAt the time of below without other, therefore can be direct according to gradient calculating
It obtains:It then can be according to δτRecursion acquires δt.Obtain δtAfterwards, it can calculate weight U, weight W and biasing
b.Update weight U formula be:U indicates that input layer connects between hidden layer before updating
The weight connect, U' indicate that input layer indicates to construct a diagonal matrix to the weight connected between hidden layer, diag () after updating
Or the matrix operation of diagonal entry on a matrix, δ are returned in vector formtIndicate the gradient of hiding layer state, xtTable
Show the ASR phonetic feature to be trained of t moment input;Update weight W formula be:
W indicates that the weight connected between hidden layer before updating, W' indicate the weight connected between hidden layer after updating;Update biasing b's
Formula is:B indicates the biasing before updated between input layer and hidden layer, and b' indicates to update
Biasing between input layer and hidden layer afterwards.When all weights and the changing value of biasing both less than stop iteration threshold ∈, i.e.,
It can deconditioning;Alternatively, when training reaches maximum number of iterations MAX, deconditioning.By ASR phonetic feature to be trained in RNN
The error generated between prediction output valve in model and the label value (true output) pre-set, it is real based on the error
The update of existing RNN model each layer weight and biasing enables the ASR-RNN model finally obtained according to ASR phonetic feature, instruction
Practice and learn the further feature about time series, realizes the purpose for accurately distinguishing voice.
Step S31-S34 is trained RNN model using ASR phonetic feature to be trained, so that the ASR- that training obtains
RNN model can be according to the further feature that ASR phonetic feature is trained and study is about sequence (timing), can be according to target voice
Voice is effectively distinguished with the ASR phonetic feature of interference voice and in conjunction with temporal factors.Under noise jamming serious situation, still
So target voice and noise can accurately be distinguished.
In speech differentiation method provided by the present embodiment, it is original to be primarily based on voice activity detection algorithms (VAD) processing
Voice data to be distinguished obtains target voice data to be distinguished, and original voice data to be distinguished is calculated by voice activity detection
Method is first distinguished once, is obtained the smaller target of range voice data to be distinguished, can be tentatively effectively removed original language to be distinguished
Interference voice data in sound data retains the original voice data to be distinguished for mixing target voice and interfering voice, and will
The original voice data to be distinguished can make original voice data to be distinguished effective first as target voice data to be distinguished
Speech differentiation is walked, a large amount of interference voice is removed.It is then based on target voice data to be distinguished, obtains corresponding ASR voice
Feature, the ASR phonetic feature can make the result of speech differentiation more accurate, even if under conditions of noise is very big, it can also be with
To accurately voice (such as noise) and target voice be interfered to distinguish, be carried out accordingly to be subsequent according to the ASR phonetic feature
The identification of ASR-RNN model provides important technology premise.ASR phonetic feature is finally input to preparatory trained ASR-RNN
It is distinguished in model, obtains target and distinguish as a result, the ASR-RNN model is the ASR language extracted according to voice data to be trained
The feature specialized training of sound feature and voice in timing for effectively distinguishing the identification model of voice, can be from mixing mesh
Poster sound and interference voice (due to having used VAD to distinguish once, so most of interference voice here refers to noise)
Target voice data to be distinguished in correctly distinguish target voice and interference voice, improve the accuracy of speech differentiation.
It should be understood that the size of the serial number of each step is not meant that the order of the execution order in above-described embodiment, each process
Execution sequence should be determined by its function and internal logic, the implementation process without coping with the embodiment of the present invention constitutes any limit
It is fixed.
Fig. 8 shows the functional block diagram with the one-to-one speech differentiation device of speech differentiation method in embodiment.Such as Fig. 8 institute
Show, which includes that target voice data to be distinguished obtains module 10, phonetic feature obtains module 20 and target area
Result is divided to obtain module 30.Wherein, target voice data to be distinguished obtains module 10, phonetic feature obtains module 20 and target area
The realization function step corresponding with speech differentiation method in embodiment for dividing result to obtain module 30 corresponds, to avoid going to live in the household of one's in-laws on getting married
It states, the present embodiment is not described in detail one by one.
Target voice data to be distinguished obtains module 10, for handling original language to be distinguished based on voice activity detection algorithms
Sound data obtain target voice data to be distinguished.
Phonetic feature obtains module 20, and for being based on target voice data to be distinguished, it is special to obtain corresponding ASR voice
Sign.
Target distinguishes result and obtains module 30, for ASR phonetic feature to be input to preparatory trained ASR-RNN model
In distinguish, obtain target distinguish result.
Preferably, it includes the first original differentiation voice data acquiring unit that target voice data to be distinguished, which obtains module 10,
11, the second original differentiation voice data acquiring unit 12 and target voice data acquiring unit 13 to be distinguished.
First original differentiation voice data acquiring unit 11, for according to short-time energy characteristic value calculation formula to it is original to
It distinguishes voice data to be handled, obtains corresponding short-time energy characteristic value, short-time energy characteristic value is greater than first threshold
Original data to be distinguished retain, and are determined as the first original differentiation voice data, wherein short-time energy characteristic value calculation formula isN is voice frame length, and s (n) is the signal amplitude in time domain, and n is the time.
Second original differentiation voice data acquiring unit 12 is used for according to zero-crossing rate characteristic value calculation formula to original to area
Point voice data is handled, and corresponding zero-crossing rate characteristic value is obtained, by zero-crossing rate characteristic value be less than second threshold it is original to
It distinguishes voice data to retain, is determined as the second original differentiation voice data, wherein zero-crossing rate characteristic value calculation formula isN is voice frame length, and s (n) is the signal amplitude in time domain, and n is the time.
Target voice data acquiring unit 13 to be distinguished is used for the first original differentiation voice data and the second original differentiation
Voice data is as target voice data to be distinguished.
Preferably, it includes pretreatment voice data acquiring unit 21, power spectrum acquiring unit that phonetic feature, which obtains module 20,
22, Meier power spectrum acquiring unit 23 and mel-frequency cepstrum coefficient unit 24.
Pretreatment unit 21 obtains pretreatment voice data for pre-processing to target voice data to be distinguished.
Power spectrum acquiring unit 22 obtains target and waits distinguishing for making Fast Fourier Transform (FFT) to pretreatment voice data
The frequency spectrum of voice data, and according to the power spectrum of frequency spectrum acquisition target voice data to be distinguished.
Meier power spectrum acquiring unit 23, for using melscale filter group processing target voice data to be distinguished
Power spectrum obtains the Meier power spectrum of target voice data to be distinguished.
Mel-frequency cepstrum coefficient unit 24 obtains target and waits distinguishing for carrying out cepstral analysis on Meier power spectrum
The mel-frequency cepstrum coefficient of voice data.
Preferably, pretreatment unit 21 includes preemphasis subelement 211, framing subelement 212 and adding window subelement 213.
Preemphasis subelement 211, for making preemphasis processing, the calculating of preemphasis processing to target voice data to be distinguished
Formula is s'n=sn-a*sn-1, wherein snFor the signal amplitude in time domain, sn-1For with snThe signal of corresponding last moment
Amplitude, s'nFor the signal amplitude in time domain after preemphasis, a is pre emphasis factor, and the value range of a is 0.9<a<1.0.
Framing subelement 212, for the target voice data to be distinguished after preemphasis to be carried out sub-frame processing.
Adding window subelement 213 obtains pretreatment for the target voice data to be distinguished after framing to be carried out windowing process
The calculation formula of voice data, adding window isWherein, N is that window is long, and n is the time,
snFor the signal amplitude in time domain, s'nFor the signal amplitude in time domain after adding window.
Preferably, mel-frequency cepstrum coefficient unit 24 includes that Meier power spectrum to be transformed obtains subelement 241 and Meier
Frequency cepstral coefficient subelement 242.
Meier power spectrum to be transformed obtains subelement 241 and obtains Meier to be transformed for taking the logarithm of Meier power spectrum
Power spectrum.
Mel-frequency cepstrum coefficient subelement 242 obtains mesh for making discrete cosine transform to Meier power spectrum to be transformed
Mark the mel-frequency cepstrum coefficient of voice data to be distinguished.
Preferably, which further includes that ASR-RNN model obtains module 40, and ASR-RNN model obtains module
40 include ASR phonetic feature acquiring unit 41, initialization unit 42, output valve acquiring unit 43 and updating unit 44 to be trained.
ASR phonetic feature acquiring unit 41 to be trained for obtaining voice data to be trained, and extracts voice number to be trained
According to ASR phonetic feature to be trained.
Initialization unit 42, for initializing RNN model.
Output valve acquiring unit 43, for ASR phonetic feature to be trained to be input in RNN model, according to propagated forward
Algorithm obtains the output valve of RNN model, and output value table is shown as:σ indicate activation primitive, V indicate hidden layer and
The weight connected between output layer, htIndicate that the hidden state of t moment, c indicate the biasing between hidden layer and output layer.
Updating unit 44 updates weight and the biasing of each layer of RNN model, obtains for carrying out error-duration model based on output valve
Take ASR-RNN model, wherein update weight V formula be:V indicates hidden layer before updating
The weight connected between output layer, V' indicate that the weight connected between hidden layer and output layer after updating, α indicate study
Rate, t indicate t moment, and τ indicates total duration,Indicate prediction output valve, ytIndicate true output, htIndicate hiding for t moment
State, T representing matrix transposition operation;Update biasing c formula be:C indicate update before hidden layer and
Biasing between output layer, c' indicate the biasing after updated between hidden layer and output layer;Update weight U formula be:U indicates that input layer indicates more to the weight connected between hidden layer, U' before updating
Input layer is to the weight connected between hidden layer after new, and diag () indicates one diagonal matrix of construction or in vector form
Return to the matrix operation of diagonal entry on a matrix, δtIndicate the gradient of hiding layer state, xtIndicate t moment input to
Training ASR phonetic feature;Update weight W formula be:W is indicated before updating
The weight connected between hidden layer, W' indicate the weight connected between hidden layer after updating;Update biasing b formula be:B indicates the biasing before updated between input layer and hidden layer, and b' indicates to input after updating
Biasing between layer and hidden layer.
The present embodiment provides a computer readable storage medium, computer journey is stored on the computer readable storage medium
Sequence realizes speech differentiation method in embodiment when the computer program is executed by processor, no longer superfluous here to avoid repeating
It states.Alternatively, realizing the function of each module/unit in speech differentiation device in embodiment when the computer program is executed by processor
Can, to avoid repeating, which is not described herein again.
It is to be appreciated that the computer readable storage medium may include:The computer program code can be carried
Any entity or device, recording medium, USB flash disk, mobile hard disk, magnetic disk, CD, computer storage, read-only memory (ROM,
Read-Only Memory), random access memory (RAM, Random Access Memory), electric carrier signal and telecommunications letter
Number etc..
Fig. 9 is the schematic diagram of computer equipment in the present embodiment.As shown in figure 9, computer equipment 50 include processor 51,
Memory 52 and it is stored in the computer program 53 that can be run in memory 52 and on processor 51.Processor 51 executes meter
Each step of speech differentiation method in embodiment, such as step S10, S20 and S30 shown in Fig. 2 are realized when calculation machine program 53.
Alternatively, processor 51 realizes the function of each module/unit of speech differentiation device in embodiment when executing computer program 53, such as scheme
The voice data to be distinguished of target shown in 8 obtains module 10, phonetic feature obtains module 20, target distinguishes result and obtains 30 and of module
The function of ASR-RNN model acquisition module 40.
It is apparent to those skilled in the art that for convenience of description and succinctly, only with above-mentioned each function
Can unit, module division progress for example, in practical application, can according to need and by above-mentioned function distribution by different
Functional unit, module are completed, i.e., the internal structure of described device is divided into different functional unit or module, more than completing
The all or part of function of description.
It, can also be in addition, the functional units in various embodiments of the present invention may be integrated into one processing unit
It is that each unit physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated list
Member both can take the form of hardware realization, can also realize in the form of software functional units.
Embodiment described above is merely illustrative of the technical solution of the present invention, rather than its limitations;Although referring to aforementioned reality
Applying example, invention is explained in detail, those skilled in the art should understand that:It still can be to aforementioned each
Technical solution documented by embodiment is modified or equivalent replacement of some of the technical features;These modification or
Person's replacement, the spirit and scope for technical solution of various embodiments of the present invention that it does not separate the essence of the corresponding technical solution should all wrap
Containing within protection scope of the present invention.
Claims (10)
1. a kind of speech differentiation method, which is characterized in that including:
Based on the original voice data to be distinguished of voice activity detection algorithms processing, target voice data to be distinguished is obtained;
Based on target voice data to be distinguished, corresponding ASR phonetic feature is obtained;
The ASR phonetic feature is input in preparatory trained ASR-RNN model and is distinguished, target is obtained and distinguishes knot
Fruit.
2. speech differentiation method according to claim 1, which is characterized in that input the ASR phonetic feature described
It is distinguished into preparatory trained ASR-RNN model, before obtaining the step of distinguishing result, the speech differentiation method is also
Including:Obtain ASR-RNN model;
The step of acquisition ASR-RNN model includes:
Voice data to be trained is obtained, and extracts the ASR phonetic feature to be trained of the voice data to be trained;
Initialize RNN model;
ASR phonetic feature to be trained is input in RNN model, the output valve of RNN model, institute are obtained according to propagated forward algorithm
Output value table is stated to be shown as:σ indicates that activation primitive, V indicate the weight connected between hidden layer and output layer, ht
Indicate that the hidden state of t moment, c indicate the biasing between hidden layer and output layer;
Error-duration model is carried out based on the output valve, weight and the biasing of each layer of RNN model is updated, obtains ASR-RNN model,
In, the formula for updating weight V is:What V expression connected between hidden layer and output layer before updating
Weight, V' indicates that the weight that connects between hidden layer and output layer after updating, α indicate that learning rate, t indicate t moment, τ table
Show total duration,Indicate prediction output valve, ytIndicate true output, htIndicate the hidden state of t moment, T representing matrix transposition
Operation;Update biasing c formula be:C indicates the biasing before updated between hidden layer and output layer, c' table
Show the biasing after updating between hidden layer and output layer;Update weight U formula be:
U indicates that input layer is to the weight connected between hidden layer before updating, and U' indicates input layer after updating to connecting between hidden layer
Weight, diag () indicate one diagonal matrix of construction or return to the square of diagonal entry on a matrix in vector form
Battle array operation, δtIndicate the gradient of hiding layer state, xtIndicate the ASR phonetic feature to be trained of t moment input;Update the public affairs of weight W
Formula is:W indicates that the weight connected between hidden layer before updating, W' indicate more
The weight connected between hidden layer after new;Update biasing b formula be:B indicates to update
Biasing between preceding input layer and hidden layer, b' indicate the biasing after updated between input layer and hidden layer.
3. speech differentiation method according to claim 1, which is characterized in that described to be handled based on voice activity detection algorithms
Original voice data to be distinguished obtains target voice data to be distinguished, including:
The original voice data to be distinguished is handled according to short-time energy characteristic value calculation formula, acquisition is corresponding in short-term
Energy eigenvalue retains the original data to be distinguished that the short-time energy characteristic value is greater than first threshold, is determined as the
One original differentiation voice data, short-time energy characteristic value calculation formula areWherein, N is voice frame length, and s (n) is
Signal amplitude in time domain, n are the time;
The original voice data to be distinguished is handled according to zero-crossing rate characteristic value calculation formula, obtains corresponding zero-crossing rate
The original voice data to be distinguished that the zero-crossing rate characteristic value is less than second threshold is retained, is determined as second by characteristic value
Original differentiation voice data, zero-crossing rate characteristic value calculation formula areWherein, N is voice frame length,
S (n) is the signal amplitude in time domain, and n is the time;
Using the described first original differentiation voice data and the second original differentiation voice data as target language to be distinguished
Sound data.
4. speech differentiation method according to claim 1, which is characterized in that described to be based on target voice number to be distinguished
According to, corresponding ASR phonetic feature is obtained, including:
Target voice data to be distinguished is pre-processed, pretreatment voice data is obtained;
Fast Fourier Transform (FFT) made to the pretreatment voice data, obtains the frequency spectrum of target voice data to be distinguished, and according to
The frequency spectrum obtains the power spectrum of target voice data to be distinguished;
The power spectrum of target voice data to be distinguished is handled using melscale filter group, obtains target voice to be distinguished
The Meier power spectrum of data;
Cepstral analysis is carried out on the Meier power spectrum, obtains the mel-frequency cepstrum coefficient of target voice data to be distinguished.
5. speech differentiation method according to claim 4, which is characterized in that described to target voice data to be distinguished
It is pre-processed, obtains pretreatment voice data, including:
Preemphasis processing is made to target voice data to be distinguished, the calculation formula of preemphasis processing is s'n=sn-a*sn-1,
Wherein, snFor the signal amplitude in time domain, sn-1For with snThe signal amplitude of corresponding last moment, s'nWhen for after preemphasis
Signal amplitude on domain, a are pre emphasis factor, and the value range of a is 0.9<a<1.0;
Target voice data to be distinguished after preemphasis is subjected to sub-frame processing;
Target voice data to be distinguished after framing is subjected to windowing process, obtains pretreatment voice data, the meter of adding window
Calculating formula isWherein, N is that window is long, and n is time, snFor the signal width in time domain
Degree, s'nFor the signal amplitude in time domain after adding window.
6. speech differentiation method according to claim 4, which is characterized in that described to be fallen on the Meier power spectrum
Spectrum analysis obtains the mel-frequency cepstrum coefficient of target voice data to be distinguished, including:
The logarithm for taking the Meier power spectrum obtains Meier power spectrum to be transformed;
Discrete cosine transform is made to the Meier power spectrum to be transformed, obtains the mel-frequency cepstrum of target voice data to be distinguished
Coefficient.
7. a kind of speech differentiation device, which is characterized in that including:
Target voice data to be distinguished obtains module, for handling original voice number to be distinguished based on voice activity detection algorithms
According to acquisition target voice data to be distinguished;
Phonetic feature obtains module, for being based on target voice data to be distinguished, obtains corresponding ASR phonetic feature;
Target distinguishes result and obtains module, for the ASR phonetic feature to be input in preparatory trained ASR-RNN model
It distinguishes, obtains target and distinguish result.
8. speech differentiation device according to claim 7, which is characterized in that the speech differentiation device further includes ASR-
RNN model obtains module, and the ASR-RNN model obtains module and includes:
ASR phonetic feature acquiring unit to be trained for obtaining voice data to be trained, and extracts the voice data to be trained
ASR phonetic feature to be trained;
Initialization unit, for initializing RNN model;
Output valve acquiring unit is obtained for ASR phonetic feature to be trained to be input in RNN model according to propagated forward algorithm
The output valve of RNN model is taken, the output value table is shown as:σ indicates that activation primitive, V indicate hidden layer and defeated
The weight connected between layer out, htIndicate that the hidden state of t moment, c indicate the biasing between hidden layer and output layer;
Updating unit updates weight and the biasing of each layer of RNN model, obtains for carrying out error-duration model based on the output valve
ASR-RNN model, wherein update weight V formula be:V indicate update before hidden layer and
The weight connected between output layer, V' indicate that the weight connected between hidden layer and output layer after updating, α indicate study
Rate, t indicate t moment, and τ indicates total duration,Indicate prediction output valve, ytIndicate true output, htIndicate hiding for t moment
State, T representing matrix transposition operation;Update biasing c formula be:C indicate update before hidden layer and
Biasing between output layer, c' indicate the biasing after updated between hidden layer and output layer;Update weight U formula be:U indicates that input layer indicates more to the weight connected between hidden layer, U' before updating
Input layer is to the weight connected between hidden layer after new, and diag () indicates one diagonal matrix of construction or in vector form
Return to the matrix operation of diagonal entry on a matrix, δtIndicate the gradient of hiding layer state, xtIndicate t moment input to
Training ASR phonetic feature;Update weight W formula be:W is indicated before updating
The weight connected between hidden layer, W' indicate the weight connected between hidden layer after updating;Update biasing b formula be:B indicates the biasing before updated between input layer and hidden layer, and b' indicates to input after updating
Biasing between layer and hidden layer.
9. a kind of computer equipment, including memory, processor and storage are in the memory and can be in the processor
The computer program of upper operation, which is characterized in that the processor realized when executing the computer program as claim 1 to
The step of any one of 6 speech differentiation method.
10. a kind of computer readable storage medium, the computer-readable recording medium storage has computer program, and feature exists
In the step of realization speech differentiation method as described in any one of claim 1 to 6 when the computer program is executed by processor
Suddenly.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810561788.1A CN108922513B (en) | 2018-06-04 | 2018-06-04 | Voice distinguishing method and device, computer equipment and storage medium |
PCT/CN2018/094190 WO2019232846A1 (en) | 2018-06-04 | 2018-07-03 | Speech differentiation method and apparatus, and computer device and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810561788.1A CN108922513B (en) | 2018-06-04 | 2018-06-04 | Voice distinguishing method and device, computer equipment and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108922513A true CN108922513A (en) | 2018-11-30 |
CN108922513B CN108922513B (en) | 2023-03-17 |
Family
ID=64419509
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810561788.1A Active CN108922513B (en) | 2018-06-04 | 2018-06-04 | Voice distinguishing method and device, computer equipment and storage medium |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN108922513B (en) |
WO (1) | WO2019232846A1 (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109545192A (en) * | 2018-12-18 | 2019-03-29 | 百度在线网络技术(北京)有限公司 | Method and apparatus for generating model |
CN109545193A (en) * | 2018-12-18 | 2019-03-29 | 百度在线网络技术(北京)有限公司 | Method and apparatus for generating model |
CN109658920A (en) * | 2018-12-18 | 2019-04-19 | 百度在线网络技术(北京)有限公司 | Method and apparatus for generating model |
CN110148401A (en) * | 2019-07-02 | 2019-08-20 | 腾讯科技(深圳)有限公司 | Audio recognition method, device, computer equipment and storage medium |
CN110189747A (en) * | 2019-05-29 | 2019-08-30 | 大众问问(北京)信息科技有限公司 | Voice signal recognition methods, device and equipment |
CN110265065A (en) * | 2019-05-13 | 2019-09-20 | 厦门亿联网络技术股份有限公司 | A kind of method and speech terminals detection system constructing speech detection model |
CN110838307A (en) * | 2019-11-18 | 2020-02-25 | 苏州思必驰信息科技有限公司 | Voice message processing method and device |
CN112908303A (en) * | 2021-01-28 | 2021-06-04 | 广东优碧胜科技有限公司 | Audio signal processing method and device and electronic equipment |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113223511B (en) * | 2020-01-21 | 2024-04-16 | 珠海市煊扬科技有限公司 | Audio processing device for speech recognition |
CN112581940A (en) * | 2020-09-17 | 2021-03-30 | 国网江苏省电力有限公司信息通信分公司 | Discharging sound detection method based on edge calculation and neural network |
CN112598114B (en) * | 2020-12-17 | 2023-11-03 | 海光信息技术股份有限公司 | Power consumption model construction method, power consumption measurement method, device and electronic equipment |
CN117648717B (en) * | 2024-01-29 | 2024-05-03 | 知学云(北京)科技股份有限公司 | Privacy protection method for artificial intelligent voice training |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1912993A (en) * | 2005-08-08 | 2007-02-14 | 中国科学院声学研究所 | Voice end detection method based on energy and harmonic |
CN104157290A (en) * | 2014-08-19 | 2014-11-19 | 大连理工大学 | Speaker recognition method based on depth learning |
US20150149165A1 (en) * | 2013-11-27 | 2015-05-28 | International Business Machines Corporation | Speaker Adaptation of Neural Network Acoustic Models Using I-Vectors |
CN105139864A (en) * | 2015-08-17 | 2015-12-09 | 北京天诚盛业科技有限公司 | Voice recognition method and voice recognition device |
US9263036B1 (en) * | 2012-11-29 | 2016-02-16 | Google Inc. | System and method for speech recognition using deep recurrent neural networks |
CN107680597A (en) * | 2017-10-23 | 2018-02-09 | 平安科技(深圳)有限公司 | Audio recognition method, device, equipment and computer-readable recording medium |
CN107871497A (en) * | 2016-09-23 | 2018-04-03 | 北京眼神科技有限公司 | Audio recognition method and device |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR102450853B1 (en) * | 2015-11-30 | 2022-10-04 | 삼성전자주식회사 | Apparatus and method for speech recognition |
CN107799126B (en) * | 2017-10-16 | 2020-10-16 | 苏州狗尾草智能科技有限公司 | Voice endpoint detection method and device based on supervised machine learning |
CN107731233B (en) * | 2017-11-03 | 2021-02-09 | 王华锋 | Voiceprint recognition method based on RNN |
-
2018
- 2018-06-04 CN CN201810561788.1A patent/CN108922513B/en active Active
- 2018-07-03 WO PCT/CN2018/094190 patent/WO2019232846A1/en active Application Filing
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1912993A (en) * | 2005-08-08 | 2007-02-14 | 中国科学院声学研究所 | Voice end detection method based on energy and harmonic |
US9263036B1 (en) * | 2012-11-29 | 2016-02-16 | Google Inc. | System and method for speech recognition using deep recurrent neural networks |
US20150149165A1 (en) * | 2013-11-27 | 2015-05-28 | International Business Machines Corporation | Speaker Adaptation of Neural Network Acoustic Models Using I-Vectors |
CN104157290A (en) * | 2014-08-19 | 2014-11-19 | 大连理工大学 | Speaker recognition method based on depth learning |
CN105139864A (en) * | 2015-08-17 | 2015-12-09 | 北京天诚盛业科技有限公司 | Voice recognition method and voice recognition device |
CN107871497A (en) * | 2016-09-23 | 2018-04-03 | 北京眼神科技有限公司 | Audio recognition method and device |
CN107680597A (en) * | 2017-10-23 | 2018-02-09 | 平安科技(深圳)有限公司 | Audio recognition method, device, equipment and computer-readable recording medium |
Non-Patent Citations (1)
Title |
---|
宋知用: "《MATLAB语音信号分析与合成》", 30 January 2018 * |
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109545193A (en) * | 2018-12-18 | 2019-03-29 | 百度在线网络技术(北京)有限公司 | Method and apparatus for generating model |
CN109658920A (en) * | 2018-12-18 | 2019-04-19 | 百度在线网络技术(北京)有限公司 | Method and apparatus for generating model |
CN109545192A (en) * | 2018-12-18 | 2019-03-29 | 百度在线网络技术(北京)有限公司 | Method and apparatus for generating model |
CN109545193B (en) * | 2018-12-18 | 2023-03-14 | 百度在线网络技术(北京)有限公司 | Method and apparatus for generating a model |
CN109658920B (en) * | 2018-12-18 | 2020-10-09 | 百度在线网络技术(北京)有限公司 | Method and apparatus for generating a model |
US11189262B2 (en) | 2018-12-18 | 2021-11-30 | Baidu Online Network Technology (Beijing) Co., Ltd. | Method and apparatus for generating model |
CN110265065B (en) * | 2019-05-13 | 2021-08-03 | 厦门亿联网络技术股份有限公司 | Method for constructing voice endpoint detection model and voice endpoint detection system |
CN110265065A (en) * | 2019-05-13 | 2019-09-20 | 厦门亿联网络技术股份有限公司 | A kind of method and speech terminals detection system constructing speech detection model |
CN110189747A (en) * | 2019-05-29 | 2019-08-30 | 大众问问(北京)信息科技有限公司 | Voice signal recognition methods, device and equipment |
CN110148401A (en) * | 2019-07-02 | 2019-08-20 | 腾讯科技(深圳)有限公司 | Audio recognition method, device, computer equipment and storage medium |
CN110148401B (en) * | 2019-07-02 | 2023-12-15 | 腾讯科技(深圳)有限公司 | Speech recognition method, device, computer equipment and storage medium |
CN110838307B (en) * | 2019-11-18 | 2022-02-25 | 思必驰科技股份有限公司 | Voice message processing method and device |
CN110838307A (en) * | 2019-11-18 | 2020-02-25 | 苏州思必驰信息科技有限公司 | Voice message processing method and device |
CN112908303A (en) * | 2021-01-28 | 2021-06-04 | 广东优碧胜科技有限公司 | Audio signal processing method and device and electronic equipment |
Also Published As
Publication number | Publication date |
---|---|
WO2019232846A1 (en) | 2019-12-12 |
CN108922513B (en) | 2023-03-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108922513A (en) | Speech differentiation method, apparatus, computer equipment and storage medium | |
CN110600017B (en) | Training method of voice processing model, voice recognition method, system and device | |
CN112509564B (en) | End-to-end voice recognition method based on connection time sequence classification and self-attention mechanism | |
CN110491416B (en) | Telephone voice emotion analysis and identification method based on LSTM and SAE | |
CN110459225B (en) | Speaker recognition system based on CNN fusion characteristics | |
CN110428842A (en) | Speech model training method, device, equipment and computer readable storage medium | |
CN107633842A (en) | Audio recognition method, device, computer equipment and storage medium | |
CN105096955B (en) | A kind of speaker's method for quickly identifying and system based on model growth cluster | |
CN109036470A (en) | Speech differentiation method, apparatus, computer equipment and storage medium | |
CN110379412A (en) | Method, apparatus, electronic equipment and the computer readable storage medium of speech processes | |
CN111899757B (en) | Single-channel voice separation method and system for target speaker extraction | |
CN108630209B (en) | Marine organism identification method based on feature fusion and deep confidence network | |
CN109256118B (en) | End-to-end Chinese dialect identification system and method based on generative auditory model | |
CN110148408A (en) | A kind of Chinese speech recognition method based on depth residual error | |
CN108922515A (en) | Speech model training method, audio recognition method, device, equipment and medium | |
CN106683666B (en) | A kind of domain-adaptive method based on deep neural network | |
CN108986798B (en) | Processing method, device and the equipment of voice data | |
CN113129908B (en) | End-to-end macaque voiceprint verification method and system based on cyclic frame level feature fusion | |
CN108806725A (en) | Speech differentiation method, apparatus, computer equipment and storage medium | |
CN108922543A (en) | Model library method for building up, audio recognition method, device, equipment and medium | |
CN113111786B (en) | Underwater target identification method based on small sample training diagram convolutional network | |
CN111243621A (en) | Construction method of GRU-SVM deep learning model for synthetic speech detection | |
CN117789699B (en) | Speech recognition method, device, electronic equipment and computer readable storage medium | |
CN110415685A (en) | A kind of audio recognition method | |
CN109767790A (en) | A kind of speech-emotion recognition method and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |