CN108806725A - Speech differentiation method, apparatus, computer equipment and storage medium - Google Patents

Speech differentiation method, apparatus, computer equipment and storage medium Download PDF

Info

Publication number
CN108806725A
CN108806725A CN201810561789.6A CN201810561789A CN108806725A CN 108806725 A CN108806725 A CN 108806725A CN 201810561789 A CN201810561789 A CN 201810561789A CN 108806725 A CN108806725 A CN 108806725A
Authority
CN
China
Prior art keywords
voice data
asr
distinguished
target
biasing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810561789.6A
Other languages
Chinese (zh)
Inventor
涂宏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN201810561789.6A priority Critical patent/CN108806725A/en
Priority to PCT/CN2018/094342 priority patent/WO2019232867A1/en
Publication of CN108806725A publication Critical patent/CN108806725A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/84Detection of presence or absence of voice signals for discriminating voice from noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/21Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L2025/783Detection of presence or absence of voice signals based on threshold decision
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/45Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window

Abstract

The invention discloses a kind of speech differentiation method, apparatus, computer equipment and storage mediums.The speech differentiation method includes:Based on the original voice data to be distinguished of voice activity detection algorithms processing, target voice data to be distinguished is obtained;Based on target voice data to be distinguished, corresponding ASR phonetic features are obtained;The ASR phonetic features are input in advance trained ASR-DBN models and are distinguished, target is obtained and distinguishes result.Target voice and interference voice can be distinguished using the speech differentiation method well, it is very big in voice data noise jamming, it still can carry out accurate speech differentiation.

Description

Speech differentiation method, apparatus, computer equipment and storage medium
Technical field
The present invention relates to a kind of speech processes field more particularly to speech differentiation method, apparatus, computer equipment and storages Medium.
Background technology
Speech differentiation refers to carrying out mute screening to the voice of input, is only retained to identifying more meaningful voice segments (i.e. Target voice).There are prodigious deficiencies for current speech differentiation method, especially in the presence of noise, with noise Become larger, the difficulty for carrying out speech differentiation is bigger, can not accurately distinguish out target voice and interference voice, lead to speech differentiation Effect is undesirable.
Invention content
A kind of speech differentiation method, apparatus of offer of the embodiment of the present invention, computer equipment and storage medium, to solve voice Distinguish the undesirable problem of effect.
The embodiment of the present invention provides a kind of speech differentiation method, including:
Based on the original voice data to be distinguished of voice activity detection algorithms processing, target voice data to be distinguished is obtained;
Based on target voice data to be distinguished, corresponding ASR phonetic features are obtained;
The ASR phonetic features are input in advance trained ASR-DBN models and are distinguished, target is obtained and distinguishes As a result.
The embodiment of the present invention provides a kind of speech differentiation device, including:
Target voice data acquisition module to be distinguished handles original voice to be distinguished for being based on voice activity detection algorithms Data obtain target voice data to be distinguished;
It is special to obtain corresponding ASR voices for being based on target voice data to be distinguished for phonetic feature acquisition module Sign;
Target distinguishes result acquisition module, for the ASR phonetic features to be input to advance trained ASR-DBN moulds It is distinguished in type, obtains target and distinguish result.
The embodiment of the present invention provides a kind of computer equipment, including memory, processor and is stored in the memory In and the computer program that can run on the processor, the processor realize institute's predicate when executing the computer program The step of sound differentiating method.
The embodiment of the present invention provides a kind of computer readable storage medium, and the computer-readable recording medium storage has meter The step of calculation machine program, the computer program realizes speech differentiation method when being executed by processor.
In speech differentiation method, apparatus, computer equipment and storage medium that the embodiment of the present invention is provided, it is primarily based on The original voice data to be distinguished of voice activity detection algorithms processing, obtains target voice data to be distinguished, original language to be distinguished Sound data are first distinguished once by voice activity detection algorithms, obtain the smaller target of range voice data to be distinguished, Neng Gouchu Step effectively removes non-voice.It is then based on target voice data to be distinguished, obtains corresponding ASR phonetic features, is used to be follow-up Offer technical foundation is identified to the ASR phonetic features in ASR-DBN models.ASR phonetic features are finally input to advance instruction It is distinguished in the ASR-DBN models perfected, obtains target and distinguish as a result, the ASR-DBN models are special according to ASR phonetic features The identification model for accurately distinguishing voice of door training, can correctly distinguish target voice from target voice data to be distinguished With interference voice, the accuracy of speech differentiation is improved.
Description of the drawings
In order to illustrate the technical solution of the embodiments of the present invention more clearly, below by institute in the description to the embodiment of the present invention Attached drawing to be used is needed to be briefly described, it should be apparent that, the accompanying drawings in the following description is only some implementations of the present invention Example, for those of ordinary skill in the art, without having to pay creative labor, can also be according to these attached drawings Obtain other attached drawings.
Fig. 1 is an applied environment figure of speech differentiation method in one embodiment of the invention;
Fig. 2 is a flow chart of speech differentiation method in one embodiment of the invention;
Fig. 3 is a particular flow sheet of step S10 in Fig. 2;
Fig. 4 is a particular flow sheet of step S20 in Fig. 2;
Fig. 5 is a particular flow sheet of step S21 in Fig. 4;
Fig. 6 is a particular flow sheet of step S24 in Fig. 4;
Fig. 7 is the particular flow sheet before step S30 in Fig. 2;
Fig. 8 is a schematic diagram of speech differentiation device in one embodiment of the invention;
Fig. 9 is a schematic diagram of one embodiment of the invention Computer equipment.
Specific implementation mode
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation describes, it is clear that described embodiments are some of the embodiments of the present invention, instead of all the embodiments.Based on this hair Embodiment in bright, the every other implementation that those of ordinary skill in the art are obtained without creative efforts Example, shall fall within the protection scope of the present invention.
Fig. 1 shows the application environment of speech differentiation method provided in an embodiment of the present invention.The application of the audio recognition method Environment includes server-side and client, wherein is attached by network between server-side and client.Client be can with The equipment that family carries out human-computer interaction, the including but not limited to equipment such as computer, smart mobile phone and tablet.Server-side can specifically be used only The server cluster of vertical server or multiple servers composition is realized.Speech differentiation method provided in an embodiment of the present invention is answered For server-side.
As shown in Fig. 2, Fig. 2 shows a flow chart of speech differentiation method in the present embodiment, which includes Following steps:
S10:Based on the original voice data to be distinguished of voice activity detection algorithms processing, target voice number to be distinguished is obtained According to.
Wherein, voice activity detection (Voice Activity Detection, hereinafter referred to as VAD), it is therefore an objective to from sound The prolonged mute phase is identified and eliminated in signal stream, and traffic resource is saved in the case where not reducing quality of service to reach Effect can save valuable bandwidth resources, reduce time delay end to end, promote user experience.Voice activity detection algorithms (vad algorithm) i.e. voice activity detection when the algorithm that specifically uses, the algorithm can there are many.It is to be appreciated that VAD can be answered Used in speech differentiation, target voice and interference voice can be distinguished.Target voice refers to that vocal print consecutive variations are bright in voice data Aobvious phonological component, interference voice can be since silence is without the phonological component of pronunciation in voice data, can also be ring Border noise.Original voice data to be distinguished is the voice data to be distinguished that most original is got, the original voice data to be distinguished Refer to that vad algorithm to be employed carries out the preliminary voice data for distinguishing processing.Target voice data to be distinguished refers to being lived by voice Detection algorithm is moved to original after distinguishing voice data and handling, the voice data for carrying out speech differentiation of acquisition.
In the present embodiment, original voice data to be distinguished is handled using vad algorithm, from original voice number to be distinguished Go out target voice and interference voice according to middle preliminary screening, and the target voice part that preliminary screening is gone out is as target language to be distinguished Sound data.It is to be appreciated that need not be distinguished again for the interference voice that preliminary screening goes out, to improve the effect of speech differentiation Rate.And the target voice that preliminary screening goes out from original voice data to be distinguished still have interference voice content, especially when If the interference voice (such as noise) that the original target voice that when noise for distinguishing voice data is bigger, preliminary screening goes out mixes It is more, it is clear that use vad algorithm that can not effectively distinguish voice at this time, therefore should go out preliminary screening mixes interference language The target voice of sound carries out more accurate distinguish as target voice data to be distinguished, with the target voice gone out to preliminary screening. Server carries out preliminary speech differentiation by using vad algorithm to original voice data to be distinguished, can be according to preliminary screening Original voice data to be distinguished is repartitioned, while removing a large amount of interference voice, is conducive to follow-up further speech region Point.
In a specific embodiment, as shown in figure 3, in step S10, original wait for is handled based on voice activity detection algorithms Voice data is distinguished, target voice data to be distinguished is obtained, includes the following steps:
S11:Original voice data to be distinguished is handled according to short-time energy characteristic value calculation formula, is obtained corresponding Short-time energy characteristic value, the original data to be distinguished that short-time energy characteristic value is more than to first threshold retain, and are determined as the first original Begin to distinguish voice data, wherein short-time energy characteristic value calculation formula isN is voice frame length, when s (n) is Signal amplitude on domain, n are the time.
Wherein, it is corresponding in its time domain to describe a frame voice (frame generally takes 10-30ms) for short-time energy characteristic value Energy, the time (i.e. voice frame length) for being interpreted as a frame " in short-term " of the short-time energy.Due in short-term capable of for target voice Measure feature value, the short-time energy characteristic value compared to interference voice (mute) can be higher by very much, therefore can in short-term can according to this Measure feature value come distinguish target voice and interference voice.
In the present embodiment, original voice data to be distinguished is handled according to short-time energy characteristic value calculation formula and (is needed in advance Make the processing of framing to original voice data to be distinguished), calculate and obtain the short-time energy of original each frame of voice data to be distinguished The short-time energy characteristic value of each frame is compared with pre-set first threshold, will be greater than the original of first threshold by characteristic value Begin voice data to be distinguished retains, and is determined as the first original differentiation voice data.The first threshold in short-term can for weighing Measure feature value belongs to target voice or interferes the cut off value of voice.In the present embodiment, according to short-time energy characteristic value and The comparison result of one threshold value can tentatively be distinguished from the angle of short-time energy characteristic value and be obtained in original voice data to be distinguished Target voice, and effectively remove in original voice data to be distinguished and largely interfere voice.
S12:Original voice data to be distinguished is handled according to zero-crossing rate characteristic value calculation formula, obtains corresponding mistake Zero rate characteristic value, the original voice data to be distinguished that zero-crossing rate characteristic value is less than to second threshold retain, and it is original to be determined as second Distinguish voice data, wherein zero-crossing rate characteristic value calculation formula isN is voice frame length, s (n) It it is the time for the signal amplitude n in time domain.
Wherein, zero-crossing rate characteristic value is the number for describing voice signal waveform in a frame voice and passing through horizontal axis (zero level). Due to the zero-crossing rate characteristic value of target voice, the zero-crossing rate characteristic value compared to interference voice can be much lower, therefore can basis The short-time energy characteristic value come distinguish target voice and interference voice.
In the present embodiment, original voice data to be distinguished is handled according to zero-crossing rate characteristic value calculation formula, calculates and obtains The zero-crossing rate characteristic value of original each frame of voice data to be distinguished, by the zero-crossing rate characteristic value of each frame and pre-set second threshold It is compared, the original voice data to be distinguished less than second threshold is retained, and be determined as the second original differentiation voice data. The second threshold is the cut off value for belonging to target voice or interference voice for weighing short-time energy characteristic value.The present embodiment In, according to the comparison result of zero-crossing rate characteristic value and second threshold, it can tentatively distinguish and obtain from the angle of zero-crossing rate characteristic value Target voice in original voice data to be distinguished, and effectively remove in original voice data to be distinguished and largely interfere voice.
S13:Using the first original differentiation voice data and the second original differentiation voice data as target voice number to be distinguished According to.
In the present embodiment, the first original differentiation voice data is to wait distinguishing from original according to the angle of short-time energy characteristic value It distinguishes and obtains in voice data, the second original differentiation voice data is to wait for area from original according to the angle of zero-crossing rate characteristic value It distinguishes and obtains in point voice data.First original differentiation voice data and the second original differentiation voice data are respectively from differentiation The different angle of voice is set out, the two angles can distinguish voice well, therefore by the first original differentiation voice data Together with merging with the second original differentiation voice data and (being merged in a manner of taking intersection), as target voice data to be distinguished.
Step S11-S13 can tentatively be effectively removed most interference voice number in original voice data to be distinguished According to reservation mixes the original voice data to be distinguished of target voice and small part interference voice (such as noise), and this is original Voice data to be distinguished can make effective preliminary voice as target voice data to be distinguished to original voice data to be distinguished It distinguishes.
S20:Based on target voice data to be distinguished, corresponding ASR phonetic features are obtained.
Wherein, ASR (Automatic Speech Recognition, automatic speech recognition technology) is to turn voice data It is changed to the technology of computer-readable input, such as converts voice data to the shapes such as button, binary coding or character string Formula.The phonetic feature in target voice data to be distinguished can be extracted by ASR, the voice extracted is corresponding thereto ASR phonetic features.It is to be appreciated that the voice data that script computer can not be directly read can be converted to computer by ASR The ASR phonetic features that can be read, the mode which may be used vector indicate.
It in the present embodiment, is handled using ASR voice data to be distinguished to target, it is special to obtain corresponding ASR voices Sign, the ASR phonetic features can reflect the potential feature of target voice data to be distinguished well, can be special according to ASR voices Sign voice data to be distinguished to target distinguishes, and knows subsequently to carry out corresponding ASR-DBN models according to the ASR phonetic features Indescribably for technology premise.
In a specific embodiment, as shown in figure 4, in step S20, based on target voice data to be distinguished, phase is obtained Corresponding ASR phonetic features, include the following steps:
S21:Voice data to be distinguished to target pre-processes, and obtains pretreatment voice data.
In the present embodiment, voice data to be distinguished to target pre-processes, and obtains corresponding pretreatment voice number According to.Voice data to be distinguished to target pre-processes the ASR voice spies that can preferably extract target voice data to be distinguished Sign so that the ASR phonetic features extracted can more represent target voice data to be distinguished, with use the ASR phonetic features into Row speech differentiation.
In a specific embodiment, as shown in figure 5, in step S21, voice data to be distinguished to target is located in advance Reason obtains pretreatment voice data, includes the following steps:
S211:Voice data to be distinguished to target makees preemphasis processing, and the calculation formula of preemphasis processing is s'n=sn-a* sn-1, wherein snFor the signal amplitude in time domain, sn-1For with snThe signal amplitude of corresponding last moment, s'nFor preemphasis Signal amplitude in time domain afterwards, a are pre emphasis factor, and the value range of a is 0.9<a<1.0.
Wherein, preemphasis is a kind of signal processing mode compensated to input signal high fdrequency component in transmitting terminal.With The increase of signal rate, signal is damaged very greatly in transmission process, in order to enable receiving terminal to obtain relatively good signal waveform, With regard to needing to compensate impaired signal.The thought of pre-emphasis technique is exactly the high frequency in the transmitting terminal enhancing signal of transmission line Ingredient, to compensate excessive decaying of the high fdrequency component in transmission process so that receiving terminal can obtain preferable signal waveform.In advance Exacerbation does not have an impact noise, therefore can effectively improve output signal-to-noise ratio.
In the present embodiment, voice data to be distinguished to target makees preemphasis processing, and the formula of preemphasis processing is s'n= sn-a*sn-1, wherein snFor the signal amplitude in time domain, i.e., the amplitude (amplitude) for the voice that voice data is expressed in the time domain, sn-1For with snThe signal amplitude of opposite last moment, s'nFor the signal amplitude in time domain after preemphasis, a is pre emphasis factor, The value range of a is 0.9<a<1.0, take the effect of 0.97 preemphasis relatively good here.Being handled using the preemphasis can eliminate Interference caused by vocal cords and lip etc. in voiced process, can be with the pent-up radio-frequency head of effective compensation target voice data to be distinguished Point, and the formant of target voice data high frequency to be distinguished can be highlighted, reinforce the signal width of target voice data to be distinguished Degree helps to extract ASR phonetic features.
S212:Target voice data to be distinguished after preemphasis is subjected to sub-frame processing.
In the present embodiment, in preemphasis target after distinguishing voice data, sub-frame processing should be also carried out.Framing refers to will be whole The voice signal of section is cut into several sections of voice processing technology, and the size per frame is in the range of 10-30ms, with general 1/2 Frame length is moved as frame.Frame shifting refers to the overlapping region of adjacent two interframe, and adjacent two frame can be avoided to change excessive problem.To mesh It marks voice data to be distinguished and carries out sub-frame processing, target voice data to be distinguished can be divided into several sections of voice data, it can To segment target voice data to be distinguished, it is convenient for the extraction of ASR phonetic features.
S213:Target voice data to be distinguished after framing is subjected to windowing process, obtains pretreatment voice data, adding window Calculation formula beWherein, N is that window is long, and n is time, snFor the letter in time domain Number amplitude, s'nFor the signal amplitude in time domain after adding window.
In the present embodiment, to target wait distinguish voice data carry out sub-frame processing after, the initial segment of each frame and end Discontinuous place can all occur in end, so framing is mostly also bigger with the error of target voice data to be distinguished.Using adding Window can solve the problems, such as this, and the voice data to be distinguished of the target after framing can be made to become continuous, and enable each frame Enough show the feature of periodic function.Windowing process is specifically referred to using at window function voice data to be distinguished to target Reason, window function can select Hamming window, then the formula of the adding window isN is Hamming Window window is long, and n is time, snFor the signal amplitude in time domain, s'nFor the signal amplitude in time domain after adding window.Language to be distinguished to target Sound data carry out windowing process, obtain pretreatment voice data, enable to the target after framing wait distinguish voice data when Signal on domain becomes continuous, contributes to the ASR phonetic features for extracting target voice data to be distinguished.
The pretreatment operation of above-mentioned steps S211-S213 voice data to be distinguished to target, to extract target language to be distinguished The ASR phonetic features of sound data provide the foundation, and enable to the ASR phonetic features of extraction that can more represent target language to be distinguished Sound data, and speech differentiation is carried out according to the ASR phonetic features.
S22:Fast Fourier Transform (FFT) is made to pretreatment voice data, obtains the frequency spectrum of target voice data to be distinguished, and The power spectrum of target voice data to be distinguished is obtained according to frequency spectrum.
Wherein, Fast Fourier Transform (FFT) (Fast Fourier Transformation, abbreviation FFT), refers to and utilizes computer Calculate efficient, quick calculation method the general designation of discrete Fourier transform, abbreviation FFT.Computer meter can be made using this algorithm It calculates the required multiplication number of discrete Fourier transform to be greatly reduced, the number of sampling points being especially transformed is more, fft algorithm meter The saving of calculation amount is more notable.
In the present embodiment, to pretreatment voice data carry out Fast Fourier Transform (FFT), will pre-process voice data from when Signal amplitude on domain is converted to the signal amplitude (frequency spectrum) on frequency domain.The formula of the calculating frequency spectrum is1≤k≤N, N are the size of frame, and s (k) is the signal amplitude on frequency domain, and s (n) is the letter in time domain Number amplitude, n is the time, and i is complex unit.After the frequency spectrum for obtaining pretreatment voice data, can directly it be asked according to the frequency spectrum The power spectrum of voice data must be pre-processed, the power spectrum for pre-processing voice data is known as target voice data to be distinguished below Power spectrum.The formula of the power spectrum of the calculating target voice data to be distinguished is1≤k≤N, N are frame Size, s (k) are the signal amplitude on frequency domain.By the way that pretreatment voice data is converted to frequency domain from the signal amplitude in time domain On signal amplitude, the power spectrum of target voice data to be distinguished is obtained further according to the signal amplitude on the frequency domain, is from target ASR phonetic features are extracted in the power spectrum of voice data to be distinguished, and important technical foundation is provided.
S23:Using the power spectrum of melscale filter group processing target voice data to be distinguished, obtains target and wait distinguishing The Meier power spectrum of voice data.
Wherein, using the power spectrum of melscale filter group processing target voice data to be distinguished carried out to power spectrum Mel-frequency analysis, mel-frequency analysis be the analysis perceived based on human auditory.Observation finds that human ear is filtered just as one Device group is the same, only focuses on certain specific frequency components (sense of hearing of people is selective frequency), that is to say, that human ear is only It allows the signal of certain frequencies to pass through, and directly ignores the certain frequency signals for being not desired to perception.However these filters are sat in frequency But it is not univesral distribution on parameter, there are many filters in low frequency region, they is distributed than comparatively dense, but in high frequency region The number in domain, filter just becomes fewer, and distribution is very sparse.It is to be appreciated that melscale filter group is in low frequency part High resolution, the auditory properties with human ear are consistent, where this is also the physical significance of melscale.
In the present embodiment, using the power spectrum of melscale filter group processing target voice data to be distinguished, mesh is obtained The Meier power spectrum for marking voice data to be distinguished carries out cutting by using melscale filter group to frequency-region signal so that Last each frequency band corresponds to a numerical value, if the number of filter is 22, can obtain target voice data to be distinguished Corresponding 22 energy values of Meier power spectrum.Mel-frequency analysis is carried out by the power spectrum to target voice data to be distinguished, So that the Meier power spectrum obtained after its analysis maintains the frequency-portions closely related with human ear characteristic, which can Reflect the feature of target voice data to be distinguished well.
S24:Cepstral analysis is carried out on Meier power spectrum, obtains the mel-frequency cepstrum system of target voice data to be distinguished Number.
Wherein, cepstrum (cepstrum) refers in a kind of Fu that the Fourier transform spectrum of signal carries out again after logarithm operation Leaf inverse transformation, since general Fourier spectrum is complex number spectrum, thus cepstrum is also known as cepstrum.
In the present embodiment, cepstral analysis is carried out to Meier power spectrum, according to cepstrum as a result, analyzing and obtaining target and wait for area Divide the mel-frequency cepstrum coefficient of voice data.It, can be excessively high by script intrinsic dimensionality, it is difficult to directly make by the cepstral analysis The feature for including in the Meier power spectrum of target voice data to be distinguished, by carrying out cepstrum point on Meier power spectrum Analysis, is converted into wieldy feature (for the mel-frequency cepstrum coefficient feature vector for being trained or identifying).The Meier The coefficient that frequency cepstral coefficient can distinguish different phonetic as ASR phonetic features, the ASR phonetic features can reflect Difference between voice can be used for identifying and distinguishing between target voice data to be distinguished.
In a specific embodiment, as shown in fig. 6, in step S24, cepstral analysis is carried out on Meier power spectrum, is obtained The mel-frequency cepstrum coefficient for taking target voice data to be distinguished, includes the following steps:
S241:The logarithm for taking Meier power spectrum obtains Meier power spectrum to be transformed.
In the present embodiment, according to the definition of cepstrum, logarithm log is taken to Meier power spectrum, obtains Meier power to be transformed Compose m.
S242:Discrete cosine transform is made to Meier power spectrum to be transformed, obtains the Meier frequency of target voice data to be distinguished Rate cepstrum coefficient.
In the present embodiment, discrete cosine transform (Discrete Cosine are made to Meier power spectrum m to be transformed Transform, DCT), the mel-frequency cepstrum coefficient of corresponding target voice data to be distinguished is obtained, generally takes the 2nd to arrive 13rd coefficient can reflect the difference between voice data as ASR phonetic features, the ASR phonetic features.To Meier to be transformed The formula that power spectrum m makees discrete cosine transform is N is frame length, and m is Meier power spectrum to be transformed, and j is the independent variable of Meier power spectrum to be transformed.Due to being between Meier filter There is overlapping, so having correlation, discrete cosine transform can between the energy value obtained using melscale filter To carry out dimensionality reduction compression to Meier power spectrum m to be transformed and be abstracted, and corresponding ASR phonetic features are obtained, compared to Fourier Transformation, the result of discrete cosine transform do not have imaginary part, there is apparent advantage in terms of calculating.
Step S21-S24 carries out the processing of feature extraction based on ASR technology voice data to be distinguished to target, final to obtain ASR phonetic features can embody target voice data to be distinguished well, be trained and can obtain using the ASR phonetic features Corresponding ASR-DBN models are taken, keep result of the ASR-DBN models of training acquisition when carrying out speech differentiation more accurate, Even if under the conditions of noise is prodigious, accurately noise and speech differentiation can also be come.
It is characterized as mel-frequency cepstrum coefficient it should be noted that extracting above, it herein should not be by ASR phonetic features It is a kind of to be limited to only mel-frequency cepstrum coefficient, and will be understood that the phonetic feature obtained using ASR technology, as long as can have Effect reflection voice data feature, can all be identified and model training as ASR phonetic features.
S30:ASR phonetic features are input in advance trained ASR-DBN models and are distinguished, target is obtained and distinguishes As a result.
Wherein, ASR-DBN models refer to the neural network model trained using ASR phonetic features, and DBN refers to depth letter Read network (Deep Belief Network).The ASR-DBN models are special using the ASR voices of voice data to be trained extraction What sign was trained, therefore the model can identify ASR phonetic features, so as to distinguish language according to ASR phonetic features Sound.Particularly, different from traditional neural network, ASR-DBN models are stacked by several ASR-RBM models, wherein ASR-RBM is the component units of ASR-DBN, and RBM refers to limited Boltzmann machine (Restricted Boltzmann Machine). Voice data to be trained includes target voice and noise, is extracted using when trained voice data trains ASR-DBN models The ASR phonetic features of target voice and the ASR phonetic features of noise so that ASR-DBN models can be known according to ASR phonetic features Noise in other target voice and interference voice (is having been removed big portion using VAD differentiations are original when distinguishing voice data Point interference voice, as due to silence without pronunciation in voice data phonological component and a part of noise, so here The interference voice that ASR-DBN models are distinguished specifically refers to noise components), it realizes and effective district is carried out to target voice and interference voice The purpose divided.
In the present embodiment, ASR phonetic features are input in advance trained ASR-DBN models and are distinguished, due to ASR phonetic features can reflect the feature of voice data, therefore can be according to ASR-DBN models voice data to be distinguished to target The ASR phonetic features of extraction are identified, to make accurate language according to ASR phonetic features voice data to be distinguished to target Sound is distinguished.Trained ASR-DBN models couplings ASR phonetic features and depth belief network carry out deep layer to feature in advance for this The characteristics of extraction, distinguishes voice from the feature of voice data, still has in the case of noise conditions very severe Very high accurate rate.Specifically, since the feature of ASR extractions also includes the ASR phonetic features of noise, in the ASR-DBN In model, noise is also accurately to distinguish, and efficiently solves current speech differentiating method (including but not limited to VAD) The problem of speech differentiation can not be effectively carried out under conditions of noise effect is larger.
In a specific embodiment, ASR phonetic features are being input to advance trained ASR-DBN moulds by step S30 It is distinguished in type, before obtaining the step of target distinguishes result, speech differentiation method further includes following steps:Obtain ASR- DBN model.
As shown in fig. 7, the step of obtaining ASR-DBN models specifically includes:
S31:Voice data to be trained is obtained, and extracts the ASR phonetic features to be trained of voice data to be trained.
Wherein, voice data to be trained refers to the voice data training sample set needed for trained ASR-DBN models, including mesh Poster sound and noise.The voice training collection increased income may be used in the voice data to be trained, or by collecting great amount of samples The voice training collection of voice data.The voice data to be trained is divided into tape label voice data to be trained and without label language to be trained Sound data.Tape label voice data to be trained refers to the voice number that voice and noise are distinguished by the form of label in advance According to, such as target voice be labeled as " 1 ", noise signature be " 0 ".No label voice data to be trained and tape label language to be trained Sound data are meant that opposite, and no label voice data to be trained distinguishes voice and noise not over the form of label Come.No label voice data to be trained is used in the training ASR-RBM model stages, and tape label voice data to be trained is used in tuning The ASR-DBN stages should all extract corresponding ASR phonetic features to be trained before each stage.
In the present embodiment, voice data to be trained is obtained, and extracts the feature of the voice data to be trained, this feature waits for The step of training ASR phonetic features, extracting ASR phonetic features to be trained is identical as step S21-S24, and details are not described herein.It waits for Training voice data includes target voice part and noise components, this two parts voice data has respective ASR voices special Sign therefore can be by the ASR phonetic features to be trained of extraction, and training simultaneously obtains ASR-DBN models so that waits instructing according to this Target voice and noise can accurately be distinguished (noise belongs to interference language by practicing the ASR-DBN models that the training of ASR phonetic features obtains Sound).
S32:ASR-RBM models are trained using ASR phonetic features to be trained successively, the error generated according to training is joined Number update, obtains each ASR-RBM models, wherein parameter includes weights and biasing, biasing of the biasing comprising aobvious layer neuron and The biasing of hidden neuron, the formula for updating weights W are:W'=W+ λ (P (h1|v1)v1-P(h2|v2)v2), W is the power before update Value, W' are updated weights, and λ is learning rate, v1For original aobvious layer, v2Layer, h are shown to reconstruct1For original hidden layer, h2For reconstruct Hidden layer;Update biasing a formula be:A'=a+ λ (v1-v2), a is the biasing before aobvious layer neuron update, and a' is aobvious layer neuron Updated biasing;The formula of update biasing b is b'=b+ λ (h1-h2), b is the biasing before hidden neuron update, and b' is hidden The layer updated biasing of neuron.
ASR-RBM is the component units of ASR-DBN, obtains ASR-DBN models and needs first to train up first ASR-RBM Model, and weights and the biasing of the ASR-RBM models are fixed, the ASR phonetic features exported using the ASR-RBM model hidden layers, As the ASR phonetic features of second ASR-RBM mode input, second ASR-RBM model is then trained up.It trains up After second ASR-RBM model, second ASR-RBM model is stacked on to the top of first ASR-RBM model, i.e., first Aobvious layer of the hidden layer of ASR-RBM models as second ASR-RBM model, and so on train up to obtain predetermined number successively ASR-RBM models.
In one embodiment, it is trained successively using ASR phonetic features to be trained and obtains each ASR-RBM of predetermined number Model.The process of training ASR-RBM models is the training method of unsupervised learning, compared to the training method with supervised learning (tape label training voice data), the characteristics of ASR phonetic features can be retained as much as possible, while reducing the dimension of feature;? In the case that sample data volume is larger, the time for being largely used to setting label spent in data acquisition phase can be reduced, it is bright It is aobvious to improve training effectiveness.
In one embodiment, ASR-DBN models are stacked to obtain by several ASR-RBM models, it is therefore desirable to training Obtain each ASR-RBM model, wherein stacking refers to the hidden layer of a upper ASR-RBM model as next ASR-RBM moulds The aobvious layer of type, the output valve of a upper ASR-RBM model are the input value of next ASR-RBM models.In training process, need Next ASR-RBM models could be trained after training up an ASR-RBM model, until the last one.Each ASR-RBM models have two layers of neuron, including aobvious layer (visible layer) and hidden layer (Hidden layer).Aobvious layer (visible layer) is made of aobvious first (visible units), for inputting ASR phonetic features to be trained.Correspondingly, hidden Layer (Hidden layer) is made of hidden member (hidden units), is used as property detector (feature detectors), energy It is enough that corresponding output valve is obtained according to the input value (ASR phonetic features to be trained) for showing layer.Two layers of (aobvious layer of ASR-RBM models And hidden layer) between be the two-way connection relation connected entirely, but all do not interconnected between the neuron inside aobvious layer and hidden layer, only The neuron of interlayer has symmetrical connecting line, then in the case where giving the output valve of all aobvious members, what each hidden member takes Value is orthogonal.That is, equally in given hidden layer, the value of all aobvious members is also orthogonal.Therefore, it is calculating One need not be just calculated when the value condition of each neuron every time, but calculate simultaneously in parallel flood show layer neuron or The neuron of flood hidden layer.In ASR-RBM models, its company is indicated there are one weights W between the connected neuron of any two Intensity (weights) is connect, each aobvious layer neuron itself is there are one a is biased, and there are one bias b for each hidden neuron itself.
The process of training ASR-RBM models is updated parameter W, a and b in model, specifically, in an ASR- In RBM models, hidden neuron hjThe probability being activated is:
P(hj=1 | v)=σ (bj+∑iWi,jvi) -- formula (1)
Due to the two-way connection relation connected entirely, layer neuron v is showniIt can equally be activated by hidden neuron:
P(vi=1 | h)=σ (ai+∑jWi,jhj) -- formula (2)
In above formula, h indicates that hidden neuron, v indicate to show layer neuron, and i indicates to show i-th in layer neuron, j It indicates that j-th in hidden neuron, σ indicate activation primitive, specifically can indicate to show the inclined of layer neuron with sigmoid functions, a It sets, b indicates that the biasing of hidden neuron, W indicate to show the weights that layer neuron is connected with hidden neuron, Wi,jI.e. i-th aobvious layer The weights of neuron and the connection of j-th of hidden neuron.After trained ASR phonetic features x is input to aobvious layer, ASR-RBM moulds Type calculates the probability P (h that each hidden neuron is activated according to formula (1)j| x), j=1,2..., Nh, take 0-1's Random number μ is activated as threshold value (such as 0.5) if the neuron that the probability that hidden neuron is activated is more than the threshold value, no It is not activated then, is formulated as:
Thus, it is possible to obtain whether each neuron of hidden layer is activated.In addition, in given hidden layer, the calculating side of layer is shown Method is similar, is expressed as:
Since training process will produce error, it is therefore desirable to update weights and the biasing of ASR-RBM according to output valve.Specifically Ground builds error function according to output valve first, which can be specifically log error function.It is then based on error letter Number optimally updates weights and biasing.For a sample data, i.e. an ASR phonetic features x to be trained, may be used following Step updates weights and biasing:
(1) ASR phonetic features x to be trained is input to original aobvious layer v1, original hidden layer h is calculated using formula (1)1In Probability P (the h that each neuron is activated1=1 | v1);
(2) Gibbs model is taken to extract a sample from the probability distribution of calculating:h1~P (h1|v1);
(3) h is used1Reconstruct obtains reconstructing aobvious layer v2, it is specific to push away aobvious layer by the way that hidden layer is counter, it is calculated and is shown in layer using formula (2) Probability P (the v that each neuron is activated2|h1);
(4) similarly, Gibbs model is taken to extract a sample from the probability distribution being calculated:v2~P (v2| h1);
(5) v is used2Reconstruct obtains reconstruct hidden layer h2, specific i.e. layer is counter pushes away hidden layer by aobvious, is calculated in hidden layer using formula (1) The probability that each neuron is activated, obtains probability distribution:P(h2|v2);
Then according to this method, then the formula for updating weights W is:W'=W+ λ (P (h1|v1)v1-P(h2|v2)v2), wherein W For the weights before update, W' is updated weights, and λ is learning rate.Update biasing a formula be:A'=a+ λ (v1-v2), a is Biasing before aobvious layer neuron update, a' are the updated biasing of aobvious layer neuron, and the formula of update biasing b is b'=b+ λ (h1-h2), b is the biasing before hidden neuron update, and b' is the updated biasing of hidden neuron.After training several times, Hidden layer more can not only accurately show the feature of aobvious layer, while can also restore aobvious layer, reach maximum frequency of training or Gradient change rate stops update training when being less than predetermined threshold value.It can fully be instructed according to the above weights and the newer formula of biasing Practice ASR-RBM models, obtain each ASR-RBM models, each ASR-RBM models that training obtains remain as far as possible to wait training ASR phonetic features, while the dimension for effectively reducing feature can compared with that can embody ASR phonetic features to be trained under low dimensional Effective speech differentiation is realized according to the feature of ASR-RBM model extractions.
S33:The output valve of the last one ASR-RBM model is obtained, arameter optimization is carried out based on output valve, update is each The weights of ASR-RBM models and biasing obtain ASR-DBN models.
In one embodiment, after training each ASR-RBM models, can also the last one ASR-RBM model again One layer of backpropagation layer is established, which can carry out the training method of supervised learning, using the last one ASR- The output valve of RBM models carries out parameter (weights and biasing) tuning.Wherein, it can be reduced using the training method of supervised learning The error that training process generates improves the recognition accuracy for the ASR-DBN models that training obtains.It specifically, can be at last The ASR phonetic features to be trained of aobvious layer input tape label in a ASR-RBM models, it is corresponding can to calculate its by formula (3) The output valve of hidden layer, the output valve i.e. output valve of the last one ASR-RBM model.It is suitable to be built by the output valve Error function, and according to error function, the effect of arameter optimization is realized using BP (Back Propagation, back-propagating) algorithm Fruit.Wherein, BP algorithm is the newer algorithms most in use of neural network parameter, and details are not described herein.It is to be appreciated that each ASR- RBM models can only ensure that the weights in own layer are optimal this layer of maps feature vectors with biasing, and be not to entire The maps feature vectors of ASR-DBN models are optimal, the mistake for being generated supervised learning training process by backpropagation layer Difference propagates to each ASR-RBM model from up to down, finely tunes entire ASR-DBN models.The last one ASR-RBM model again One layer of backpropagation layer is established, effective tuning can be carried out to parameter (weights and biasing), advanced optimize parameter so that instruction Practice the ASR-DBN models obtained and possesses higher recognition accuracy.
Step S31-S33 according to this feature training and obtains ASR-DBN models by extracting ASR phonetic features to be trained, The training method for using unsupervised learning to each ASR-RBM models in the training process, can preferably retain and wait training ASR phonetic features, while the dimension for effectively reducing feature can be reduced and be obtained in data in the case where sample data volume is larger Take the time for being largely used to setting label that the stage spends, hence it is evident that improve training effectiveness.And it is built in the last one ASR-RBM model Vertical backpropagation layer carries out tuning to parameter using the training method of supervised learning in the backpropagation layer, can reduce instruction Practice the error that process generates, improve the recognition accuracy of ASR-DBN models, realization accurately distinguishes voice.
In the speech differentiation method that the present embodiment is provided, it is original to be primarily based on voice activity detection algorithms (VAD) processing Voice data to be distinguished obtains target voice data to be distinguished, original voice data to be distinguished is calculated by voice activity detection Method is first distinguished once, is obtained the smaller target of range voice data to be distinguished, can be tentatively effectively removed original language to be distinguished Interference voice in sound data retains the original voice data to be distinguished for mixing target voice and interfering voice, and by the original Voice data to be distinguished begin as target voice data to be distinguished, effective preliminary language can be made to original voice data to be distinguished Sound is distinguished, and a large amount of interference voice is removed.It is then based on target voice data to be distinguished, obtains corresponding ASR phonetic features, The ASR phonetic features can make the result of speech differentiation more accurate, even if under the conditions of noise is prodigious, it can also be accurately Voice (such as noise) and target voice will be interfered to distinguish, subsequently to carry out corresponding ASR-DBN according to the ASR phonetic features Model Identification provides important technology premise.Finally by ASR phonetic features be input in advance trained ASR-DBN models into Row is distinguished, and is obtained target and is distinguished as a result, the ASR-DBN models are to be used to effectively distinguish according to ASR phonetic feature specialized trainings The identification model of voice, can from mix target voice and interference voice (due to having used VAD to distinguish once, so Here interference voice specifically refers to noise) target voice data to be distinguished in correctly distinguish target voice and interference voice, Improve the accuracy of speech differentiation.
It should be understood that the size of the serial number of each step is not meant that the order of the execution order in above-described embodiment, each process Execution sequence should be determined by its function and internal logic, the implementation process without coping with the embodiment of the present invention constitutes any limit It is fixed.
Fig. 8 shows the functional block diagram with the one-to-one speech differentiation device of speech differentiation method in embodiment.Such as Fig. 8 institutes Show, which includes target voice data acquisition module 10 to be distinguished, phonetic feature acquisition module 20 and target area Divide result acquisition module 30.Wherein, target voice data acquisition module 10 to be distinguished, phonetic feature acquisition module 20 and target area The realization function of result acquisition module 30 step corresponding with speech differentiation method in embodiment is divided to correspond, to avoid going to live in the household of one's in-laws on getting married It states, the present embodiment is not described in detail one by one.
Target voice data acquisition module 10 to be distinguished handles original language to be distinguished for being based on voice activity detection algorithms Sound data obtain target voice data to be distinguished.
It is special to obtain corresponding ASR voices for being based on target voice data to be distinguished for phonetic feature acquisition module 20 Sign.
Target distinguishes result acquisition module 30, for ASR phonetic features to be input to advance trained ASR-DBN models In distinguish, obtain target distinguish result.
Preferably, target voice data acquisition module 10 to be distinguished includes the first original differentiation voice data acquiring unit 11, the second original differentiation voice data acquiring unit 12 and target voice data acquiring unit 13 to be distinguished.
First original differentiation voice data acquiring unit 11, for being waited for original according to short-time energy characteristic value calculation formula It distinguishes voice data to be handled, obtains corresponding short-time energy characteristic value, short-time energy characteristic value is more than first threshold Original data to be distinguished retain, and are determined as the first original differentiation voice data, wherein short-time energy characteristic value calculation formula isN is voice frame length, and s (n) is the signal amplitude in time domain, and n is the time.
Second original differentiation voice data acquiring unit 12, for waiting for area to original according to zero-crossing rate characteristic value calculation formula Divide voice data to be handled, obtains corresponding zero-crossing rate characteristic value, zero-crossing rate characteristic value is waited for less than the original of second threshold It distinguishes voice data to retain, is determined as the second original differentiation voice data, wherein zero-crossing rate characteristic value calculation formula isN is voice frame length, and s (n) is the signal amplitude in time domain, and n is the time.
Target voice data acquiring unit 13 to be distinguished is used for the first original differentiation voice data and the second original differentiation Voice data is as target voice data to be distinguished.
Preferably, phonetic feature acquisition module 20 includes pretreatment voice data acquiring unit 21, power spectrum acquiring unit 22, Meier power spectrum acquiring unit 23 and mel-frequency cepstrum coefficient unit 24.
Pretreatment unit 21 obtains pretreatment voice data for being pre-processed to target voice data to be distinguished.
Power spectrum acquiring unit 22 obtains target and waits distinguishing for making Fast Fourier Transform (FFT) to pretreatment voice data The frequency spectrum of voice data, and according to the power spectrum of frequency spectrum acquisition target voice data to be distinguished.
Meier power spectrum acquiring unit 23, for using melscale filter group processing target voice data to be distinguished Power spectrum obtains the Meier power spectrum of target voice data to be distinguished.
Mel-frequency cepstrum coefficient unit 24 obtains target and waits distinguishing for carrying out cepstral analysis on Meier power spectrum The mel-frequency cepstrum coefficient of voice data.
Preferably, pretreatment unit 21 includes preemphasis subelement 211, framing subelement 212 and adding window subelement 213.
Preemphasis subelement 211, for making preemphasis processing, the calculating of preemphasis processing to target voice data to be distinguished Formula is s'n=sn-a*sn-1, wherein snFor the signal amplitude in time domain, sn-1For with snThe signal of corresponding last moment Amplitude, s'nFor the signal amplitude in time domain after preemphasis, a is pre emphasis factor, and the value range of a is 0.9<a<1.0.
Framing subelement 212, for the target voice data to be distinguished after preemphasis to be carried out sub-frame processing.
Adding window subelement 213 obtains pretreatment for the target voice data to be distinguished after framing to be carried out windowing process The calculation formula of voice data, adding window isWherein, N is that window is long, and n is the time, snFor the signal amplitude in time domain, s'nFor the signal amplitude in time domain after adding window.
Preferably, mel-frequency cepstrum coefficient unit 24 includes that Meier power spectrum to be transformed obtains subelement 241 and Meier Frequency cepstral coefficient subelement 242.
Meier power spectrum to be transformed obtains subelement 241, and the logarithm for taking Meier power spectrum obtains Meier to be transformed Power spectrum.
Mel-frequency cepstrum coefficient subelement 242 obtains mesh for making discrete cosine transform to Meier power spectrum to be transformed Mark the mel-frequency cepstrum coefficient of voice data to be distinguished.
Preferably, which further includes ASR-DBN models acquisition module 40, ASR-DBN model acquisition modules 40 for obtaining ASR-DBN models.
ASR-DBN models acquisition module 40 includes ASR phonetic features acquiring unit 41 to be trained, the acquisition of ASR-RBM models Unit 42 and tuning unit 43.
ASR phonetic features acquiring unit 41 to be trained for obtaining voice data to be trained, and extracts voice number to be trained According to ASR phonetic features to be trained.
ASR-RBM models acquiring unit 42, for training ASR-RBM models, root successively using ASR phonetic features to be trained Parameter update is carried out according to the error that training generates, obtains each ASR-RBM models, wherein parameter includes weights and biasing, biasing Include the biasing of the biasing and hidden neuron of aobvious layer neuron, the formula of update weights W is:W'=W+ λ (P (h1|v1)v1-P (h2|v2)v2), W is the weights before update, and W' is updated weights, and λ is learning rate, v1For original aobvious layer, v2It is aobvious to reconstruct Layer, h1For original hidden layer, h2To reconstruct hidden layer;Update biasing a formula be:A'=a+ λ (v1-v2), a is to show layer neuron more Biasing before new, a' are the updated biasing of aobvious layer neuron;The formula of update biasing b is b'=b+ λ (h1-h2), b is hidden layer Biasing before neuron update, b' are the updated biasing of hidden neuron.
Tuning unit 43, the output valve for obtaining the last one ASR-RBM model carry out parameter tune based on output valve It is excellent, weights and the biasing of each ASR-RBM models are updated, ASR-DBN models are obtained.
The present embodiment provides a computer readable storage medium, computer journey is stored on the computer readable storage medium Sequence realizes speech differentiation method in embodiment when the computer program is executed by processor, no longer superfluous here to avoid repeating It states.Alternatively, realizing the work(of each module/unit in speech differentiation device in embodiment when the computer program is executed by processor Can, to avoid repeating, which is not described herein again.
It is to be appreciated that the computer readable storage medium may include:The computer program code can be carried Any entity or device, recording medium, USB flash disk, mobile hard disk, magnetic disc, CD, computer storage, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), electric carrier signal and telecommunications letter Number etc..
Fig. 9 is the schematic diagram of the present embodiment Computer equipment.As shown in figure 9, computer equipment 50 include processor 51, Memory 52 and it is stored in the computer program 53 that can be run in memory 52 and on processor 51.Processor 51 executes meter Each step of speech differentiation method in embodiment, such as step S10, S20 and S30 shown in Fig. 2 are realized when calculation machine program 53. Alternatively, processor 51 realizes the function of each module/unit of speech differentiation device in embodiment when executing computer program 53, such as scheme The voice data acquisition module 10 to be distinguished of target shown in 8, phonetic feature acquisition module 20 and target distinguish result acquisition module 30 Function.
It is apparent to those skilled in the art that for convenience of description and succinctly, only with above-mentioned each work( Can unit, module division progress for example, in practical application, can be as needed and by above-mentioned function distribution by different Functional unit, module are completed, i.e., the internal structure of described device are divided into different functional units or module, more than completion The all or part of function of description.
In addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit, it can also It is that each unit physically exists alone, it can also be during two or more units be integrated in one unit.Above-mentioned integrated list The form that hardware had both may be used in member is realized, can also be realized in the form of SFU software functional unit.
Embodiment described above is merely illustrative of the technical solution of the present invention, rather than its limitations;Although with reference to aforementioned reality Applying example, invention is explained in detail, it will be understood by those of ordinary skill in the art that:It still can be to aforementioned each Technical solution recorded in embodiment is modified or equivalent replacement of some of the technical features;These modification or Person replaces, and the spirit and scope for various embodiments of the present invention technical solution that it does not separate the essence of the corresponding technical solution should all wrap Containing within protection scope of the present invention.

Claims (10)

1. a kind of speech differentiation method, which is characterized in that including:
Based on the original voice data to be distinguished of voice activity detection algorithms processing, target voice data to be distinguished is obtained;
Based on target voice data to be distinguished, corresponding ASR phonetic features are obtained;
The ASR phonetic features are input in advance trained ASR-DBN models and are distinguished, target is obtained and distinguishes knot Fruit.
2. speech differentiation method according to claim 1, which is characterized in that input the ASR phonetic features described To being distinguished in advance trained ASR-DBN models, before obtaining the step of distinguishing result, the speech differentiation method is also Including:Obtain ASR-DBN models;
The step of acquisition ASR-DBN models includes:
Obtain voice data to be trained, and the ASR phonetic features to be trained of voice data to be trained described in extraction;
ASR-RBM models are trained using the ASR phonetic features to be trained successively, parameter is carried out more according to the error that training generates Newly, each ASR-RBM models are obtained, wherein the parameter includes weights and biasing, and the biasing is inclined comprising aobvious layer neuron The biasing with hidden neuron is set, the formula of update weights W is:W'=W+ λ (P (h1|v1)v1-P(h2|v2)v2), W is before updating Weights, W' be updated weights, λ is learning rate, v1For original aobvious layer, v2Layer, h are shown to reconstruct1For original hidden layer, h2For Reconstruct hidden layer;Update biasing a formula be:A'=a+ λ (v1-v2), a is the biasing before aobvious layer neuron update, and a' is aobvious layer god Through the updated biasing of member;The formula of update biasing b is b'=b+ λ (h1-h2), b is the biasing before hidden neuron update, b' For the updated biasing of hidden neuron;
The output valve of the last one ASR-RBM model is obtained, arameter optimization is carried out based on the output valve, updates each institute Weights and the biasing of ASR-RBM models are stated, ASR-DBN models are obtained.
3. speech differentiation method according to claim 1, which is characterized in that described to be handled based on voice activity detection algorithms Original voice data to be distinguished obtains target voice data to be distinguished, including:
The original voice data to be distinguished is handled according to short-time energy characteristic value calculation formula, acquisition is corresponding in short-term The short-time energy characteristic value is more than original data the to be distinguished reservation of first threshold by energy eigenvalue, is determined as the One original differentiation voice data, short-time energy characteristic value calculation formula areWherein, N is voice frame length, and s (n) is Signal amplitude in time domain, n are the time;
The original voice data to be distinguished is handled according to zero-crossing rate characteristic value calculation formula, obtains corresponding zero-crossing rate Characteristic value, the original voice data to be distinguished that the zero-crossing rate characteristic value is less than to second threshold retain, and are determined as second Original differentiation voice data, zero-crossing rate characteristic value calculation formula areWherein, N is voice frame length, s (n) it is the signal amplitude in time domain, n is the time;
Using the described first original differentiation voice data and the second original differentiation voice data as target language to be distinguished Sound data.
4. speech differentiation method according to claim 1, which is characterized in that described to be based on target voice number to be distinguished According to, corresponding ASR phonetic features are obtained, including:
Target voice data to be distinguished is pre-processed, pretreatment voice data is obtained;
Fast Fourier Transform (FFT) made to the pretreatment voice data, obtains the frequency spectrum of target voice data to be distinguished, and according to The frequency spectrum obtains the power spectrum of target voice data to be distinguished;
The power spectrum of target voice data to be distinguished is handled using melscale filter group, obtains target voice to be distinguished The Meier power spectrum of data;
Cepstral analysis is carried out on the Meier power spectrum, obtains the mel-frequency cepstrum coefficient of target voice data to be distinguished.
5. speech differentiation method according to claim 4, which is characterized in that described to target voice data to be distinguished It is pre-processed, obtains pretreatment voice data, including:
Preemphasis processing is made to target voice data to be distinguished, the calculation formula of preemphasis processing is s'n=sn-a*sn-1, Wherein, snFor the signal amplitude in time domain, sn-1For with snThe signal amplitude of corresponding last moment, s'nWhen for after preemphasis Signal amplitude on domain, a are pre emphasis factor, and the value range of a is 0.9<a<1.0;
Target voice data to be distinguished after preemphasis is subjected to sub-frame processing;
Target voice data to be distinguished after framing is subjected to windowing process, obtains pretreatment voice data, the meter of adding window Calculating formula isWherein, N is that window is long, and n is time, snFor the signal width in time domain Degree, s'nFor the signal amplitude in time domain after adding window.
6. speech differentiation method according to claim 4, which is characterized in that described to be fallen on the Meier power spectrum Spectrum analysis obtains the mel-frequency cepstrum coefficient of target voice data to be distinguished, including:
The logarithm for taking the Meier power spectrum obtains Meier power spectrum to be transformed;
Discrete cosine transform is made to the Meier power spectrum to be transformed, obtains the mel-frequency cepstrum of target voice data to be distinguished Coefficient.
7. a kind of speech differentiation device, which is characterized in that including:
Target voice data acquisition module to be distinguished handles original voice number to be distinguished for being based on voice activity detection algorithms According to acquisition target voice data to be distinguished;
Phonetic feature acquisition module obtains corresponding ASR phonetic features for being based on target voice data to be distinguished;
Target distinguishes result acquisition module, for the ASR phonetic features to be input in advance trained ASR-DBN models It distinguishes, obtains target and distinguish result.
8. speech differentiation device according to claim 7, which is characterized in that the speech differentiation device further includes ASR- DBN model acquisition module, the ASR-DBN models acquisition module include:
ASR phonetic features acquiring unit to be trained, for obtaining voice data to be trained, and voice data to be trained described in extraction ASR phonetic features to be trained;
ASR-RBM model acquiring units train ASR-RBM models successively for ASR phonetic features to be trained described in use, according to The error that training generates carries out parameter update, obtains each ASR-RBM models, wherein the parameter includes weights and biasing, institute The biasing of biasing and hidden neuron of the biasing comprising aobvious layer neuron is stated, the formula of update weights W is:W'=W+ λ (P (h1| v1)v1-P(h2|v2)v2), W is the weights before update, and W' is updated weights, and λ is learning rate, v1For original aobvious layer, v2For Reconstruct shows layer, h1For original hidden layer, h2To reconstruct hidden layer;Update biasing a formula be:A'=a+ λ (v1-v2), a is aobvious layer god Biasing before being updated through member, a' are the updated biasing of aobvious layer neuron;The formula of update biasing b is b'=b+ λ (h1-h2), b Biasing before being updated for hidden neuron, b' are the updated biasing of hidden neuron;
Tuning unit, the output valve for obtaining the last one ASR-RBM model carry out parameter tune based on the output valve It is excellent, weights and the biasing of each ASR-RBM models are updated, ASR-DBN models are obtained.
9. a kind of computer equipment, including memory, processor and it is stored in the memory and can be in the processor The computer program of upper operation, which is characterized in that the processor realized when executing the computer program as claim 1 to The step of any one of 6 speech differentiation method.
10. a kind of computer readable storage medium, the computer-readable recording medium storage has computer program, feature to exist In the step of realization speech differentiation method as described in any one of claim 1 to 6 when the computer program is executed by processor Suddenly.
CN201810561789.6A 2018-06-04 2018-06-04 Speech differentiation method, apparatus, computer equipment and storage medium Pending CN108806725A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201810561789.6A CN108806725A (en) 2018-06-04 2018-06-04 Speech differentiation method, apparatus, computer equipment and storage medium
PCT/CN2018/094342 WO2019232867A1 (en) 2018-06-04 2018-07-03 Voice discrimination method and apparatus, and computer device, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810561789.6A CN108806725A (en) 2018-06-04 2018-06-04 Speech differentiation method, apparatus, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN108806725A true CN108806725A (en) 2018-11-13

Family

ID=64090244

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810561789.6A Pending CN108806725A (en) 2018-06-04 2018-06-04 Speech differentiation method, apparatus, computer equipment and storage medium

Country Status (2)

Country Link
CN (1) CN108806725A (en)
WO (1) WO2019232867A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109739112A (en) * 2018-12-29 2019-05-10 张卫校 A kind of wobble objects control method and wobble objects
CN110047510A (en) * 2019-04-15 2019-07-23 北京达佳互联信息技术有限公司 Audio identification methods, device, computer equipment and storage medium
WO2019232867A1 (en) * 2018-06-04 2019-12-12 平安科技(深圳)有限公司 Voice discrimination method and apparatus, and computer device, and storage medium
CN112652324A (en) * 2020-12-28 2021-04-13 深圳万兴软件有限公司 Speech enhancement optimization method, speech enhancement optimization system and readable storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104157290A (en) * 2014-08-19 2014-11-19 大连理工大学 Speaker recognition method based on depth learning
KR101561651B1 (en) * 2014-05-23 2015-11-02 서강대학교산학협력단 Interest detecting method and apparatus based feature data of voice signal using Deep Belief Network, recording medium recording program of the method
CN107358966A (en) * 2017-06-27 2017-11-17 北京理工大学 Based on deep learning speech enhan-cement without reference voice quality objective evaluation method

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100580770C (en) * 2005-08-08 2010-01-13 中国科学院声学研究所 Voice end detection method based on energy and harmonic
US10229700B2 (en) * 2015-09-24 2019-03-12 Google Llc Voice activity detection
CN108806725A (en) * 2018-06-04 2018-11-13 平安科技(深圳)有限公司 Speech differentiation method, apparatus, computer equipment and storage medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101561651B1 (en) * 2014-05-23 2015-11-02 서강대학교산학협력단 Interest detecting method and apparatus based feature data of voice signal using Deep Belief Network, recording medium recording program of the method
CN104157290A (en) * 2014-08-19 2014-11-19 大连理工大学 Speaker recognition method based on depth learning
CN107358966A (en) * 2017-06-27 2017-11-17 北京理工大学 Based on deep learning speech enhan-cement without reference voice quality objective evaluation method

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
XIAO-LEI ZHANG: "Deep Belief Networks Based Voice Activity Detection", 《IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING》, 30 April 2013 (2013-04-30), pages 697 - 709 *
宋知用: "《MATLAB语音信号分析与合成》", 30 January 2018, pages: 118 *
张雪英: "《数字信号处理基础教程》", 西安电子科技大学出版社, pages: 212 - 214 *
贺宏兵: "《雷达目标识别原理与实验技术》", 30 December 2017, pages: 132 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019232867A1 (en) * 2018-06-04 2019-12-12 平安科技(深圳)有限公司 Voice discrimination method and apparatus, and computer device, and storage medium
CN109739112A (en) * 2018-12-29 2019-05-10 张卫校 A kind of wobble objects control method and wobble objects
CN110047510A (en) * 2019-04-15 2019-07-23 北京达佳互联信息技术有限公司 Audio identification methods, device, computer equipment and storage medium
CN112652324A (en) * 2020-12-28 2021-04-13 深圳万兴软件有限公司 Speech enhancement optimization method, speech enhancement optimization system and readable storage medium

Also Published As

Publication number Publication date
WO2019232867A1 (en) 2019-12-12

Similar Documents

Publication Publication Date Title
CN110827801B (en) Automatic voice recognition method and system based on artificial intelligence
CN108922513A (en) Speech differentiation method, apparatus, computer equipment and storage medium
CN108447495B (en) Deep learning voice enhancement method based on comprehensive feature set
CN110428842A (en) Speech model training method, device, equipment and computer readable storage medium
CN108806725A (en) Speech differentiation method, apparatus, computer equipment and storage medium
CN113012720B (en) Depression detection method by multi-voice feature fusion under spectral subtraction noise reduction
CN111292762A (en) Single-channel voice separation method based on deep learning
CN105023580A (en) Unsupervised noise estimation and speech enhancement method based on separable deep automatic encoding technology
CN111899757B (en) Single-channel voice separation method and system for target speaker extraction
CN109036470B (en) Voice distinguishing method, device, computer equipment and storage medium
CN111341319B (en) Audio scene identification method and system based on local texture features
Mallidi et al. Novel neural network based fusion for multistream ASR
CN111128209A (en) Speech enhancement method based on mixed masking learning target
CN111951796A (en) Voice recognition method and device, electronic equipment and storage medium
CN111883181A (en) Audio detection method and device, storage medium and electronic device
CN110136726A (en) A kind of estimation method, device, system and the storage medium of voice gender
Thomas et al. Acoustic and data-driven features for robust speech activity detection
Hasan et al. Preprocessing of continuous bengali speech for feature extraction
Lim et al. Harmonic and percussive source separation using a convolutional auto encoder
Sekkate et al. Speaker identification for OFDM-based aeronautical communication system
CN112735466A (en) Audio detection method and device
KR20190135916A (en) Apparatus and method for determining user stress using speech signal
CN114613387A (en) Voice separation method and device, electronic equipment and storage medium
CN114724589A (en) Voice quality inspection method and device, electronic equipment and storage medium
CN113744715A (en) Vocoder speech synthesis method, device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20181113

RJ01 Rejection of invention patent application after publication