CN109584896A - A kind of speech chip and electronic equipment - Google Patents

A kind of speech chip and electronic equipment Download PDF

Info

Publication number
CN109584896A
CN109584896A CN201811293499.4A CN201811293499A CN109584896A CN 109584896 A CN109584896 A CN 109584896A CN 201811293499 A CN201811293499 A CN 201811293499A CN 109584896 A CN109584896 A CN 109584896A
Authority
CN
China
Prior art keywords
voice signal
signal
voice
module
speech
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811293499.4A
Other languages
Chinese (zh)
Inventor
肖佳林
王欢良
唐浩元
王佳珺
吴洪宇
马殿昌
李志�
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Qdreamer Network Science And Technology Co Ltd
Original Assignee
Suzhou Qdreamer Network Science And Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Qdreamer Network Science And Technology Co Ltd filed Critical Suzhou Qdreamer Network Science And Technology Co Ltd
Priority to CN201811293499.4A priority Critical patent/CN109584896A/en
Publication of CN109584896A publication Critical patent/CN109584896A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0224Processing in the time domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • G10L21/028Voice signal separating using properties of sound source
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L2015/088Word spotting
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02082Noise filtering the noise being echo, reverberation of the speech

Abstract

The present invention relates to a kind of speech chip and the electronic equipments of application, comprising: audio collection module, for acquiring voice signal;Front end ARRAY PROCESSING module, connect with audio collection module, for handling voice signal;Voice Activity Detection module, it is connect with front end ARRAY PROCESSING module, for determining the sound end of the voice signal after the ARRAY PROCESSING resume module of front end, wherein, sound end includes the beginning endpoint and end caps of voice signal, and each sound end corresponds at least frame in voice signal;Voice wake-up module is connect with Voice Activity Detection module, when for determining that voice signal includes preset wake-up voice based on sound end, wakes up electronic equipment;Speech recognition module is connect with front end ARRAY PROCESSING module, for after electronic equipment is waken up, the interactive instruction in voice signal after identifying front end ARRAY PROCESSING resume module, and electronic equipment is made to execute interactive instruction.The accuracy of speech recognition can be improved in the present invention.

Description

A kind of speech chip and electronic equipment
Technical field
The present invention relates to speech processes fields, more particularly to a kind of speech chip and electronic equipment.
Background technique
Perfect man machine language's interaction may be implemented in the intelligent sound function that electronic equipment has, and final goal is It allows electronic equipment to understand the language of the mankind, and executes corresponding function.Chip applied in electronic equipment does not have intelligence at present Energy function, and it is expensive, power consumption is high.For supporting the chip of voice interactive function at present, the speech recognition supported is based on The neural network algorithm of deep learning, but deep learning neural network algorithm calculation amount is larger, causes to support interactive voice at present The chip of function calculates, and power consumption is big, and speed is slow.In order to meet calculation amount, support that the chip of voice interactive function can be to depth Practising neural network algorithm will do it simplification, and recognition performance is caused to reduce, and further results in user and is carrying out voice with electronic equipment Experience sense is very poor when interactive.
Summary of the invention
Based on this, it is necessary to for the low problem of current speech recognition degree, provide a kind of speech chip and electronic equipment.
A kind of speech chip is applied to electronic equipment, comprising:
Audio collection module, for acquiring voice signal;
Front end ARRAY PROCESSING module, connect with the audio collection module, for handling the voice signal;
Voice Activity Detection module is connect with the front end ARRAY PROCESSING module, for determining the front end ARRAY PROCESSING The sound end of voice signal after resume module, wherein the sound end include the voice signal beginning endpoint and End caps, each sound end correspond at least frame in voice signal;
Voice wake-up module is connect with the Voice Activity Detection module, described in being determined based on the sound end When voice signal includes preset wake-up voice, the electronic equipment is waken up;
Speech recognition module is connect with the front end ARRAY PROCESSING module, for after the electronic equipment is waken up, The interactive instruction in voice signal after identifying the front end ARRAY PROCESSING resume module, and electronic equipment is made to execute the friendship Mutually instruction.
Preferably, the front end ARRAY PROCESSING module includes:
Echo cancellation unit, for carrying out back Processing for removing to the voice signal;
Beam forming unit, for carrying out Wave beam forming processing to the voice signal;
Auditory localization unit, for carrying out auditory localization to the voice signal;
Dereverberation processing unit, for removing reverberation with the voice signal.
Preferably, the echo cancellation unit is specifically used for:
Trap is carried out to the voice signal and filters out DC component, and does preemphasis processing, forms the first input signal;
First input signal is filtered using prime filtering, obtained error signal variance is stored To Sff;
First input signal is filtered using rear class filtering, by obtained error signal variance to See;
Final filtered signal is exported based on the Sff and See.
Preferably, the beam forming unit is specifically used for:
Fourier transformation is carried out to the voice signal, and calculates the voice signal covariance of the voice signal;
Eigenvalues Decomposition is carried out to the voice signal covariance, determines maximum eigenvalue;
Determine the corresponding feature vector of the maximum eigenvalue;
Final enhancing signal is calculated based on the voice signal covariance and described eigenvector;
After carrying out Fourier transformation to the enhancing signal, and the signal after conversion is gone into time domain.
Preferably, the auditory localization unit is specifically used for:
The voice signal is calculated to reach in array between different microphones since the transmitting voice signal distance is different Caused by the time difference;
The time difference is obtained into range difference multiplied by the velocity of sound;
A series of hyperboloids are calculated according to geometrical relationship and the range difference, sound is obtained by the bi-curved intersection point Source position.
Preferably, the dereverberation processing unit is specifically used for:
Short Time Fourier Transform is carried out to the voice signal;
The frequency-region signal after removal reverberation is calculated based on transformed signal;
Inverse Short Time Fourier Transform is carried out to the frequency-region signal, and transformed voice signal is gone to time domain.
Preferably, the Voice Activity Detection module is specifically used for:
The voice signal is pre-processed;
FBank feature will be extracted using FilterBank algorithm frame by frame by pretreated signal;
The FBank feature of each frame is inputted into deep neural network model, each frame is exported by deep neural network model The output probability of each phoneme in phone set;
The corresponding output probability of all non-noise, non-mute phonemes will be corresponded in each frame to sum;
If described and value is greater than preset threshold, judge corresponding frame as sound end;
After last frame is judged, then the disposal of gentle filter is carried out to court verdict before, obtain final voice Endpoint court verdict.
Preferably, the voice wake-up module is specifically used for:
Speech feature extraction is carried out to the voice signal, obtains corresponding speech feature vector;
The speech feature vector is input to DNN model, obtaining the corresponding voice signal of the speech feature vector is to close The posterior probability of keyword either non-key word;
Corresponding confidence level is smoothly obtained to the posterior probability;
If the confidence level is greater than preset value, judge that the corresponding voice signal of the speech feature vector includes to close Keyword;
Judge to wake up the electronic equipment if the keyword is occurred by setting sequence.
Preferably, the speech recognition module is specifically used for:
Extract the speech feature vector in the voice signal;
Tone decoding is carried out to the speech feature vector, obtains optimal output word sequence;
Corresponding text is exported based on the output word sequence;
The interactive instruction of the text representation is determined based on the text, and electronic equipment is made to execute the interactive instruction.
A kind of electronic equipment, the electronic equipment include above-described speech chip.
Speech chip described above includes audio collection module, front end ARRAY PROCESSING module, voice activity detection (Voice Activity detection, VAD) module (VAD module), voice wake-up module, speech recognition module.Audio collection module Voice signal is collected first, and voice signal is then transmitted to front end ARRAY PROCESSING module, is handled.VAD module and front end ARRAY PROCESSING module is connected, and VAD module can be to the beginning and end of treated Speech signal detection user speech.Language Sound wake-up module carries out similarity-rough set to the wake-up word of user speech and setting, if matching can be by equipment from dormant state Middle wake-up.It is handled after waking up successfully by front end array algorithm, positions speaker orientation, be oriented speech enhan-cement, will enhance Voice afterwards is sent into speech recognition module, and electronic equipment is made corresponding movement for the instruction of identification, be can be realized man-machine Interactive voice.The present invention, can be by the voice signal after positioning and orientation speech enhan-cement after waking up electronic equipment as a result, It is sent into speech recognition module, to promote the accuracy of speech recognition.
Detailed description of the invention
Fig. 1 is the structure chart of the speech chip of an embodiment.
Specific embodiment
In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawings and embodiments, The present invention will be described in further detail.It should be appreciated that specific embodiment described herein is only used to explain this hair It is bright, it is not intended to limit the present invention.
Fig. 1 is the structure chart of the speech chip of an embodiment, as shown in Figure 1, the speech chip is applied to electronic equipment, Include:
Audio collection module, for acquiring voice signal;
Front end ARRAY PROCESSING module, connect with the audio collection module, for handling the voice signal;
Voice Activity Detection module is connect with the front end ARRAY PROCESSING module, for determining the front end ARRAY PROCESSING The sound end of voice signal after resume module, wherein the sound end include the voice signal beginning endpoint and End caps, each sound end correspond at least frame in voice signal;
Voice wake-up module is connect with the Voice Activity Detection module, described in being determined based on the sound end When voice signal includes preset wake-up voice, the electronic equipment is waken up;
Speech recognition module is connect with the front end ARRAY PROCESSING module, for after the electronic equipment is waken up, The interactive instruction in voice signal after identifying the front end ARRAY PROCESSING resume module, and electronic equipment is made to execute the friendship Mutually instruction.
Speech chip described above includes audio collection module, front end ARRAY PROCESSING module, voice activity detection (Voice Activity detection, VAD) module (the present embodiment Voice Activity Detection module abbreviation VAD module), voice wake-up mould Block, speech recognition module.Audio collection module collects voice signal first, and then voice signal is transmitted at the array of front end Module is managed, is handled.VAD module is connected with front end ARRAY PROCESSING module, and VAD module can believe treated voice Number detection user speech beginning and end.Voice wake-up module carries out similarity ratio to the wake-up word of user speech and setting Compared with if matching can wake up equipment from dormant state.It is handled after waking up successfully by front end array algorithm, positioning is spoken People orientation, is oriented speech enhan-cement, and enhanced voice is sent into speech recognition module, finger of the electronic equipment for identification Corresponding movement is made in order, and man machine language's interaction can be realized.The present invention can will determine after waking up electronic equipment as a result, Voice signal behind position and orientation speech enhan-cement is sent into speech recognition module, to promote the accuracy of speech recognition.
In the present embodiment, front end ARRAY PROCESSING module, VAD module, voice wake-up module and speech recognition module are to utilize The realization that accelerator carries out, to improve program operation speed.This speech chip can be applied to various smart machines to realize Man machine language's interaction.
In the present embodiment, audio collection module can be Mic (microphone, microphone), be used for received voice Signal is transmitted to front end ARRAY PROCESSING module.
In the present embodiment, front end ARRAY PROCESSING module can carry out echo cancellor, Wave beam forming, auditory localization, dereverberation Etc. processing operations, and journey processed above can be speeded up to realize, mainly used following hardware accelerator: fft/ifft adds Fast device, Matrix Multiplication accelerator, accelerator of inverting seek determinant accelerator, seek characteristic value feature vector accelerator, and simd accelerates Device, mathematical operation accelerator (dma accelerator) and ask cholesky product accelerator.The wherein function of mathematical operation accelerator Mainly have: trigonometric function, logarithmic function, exponential function, summation operation are asked and have bad luck calculation, division arithmetic, extracting operation, power fortune It calculates, asks absolute value, floating-point integer translation operation etc..
Above is referred to following nouns for the present embodiment, it may be assumed that (Fast FourierTransform, fast Fourier become fft Change), IFFT (Inverse Fast-Fourier-Transformation, inverse fast fourier transform), SIMD (Single Instruction Multiple Data, single-instruction multiple-data stream (SIMD)), DMA (deposit by Direct Memory Access, direct memory Take), it is that the matrix of a symmetric positive definite is expressed as a lower triangular matrix L and the product of its transposition that Cholesky, which is decomposed, It decomposes.
In one embodiment of the invention, the front end ARRAY PROCESSING module includes:
Echo cancellation unit, for carrying out back Processing for removing to the voice signal;
Beam forming unit, for carrying out Wave beam forming processing to the voice signal;
Auditory localization unit, for carrying out auditory localization to the voice signal;
Dereverberation processing unit, for removing reverberation with the voice signal.
This hair is invented in an embodiment, and the echo cancellation unit is specifically used for:
Trap is carried out to the voice signal and filters out DC component, and does preemphasis processing, forms the first input signal;
First input signal is filtered using prime filtering, obtained error signal variance is stored To Sff;
First input signal is filtered using rear class filtering, by obtained error signal variance to See;
Final filtered signal is exported based on the Sff and See.
In the present embodiment, the specific implementation of echo cancellor is: falling into first to the received voice signal of microphone array Wave filters out DC component, and does preemphasis processing, forms the first input signal.Then the first input is believed using prime filtering It number is filtered, wherein accelerating to calculate using simd accelerator and dma accelerator, then deposits filtered output signals In the latter half of e [], error signal variance, which stores, gives a kind of Sff (high-frequency synchronous rotating coordinate system filter (Synchronous Frame Filter, SFF), for extracting electric current negative sequence component in Electric control).Rear class is calculated later Filter tap coefficients W, wherein with normalization minimum mean-square sef-adapting filter (Normalized least mean Square, NLMS) based on, more delay block adaptive frequency domain filter (The multi-delay block frequency- Domain adaptive filter, MDF) frequency domain realization, finally derive that optimal step size is equal to residual echo variance and error The ratio between signal variance.Wherein residual echo variance is calculated by one leadage coefficient of definition and using simd accelerator, And leadage coefficient is the mutual of the auto-correlation that each frequency point is obtained by recursive average processing method, input signal and error signal It is related finally obtained.Later the first input signal is filtered again with rear class filtering, simd is used to accelerate Device and dma accelerator accelerate to calculate, and for obtained error signal variance to See, error is stored in e [] first half.Then Whether need to update prime filter factor or resetting rear class filtering in conjunction with See and Sff comprehensive descision, carries out if necessary Adaptive-filtering, right value update, and error signal is updated in the energy value of time domain, wherein using simd accelerator.Finally again Final filtering output out=input-filtering output e [] latter half is calculated with simd accelerator, and is postemphasised Processing is completed to this echo cancellor.
It should be pointed out that the present embodiment with the specific implementation process of upper module, only can be used in this present embodiment The optimal selection of one kind, the present embodiment is not limited to above specific process, method progress deformation appropriate or change, To realize the specific technical solution of the present invention, this is within the scope of protection of the invention.
In one implementation of the present embodiment, the beam forming unit is specifically used for:
Fourier transformation is carried out to the voice signal, and calculates the voice signal covariance of the voice signal;
Eigenvalues Decomposition is carried out to the voice signal covariance, determines maximum eigenvalue;
Determine the corresponding feature vector of the maximum eigenvalue;
Final enhancing signal is calculated based on the voice signal covariance and described eigenvector;
After carrying out Fourier transformation to the enhancing signal, and the signal after conversion is gone into time domain.
In the present embodiment, the realization of Wave beam forming is: receiving first with fft accelerator to microphone array Signal ytIt carries out Short Time Fourier Transform (Short-time fourier transform, STFT).Then simd accelerator is used To accelerateAnd Rf (v)Initialization.Later according to CGMM principle, estimated using Matrix Multiplication accelerator and dma acceleratorThen start to estimate noise covariance Rn(f), noisy speech covariance Rk+n(f) and voice signal covariance Rk(f), make Dma accelerator, Matrix Multiplication accelerator and simd accelerator are used to accelerate to calculate.Later using asking characteristic value feature vector to add Fast device comes to matrix Rk(f) carry out Eigenvalues Decomposition, the corresponding feature vector of maximum eigenvalue be target voice direction to Measure rf.According to obtained Rn(f) and rfWeight is calculatedWherein come pair using Matrix Multiplication accelerator and simd accelerator Calculating is accelerated.The enhancing signal finally to be obtained finally is calculated using simd acceleratorAnd added using ifft Fast device carries out inverse Short Time Fourier Transform (Inverse short-time fourier transform, ISTFT) to it, it Signal is gone to time domain afterwards.So far Wave beam forming terminates.
It should be pointed out that the present embodiment with the specific implementation process of upper module, only can be used in this present embodiment The optimal selection of one kind, the present embodiment is not limited to above specific process, method progress deformation appropriate or change, To realize the specific technical solution of the present invention, this is within the scope of protection of the invention.
In the present embodiment, the auditory localization unit is specifically used for:
The voice signal is calculated to reach in array between different microphones since the transmitting voice signal distance is different Caused by the time difference;
The time difference is obtained into range difference multiplied by the velocity of sound;
A series of hyperboloids are calculated according to geometrical relationship and the range difference, sound is obtained by the bi-curved intersection point Source position.
In the present embodiment, the realization of auditory localization is: estimated speech signal reaches in array between different microphones first The time difference (Time delay of arrival, TDOA) due to caused by signal transmission distance difference, i.e. progress time prolong Estimation (Time delay estimation, TDE) late.Used here as arrive broad sense cross-correlation (Generalized cross Correlation, GCC) method carries out time delay estimadon, the audio that receives first with fft accelerator to different microphones Signal carries out Fast Fourier Transform (FFT) (Fast fourier transformation, FFT).Then broad sense cross-correlation letter is defined Number, it is first corresponding to protrude come the through part of enhanced speech signal, inhibition noise and reverb signal using weighting function in frequency domain Peak value, wherein being accelerated using simd accelerator.The signal after weighting is carried out using ifft accelerator later reverse Fast Fourier Transform (FFT) (Inverse fast fourier transformation, IFFT).Then broad sense cross-correlation is detected The peak value of function obtains TDOA.The TDOA of acquisition is obtained into range difference multiplied by the velocity of sound later, obtains one according to geometrical relationship Sound source position can be obtained by bi-curved intersection point in serial hyperboloid.Auditory localization is completed.
It should be pointed out that the present embodiment with the specific implementation process of upper module, only can be used in this present embodiment The optimal selection of one kind, the present embodiment is not limited to above specific process, method progress deformation appropriate or change, To realize the specific technical solution of the present invention, this is within the scope of protection of the invention.
In the present embodiment, the dereverberation processing unit is specifically used for:
Short Time Fourier Transform is carried out to the voice signal;
The frequency-region signal after removal reverberation is calculated based on transformed signal;
Inverse Short Time Fourier Transform is carried out to the frequency-region signal, and transformed voice signal is gone to time domain.
In one implementation of the present embodiment, the realization of dereverberation is: first with fft accelerator initialization when pair The voice signal y that microphone array receivestIt carries out Short Time Fourier Transform (STFT).Then come using Matrix Multiplication accelerator It calculatesWithAccelerate to calculate the frequency-region signal after removal reverberation using Matrix Multiplication accelerator laterAnd utilize ifft Accelerator pairInverse Short Time Fourier Transform (ISTFT) is carried out, signal is then gone to time domain, completes dereverberation.And it is updating When the signal y that equally first array is received using fft acceleratortIt carries out Short Time Fourier Transform (STFT), is accelerated with Matrix Multiplication Device recalculatesEach frame data of each frequency point are updated, and are once updated using precedingWithMeter Calculation obtains updatedFinally accelerated to calculate the frequency-region signal after removal reverberation with Matrix Multiplication acceleratorAnd it utilizes Ifft accelerator pairInverse Short Time Fourier Transform (ISTFT) is carried out, signal is gone to time domain later, is completed to this dereverberation.
It should be pointed out that the present embodiment with the specific implementation process of upper module, only can be used in this present embodiment The optimal selection of one kind, the present embodiment is not limited to above specific process, method progress deformation appropriate or change, To realize the specific technical solution of the present invention, this is within the scope of protection of the invention.
In one implementation of the present embodiment, the Voice Activity Detection module is specifically used for:
The voice signal is pre-processed;
FBank feature will be extracted using FilterBank algorithm frame by frame by pretreated signal;
The FBank feature of each frame is inputted into deep neural network model, each frame is exported by deep neural network model The output probability of each phoneme in phone set;
The corresponding output probability of all non-noise, non-mute phonemes will be corresponded in each frame to sum;
If described and value is greater than preset threshold, judge corresponding frame as sound end;
After last frame is judged, then the disposal of gentle filter is carried out to court verdict before, obtain final voice Endpoint court verdict.
It should be pointed out that the VAD module, voice wake-up module and speech recognition module in the present embodiment mainly use Following hardware accelerator: simd accelerator, mathematical operation accelerator (dma), fft/ifft accelerator, neural network accelerate Device (Neural-network process units, NPU).Wherein NPU can flexibly support each Connectionist model, Mainly have: deep neural network (Deep neural network, DNN), circular recursion neural network (Recurrent Neural network, RNN), convolutional neural networks (Convolutional neural network, CNN), time delay nerve net Network (Time delay neural network, TDNN) etc..
The realization of VAD module in the present embodiment is: incoming voice signal pre-processed first, including Framing and pre-filtering etc..Then FBank feature will be extracted using FilterBank algorithm frame by frame by pretreated signal. Endpoint judgement is carried out later, that is, passes through a trained deep neural network (Deep neural to classify to phoneme Network, DNN) model, the FBank feature of each frame is inputted, it is each in phone set by the corresponding each frame of model output The posterior probability (being also output probability) of a phoneme.Then by all non-noise, the non-mute corresponding output probability of phoneme into Row summation, if it is greater than the threshold value of setting, then it is assumed that the frame is voice.When last frame signal is after endpoint is adjudicated, then into Row post-processing operation carries out the disposal of gentle filter to court verdict before, obtains final sound end court verdict, So far voice activity detection is completed.Wherein, VAD module is to utilize neural network accelerator (Neural-network process Units, NPU) come what is speeded up to realize.
It should be pointed out that the present embodiment with the specific implementation process of upper module, only can be used in this present embodiment The optimal selection of one kind, the present embodiment is not limited to above specific process, method progress deformation appropriate or change, To realize the specific technical solution of the present invention, this is within the scope of protection of the invention.
In one implementation of the present embodiment, the voice wake-up module is specifically used for:
Speech feature extraction is carried out to the voice signal, obtains corresponding speech feature vector;
The speech feature vector is input to DNN model, obtaining the corresponding voice signal of the speech feature vector is to close The posterior probability of keyword either non-key word;
Corresponding confidence level is smoothly obtained to the posterior probability;
If the confidence level is greater than preset value, judge that the corresponding voice signal of the speech feature vector includes to close Keyword;
Judge to wake up the electronic equipment if the keyword is occurred by setting sequence.
In one implementation of the present embodiment, the realization that voice wakes up is: using mode end to end, that is, what is inputted is Voice signal, output are directly keyword.Speech feature extraction is carried out to the voice signal of input first, using MFCC (Mel-frequency cepstral coefficients) algorithm.Wherein, using MFCC algorithm extract phonetic feature it It is preceding that pre-processing, including analog-to-digital conversion, preemphasis and framing adding window first are done to incoming voice signal.Carry out later quickly from Fourier transformation and Mel filtering are dissipated, finally carrying out cepstrum, energy and difference can be obtained MFCC parameter vector.It will obtain later Speech feature vector be input to DNN model (deep neural network, Deep Neural Networks, DNN), pass through training DNN is the keyword either posterior probability of non-key word and then outputs it come the phonetic feature for predicting input.It incites somebody to action later The posterior value arrived is by post-processing model, because posterior value is exported as unit of frame needing to come with certain window length It carries out smoothly, the confidence level of keyword can be obtained after carrying out smoothly posterior value.If the confidence level is greater than the threshold of setting Value, then it is assumed that keyword occurs.And think to wake up if keyword is occurred by setting sequence, while series of parameters is arranged Possible false wake-up is limited, is terminated to the wake-up of this voice.Wherein, voice wake-up module is to utilize neural network accelerator (NPU) it speeds up to realize.
It should be pointed out that the present embodiment with the specific implementation process of upper module, only can be used in this present embodiment The optimal selection of one kind, the present embodiment is not limited to above specific process, method progress deformation appropriate or change, To realize the specific technical solution of the present invention, this is within the scope of protection of the invention.
In one implementation of the present embodiment, the speech recognition module is specifically used for:
Extract the speech feature vector in the voice signal;
Tone decoding is carried out to the speech feature vector, obtains optimal output word sequence;
Corresponding text is exported based on the output word sequence;
The interactive instruction of the text representation is determined based on the text, and electronic equipment is made to execute the interactive instruction.
In one implementation of the present embodiment, one kind of speech recognition is achieved in that: first extracting language using MFCC algorithm Sound feature vector.Then the speech feature vector extracted is subjected to tone decoding, and speech decoding process is exactly to pass through acoustics Model, Pronounceable dictionary and language model carry out text output to the voice data after extracting feature.Acoustic model be using TDNN-HMM model, wherein TDNN is time delay deep neural network model, and HMM is hidden Markov model (Hidden Markov model, HMM), it is the acoustic model parameters trained according to the characteristic parameter of speech database, is then identifying When speech feature vector that the mode input is extracted matched with acoustic model, obtain recognition result i.e. phoneme Information.Wherein carry out Fitted probability density function using TDNN model, carries out the state modeling of HMM.Before in HMM model being utilization Probability calculation is solved the problems, such as to algorithm and backward algorithm, solves problem concerning study with Baum-Welch algorithm, and use Triphones HMM model and the training burden that every one kind is improved by decision tree.Pronounceable dictionary is identified according to acoustic model Phoneme information, find corresponding word or word, acoustic model and language model be tied.And language model is to pass through Large amount of text information is trained, in conjunction with syntax and semantics knowledge description word between internal relation, for hair The word or word that sound dictionary is found obtain the word sequence of maximum probability.Later by trained acoustic model, pronunciation dictionary, language Speech model construction is a state network.Decoding is carried out using Viterbi algorithm, i.e., from the state network of building Find with the most matched path of voice, obtain optimal output word sequence.Final output text just completes grammer and identified Journey.Wherein, speech recognition module is accelerated using neural network accelerator (NPU).
The another kind of speech recognition is achieved in that: acoustic model is constructed using RNN-CTC, wherein RNN is to recycle Neural network, the acoustic model that CTC (Connectionist temporal classification) is used as loss function are instructed Practice, save alignment of data and mark, while to Chinese phonetic mother, the multilinguals such as phoneme and state structure carries out analysis and builds Mould;The method is trained by BP algorithm (errorBackPropagation), and last voice output is one section of spike sequence Column, non-speech portion is blank parts;Since the spike sequence of output corresponds to mulitpath, thus using preceding backward algorithm into Row computational short cut.Pronounceable dictionary is the phoneme information identified according to acoustic model, finds corresponding word or word, by sound Model is learned to be tied with language model.And language model is modeled using N-gram+LSTM, wherein N-gram is a kind of Statistical language model predicts that n-th of item, these item can be phoneme according to preceding (n-1) a item, character, word etc., It is most common language model;LSTM (Long Short Term Memory networks) is a kind of special circulation Neural network (RNN) can be learnt by cellular state (Cell State) structure to long-term dependence;N-gram+ LSTM model overcomes the problem of individual N-gram model fails for long-time dependence, this model passes through to a large amount of Text information is trained to obtain, in conjunction with syntax and semantics knowledge description word between internal relation, Pronounceable dictionary is looked for To word or word obtain the word sequence of maximum probability.Later by trained acoustic model, pronunciation dictionary, language model structure It builds as a state network.Decoding is carried out using Viterbi algorithm, i.e., finds from the state network of building and language The most matched path of sound, obtains optimal output word sequence.Final output text just completes grammer identification process.It is local Speech recognition is accelerated using neural network accelerator (NPU).
It should be pointed out that the specific implementation process of the present embodiment each module described above, only can be used in this The optimal selection of one of embodiment, the present embodiment be not limited to carry out the above specific process, method deformation appropriate or Person changes, and to realize the specific technical solution of the present invention, this is within the scope of protection of the invention.
This implementation is by basic audio collection module, front end array signal processing module, VAD module, and voice wakes up mould Block, the intelligent sound chip of grammer identification module composition.On this basis, ARRAY PROCESSING module in front end is based on fft/ifft Accelerator, Matrix Multiplication accelerator, accelerator of inverting seek determinant accelerator, seek characteristic value feature vector accelerator, and simd adds Fast device, dma accelerator ask cholesky product accelerator to realize;VAD module is realized based on neural network accelerator;Language Sound wake-up module is realized based on neural network accelerator;Local voice identification module is real based on neural network accelerator Existing.
The present embodiment additionally provides a kind of electronic equipment, and the electronic equipment includes described in any one of claim 1-9 Speech chip.
Each technical characteristic of embodiment described above can be combined arbitrarily, for simplicity of description, not to above-mentioned reality It applies all possible combination of each technical characteristic in example to be all described, as long as however, the combination of these technical characteristics is not deposited In contradiction, all should be considered as described in this specification.
The embodiments described above only express several embodiments of the present invention, and the description thereof is more specific and detailed, but simultaneously It cannot therefore be construed as limiting the scope of the patent.It should be pointed out that coming for those of ordinary skill in the art It says, without departing from the inventive concept of the premise, various modifications and improvements can be made, these belong to guarantor of the invention Protect range.Therefore, the scope of protection of the patent of the invention shall be subject to the appended claims.

Claims (10)

1. a kind of speech chip is applied to electronic equipment characterized by comprising
Audio collection module, for acquiring voice signal;
Front end ARRAY PROCESSING module, connect with the audio collection module, for handling the voice signal;
Voice Activity Detection module is connect with the front end ARRAY PROCESSING module, for determining the front end ARRAY PROCESSING module The sound end of treated voice signal, wherein the sound end includes beginning endpoint and the end of the voice signal Endpoint, each sound end correspond at least frame in voice signal;
Voice wake-up module is connect with the Voice Activity Detection module, for determining the voice based on the sound end When signal includes preset wake-up voice, the electronic equipment is waken up;
Speech recognition module is connect with the front end ARRAY PROCESSING module, for identifying institute after the electronic equipment is waken up The interactive instruction in voice signal after stating front end ARRAY PROCESSING resume module, and electronic equipment is made to execute the interactive instruction.
2. speech chip according to claim 1, which is characterized in that the front end ARRAY PROCESSING module includes:
Echo cancellation unit, for carrying out back Processing for removing to the voice signal;
Beam forming unit, for carrying out Wave beam forming processing to the voice signal;
Auditory localization unit, for carrying out auditory localization to the voice signal;
Dereverberation processing unit, for removing reverberation with the voice signal.
3. speech chip according to claim 2, which is characterized in that the echo cancellation unit is specifically used for:
Trap is carried out to the voice signal and filters out DC component, and does preemphasis processing, forms the first input signal;
First input signal is filtered using prime filtering, obtained error signal variance is stored to Sff;
First input signal is filtered using rear class filtering, by obtained error signal variance to See;
Final filtered signal is exported based on the Sff and See.
4. speech chip according to claim 2, which is characterized in that the beam forming unit is specifically used for:
Fourier transformation is carried out to the voice signal, and calculates the voice signal covariance of the voice signal;
Eigenvalues Decomposition is carried out to the voice signal covariance, determines maximum eigenvalue;
Determine the corresponding feature vector of the maximum eigenvalue;
Final enhancing signal is calculated based on the voice signal covariance and described eigenvector;
After carrying out Fourier transformation to the enhancing signal, and the signal after conversion is gone into time domain.
5. speech chip according to claim 2, which is characterized in that the auditory localization unit is specifically used for:
It calculates the voice signal and reaches in array and draw between different microphones since the transmitting voice signal distance is different The time difference risen;
The time difference is obtained into range difference multiplied by the velocity of sound;
A series of hyperboloids are calculated according to geometrical relationship and the range difference, sound source position is obtained by the bi-curved intersection point It sets.
6. speech chip according to claim 2, which is characterized in that the dereverberation processing unit is specifically used for:
Short Time Fourier Transform is carried out to the voice signal;
The frequency-region signal after removal reverberation is calculated based on transformed signal;
Inverse Short Time Fourier Transform is carried out to the frequency-region signal, and transformed voice signal is gone to time domain.
7. speech chip according to claim 1, which is characterized in that the Voice Activity Detection module is specifically used for:
The voice signal is pre-processed;
FBank feature will be extracted using FilterBank algorithm frame by frame by pretreated signal;
The FBank feature of each frame is inputted into deep neural network model, each frame is exported in sound by deep neural network model Element concentrates the output probability of each phoneme;
The corresponding output probability of all non-noise, non-mute phonemes will be corresponded in each frame to sum;
If described and value is greater than preset threshold, judge corresponding frame as sound end;
After last frame is judged, then the disposal of gentle filter is carried out to court verdict before, obtain final sound end Court verdict.
8. speech chip according to claim 1, which is characterized in that the voice wake-up module is specifically used for:
Speech feature extraction is carried out to the voice signal, obtains corresponding speech feature vector;
The speech feature vector is input to DNN model, obtaining the corresponding voice signal of the speech feature vector is keyword The either posterior probability of non-key word;
Corresponding confidence level is smoothly obtained to the posterior probability;
If the confidence level is greater than preset value, judge that the corresponding voice signal of the speech feature vector includes keyword;
Judge to wake up the electronic equipment if the keyword is occurred by setting sequence.
9. speech chip according to claim 1, which is characterized in that the speech recognition module is specifically used for:
Extract the speech feature vector in the voice signal;
Tone decoding is carried out to the speech feature vector, obtains optimal output word sequence;
Corresponding text is exported based on the output word sequence;
The interactive instruction of the text representation is determined based on the text, and electronic equipment is made to execute the interactive instruction.
10. a kind of electronic equipment, which is characterized in that the electronic equipment includes voice of any of claims 1-9 Chip.
CN201811293499.4A 2018-11-01 2018-11-01 A kind of speech chip and electronic equipment Pending CN109584896A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811293499.4A CN109584896A (en) 2018-11-01 2018-11-01 A kind of speech chip and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811293499.4A CN109584896A (en) 2018-11-01 2018-11-01 A kind of speech chip and electronic equipment

Publications (1)

Publication Number Publication Date
CN109584896A true CN109584896A (en) 2019-04-05

Family

ID=65921441

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811293499.4A Pending CN109584896A (en) 2018-11-01 2018-11-01 A kind of speech chip and electronic equipment

Country Status (1)

Country Link
CN (1) CN109584896A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110265029A (en) * 2019-06-21 2019-09-20 百度在线网络技术(北京)有限公司 Speech chip and electronic equipment
CN110634483A (en) * 2019-09-03 2019-12-31 北京达佳互联信息技术有限公司 Man-machine interaction method and device, electronic equipment and storage medium
CN110830866A (en) * 2019-10-31 2020-02-21 歌尔科技有限公司 Voice assistant awakening method and device, wireless earphone and storage medium
CN110930979A (en) * 2019-11-29 2020-03-27 百度在线网络技术(北京)有限公司 Speech recognition model training method and device and electronic equipment
CN111392532A (en) * 2020-04-07 2020-07-10 上海爱登堡电梯集团股份有限公司 Elevator outbound call device with voice parameter setting function, elevator parameter debugging method and elevator
WO2020228270A1 (en) * 2019-05-10 2020-11-19 平安科技(深圳)有限公司 Speech processing method and device, computer device and storage medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101888455A (en) * 2010-04-09 2010-11-17 熔点网讯(北京)科技有限公司 Self-adaptive echo counteracting method for frequency domain
CN102750956A (en) * 2012-06-18 2012-10-24 歌尔声学股份有限公司 Method and device for removing reverberation of single channel voice
CN103259563A (en) * 2012-02-16 2013-08-21 联芯科技有限公司 Self-adapting filter divergence detection method and echo cancellation system
CN105632486A (en) * 2015-12-23 2016-06-01 北京奇虎科技有限公司 Voice wake-up method and device of intelligent hardware
CN107316649A (en) * 2017-05-15 2017-11-03 百度在线网络技术(北京)有限公司 Audio recognition method and device based on artificial intelligence
US20180293998A1 (en) * 2017-04-11 2018-10-11 Texas Instruments Incorporated Methods and apparatus for low cost voice activity detector
CN108648769A (en) * 2018-04-20 2018-10-12 百度在线网络技术(北京)有限公司 Voice activity detection method, apparatus and equipment
CN108665895A (en) * 2018-05-03 2018-10-16 百度在线网络技术(北京)有限公司 Methods, devices and systems for handling information

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101888455A (en) * 2010-04-09 2010-11-17 熔点网讯(北京)科技有限公司 Self-adaptive echo counteracting method for frequency domain
CN103259563A (en) * 2012-02-16 2013-08-21 联芯科技有限公司 Self-adapting filter divergence detection method and echo cancellation system
CN102750956A (en) * 2012-06-18 2012-10-24 歌尔声学股份有限公司 Method and device for removing reverberation of single channel voice
CN105632486A (en) * 2015-12-23 2016-06-01 北京奇虎科技有限公司 Voice wake-up method and device of intelligent hardware
US20180293998A1 (en) * 2017-04-11 2018-10-11 Texas Instruments Incorporated Methods and apparatus for low cost voice activity detector
CN107316649A (en) * 2017-05-15 2017-11-03 百度在线网络技术(北京)有限公司 Audio recognition method and device based on artificial intelligence
CN108648769A (en) * 2018-04-20 2018-10-12 百度在线网络技术(北京)有限公司 Voice activity detection method, apparatus and equipment
CN108665895A (en) * 2018-05-03 2018-10-16 百度在线网络技术(北京)有限公司 Methods, devices and systems for handling information

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020228270A1 (en) * 2019-05-10 2020-11-19 平安科技(深圳)有限公司 Speech processing method and device, computer device and storage medium
CN110265029A (en) * 2019-06-21 2019-09-20 百度在线网络技术(北京)有限公司 Speech chip and electronic equipment
CN110634483A (en) * 2019-09-03 2019-12-31 北京达佳互联信息技术有限公司 Man-machine interaction method and device, electronic equipment and storage medium
CN110634483B (en) * 2019-09-03 2021-06-18 北京达佳互联信息技术有限公司 Man-machine interaction method and device, electronic equipment and storage medium
CN110830866A (en) * 2019-10-31 2020-02-21 歌尔科技有限公司 Voice assistant awakening method and device, wireless earphone and storage medium
CN110930979A (en) * 2019-11-29 2020-03-27 百度在线网络技术(北京)有限公司 Speech recognition model training method and device and electronic equipment
CN110930979B (en) * 2019-11-29 2020-10-30 百度在线网络技术(北京)有限公司 Speech recognition model training method and device and electronic equipment
CN111392532A (en) * 2020-04-07 2020-07-10 上海爱登堡电梯集团股份有限公司 Elevator outbound call device with voice parameter setting function, elevator parameter debugging method and elevator

Similar Documents

Publication Publication Date Title
CN109584896A (en) A kind of speech chip and electronic equipment
US10373609B2 (en) Voice recognition method and apparatus
Zhang et al. A speech enhancement algorithm by iterating single-and multi-microphone processing and its application to robust ASR
WO2015047517A1 (en) Keyword detection
US5594834A (en) Method and system for recognizing a boundary between sounds in continuous speech
US20170154640A1 (en) Method and electronic device for voice recognition based on dynamic voice model selection
Meng et al. Adversarial feature-mapping for speech enhancement
US10304440B1 (en) Keyword spotting using multi-task configuration
CN106548775A (en) A kind of audio recognition method and system
WO1996008005A1 (en) System for recognizing spoken sounds from continuous speech and method of using same
CN108962237A (en) Mixing voice recognition methods, device and computer readable storage medium
CN108899047A (en) The masking threshold estimation method, apparatus and storage medium of audio signal
Oh et al. Improvement of speech detection using ERB feature extraction
Todkar et al. Speaker Recognition Techniques: A Review
Agrawal et al. Deep variational filter learning models for speech recognition
Ceolini et al. Event-driven pipeline for low-latency low-compute keyword spotting and speaker verification system
US10460729B1 (en) Binary target acoustic trigger detecton
Panda A fast approach to psychoacoustic model compensation for robust speaker recognition in additive noise
Nakamura et al. Robot audition based acoustic event identification using a bayesian model considering spectral and temporal uncertainties
Salam et al. Temporal speech normalization methods comparison in speech recognition using neural network
Huemmer et al. Online environmental adaptation of CNN-based acoustic models using spatial diffuseness features
Li et al. A Convolutional Neural Network with Non-Local Module for Speech Enhancement.
Khan et al. Isolated Bangla word recognition and speaker detection by semantic modular time delay neural network (MTDNN)
CN110268471A (en) The method and apparatus of ASR with embedded noise reduction
Wang et al. A Fusion Model for Robust Voice Activity Detection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination