CN109584896A

CN109584896A - A kind of speech chip and electronic equipment

Info

Publication number: CN109584896A
Application number: CN201811293499.4A
Authority: CN
Inventors: 肖佳林; 王欢良; 唐浩元; 王佳珺; 吴洪宇; 马殿昌; 李志�
Original assignee: Suzhou Qdreamer Network Science And Technology Co Ltd
Current assignee: Suzhou Qdreamer Network Science And Technology Co Ltd
Priority date: 2018-11-01
Filing date: 2018-11-01
Publication date: 2019-04-05

Abstract

The present invention relates to a kind of speech chip and the electronic equipments of application, comprising: audio collection module, for acquiring voice signal；Front end ARRAY PROCESSING module, connect with audio collection module, for handling voice signal；Voice Activity Detection module, it is connect with front end ARRAY PROCESSING module, for determining the sound end of the voice signal after the ARRAY PROCESSING resume module of front end, wherein, sound end includes the beginning endpoint and end caps of voice signal, and each sound end corresponds at least frame in voice signal；Voice wake-up module is connect with Voice Activity Detection module, when for determining that voice signal includes preset wake-up voice based on sound end, wakes up electronic equipment；Speech recognition module is connect with front end ARRAY PROCESSING module, for after electronic equipment is waken up, the interactive instruction in voice signal after identifying front end ARRAY PROCESSING resume module, and electronic equipment is made to execute interactive instruction.The accuracy of speech recognition can be improved in the present invention.

Description

A kind of speech chip and electronic equipment

Technical field

The present invention relates to speech processes fields, more particularly to a kind of speech chip and electronic equipment.

Background technique

Perfect man machine language's interaction may be implemented in the intelligent sound function that electronic equipment has, and final goal is It allows electronic equipment to understand the language of the mankind, and executes corresponding function.Chip applied in electronic equipment does not have intelligence at present Energy function, and it is expensive, power consumption is high.For supporting the chip of voice interactive function at present, the speech recognition supported is based on The neural network algorithm of deep learning, but deep learning neural network algorithm calculation amount is larger, causes to support interactive voice at present The chip of function calculates, and power consumption is big, and speed is slow.In order to meet calculation amount, support that the chip of voice interactive function can be to depth Practising neural network algorithm will do it simplification, and recognition performance is caused to reduce, and further results in user and is carrying out voice with electronic equipment Experience sense is very poor when interactive.

Summary of the invention

Based on this, it is necessary to for the low problem of current speech recognition degree, provide a kind of speech chip and electronic equipment.

A kind of speech chip is applied to electronic equipment, comprising:

Audio collection module, for acquiring voice signal；

Front end ARRAY PROCESSING module, connect with the audio collection module, for handling the voice signal；

Voice Activity Detection module is connect with the front end ARRAY PROCESSING module, for determining the front end ARRAY PROCESSING The sound end of voice signal after resume module, wherein the sound end include the voice signal beginning endpoint and End caps, each sound end correspond at least frame in voice signal；

Voice wake-up module is connect with the Voice Activity Detection module, described in being determined based on the sound end When voice signal includes preset wake-up voice, the electronic equipment is waken up；

Speech recognition module is connect with the front end ARRAY PROCESSING module, for after the electronic equipment is waken up, The interactive instruction in voice signal after identifying the front end ARRAY PROCESSING resume module, and electronic equipment is made to execute the friendship Mutually instruction.

Preferably, the front end ARRAY PROCESSING module includes:

Echo cancellation unit, for carrying out back Processing for removing to the voice signal；

Beam forming unit, for carrying out Wave beam forming processing to the voice signal；

Auditory localization unit, for carrying out auditory localization to the voice signal；

Dereverberation processing unit, for removing reverberation with the voice signal.

Preferably, the echo cancellation unit is specifically used for:

Trap is carried out to the voice signal and filters out DC component, and does preemphasis processing, forms the first input signal；

First input signal is filtered using prime filtering, obtained error signal variance is stored To Sff；

First input signal is filtered using rear class filtering, by obtained error signal variance to See；

Final filtered signal is exported based on the Sff and See.

Preferably, the beam forming unit is specifically used for:

Fourier transformation is carried out to the voice signal, and calculates the voice signal covariance of the voice signal；

Eigenvalues Decomposition is carried out to the voice signal covariance, determines maximum eigenvalue；

Determine the corresponding feature vector of the maximum eigenvalue；

Final enhancing signal is calculated based on the voice signal covariance and described eigenvector；

After carrying out Fourier transformation to the enhancing signal, and the signal after conversion is gone into time domain.

Preferably, the auditory localization unit is specifically used for:

The voice signal is calculated to reach in array between different microphones since the transmitting voice signal distance is different Caused by the time difference；

The time difference is obtained into range difference multiplied by the velocity of sound；

A series of hyperboloids are calculated according to geometrical relationship and the range difference, sound is obtained by the bi-curved intersection point Source position.

Preferably, the dereverberation processing unit is specifically used for:

Short Time Fourier Transform is carried out to the voice signal；

The frequency-region signal after removal reverberation is calculated based on transformed signal；

Inverse Short Time Fourier Transform is carried out to the frequency-region signal, and transformed voice signal is gone to time domain.

Preferably, the Voice Activity Detection module is specifically used for:

The voice signal is pre-processed；

FBank feature will be extracted using FilterBank algorithm frame by frame by pretreated signal；

The FBank feature of each frame is inputted into deep neural network model, each frame is exported by deep neural network model The output probability of each phoneme in phone set；

The corresponding output probability of all non-noise, non-mute phonemes will be corresponded in each frame to sum；

If described and value is greater than preset threshold, judge corresponding frame as sound end；

After last frame is judged, then the disposal of gentle filter is carried out to court verdict before, obtain final voice Endpoint court verdict.

Preferably, the voice wake-up module is specifically used for:

Speech feature extraction is carried out to the voice signal, obtains corresponding speech feature vector；

The speech feature vector is input to DNN model, obtaining the corresponding voice signal of the speech feature vector is to close The posterior probability of keyword either non-key word；

Corresponding confidence level is smoothly obtained to the posterior probability；

If the confidence level is greater than preset value, judge that the corresponding voice signal of the speech feature vector includes to close Keyword；

Judge to wake up the electronic equipment if the keyword is occurred by setting sequence.

Preferably, the speech recognition module is specifically used for:

Extract the speech feature vector in the voice signal；

Tone decoding is carried out to the speech feature vector, obtains optimal output word sequence；

Corresponding text is exported based on the output word sequence；

The interactive instruction of the text representation is determined based on the text, and electronic equipment is made to execute the interactive instruction.

A kind of electronic equipment, the electronic equipment include above-described speech chip.

Speech chip described above includes audio collection module, front end ARRAY PROCESSING module, voice activity detection (Voice Activity detection, VAD) module (VAD module), voice wake-up module, speech recognition module.Audio collection module Voice signal is collected first, and voice signal is then transmitted to front end ARRAY PROCESSING module, is handled.VAD module and front end ARRAY PROCESSING module is connected, and VAD module can be to the beginning and end of treated Speech signal detection user speech.Language Sound wake-up module carries out similarity-rough set to the wake-up word of user speech and setting, if matching can be by equipment from dormant state Middle wake-up.It is handled after waking up successfully by front end array algorithm, positions speaker orientation, be oriented speech enhan-cement, will enhance Voice afterwards is sent into speech recognition module, and electronic equipment is made corresponding movement for the instruction of identification, be can be realized man-machine Interactive voice.The present invention, can be by the voice signal after positioning and orientation speech enhan-cement after waking up electronic equipment as a result, It is sent into speech recognition module, to promote the accuracy of speech recognition.

Detailed description of the invention

Fig. 1 is the structure chart of the speech chip of an embodiment.

Specific embodiment

In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawings and embodiments, The present invention will be described in further detail.It should be appreciated that specific embodiment described herein is only used to explain this hair It is bright, it is not intended to limit the present invention.

Fig. 1 is the structure chart of the speech chip of an embodiment, as shown in Figure 1, the speech chip is applied to electronic equipment, Include:

Audio collection module, for acquiring voice signal；

Speech chip described above includes audio collection module, front end ARRAY PROCESSING module, voice activity detection (Voice Activity detection, VAD) module (the present embodiment Voice Activity Detection module abbreviation VAD module), voice wake-up mould Block, speech recognition module.Audio collection module collects voice signal first, and then voice signal is transmitted at the array of front end Module is managed, is handled.VAD module is connected with front end ARRAY PROCESSING module, and VAD module can believe treated voice Number detection user speech beginning and end.Voice wake-up module carries out similarity ratio to the wake-up word of user speech and setting Compared with if matching can wake up equipment from dormant state.It is handled after waking up successfully by front end array algorithm, positioning is spoken People orientation, is oriented speech enhan-cement, and enhanced voice is sent into speech recognition module, finger of the electronic equipment for identification Corresponding movement is made in order, and man machine language's interaction can be realized.The present invention can will determine after waking up electronic equipment as a result, Voice signal behind position and orientation speech enhan-cement is sent into speech recognition module, to promote the accuracy of speech recognition.

In the present embodiment, front end ARRAY PROCESSING module, VAD module, voice wake-up module and speech recognition module are to utilize The realization that accelerator carries out, to improve program operation speed.This speech chip can be applied to various smart machines to realize Man machine language's interaction.

In the present embodiment, audio collection module can be Mic (microphone, microphone), be used for received voice Signal is transmitted to front end ARRAY PROCESSING module.

In the present embodiment, front end ARRAY PROCESSING module can carry out echo cancellor, Wave beam forming, auditory localization, dereverberation Etc. processing operations, and journey processed above can be speeded up to realize, mainly used following hardware accelerator: fft/ifft adds Fast device, Matrix Multiplication accelerator, accelerator of inverting seek determinant accelerator, seek characteristic value feature vector accelerator, and simd accelerates Device, mathematical operation accelerator (dma accelerator) and ask cholesky product accelerator.The wherein function of mathematical operation accelerator Mainly have: trigonometric function, logarithmic function, exponential function, summation operation are asked and have bad luck calculation, division arithmetic, extracting operation, power fortune It calculates, asks absolute value, floating-point integer translation operation etc..

Above is referred to following nouns for the present embodiment, it may be assumed that (Fast FourierTransform, fast Fourier become fft Change), IFFT (Inverse Fast-Fourier-Transformation, inverse fast fourier transform), SIMD (Single Instruction Multiple Data, single-instruction multiple-data stream (SIMD)), DMA (deposit by Direct Memory Access, direct memory Take), it is that the matrix of a symmetric positive definite is expressed as a lower triangular matrix L and the product of its transposition that Cholesky, which is decomposed, It decomposes.

In one embodiment of the invention, the front end ARRAY PROCESSING module includes:

This hair is invented in an embodiment, and the echo cancellation unit is specifically used for:

Final filtered signal is exported based on the Sff and See.

In the present embodiment, the specific implementation of echo cancellor is: falling into first to the received voice signal of microphone array Wave filters out DC component, and does preemphasis processing, forms the first input signal.Then the first input is believed using prime filtering It number is filtered, wherein accelerating to calculate using simd accelerator and dma accelerator, then deposits filtered output signals In the latter half of e [], error signal variance, which stores, gives a kind of Sff (high-frequency synchronous rotating coordinate system filter (Synchronous Frame Filter, SFF), for extracting electric current negative sequence component in Electric control).Rear class is calculated later Filter tap coefficients W, wherein with normalization minimum mean-square sef-adapting filter (Normalized least mean Square, NLMS) based on, more delay block adaptive frequency domain filter (The multi-delay block frequency- Domain adaptive filter, MDF) frequency domain realization, finally derive that optimal step size is equal to residual echo variance and error The ratio between signal variance.Wherein residual echo variance is calculated by one leadage coefficient of definition and using simd accelerator, And leadage coefficient is the mutual of the auto-correlation that each frequency point is obtained by recursive average processing method, input signal and error signal It is related finally obtained.Later the first input signal is filtered again with rear class filtering, simd is used to accelerate Device and dma accelerator accelerate to calculate, and for obtained error signal variance to See, error is stored in e [] first half.Then Whether need to update prime filter factor or resetting rear class filtering in conjunction with See and Sff comprehensive descision, carries out if necessary Adaptive-filtering, right value update, and error signal is updated in the energy value of time domain, wherein using simd accelerator.Finally again Final filtering output out=input-filtering output e [] latter half is calculated with simd accelerator, and is postemphasised Processing is completed to this echo cancellor.

It should be pointed out that the present embodiment with the specific implementation process of upper module, only can be used in this present embodiment The optimal selection of one kind, the present embodiment is not limited to above specific process, method progress deformation appropriate or change, To realize the specific technical solution of the present invention, this is within the scope of protection of the invention.

In one implementation of the present embodiment, the beam forming unit is specifically used for:

Determine the corresponding feature vector of the maximum eigenvalue；

In the present embodiment, the realization of Wave beam forming is: receiving first with fft accelerator to microphone array Signal y_tIt carries out Short Time Fourier Transform (Short-time fourier transform, STFT).Then simd accelerator is used To accelerateAnd R_f ^(v)Initialization.Later according to CGMM principle, estimated using Matrix Multiplication accelerator and dma acceleratorThen start to estimate noise covariance R_n(f), noisy speech covariance R_k+n(f) and voice signal covariance R_k(f), make Dma accelerator, Matrix Multiplication accelerator and simd accelerator are used to accelerate to calculate.Later using asking characteristic value feature vector to add Fast device comes to matrix R_k(f) carry out Eigenvalues Decomposition, the corresponding feature vector of maximum eigenvalue be target voice direction to Measure r_f.According to obtained R_n(f) and r_fWeight is calculatedWherein come pair using Matrix Multiplication accelerator and simd accelerator Calculating is accelerated.The enhancing signal finally to be obtained finally is calculated using simd acceleratorAnd added using ifft Fast device carries out inverse Short Time Fourier Transform (Inverse short-time fourier transform, ISTFT) to it, it Signal is gone to time domain afterwards.So far Wave beam forming terminates.

In the present embodiment, the auditory localization unit is specifically used for:

In the present embodiment, the realization of auditory localization is: estimated speech signal reaches in array between different microphones first The time difference (Time delay of arrival, TDOA) due to caused by signal transmission distance difference, i.e. progress time prolong Estimation (Time delay estimation, TDE) late.Used here as arrive broad sense cross-correlation (Generalized cross Correlation, GCC) method carries out time delay estimadon, the audio that receives first with fft accelerator to different microphones Signal carries out Fast Fourier Transform (FFT) (Fast fourier transformation, FFT).Then broad sense cross-correlation letter is defined Number, it is first corresponding to protrude come the through part of enhanced speech signal, inhibition noise and reverb signal using weighting function in frequency domain Peak value, wherein being accelerated using simd accelerator.The signal after weighting is carried out using ifft accelerator later reverse Fast Fourier Transform (FFT) (Inverse fast fourier transformation, IFFT).Then broad sense cross-correlation is detected The peak value of function obtains TDOA.The TDOA of acquisition is obtained into range difference multiplied by the velocity of sound later, obtains one according to geometrical relationship Sound source position can be obtained by bi-curved intersection point in serial hyperboloid.Auditory localization is completed.

In the present embodiment, the dereverberation processing unit is specifically used for:

Short Time Fourier Transform is carried out to the voice signal；

In one implementation of the present embodiment, the realization of dereverberation is: first with fft accelerator initialization when pair The voice signal y that microphone array receives_tIt carries out Short Time Fourier Transform (STFT).Then come using Matrix Multiplication accelerator It calculatesWithAccelerate to calculate the frequency-region signal after removal reverberation using Matrix Multiplication accelerator laterAnd utilize ifft Accelerator pairInverse Short Time Fourier Transform (ISTFT) is carried out, signal is then gone to time domain, completes dereverberation.And it is updating When the signal y that equally first array is received using fft accelerator_tIt carries out Short Time Fourier Transform (STFT), is accelerated with Matrix Multiplication Device recalculatesEach frame data of each frequency point are updated, and are once updated using precedingWithMeter Calculation obtains updatedFinally accelerated to calculate the frequency-region signal after removal reverberation with Matrix Multiplication acceleratorAnd it utilizes Ifft accelerator pairInverse Short Time Fourier Transform (ISTFT) is carried out, signal is gone to time domain later, is completed to this dereverberation.

In one implementation of the present embodiment, the Voice Activity Detection module is specifically used for:

The voice signal is pre-processed；

It should be pointed out that the VAD module, voice wake-up module and speech recognition module in the present embodiment mainly use Following hardware accelerator: simd accelerator, mathematical operation accelerator (dma), fft/ifft accelerator, neural network accelerate Device (Neural-network process units, NPU).Wherein NPU can flexibly support each Connectionist model, Mainly have: deep neural network (Deep neural network, DNN), circular recursion neural network (Recurrent Neural network, RNN), convolutional neural networks (Convolutional neural network, CNN), time delay nerve net Network (Time delay neural network, TDNN) etc..

The realization of VAD module in the present embodiment is: incoming voice signal pre-processed first, including Framing and pre-filtering etc..Then FBank feature will be extracted using FilterBank algorithm frame by frame by pretreated signal. Endpoint judgement is carried out later, that is, passes through a trained deep neural network (Deep neural to classify to phoneme Network, DNN) model, the FBank feature of each frame is inputted, it is each in phone set by the corresponding each frame of model output The posterior probability (being also output probability) of a phoneme.Then by all non-noise, the non-mute corresponding output probability of phoneme into Row summation, if it is greater than the threshold value of setting, then it is assumed that the frame is voice.When last frame signal is after endpoint is adjudicated, then into Row post-processing operation carries out the disposal of gentle filter to court verdict before, obtains final sound end court verdict, So far voice activity detection is completed.Wherein, VAD module is to utilize neural network accelerator (Neural-network process Units, NPU) come what is speeded up to realize.

In one implementation of the present embodiment, the voice wake-up module is specifically used for:

In one implementation of the present embodiment, the realization that voice wakes up is: using mode end to end, that is, what is inputted is Voice signal, output are directly keyword.Speech feature extraction is carried out to the voice signal of input first, using MFCC (Mel-frequency cepstral coefficients) algorithm.Wherein, using MFCC algorithm extract phonetic feature it It is preceding that pre-processing, including analog-to-digital conversion, preemphasis and framing adding window first are done to incoming voice signal.Carry out later quickly from Fourier transformation and Mel filtering are dissipated, finally carrying out cepstrum, energy and difference can be obtained MFCC parameter vector.It will obtain later Speech feature vector be input to DNN model (deep neural network, Deep Neural Networks, DNN), pass through training DNN is the keyword either posterior probability of non-key word and then outputs it come the phonetic feature for predicting input.It incites somebody to action later The posterior value arrived is by post-processing model, because posterior value is exported as unit of frame needing to come with certain window length It carries out smoothly, the confidence level of keyword can be obtained after carrying out smoothly posterior value.If the confidence level is greater than the threshold of setting Value, then it is assumed that keyword occurs.And think to wake up if keyword is occurred by setting sequence, while series of parameters is arranged Possible false wake-up is limited, is terminated to the wake-up of this voice.Wherein, voice wake-up module is to utilize neural network accelerator (NPU) it speeds up to realize.

In one implementation of the present embodiment, the speech recognition module is specifically used for:

Extract the speech feature vector in the voice signal；

Corresponding text is exported based on the output word sequence；

In one implementation of the present embodiment, one kind of speech recognition is achieved in that: first extracting language using MFCC algorithm Sound feature vector.Then the speech feature vector extracted is subjected to tone decoding, and speech decoding process is exactly to pass through acoustics Model, Pronounceable dictionary and language model carry out text output to the voice data after extracting feature.Acoustic model be using TDNN-HMM model, wherein TDNN is time delay deep neural network model, and HMM is hidden Markov model (Hidden Markov model, HMM), it is the acoustic model parameters trained according to the characteristic parameter of speech database, is then identifying When speech feature vector that the mode input is extracted matched with acoustic model, obtain recognition result i.e. phoneme Information.Wherein carry out Fitted probability density function using TDNN model, carries out the state modeling of HMM.Before in HMM model being utilization Probability calculation is solved the problems, such as to algorithm and backward algorithm, solves problem concerning study with Baum-Welch algorithm, and use Triphones HMM model and the training burden that every one kind is improved by decision tree.Pronounceable dictionary is identified according to acoustic model Phoneme information, find corresponding word or word, acoustic model and language model be tied.And language model is to pass through Large amount of text information is trained, in conjunction with syntax and semantics knowledge description word between internal relation, for hair The word or word that sound dictionary is found obtain the word sequence of maximum probability.Later by trained acoustic model, pronunciation dictionary, language Speech model construction is a state network.Decoding is carried out using Viterbi algorithm, i.e., from the state network of building Find with the most matched path of voice, obtain optimal output word sequence.Final output text just completes grammer and identified Journey.Wherein, speech recognition module is accelerated using neural network accelerator (NPU).

The another kind of speech recognition is achieved in that: acoustic model is constructed using RNN-CTC, wherein RNN is to recycle Neural network, the acoustic model that CTC (Connectionist temporal classification) is used as loss function are instructed Practice, save alignment of data and mark, while to Chinese phonetic mother, the multilinguals such as phoneme and state structure carries out analysis and builds Mould；The method is trained by BP algorithm (errorBackPropagation), and last voice output is one section of spike sequence Column, non-speech portion is blank parts；Since the spike sequence of output corresponds to mulitpath, thus using preceding backward algorithm into Row computational short cut.Pronounceable dictionary is the phoneme information identified according to acoustic model, finds corresponding word or word, by sound Model is learned to be tied with language model.And language model is modeled using N-gram+LSTM, wherein N-gram is a kind of Statistical language model predicts that n-th of item, these item can be phoneme according to preceding (n-1) a item, character, word etc., It is most common language model；LSTM (Long Short Term Memory networks) is a kind of special circulation Neural network (RNN) can be learnt by cellular state (Cell State) structure to long-term dependence；N-gram+ LSTM model overcomes the problem of individual N-gram model fails for long-time dependence, this model passes through to a large amount of Text information is trained to obtain, in conjunction with syntax and semantics knowledge description word between internal relation, Pronounceable dictionary is looked for To word or word obtain the word sequence of maximum probability.Later by trained acoustic model, pronunciation dictionary, language model structure It builds as a state network.Decoding is carried out using Viterbi algorithm, i.e., finds from the state network of building and language The most matched path of sound, obtains optimal output word sequence.Final output text just completes grammer identification process.It is local Speech recognition is accelerated using neural network accelerator (NPU).

It should be pointed out that the specific implementation process of the present embodiment each module described above, only can be used in this The optimal selection of one of embodiment, the present embodiment be not limited to carry out the above specific process, method deformation appropriate or Person changes, and to realize the specific technical solution of the present invention, this is within the scope of protection of the invention.

This implementation is by basic audio collection module, front end array signal processing module, VAD module, and voice wakes up mould Block, the intelligent sound chip of grammer identification module composition.On this basis, ARRAY PROCESSING module in front end is based on fft/ifft Accelerator, Matrix Multiplication accelerator, accelerator of inverting seek determinant accelerator, seek characteristic value feature vector accelerator, and simd adds Fast device, dma accelerator ask cholesky product accelerator to realize；VAD module is realized based on neural network accelerator；Language Sound wake-up module is realized based on neural network accelerator；Local voice identification module is real based on neural network accelerator Existing.

The present embodiment additionally provides a kind of electronic equipment, and the electronic equipment includes described in any one of claim 1-9 Speech chip.

Each technical characteristic of embodiment described above can be combined arbitrarily, for simplicity of description, not to above-mentioned reality It applies all possible combination of each technical characteristic in example to be all described, as long as however, the combination of these technical characteristics is not deposited In contradiction, all should be considered as described in this specification.

The embodiments described above only express several embodiments of the present invention, and the description thereof is more specific and detailed, but simultaneously It cannot therefore be construed as limiting the scope of the patent.It should be pointed out that coming for those of ordinary skill in the art It says, without departing from the inventive concept of the premise, various modifications and improvements can be made, these belong to guarantor of the invention Protect range.Therefore, the scope of protection of the patent of the invention shall be subject to the appended claims.

Claims

1. a kind of speech chip is applied to electronic equipment characterized by comprising

Audio collection module, for acquiring voice signal；

Voice Activity Detection module is connect with the front end ARRAY PROCESSING module, for determining the front end ARRAY PROCESSING module The sound end of treated voice signal, wherein the sound end includes beginning endpoint and the end of the voice signal Endpoint, each sound end correspond at least frame in voice signal；

Voice wake-up module is connect with the Voice Activity Detection module, for determining the voice based on the sound end When signal includes preset wake-up voice, the electronic equipment is waken up；

Speech recognition module is connect with the front end ARRAY PROCESSING module, for identifying institute after the electronic equipment is waken up The interactive instruction in voice signal after stating front end ARRAY PROCESSING resume module, and electronic equipment is made to execute the interactive instruction.

2. speech chip according to claim 1, which is characterized in that the front end ARRAY PROCESSING module includes:

3. speech chip according to claim 2, which is characterized in that the echo cancellation unit is specifically used for:

Final filtered signal is exported based on the Sff and See.

4. speech chip according to claim 2, which is characterized in that the beam forming unit is specifically used for:

Determine the corresponding feature vector of the maximum eigenvalue；

5. speech chip according to claim 2, which is characterized in that the auditory localization unit is specifically used for:

It calculates the voice signal and reaches in array and draw between different microphones since the transmitting voice signal distance is different The time difference risen；

A series of hyperboloids are calculated according to geometrical relationship and the range difference, sound source position is obtained by the bi-curved intersection point It sets.

6. speech chip according to claim 2, which is characterized in that the dereverberation processing unit is specifically used for:

Short Time Fourier Transform is carried out to the voice signal；

7. speech chip according to claim 1, which is characterized in that the Voice Activity Detection module is specifically used for:

The voice signal is pre-processed；

The FBank feature of each frame is inputted into deep neural network model, each frame is exported in sound by deep neural network model Element concentrates the output probability of each phoneme；

After last frame is judged, then the disposal of gentle filter is carried out to court verdict before, obtain final sound end Court verdict.

8. speech chip according to claim 1, which is characterized in that the voice wake-up module is specifically used for:

The speech feature vector is input to DNN model, obtaining the corresponding voice signal of the speech feature vector is keyword The either posterior probability of non-key word；

If the confidence level is greater than preset value, judge that the corresponding voice signal of the speech feature vector includes keyword；

9. speech chip according to claim 1, which is characterized in that the speech recognition module is specifically used for:

Extract the speech feature vector in the voice signal；

Corresponding text is exported based on the output word sequence；

10. a kind of electronic equipment, which is characterized in that the electronic equipment includes voice of any of claims 1-9 Chip.