CN109584896A - A kind of speech chip and electronic equipment - Google Patents
A kind of speech chip and electronic equipment Download PDFInfo
- Publication number
- CN109584896A CN109584896A CN201811293499.4A CN201811293499A CN109584896A CN 109584896 A CN109584896 A CN 109584896A CN 201811293499 A CN201811293499 A CN 201811293499A CN 109584896 A CN109584896 A CN 109584896A
- Authority
- CN
- China
- Prior art keywords
- voice signal
- signal
- voice
- module
- speech
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000012545 processing Methods 0.000 claims abstract description 51
- 238000001514 detection method Methods 0.000 claims abstract description 19
- 230000000694 effects Effects 0.000 claims abstract description 17
- 230000002452 interceptive effect Effects 0.000 claims abstract description 17
- 238000001914 filtration Methods 0.000 claims description 14
- 230000004807 localization Effects 0.000 claims description 12
- 230000009466 transformation Effects 0.000 claims description 9
- 230000002708 enhancing effect Effects 0.000 claims description 7
- 238000003062 neural network model Methods 0.000 claims description 7
- 238000006243 chemical reaction Methods 0.000 claims description 4
- 238000000354 decomposition reaction Methods 0.000 claims description 4
- 238000000605 extraction Methods 0.000 claims description 4
- 239000012141 concentrate Substances 0.000 claims 1
- 238000000034 method Methods 0.000 description 29
- 238000013528 artificial neural network Methods 0.000 description 23
- 230000008569 process Effects 0.000 description 19
- 230000006870 function Effects 0.000 description 14
- 239000011159 matrix material Substances 0.000 description 12
- 230000008859 change Effects 0.000 description 7
- 238000004364 calculation method Methods 0.000 description 5
- 239000004568 cement Substances 0.000 description 4
- 230000002618 waking effect Effects 0.000 description 4
- 238000013527 convolutional neural network Methods 0.000 description 3
- 230000003993 interaction Effects 0.000 description 3
- 230000003044 adaptive effect Effects 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 238000009432 framing Methods 0.000 description 2
- 230000015654 memory Effects 0.000 description 2
- 238000012805 post-processing Methods 0.000 description 2
- 230000001360 synchronised effect Effects 0.000 description 2
- 238000012549 training Methods 0.000 description 2
- 230000009471 action Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000003066 decision tree Methods 0.000 description 1
- 230000005764 inhibitory process Effects 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 210000004218 nerve net Anatomy 0.000 description 1
- 230000001537 neural effect Effects 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 230000002441 reversible effect Effects 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 230000008054 signal transmission Effects 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
- 230000017105 transposition Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
- G10L21/0224—Processing in the time domain
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
- G10L21/0232—Processing in the frequency domain
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0272—Voice signal separating
- G10L21/028—Voice signal separating using properties of sound source
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L2015/088—Word spotting
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L2021/02082—Noise filtering the noise being echo, reverberation of the speech
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Quality & Reliability (AREA)
- Artificial Intelligence (AREA)
- Circuit For Audible Band Transducer (AREA)
Abstract
The present invention relates to a kind of speech chip and the electronic equipments of application, comprising: audio collection module, for acquiring voice signal;Front end ARRAY PROCESSING module, connect with audio collection module, for handling voice signal;Voice Activity Detection module, it is connect with front end ARRAY PROCESSING module, for determining the sound end of the voice signal after the ARRAY PROCESSING resume module of front end, wherein, sound end includes the beginning endpoint and end caps of voice signal, and each sound end corresponds at least frame in voice signal;Voice wake-up module is connect with Voice Activity Detection module, when for determining that voice signal includes preset wake-up voice based on sound end, wakes up electronic equipment;Speech recognition module is connect with front end ARRAY PROCESSING module, for after electronic equipment is waken up, the interactive instruction in voice signal after identifying front end ARRAY PROCESSING resume module, and electronic equipment is made to execute interactive instruction.The accuracy of speech recognition can be improved in the present invention.
Description
Technical field
The present invention relates to speech processes fields, more particularly to a kind of speech chip and electronic equipment.
Background technique
Perfect man machine language's interaction may be implemented in the intelligent sound function that electronic equipment has, and final goal is
It allows electronic equipment to understand the language of the mankind, and executes corresponding function.Chip applied in electronic equipment does not have intelligence at present
Energy function, and it is expensive, power consumption is high.For supporting the chip of voice interactive function at present, the speech recognition supported is based on
The neural network algorithm of deep learning, but deep learning neural network algorithm calculation amount is larger, causes to support interactive voice at present
The chip of function calculates, and power consumption is big, and speed is slow.In order to meet calculation amount, support that the chip of voice interactive function can be to depth
Practising neural network algorithm will do it simplification, and recognition performance is caused to reduce, and further results in user and is carrying out voice with electronic equipment
Experience sense is very poor when interactive.
Summary of the invention
Based on this, it is necessary to for the low problem of current speech recognition degree, provide a kind of speech chip and electronic equipment.
A kind of speech chip is applied to electronic equipment, comprising:
Audio collection module, for acquiring voice signal;
Front end ARRAY PROCESSING module, connect with the audio collection module, for handling the voice signal;
Voice Activity Detection module is connect with the front end ARRAY PROCESSING module, for determining the front end ARRAY PROCESSING
The sound end of voice signal after resume module, wherein the sound end include the voice signal beginning endpoint and
End caps, each sound end correspond at least frame in voice signal;
Voice wake-up module is connect with the Voice Activity Detection module, described in being determined based on the sound end
When voice signal includes preset wake-up voice, the electronic equipment is waken up;
Speech recognition module is connect with the front end ARRAY PROCESSING module, for after the electronic equipment is waken up,
The interactive instruction in voice signal after identifying the front end ARRAY PROCESSING resume module, and electronic equipment is made to execute the friendship
Mutually instruction.
Preferably, the front end ARRAY PROCESSING module includes:
Echo cancellation unit, for carrying out back Processing for removing to the voice signal;
Beam forming unit, for carrying out Wave beam forming processing to the voice signal;
Auditory localization unit, for carrying out auditory localization to the voice signal;
Dereverberation processing unit, for removing reverberation with the voice signal.
Preferably, the echo cancellation unit is specifically used for:
Trap is carried out to the voice signal and filters out DC component, and does preemphasis processing, forms the first input signal;
First input signal is filtered using prime filtering, obtained error signal variance is stored
To Sff;
First input signal is filtered using rear class filtering, by obtained error signal variance to
See;
Final filtered signal is exported based on the Sff and See.
Preferably, the beam forming unit is specifically used for:
Fourier transformation is carried out to the voice signal, and calculates the voice signal covariance of the voice signal;
Eigenvalues Decomposition is carried out to the voice signal covariance, determines maximum eigenvalue;
Determine the corresponding feature vector of the maximum eigenvalue;
Final enhancing signal is calculated based on the voice signal covariance and described eigenvector;
After carrying out Fourier transformation to the enhancing signal, and the signal after conversion is gone into time domain.
Preferably, the auditory localization unit is specifically used for:
The voice signal is calculated to reach in array between different microphones since the transmitting voice signal distance is different
Caused by the time difference;
The time difference is obtained into range difference multiplied by the velocity of sound;
A series of hyperboloids are calculated according to geometrical relationship and the range difference, sound is obtained by the bi-curved intersection point
Source position.
Preferably, the dereverberation processing unit is specifically used for:
Short Time Fourier Transform is carried out to the voice signal;
The frequency-region signal after removal reverberation is calculated based on transformed signal;
Inverse Short Time Fourier Transform is carried out to the frequency-region signal, and transformed voice signal is gone to time domain.
Preferably, the Voice Activity Detection module is specifically used for:
The voice signal is pre-processed;
FBank feature will be extracted using FilterBank algorithm frame by frame by pretreated signal;
The FBank feature of each frame is inputted into deep neural network model, each frame is exported by deep neural network model
The output probability of each phoneme in phone set;
The corresponding output probability of all non-noise, non-mute phonemes will be corresponded in each frame to sum;
If described and value is greater than preset threshold, judge corresponding frame as sound end;
After last frame is judged, then the disposal of gentle filter is carried out to court verdict before, obtain final voice
Endpoint court verdict.
Preferably, the voice wake-up module is specifically used for:
Speech feature extraction is carried out to the voice signal, obtains corresponding speech feature vector;
The speech feature vector is input to DNN model, obtaining the corresponding voice signal of the speech feature vector is to close
The posterior probability of keyword either non-key word;
Corresponding confidence level is smoothly obtained to the posterior probability;
If the confidence level is greater than preset value, judge that the corresponding voice signal of the speech feature vector includes to close
Keyword;
Judge to wake up the electronic equipment if the keyword is occurred by setting sequence.
Preferably, the speech recognition module is specifically used for:
Extract the speech feature vector in the voice signal;
Tone decoding is carried out to the speech feature vector, obtains optimal output word sequence;
Corresponding text is exported based on the output word sequence;
The interactive instruction of the text representation is determined based on the text, and electronic equipment is made to execute the interactive instruction.
A kind of electronic equipment, the electronic equipment include above-described speech chip.
Speech chip described above includes audio collection module, front end ARRAY PROCESSING module, voice activity detection (Voice
Activity detection, VAD) module (VAD module), voice wake-up module, speech recognition module.Audio collection module
Voice signal is collected first, and voice signal is then transmitted to front end ARRAY PROCESSING module, is handled.VAD module and front end
ARRAY PROCESSING module is connected, and VAD module can be to the beginning and end of treated Speech signal detection user speech.Language
Sound wake-up module carries out similarity-rough set to the wake-up word of user speech and setting, if matching can be by equipment from dormant state
Middle wake-up.It is handled after waking up successfully by front end array algorithm, positions speaker orientation, be oriented speech enhan-cement, will enhance
Voice afterwards is sent into speech recognition module, and electronic equipment is made corresponding movement for the instruction of identification, be can be realized man-machine
Interactive voice.The present invention, can be by the voice signal after positioning and orientation speech enhan-cement after waking up electronic equipment as a result,
It is sent into speech recognition module, to promote the accuracy of speech recognition.
Detailed description of the invention
Fig. 1 is the structure chart of the speech chip of an embodiment.
Specific embodiment
In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawings and embodiments,
The present invention will be described in further detail.It should be appreciated that specific embodiment described herein is only used to explain this hair
It is bright, it is not intended to limit the present invention.
Fig. 1 is the structure chart of the speech chip of an embodiment, as shown in Figure 1, the speech chip is applied to electronic equipment,
Include:
Audio collection module, for acquiring voice signal;
Front end ARRAY PROCESSING module, connect with the audio collection module, for handling the voice signal;
Voice Activity Detection module is connect with the front end ARRAY PROCESSING module, for determining the front end ARRAY PROCESSING
The sound end of voice signal after resume module, wherein the sound end include the voice signal beginning endpoint and
End caps, each sound end correspond at least frame in voice signal;
Voice wake-up module is connect with the Voice Activity Detection module, described in being determined based on the sound end
When voice signal includes preset wake-up voice, the electronic equipment is waken up;
Speech recognition module is connect with the front end ARRAY PROCESSING module, for after the electronic equipment is waken up,
The interactive instruction in voice signal after identifying the front end ARRAY PROCESSING resume module, and electronic equipment is made to execute the friendship
Mutually instruction.
Speech chip described above includes audio collection module, front end ARRAY PROCESSING module, voice activity detection (Voice
Activity detection, VAD) module (the present embodiment Voice Activity Detection module abbreviation VAD module), voice wake-up mould
Block, speech recognition module.Audio collection module collects voice signal first, and then voice signal is transmitted at the array of front end
Module is managed, is handled.VAD module is connected with front end ARRAY PROCESSING module, and VAD module can believe treated voice
Number detection user speech beginning and end.Voice wake-up module carries out similarity ratio to the wake-up word of user speech and setting
Compared with if matching can wake up equipment from dormant state.It is handled after waking up successfully by front end array algorithm, positioning is spoken
People orientation, is oriented speech enhan-cement, and enhanced voice is sent into speech recognition module, finger of the electronic equipment for identification
Corresponding movement is made in order, and man machine language's interaction can be realized.The present invention can will determine after waking up electronic equipment as a result,
Voice signal behind position and orientation speech enhan-cement is sent into speech recognition module, to promote the accuracy of speech recognition.
In the present embodiment, front end ARRAY PROCESSING module, VAD module, voice wake-up module and speech recognition module are to utilize
The realization that accelerator carries out, to improve program operation speed.This speech chip can be applied to various smart machines to realize
Man machine language's interaction.
In the present embodiment, audio collection module can be Mic (microphone, microphone), be used for received voice
Signal is transmitted to front end ARRAY PROCESSING module.
In the present embodiment, front end ARRAY PROCESSING module can carry out echo cancellor, Wave beam forming, auditory localization, dereverberation
Etc. processing operations, and journey processed above can be speeded up to realize, mainly used following hardware accelerator: fft/ifft adds
Fast device, Matrix Multiplication accelerator, accelerator of inverting seek determinant accelerator, seek characteristic value feature vector accelerator, and simd accelerates
Device, mathematical operation accelerator (dma accelerator) and ask cholesky product accelerator.The wherein function of mathematical operation accelerator
Mainly have: trigonometric function, logarithmic function, exponential function, summation operation are asked and have bad luck calculation, division arithmetic, extracting operation, power fortune
It calculates, asks absolute value, floating-point integer translation operation etc..
Above is referred to following nouns for the present embodiment, it may be assumed that (Fast FourierTransform, fast Fourier become fft
Change), IFFT (Inverse Fast-Fourier-Transformation, inverse fast fourier transform), SIMD (Single
Instruction Multiple Data, single-instruction multiple-data stream (SIMD)), DMA (deposit by Direct Memory Access, direct memory
Take), it is that the matrix of a symmetric positive definite is expressed as a lower triangular matrix L and the product of its transposition that Cholesky, which is decomposed,
It decomposes.
In one embodiment of the invention, the front end ARRAY PROCESSING module includes:
Echo cancellation unit, for carrying out back Processing for removing to the voice signal;
Beam forming unit, for carrying out Wave beam forming processing to the voice signal;
Auditory localization unit, for carrying out auditory localization to the voice signal;
Dereverberation processing unit, for removing reverberation with the voice signal.
This hair is invented in an embodiment, and the echo cancellation unit is specifically used for:
Trap is carried out to the voice signal and filters out DC component, and does preemphasis processing, forms the first input signal;
First input signal is filtered using prime filtering, obtained error signal variance is stored
To Sff;
First input signal is filtered using rear class filtering, by obtained error signal variance to
See;
Final filtered signal is exported based on the Sff and See.
In the present embodiment, the specific implementation of echo cancellor is: falling into first to the received voice signal of microphone array
Wave filters out DC component, and does preemphasis processing, forms the first input signal.Then the first input is believed using prime filtering
It number is filtered, wherein accelerating to calculate using simd accelerator and dma accelerator, then deposits filtered output signals
In the latter half of e [], error signal variance, which stores, gives a kind of Sff (high-frequency synchronous rotating coordinate system filter
(Synchronous Frame Filter, SFF), for extracting electric current negative sequence component in Electric control).Rear class is calculated later
Filter tap coefficients W, wherein with normalization minimum mean-square sef-adapting filter (Normalized least mean
Square, NLMS) based on, more delay block adaptive frequency domain filter (The multi-delay block frequency-
Domain adaptive filter, MDF) frequency domain realization, finally derive that optimal step size is equal to residual echo variance and error
The ratio between signal variance.Wherein residual echo variance is calculated by one leadage coefficient of definition and using simd accelerator,
And leadage coefficient is the mutual of the auto-correlation that each frequency point is obtained by recursive average processing method, input signal and error signal
It is related finally obtained.Later the first input signal is filtered again with rear class filtering, simd is used to accelerate
Device and dma accelerator accelerate to calculate, and for obtained error signal variance to See, error is stored in e [] first half.Then
Whether need to update prime filter factor or resetting rear class filtering in conjunction with See and Sff comprehensive descision, carries out if necessary
Adaptive-filtering, right value update, and error signal is updated in the energy value of time domain, wherein using simd accelerator.Finally again
Final filtering output out=input-filtering output e [] latter half is calculated with simd accelerator, and is postemphasised
Processing is completed to this echo cancellor.
It should be pointed out that the present embodiment with the specific implementation process of upper module, only can be used in this present embodiment
The optimal selection of one kind, the present embodiment is not limited to above specific process, method progress deformation appropriate or change,
To realize the specific technical solution of the present invention, this is within the scope of protection of the invention.
In one implementation of the present embodiment, the beam forming unit is specifically used for:
Fourier transformation is carried out to the voice signal, and calculates the voice signal covariance of the voice signal;
Eigenvalues Decomposition is carried out to the voice signal covariance, determines maximum eigenvalue;
Determine the corresponding feature vector of the maximum eigenvalue;
Final enhancing signal is calculated based on the voice signal covariance and described eigenvector;
After carrying out Fourier transformation to the enhancing signal, and the signal after conversion is gone into time domain.
In the present embodiment, the realization of Wave beam forming is: receiving first with fft accelerator to microphone array
Signal ytIt carries out Short Time Fourier Transform (Short-time fourier transform, STFT).Then simd accelerator is used
To accelerateAnd Rf (v)Initialization.Later according to CGMM principle, estimated using Matrix Multiplication accelerator and dma acceleratorThen start to estimate noise covariance Rn(f), noisy speech covariance Rk+n(f) and voice signal covariance Rk(f), make
Dma accelerator, Matrix Multiplication accelerator and simd accelerator are used to accelerate to calculate.Later using asking characteristic value feature vector to add
Fast device comes to matrix Rk(f) carry out Eigenvalues Decomposition, the corresponding feature vector of maximum eigenvalue be target voice direction to
Measure rf.According to obtained Rn(f) and rfWeight is calculatedWherein come pair using Matrix Multiplication accelerator and simd accelerator
Calculating is accelerated.The enhancing signal finally to be obtained finally is calculated using simd acceleratorAnd added using ifft
Fast device carries out inverse Short Time Fourier Transform (Inverse short-time fourier transform, ISTFT) to it, it
Signal is gone to time domain afterwards.So far Wave beam forming terminates.
It should be pointed out that the present embodiment with the specific implementation process of upper module, only can be used in this present embodiment
The optimal selection of one kind, the present embodiment is not limited to above specific process, method progress deformation appropriate or change,
To realize the specific technical solution of the present invention, this is within the scope of protection of the invention.
In the present embodiment, the auditory localization unit is specifically used for:
The voice signal is calculated to reach in array between different microphones since the transmitting voice signal distance is different
Caused by the time difference;
The time difference is obtained into range difference multiplied by the velocity of sound;
A series of hyperboloids are calculated according to geometrical relationship and the range difference, sound is obtained by the bi-curved intersection point
Source position.
In the present embodiment, the realization of auditory localization is: estimated speech signal reaches in array between different microphones first
The time difference (Time delay of arrival, TDOA) due to caused by signal transmission distance difference, i.e. progress time prolong
Estimation (Time delay estimation, TDE) late.Used here as arrive broad sense cross-correlation (Generalized cross
Correlation, GCC) method carries out time delay estimadon, the audio that receives first with fft accelerator to different microphones
Signal carries out Fast Fourier Transform (FFT) (Fast fourier transformation, FFT).Then broad sense cross-correlation letter is defined
Number, it is first corresponding to protrude come the through part of enhanced speech signal, inhibition noise and reverb signal using weighting function in frequency domain
Peak value, wherein being accelerated using simd accelerator.The signal after weighting is carried out using ifft accelerator later reverse
Fast Fourier Transform (FFT) (Inverse fast fourier transformation, IFFT).Then broad sense cross-correlation is detected
The peak value of function obtains TDOA.The TDOA of acquisition is obtained into range difference multiplied by the velocity of sound later, obtains one according to geometrical relationship
Sound source position can be obtained by bi-curved intersection point in serial hyperboloid.Auditory localization is completed.
It should be pointed out that the present embodiment with the specific implementation process of upper module, only can be used in this present embodiment
The optimal selection of one kind, the present embodiment is not limited to above specific process, method progress deformation appropriate or change,
To realize the specific technical solution of the present invention, this is within the scope of protection of the invention.
In the present embodiment, the dereverberation processing unit is specifically used for:
Short Time Fourier Transform is carried out to the voice signal;
The frequency-region signal after removal reverberation is calculated based on transformed signal;
Inverse Short Time Fourier Transform is carried out to the frequency-region signal, and transformed voice signal is gone to time domain.
In one implementation of the present embodiment, the realization of dereverberation is: first with fft accelerator initialization when pair
The voice signal y that microphone array receivestIt carries out Short Time Fourier Transform (STFT).Then come using Matrix Multiplication accelerator
It calculatesWithAccelerate to calculate the frequency-region signal after removal reverberation using Matrix Multiplication accelerator laterAnd utilize ifft
Accelerator pairInverse Short Time Fourier Transform (ISTFT) is carried out, signal is then gone to time domain, completes dereverberation.And it is updating
When the signal y that equally first array is received using fft acceleratortIt carries out Short Time Fourier Transform (STFT), is accelerated with Matrix Multiplication
Device recalculatesEach frame data of each frequency point are updated, and are once updated using precedingWithMeter
Calculation obtains updatedFinally accelerated to calculate the frequency-region signal after removal reverberation with Matrix Multiplication acceleratorAnd it utilizes
Ifft accelerator pairInverse Short Time Fourier Transform (ISTFT) is carried out, signal is gone to time domain later, is completed to this dereverberation.
It should be pointed out that the present embodiment with the specific implementation process of upper module, only can be used in this present embodiment
The optimal selection of one kind, the present embodiment is not limited to above specific process, method progress deformation appropriate or change,
To realize the specific technical solution of the present invention, this is within the scope of protection of the invention.
In one implementation of the present embodiment, the Voice Activity Detection module is specifically used for:
The voice signal is pre-processed;
FBank feature will be extracted using FilterBank algorithm frame by frame by pretreated signal;
The FBank feature of each frame is inputted into deep neural network model, each frame is exported by deep neural network model
The output probability of each phoneme in phone set;
The corresponding output probability of all non-noise, non-mute phonemes will be corresponded in each frame to sum;
If described and value is greater than preset threshold, judge corresponding frame as sound end;
After last frame is judged, then the disposal of gentle filter is carried out to court verdict before, obtain final voice
Endpoint court verdict.
It should be pointed out that the VAD module, voice wake-up module and speech recognition module in the present embodiment mainly use
Following hardware accelerator: simd accelerator, mathematical operation accelerator (dma), fft/ifft accelerator, neural network accelerate
Device (Neural-network process units, NPU).Wherein NPU can flexibly support each Connectionist model,
Mainly have: deep neural network (Deep neural network, DNN), circular recursion neural network (Recurrent
Neural network, RNN), convolutional neural networks (Convolutional neural network, CNN), time delay nerve net
Network (Time delay neural network, TDNN) etc..
The realization of VAD module in the present embodiment is: incoming voice signal pre-processed first, including
Framing and pre-filtering etc..Then FBank feature will be extracted using FilterBank algorithm frame by frame by pretreated signal.
Endpoint judgement is carried out later, that is, passes through a trained deep neural network (Deep neural to classify to phoneme
Network, DNN) model, the FBank feature of each frame is inputted, it is each in phone set by the corresponding each frame of model output
The posterior probability (being also output probability) of a phoneme.Then by all non-noise, the non-mute corresponding output probability of phoneme into
Row summation, if it is greater than the threshold value of setting, then it is assumed that the frame is voice.When last frame signal is after endpoint is adjudicated, then into
Row post-processing operation carries out the disposal of gentle filter to court verdict before, obtains final sound end court verdict,
So far voice activity detection is completed.Wherein, VAD module is to utilize neural network accelerator (Neural-network process
Units, NPU) come what is speeded up to realize.
It should be pointed out that the present embodiment with the specific implementation process of upper module, only can be used in this present embodiment
The optimal selection of one kind, the present embodiment is not limited to above specific process, method progress deformation appropriate or change,
To realize the specific technical solution of the present invention, this is within the scope of protection of the invention.
In one implementation of the present embodiment, the voice wake-up module is specifically used for:
Speech feature extraction is carried out to the voice signal, obtains corresponding speech feature vector;
The speech feature vector is input to DNN model, obtaining the corresponding voice signal of the speech feature vector is to close
The posterior probability of keyword either non-key word;
Corresponding confidence level is smoothly obtained to the posterior probability;
If the confidence level is greater than preset value, judge that the corresponding voice signal of the speech feature vector includes to close
Keyword;
Judge to wake up the electronic equipment if the keyword is occurred by setting sequence.
In one implementation of the present embodiment, the realization that voice wakes up is: using mode end to end, that is, what is inputted is
Voice signal, output are directly keyword.Speech feature extraction is carried out to the voice signal of input first, using MFCC
(Mel-frequency cepstral coefficients) algorithm.Wherein, using MFCC algorithm extract phonetic feature it
It is preceding that pre-processing, including analog-to-digital conversion, preemphasis and framing adding window first are done to incoming voice signal.Carry out later quickly from
Fourier transformation and Mel filtering are dissipated, finally carrying out cepstrum, energy and difference can be obtained MFCC parameter vector.It will obtain later
Speech feature vector be input to DNN model (deep neural network, Deep Neural Networks, DNN), pass through training
DNN is the keyword either posterior probability of non-key word and then outputs it come the phonetic feature for predicting input.It incites somebody to action later
The posterior value arrived is by post-processing model, because posterior value is exported as unit of frame needing to come with certain window length
It carries out smoothly, the confidence level of keyword can be obtained after carrying out smoothly posterior value.If the confidence level is greater than the threshold of setting
Value, then it is assumed that keyword occurs.And think to wake up if keyword is occurred by setting sequence, while series of parameters is arranged
Possible false wake-up is limited, is terminated to the wake-up of this voice.Wherein, voice wake-up module is to utilize neural network accelerator
(NPU) it speeds up to realize.
It should be pointed out that the present embodiment with the specific implementation process of upper module, only can be used in this present embodiment
The optimal selection of one kind, the present embodiment is not limited to above specific process, method progress deformation appropriate or change,
To realize the specific technical solution of the present invention, this is within the scope of protection of the invention.
In one implementation of the present embodiment, the speech recognition module is specifically used for:
Extract the speech feature vector in the voice signal;
Tone decoding is carried out to the speech feature vector, obtains optimal output word sequence;
Corresponding text is exported based on the output word sequence;
The interactive instruction of the text representation is determined based on the text, and electronic equipment is made to execute the interactive instruction.
In one implementation of the present embodiment, one kind of speech recognition is achieved in that: first extracting language using MFCC algorithm
Sound feature vector.Then the speech feature vector extracted is subjected to tone decoding, and speech decoding process is exactly to pass through acoustics
Model, Pronounceable dictionary and language model carry out text output to the voice data after extracting feature.Acoustic model be using
TDNN-HMM model, wherein TDNN is time delay deep neural network model, and HMM is hidden Markov model (Hidden
Markov model, HMM), it is the acoustic model parameters trained according to the characteristic parameter of speech database, is then identifying
When speech feature vector that the mode input is extracted matched with acoustic model, obtain recognition result i.e. phoneme
Information.Wherein carry out Fitted probability density function using TDNN model, carries out the state modeling of HMM.Before in HMM model being utilization
Probability calculation is solved the problems, such as to algorithm and backward algorithm, solves problem concerning study with Baum-Welch algorithm, and use
Triphones HMM model and the training burden that every one kind is improved by decision tree.Pronounceable dictionary is identified according to acoustic model
Phoneme information, find corresponding word or word, acoustic model and language model be tied.And language model is to pass through
Large amount of text information is trained, in conjunction with syntax and semantics knowledge description word between internal relation, for hair
The word or word that sound dictionary is found obtain the word sequence of maximum probability.Later by trained acoustic model, pronunciation dictionary, language
Speech model construction is a state network.Decoding is carried out using Viterbi algorithm, i.e., from the state network of building
Find with the most matched path of voice, obtain optimal output word sequence.Final output text just completes grammer and identified
Journey.Wherein, speech recognition module is accelerated using neural network accelerator (NPU).
The another kind of speech recognition is achieved in that: acoustic model is constructed using RNN-CTC, wherein RNN is to recycle
Neural network, the acoustic model that CTC (Connectionist temporal classification) is used as loss function are instructed
Practice, save alignment of data and mark, while to Chinese phonetic mother, the multilinguals such as phoneme and state structure carries out analysis and builds
Mould;The method is trained by BP algorithm (errorBackPropagation), and last voice output is one section of spike sequence
Column, non-speech portion is blank parts;Since the spike sequence of output corresponds to mulitpath, thus using preceding backward algorithm into
Row computational short cut.Pronounceable dictionary is the phoneme information identified according to acoustic model, finds corresponding word or word, by sound
Model is learned to be tied with language model.And language model is modeled using N-gram+LSTM, wherein N-gram is a kind of
Statistical language model predicts that n-th of item, these item can be phoneme according to preceding (n-1) a item, character, word etc.,
It is most common language model;LSTM (Long Short Term Memory networks) is a kind of special circulation
Neural network (RNN) can be learnt by cellular state (Cell State) structure to long-term dependence;N-gram+
LSTM model overcomes the problem of individual N-gram model fails for long-time dependence, this model passes through to a large amount of
Text information is trained to obtain, in conjunction with syntax and semantics knowledge description word between internal relation, Pronounceable dictionary is looked for
To word or word obtain the word sequence of maximum probability.Later by trained acoustic model, pronunciation dictionary, language model structure
It builds as a state network.Decoding is carried out using Viterbi algorithm, i.e., finds from the state network of building and language
The most matched path of sound, obtains optimal output word sequence.Final output text just completes grammer identification process.It is local
Speech recognition is accelerated using neural network accelerator (NPU).
It should be pointed out that the specific implementation process of the present embodiment each module described above, only can be used in this
The optimal selection of one of embodiment, the present embodiment be not limited to carry out the above specific process, method deformation appropriate or
Person changes, and to realize the specific technical solution of the present invention, this is within the scope of protection of the invention.
This implementation is by basic audio collection module, front end array signal processing module, VAD module, and voice wakes up mould
Block, the intelligent sound chip of grammer identification module composition.On this basis, ARRAY PROCESSING module in front end is based on fft/ifft
Accelerator, Matrix Multiplication accelerator, accelerator of inverting seek determinant accelerator, seek characteristic value feature vector accelerator, and simd adds
Fast device, dma accelerator ask cholesky product accelerator to realize;VAD module is realized based on neural network accelerator;Language
Sound wake-up module is realized based on neural network accelerator;Local voice identification module is real based on neural network accelerator
Existing.
The present embodiment additionally provides a kind of electronic equipment, and the electronic equipment includes described in any one of claim 1-9
Speech chip.
Each technical characteristic of embodiment described above can be combined arbitrarily, for simplicity of description, not to above-mentioned reality
It applies all possible combination of each technical characteristic in example to be all described, as long as however, the combination of these technical characteristics is not deposited
In contradiction, all should be considered as described in this specification.
The embodiments described above only express several embodiments of the present invention, and the description thereof is more specific and detailed, but simultaneously
It cannot therefore be construed as limiting the scope of the patent.It should be pointed out that coming for those of ordinary skill in the art
It says, without departing from the inventive concept of the premise, various modifications and improvements can be made, these belong to guarantor of the invention
Protect range.Therefore, the scope of protection of the patent of the invention shall be subject to the appended claims.
Claims (10)
1. a kind of speech chip is applied to electronic equipment characterized by comprising
Audio collection module, for acquiring voice signal;
Front end ARRAY PROCESSING module, connect with the audio collection module, for handling the voice signal;
Voice Activity Detection module is connect with the front end ARRAY PROCESSING module, for determining the front end ARRAY PROCESSING module
The sound end of treated voice signal, wherein the sound end includes beginning endpoint and the end of the voice signal
Endpoint, each sound end correspond at least frame in voice signal;
Voice wake-up module is connect with the Voice Activity Detection module, for determining the voice based on the sound end
When signal includes preset wake-up voice, the electronic equipment is waken up;
Speech recognition module is connect with the front end ARRAY PROCESSING module, for identifying institute after the electronic equipment is waken up
The interactive instruction in voice signal after stating front end ARRAY PROCESSING resume module, and electronic equipment is made to execute the interactive instruction.
2. speech chip according to claim 1, which is characterized in that the front end ARRAY PROCESSING module includes:
Echo cancellation unit, for carrying out back Processing for removing to the voice signal;
Beam forming unit, for carrying out Wave beam forming processing to the voice signal;
Auditory localization unit, for carrying out auditory localization to the voice signal;
Dereverberation processing unit, for removing reverberation with the voice signal.
3. speech chip according to claim 2, which is characterized in that the echo cancellation unit is specifically used for:
Trap is carried out to the voice signal and filters out DC component, and does preemphasis processing, forms the first input signal;
First input signal is filtered using prime filtering, obtained error signal variance is stored to Sff;
First input signal is filtered using rear class filtering, by obtained error signal variance to See;
Final filtered signal is exported based on the Sff and See.
4. speech chip according to claim 2, which is characterized in that the beam forming unit is specifically used for:
Fourier transformation is carried out to the voice signal, and calculates the voice signal covariance of the voice signal;
Eigenvalues Decomposition is carried out to the voice signal covariance, determines maximum eigenvalue;
Determine the corresponding feature vector of the maximum eigenvalue;
Final enhancing signal is calculated based on the voice signal covariance and described eigenvector;
After carrying out Fourier transformation to the enhancing signal, and the signal after conversion is gone into time domain.
5. speech chip according to claim 2, which is characterized in that the auditory localization unit is specifically used for:
It calculates the voice signal and reaches in array and draw between different microphones since the transmitting voice signal distance is different
The time difference risen;
The time difference is obtained into range difference multiplied by the velocity of sound;
A series of hyperboloids are calculated according to geometrical relationship and the range difference, sound source position is obtained by the bi-curved intersection point
It sets.
6. speech chip according to claim 2, which is characterized in that the dereverberation processing unit is specifically used for:
Short Time Fourier Transform is carried out to the voice signal;
The frequency-region signal after removal reverberation is calculated based on transformed signal;
Inverse Short Time Fourier Transform is carried out to the frequency-region signal, and transformed voice signal is gone to time domain.
7. speech chip according to claim 1, which is characterized in that the Voice Activity Detection module is specifically used for:
The voice signal is pre-processed;
FBank feature will be extracted using FilterBank algorithm frame by frame by pretreated signal;
The FBank feature of each frame is inputted into deep neural network model, each frame is exported in sound by deep neural network model
Element concentrates the output probability of each phoneme;
The corresponding output probability of all non-noise, non-mute phonemes will be corresponded in each frame to sum;
If described and value is greater than preset threshold, judge corresponding frame as sound end;
After last frame is judged, then the disposal of gentle filter is carried out to court verdict before, obtain final sound end
Court verdict.
8. speech chip according to claim 1, which is characterized in that the voice wake-up module is specifically used for:
Speech feature extraction is carried out to the voice signal, obtains corresponding speech feature vector;
The speech feature vector is input to DNN model, obtaining the corresponding voice signal of the speech feature vector is keyword
The either posterior probability of non-key word;
Corresponding confidence level is smoothly obtained to the posterior probability;
If the confidence level is greater than preset value, judge that the corresponding voice signal of the speech feature vector includes keyword;
Judge to wake up the electronic equipment if the keyword is occurred by setting sequence.
9. speech chip according to claim 1, which is characterized in that the speech recognition module is specifically used for:
Extract the speech feature vector in the voice signal;
Tone decoding is carried out to the speech feature vector, obtains optimal output word sequence;
Corresponding text is exported based on the output word sequence;
The interactive instruction of the text representation is determined based on the text, and electronic equipment is made to execute the interactive instruction.
10. a kind of electronic equipment, which is characterized in that the electronic equipment includes voice of any of claims 1-9
Chip.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811293499.4A CN109584896A (en) | 2018-11-01 | 2018-11-01 | A kind of speech chip and electronic equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811293499.4A CN109584896A (en) | 2018-11-01 | 2018-11-01 | A kind of speech chip and electronic equipment |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109584896A true CN109584896A (en) | 2019-04-05 |
Family
ID=65921441
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811293499.4A Pending CN109584896A (en) | 2018-11-01 | 2018-11-01 | A kind of speech chip and electronic equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109584896A (en) |
Cited By (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110265029A (en) * | 2019-06-21 | 2019-09-20 | 百度在线网络技术(北京)有限公司 | Speech chip and electronic equipment |
CN110634483A (en) * | 2019-09-03 | 2019-12-31 | 北京达佳互联信息技术有限公司 | Man-machine interaction method and device, electronic equipment and storage medium |
CN110738991A (en) * | 2019-10-11 | 2020-01-31 | 东南大学 | Speech recognition equipment based on flexible wearable sensor |
CN110830866A (en) * | 2019-10-31 | 2020-02-21 | 歌尔科技有限公司 | Voice assistant awakening method and device, wireless earphone and storage medium |
CN110930979A (en) * | 2019-11-29 | 2020-03-27 | 百度在线网络技术(北京)有限公司 | Speech recognition model training method and device and electronic equipment |
CN111048061A (en) * | 2019-12-27 | 2020-04-21 | 西安讯飞超脑信息科技有限公司 | Method, device and equipment for obtaining step length of echo cancellation filter |
CN111392532A (en) * | 2020-04-07 | 2020-07-10 | 上海爱登堡电梯集团股份有限公司 | Elevator outbound call device with voice parameter setting function, elevator parameter debugging method and elevator |
CN111508498A (en) * | 2020-04-09 | 2020-08-07 | 携程计算机技术(上海)有限公司 | Conversational speech recognition method, system, electronic device and storage medium |
CN111599371A (en) * | 2020-05-19 | 2020-08-28 | 苏州奇梦者网络科技有限公司 | Voice adding method, system, device and storage medium |
CN111724769A (en) * | 2020-04-22 | 2020-09-29 | 深圳市伟文无线通讯技术有限公司 | Production method of intelligent household voice recognition model |
CN111785289A (en) * | 2019-07-31 | 2020-10-16 | 北京京东尚科信息技术有限公司 | Residual echo cancellation method and device |
WO2020228270A1 (en) * | 2019-05-10 | 2020-11-19 | 平安科技(深圳)有限公司 | Speech processing method and device, computer device and storage medium |
CN112599132A (en) * | 2019-09-16 | 2021-04-02 | 北京知存科技有限公司 | Voice processing device and method based on storage and calculation integrated chip and electronic equipment |
CN112599151A (en) * | 2020-12-07 | 2021-04-02 | 携程旅游信息技术(上海)有限公司 | Speech rate evaluation method, system, device and storage medium |
CN112672120A (en) * | 2019-10-15 | 2021-04-16 | 许桂林 | Projector with voice analysis function and personal health data generation method |
CN114360517A (en) * | 2021-12-17 | 2022-04-15 | 天翼爱音乐文化科技有限公司 | Audio processing method and device in complex environment and storage medium |
WO2022105861A1 (en) * | 2020-11-20 | 2022-05-27 | 北京有竹居网络技术有限公司 | Method and apparatus for recognizing voice, electronic device and medium |
CN114944153A (en) * | 2022-07-26 | 2022-08-26 | 中诚华隆计算机技术有限公司 | Enhanced awakening method and device for terminal of Internet of things and storage medium |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101888455A (en) * | 2010-04-09 | 2010-11-17 | 熔点网讯(北京)科技有限公司 | Self-adaptive echo counteracting method for frequency domain |
CN102750956A (en) * | 2012-06-18 | 2012-10-24 | 歌尔声学股份有限公司 | Method and device for removing reverberation of single channel voice |
CN103259563A (en) * | 2012-02-16 | 2013-08-21 | 联芯科技有限公司 | Self-adapting filter divergence detection method and echo cancellation system |
CN105632486A (en) * | 2015-12-23 | 2016-06-01 | 北京奇虎科技有限公司 | Voice wake-up method and device of intelligent hardware |
CN107316649A (en) * | 2017-05-15 | 2017-11-03 | 百度在线网络技术(北京)有限公司 | Audio recognition method and device based on artificial intelligence |
US20180293998A1 (en) * | 2017-04-11 | 2018-10-11 | Texas Instruments Incorporated | Methods and apparatus for low cost voice activity detector |
CN108648769A (en) * | 2018-04-20 | 2018-10-12 | 百度在线网络技术(北京)有限公司 | Voice activity detection method, apparatus and equipment |
CN108665895A (en) * | 2018-05-03 | 2018-10-16 | 百度在线网络技术(北京)有限公司 | Methods, devices and systems for handling information |
-
2018
- 2018-11-01 CN CN201811293499.4A patent/CN109584896A/en active Pending
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101888455A (en) * | 2010-04-09 | 2010-11-17 | 熔点网讯(北京)科技有限公司 | Self-adaptive echo counteracting method for frequency domain |
CN103259563A (en) * | 2012-02-16 | 2013-08-21 | 联芯科技有限公司 | Self-adapting filter divergence detection method and echo cancellation system |
CN102750956A (en) * | 2012-06-18 | 2012-10-24 | 歌尔声学股份有限公司 | Method and device for removing reverberation of single channel voice |
CN105632486A (en) * | 2015-12-23 | 2016-06-01 | 北京奇虎科技有限公司 | Voice wake-up method and device of intelligent hardware |
US20180293998A1 (en) * | 2017-04-11 | 2018-10-11 | Texas Instruments Incorporated | Methods and apparatus for low cost voice activity detector |
CN107316649A (en) * | 2017-05-15 | 2017-11-03 | 百度在线网络技术(北京)有限公司 | Audio recognition method and device based on artificial intelligence |
CN108648769A (en) * | 2018-04-20 | 2018-10-12 | 百度在线网络技术(北京)有限公司 | Voice activity detection method, apparatus and equipment |
CN108665895A (en) * | 2018-05-03 | 2018-10-16 | 百度在线网络技术(北京)有限公司 | Methods, devices and systems for handling information |
Cited By (27)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2020228270A1 (en) * | 2019-05-10 | 2020-11-19 | 平安科技(深圳)有限公司 | Speech processing method and device, computer device and storage medium |
CN110265029A (en) * | 2019-06-21 | 2019-09-20 | 百度在线网络技术(北京)有限公司 | Speech chip and electronic equipment |
CN111785289A (en) * | 2019-07-31 | 2020-10-16 | 北京京东尚科信息技术有限公司 | Residual echo cancellation method and device |
CN111785289B (en) * | 2019-07-31 | 2023-12-05 | 北京京东尚科信息技术有限公司 | Residual echo cancellation method and device |
CN110634483A (en) * | 2019-09-03 | 2019-12-31 | 北京达佳互联信息技术有限公司 | Man-machine interaction method and device, electronic equipment and storage medium |
CN110634483B (en) * | 2019-09-03 | 2021-06-18 | 北京达佳互联信息技术有限公司 | Man-machine interaction method and device, electronic equipment and storage medium |
US11620984B2 (en) | 2019-09-03 | 2023-04-04 | Beijing Dajia Internet Information Technology Co., Ltd. | Human-computer interaction method, and electronic device and storage medium thereof |
CN112599132A (en) * | 2019-09-16 | 2021-04-02 | 北京知存科技有限公司 | Voice processing device and method based on storage and calculation integrated chip and electronic equipment |
CN110738991A (en) * | 2019-10-11 | 2020-01-31 | 东南大学 | Speech recognition equipment based on flexible wearable sensor |
CN112672120B (en) * | 2019-10-15 | 2022-09-09 | 许桂林 | Projector with voice analysis function and personal health data generation method |
CN112672120A (en) * | 2019-10-15 | 2021-04-16 | 许桂林 | Projector with voice analysis function and personal health data generation method |
CN110830866A (en) * | 2019-10-31 | 2020-02-21 | 歌尔科技有限公司 | Voice assistant awakening method and device, wireless earphone and storage medium |
CN110930979B (en) * | 2019-11-29 | 2020-10-30 | 百度在线网络技术(北京)有限公司 | Speech recognition model training method and device and electronic equipment |
CN110930979A (en) * | 2019-11-29 | 2020-03-27 | 百度在线网络技术(北京)有限公司 | Speech recognition model training method and device and electronic equipment |
CN111048061A (en) * | 2019-12-27 | 2020-04-21 | 西安讯飞超脑信息科技有限公司 | Method, device and equipment for obtaining step length of echo cancellation filter |
CN111392532A (en) * | 2020-04-07 | 2020-07-10 | 上海爱登堡电梯集团股份有限公司 | Elevator outbound call device with voice parameter setting function, elevator parameter debugging method and elevator |
CN111508498A (en) * | 2020-04-09 | 2020-08-07 | 携程计算机技术(上海)有限公司 | Conversational speech recognition method, system, electronic device and storage medium |
CN111508498B (en) * | 2020-04-09 | 2024-01-30 | 携程计算机技术(上海)有限公司 | Conversational speech recognition method, conversational speech recognition system, electronic device, and storage medium |
CN111724769A (en) * | 2020-04-22 | 2020-09-29 | 深圳市伟文无线通讯技术有限公司 | Production method of intelligent household voice recognition model |
CN111599371A (en) * | 2020-05-19 | 2020-08-28 | 苏州奇梦者网络科技有限公司 | Voice adding method, system, device and storage medium |
CN111599371B (en) * | 2020-05-19 | 2023-10-20 | 苏州奇梦者网络科技有限公司 | Voice adding method, system, device and storage medium |
WO2022105861A1 (en) * | 2020-11-20 | 2022-05-27 | 北京有竹居网络技术有限公司 | Method and apparatus for recognizing voice, electronic device and medium |
CN112599151A (en) * | 2020-12-07 | 2021-04-02 | 携程旅游信息技术(上海)有限公司 | Speech rate evaluation method, system, device and storage medium |
CN112599151B (en) * | 2020-12-07 | 2023-07-21 | 携程旅游信息技术(上海)有限公司 | Language speed evaluation method, system, equipment and storage medium |
CN114360517A (en) * | 2021-12-17 | 2022-04-15 | 天翼爱音乐文化科技有限公司 | Audio processing method and device in complex environment and storage medium |
CN114360517B (en) * | 2021-12-17 | 2023-04-18 | 天翼爱音乐文化科技有限公司 | Audio processing method and device in complex environment and storage medium |
CN114944153A (en) * | 2022-07-26 | 2022-08-26 | 中诚华隆计算机技术有限公司 | Enhanced awakening method and device for terminal of Internet of things and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109584896A (en) | A kind of speech chip and electronic equipment | |
Zhang et al. | A speech enhancement algorithm by iterating single-and multi-microphone processing and its application to robust ASR | |
US10373609B2 (en) | Voice recognition method and apparatus | |
US10304440B1 (en) | Keyword spotting using multi-task configuration | |
CN109427328B (en) | Multichannel voice recognition method based on filter network acoustic model | |
US20170154640A1 (en) | Method and electronic device for voice recognition based on dynamic voice model selection | |
CN109192200B (en) | Speech recognition method | |
US5594834A (en) | Method and system for recognizing a boundary between sounds in continuous speech | |
WO2015047517A1 (en) | Keyword detection | |
CN108962237A (en) | Mixing voice recognition methods, device and computer readable storage medium | |
US10460729B1 (en) | Binary target acoustic trigger detecton | |
CN111899757B (en) | Single-channel voice separation method and system for target speaker extraction | |
AU684214B2 (en) | System for recognizing spoken sounds from continuous speech and method of using same | |
CN106548775A (en) | A kind of audio recognition method and system | |
Todkar et al. | Speaker recognition techniques: A review | |
CN110268471A (en) | The method and apparatus of ASR with embedded noise reduction | |
Ceolini et al. | Event-driven pipeline for low-latency low-compute keyword spotting and speaker verification system | |
CN111785302A (en) | Speaker separation method and device and electronic equipment | |
Li et al. | A Convolutional Neural Network with Non-Local Module for Speech Enhancement. | |
Wang et al. | A fusion model for robust voice activity detection | |
Kamble et al. | Teager energy subband filtered features for near and far-field automatic speech recognition | |
Nakamura et al. | Robot audition based acoustic event identification using a bayesian model considering spectral and temporal uncertainties | |
Agrawal et al. | Deep variational filter learning models for speech recognition | |
Nidhyananthan et al. | A review on speech enhancement algorithms and why to combine with environment classification | |
Pan et al. | Application of hidden Markov models in speech command recognition |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190405 |