CN109272989B - Voice wake-up method, apparatus and computer readable storage medium - Google Patents
Voice wake-up method, apparatus and computer readable storage medium Download PDFInfo
- Publication number
- CN109272989B CN109272989B CN201810992991.4A CN201810992991A CN109272989B CN 109272989 B CN109272989 B CN 109272989B CN 201810992991 A CN201810992991 A CN 201810992991A CN 109272989 B CN109272989 B CN 109272989B
- Authority
- CN
- China
- Prior art keywords
- sound source
- beams
- voice
- wake
- module
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 68
- 230000008569 process Effects 0.000 claims abstract description 20
- 238000012545 processing Methods 0.000 claims description 40
- 239000013598 vector Substances 0.000 claims description 30
- 238000012549 training Methods 0.000 claims description 27
- 239000011159 matrix material Substances 0.000 claims description 16
- 230000017105 transposition Effects 0.000 claims description 12
- 238000004590 computer program Methods 0.000 claims description 10
- 238000013136 deep learning model Methods 0.000 claims description 7
- 238000002372 labelling Methods 0.000 claims description 6
- 230000002452 interceptive effect Effects 0.000 claims description 3
- 230000002618 waking effect Effects 0.000 claims description 3
- 238000004422 calculation algorithm Methods 0.000 abstract description 16
- 238000010586 diagram Methods 0.000 description 18
- 230000006870 function Effects 0.000 description 7
- 238000005516 engineering process Methods 0.000 description 5
- 238000012360 testing method Methods 0.000 description 4
- 238000013528 artificial neural network Methods 0.000 description 3
- 230000004807 localization Effects 0.000 description 3
- 238000006243 chemical reaction Methods 0.000 description 2
- 230000006854 communication Effects 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 238000010606 normalization Methods 0.000 description 2
- 230000000306 recurrent effect Effects 0.000 description 2
- 238000003491 array Methods 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000006855 networking Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 230000003595 spectral effect Effects 0.000 description 1
- 230000001629 suppression Effects 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
- G10L2021/02161—Number of inputs available containing the signal or the noise to be suppressed
- G10L2021/02166—Microphone arrays; Beamforming
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Quality & Reliability (AREA)
- Signal Processing (AREA)
- Artificial Intelligence (AREA)
- Circuit For Audible Band Transducer (AREA)
Abstract
The disclosure relates to a voice wake-up method, a voice wake-up device and a computer readable storage medium, and relates to the technical field of computers. The method of the present disclosure comprises: carrying out beam forming on the voice signals in a plurality of preset directions to obtain a plurality of beams; inputting the beam into a pre-trained keyword recognition model to obtain the probability of the beam containing the keyword; determining a wave beam pointing to the direction of a sound source as a sound source wave beam according to the probability that the wave beam contains the keywords and the signal quality of the wave beam; and determining whether to wake up the system according to the feature matching results of the sound source beams at a plurality of continuous moments. According to the method, the existing sound source positioning method and voice awakening process are not adopted, and the beam forming algorithm and the sound source positioning algorithm are decoupled, so that the influence of the sound source positioning precision on the beam forming algorithm direction is avoided, the awakening accuracy of the voice system is improved, and the user experience is improved.
Description
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to a voice wake-up method and apparatus, and a computer-readable storage medium.
Background
With the development of computer technology, the need for human and machine information communication is more and more urgent. Voice, one of the most natural ways of human interaction, is also one of the most important ways people want to communicate with computers instead of mouse and keyboard. With the increasing urgent development requirements of intelligent terminals such as smart homes, intelligent vehicles and intelligent conference systems, the technology of the intelligent voice wake-up system as an entrance of the intelligent terminal is receiving more and more attention.
The speech communication process is interfered by the surrounding environment and the propagation medium (such as echo, reverberation and interference sound source, etc.), so that the comprehension of the speech by the computer is reduced sharply. Since noise interference always comes from all directions, it becomes very difficult to capture pure speech with a single microphone. At present, a voice awakening system is mainly based on a microphone array method, and carries out time-space domain processing on voice collected by a plurality of microphones, so that the purposes of noise suppression and voice enhancement are achieved.
The voice wake-up method known to the inventors generally comprises the following steps: the method comprises the steps of collecting voice signals through a microphone array, preprocessing the voice signals, determining the angle and the direction of a sound source through a sound source positioning and tracking technology, generating beams pointing to the angle and the direction of the sound source through a beam forming technology, transmitting the formed beams to a voice recognition system for identification, and determining whether to awaken the system or not.
Disclosure of Invention
The inventor finds that: current sound source localization can be broadly divided into three categories according to localization principles: maximum output power based steerable beamforming techniques, time difference of arrival based techniques, and high resolution spectral estimation based localization. The performance of the three sound source positioning algorithms is sharply reduced in an environment with serious reverberation and noise interference, and the angle and the direction of a sound source cannot be accurately positioned, so that subsequent voice recognition is directly influenced, and a voice awakening result is influenced.
One technical problem to be solved by the present disclosure is: how to improve the accuracy of voice awakening and improve the user experience.
According to some embodiments of the present disclosure, there is provided a voice wake-up method, including: carrying out beam forming on the voice signals in a plurality of preset directions to obtain a plurality of beams; inputting the beam into a pre-trained keyword recognition model to obtain the probability of the beam containing the keyword; determining a wave beam pointing to the direction of a sound source as a sound source wave beam according to the probability that the wave beam contains the keywords and the signal quality of the wave beam; and determining whether to wake up the system according to the feature matching results of the sound source beams at a plurality of continuous moments.
In some embodiments, inputting the beams into a pre-trained keyword recognition model comprises: and selecting partial beams according to the signal quality of the beams and inputting the partial beams into a pre-trained keyword recognition model.
In some embodiments, selecting the partial beams based on the signal quality of the beams includes: determining the signal quality of the beam according to at least one of the energy and the signal-to-noise ratio of the beam in a fixed time window; and selecting partial beams with the signal quality higher than the signal quality threshold.
In some embodiments, determining a beam pointing in the direction of the sound source as the sound source beam based on the probability that the beam contains the keyword and the signal quality of the beam comprises: carrying out weighted summation on the probability that the beam contains the keywords and the signal quality of the beam to obtain the importance degree of the beam; and selecting the wave beam with the highest importance degree as the sound source wave beam, and determining the direction pointed by the sound source wave beam as the sound source direction.
In some embodiments, determining whether to wake up the system according to the result of feature matching of the sound source beam at a plurality of consecutive time instances comprises: matching the sound source directions pointed by the sound source beams at a plurality of continuous moments, and determining whether the sound source beams at the plurality of continuous moments contain keywords or not; and when the sound source directions pointed by the sound source beams at a plurality of continuous moments are consistent and the sound source beams at the plurality of continuous moments all contain keywords, waking up the system.
In some embodiments, beamforming the voice signal in a predetermined plurality of directions, the obtaining a plurality of beams comprises: determining the weight of each path of voice signals received by a microphone relative to a preset direction according to the direction of point source noise, the proportion of the point source noise and the white noise and a directional vector of the preset direction; and according to the weight of each path of voice signal received by the microphone relative to the preset direction, carrying out weighted summation on each path of voice signal received by the microphone, and determining the wave beam in the preset direction.
In some embodiments, the weight of each path of speech signal received by the microphone relative to the predetermined direction is calculated according to the following formula:
wherein, Wm(k) Is the weight vector of each path of voice signal received by the microphone in the mth wave beam processing process relative to the preset direction, k is the number of different frequency bands of the signals received by the microphone,is the covariance matrix of the noise during the mth beam processing,is composed ofThe inverse of the matrix is then applied to the matrix,for a microphone array pointing vector in a predetermined direction during the mth beam processing,is composed ofBy conjugate transposition of alphapsnAs noiseProportion of point source interference noise at medium predetermined orientation, 1-alphapsnIs the proportion of white noise in the noise,for the pointing vector of the predetermined azimuth point source interference noise during the mth beam processing,is composed ofAnd (4) conjugate transposition.
In some embodiments, the method further comprises: performing a beam forming process on the voice signal in a plurality of predetermined directions to obtain a plurality of beams; labeling keywords of the multiple beams to serve as training beams; and inputting the training wave beam into the keyword recognition model for training to obtain a pre-trained keyword recognition model.
In some embodiments, before the beamforming the voice signal in the predetermined plurality of directions further comprises: the voice signal received through the microphone is subjected to echo cancellation.
In some embodiments, the keyword recognition model comprises: a deep learning model or a hidden markov model.
According to other embodiments of the present disclosure, there is provided a voice wake-up apparatus including: the device comprises a beam forming module, a processing module and a processing module, wherein the beam forming module is used for carrying out beam forming on a voice signal in a plurality of preset directions to obtain a plurality of beams; the keyword identification module is used for inputting the beam into a keyword identification model trained in advance to obtain the probability of the beam containing the keyword; the sound source determining module is used for determining a beam pointing to the sound source direction as a sound source beam according to the probability that the beam contains the keyword and the signal quality of the beam; and the voice awakening module is used for determining whether to awaken the system or not according to the feature matching result of the sound source wave beams at a plurality of continuous moments.
In some embodiments, the apparatus further comprises: and the beam selection module is used for selecting partial beams according to the signal quality of the beams and sending the partial beams to the keyword recognition module so that the keyword recognition module can input the received beams into a keyword recognition model trained in advance.
In some embodiments, the beam selection module is configured to determine a signal quality of a beam based on at least one of an energy and a signal-to-noise ratio of the beam within a fixed time window; and selecting partial beams with the signal quality higher than the signal quality threshold.
In some embodiments, the sound source determining module is configured to perform weighted summation on the probability that the beam includes the keyword and the signal quality of the beam to obtain the importance degree of the beam, select the beam with the highest importance degree as the sound source beam, and determine the direction pointed by the sound source beam as the sound source direction.
In some embodiments, the voice wake-up module is configured to match sound source directions pointed by sound source beams at multiple consecutive time instances, and determine whether the sound source beams at the multiple consecutive time instances all contain a keyword, and wake up the system if the sound source directions pointed by the sound source beams at the multiple consecutive time instances are consistent, and the sound source beams at the multiple consecutive time instances all contain the keyword.
In some embodiments, the beam forming module is configured to determine a weight of each path of voice signals received by the microphone with respect to a predetermined direction according to a direction of the point source noise, a ratio of the point source noise to the white noise, and a directional vector of the predetermined direction, and perform weighted summation on each path of voice signals received by the microphone according to the weight of each path of voice signals received by the microphone with respect to the predetermined direction to determine a beam in the predetermined direction.
In some embodiments, the weight of each path of speech signal received by the microphone relative to the predetermined direction is calculated according to the following formula:
wherein, Wm(k) Is the weight vector of each path of voice signal received by the microphone in the mth wave beam processing process relative to the preset direction, k is the number of different frequency bands of the signals received by the microphone,is the covariance matrix of the noise during the mth beam processing,is composed ofThe inverse of the matrix is then applied to the matrix,for a microphone array pointing vector in a predetermined direction during the mth beam processing,is composed ofBy conjugate transposition of alphapsnFor the proportion of the noise of the point source interfering at a predetermined azimuth, 1-alphapsnIs the proportion of white noise in the noise,for the pointing vector of the predetermined azimuth point source interference noise during the mth beam processing,is composed ofAnd (4) conjugate transposition.
In some embodiments, the apparatus further comprises: and the model training module is used for carrying out a beam forming process on the voice signal in a plurality of preset directions to obtain a plurality of beams, carrying out keyword labeling on the beams to be used as training beams, and inputting the training beams into the keyword recognition model for training to obtain a pre-trained keyword recognition model.
In some embodiments, the apparatus further comprises: and the echo cancellation module is used for carrying out echo cancellation on the voice signal received by the microphone.
In some embodiments, the keyword recognition model comprises: a deep learning model or a hidden markov model.
According to still other embodiments of the present disclosure, there is provided a voice wake-up apparatus including: a memory; and a processor coupled to the memory, the processor configured to perform the voice wake-up method of any of the preceding embodiments based on instructions stored in the memory device.
According to still further embodiments of the present disclosure, there is provided a computer-readable storage medium having a computer program stored thereon, wherein the program, when executed by a processor, implements the voice wake-up method of any of the preceding embodiments.
According to the method, a voice signal is subjected to wave beam forming in multiple directions to obtain multiple wave beams, the multiple wave beams are input into a keyword recognition model, the probability that the multiple wave beams contain keywords is recognized, then a sound source wave beam is selected based on the probability that the wave beams contain the keywords and the signal quality of the wave beams, and whether the system is awakened or not is determined according to feature matching results of the sound source wave beams at multiple moments. According to the method, the existing sound source positioning method and voice awakening process are not adopted, and the beam forming algorithm and the sound source positioning algorithm are decoupled, so that the influence of the sound source positioning precision on the beam forming algorithm direction is avoided, the awakening accuracy of the voice system is improved, and the user experience is improved.
Other features of the present disclosure and advantages thereof will become apparent from the following detailed description of exemplary embodiments thereof, which proceeds with reference to the accompanying drawings.
Drawings
In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and other drawings can be obtained by those skilled in the art without creative efforts.
Fig. 1 illustrates a flow diagram of a voice wake-up method of some embodiments of the present disclosure.
Fig. 2 shows a flow diagram of a voice wake-up method of further embodiments of the present disclosure.
Fig. 3 shows a schematic structural diagram of a voice wake-up apparatus according to some embodiments of the present disclosure.
Fig. 4 shows a schematic structural diagram of a voice wake-up apparatus according to another embodiment of the present disclosure.
Fig. 5 shows a schematic structural diagram of a voice wake-up apparatus according to still other embodiments of the present disclosure.
Fig. 6 shows a schematic structural diagram of a voice wake-up apparatus according to still other embodiments of the present disclosure.
Detailed Description
The technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are only a part of the embodiments of the present disclosure, and not all of the embodiments. The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses. All other embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the protection scope of the present disclosure.
The present disclosure provides a voice wake-up method, and some embodiments of the voice wake-up method of the present disclosure are described below in conjunction with fig. 1.
Fig. 1 is a flow chart of some embodiments of a voice wake-up method of the present disclosure. As shown in fig. 1, the method of this embodiment includes: steps S102 to S108.
In step S102, a voice signal is beamformed in a plurality of predetermined directions, resulting in a plurality of beams.
A plurality of microphones, namely microphone arrays, can be arranged on the voice recognition system for voice wake-up to receive voice signals of users. The speech signal may first be pre-processed, for example, the speech signal received by the microphone array may be Echo cancelled, for example, by an Acoustic Echo Cancellation (AEC) algorithm.
The preprocessed voice signals may be beamformed in a predetermined plurality of directions using a phased array beamforming algorithm. The meaning of the phased array is that M directions are preset (evenly distributed on a circle), and M times of weighting and processing are carried out on the multi-path voice signals received by the microphone array to form M paths of voice signals which are enhanced aiming at the respective specific directions. For example, M directions uniformly distributed on a predetermined circle may be used as the beam forming directions, i.e., the formed beam points to the predetermined M directions. The beamforming algorithm may employ, for example, an MVDR (Minimum Variance distortion free Response) algorithm, a GSC (generalized Sidelobe Canceller), a TF-GSC (Transfer Function generalized Sidelobe Canceller), and the like. Beamforming in a predetermined plurality of directions can be achieved by existing algorithms, which are not described herein.
The present disclosure also provides an improved beamforming algorithm, described below.
In some embodiments, the weights of the paths of voice signals received by the microphone relative to the predetermined direction are determined according to the direction of the point source noise, the proportion of the point source noise to the white noise and the directional vector of the predetermined direction; and according to the weight of each path of voice signal received by the microphone relative to the preset direction, carrying out weighted summation on each path of voice signal received by the microphone, and determining the wave beam in the preset direction. Beamforming may be performed according to the following equation.
Xn(k,l)=fft(xn(t)) (1)
In the formula (1), xn(t) is the speech signal received by the nth microphone, FFT (·) means Fast Fourier Transform (FFT) of the speech signal to obtain Xn(k, l) is xn(t) SFFT amplitude of kth frequency band in the l-th time period, wherein l represents that the speech signal is windowed and divided into l time periods to be processed respectively, and k represents the frequency band number of each speech signal after FFT conversion.
ym(t)=ifft(Ym(k,l)) (3)
In the formulae (2) and (3), ym(t) is the output signal of the mth predetermined azimuth beam formed by the phased array beam, ifft (-) represents the inverse fast Fourier transform, Ym(k, l) is ym(t) SFFT magnitude of kth band of the l-th period.In the mth beam processing process, the weight of the voice signal received by the nth microphone in the kth frequency band of the ith time period.
As can be seen from the above formula, determineA signal representative of the beam m in the predetermined direction can be determined.
In the formula (4), Wm(k) Processing the m-th beamThe weight vector of each received voice signal relative to the predetermined direction,for an n-dimensional vector, the signal weight vector for each time segment can be considered the same, knownThen can obtain Which represents the weight of the voice signal received by the nth microphone in the kth frequency band during the mth beam processing.Is the covariance matrix of the noise during the mth beam processing,is composed ofAnd (4) inverting the matrix.For microphone array pointing vectors that are expected to enhance the bearing (i.e. the predetermined bearing) during the mth beam processing,is that n is a column vector and,is composed ofAnd (4) conjugate transposition.By setting a predetermined direction.
Further according to the formula (5) can obtainαpsnFor the proportion of stationary azimuthal point source interference noise in the noise, 1-alphapsnIs the proportion of white noise in the noise. Alpha is alphapsnMay be obtained from testing or experience.The pointing vector of the fixed azimuth point source interference noise in the mth beam processing process,is composed ofAnd (4) conjugate transposition.May be obtained from testing or experience.
The beam signals in each predetermined direction can be calculated by the above formulas, and a plurality of beam forming processes can be executed in parallel to obtain a plurality of beams.
In step S104, the beam is input to a keyword recognition model trained in advance, and the probability that the beam includes the keyword is obtained.
The voice system judges whether to record subsequent voice and perform voice recognition by recognizing the keywords in the voice, namely, whether to awaken the voice system subsequently is determined by retrieving the keywords in the voice. The keyword recognition model is, for example, a deep learning model, a hidden markov model, or the like. Examples of the Deep learning model include DNN (Deep Neural Networks), RNN (Recurrent Neural Networks), CRNN (Convolutional Recurrent Neural Networks), and the like. These models are all existing models and are not described in detail herein. When performing the keyword recognition model training, a plurality of beams may be generated according to the embodiment of step S102, and whether the plurality of beams contain the keyword is labeled as a training beam; and inputting the training wave beam into the keyword recognition model for off-line training to obtain a pre-trained keyword recognition model. And then inputting the beam into a pre-trained keyword recognition model, so as to obtain the probability of the beam containing the keyword.
In step S106, a beam pointing to the sound source direction is determined as a sound source beam according to the probability that the beam contains the keyword and the signal quality of the beam.
In some embodiments, the signal quality of the beam is determined based on at least one of an energy and a signal-to-noise ratio of the beam within a fixed time window. The higher the energy of the beam within a fixed time window, the higher the signal-to-noise ratio, and the better the signal quality. For example, the energy and signal-to-noise ratio of a beam within a fixed time window may be calculated, and the weighted sum of the two parameters may be used to determine the signal quality of the beam. The weights of energy and signal-to-noise ratio can be set according to actual requirements. The energy and signal-to-noise ratio may be normalized and weighted.
In some embodiments, the probability that the beam contains the keyword and the signal quality of the beam are subjected to weighted summation to obtain the importance degree of the beam; and selecting the wave beam with the highest importance degree as the sound source wave beam, and determining the direction pointed by the sound source wave beam as the sound source direction. The beam signal quality in the sound source direction is better, and the probability that the beam contains the keyword can be identified to be higher, so that the sound source beam can be selected according to the probability that the beam contains the keyword and the signal quality of the beam. For example, the energy power 'and the signal-to-noise ratio SNR' of the K wave beams in a fixed time window are calculated, and normalization processing is carried out at the same time to obtainObtaining the keyword recognition probability output by the keyword recognition model of the kth wave beam through the keyword recognition model as NNscorekAnd further calculates the importance of the kth beam,
in step S108, it is determined whether to wake up the system according to the result of feature matching of the sound source beam at a plurality of consecutive time instances.
The system can be further determined whether to wake up or not directly according to whether the keyword probability of the sound source wave beam exceeds a threshold value or not. But the accuracy of the wake-up can be further improved by feature matching of the sound source beam for a plurality of consecutive time instants.
In some embodiments, the sound source directions pointed by the sound source beams at the current time and a preset number of continuous multiple times before are matched, and whether the sound source beams at the continuous multiple times all contain keywords is determined; and when the sound source directions pointed by the sound source beams at a plurality of continuous moments are consistent and the sound source beams at the plurality of continuous moments all contain keywords, waking up the system. Otherwise, the system is not woken up. Namely, whether the system is awakened or not is confirmed according to the consistency of the results of the keyword recognition and positioning judgment module at the time t and the previous time (t-p, t-p +1 … …, t-1 and t). If the keyword recognition and positioning results at the previous moment and the later moment are consistent, the system is awakened, otherwise, the system cannot be awakened.
In the method of the above embodiment, the voice signal is formed into a beam in multiple directions to obtain multiple beams, the multiple beams are input to the keyword recognition model to recognize the probability that the multiple beams contain the keyword, a sound source beam is selected based on the probability that the beam contains the keyword and the signal quality of the beam, and whether the system is awakened or not is determined according to the feature matching result of the sound source beam at multiple moments. According to the method, the existing sound source positioning method and voice awakening process are not adopted, and the beam forming algorithm and the sound source positioning algorithm are decoupled, so that the influence of the sound source positioning precision on the beam forming algorithm direction is avoided, the awakening accuracy of the voice system is improved, and the user experience is improved.
Further embodiments of the disclosed voice wake-up method are described below in conjunction with fig. 2.
Fig. 2 is a flowchart of another embodiment of a voice wake-up method according to the present disclosure. As shown in fig. 2, the method of this embodiment includes: steps S202 to S214.
In step S202, a speech signal of a user is received through a microphone array.
In step S204, echo cancellation is performed on the multi-path speech signals received by the microphone array.
In step S206, the received voice signal is beamformed in a plurality of predetermined directions, resulting in a plurality of beams.
In step S208, a partial beam is selected according to the signal quality of the beam.
In some embodiments, the signal quality of the beam is determined based on at least one of the energy and the signal-to-noise ratio of the beam within a fixed time window; and selecting partial beams with the signal quality higher than the signal quality threshold. E.g., weights of energy and signal-to-noise ratio of the beam within a fixed time window, determines the signal quality of the beam. The weights of energy and signal-to-noise ratio can be set according to actual requirements. For example, the energy power and the SNR of each beam in a fixed time window are calculated respectively, and normalization processing is performed simultaneously to obtainFurther calculating signal quality scores for the respective beams Selecting a signal quality scorek(k 1,2 … … M) higher than the signal quality threshold, or selecting a beam with a signal quality ranked at a predetermined rank.
By selecting the beams with better quality through the method, the calculation amount of the subsequent process can be reduced, and the system efficiency and the awakening accuracy rate are improved.
In step S210, the selected partial beams are input into a pre-trained keyword recognition model, so as to obtain the probability of the beams including the keywords.
In step S212, a beam pointing to the sound source direction is determined as a sound source beam according to the probability that the beam contains the keyword and the signal quality of the beam.
In step S214, it is determined whether to wake up the system according to the result of feature matching of the sound source beam at a plurality of consecutive time instances.
The present disclosure also provides a voice wake-up apparatus, which is described below with reference to fig. 3.
Fig. 3 is a block diagram of some embodiments of the disclosed voice wake-up apparatus. As shown in fig. 3, the apparatus 30 of this embodiment includes: a beam forming module 302, a keyword recognition module 304, a sound source determination module 306, and a voice wake-up module 308.
The beam forming module 302 is configured to perform beam forming on the voice signal in a plurality of predetermined directions, so as to obtain a plurality of beams.
In some embodiments, the beam forming module 302 is configured to determine a weight of each voice signal received by the microphone with respect to a predetermined direction according to a direction of the point source noise, a ratio of the point source noise to the white noise, and a directional vector of the predetermined direction, and perform weighted summation on each voice signal received by the microphone according to the weight of each voice signal received by the microphone with respect to the predetermined direction to determine a beam in the predetermined direction.
In some embodiments, beamforming may be performed according to the following formula. The same as in the previous embodiment.
Xn(k,l)=fft(xn(t)) (1)
Wherein x isn(t) is the speech signal received by the nth microphone, FFT (·) means Fast Fourier Transform (FFT) of the speech signal to obtain Xn(k, l) is xn(t) SFFT amplitude of kth frequency band in the l-th time period, wherein l represents that the speech signal is windowed and divided into l time periods to be processed respectively, and k represents the frequency band number of each speech signal after FFT conversion.
ym(t)=ifft(ym(k,l)) (3)
Wherein, ym(t) is the output signal of the mth predetermined azimuth beam formed by the phased array beam, ifft (-) represents the inverse fast Fourier transform, Ym(k, l) is ym(t) SFFT magnitude of kth band of the l-th period.In the mth beam processing process, the weight of the voice signal received by the nth microphone in the kth frequency band of the ith time period.
As can be seen from the above formula, determineA signal representative of the beam m in the predetermined direction can be determined.
Wherein, Wm(k) The weighting vector of each path of voice signal received by the microphone in the mth beam processing process relative to the preset direction,for an n-dimensional vector, the signal weight vector for each time segment can be considered the same, knownThen can obtain Which represents the weight of the voice signal received by the nth microphone in the kth frequency band during the mth beam processing.Is the covariance matrix of the noise during the mth beam processing,is composed ofAnd (4) inverting the matrix.For microphone array pointing vectors that are expected to enhance the bearing (i.e. the predetermined bearing) during the mth beam processing,is that n is a column vector and,is composed ofAnd (4) conjugate transposition.By setting a predetermined direction.
αpsnFor the proportion of stationary azimuthal point source interference noise in the noise, 1-alphapsnIs the proportion of white noise in the noise. Alpha is alphapsnMay be obtained from testing or experience.The pointing vector of the fixed azimuth point source interference noise in the mth beam processing process,is composed ofAnd (4) conjugate transposition.May be obtained from testing or experience.
The keyword recognition module 304 is configured to input the beam into a pre-trained keyword recognition model to obtain a probability that the beam includes the keyword.
In some embodiments, the keyword recognition model comprises: a deep learning model or a hidden markov model.
The sound source determining module 306 is configured to determine a beam pointing to a sound source direction as a sound source beam according to the probability that the beam contains the keyword and the signal quality of the beam.
In some embodiments, the sound source determining module 306 is configured to perform weighted summation on the probability that the beam includes the keyword and the signal quality of the beam to obtain the importance degree of the beam, select the beam with the highest importance degree as the sound source beam, and determine the direction pointed by the sound source beam as the sound source direction.
The voice wake-up module 308 is configured to determine whether to wake up the system according to the feature matching result of the sound source beam at multiple consecutive time instances.
In some embodiments, the voice wake-up module 308 is configured to match sound source directions pointed by sound source beams at multiple consecutive time instances, and determine whether the sound source beams at the multiple consecutive time instances all contain a keyword, and wake up the system if the sound source directions pointed by the sound source beams at the multiple consecutive time instances are consistent, and the sound source beams at the multiple consecutive time instances all contain the keyword.
Further embodiments of the disclosed voice wake-up apparatus are described below in conjunction with fig. 4.
Fig. 4 is a block diagram of another embodiment of a voice wakeup device according to the present disclosure. As shown in fig. 4, the apparatus 40 of this embodiment includes: an echo cancellation module 402, a beam forming module 404, a beam selection module 406, a keyword recognition module 408, a sound source determination module 410, a voice wake-up module 412, and a model training module 414.
The echo cancellation module 402 is used for performing echo cancellation on a voice signal received through a microphone.
The beam forming module 404 is configured to perform beam forming on the voice signal in a predetermined plurality of directions, so as to obtain a plurality of beams. The beamforming module 404 functions the same as the beamforming module 302.
The beam selection module 406 is configured to select a part of the beams according to the signal quality of the beams, and send the part of the beams to the keyword recognition module, so that the keyword recognition module 408 inputs the received beams into a keyword recognition model trained in advance.
In some embodiments, the beam selection module 406 is configured to determine a signal quality of a beam based on at least one of an energy and a signal-to-noise ratio of the beam within a fixed time window; and selecting partial beams with the signal quality higher than the signal quality threshold.
The keyword recognition module 408 is configured to input the beam into a pre-trained keyword recognition model to obtain a probability that the beam contains the keyword. The keyword recognition module 408 functions the same as the keyword recognition module 304.
The sound source determining module 410 is configured to determine a beam pointing to a sound source direction as a sound source beam according to the probability that the beam contains the keyword and the signal quality of the beam. The sound source determination module 410 is functionally identical to the sound source determination module 306.
The voice wake-up module 412 is configured to determine whether to wake up the system according to the feature matching result of the sound source beam at multiple consecutive time instances. The voice wakeup module 412 is functionally identical to the voice wakeup module 308.
The model training module 414 is configured to perform a beamforming process on the voice signal in a plurality of predetermined directions to obtain a plurality of beams, perform keyword labeling on the plurality of beams to obtain training beams, and input the training beams into the keyword recognition model for training to obtain a pre-trained keyword recognition model.
The model training module 414 may also be configured to receive the multiple beams obtained by the beam forming module 404 or the multiple beams obtained by the beam selecting module 406, perform keyword labeling on the multiple beams to obtain training beams, and input the training beams into the keyword recognition model for training to obtain a pre-trained keyword recognition model.
The voice wake apparatus in the embodiments of the present disclosure may each be implemented by various computing devices or computer systems, which are described below in conjunction with fig. 5 and 6.
Fig. 5 is a block diagram of some embodiments of the disclosed voice wake-up apparatus. As shown in fig. 5, the apparatus 50 of this embodiment includes: a memory 510 and a processor 520 coupled to the memory 510, the processor 520 configured to perform a voice wake-up method in any of the embodiments of the present disclosure based on instructions stored in the memory 510.
Memory 110 may include, for example, system memory, fixed non-volatile storage media, and the like. The system memory stores, for example, an operating system, an application program, a Boot Loader (Boot Loader), a database, and other programs.
Fig. 6 is a block diagram of another embodiment of a voice wakeup device according to the present disclosure. As shown in fig. 6, the apparatus 60 of this embodiment includes: memory 610 and processor 620 are similar to memory 510 and processor 520, respectively. An input output interface 630, a network interface 640, a storage interface 650, and the like may also be included. These interfaces 630, 640, 650 and the connections between the memory 610 and the processor 620 may be, for example, via a bus 660. The input/output interface 630 provides a connection interface for input/output devices such as a display, a mouse, a keyboard, and a touch screen. The network interface 640 provides a connection interface for various networking devices, such as a database server or a cloud storage server. The storage interface 650 provides a connection interface for external storage devices such as an SD card and a usb disk.
The present disclosure also provides a computer-readable storage medium having a computer program stored thereon, wherein the program, when executed by a processor, implements the voice wake-up method of any of the foregoing embodiments.
As will be appreciated by one skilled in the art, embodiments of the present disclosure may be provided as a method, system, or computer program product. Accordingly, the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present disclosure may take the form of a computer program product embodied on one or more computer-usable non-transitory storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present disclosure is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the disclosure. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The above description is only exemplary of the present disclosure and is not intended to limit the present disclosure, so that any modification, equivalent replacement, or improvement made within the spirit and principle of the present disclosure should be included in the scope of the present disclosure.
Claims (20)
1. A voice wake-up method, comprising:
carrying out beam forming on the voice signals in a plurality of preset directions to obtain a plurality of beams;
inputting the wave beam into a pre-trained keyword recognition model to obtain the probability of the wave beam containing the keyword;
determining a beam pointing to the direction of a sound source as a sound source beam according to the probability that the beam contains the keywords and the signal quality of the beam;
determining whether to wake up the system according to the feature matching results of the sound source beams at a plurality of continuous moments;
wherein, the determining whether to wake up the system according to the feature matching result of the sound source beams at a plurality of continuous moments comprises:
matching sound source directions pointed by sound source beams at a plurality of continuous moments, and determining whether the sound source beams at the plurality of continuous moments contain keywords or not;
and when the sound source directions pointed by the sound source beams at a plurality of continuous moments are consistent and the sound source beams at the plurality of continuous moments all contain keywords, waking up the system.
2. The voice wake-up method of claim 1, wherein,
the inputting the beam into a pre-trained keyword recognition model comprises:
and selecting partial beams to input a pre-trained keyword recognition model according to the signal quality of the beams.
3. The voice wake-up method of claim 2, wherein,
said selecting a portion of beams based on the signal quality of the beams comprises:
determining a signal quality of the beam based on at least one of an energy and a signal-to-noise ratio of the beam within a fixed time window;
and selecting partial beams with the signal quality higher than the signal quality threshold.
4. The voice wake-up method of claim 1, wherein,
determining a beam pointing to a sound source direction according to the probability that the beam contains the keyword and the signal quality of the beam, wherein the determining the beam as the sound source beam comprises:
carrying out weighted summation on the probability that the beam contains the keywords and the signal quality of the beam to obtain the importance degree of the beam;
and selecting the wave beam with the highest importance degree as a sound source wave beam, and determining the direction pointed by the sound source wave beam as the sound source direction.
5. The voice wake-up method of claim 1, wherein,
the beamforming the voice signal in a plurality of predetermined directions to obtain a plurality of beams includes:
determining the weight of each path of voice signals received by a microphone relative to a preset direction according to the direction of point source noise, the proportion of the point source noise to the white noise and a directional vector of the preset direction;
and according to the weight of each path of voice signal received by the microphone relative to the preset direction, carrying out weighted summation on each path of voice signal received by the microphone, and determining the wave beam in the preset direction.
6. The voice wake-up method of claim 5, wherein,
the weight of each path of voice signals received by the microphone relative to the preset direction is calculated according to the following formula:
wherein, Wm(k) Is the weight vector of each path of voice signal received by the microphone in the mth wave beam processing process relative to the preset direction, k is the number of different frequency bands of the signals received by the microphone,is the covariance matrix of the noise during the mth beam processing,is composed ofThe inverse of the matrix is then applied to the matrix,for a microphone array pointing vector in a predetermined direction during the mth beam processing,is composed ofBy conjugate transposition of alphapsnFor the proportion of the noise of the point source interfering at a predetermined azimuth, 1-alphapsnIs the proportion of white noise in the noise,for the pointing vector of the predetermined azimuth point source interference noise during the mth beam processing,is composed ofAnd (4) conjugate transposition.
7. The voice wake-up method of claim 1 further comprising:
performing a beam forming process on the voice signal in a plurality of predetermined directions to obtain a plurality of beams;
labeling keywords of the multiple beams to serve as training beams;
and inputting the training wave beam into a keyword recognition model for training to obtain a pre-trained keyword recognition model.
8. The voice wake-up method of claim 1, wherein,
before the beamforming the voice signal in a predetermined plurality of directions, the method further comprises:
the voice signal received through the microphone is subjected to echo cancellation.
9. Voice wake-up method according to any of the claims 1 to 8,
the keyword recognition model includes: a deep learning model or a hidden markov model.
10. A voice wake-up apparatus comprising:
the device comprises a beam forming module, a processing module and a processing module, wherein the beam forming module is used for carrying out beam forming on a voice signal in a plurality of preset directions to obtain a plurality of beams;
the keyword identification module is used for inputting the beam into a keyword identification model trained in advance to obtain the probability of the beam containing the keyword;
the sound source determining module is used for determining a wave beam pointing to the sound source direction as a sound source wave beam according to the probability that the wave beam contains the keywords and the signal quality of the wave beam;
the voice awakening module is used for determining whether to awaken the system or not according to the feature matching results of the sound source wave beams at a plurality of continuous moments;
the voice awakening module is used for matching the sound source directions pointed by the sound source beams at a plurality of continuous moments and determining whether the sound source beams at the plurality of continuous moments contain keywords, and awakening the system under the condition that the sound source directions pointed by the sound source beams at the plurality of continuous moments are consistent and the sound source beams at the plurality of continuous moments contain the keywords.
11. The voice wake-up apparatus according to claim 10, further comprising:
and the beam selection module is used for selecting partial beams according to the signal quality of the beams and sending the partial beams to the keyword recognition module so that the keyword recognition module can input the received beams into a keyword recognition model trained in advance.
12. The voice wake-up device of claim 11, wherein,
the beam selection module is used for determining the signal quality of the beam according to at least one of the energy and the signal-to-noise ratio of the beam in a fixed time window; and selecting partial beams with the signal quality higher than the signal quality threshold.
13. The voice wake-up device of claim 10, wherein,
the sound source determining module is used for weighting and summing the probability that the wave beam contains the keywords and the signal quality of the wave beam to obtain the importance degree of the wave beam, selecting the wave beam with the highest importance degree as a sound source wave beam, and determining the direction pointed by the sound source wave beam as the sound source direction.
14. The voice wake-up device of claim 10, wherein,
the beam forming module is used for determining the weight of each path of voice signals received by the microphone relative to the preset direction according to the direction of point source noise, the proportion of the point source noise and white noise and the directional vector of the preset direction, and carrying out weighted summation on each path of voice signals received by the microphone according to the weight of each path of voice signals received by the microphone relative to the preset direction to determine the beam of the preset direction.
15. The voice wake-up device of claim 14, wherein,
the weight of each path of voice signals received by the microphone relative to the preset direction is calculated according to the following formula:
wherein, Wm(k) Is the weight vector of each path of voice signal received by the microphone in the mth wave beam processing process relative to the preset direction, k is the number of different frequency bands of the signals received by the microphone,is the covariance matrix of the noise during the mth beam processing,is composed ofThe inverse of the matrix is then applied to the matrix,for a microphone array pointing vector in a predetermined direction during the mth beam processing,is composed ofBy conjugate transposition of alphapsnFor the proportion of the noise of the point source interfering at a predetermined azimuth, 1-alphapsnIs the proportion of white noise in the noise,for the pointing vector of the predetermined azimuth point source interference noise during the mth beam processing,is composed ofAnd (4) conjugate transposition.
16. The voice wake-up apparatus according to claim 10, further comprising:
and the model training module is used for carrying out a beam forming process on the voice signal in a plurality of preset directions to obtain a plurality of beams, carrying out keyword labeling on the beams to be used as training beams, and inputting the training beams into the keyword recognition model for training to obtain a pre-trained keyword recognition model.
17. The voice wake-up apparatus according to claim 10, further comprising:
and the echo cancellation module is used for carrying out echo cancellation on the voice signal received by the microphone.
18. Voice wake-up device according to any of the claims 10 to 17,
the keyword recognition model includes: a deep learning model or a hidden markov model.
19. A voice wake-up apparatus comprising:
a memory; and
a processor coupled to the memory, the processor configured to perform the voice wake-up method of any of claims 1-9 based on instructions stored in the memory device.
20. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 9.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810992991.4A CN109272989B (en) | 2018-08-29 | 2018-08-29 | Voice wake-up method, apparatus and computer readable storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810992991.4A CN109272989B (en) | 2018-08-29 | 2018-08-29 | Voice wake-up method, apparatus and computer readable storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109272989A CN109272989A (en) | 2019-01-25 |
CN109272989B true CN109272989B (en) | 2021-08-10 |
Family
ID=65154643
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810992991.4A Active CN109272989B (en) | 2018-08-29 | 2018-08-29 | Voice wake-up method, apparatus and computer readable storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109272989B (en) |
Families Citing this family (20)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110164446B (en) * | 2018-06-28 | 2023-06-30 | 腾讯科技(深圳)有限公司 | Speech signal recognition method and device, computer equipment and electronic equipment |
CN111627425B (en) * | 2019-02-12 | 2023-11-28 | 阿里巴巴集团控股有限公司 | Voice recognition method and system |
CN111667843B (en) * | 2019-03-05 | 2021-12-31 | 北京京东尚科信息技术有限公司 | Voice wake-up method and system for terminal equipment, electronic equipment and storage medium |
CN109920433B (en) * | 2019-03-19 | 2021-08-20 | 上海华镇电子科技有限公司 | Voice awakening method of electronic equipment in noisy environment |
CN109949810B (en) * | 2019-03-28 | 2021-09-07 | 荣耀终端有限公司 | Voice wake-up method, device, equipment and medium |
CN111755021B (en) * | 2019-04-01 | 2023-09-01 | 北京京东尚科信息技术有限公司 | Voice enhancement method and device based on binary microphone array |
CN111833901B (en) * | 2019-04-23 | 2024-04-05 | 北京京东尚科信息技术有限公司 | Audio processing method, audio processing device, system and medium |
CN112216295B (en) * | 2019-06-25 | 2024-04-26 | 大众问问(北京)信息科技有限公司 | Sound source positioning method, device and equipment |
CN110265020B (en) * | 2019-07-12 | 2021-07-06 | 大象声科(深圳)科技有限公司 | Voice wake-up method and device, electronic equipment and storage medium |
CN110277093B (en) * | 2019-07-30 | 2021-10-26 | 腾讯科技(深圳)有限公司 | Audio signal detection method and device |
CN110517682B (en) * | 2019-09-02 | 2022-08-30 | 腾讯科技(深圳)有限公司 | Voice recognition method, device, equipment and storage medium |
CN110797051A (en) * | 2019-10-28 | 2020-02-14 | 星络智能科技有限公司 | Awakening threshold setting method and device, intelligent sound box and storage medium |
CN111276143B (en) * | 2020-01-21 | 2023-04-25 | 北京远特科技股份有限公司 | Sound source positioning method, sound source positioning device, voice recognition control method and terminal equipment |
CN111883162B (en) * | 2020-07-24 | 2021-03-23 | 杨汉丹 | Awakening method and device and computer equipment |
CN113257269A (en) * | 2021-04-21 | 2021-08-13 | 瑞芯微电子股份有限公司 | Beam forming method based on deep learning and storage device |
CN113284505A (en) * | 2021-04-21 | 2021-08-20 | 瑞芯微电子股份有限公司 | Adaptive beam forming method and storage device |
CN113782009A (en) * | 2021-11-10 | 2021-12-10 | 中科南京智能技术研究院 | Voice awakening system based on Savitzky-Golay filter smoothing method |
CN114257684A (en) * | 2021-12-17 | 2022-03-29 | 歌尔科技有限公司 | Voice processing method, system and device and electronic equipment |
CN116504264B (en) * | 2023-06-30 | 2023-10-31 | 小米汽车科技有限公司 | Audio processing method, device, equipment and storage medium |
CN118151548B (en) * | 2023-12-22 | 2024-09-17 | 广东佳盈锋智能科技有限公司 | Smart home control system and power supply control board |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2013137900A1 (en) * | 2012-03-16 | 2013-09-19 | Nuance Communictions, Inc. | User dedicated automatic speech recognition |
CN104936091A (en) * | 2015-05-14 | 2015-09-23 | 科大讯飞股份有限公司 | Intelligent interaction method and system based on circle microphone array |
CN106483502A (en) * | 2016-09-23 | 2017-03-08 | 科大讯飞股份有限公司 | A kind of sound localization method and device |
-
2018
- 2018-08-29 CN CN201810992991.4A patent/CN109272989B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2013137900A1 (en) * | 2012-03-16 | 2013-09-19 | Nuance Communictions, Inc. | User dedicated automatic speech recognition |
CN104936091A (en) * | 2015-05-14 | 2015-09-23 | 科大讯飞股份有限公司 | Intelligent interaction method and system based on circle microphone array |
CN106483502A (en) * | 2016-09-23 | 2017-03-08 | 科大讯飞股份有限公司 | A kind of sound localization method and device |
Non-Patent Citations (1)
Title |
---|
Performance Study of the MVDR Beam-former as a Function of the Source Incidence Angle;Chao Pan, Jingdong Chen, Jacob Benesty;《IEEE Transactions on Audio,Speech and Language Processing 2014》;20140131 * |
Also Published As
Publication number | Publication date |
---|---|
CN109272989A (en) | 2019-01-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109272989B (en) | Voice wake-up method, apparatus and computer readable storage medium | |
Huang et al. | Source localization using deep neural networks in a shallow water environment | |
CN107102296B (en) | Sound source positioning system based on distributed microphone array | |
CN110556103B (en) | Audio signal processing method, device, system, equipment and storage medium | |
Nguyen et al. | Robust source counting and DOA estimation using spatial pseudo-spectrum and convolutional neural network | |
CN109712611B (en) | Joint model training method and system | |
Takeda et al. | Discriminative multiple sound source localization based on deep neural networks using independent location model | |
Varanasi et al. | A deep learning framework for robust DOA estimation using spherical harmonic decomposition | |
Salvati et al. | Exploiting CNNs for improving acoustic source localization in noisy and reverberant conditions | |
CN110503969A (en) | A kind of audio data processing method, device and storage medium | |
CN109509465B (en) | Voice signal processing method, assembly, equipment and medium | |
WO2019080551A1 (en) | Target voice detection method and apparatus | |
CN110610718B (en) | Method and device for extracting expected sound source voice signal | |
Yu et al. | Adversarial network bottleneck features for noise robust speaker verification | |
CN112349297A (en) | Depression detection method based on microphone array | |
WO2022218134A1 (en) | Multi-channel speech detection system and method | |
CN108549052A (en) | A kind of humorous domain puppet sound intensity sound localization method of circle of time-frequency-spatial domain joint weighting | |
CN115775564B (en) | Audio processing method, device, storage medium and intelligent glasses | |
CN112712818A (en) | Voice enhancement method, device and equipment | |
CN106019230B (en) | A kind of sound localization method based on i-vector Speaker Identification | |
CN113314127A (en) | Space orientation-based bird song recognition method, system, computer device and medium | |
CN118053443A (en) | Target speaker tracking method and system with selective hearing | |
CN116559778B (en) | Vehicle whistle positioning method and system based on deep learning | |
Feng et al. | Soft label coding for end-to-end sound source localization with ad-hoc microphone arrays | |
Girin et al. | Audio source separation into the wild |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |