CN109272989B - Voice wake-up method, apparatus and computer readable storage medium - Google Patents

Voice wake-up method, apparatus and computer readable storage medium Download PDF

Info

Publication number
CN109272989B
CN109272989B CN201810992991.4A CN201810992991A CN109272989B CN 109272989 B CN109272989 B CN 109272989B CN 201810992991 A CN201810992991 A CN 201810992991A CN 109272989 B CN109272989 B CN 109272989B
Authority
CN
China
Prior art keywords
sound source
beams
voice
wake
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810992991.4A
Other languages
Chinese (zh)
Other versions
CN109272989A (en
Inventor
徐晴晴
陈宇
杨楠
耿岭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Original Assignee
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jingdong Century Trading Co Ltd, Beijing Jingdong Shangke Information Technology Co Ltd filed Critical Beijing Jingdong Century Trading Co Ltd
Priority to CN201810992991.4A priority Critical patent/CN109272989B/en
Publication of CN109272989A publication Critical patent/CN109272989A/en
Application granted granted Critical
Publication of CN109272989B publication Critical patent/CN109272989B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L2021/02161Number of inputs available containing the signal or the noise to be suppressed
    • G10L2021/02166Microphone arrays; Beamforming

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The disclosure relates to a voice wake-up method, a voice wake-up device and a computer readable storage medium, and relates to the technical field of computers. The method of the present disclosure comprises: carrying out beam forming on the voice signals in a plurality of preset directions to obtain a plurality of beams; inputting the beam into a pre-trained keyword recognition model to obtain the probability of the beam containing the keyword; determining a wave beam pointing to the direction of a sound source as a sound source wave beam according to the probability that the wave beam contains the keywords and the signal quality of the wave beam; and determining whether to wake up the system according to the feature matching results of the sound source beams at a plurality of continuous moments. According to the method, the existing sound source positioning method and voice awakening process are not adopted, and the beam forming algorithm and the sound source positioning algorithm are decoupled, so that the influence of the sound source positioning precision on the beam forming algorithm direction is avoided, the awakening accuracy of the voice system is improved, and the user experience is improved.

Description

Voice wake-up method, apparatus and computer readable storage medium
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to a voice wake-up method and apparatus, and a computer-readable storage medium.
Background
With the development of computer technology, the need for human and machine information communication is more and more urgent. Voice, one of the most natural ways of human interaction, is also one of the most important ways people want to communicate with computers instead of mouse and keyboard. With the increasing urgent development requirements of intelligent terminals such as smart homes, intelligent vehicles and intelligent conference systems, the technology of the intelligent voice wake-up system as an entrance of the intelligent terminal is receiving more and more attention.
The speech communication process is interfered by the surrounding environment and the propagation medium (such as echo, reverberation and interference sound source, etc.), so that the comprehension of the speech by the computer is reduced sharply. Since noise interference always comes from all directions, it becomes very difficult to capture pure speech with a single microphone. At present, a voice awakening system is mainly based on a microphone array method, and carries out time-space domain processing on voice collected by a plurality of microphones, so that the purposes of noise suppression and voice enhancement are achieved.
The voice wake-up method known to the inventors generally comprises the following steps: the method comprises the steps of collecting voice signals through a microphone array, preprocessing the voice signals, determining the angle and the direction of a sound source through a sound source positioning and tracking technology, generating beams pointing to the angle and the direction of the sound source through a beam forming technology, transmitting the formed beams to a voice recognition system for identification, and determining whether to awaken the system or not.
Disclosure of Invention
The inventor finds that: current sound source localization can be broadly divided into three categories according to localization principles: maximum output power based steerable beamforming techniques, time difference of arrival based techniques, and high resolution spectral estimation based localization. The performance of the three sound source positioning algorithms is sharply reduced in an environment with serious reverberation and noise interference, and the angle and the direction of a sound source cannot be accurately positioned, so that subsequent voice recognition is directly influenced, and a voice awakening result is influenced.
One technical problem to be solved by the present disclosure is: how to improve the accuracy of voice awakening and improve the user experience.
According to some embodiments of the present disclosure, there is provided a voice wake-up method, including: carrying out beam forming on the voice signals in a plurality of preset directions to obtain a plurality of beams; inputting the beam into a pre-trained keyword recognition model to obtain the probability of the beam containing the keyword; determining a wave beam pointing to the direction of a sound source as a sound source wave beam according to the probability that the wave beam contains the keywords and the signal quality of the wave beam; and determining whether to wake up the system according to the feature matching results of the sound source beams at a plurality of continuous moments.
In some embodiments, inputting the beams into a pre-trained keyword recognition model comprises: and selecting partial beams according to the signal quality of the beams and inputting the partial beams into a pre-trained keyword recognition model.
In some embodiments, selecting the partial beams based on the signal quality of the beams includes: determining the signal quality of the beam according to at least one of the energy and the signal-to-noise ratio of the beam in a fixed time window; and selecting partial beams with the signal quality higher than the signal quality threshold.
In some embodiments, determining a beam pointing in the direction of the sound source as the sound source beam based on the probability that the beam contains the keyword and the signal quality of the beam comprises: carrying out weighted summation on the probability that the beam contains the keywords and the signal quality of the beam to obtain the importance degree of the beam; and selecting the wave beam with the highest importance degree as the sound source wave beam, and determining the direction pointed by the sound source wave beam as the sound source direction.
In some embodiments, determining whether to wake up the system according to the result of feature matching of the sound source beam at a plurality of consecutive time instances comprises: matching the sound source directions pointed by the sound source beams at a plurality of continuous moments, and determining whether the sound source beams at the plurality of continuous moments contain keywords or not; and when the sound source directions pointed by the sound source beams at a plurality of continuous moments are consistent and the sound source beams at the plurality of continuous moments all contain keywords, waking up the system.
In some embodiments, beamforming the voice signal in a predetermined plurality of directions, the obtaining a plurality of beams comprises: determining the weight of each path of voice signals received by a microphone relative to a preset direction according to the direction of point source noise, the proportion of the point source noise and the white noise and a directional vector of the preset direction; and according to the weight of each path of voice signal received by the microphone relative to the preset direction, carrying out weighted summation on each path of voice signal received by the microphone, and determining the wave beam in the preset direction.
In some embodiments, the weight of each path of speech signal received by the microphone relative to the predetermined direction is calculated according to the following formula:
Figure BDA0001781230220000031
Figure BDA0001781230220000032
Figure BDA0001781230220000033
wherein, Wm(k) Is the weight vector of each path of voice signal received by the microphone in the mth wave beam processing process relative to the preset direction, k is the number of different frequency bands of the signals received by the microphone,
Figure BDA0001781230220000034
is the covariance matrix of the noise during the mth beam processing,
Figure BDA0001781230220000035
is composed of
Figure BDA0001781230220000036
The inverse of the matrix is then applied to the matrix,
Figure BDA0001781230220000037
for a microphone array pointing vector in a predetermined direction during the mth beam processing,
Figure BDA0001781230220000038
is composed of
Figure BDA0001781230220000039
By conjugate transposition of alphapsnAs noiseProportion of point source interference noise at medium predetermined orientation, 1-alphapsnIs the proportion of white noise in the noise,
Figure BDA00017812302200000310
for the pointing vector of the predetermined azimuth point source interference noise during the mth beam processing,
Figure BDA00017812302200000311
is composed of
Figure BDA00017812302200000312
And (4) conjugate transposition.
In some embodiments, the method further comprises: performing a beam forming process on the voice signal in a plurality of predetermined directions to obtain a plurality of beams; labeling keywords of the multiple beams to serve as training beams; and inputting the training wave beam into the keyword recognition model for training to obtain a pre-trained keyword recognition model.
In some embodiments, before the beamforming the voice signal in the predetermined plurality of directions further comprises: the voice signal received through the microphone is subjected to echo cancellation.
In some embodiments, the keyword recognition model comprises: a deep learning model or a hidden markov model.
According to other embodiments of the present disclosure, there is provided a voice wake-up apparatus including: the device comprises a beam forming module, a processing module and a processing module, wherein the beam forming module is used for carrying out beam forming on a voice signal in a plurality of preset directions to obtain a plurality of beams; the keyword identification module is used for inputting the beam into a keyword identification model trained in advance to obtain the probability of the beam containing the keyword; the sound source determining module is used for determining a beam pointing to the sound source direction as a sound source beam according to the probability that the beam contains the keyword and the signal quality of the beam; and the voice awakening module is used for determining whether to awaken the system or not according to the feature matching result of the sound source wave beams at a plurality of continuous moments.
In some embodiments, the apparatus further comprises: and the beam selection module is used for selecting partial beams according to the signal quality of the beams and sending the partial beams to the keyword recognition module so that the keyword recognition module can input the received beams into a keyword recognition model trained in advance.
In some embodiments, the beam selection module is configured to determine a signal quality of a beam based on at least one of an energy and a signal-to-noise ratio of the beam within a fixed time window; and selecting partial beams with the signal quality higher than the signal quality threshold.
In some embodiments, the sound source determining module is configured to perform weighted summation on the probability that the beam includes the keyword and the signal quality of the beam to obtain the importance degree of the beam, select the beam with the highest importance degree as the sound source beam, and determine the direction pointed by the sound source beam as the sound source direction.
In some embodiments, the voice wake-up module is configured to match sound source directions pointed by sound source beams at multiple consecutive time instances, and determine whether the sound source beams at the multiple consecutive time instances all contain a keyword, and wake up the system if the sound source directions pointed by the sound source beams at the multiple consecutive time instances are consistent, and the sound source beams at the multiple consecutive time instances all contain the keyword.
In some embodiments, the beam forming module is configured to determine a weight of each path of voice signals received by the microphone with respect to a predetermined direction according to a direction of the point source noise, a ratio of the point source noise to the white noise, and a directional vector of the predetermined direction, and perform weighted summation on each path of voice signals received by the microphone according to the weight of each path of voice signals received by the microphone with respect to the predetermined direction to determine a beam in the predetermined direction.
In some embodiments, the weight of each path of speech signal received by the microphone relative to the predetermined direction is calculated according to the following formula:
Figure BDA0001781230220000041
Figure BDA0001781230220000042
Figure BDA0001781230220000043
wherein, Wm(k) Is the weight vector of each path of voice signal received by the microphone in the mth wave beam processing process relative to the preset direction, k is the number of different frequency bands of the signals received by the microphone,
Figure BDA0001781230220000044
is the covariance matrix of the noise during the mth beam processing,
Figure BDA0001781230220000045
is composed of
Figure BDA0001781230220000051
The inverse of the matrix is then applied to the matrix,
Figure BDA0001781230220000052
for a microphone array pointing vector in a predetermined direction during the mth beam processing,
Figure BDA0001781230220000053
is composed of
Figure BDA0001781230220000054
By conjugate transposition of alphapsnFor the proportion of the noise of the point source interfering at a predetermined azimuth, 1-alphapsnIs the proportion of white noise in the noise,
Figure BDA0001781230220000055
for the pointing vector of the predetermined azimuth point source interference noise during the mth beam processing,
Figure BDA0001781230220000056
is composed of
Figure BDA0001781230220000057
And (4) conjugate transposition.
In some embodiments, the apparatus further comprises: and the model training module is used for carrying out a beam forming process on the voice signal in a plurality of preset directions to obtain a plurality of beams, carrying out keyword labeling on the beams to be used as training beams, and inputting the training beams into the keyword recognition model for training to obtain a pre-trained keyword recognition model.
In some embodiments, the apparatus further comprises: and the echo cancellation module is used for carrying out echo cancellation on the voice signal received by the microphone.
In some embodiments, the keyword recognition model comprises: a deep learning model or a hidden markov model.
According to still other embodiments of the present disclosure, there is provided a voice wake-up apparatus including: a memory; and a processor coupled to the memory, the processor configured to perform the voice wake-up method of any of the preceding embodiments based on instructions stored in the memory device.
According to still further embodiments of the present disclosure, there is provided a computer-readable storage medium having a computer program stored thereon, wherein the program, when executed by a processor, implements the voice wake-up method of any of the preceding embodiments.
According to the method, a voice signal is subjected to wave beam forming in multiple directions to obtain multiple wave beams, the multiple wave beams are input into a keyword recognition model, the probability that the multiple wave beams contain keywords is recognized, then a sound source wave beam is selected based on the probability that the wave beams contain the keywords and the signal quality of the wave beams, and whether the system is awakened or not is determined according to feature matching results of the sound source wave beams at multiple moments. According to the method, the existing sound source positioning method and voice awakening process are not adopted, and the beam forming algorithm and the sound source positioning algorithm are decoupled, so that the influence of the sound source positioning precision on the beam forming algorithm direction is avoided, the awakening accuracy of the voice system is improved, and the user experience is improved.
Other features of the present disclosure and advantages thereof will become apparent from the following detailed description of exemplary embodiments thereof, which proceeds with reference to the accompanying drawings.
Drawings
In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and other drawings can be obtained by those skilled in the art without creative efforts.
Fig. 1 illustrates a flow diagram of a voice wake-up method of some embodiments of the present disclosure.
Fig. 2 shows a flow diagram of a voice wake-up method of further embodiments of the present disclosure.
Fig. 3 shows a schematic structural diagram of a voice wake-up apparatus according to some embodiments of the present disclosure.
Fig. 4 shows a schematic structural diagram of a voice wake-up apparatus according to another embodiment of the present disclosure.
Fig. 5 shows a schematic structural diagram of a voice wake-up apparatus according to still other embodiments of the present disclosure.
Fig. 6 shows a schematic structural diagram of a voice wake-up apparatus according to still other embodiments of the present disclosure.
Detailed Description
The technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are only a part of the embodiments of the present disclosure, and not all of the embodiments. The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses. All other embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the protection scope of the present disclosure.
The present disclosure provides a voice wake-up method, and some embodiments of the voice wake-up method of the present disclosure are described below in conjunction with fig. 1.
Fig. 1 is a flow chart of some embodiments of a voice wake-up method of the present disclosure. As shown in fig. 1, the method of this embodiment includes: steps S102 to S108.
In step S102, a voice signal is beamformed in a plurality of predetermined directions, resulting in a plurality of beams.
A plurality of microphones, namely microphone arrays, can be arranged on the voice recognition system for voice wake-up to receive voice signals of users. The speech signal may first be pre-processed, for example, the speech signal received by the microphone array may be Echo cancelled, for example, by an Acoustic Echo Cancellation (AEC) algorithm.
The preprocessed voice signals may be beamformed in a predetermined plurality of directions using a phased array beamforming algorithm. The meaning of the phased array is that M directions are preset (evenly distributed on a circle), and M times of weighting and processing are carried out on the multi-path voice signals received by the microphone array to form M paths of voice signals which are enhanced aiming at the respective specific directions. For example, M directions uniformly distributed on a predetermined circle may be used as the beam forming directions, i.e., the formed beam points to the predetermined M directions. The beamforming algorithm may employ, for example, an MVDR (Minimum Variance distortion free Response) algorithm, a GSC (generalized Sidelobe Canceller), a TF-GSC (Transfer Function generalized Sidelobe Canceller), and the like. Beamforming in a predetermined plurality of directions can be achieved by existing algorithms, which are not described herein.
The present disclosure also provides an improved beamforming algorithm, described below.
In some embodiments, the weights of the paths of voice signals received by the microphone relative to the predetermined direction are determined according to the direction of the point source noise, the proportion of the point source noise to the white noise and the directional vector of the predetermined direction; and according to the weight of each path of voice signal received by the microphone relative to the preset direction, carrying out weighted summation on each path of voice signal received by the microphone, and determining the wave beam in the preset direction. Beamforming may be performed according to the following equation.
Xn(k,l)=fft(xn(t)) (1)
In the formula (1), xn(t) is the speech signal received by the nth microphone, FFT (·) means Fast Fourier Transform (FFT) of the speech signal to obtain Xn(k, l) is xn(t) SFFT amplitude of kth frequency band in the l-th time period, wherein l represents that the speech signal is windowed and divided into l time periods to be processed respectively, and k represents the frequency band number of each speech signal after FFT conversion.
Figure BDA0001781230220000071
ym(t)=ifft(Ym(k,l)) (3)
In the formulae (2) and (3), ym(t) is the output signal of the mth predetermined azimuth beam formed by the phased array beam, ifft (-) represents the inverse fast Fourier transform, Ym(k, l) is ym(t) SFFT magnitude of kth band of the l-th period.
Figure BDA0001781230220000072
In the mth beam processing process, the weight of the voice signal received by the nth microphone in the kth frequency band of the ith time period.
As can be seen from the above formula, determine
Figure BDA0001781230220000081
A signal representative of the beam m in the predetermined direction can be determined.
Figure BDA0001781230220000082
Figure BDA0001781230220000083
Figure BDA0001781230220000084
In the formula (4), Wm(k) Processing the m-th beamThe weight vector of each received voice signal relative to the predetermined direction,
Figure BDA0001781230220000085
for an n-dimensional vector, the signal weight vector for each time segment can be considered the same, known
Figure BDA0001781230220000086
Then can obtain
Figure BDA0001781230220000087
Figure BDA0001781230220000088
Which represents the weight of the voice signal received by the nth microphone in the kth frequency band during the mth beam processing.
Figure BDA0001781230220000089
Is the covariance matrix of the noise during the mth beam processing,
Figure BDA00017812302200000810
is composed of
Figure BDA00017812302200000811
And (4) inverting the matrix.
Figure BDA00017812302200000812
For microphone array pointing vectors that are expected to enhance the bearing (i.e. the predetermined bearing) during the mth beam processing,
Figure BDA00017812302200000813
is that n is a column vector and,
Figure BDA00017812302200000814
is composed of
Figure BDA00017812302200000815
And (4) conjugate transposition.
Figure BDA00017812302200000816
By setting a predetermined direction.
Further according to the formula (5) can obtain
Figure BDA00017812302200000817
αpsnFor the proportion of stationary azimuthal point source interference noise in the noise, 1-alphapsnIs the proportion of white noise in the noise. Alpha is alphapsnMay be obtained from testing or experience.
Figure BDA00017812302200000818
The pointing vector of the fixed azimuth point source interference noise in the mth beam processing process,
Figure BDA00017812302200000819
is composed of
Figure BDA00017812302200000820
And (4) conjugate transposition.
Figure BDA00017812302200000821
May be obtained from testing or experience.
The beam signals in each predetermined direction can be calculated by the above formulas, and a plurality of beam forming processes can be executed in parallel to obtain a plurality of beams.
In step S104, the beam is input to a keyword recognition model trained in advance, and the probability that the beam includes the keyword is obtained.
The voice system judges whether to record subsequent voice and perform voice recognition by recognizing the keywords in the voice, namely, whether to awaken the voice system subsequently is determined by retrieving the keywords in the voice. The keyword recognition model is, for example, a deep learning model, a hidden markov model, or the like. Examples of the Deep learning model include DNN (Deep Neural Networks), RNN (Recurrent Neural Networks), CRNN (Convolutional Recurrent Neural Networks), and the like. These models are all existing models and are not described in detail herein. When performing the keyword recognition model training, a plurality of beams may be generated according to the embodiment of step S102, and whether the plurality of beams contain the keyword is labeled as a training beam; and inputting the training wave beam into the keyword recognition model for off-line training to obtain a pre-trained keyword recognition model. And then inputting the beam into a pre-trained keyword recognition model, so as to obtain the probability of the beam containing the keyword.
In step S106, a beam pointing to the sound source direction is determined as a sound source beam according to the probability that the beam contains the keyword and the signal quality of the beam.
In some embodiments, the signal quality of the beam is determined based on at least one of an energy and a signal-to-noise ratio of the beam within a fixed time window. The higher the energy of the beam within a fixed time window, the higher the signal-to-noise ratio, and the better the signal quality. For example, the energy and signal-to-noise ratio of a beam within a fixed time window may be calculated, and the weighted sum of the two parameters may be used to determine the signal quality of the beam. The weights of energy and signal-to-noise ratio can be set according to actual requirements. The energy and signal-to-noise ratio may be normalized and weighted.
In some embodiments, the probability that the beam contains the keyword and the signal quality of the beam are subjected to weighted summation to obtain the importance degree of the beam; and selecting the wave beam with the highest importance degree as the sound source wave beam, and determining the direction pointed by the sound source wave beam as the sound source direction. The beam signal quality in the sound source direction is better, and the probability that the beam contains the keyword can be identified to be higher, so that the sound source beam can be selected according to the probability that the beam contains the keyword and the signal quality of the beam. For example, the energy power 'and the signal-to-noise ratio SNR' of the K wave beams in a fixed time window are calculated, and normalization processing is carried out at the same time to obtain
Figure BDA0001781230220000091
Obtaining the keyword recognition probability output by the keyword recognition model of the kth wave beam through the keyword recognition model as NNscorekAnd further calculates the importance of the kth beam,
Figure BDA0001781230220000092
Figure BDA0001781230220000093
in step S108, it is determined whether to wake up the system according to the result of feature matching of the sound source beam at a plurality of consecutive time instances.
The system can be further determined whether to wake up or not directly according to whether the keyword probability of the sound source wave beam exceeds a threshold value or not. But the accuracy of the wake-up can be further improved by feature matching of the sound source beam for a plurality of consecutive time instants.
In some embodiments, the sound source directions pointed by the sound source beams at the current time and a preset number of continuous multiple times before are matched, and whether the sound source beams at the continuous multiple times all contain keywords is determined; and when the sound source directions pointed by the sound source beams at a plurality of continuous moments are consistent and the sound source beams at the plurality of continuous moments all contain keywords, waking up the system. Otherwise, the system is not woken up. Namely, whether the system is awakened or not is confirmed according to the consistency of the results of the keyword recognition and positioning judgment module at the time t and the previous time (t-p, t-p +1 … …, t-1 and t). If the keyword recognition and positioning results at the previous moment and the later moment are consistent, the system is awakened, otherwise, the system cannot be awakened.
In the method of the above embodiment, the voice signal is formed into a beam in multiple directions to obtain multiple beams, the multiple beams are input to the keyword recognition model to recognize the probability that the multiple beams contain the keyword, a sound source beam is selected based on the probability that the beam contains the keyword and the signal quality of the beam, and whether the system is awakened or not is determined according to the feature matching result of the sound source beam at multiple moments. According to the method, the existing sound source positioning method and voice awakening process are not adopted, and the beam forming algorithm and the sound source positioning algorithm are decoupled, so that the influence of the sound source positioning precision on the beam forming algorithm direction is avoided, the awakening accuracy of the voice system is improved, and the user experience is improved.
Further embodiments of the disclosed voice wake-up method are described below in conjunction with fig. 2.
Fig. 2 is a flowchart of another embodiment of a voice wake-up method according to the present disclosure. As shown in fig. 2, the method of this embodiment includes: steps S202 to S214.
In step S202, a speech signal of a user is received through a microphone array.
In step S204, echo cancellation is performed on the multi-path speech signals received by the microphone array.
In step S206, the received voice signal is beamformed in a plurality of predetermined directions, resulting in a plurality of beams.
In step S208, a partial beam is selected according to the signal quality of the beam.
In some embodiments, the signal quality of the beam is determined based on at least one of the energy and the signal-to-noise ratio of the beam within a fixed time window; and selecting partial beams with the signal quality higher than the signal quality threshold. E.g., weights of energy and signal-to-noise ratio of the beam within a fixed time window, determines the signal quality of the beam. The weights of energy and signal-to-noise ratio can be set according to actual requirements. For example, the energy power and the SNR of each beam in a fixed time window are calculated respectively, and normalization processing is performed simultaneously to obtain
Figure BDA0001781230220000111
Further calculating signal quality scores for the respective beams
Figure BDA0001781230220000112
Figure BDA0001781230220000113
Selecting a signal quality scorek(k 1,2 … … M) higher than the signal quality threshold, or selecting a beam with a signal quality ranked at a predetermined rank.
By selecting the beams with better quality through the method, the calculation amount of the subsequent process can be reduced, and the system efficiency and the awakening accuracy rate are improved.
In step S210, the selected partial beams are input into a pre-trained keyword recognition model, so as to obtain the probability of the beams including the keywords.
In step S212, a beam pointing to the sound source direction is determined as a sound source beam according to the probability that the beam contains the keyword and the signal quality of the beam.
In step S214, it is determined whether to wake up the system according to the result of feature matching of the sound source beam at a plurality of consecutive time instances.
The present disclosure also provides a voice wake-up apparatus, which is described below with reference to fig. 3.
Fig. 3 is a block diagram of some embodiments of the disclosed voice wake-up apparatus. As shown in fig. 3, the apparatus 30 of this embodiment includes: a beam forming module 302, a keyword recognition module 304, a sound source determination module 306, and a voice wake-up module 308.
The beam forming module 302 is configured to perform beam forming on the voice signal in a plurality of predetermined directions, so as to obtain a plurality of beams.
In some embodiments, the beam forming module 302 is configured to determine a weight of each voice signal received by the microphone with respect to a predetermined direction according to a direction of the point source noise, a ratio of the point source noise to the white noise, and a directional vector of the predetermined direction, and perform weighted summation on each voice signal received by the microphone according to the weight of each voice signal received by the microphone with respect to the predetermined direction to determine a beam in the predetermined direction.
In some embodiments, beamforming may be performed according to the following formula. The same as in the previous embodiment.
Xn(k,l)=fft(xn(t)) (1)
Wherein x isn(t) is the speech signal received by the nth microphone, FFT (·) means Fast Fourier Transform (FFT) of the speech signal to obtain Xn(k, l) is xn(t) SFFT amplitude of kth frequency band in the l-th time period, wherein l represents that the speech signal is windowed and divided into l time periods to be processed respectively, and k represents the frequency band number of each speech signal after FFT conversion.
Figure BDA0001781230220000121
ym(t)=ifft(ym(k,l)) (3)
Wherein, ym(t) is the output signal of the mth predetermined azimuth beam formed by the phased array beam, ifft (-) represents the inverse fast Fourier transform, Ym(k, l) is ym(t) SFFT magnitude of kth band of the l-th period.
Figure BDA0001781230220000122
In the mth beam processing process, the weight of the voice signal received by the nth microphone in the kth frequency band of the ith time period.
As can be seen from the above formula, determine
Figure BDA0001781230220000123
A signal representative of the beam m in the predetermined direction can be determined.
Figure BDA0001781230220000124
Figure BDA0001781230220000125
Figure BDA0001781230220000126
Wherein, Wm(k) The weighting vector of each path of voice signal received by the microphone in the mth beam processing process relative to the preset direction,
Figure BDA0001781230220000127
for an n-dimensional vector, the signal weight vector for each time segment can be considered the same, known
Figure BDA0001781230220000128
Then can obtain
Figure BDA0001781230220000129
Figure BDA00017812302200001210
Which represents the weight of the voice signal received by the nth microphone in the kth frequency band during the mth beam processing.
Figure BDA00017812302200001211
Is the covariance matrix of the noise during the mth beam processing,
Figure BDA00017812302200001212
is composed of
Figure BDA00017812302200001213
And (4) inverting the matrix.
Figure BDA00017812302200001214
For microphone array pointing vectors that are expected to enhance the bearing (i.e. the predetermined bearing) during the mth beam processing,
Figure BDA00017812302200001215
is that n is a column vector and,
Figure BDA00017812302200001216
is composed of
Figure BDA00017812302200001217
And (4) conjugate transposition.
Figure BDA00017812302200001218
By setting a predetermined direction.
αpsnFor the proportion of stationary azimuthal point source interference noise in the noise, 1-alphapsnIs the proportion of white noise in the noise. Alpha is alphapsnMay be obtained from testing or experience.
Figure BDA00017812302200001219
The pointing vector of the fixed azimuth point source interference noise in the mth beam processing process,
Figure BDA00017812302200001220
is composed of
Figure BDA00017812302200001221
And (4) conjugate transposition.
Figure BDA00017812302200001222
May be obtained from testing or experience.
The keyword recognition module 304 is configured to input the beam into a pre-trained keyword recognition model to obtain a probability that the beam includes the keyword.
In some embodiments, the keyword recognition model comprises: a deep learning model or a hidden markov model.
The sound source determining module 306 is configured to determine a beam pointing to a sound source direction as a sound source beam according to the probability that the beam contains the keyword and the signal quality of the beam.
In some embodiments, the sound source determining module 306 is configured to perform weighted summation on the probability that the beam includes the keyword and the signal quality of the beam to obtain the importance degree of the beam, select the beam with the highest importance degree as the sound source beam, and determine the direction pointed by the sound source beam as the sound source direction.
The voice wake-up module 308 is configured to determine whether to wake up the system according to the feature matching result of the sound source beam at multiple consecutive time instances.
In some embodiments, the voice wake-up module 308 is configured to match sound source directions pointed by sound source beams at multiple consecutive time instances, and determine whether the sound source beams at the multiple consecutive time instances all contain a keyword, and wake up the system if the sound source directions pointed by the sound source beams at the multiple consecutive time instances are consistent, and the sound source beams at the multiple consecutive time instances all contain the keyword.
Further embodiments of the disclosed voice wake-up apparatus are described below in conjunction with fig. 4.
Fig. 4 is a block diagram of another embodiment of a voice wakeup device according to the present disclosure. As shown in fig. 4, the apparatus 40 of this embodiment includes: an echo cancellation module 402, a beam forming module 404, a beam selection module 406, a keyword recognition module 408, a sound source determination module 410, a voice wake-up module 412, and a model training module 414.
The echo cancellation module 402 is used for performing echo cancellation on a voice signal received through a microphone.
The beam forming module 404 is configured to perform beam forming on the voice signal in a predetermined plurality of directions, so as to obtain a plurality of beams. The beamforming module 404 functions the same as the beamforming module 302.
The beam selection module 406 is configured to select a part of the beams according to the signal quality of the beams, and send the part of the beams to the keyword recognition module, so that the keyword recognition module 408 inputs the received beams into a keyword recognition model trained in advance.
In some embodiments, the beam selection module 406 is configured to determine a signal quality of a beam based on at least one of an energy and a signal-to-noise ratio of the beam within a fixed time window; and selecting partial beams with the signal quality higher than the signal quality threshold.
The keyword recognition module 408 is configured to input the beam into a pre-trained keyword recognition model to obtain a probability that the beam contains the keyword. The keyword recognition module 408 functions the same as the keyword recognition module 304.
The sound source determining module 410 is configured to determine a beam pointing to a sound source direction as a sound source beam according to the probability that the beam contains the keyword and the signal quality of the beam. The sound source determination module 410 is functionally identical to the sound source determination module 306.
The voice wake-up module 412 is configured to determine whether to wake up the system according to the feature matching result of the sound source beam at multiple consecutive time instances. The voice wakeup module 412 is functionally identical to the voice wakeup module 308.
The model training module 414 is configured to perform a beamforming process on the voice signal in a plurality of predetermined directions to obtain a plurality of beams, perform keyword labeling on the plurality of beams to obtain training beams, and input the training beams into the keyword recognition model for training to obtain a pre-trained keyword recognition model.
The model training module 414 may also be configured to receive the multiple beams obtained by the beam forming module 404 or the multiple beams obtained by the beam selecting module 406, perform keyword labeling on the multiple beams to obtain training beams, and input the training beams into the keyword recognition model for training to obtain a pre-trained keyword recognition model.
The voice wake apparatus in the embodiments of the present disclosure may each be implemented by various computing devices or computer systems, which are described below in conjunction with fig. 5 and 6.
Fig. 5 is a block diagram of some embodiments of the disclosed voice wake-up apparatus. As shown in fig. 5, the apparatus 50 of this embodiment includes: a memory 510 and a processor 520 coupled to the memory 510, the processor 520 configured to perform a voice wake-up method in any of the embodiments of the present disclosure based on instructions stored in the memory 510.
Memory 110 may include, for example, system memory, fixed non-volatile storage media, and the like. The system memory stores, for example, an operating system, an application program, a Boot Loader (Boot Loader), a database, and other programs.
Fig. 6 is a block diagram of another embodiment of a voice wakeup device according to the present disclosure. As shown in fig. 6, the apparatus 60 of this embodiment includes: memory 610 and processor 620 are similar to memory 510 and processor 520, respectively. An input output interface 630, a network interface 640, a storage interface 650, and the like may also be included. These interfaces 630, 640, 650 and the connections between the memory 610 and the processor 620 may be, for example, via a bus 660. The input/output interface 630 provides a connection interface for input/output devices such as a display, a mouse, a keyboard, and a touch screen. The network interface 640 provides a connection interface for various networking devices, such as a database server or a cloud storage server. The storage interface 650 provides a connection interface for external storage devices such as an SD card and a usb disk.
The present disclosure also provides a computer-readable storage medium having a computer program stored thereon, wherein the program, when executed by a processor, implements the voice wake-up method of any of the foregoing embodiments.
As will be appreciated by one skilled in the art, embodiments of the present disclosure may be provided as a method, system, or computer program product. Accordingly, the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present disclosure may take the form of a computer program product embodied on one or more computer-usable non-transitory storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present disclosure is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the disclosure. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The above description is only exemplary of the present disclosure and is not intended to limit the present disclosure, so that any modification, equivalent replacement, or improvement made within the spirit and principle of the present disclosure should be included in the scope of the present disclosure.

Claims (20)

1. A voice wake-up method, comprising:
carrying out beam forming on the voice signals in a plurality of preset directions to obtain a plurality of beams;
inputting the wave beam into a pre-trained keyword recognition model to obtain the probability of the wave beam containing the keyword;
determining a beam pointing to the direction of a sound source as a sound source beam according to the probability that the beam contains the keywords and the signal quality of the beam;
determining whether to wake up the system according to the feature matching results of the sound source beams at a plurality of continuous moments;
wherein, the determining whether to wake up the system according to the feature matching result of the sound source beams at a plurality of continuous moments comprises:
matching sound source directions pointed by sound source beams at a plurality of continuous moments, and determining whether the sound source beams at the plurality of continuous moments contain keywords or not;
and when the sound source directions pointed by the sound source beams at a plurality of continuous moments are consistent and the sound source beams at the plurality of continuous moments all contain keywords, waking up the system.
2. The voice wake-up method of claim 1, wherein,
the inputting the beam into a pre-trained keyword recognition model comprises:
and selecting partial beams to input a pre-trained keyword recognition model according to the signal quality of the beams.
3. The voice wake-up method of claim 2, wherein,
said selecting a portion of beams based on the signal quality of the beams comprises:
determining a signal quality of the beam based on at least one of an energy and a signal-to-noise ratio of the beam within a fixed time window;
and selecting partial beams with the signal quality higher than the signal quality threshold.
4. The voice wake-up method of claim 1, wherein,
determining a beam pointing to a sound source direction according to the probability that the beam contains the keyword and the signal quality of the beam, wherein the determining the beam as the sound source beam comprises:
carrying out weighted summation on the probability that the beam contains the keywords and the signal quality of the beam to obtain the importance degree of the beam;
and selecting the wave beam with the highest importance degree as a sound source wave beam, and determining the direction pointed by the sound source wave beam as the sound source direction.
5. The voice wake-up method of claim 1, wherein,
the beamforming the voice signal in a plurality of predetermined directions to obtain a plurality of beams includes:
determining the weight of each path of voice signals received by a microphone relative to a preset direction according to the direction of point source noise, the proportion of the point source noise to the white noise and a directional vector of the preset direction;
and according to the weight of each path of voice signal received by the microphone relative to the preset direction, carrying out weighted summation on each path of voice signal received by the microphone, and determining the wave beam in the preset direction.
6. The voice wake-up method of claim 5, wherein,
the weight of each path of voice signals received by the microphone relative to the preset direction is calculated according to the following formula:
Figure FDA0002883912020000021
Figure FDA0002883912020000022
Figure FDA0002883912020000023
wherein, Wm(k) Is the weight vector of each path of voice signal received by the microphone in the mth wave beam processing process relative to the preset direction, k is the number of different frequency bands of the signals received by the microphone,
Figure FDA0002883912020000024
is the covariance matrix of the noise during the mth beam processing,
Figure FDA0002883912020000025
is composed of
Figure FDA0002883912020000026
The inverse of the matrix is then applied to the matrix,
Figure FDA0002883912020000027
for a microphone array pointing vector in a predetermined direction during the mth beam processing,
Figure FDA0002883912020000028
is composed of
Figure FDA0002883912020000029
By conjugate transposition of alphapsnFor the proportion of the noise of the point source interfering at a predetermined azimuth, 1-alphapsnIs the proportion of white noise in the noise,
Figure FDA00028839120200000210
for the pointing vector of the predetermined azimuth point source interference noise during the mth beam processing,
Figure FDA00028839120200000211
is composed of
Figure FDA00028839120200000212
And (4) conjugate transposition.
7. The voice wake-up method of claim 1 further comprising:
performing a beam forming process on the voice signal in a plurality of predetermined directions to obtain a plurality of beams;
labeling keywords of the multiple beams to serve as training beams;
and inputting the training wave beam into a keyword recognition model for training to obtain a pre-trained keyword recognition model.
8. The voice wake-up method of claim 1, wherein,
before the beamforming the voice signal in a predetermined plurality of directions, the method further comprises:
the voice signal received through the microphone is subjected to echo cancellation.
9. Voice wake-up method according to any of the claims 1 to 8,
the keyword recognition model includes: a deep learning model or a hidden markov model.
10. A voice wake-up apparatus comprising:
the device comprises a beam forming module, a processing module and a processing module, wherein the beam forming module is used for carrying out beam forming on a voice signal in a plurality of preset directions to obtain a plurality of beams;
the keyword identification module is used for inputting the beam into a keyword identification model trained in advance to obtain the probability of the beam containing the keyword;
the sound source determining module is used for determining a wave beam pointing to the sound source direction as a sound source wave beam according to the probability that the wave beam contains the keywords and the signal quality of the wave beam;
the voice awakening module is used for determining whether to awaken the system or not according to the feature matching results of the sound source wave beams at a plurality of continuous moments;
the voice awakening module is used for matching the sound source directions pointed by the sound source beams at a plurality of continuous moments and determining whether the sound source beams at the plurality of continuous moments contain keywords, and awakening the system under the condition that the sound source directions pointed by the sound source beams at the plurality of continuous moments are consistent and the sound source beams at the plurality of continuous moments contain the keywords.
11. The voice wake-up apparatus according to claim 10, further comprising:
and the beam selection module is used for selecting partial beams according to the signal quality of the beams and sending the partial beams to the keyword recognition module so that the keyword recognition module can input the received beams into a keyword recognition model trained in advance.
12. The voice wake-up device of claim 11, wherein,
the beam selection module is used for determining the signal quality of the beam according to at least one of the energy and the signal-to-noise ratio of the beam in a fixed time window; and selecting partial beams with the signal quality higher than the signal quality threshold.
13. The voice wake-up device of claim 10, wherein,
the sound source determining module is used for weighting and summing the probability that the wave beam contains the keywords and the signal quality of the wave beam to obtain the importance degree of the wave beam, selecting the wave beam with the highest importance degree as a sound source wave beam, and determining the direction pointed by the sound source wave beam as the sound source direction.
14. The voice wake-up device of claim 10, wherein,
the beam forming module is used for determining the weight of each path of voice signals received by the microphone relative to the preset direction according to the direction of point source noise, the proportion of the point source noise and white noise and the directional vector of the preset direction, and carrying out weighted summation on each path of voice signals received by the microphone according to the weight of each path of voice signals received by the microphone relative to the preset direction to determine the beam of the preset direction.
15. The voice wake-up device of claim 14, wherein,
the weight of each path of voice signals received by the microphone relative to the preset direction is calculated according to the following formula:
Figure FDA0002883912020000041
Figure FDA0002883912020000042
Figure FDA0002883912020000051
wherein, Wm(k) Is the weight vector of each path of voice signal received by the microphone in the mth wave beam processing process relative to the preset direction, k is the number of different frequency bands of the signals received by the microphone,
Figure FDA0002883912020000052
is the covariance matrix of the noise during the mth beam processing,
Figure FDA0002883912020000053
is composed of
Figure FDA0002883912020000054
The inverse of the matrix is then applied to the matrix,
Figure FDA0002883912020000055
for a microphone array pointing vector in a predetermined direction during the mth beam processing,
Figure FDA0002883912020000056
is composed of
Figure FDA0002883912020000057
By conjugate transposition of alphapsnFor the proportion of the noise of the point source interfering at a predetermined azimuth, 1-alphapsnIs the proportion of white noise in the noise,
Figure FDA0002883912020000058
for the pointing vector of the predetermined azimuth point source interference noise during the mth beam processing,
Figure FDA0002883912020000059
is composed of
Figure FDA00028839120200000510
And (4) conjugate transposition.
16. The voice wake-up apparatus according to claim 10, further comprising:
and the model training module is used for carrying out a beam forming process on the voice signal in a plurality of preset directions to obtain a plurality of beams, carrying out keyword labeling on the beams to be used as training beams, and inputting the training beams into the keyword recognition model for training to obtain a pre-trained keyword recognition model.
17. The voice wake-up apparatus according to claim 10, further comprising:
and the echo cancellation module is used for carrying out echo cancellation on the voice signal received by the microphone.
18. Voice wake-up device according to any of the claims 10 to 17,
the keyword recognition model includes: a deep learning model or a hidden markov model.
19. A voice wake-up apparatus comprising:
a memory; and
a processor coupled to the memory, the processor configured to perform the voice wake-up method of any of claims 1-9 based on instructions stored in the memory device.
20. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 9.
CN201810992991.4A 2018-08-29 2018-08-29 Voice wake-up method, apparatus and computer readable storage medium Active CN109272989B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810992991.4A CN109272989B (en) 2018-08-29 2018-08-29 Voice wake-up method, apparatus and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810992991.4A CN109272989B (en) 2018-08-29 2018-08-29 Voice wake-up method, apparatus and computer readable storage medium

Publications (2)

Publication Number Publication Date
CN109272989A CN109272989A (en) 2019-01-25
CN109272989B true CN109272989B (en) 2021-08-10

Family

ID=65154643

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810992991.4A Active CN109272989B (en) 2018-08-29 2018-08-29 Voice wake-up method, apparatus and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN109272989B (en)

Families Citing this family (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110164446B (en) * 2018-06-28 2023-06-30 腾讯科技(深圳)有限公司 Speech signal recognition method and device, computer equipment and electronic equipment
CN111627425B (en) * 2019-02-12 2023-11-28 阿里巴巴集团控股有限公司 Voice recognition method and system
CN111667843B (en) * 2019-03-05 2021-12-31 北京京东尚科信息技术有限公司 Voice wake-up method and system for terminal equipment, electronic equipment and storage medium
CN109920433B (en) * 2019-03-19 2021-08-20 上海华镇电子科技有限公司 Voice awakening method of electronic equipment in noisy environment
CN109949810B (en) * 2019-03-28 2021-09-07 荣耀终端有限公司 Voice wake-up method, device, equipment and medium
CN111755021B (en) * 2019-04-01 2023-09-01 北京京东尚科信息技术有限公司 Voice enhancement method and device based on binary microphone array
CN111833901B (en) * 2019-04-23 2024-04-05 北京京东尚科信息技术有限公司 Audio processing method, audio processing device, system and medium
CN112216295B (en) * 2019-06-25 2024-04-26 大众问问(北京)信息科技有限公司 Sound source positioning method, device and equipment
CN110265020B (en) * 2019-07-12 2021-07-06 大象声科(深圳)科技有限公司 Voice wake-up method and device, electronic equipment and storage medium
CN110277093B (en) * 2019-07-30 2021-10-26 腾讯科技(深圳)有限公司 Audio signal detection method and device
CN110517682B (en) * 2019-09-02 2022-08-30 腾讯科技(深圳)有限公司 Voice recognition method, device, equipment and storage medium
CN110797051A (en) * 2019-10-28 2020-02-14 星络智能科技有限公司 Awakening threshold setting method and device, intelligent sound box and storage medium
CN111276143B (en) * 2020-01-21 2023-04-25 北京远特科技股份有限公司 Sound source positioning method, sound source positioning device, voice recognition control method and terminal equipment
CN111883162B (en) * 2020-07-24 2021-03-23 杨汉丹 Awakening method and device and computer equipment
CN113257269A (en) * 2021-04-21 2021-08-13 瑞芯微电子股份有限公司 Beam forming method based on deep learning and storage device
CN113284505A (en) * 2021-04-21 2021-08-20 瑞芯微电子股份有限公司 Adaptive beam forming method and storage device
CN113782009A (en) * 2021-11-10 2021-12-10 中科南京智能技术研究院 Voice awakening system based on Savitzky-Golay filter smoothing method
CN114257684A (en) * 2021-12-17 2022-03-29 歌尔科技有限公司 Voice processing method, system and device and electronic equipment
CN116504264B (en) * 2023-06-30 2023-10-31 小米汽车科技有限公司 Audio processing method, device, equipment and storage medium
CN118151548B (en) * 2023-12-22 2024-09-17 广东佳盈锋智能科技有限公司 Smart home control system and power supply control board

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013137900A1 (en) * 2012-03-16 2013-09-19 Nuance Communictions, Inc. User dedicated automatic speech recognition
CN104936091A (en) * 2015-05-14 2015-09-23 科大讯飞股份有限公司 Intelligent interaction method and system based on circle microphone array
CN106483502A (en) * 2016-09-23 2017-03-08 科大讯飞股份有限公司 A kind of sound localization method and device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013137900A1 (en) * 2012-03-16 2013-09-19 Nuance Communictions, Inc. User dedicated automatic speech recognition
CN104936091A (en) * 2015-05-14 2015-09-23 科大讯飞股份有限公司 Intelligent interaction method and system based on circle microphone array
CN106483502A (en) * 2016-09-23 2017-03-08 科大讯飞股份有限公司 A kind of sound localization method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Performance Study of the MVDR Beam-former as a Function of the Source Incidence Angle;Chao Pan, Jingdong Chen, Jacob Benesty;《IEEE Transactions on Audio,Speech and Language Processing 2014》;20140131 *

Also Published As

Publication number Publication date
CN109272989A (en) 2019-01-25

Similar Documents

Publication Publication Date Title
CN109272989B (en) Voice wake-up method, apparatus and computer readable storage medium
Huang et al. Source localization using deep neural networks in a shallow water environment
CN107102296B (en) Sound source positioning system based on distributed microphone array
CN110556103B (en) Audio signal processing method, device, system, equipment and storage medium
Nguyen et al. Robust source counting and DOA estimation using spatial pseudo-spectrum and convolutional neural network
CN109712611B (en) Joint model training method and system
Takeda et al. Discriminative multiple sound source localization based on deep neural networks using independent location model
Varanasi et al. A deep learning framework for robust DOA estimation using spherical harmonic decomposition
Salvati et al. Exploiting CNNs for improving acoustic source localization in noisy and reverberant conditions
CN110503969A (en) A kind of audio data processing method, device and storage medium
CN109509465B (en) Voice signal processing method, assembly, equipment and medium
WO2019080551A1 (en) Target voice detection method and apparatus
CN110610718B (en) Method and device for extracting expected sound source voice signal
Yu et al. Adversarial network bottleneck features for noise robust speaker verification
CN112349297A (en) Depression detection method based on microphone array
WO2022218134A1 (en) Multi-channel speech detection system and method
CN108549052A (en) A kind of humorous domain puppet sound intensity sound localization method of circle of time-frequency-spatial domain joint weighting
CN115775564B (en) Audio processing method, device, storage medium and intelligent glasses
CN112712818A (en) Voice enhancement method, device and equipment
CN106019230B (en) A kind of sound localization method based on i-vector Speaker Identification
CN113314127A (en) Space orientation-based bird song recognition method, system, computer device and medium
CN118053443A (en) Target speaker tracking method and system with selective hearing
CN116559778B (en) Vehicle whistle positioning method and system based on deep learning
Feng et al. Soft label coding for end-to-end sound source localization with ad-hoc microphone arrays
Girin et al. Audio source separation into the wild

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant