WO2005048239A1 - 音声認識装置 - Google Patents
音声認識装置 Download PDFInfo
- Publication number
- WO2005048239A1 WO2005048239A1 PCT/JP2004/016883 JP2004016883W WO2005048239A1 WO 2005048239 A1 WO2005048239 A1 WO 2005048239A1 JP 2004016883 W JP2004016883 W JP 2004016883W WO 2005048239 A1 WO2005048239 A1 WO 2005048239A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- sound source
- unit
- acoustic model
- sound
- acoustic
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/20—Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0272—Voice signal separating
- G10L21/028—Voice signal separating using properties of sound source
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
- G10L2021/02161—Number of inputs available containing the signal or the noise to be suppressed
- G10L2021/02166—Microphone arrays; Beamforming
Definitions
- the present invention relates to a speech recognition device, and more particularly, to a speech recognition device capable of recognizing speech with high accuracy even when a speaker or a moving object equipped with the speech recognition device moves.
- Non-Patent Document 1 a technique for performing localization 'separation' recognition of a plurality of sound sources by active audition.
- two microphones are placed at positions corresponding to human ears, and when multiple speakers speak at the same time, a single spoken word is recognized.
- the position of the speaker is localized from the audio signals input from the two microphones, and the voice of each speaker is separated, and then the voice is recognized.
- an acoustic model of each speaker in every 10 ° direction up to 90 ° power and 90 ° as seen from the moving object (a robot equipped with a voice recognition device) is created in advance. deep. Then, at the time of speech recognition, the recognition process is executed in parallel using these acoustic models.
- Non-Patent Document 1 A Humanoid Listens to three simultaneous talkers by Integrating Active Audition and Face Recognition "Kazuhiro Nakadai, et al., IJCAI— 03
- the present invention has been made in view of such a background, and it is an object of the present invention to provide a speech recognition device capable of recognizing with high accuracy even if a speaker or a moving body moves.
- a voice recognition device of the present invention is a voice recognition device that recognizes voice from sound signals detected by a plurality of microphones and converts the voice into character information.
- a sound source localization unit that specifies the sound source direction of the specific speaker based on the detected sound signal, and a sound signal included in the sound signal based on one or more sound signals detected by the plurality of microphones.
- a feature extraction unit that extracts features, an acoustic model storage unit that stores direction-dependent acoustic models corresponding to a plurality of intermittent directions, and an acoustic model of a sound source direction identified by the sound source localization unit.
- An acoustic model synthesis unit that synthesizes based on the direction-dependent acoustic model and stores the acoustic model in the acoustic model storage unit, and a feature extracted by the feature extraction unit using the acoustic model synthesized by the acoustic model synthesis unit.
- the sound source localization unit specifies the sound source direction
- the acoustic model synthesis unit determines an acoustic model suitable for the direction based on the sound source direction and the direction-dependent acoustic model.
- the speech model is synthesized, and the speech recognition unit performs speech recognition using the acoustic model.
- the above-described speech recognition device includes a sound source separation unit that separates the sound signal of the specific speaker from the acoustic signal based on the sound source direction specified by the sound source localization unit.
- the feature extraction unit may be configured to extract the features of the audio signal based on the audio signal separated by the sound source separation unit.
- the sound source localization unit specifies the sound source direction
- the sound source separation unit separates only the sound in the sound source direction specified by the sound source localization unit.
- the acoustic model synthesis unit synthesizes an acoustic model suitable for the direction based on the sound source direction and the direction-dependent acoustic model, and the speech recognition unit performs speech recognition using the acoustic model.
- the audio signal output by the sound source separation unit is not limited to the analog audio signal itself as long as it has information meaningful as audio, but may also be a digitized or encoded signal or data of a frequency-analyzed spectrum. including.
- the sound source localization unit frequency-analyzes the sound signal detected by the microphone, extracts a harmonic structure, and extracts the harmonic structure extracted from the plurality of microphones.
- the sound pressure difference and the phase difference are obtained, the likelihood of the sound source direction is obtained from each of the sound pressure difference and the phase difference, and the direction with the highest accuracy is determined as the sound source direction.
- the sound source localization unit uses a sound pressure difference and a phase difference of sound signals detected from the plurality of microphones to specify a sound source direction of the specific speaker. It is possible to use a scattering theory in which acoustic signals scattered on the surface of the member provided with the microphone are modeled for each sound source direction.
- the sound source separation unit may include a narrow directional band when the sound source direction specified by the sound source localization unit is close to the front determined by the arrangement of the plurality of microphones.
- the sound separation is performed by using an active direction pass filter that separates the sound and separates the sound in a wide directional band when the front force is separated.
- the acoustic model synthesis unit is configured to synthesize the acoustic model in the sound source direction by a weighted linear sum of the direction-dependent acoustic models of the acoustic model storage unit.
- the weights used for the linear sum are determined by learning.
- the above-described speech recognition device further includes a speaker identification unit that specifies the speaker, wherein the acoustic model storage unit has a direction-dependent acoustic model for each of the speakers,
- a model synthesis unit configured to store the sound model of the sound source direction in the sound model storage unit based on the direction-dependent sound model of the speaker specified by the speaker identification unit and the sound source direction specified by the sound source localization unit; It is preferable that the sound model is determined based on the dependent sound model and stored in the sound model storage unit.
- the feature extracted by the feature extraction unit or the audio signal separated by the sound source separation unit is compared with a template prepared in advance, and an area where the difference from the template is larger than a preset threshold value, for example, It is preferable to further include a masking word for identifying a frequency domain or a subband, and outputting an index indicating low reliability as a feature of the identified area to the speech recognition unit.
- Another voice recognition device of the present invention is a voice recognition device that recognizes a specific speaker's voice from sound signals detected by a plurality of microphones and converts the voice into character information.
- a sound source localization unit that specifies the sound source direction of the specific speaker based on the acoustic signal detected by the microphone, and stores the sound source direction specified by the sound source localization unit to move the specific speaker.
- a stream tracking unit that estimates the current speaker position from the estimated direction, and the current speaker position force estimated by the stream tracking unit.
- a sound source separation unit for separating a speaker's voice signal from the sound signal; a feature extraction unit for extracting characteristics of the sound signal separated by the sound source separation unit; and a direction-dependent sound model corresponding to a plurality of intermittent directions.
- Acoustic model storage unit A sound model synthesis unit that synthesizes an acoustic model of the sound source direction specified by the sound source localization unit based on the direction-dependent acoustic model in the acoustic model storage unit, and stores the synthesized acoustic model in the acoustic model storage unit;
- a voice recognition unit that performs voice recognition on the feature extracted by the feature extraction unit using the acoustic model synthesized by the model synthesis unit and converts the feature into character information.
- the sound source direction of an acoustic signal emitted from an arbitrary direction is specified, and speech recognition is performed using an acoustic model suitable for the sound source direction.
- FIG. 1 is a block diagram of a speech recognition device according to an embodiment of the present invention.
- FIG. 2 is a block diagram illustrating an example of a sound source localization unit.
- FIG. 3 is a diagram illustrating the operation of a sound source localization unit.
- FIG. 4 is a diagram illustrating the operation of a sound source localization unit.
- FIG. 5 is a diagram for describing auditory epipolar geometry.
- FIG. 6 is a graph showing a relationship between a phase difference ⁇ and a frequency f.
- FIG. 7 is a graph showing an example of a head-related transfer function.
- FIG. 8 is a block diagram illustrating an example of a sound source separation unit.
- FIG. 9 is a graph showing an example of a passband function.
- FIG. 10 is a diagram for explaining the operation of a subband selection unit.
- FIG. 11 is a plan view illustrating an example of a pass band.
- FIG. 12 (a) and (b) are block diagrams each showing an example of a feature extraction unit.
- FIG. 13 is a block diagram illustrating an example of an acoustic model synthesis unit.
- FIG. 14 is a diagram showing recognition units and submodels of a direction-dependent acoustic model.
- FIG. 15 is a diagram illustrating the operation of a parameter synthesizing unit.
- FIG. 16] (a) and (b) are graphs each showing an example of the weight W.
- FIG. 17 is a diagram illustrating a learning method of weight W.
- FIG. 18 is a block diagram of a speech recognition device according to a second embodiment.
- FIG. 20 is a block diagram of a speech recognition device according to a third embodiment.
- FIG. 21 is a block diagram of a stream tracking unit.
- FIG. 22 is a graph illustrating a history of sound source directions.
- FIG. 1 is a block diagram of a speech recognition device according to an embodiment of the present invention.
- the speech recognition device 1 includes two microphones M 1 and M 2
- the sound source localization unit 10 that identifies the position of the speaker (sound source) from the acoustic signals detected by the R
- a sound source separation unit 20 for separating sound coming from a sound source in a specific direction an acoustic model storage unit 49 storing a plurality of acoustic models in a plurality of directions, an acoustic model and a sound source in the acoustic model storage unit 49.
- an acoustic model synthesis unit 40 that synthesizes an acoustic model of the sound source direction, and a feature that extracts the characteristics of the spectral color of the specific sound source separated by the sound source separation unit 20
- the audio processing apparatus includes an extraction unit 30, an acoustic model synthesized by the acoustic model synthesis unit 40, and a speech recognition unit 50 that performs speech recognition based on the acoustic features extracted by the feature extraction unit 30.
- the sound source separation unit 20 is optionally used.
- the speech recognition unit 50 performs speech recognition using the acoustic model generated by the acoustic model synthesis unit 40 and suitable for the direction of the sound source, so that a high recognition rate is realized.
- microphones M 1 and M 2 which are components of the speech recognition device 1 according to the embodiment, and sound source localization
- the R-segmentation unit 10 the sound source separation unit 20, the feature extraction unit 30, the acoustic model synthesis unit 40, and the speech recognition unit 50 will be described respectively.
- R is a general microphone that detects sound and outputs it as an electrical signal (acoustic signal).
- any two or more forces may be used.
- three or four forces may be used.
- the microphones M 1 and M 2 are, for example, a robot RB
- the arrangement of the microphones M and M is a general voice recognition device 1 for collecting sound signals.
- FIG. 2 is a block diagram illustrating an example of the sound source localization unit
- FIGS. 3 and 4 are diagrams illustrating the operation of the sound source localization unit.
- the sound source localization unit 10 divides each sound from two acoustic signals to which two microphones M and M power are also input.
- Sound source localization method is microphone A method that uses the phase difference between acoustic signals input to M and M,
- the sound source localization unit 10 includes a frequency analysis unit 11, a peak extraction unit 12, a harmonic structure extraction unit 13, an IPD calculation unit 14, an IID calculation unit 15, and auditory epipolar geometric hypothesis data 16. , A confidence calculation unit 17 and a confidence integration unit 18.
- the frequency analysis unit 11 detects the left and right microphones M,
- the analysis result obtained from the sound signal CR1 of the right microphone M power is the spectrum CR2.
- the analysis result obtained from the acoustic signal CL1 of the left microphone M power is the spectrum CL2.
- the peak extracting unit 12 extracts a series of peaks from the spectra CR2 and CL2 for each of the left and right channels.
- the peak is extracted by extracting the local peak of the spectrum as it is or by a method based on the spectral subtraction method (S.F.Boll, A spectral subtraction algorithm for suppression of acoustic noise in speech, Proceedings of 1979).
- the harmonic structure extraction unit 13 groups the peaks having the specific harmonic structure for each of the left and right channels based on the harmonic structure of the sound source. For example, in the case of a human voice, a specific human voice is composed of a sound of a fundamental frequency and a harmonic of the fundamental frequency. People's voices can be grouped. Peaks divided into the same group based on the harmonic structure can be estimated as signals emitted from the same sound source. For example, if multiple CF speakers) are speaking simultaneously, multiple CF harmonic structures are extracted.
- peaks PI3, P3, and P5 of the peak spectra CR3 and CL3 are grouped into one group to form harmonic structures CR41 and CL41, and the peaks P2, P4, and P6 are grouped into one group.
- the IPD calculation unit 14 calculates the harmonic structure CR41, CR42, CL41 extracted by the harmonic structure extraction unit 13.
- the IPD calculation unit 14 calculates the set of peak frequencies included in the harmonic structure (for example, the harmonic structure CR41) corresponding to the speaker HMj as (f
- the vector subbands are also selected for both the right and left channel (eg, harmonic structure CR41 and harmonic structure CL41) forces, and IPD A ⁇ ( ⁇ ) is calculated by the following equation (1).
- the IPD A ⁇ ( ⁇ ) calculated from the structure CL41 is, for example, the same as the binaural phase difference C51 shown in FIG.
- ⁇ ⁇ (f) is the IPD of a certain harmonic f included in a certain harmonic structure
- K is the K
- the IID calculation unit 15 receives the input from the left microphone M for each harmonic in each harmonic structure.
- the IID calculation unit 15 calculates the subbands of the spectrum corresponding to the harmonics of the peak frequency f included in the harmonic structure (for example, the harmonic structures CR41 and CL41) corresponding to the speaker HMj by right and k
- P (f) is, for example, a binaural sound pressure difference C61 shown in FIG.
- the auditory epipolar geometric hypothesis data 16 shows the distance between the sound source S and the microphones M and M of both ears of the robot RB when the sphere assuming the head of the robot RB is viewed from above.
- the phase difference ⁇ is obtained by the following equation (3).
- the head shape is a sphere!
- ⁇ is the interaural phase difference (IPD)
- V is the speed of sound
- f is the frequency
- r is a value obtained from the distance 2r between the two ears
- ⁇ indicates the sound source direction.
- Equation (3) the relationship between the frequency f of the acoustic signal emitted from each sound source direction and the phase difference ⁇ Is as shown in Figure 6.
- the certainty factor calculator 17 calculates each certainty factor of the IPD and the IID.
- the confidence factor of the IPD is calculated as a function of ⁇ by determining whether the harmonic f included in the harmonic structure (for example, the harmonic structure CR41, CL4 1) corresponding to the speaker HMj seems to have any directional force, and
- the IPD hypothesis (expected value) of f is calculated based on the following equation (4).
- ⁇ ( ⁇ , f) is I h k k when the sound source direction is ⁇ ⁇ ⁇ ⁇ with respect to the kth harmonic f in a certain harmonic structure.
- the PD hypothesis (expected value) is shown.
- the sound source direction ⁇ is changed every 5 ° within a range of ⁇ 90 °, and a total of 37 hypotheses are calculated. However, it does not matter whether the calculation is performed for each finer angle or for each larger angle.
- X (0) (d ( ⁇ ) —m) / ⁇ (s / n)), where m is the average of d ( ⁇ ), s is the variance of d ( ⁇ ), and n is the IPD This is the number of hypotheses (37 in this embodiment).
- IID certainty factor The IID confidence is obtained as follows. First, the sum of the sound pressure differences of the harmonics included in the harmonic structure corresponding to the speaker HMj is calculated by the following equation (7).
- K indicates the number of harmonics included in the harmonic structure
- ⁇ p (f) is the IID obtained by the IID calculation unit 15 .
- Table 1 shows the values obtained experimentally.
- the reliability B ( ⁇ ) is 0.35 with reference to the upper left column.
- the confidence integration unit 18 calculates the confidences B (B) of IPD and IID,
- the direction of the sound source ⁇ ⁇ at which the confidence factor B ( ⁇ ) becomes the largest is defined as the direction in which the speaker HMj is located.
- B IPD ( ⁇ ) 1-(1-B 1PD ( ⁇ ) ⁇ -B IID ( ⁇ ))
- hypothesis data using the auditory epipolar geometry instead of the hypothesis using the auditory epipolar geometry as described above, hypothesis data using a head-related transfer function or hypothesis data based on the scattering theory can be used.
- the head-related transfer function hypothesis data is a phase difference and a sound pressure difference between the sounds detected by the microphones ⁇ and ⁇ , which are obtained from an impulse emitted from a robot around the robot.
- the head related transfer function hypothesis data is calculated at an appropriate interval (for example, 5 °) between 90 ° and 90 °.
- the impulses emitted from the direction are detected by microphones M and M, and the
- the obtained head related transfer function hypothesis data is as shown in IPD in FIG. 7 (a) and IID in (b).
- IPD head related transfer function
- the relationship between the frequency and the IID of the sound that also has a certain sound source direction force is obtained for IID, not only for IPD, but distance data d (() is created for both IPD and IID. Find confidence from.
- the method of creating hypothesis data is the same for IPD and IID.
- d ( ⁇ ) which is the distance between each hypothesis and the input, is directly calculated from the measured values shown in FIGS. 7 (a) and 7 (b).
- Scattering theory is a theory that computes and estimates both IPD and IID, taking into account scattered waves from an object that scatters sound, for example, the head of a robot.
- the object that mainly affects the microphone input among the objects that scatter sound is the robot's head, and this is assumed to be a sphere of radius a.
- the coordinates of the center of the head are set as polar coordinates origins.
- V s potential due to scattered sound
- ⁇ 5 ( ⁇ , ⁇ ) arg (S, ( ⁇ , f))-arg (3 ⁇ 4 ( ⁇ , f)) (1 3)
- IID also calculates d (0) and B ( ⁇ ) in the same manner as IPD. Specifically, let ⁇ ⁇ be ⁇ p,
- the sound source direction and the phase difference, and the sound source direction are considered in consideration of the effects of the sound scattered along the surface of the robot head, for example, the sound circling around the back of the head. Since the relationship between the direction and the sound pressure difference can be modeled, the estimation accuracy of the sound source direction is improved. In particular, when the sound source is on the side, the power of the sound that goes around the occiput and reaches the microphone in the opposite direction to the sound source is relatively large. improves.
- the sound source separation unit 20 separates the sound (voice) signal of each speaker HMj based on the information of each sound source direction localized by the sound source localization unit 10 and the spectrum (for example, spectrum CR2) calculated by the sound source localization unit. It is. Conventional methods such as beamforming, nullforming, peak tracking, directional microphones, and ICA (Independent Component Analysis) can be used as the sound source separation method. A description will be given of a method using the generated active direction pass filter.
- the passband is actively controlled so that the sound source in the front direction narrows the range of the direction in which the sound passes, and the passband is actively controlled so that the sound source with a distant front force has a wide range.
- the sound source separation section 20 includes a passband function 21 and a subband selection section 22.
- the passband function 21 is a function of the sound source direction and the pass bandwidth.As the sound source direction moves away from the front (0 °), the accuracy of the direction information cannot be expected. Is a function set in advance so that the passband becomes larger as the front force increases.
- the sub-band selection unit 22 is a unit that selects a sub-band that is presumed to have come from a specific direction based on the value of each frequency of the spectra CR2 and CL2 (this is referred to as “sub-band”). As shown in FIG. 10, the sub-band selection unit 22 calculates the spectrums CR2 and CL2 of the left and right input sounds generated by the sound source localization unit 10 and the sub-bands of each spectrum, using the above equations (1) and (2). , And calculate IPD A ⁇ ) (and 110 m / 0 (see 52, binaural phase difference 52, binaural sound pressure difference C62 in FIG. 10).
- ⁇ obtained by the sound source localization unit 10 is set as the sound source direction to be extracted, and the passband function
- the pass band B is illustrated in a plan view as a direction, for example, as shown in FIG.
- the IPD and IID corresponding to ⁇ and ⁇ are estimated. For these estimations, measurement or
- the transfer function is a function that associates the frequency f with the IPD and IID with respect to the signal coming from the sound source and uses the above-mentioned epipolar geometry, head-related transfer function, scattering theory, and the like.
- the estimated IPD is, for example, ⁇ (f), ⁇ (f) in the interaural phase difference C53 in FIG. 10, and the estimated IID is, for example, the interaural sound pressure in FIG.
- the spectrum CR2 is calculated using the transfer function of the robot RB.
- the IPD indicates i i th
- f is a threshold for determining whether to use IPD or IID as a filtering criterion.
- the subband (shaded area) whose IID is between (f) and (f) in wavenumber is selected.
- the spectrum consisting of the selected subbands is referred to herein as the "selected spectrum”.
- the sound source separation unit 20 of the present embodiment has been described above.
- a sound source separation method there is another method using a directional microphone. That is, a microphone having a narrow directivity is provided in the robot RB, and the directional microphone is directed in the direction of the sound source obtained by the sound source localization unit 10.
- the feature extraction unit 30 extracts the speech from the speech spectrum separated by the sound source separation unit 20 or the unseparated spectrum CR2 (or CL2) (hereinafter referred to as “recognition studio” when used for speech recognition). This is a part for extracting features required for recognition.
- a linear spectrum obtained by frequency-analyzing the voice a mel frequency spectrum, and a mel frequency cepstrum coefficient (MFCC) can be used. In the present embodiment, the case where the MFCC is used will be described.
- the feature extraction unit 30 does not perform any processing. Also, the Mel frequency spectrum If used, do not perform cosine transform (described later).
- the feature extraction unit 30 includes a logarithmic conversion unit 31, a mel frequency conversion unit 32, and a cosine conversion unit 33.
- the logarithmic converter 31 converts the amplitude of the recognition spectrum selected by the subband selector 22 (see FIG. 8) into a logarithm to obtain a logarithmic spectrum.
- the mel frequency conversion unit 32 passes the logarithmic spectrum generated by the logarithmic conversion unit 31 through a mel frequency bandpass filter to obtain a mel frequency logarithmic vector converted to a frequency force S mel scale.
- the cosine transform unit 33 performs a cosine transform on the mel frequency logarithmic spectrum generated by the mel frequency transform unit 32.
- the coefficient obtained by this cosine transform is the MFCC.
- a masking unit that gives an index (0 force 1) as shown in FIG. 34 may optionally be added in or after the feature extractor 30.
- the word dictionary 59 has a time-series spectrum of the word corresponding to the word.
- this time-series spectrum is referred to as “word speech spectrum”.
- the word speech spectrum is obtained by performing frequency analysis on the speech that uttered the word in a noisy environment.
- the recognition spectrum is input to the feature extraction unit 30, the word speech spectrum of the word estimated to be included in the input speech is selected from the word dictionary as the assumed speech spectrum.
- the one having the closest time to the recognition spectrum and the time length is estimated as the assumed speech spectrum.
- the recognition spectrum and the assumed speech spectrum are converted to MFCC through a logarithmic converter 31, a mel frequency converter 32, and a cosine converter 33, respectively.
- the MFCC of the recognition spectrum is referred to as “recognition MFCC”
- the MFCC of the assumed speech spectrum is referred to as “assumed MFCCJ”.
- the masking unit 34 calculates the difference between the MFCC for recognition and the assumed MFCC, and assigns 0 to each of the features of the feature vector of the MFCC if the difference is larger than the threshold value assumed in advance and 1 if the difference is smaller than the threshold. This is output as an index ⁇ to the speech recognition unit 50 together with the MFCC for recognition.
- the assumed audio spectrum multiple selections may be made instead of only one.
- All word speech spectra may be used without distinction. In that case, the index ⁇ is obtained for all the assumed speech spectra and output to the speech recognition unit 50.
- the spectrum of the separated voice obtained from the directional microphone camera is analyzed using a general frequency analysis method such as an FFT or a band-pass filter. obtain.
- the acoustic model synthesizing unit 40 is a unit that synthesizes an acoustic model corresponding to each localized sound source direction from the direction-dependent acoustic model stored in the acoustic model storage unit 49.
- the acoustic model synthesis unit 40 includes a cosine inverse conversion unit 41, a linear conversion unit 42, an exponential conversion unit 43, a parameter synthesis unit 44, a logarithmic conversion unit 45, a mel frequency conversion unit 46, and a cosine It has a conversion unit 47, and synthesizes an acoustic model in the ⁇ direction with reference to the direction-dependent acoustic model ⁇ ) stored in the acoustic model storage unit 49.
- the acoustic model storage unit 49 stores a direction-dependent acoustic model ⁇ ( ⁇ ), which is an acoustic model suitable for the direction ⁇ ⁇ , for each direction ⁇ ⁇ ⁇ with respect to the front of the robot RB.
- the direction-dependent acoustic model ⁇ ( ⁇ ) is obtained by learning the features of a person's voice uttered in a specific direction 0 with a hidden Markov model ( ⁇ ).
- each direction-dependent acoustic model ⁇ ( ⁇ ) uses, for example, a phoneme as a recognition unit and stores a submodel h (m, ⁇ ) corresponding to each phoneme.
- the sub model may be created using another recognition unit such as a monophone, a PTM, a biphone, and a tri-on.
- the submodel h (m, ⁇ ) has parameters of the number of states, probability density distribution of each state, and state transition probability.
- the number of states of each phoneme is fixed to three: a front part (state 1), an intermediate part (state 2), and a rear part (state 3).
- the probability density distribution is fixed to a normal distribution, but the probability density distribution may be a normal distribution or a mixture distribution of one or more other distributions. Therefore, in the present embodiment, the state transition probability P and the positive Train normal distribution parameters, that is, mean and standard deviation ⁇ .
- the learning data of the sub model h (m, ⁇ ) is created as follows.
- a voice signal composed of a specific phoneme is emitted from a speaker (not shown) to the robot RB in a direction in which an acoustic model is to be created. Then, the detected acoustic signal is converted into an MFCC by the feature extraction unit 30, and the speech is recognized by a speech recognition unit 50 described later. Then, the probability that the probability of the recognized speech is for each phoneme is obtained as a result. To this result, an acoustic model is adaptively learned by giving a teacher signal that it is a specific phoneme in a specific direction. . Then, they learn enough types of phonemes and words (eg, different speakers) to learn the submodel.
- the learning voice When the learning voice is issued, another voice may be emitted as noise from a direction different from the direction in which the acoustic model is desired to be created.
- the sound model is created by the sound source separation unit 20 described above, and only the sound in the direction is separated, and then converted to the MFCC by the feature extraction unit 30.
- it is desired to have an acoustic model as a model of an unspecified speaker it is only necessary to train with an unspecified speaker's voice, and a model is provided for each specific speaker! / In that case, you have to learn for each specific speaker.
- the inverse cosine transform unit 41 to the exponential transform unit 43 return the MFCC of the probability density distribution to a linear spectrum. In other words, the inverse operation of the feature extraction unit 30 is performed for the probability density distribution.
- the inverse cosine transform unit 41 performs an inverse cosine transform on the MFCC included in the direction-dependent acoustic model 11 ( ⁇ ) stored in the acoustic model storage unit 49 to generate a mel logarithmic spectrum.
- the linear converter 42 converts the frequency of the mel logarithmic spectrum generated by the inverse cosine converter 41 into a linear frequency to generate a logarithmic spectrum.
- the exponential conversion unit 43 performs an exponential conversion on the intensity of the logarithmic spectrum generated by the linear conversion unit 42 to generate a linear spectrum.
- the linear spectrum is obtained as a probability density distribution with mean and standard deviation ⁇ .
- the norameter synthesizing unit 44 weights the direction-dependent acoustic model ⁇ ( ⁇ ⁇ ) and sums them, and synthesizes the acoustic model ⁇ ( ⁇ ) of the sound source direction 0.
- Each sub-model in the direction-dependent acoustic model ⁇ ( ⁇ ) is converted into a linear spectrum probability density distribution by the inverse cosine transform unit 41 to the exponential transform unit 43, and the mean, ⁇ , ⁇ , and standard deviation, respectively.
- the metric synthesizing unit 44 uses the direction-dependent acoustic
- the acoustic model of the sound source direction ⁇ is synthesized by the linear sum of Dell ⁇ ( ⁇ ). Note that the weight n HMj
- the mean and ⁇ can be obtained in the same manner.
- the standard deviations ⁇ and ⁇ can be obtained in the same manner.
- the probability density distribution can be obtained.
- the logarithmic converter 45 to the cosine converter 47 convert the probability density distribution back to the linear spectral power MFCC. That is, the logarithmic conversion unit 45 is the same as the logarithmic conversion unit 31, the mel frequency conversion unit 46 is the same as the mel frequency conversion unit 32, and the cosine conversion unit 47 is the same as the cosine conversion unit 33. I do.
- a probability density distribution f (X) is obtained by the following equation (20) instead of the above-described calculation of the average and the standard deviation ⁇ .
- the metric synthesizing section 44 stores the acoustic model thus obtained in the acoustic model storage section 49.
- the weight W depends on each direction ⁇ ⁇ ⁇ ⁇
- the weight W is set for the existing acoustic model ⁇ ( ⁇ ), and may be set for all sub-models Mm, ⁇ ) included in ⁇ (0), or each sub-model ⁇ ⁇ ⁇
- Mm, ⁇ Mm, ⁇
- the sound source is ⁇ ⁇ H j
- a function f ( ⁇ ) that determines the weight W in a certain case is set in advance, and the sound source direction ⁇
- FIG. 16 (a) shows that f ( ⁇ ) has been shifted by ⁇ in the 0-axis direction.
- W be the weight of any phoneme m when the sound source is in front.
- the phoneme train is emitted from a speaker installed in the front, and this is recognized.
- the learning data may be a single phoneme m itself.Since it is better to train with a phoneme sequence in which a plurality of power phonemes are connected, a better learning result is obtained, the phoneme sequence is used. Is the recognition result of FIG. In Fig. 17, using the initial value of W,
- the recognition result of the generated acoustic model ⁇ ( ⁇ ) is the first line, and ⁇ ( ⁇ ) below the second line
- the recognition result of the sound model ⁇ ( ⁇ ) is a phoneme sequence.
- a recognition result of 0 indicates that it was a phoneme sequence [/ / y / m "].
- Ad is obtained experimentally, and is set to, for example, 0.05. Then, if a matching phoneme is not recognized V, the weight W of the model corresponding to that direction is reduced by ⁇ dZ (nk). That is
- the weight of the direction-dependent acoustic model that gives a correct answer is increased, and the weight of the direction-dependent acoustic model that gives a correct answer is decreased.
- the weight of the direction-dependent acoustic model that gives a correct answer is decreased.
- Whether or not it is dominant is determined by whether or not the weight is greater than a predetermined threshold (here, 0.8). If there is no dominant direction-dependent acoustic model ⁇ ( ⁇ ), only the maximum weight is reduced by ⁇ d, and the weights of the other direction-dependent acoustic models H ( ⁇ ) are increased by ⁇ / ( ⁇ -1).
- a predetermined threshold here, 0.8
- the process proceeds to recognition and learning of the next phoneme m ', or ends learning.
- the weight W obtained here becomes f (0).
- f ( ⁇ ) is the result of learning all phonemes and averaging the results.
- the acoustic model H ( ⁇ ) can be recognized even if it is repeated a predetermined number of times (for example, 0.5 / ⁇ d times).
- HMj If the recognition result does not reach the correct answer, for example, if the recognition of m is not successful, the process moves to the learning of the next phoneme, and the same as the distribution of the weight of the phoneme (for example,) finally recognized successfully Update weight with value.
- the weight obtained by learning in this manner is stored in the acoustic model storage unit 49.
- the speech recognition unit 50 uses the acoustic model ⁇ ( ⁇ ) synthesized according to the sound source direction ⁇ ⁇ H j, and recognizes the speech or the input speech power of each of the separated speakers HMj, and recognizes the extracted characters. As information, words are recognized with reference to the word dictionary 59, and a recognition result is output. Since this voice recognition method uses a general hidden Markov model, detailed description is omitted.
- the speech recognition unit 50 The feature is subjected to force recognition as shown in the following equation (21).
- Unreliable component of X is performed using the obtained output probability and state transition probability in the same way as a general recognition method using a hidden Markov model.
- Fig. 1 As shown, the sounds of multiple speakers HMj (see Fig. 3) are applied to the microphones M and M of the robot RB.
- the sound source directions of the acoustic signals detected by the microphones M, M are localized by the sound source localization unit 10.
- the sound source localization calculates the confidence using the hypothesis data based on the auditory epipolar geometry after the frequency analysis, the peak extraction, the extraction of the harmonic structure, and the calculation of the IPD • IID. Then, it is most likely to integrate the confidence of IPD and IID, and ⁇ is the sound source direction
- Sound source separation is pass band
- feature extracting section 30 converts the selected spectrum separated by sound source separating section 20 to MFCC by logarithmic conversion section 31, mel frequency conversion section 32, and cosine conversion section 33.
- the acoustic model synthesizing unit 40 calculates n HMj in the sound source direction ⁇ from the direction-dependent acoustic model ⁇ ( ⁇ ) stored in the acoustic model storage unit 49 and the sound source direction ⁇ ⁇ localized by the sound source localization unit 10.
- HMj Synthesizes an acoustic model considered appropriate. That is, the acoustic model synthesis unit 40 converts the direction-dependent acoustic model ⁇ ( ⁇ n ) into a linear spectrum by the inverse cosine transform unit 41, the linear transform unit 42, and the exponential transform unit 43. Then, the parameter synthesizing unit 44 stores the weight W of the sound source direction ⁇ in the acoustic model storage unit 49 ⁇ ⁇ ⁇ H j
- n HMj Synthesize acoustic model ⁇ ( ⁇ ). And the acoustic model represented by this linear spectrum
- ⁇ ( ⁇ ) is calculated by a logarithmic converter 45, a mel frequency converter 46, and a cosine converter 47.
- the speech recognition unit 50 outputs the acoustic model ⁇ ( ⁇ ) synthesized by the acoustic model synthesizing unit 40.
- Speech recognition is performed by Hidden Markov Model using HMj.
- Table 4 shows examples of the results of performing speech recognition in this manner.
- the speech recognition device 1 of the present embodiment even when speech is emitted from an arbitrary direction, an acoustic model suitable for that direction is synthesized each time, so that a high recognition rate is obtained. Can be realized. In addition, since speech in any direction can be recognized, speech recognition with a moving sound source and a high recognition rate can be performed even when the moving object (robot RB) itself is moving. is there.
- a sound source localization unit 110 that localizes a sound source direction using a peak of a cross-correlation is provided instead of the sound source localization unit 10 of the first embodiment.
- the other parts are
- the sound source localization unit 110 includes a frame cutout unit 111, a cross-correlation calculation unit 112, a peak extraction unit 113, and a direction estimation unit 114, as shown in FIG.
- the frame cutout unit 111 outputs the audio signals input to the left and right microphones M and M, respectively.
- the signal is cut out at a predetermined time length, for example, 100 msec.
- the clipping process is performed at appropriate time intervals, for example, every 30 msec.
- the cross-correlation calculator 112 calculates the sound of the right microphone M extracted by the frame
- the peak extracting unit 113 also extracts a peak as a result of the obtained cross-correlation. Pi to extract If the number of sound sources is preliminarily componentized, the number of peaks should be selected from the larger peaks corresponding to the number. If the number of sound sources is unknown, a force for extracting all peaks exceeding a predetermined threshold or a predetermined predetermined number of peaks is selected in descending order.
- the sound source direction ⁇ is calculated from the peaks obtained by the sound input to the right microphone M and the left microphone M.
- the sound source localization unit 110 using such cross-correlation also allows the direction of the sound source direction 0 to be changed.
- HMj is estimated, and the acoustic model suitable for the sound source direction ⁇ ⁇ ⁇ is estimated by the acoustic model synthesis unit 40 described above.
- the recognition rate can be improved.
- a function of performing voice recognition while confirming that a sound source localization unit sound source comes from the same sound source is added. Note that the same parts as those in the first embodiment are denoted by the same reference numerals, and description thereof is omitted.
- a voice recognition device 100 receives a sound source direction in which a sound source localization unit 10 has been localized, in addition to the voice recognition device 1 of the first embodiment, and tracks a sound source. Then, it has a stream tracking unit 60 that checks whether the same sound source power sound continues to come, and outputs the sound source direction to the sound source separation unit 20 if the sound can be checked.
- the stream tracking section 60 has a sound source direction history storage section 61, a prediction section 62, and a comparison section 63.
- the sound source direction history storage unit 61 stores the time, the direction of the sound source and the pitch of the sound source recognized at that time (the fundamental frequency f of the harmonic structure of the sound source, as shown in FIG. 22).
- the prediction unit 62 reads out the history of the sound source direction of the sound source that has been tracked up to immediately before from the sound source direction history storage unit 61, and records the sound source direction ⁇ ⁇ and the fundamental frequency at the current time tl using the history force Kalman filter or the like. predict stream feature vector ( ⁇ , f) consisting of f
- the comparison unit 63 receives the sound source direction ⁇ of each speaker HMj at the current time tl localized by the sound source localization unit 10 and the fundamental frequency f of the sound source from the sound source localization unit 10. And forecast
- the localized sound source direction ⁇ is not output to the sound source separation unit 20, so that no speech recognition is performed. Note that the sound source direction
- the sound source direction is localized by the sound source localization unit 10, and the sound source direction and the pitch are input to the stream tracking unit 60.
- the prediction section 62 reads the history of the sound source direction stored in the sound source direction history storage section 61 and predicts the stream feature vector ( ⁇ , f) at the current time tl. Ratio
- the comparison unit 63 includes a stream feature vector ( ⁇ , f) predicted by the prediction unit 62 and a sound source localization unit.
- the sound source direction is output to the sound source separation unit 20.
- the sound source separation unit 20 performs the same processing as in the first embodiment based on the data of the spectrum input from the sound source localization unit 10 and the data of the sound source direction ⁇ ⁇ output by the stream tracking unit 60.
- the feature extracting unit 30, the acoustic model synthesizing unit 40, and the speech recognizing unit 50 also perform processing in the same manner as in the first embodiment.
- the voice recognition device 100 of the present embodiment performs voice recognition after confirming whether or not the sound source can be tracked. Therefore, even when the sound source is moving, the same sound source can be recognized. Since the speech that is continuously emitted is continuously recognized, the possibility of erroneous recognition can be reduced. In particular, it is suitable for a case where there are a plurality of moving sound sources and those sound sources intersect. In addition, since the sound source direction is stored and predicted, only for a predetermined range in that direction If a sound source is searched, the processing can be reduced.
- the voice recognition device 1 includes a camera and a known image recognition device, recognizes a speaker's face, and has a database capability that identifies who is talking. If the direction-dependent acoustic model is provided for each speaker, a sound model suitable for the speaker can be synthesized, so that the recognition rate can be further increased.
- the speech of the speaker registered in advance is vectorized using the vector quantization (VQ) without using the camera, and the speech separated by the sound source separation unit 20 is vectorized. Compare the closest distance! ⁇ You may identify the speaker by outputting the speaker as a result!
Abstract
Description
Claims
Priority Applications (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
DE602004021716T DE602004021716D1 (de) | 2003-11-12 | 2004-11-12 | Spracherkennungssystem |
JP2005515466A JP4516527B2 (ja) | 2003-11-12 | 2004-11-12 | 音声認識装置 |
US10/579,235 US20090018828A1 (en) | 2003-11-12 | 2004-11-12 | Automatic Speech Recognition System |
EP04818533A EP1691344B1 (en) | 2003-11-12 | 2004-11-12 | Speech recognition system |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2003-383072 | 2003-11-12 | ||
JP2003383072 | 2003-11-12 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2005048239A1 true WO2005048239A1 (ja) | 2005-05-26 |
Family
ID=34587281
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2004/016883 WO2005048239A1 (ja) | 2003-11-12 | 2004-11-12 | 音声認識装置 |
Country Status (5)
Country | Link |
---|---|
US (1) | US20090018828A1 (ja) |
EP (1) | EP1691344B1 (ja) |
JP (1) | JP4516527B2 (ja) |
DE (1) | DE602004021716D1 (ja) |
WO (1) | WO2005048239A1 (ja) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2007264328A (ja) * | 2006-03-28 | 2007-10-11 | Matsushita Electric Works Ltd | 浴室装置及びそれに用いる音声操作装置 |
JP2009020352A (ja) * | 2007-07-12 | 2009-01-29 | Yamaha Corp | 音声処理装置およびプログラム |
JP2011146871A (ja) * | 2010-01-13 | 2011-07-28 | Hitachi Ltd | 音源探索装置及び音源探索方法 |
WO2015029296A1 (ja) * | 2013-08-29 | 2015-03-05 | パナソニック インテレクチュアル プロパティ コーポレーション オブ アメリカ | 音声認識方法及び音声認識装置 |
JP2018517325A (ja) * | 2015-04-09 | 2018-06-28 | シンテフ ティーティーオー アクティーゼルスカブ | 音声認識 |
US10880643B2 (en) | 2018-09-27 | 2020-12-29 | Fujitsu Limited | Sound-source-direction determining apparatus, sound-source-direction determining method, and storage medium |
JPWO2019138619A1 (ja) * | 2018-01-09 | 2021-01-28 | ソニー株式会社 | 情報処理装置、情報処理方法、およびプログラム |
TWI740315B (zh) * | 2019-08-23 | 2021-09-21 | 大陸商北京市商湯科技開發有限公司 | 聲音分離方法、電子設備和電腦可讀儲存媒體 |
CN113576527A (zh) * | 2021-08-27 | 2021-11-02 | 复旦大学 | 一种利用声控进行超声输入判断的方法 |
Families Citing this family (289)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8645137B2 (en) | 2000-03-16 | 2014-02-04 | Apple Inc. | Fast, language-independent method for user authentication by voice |
DE602005008005D1 (de) * | 2005-02-23 | 2008-08-21 | Harman Becker Automotive Sys | Spracherkennungssytem in einem Kraftfahrzeug |
US8677377B2 (en) | 2005-09-08 | 2014-03-18 | Apple Inc. | Method and apparatus for building an intelligent automated assistant |
US7697827B2 (en) | 2005-10-17 | 2010-04-13 | Konicek Jeffrey C | User-friendlier interfaces for a camera |
JP2007318438A (ja) * | 2006-05-25 | 2007-12-06 | Yamaha Corp | 音声状況データ生成装置、音声状況可視化装置、音声状況データ編集装置、音声データ再生装置、および音声通信システム |
US9318108B2 (en) | 2010-01-18 | 2016-04-19 | Apple Inc. | Intelligent automated assistant |
CN101622663B (zh) * | 2007-03-02 | 2012-06-20 | 松下电器产业株式会社 | 编码装置以及编码方法 |
US8977255B2 (en) | 2007-04-03 | 2015-03-10 | Apple Inc. | Method and system for operating a multi-function portable electronic device using voice-activation |
US10002189B2 (en) | 2007-12-20 | 2018-06-19 | Apple Inc. | Method and apparatus for searching using an active ontology |
US9330720B2 (en) | 2008-01-03 | 2016-05-03 | Apple Inc. | Methods and apparatus for altering audio output signals |
US8532802B1 (en) * | 2008-01-18 | 2013-09-10 | Adobe Systems Incorporated | Graphic phase shifter |
US8996376B2 (en) | 2008-04-05 | 2015-03-31 | Apple Inc. | Intelligent text-to-speech conversion |
US10496753B2 (en) | 2010-01-18 | 2019-12-03 | Apple Inc. | Automatically adapting user interfaces for hands-free interaction |
US20100030549A1 (en) | 2008-07-31 | 2010-02-04 | Lee Michael M | Mobile device having human language translation capability with positional feedback |
KR101178801B1 (ko) * | 2008-12-09 | 2012-08-31 | 한국전자통신연구원 | 음원분리 및 음원식별을 이용한 음성인식 장치 및 방법 |
US8676904B2 (en) | 2008-10-02 | 2014-03-18 | Apple Inc. | Electronic devices with voice command and contextual data processing capabilities |
WO2010067118A1 (en) | 2008-12-11 | 2010-06-17 | Novauris Technologies Limited | Speech recognition involving a mobile device |
US10241752B2 (en) | 2011-09-30 | 2019-03-26 | Apple Inc. | Interface for a virtual digital assistant |
US20120309363A1 (en) | 2011-06-03 | 2012-12-06 | Apple Inc. | Triggering notifications associated with tasks items that represent tasks to perform |
US10241644B2 (en) | 2011-06-03 | 2019-03-26 | Apple Inc. | Actionable reminder entries |
US9858925B2 (en) | 2009-06-05 | 2018-01-02 | Apple Inc. | Using context information to facilitate processing of commands in a virtual assistant |
US9431006B2 (en) | 2009-07-02 | 2016-08-30 | Apple Inc. | Methods and apparatuses for automatic speech recognition |
US8175617B2 (en) | 2009-10-28 | 2012-05-08 | Digimarc Corporation | Sensor-based mobile search, related methods and systems |
JP5622744B2 (ja) * | 2009-11-06 | 2014-11-12 | 株式会社東芝 | 音声認識装置 |
US20110125497A1 (en) * | 2009-11-20 | 2011-05-26 | Takahiro Unno | Method and System for Voice Activity Detection |
US9838784B2 (en) | 2009-12-02 | 2017-12-05 | Knowles Electronics, Llc | Directional audio capture |
US8560309B2 (en) * | 2009-12-29 | 2013-10-15 | Apple Inc. | Remote conferencing center |
US10276170B2 (en) | 2010-01-18 | 2019-04-30 | Apple Inc. | Intelligent automated assistant |
US10705794B2 (en) | 2010-01-18 | 2020-07-07 | Apple Inc. | Automatically adapting user interfaces for hands-free interaction |
US10553209B2 (en) | 2010-01-18 | 2020-02-04 | Apple Inc. | Systems and methods for hands-free notification summaries |
US10679605B2 (en) | 2010-01-18 | 2020-06-09 | Apple Inc. | Hands-free list-reading by intelligent automated assistant |
US8676581B2 (en) * | 2010-01-22 | 2014-03-18 | Microsoft Corporation | Speech recognition analysis via identification information |
US8718290B2 (en) | 2010-01-26 | 2014-05-06 | Audience, Inc. | Adaptive noise reduction using level cues |
US8682667B2 (en) | 2010-02-25 | 2014-03-25 | Apple Inc. | User profiling for selecting user specific voice input processing information |
CN102893327B (zh) * | 2010-03-19 | 2015-05-27 | 数字标记公司 | 直觉计算方法和系统 |
US9378754B1 (en) | 2010-04-28 | 2016-06-28 | Knowles Electronics, Llc | Adaptive spatial classifier for multi-microphone systems |
KR101750338B1 (ko) * | 2010-09-13 | 2017-06-23 | 삼성전자주식회사 | 마이크의 빔포밍 수행 방법 및 장치 |
US10762293B2 (en) | 2010-12-22 | 2020-09-01 | Apple Inc. | Using parts-of-speech tagging and named entity recognition for spelling correction |
KR101791907B1 (ko) * | 2011-01-04 | 2017-11-02 | 삼성전자주식회사 | 위치 기반의 음향 처리 장치 및 방법 |
US9047867B2 (en) * | 2011-02-21 | 2015-06-02 | Adobe Systems Incorporated | Systems and methods for concurrent signal recognition |
US9262612B2 (en) | 2011-03-21 | 2016-02-16 | Apple Inc. | Device access using voice authentication |
US10057736B2 (en) | 2011-06-03 | 2018-08-21 | Apple Inc. | Active transport based notifications |
JP5479655B2 (ja) * | 2011-07-08 | 2014-04-23 | ゴーアテック インコーポレイテッド | 残留エコーを抑制するための方法及び装置 |
US9435873B2 (en) | 2011-07-14 | 2016-09-06 | Microsoft Technology Licensing, Llc | Sound source localization using phase spectrum |
US8994660B2 (en) | 2011-08-29 | 2015-03-31 | Apple Inc. | Text correction processing |
US9966088B2 (en) * | 2011-09-23 | 2018-05-08 | Adobe Systems Incorporated | Online source separation |
US8879761B2 (en) | 2011-11-22 | 2014-11-04 | Apple Inc. | Orientation-based audio |
BR112014015844A8 (pt) * | 2011-12-26 | 2017-07-04 | Intel Corp | determinação das entradas de áudio e visuais de ocupantes baseada em veículo |
US10134385B2 (en) | 2012-03-02 | 2018-11-20 | Apple Inc. | Systems and methods for name pronunciation |
US9483461B2 (en) | 2012-03-06 | 2016-11-01 | Apple Inc. | Handling speech synthesis of content for multiple languages |
US9280610B2 (en) | 2012-05-14 | 2016-03-08 | Apple Inc. | Crowd sourcing information to fulfill user requests |
US10417037B2 (en) | 2012-05-15 | 2019-09-17 | Apple Inc. | Systems and methods for integrating third party services with a digital assistant |
US9881616B2 (en) * | 2012-06-06 | 2018-01-30 | Qualcomm Incorporated | Method and systems having improved speech recognition |
US9721563B2 (en) | 2012-06-08 | 2017-08-01 | Apple Inc. | Name recognition system |
US9495129B2 (en) | 2012-06-29 | 2016-11-15 | Apple Inc. | Device, method, and user interface for voice-activated navigation and browsing of a document |
US8831957B2 (en) * | 2012-08-01 | 2014-09-09 | Google Inc. | Speech recognition models based on location indicia |
US9576574B2 (en) | 2012-09-10 | 2017-02-21 | Apple Inc. | Context-sensitive handling of interruptions by intelligent digital assistant |
US9547647B2 (en) | 2012-09-19 | 2017-01-17 | Apple Inc. | Voice-based media searching |
CN113470640B (zh) | 2013-02-07 | 2022-04-26 | 苹果公司 | 数字助理的语音触发器 |
US9311640B2 (en) | 2014-02-11 | 2016-04-12 | Digimarc Corporation | Methods and arrangements for smartphone payments and transactions |
US9368114B2 (en) | 2013-03-14 | 2016-06-14 | Apple Inc. | Context-sensitive handling of interruptions |
WO2014144579A1 (en) | 2013-03-15 | 2014-09-18 | Apple Inc. | System and method for updating an adaptive speech recognition model |
CN105027197B (zh) | 2013-03-15 | 2018-12-14 | 苹果公司 | 训练至少部分语音命令系统 |
WO2014197334A2 (en) | 2013-06-07 | 2014-12-11 | Apple Inc. | System and method for user-specified pronunciation of words for speech synthesis and recognition |
WO2014197336A1 (en) | 2013-06-07 | 2014-12-11 | Apple Inc. | System and method for detecting errors in interactions with a voice-based digital assistant |
US9582608B2 (en) | 2013-06-07 | 2017-02-28 | Apple Inc. | Unified ranking with entropy-weighted information for phrase-based semantic auto-completion |
WO2014197335A1 (en) | 2013-06-08 | 2014-12-11 | Apple Inc. | Interpreting and acting upon commands that involve sharing information with remote devices |
US10176167B2 (en) | 2013-06-09 | 2019-01-08 | Apple Inc. | System and method for inferring user intent from speech inputs |
DE112014002747T5 (de) | 2013-06-09 | 2016-03-03 | Apple Inc. | Vorrichtung, Verfahren und grafische Benutzerschnittstelle zum Ermöglichen einer Konversationspersistenz über zwei oder mehr Instanzen eines digitalen Assistenten |
CN105265005B (zh) | 2013-06-13 | 2019-09-17 | 苹果公司 | 用于由语音命令发起的紧急呼叫的系统和方法 |
US9536540B2 (en) | 2013-07-19 | 2017-01-03 | Knowles Electronics, Llc | Speech signal separation and synthesis based on auditory scene analysis and speech modeling |
AU2014306221B2 (en) | 2013-08-06 | 2017-04-06 | Apple Inc. | Auto-activating smart responses based on activities from remote devices |
JP6148163B2 (ja) * | 2013-11-29 | 2017-06-14 | 本田技研工業株式会社 | 会話支援装置、会話支援装置の制御方法、及び会話支援装置のプログラム |
US10296160B2 (en) | 2013-12-06 | 2019-05-21 | Apple Inc. | Method for extracting salient dialog usage from live data |
US9338761B2 (en) * | 2014-02-26 | 2016-05-10 | Empire Technology Development Llc | Presence-based device mode modification |
US9390712B2 (en) * | 2014-03-24 | 2016-07-12 | Microsoft Technology Licensing, Llc. | Mixed speech recognition |
US9620105B2 (en) | 2014-05-15 | 2017-04-11 | Apple Inc. | Analyzing audio input for efficient speech and music recognition |
US10592095B2 (en) | 2014-05-23 | 2020-03-17 | Apple Inc. | Instantaneous speaking of content on touch devices |
US9502031B2 (en) | 2014-05-27 | 2016-11-22 | Apple Inc. | Method for supporting dynamic grammars in WFST-based ASR |
AU2015266863B2 (en) | 2014-05-30 | 2018-03-15 | Apple Inc. | Multi-command single utterance input method |
US9734193B2 (en) | 2014-05-30 | 2017-08-15 | Apple Inc. | Determining domain salience ranking from ambiguous words in natural speech |
US10170123B2 (en) | 2014-05-30 | 2019-01-01 | Apple Inc. | Intelligent assistant for home automation |
US10289433B2 (en) | 2014-05-30 | 2019-05-14 | Apple Inc. | Domain specific language for encoding assistant dialog |
US9430463B2 (en) | 2014-05-30 | 2016-08-30 | Apple Inc. | Exemplar-based natural language processing |
US9785630B2 (en) | 2014-05-30 | 2017-10-10 | Apple Inc. | Text prediction using combined word N-gram and unigram language models |
US9715875B2 (en) | 2014-05-30 | 2017-07-25 | Apple Inc. | Reducing the need for manual start/end-pointing and trigger phrases |
US9842101B2 (en) | 2014-05-30 | 2017-12-12 | Apple Inc. | Predictive conversion of language input |
US9633004B2 (en) | 2014-05-30 | 2017-04-25 | Apple Inc. | Better resolution when referencing to concepts |
US10078631B2 (en) | 2014-05-30 | 2018-09-18 | Apple Inc. | Entropy-guided text prediction using combined word and character n-gram language models |
US9760559B2 (en) | 2014-05-30 | 2017-09-12 | Apple Inc. | Predictive text input |
US10659851B2 (en) | 2014-06-30 | 2020-05-19 | Apple Inc. | Real-time digital assistant knowledge updates |
US9338493B2 (en) | 2014-06-30 | 2016-05-10 | Apple Inc. | Intelligent automated assistant for TV user interactions |
US9817634B2 (en) * | 2014-07-21 | 2017-11-14 | Intel Corporation | Distinguishing speech from multiple users in a computer interaction |
WO2016033269A1 (en) * | 2014-08-28 | 2016-03-03 | Analog Devices, Inc. | Audio processing using an intelligent microphone |
US10446141B2 (en) | 2014-08-28 | 2019-10-15 | Apple Inc. | Automatic speech recognition based on user feedback |
US9818400B2 (en) | 2014-09-11 | 2017-11-14 | Apple Inc. | Method and apparatus for discovering trending terms in speech requests |
CN107112025A (zh) | 2014-09-12 | 2017-08-29 | 美商楼氏电子有限公司 | 用于恢复语音分量的系统和方法 |
US10789041B2 (en) | 2014-09-12 | 2020-09-29 | Apple Inc. | Dynamic thresholds for always listening speech trigger |
US9606986B2 (en) | 2014-09-29 | 2017-03-28 | Apple Inc. | Integrated word N-gram and class M-gram language models |
US10074360B2 (en) | 2014-09-30 | 2018-09-11 | Apple Inc. | Providing an indication of the suitability of speech recognition |
US9646609B2 (en) | 2014-09-30 | 2017-05-09 | Apple Inc. | Caching apparatus for serving phonetic pronunciations |
US10127911B2 (en) | 2014-09-30 | 2018-11-13 | Apple Inc. | Speaker identification and unsupervised speaker adaptation techniques |
US9886432B2 (en) | 2014-09-30 | 2018-02-06 | Apple Inc. | Parsimonious handling of word inflection via categorical stem + suffix N-gram language models |
US9668121B2 (en) | 2014-09-30 | 2017-05-30 | Apple Inc. | Social reminders |
US9626001B2 (en) | 2014-11-13 | 2017-04-18 | International Business Machines Corporation | Speech recognition candidate selection based on non-acoustic input |
US9881610B2 (en) | 2014-11-13 | 2018-01-30 | International Business Machines Corporation | Speech recognition system adaptation based on non-acoustic attributes and face selection based on mouth motion using pixel intensities |
US10552013B2 (en) | 2014-12-02 | 2020-02-04 | Apple Inc. | Data detection |
US9711141B2 (en) | 2014-12-09 | 2017-07-18 | Apple Inc. | Disambiguating heteronyms in speech synthesis |
US9865280B2 (en) | 2015-03-06 | 2018-01-09 | Apple Inc. | Structured dictation using intelligent automated assistants |
US10152299B2 (en) | 2015-03-06 | 2018-12-11 | Apple Inc. | Reducing response latency of intelligent automated assistants |
US9721566B2 (en) | 2015-03-08 | 2017-08-01 | Apple Inc. | Competing devices responding to voice triggers |
US9886953B2 (en) | 2015-03-08 | 2018-02-06 | Apple Inc. | Virtual assistant activation |
US10567477B2 (en) | 2015-03-08 | 2020-02-18 | Apple Inc. | Virtual assistant continuity |
US9899019B2 (en) | 2015-03-18 | 2018-02-20 | Apple Inc. | Systems and methods for structured stem and suffix language models |
US9842105B2 (en) | 2015-04-16 | 2017-12-12 | Apple Inc. | Parsimonious continuous-space phrase representations for natural language processing |
US10460227B2 (en) | 2015-05-15 | 2019-10-29 | Apple Inc. | Virtual assistant in a communication session |
US10083688B2 (en) | 2015-05-27 | 2018-09-25 | Apple Inc. | Device voice control for selecting a displayed affordance |
US10127220B2 (en) | 2015-06-04 | 2018-11-13 | Apple Inc. | Language identification from short strings |
US9578173B2 (en) | 2015-06-05 | 2017-02-21 | Apple Inc. | Virtual assistant aided communication with 3rd party service in a communication session |
US10101822B2 (en) | 2015-06-05 | 2018-10-16 | Apple Inc. | Language input correction |
US10186254B2 (en) | 2015-06-07 | 2019-01-22 | Apple Inc. | Context-based endpoint detection |
US11025565B2 (en) | 2015-06-07 | 2021-06-01 | Apple Inc. | Personalized prediction of responses for instant messaging |
US10255907B2 (en) | 2015-06-07 | 2019-04-09 | Apple Inc. | Automatic accent detection using acoustic models |
JP6543843B2 (ja) * | 2015-06-18 | 2019-07-17 | 本田技研工業株式会社 | 音源分離装置、および音源分離方法 |
US20160378747A1 (en) | 2015-06-29 | 2016-12-29 | Apple Inc. | Virtual assistant for media playback |
CN105005027A (zh) * | 2015-08-05 | 2015-10-28 | 张亚光 | 一种区域范围内目标对象的定位系统 |
JP6543844B2 (ja) * | 2015-08-27 | 2019-07-17 | 本田技研工業株式会社 | 音源同定装置および音源同定方法 |
US10671428B2 (en) | 2015-09-08 | 2020-06-02 | Apple Inc. | Distributed personal assistant |
US10747498B2 (en) | 2015-09-08 | 2020-08-18 | Apple Inc. | Zero latency digital assistant |
US9697820B2 (en) | 2015-09-24 | 2017-07-04 | Apple Inc. | Unit-selection text-to-speech synthesis using concatenation-sensitive neural networks |
US11010550B2 (en) | 2015-09-29 | 2021-05-18 | Apple Inc. | Unified language modeling framework for word prediction, auto-completion and auto-correction |
US10366158B2 (en) | 2015-09-29 | 2019-07-30 | Apple Inc. | Efficient word encoding for recurrent neural network language models |
US11587559B2 (en) | 2015-09-30 | 2023-02-21 | Apple Inc. | Intelligent device identification |
KR102444061B1 (ko) * | 2015-11-02 | 2022-09-16 | 삼성전자주식회사 | 음성 인식이 가능한 전자 장치 및 방법 |
US10691473B2 (en) | 2015-11-06 | 2020-06-23 | Apple Inc. | Intelligent automated assistant in a messaging environment |
US10049668B2 (en) | 2015-12-02 | 2018-08-14 | Apple Inc. | Applying neural network language models to weighted finite state transducers for automatic speech recognition |
US10223066B2 (en) | 2015-12-23 | 2019-03-05 | Apple Inc. | Proactive assistance based on dialog communication between devices |
US10264030B2 (en) | 2016-02-22 | 2019-04-16 | Sonos, Inc. | Networked microphone device control |
US9947316B2 (en) | 2016-02-22 | 2018-04-17 | Sonos, Inc. | Voice control of a media playback system |
US9965247B2 (en) | 2016-02-22 | 2018-05-08 | Sonos, Inc. | Voice controlled media playback system based on user profile |
US10509626B2 (en) | 2016-02-22 | 2019-12-17 | Sonos, Inc | Handling of loss of pairing between networked devices |
US9826306B2 (en) | 2016-02-22 | 2017-11-21 | Sonos, Inc. | Default playback device designation |
US10095470B2 (en) | 2016-02-22 | 2018-10-09 | Sonos, Inc. | Audio response playback |
US10142754B2 (en) | 2016-02-22 | 2018-11-27 | Sonos, Inc. | Sensor on moving component of transducer |
US10446143B2 (en) | 2016-03-14 | 2019-10-15 | Apple Inc. | Identification of voice inputs providing credentials |
EP3434024B1 (en) * | 2016-04-21 | 2023-08-02 | Hewlett-Packard Development Company, L.P. | Electronic device microphone listening modes |
US9820042B1 (en) | 2016-05-02 | 2017-11-14 | Knowles Electronics, Llc | Stereo separation and directional suppression with omni-directional microphones |
US9934775B2 (en) | 2016-05-26 | 2018-04-03 | Apple Inc. | Unit-selection text-to-speech synthesis based on predicted concatenation parameters |
US9972304B2 (en) | 2016-06-03 | 2018-05-15 | Apple Inc. | Privacy preserving distributed evaluation framework for embedded personalized systems |
US10249300B2 (en) | 2016-06-06 | 2019-04-02 | Apple Inc. | Intelligent list reading |
US11227589B2 (en) | 2016-06-06 | 2022-01-18 | Apple Inc. | Intelligent list reading |
US10049663B2 (en) | 2016-06-08 | 2018-08-14 | Apple, Inc. | Intelligent automated assistant for media exploration |
DK179588B1 (en) | 2016-06-09 | 2019-02-22 | Apple Inc. | INTELLIGENT AUTOMATED ASSISTANT IN A HOME ENVIRONMENT |
US9978390B2 (en) | 2016-06-09 | 2018-05-22 | Sonos, Inc. | Dynamic player selection for audio signal processing |
US10192552B2 (en) | 2016-06-10 | 2019-01-29 | Apple Inc. | Digital assistant providing whispered speech |
US10509862B2 (en) | 2016-06-10 | 2019-12-17 | Apple Inc. | Dynamic phrase expansion of language input |
US10586535B2 (en) | 2016-06-10 | 2020-03-10 | Apple Inc. | Intelligent digital assistant in a multi-tasking environment |
US10490187B2 (en) | 2016-06-10 | 2019-11-26 | Apple Inc. | Digital assistant providing automated status report |
US10067938B2 (en) | 2016-06-10 | 2018-09-04 | Apple Inc. | Multilingual word prediction |
DK201670540A1 (en) | 2016-06-11 | 2018-01-08 | Apple Inc | Application integration with a digital assistant |
DK179049B1 (en) | 2016-06-11 | 2017-09-18 | Apple Inc | Data driven natural language event detection and classification |
DK179343B1 (en) | 2016-06-11 | 2018-05-14 | Apple Inc | Intelligent task discovery |
DK179415B1 (en) | 2016-06-11 | 2018-06-14 | Apple Inc | Intelligent device arbitration and control |
US10134399B2 (en) | 2016-07-15 | 2018-11-20 | Sonos, Inc. | Contextualization of voice inputs |
US10152969B2 (en) | 2016-07-15 | 2018-12-11 | Sonos, Inc. | Voice detection by multiple devices |
US10115400B2 (en) | 2016-08-05 | 2018-10-30 | Sonos, Inc. | Multiple voice services |
JP6703460B2 (ja) * | 2016-08-25 | 2020-06-03 | 本田技研工業株式会社 | 音声処理装置、音声処理方法及び音声処理プログラム |
US10474753B2 (en) | 2016-09-07 | 2019-11-12 | Apple Inc. | Language identification using recurrent neural networks |
JP6672114B2 (ja) * | 2016-09-13 | 2020-03-25 | 本田技研工業株式会社 | 会話メンバー最適化装置、会話メンバー最適化方法およびプログラム |
US9794720B1 (en) | 2016-09-22 | 2017-10-17 | Sonos, Inc. | Acoustic position measurement |
US10043516B2 (en) | 2016-09-23 | 2018-08-07 | Apple Inc. | Intelligent automated assistant |
US9942678B1 (en) | 2016-09-27 | 2018-04-10 | Sonos, Inc. | Audio playback settings for voice interaction |
US9743204B1 (en) | 2016-09-30 | 2017-08-22 | Sonos, Inc. | Multi-orientation playback device microphones |
US10181323B2 (en) | 2016-10-19 | 2019-01-15 | Sonos, Inc. | Arbitration-based voice recognition |
US11281993B2 (en) | 2016-12-05 | 2022-03-22 | Apple Inc. | Model and ensemble compression for metric learning |
US10593346B2 (en) | 2016-12-22 | 2020-03-17 | Apple Inc. | Rank-reduced token representation for automatic speech recognition |
US11204787B2 (en) | 2017-01-09 | 2021-12-21 | Apple Inc. | Application integration with a digital assistant |
JP6805037B2 (ja) * | 2017-03-22 | 2020-12-23 | 株式会社東芝 | 話者検索装置、話者検索方法、および話者検索プログラム |
US11183181B2 (en) | 2017-03-27 | 2021-11-23 | Sonos, Inc. | Systems and methods of multiple voice services |
CN107135443B (zh) * | 2017-03-29 | 2020-06-23 | 联想(北京)有限公司 | 一种信号处理方法及电子设备 |
CN110603587A (zh) * | 2017-05-08 | 2019-12-20 | 索尼公司 | 信息处理设备 |
DK201770383A1 (en) | 2017-05-09 | 2018-12-14 | Apple Inc. | USER INTERFACE FOR CORRECTING RECOGNITION ERRORS |
US10417266B2 (en) | 2017-05-09 | 2019-09-17 | Apple Inc. | Context-aware ranking of intelligent response suggestions |
DK201770439A1 (en) | 2017-05-11 | 2018-12-13 | Apple Inc. | Offline personal assistant |
US10395654B2 (en) | 2017-05-11 | 2019-08-27 | Apple Inc. | Text normalization based on a data-driven learning network |
US10726832B2 (en) | 2017-05-11 | 2020-07-28 | Apple Inc. | Maintaining privacy of personal information |
DK201770429A1 (en) | 2017-05-12 | 2018-12-14 | Apple Inc. | LOW-LATENCY INTELLIGENT AUTOMATED ASSISTANT |
US11301477B2 (en) | 2017-05-12 | 2022-04-12 | Apple Inc. | Feedback analysis of a digital assistant |
DK179745B1 (en) | 2017-05-12 | 2019-05-01 | Apple Inc. | SYNCHRONIZATION AND TASK DELEGATION OF A DIGITAL ASSISTANT |
DK179496B1 (en) | 2017-05-12 | 2019-01-15 | Apple Inc. | USER-SPECIFIC Acoustic Models |
DK201770431A1 (en) | 2017-05-15 | 2018-12-20 | Apple Inc. | Optimizing dialogue policy decisions for digital assistants using implicit feedback |
DK201770432A1 (en) | 2017-05-15 | 2018-12-21 | Apple Inc. | Hierarchical belief states for digital assistants |
DK179560B1 (en) | 2017-05-16 | 2019-02-18 | Apple Inc. | FAR-FIELD EXTENSION FOR DIGITAL ASSISTANT SERVICES |
US20180336275A1 (en) | 2017-05-16 | 2018-11-22 | Apple Inc. | Intelligent automated assistant for media exploration |
US10403278B2 (en) | 2017-05-16 | 2019-09-03 | Apple Inc. | Methods and systems for phonetic matching in digital assistant services |
US10311144B2 (en) | 2017-05-16 | 2019-06-04 | Apple Inc. | Emoji word sense disambiguation |
US10657328B2 (en) | 2017-06-02 | 2020-05-19 | Apple Inc. | Multi-task recurrent neural network architecture for efficient morphology handling in neural language modeling |
US10649060B2 (en) | 2017-07-24 | 2020-05-12 | Microsoft Technology Licensing, Llc | Sound source localization confidence estimation using machine learning |
US10475449B2 (en) | 2017-08-07 | 2019-11-12 | Sonos, Inc. | Wake-word detection suppression |
US10847149B1 (en) * | 2017-09-01 | 2020-11-24 | Amazon Technologies, Inc. | Speech-based attention span for voice user interface |
US10048930B1 (en) | 2017-09-08 | 2018-08-14 | Sonos, Inc. | Dynamic computation of system response volume |
US10445429B2 (en) | 2017-09-21 | 2019-10-15 | Apple Inc. | Natural language understanding using vocabularies with compressed serialized tries |
US10446165B2 (en) | 2017-09-27 | 2019-10-15 | Sonos, Inc. | Robust short-time fourier transform acoustic echo cancellation during audio playback |
US10051366B1 (en) | 2017-09-28 | 2018-08-14 | Sonos, Inc. | Three-dimensional beam forming with a microphone array |
US10621981B2 (en) | 2017-09-28 | 2020-04-14 | Sonos, Inc. | Tone interference cancellation |
US10482868B2 (en) | 2017-09-28 | 2019-11-19 | Sonos, Inc. | Multi-channel acoustic echo cancellation |
US10755051B2 (en) | 2017-09-29 | 2020-08-25 | Apple Inc. | Rule-based natural language processing |
US10466962B2 (en) | 2017-09-29 | 2019-11-05 | Sonos, Inc. | Media playback system with voice assistance |
US10636424B2 (en) | 2017-11-30 | 2020-04-28 | Apple Inc. | Multi-turn canned dialog |
KR102469753B1 (ko) * | 2017-11-30 | 2022-11-22 | 삼성전자주식회사 | 음원의 위치에 기초하여 서비스를 제공하는 방법 및 이를 위한 음성 인식 디바이스 |
US10880650B2 (en) | 2017-12-10 | 2020-12-29 | Sonos, Inc. | Network microphone devices with automatic do not disturb actuation capabilities |
US10818290B2 (en) | 2017-12-11 | 2020-10-27 | Sonos, Inc. | Home graph |
US10733982B2 (en) | 2018-01-08 | 2020-08-04 | Apple Inc. | Multi-directional dialog |
US10733375B2 (en) | 2018-01-31 | 2020-08-04 | Apple Inc. | Knowledge-based framework for improving natural language understanding |
WO2019152722A1 (en) | 2018-01-31 | 2019-08-08 | Sonos, Inc. | Device designation of playback and network microphone device arrangements |
US10789959B2 (en) | 2018-03-02 | 2020-09-29 | Apple Inc. | Training speaker recognition models for digital assistants |
WO2019169616A1 (zh) * | 2018-03-09 | 2019-09-12 | 深圳市汇顶科技股份有限公司 | 语音信号处理方法及装置 |
US10592604B2 (en) | 2018-03-12 | 2020-03-17 | Apple Inc. | Inverse text normalization for automatic speech recognition |
US10818288B2 (en) | 2018-03-26 | 2020-10-27 | Apple Inc. | Natural assistant interaction |
US10909331B2 (en) | 2018-03-30 | 2021-02-02 | Apple Inc. | Implicit identification of translation payload with neural machine translation |
US10928918B2 (en) | 2018-05-07 | 2021-02-23 | Apple Inc. | Raise to speak |
US11145294B2 (en) | 2018-05-07 | 2021-10-12 | Apple Inc. | Intelligent automated assistant for delivering content from user experiences |
US11175880B2 (en) | 2018-05-10 | 2021-11-16 | Sonos, Inc. | Systems and methods for voice-assisted media content selection |
US10847178B2 (en) | 2018-05-18 | 2020-11-24 | Sonos, Inc. | Linear filtering for noise-suppressed speech detection |
US10984780B2 (en) | 2018-05-21 | 2021-04-20 | Apple Inc. | Global semantic word embeddings using bi-directional recurrent neural networks |
US10959029B2 (en) | 2018-05-25 | 2021-03-23 | Sonos, Inc. | Determining and adapting to changes in microphone performance of playback devices |
DK179822B1 (da) | 2018-06-01 | 2019-07-12 | Apple Inc. | Voice interaction at a primary device to access call functionality of a companion device |
US11386266B2 (en) | 2018-06-01 | 2022-07-12 | Apple Inc. | Text correction |
US10892996B2 (en) | 2018-06-01 | 2021-01-12 | Apple Inc. | Variable latency device coordination |
DK180639B1 (en) | 2018-06-01 | 2021-11-04 | Apple Inc | DISABILITY OF ATTENTION-ATTENTIVE VIRTUAL ASSISTANT |
DK201870355A1 (en) | 2018-06-01 | 2019-12-16 | Apple Inc. | VIRTUAL ASSISTANT OPERATION IN MULTI-DEVICE ENVIRONMENTS |
US10504518B1 (en) | 2018-06-03 | 2019-12-10 | Apple Inc. | Accelerated task performance |
US10681460B2 (en) | 2018-06-28 | 2020-06-09 | Sonos, Inc. | Systems and methods for associating playback devices with voice assistant services |
US10461710B1 (en) | 2018-08-28 | 2019-10-29 | Sonos, Inc. | Media playback system with maximum volume setting |
US11076035B2 (en) | 2018-08-28 | 2021-07-27 | Sonos, Inc. | Do not disturb feature for audio notifications |
US10878811B2 (en) | 2018-09-14 | 2020-12-29 | Sonos, Inc. | Networked devices, systems, and methods for intelligently deactivating wake-word engines |
US10587430B1 (en) | 2018-09-14 | 2020-03-10 | Sonos, Inc. | Networked devices, systems, and methods for associating playback devices based on sound codes |
CN109298642B (zh) * | 2018-09-20 | 2021-08-27 | 三星电子(中国)研发中心 | 采用智能音箱进行监控的方法及装置 |
US11024331B2 (en) | 2018-09-21 | 2021-06-01 | Sonos, Inc. | Voice detection optimization using sound metadata |
US10811015B2 (en) | 2018-09-25 | 2020-10-20 | Sonos, Inc. | Voice detection optimization based on selected voice assistant service |
US11010561B2 (en) | 2018-09-27 | 2021-05-18 | Apple Inc. | Sentiment prediction from textual data |
US10726830B1 (en) * | 2018-09-27 | 2020-07-28 | Amazon Technologies, Inc. | Deep multi-channel acoustic modeling |
US11170166B2 (en) | 2018-09-28 | 2021-11-09 | Apple Inc. | Neural typographical error modeling via generative adversarial networks |
US10839159B2 (en) | 2018-09-28 | 2020-11-17 | Apple Inc. | Named entity normalization in a spoken dialog system |
US11462215B2 (en) | 2018-09-28 | 2022-10-04 | Apple Inc. | Multi-modal inputs for voice commands |
US11100923B2 (en) | 2018-09-28 | 2021-08-24 | Sonos, Inc. | Systems and methods for selective wake word detection using neural network models |
US10692518B2 (en) | 2018-09-29 | 2020-06-23 | Sonos, Inc. | Linear filtering for noise-suppressed speech detection via multiple network microphone devices |
US11899519B2 (en) | 2018-10-23 | 2024-02-13 | Sonos, Inc. | Multiple stage network microphone device with reduced power consumption and processing load |
US11475898B2 (en) | 2018-10-26 | 2022-10-18 | Apple Inc. | Low-latency multi-speaker speech recognition |
EP3654249A1 (en) | 2018-11-15 | 2020-05-20 | Snips | Dilated convolutions and gating for efficient keyword spotting |
US11183183B2 (en) | 2018-12-07 | 2021-11-23 | Sonos, Inc. | Systems and methods of operating media playback systems having multiple voice assistant services |
US11132989B2 (en) | 2018-12-13 | 2021-09-28 | Sonos, Inc. | Networked microphone devices, systems, and methods of localized arbitration |
US10602268B1 (en) | 2018-12-20 | 2020-03-24 | Sonos, Inc. | Optimization of network microphone devices using noise classification |
US11638059B2 (en) | 2019-01-04 | 2023-04-25 | Apple Inc. | Content playback on multiple devices |
US10867604B2 (en) | 2019-02-08 | 2020-12-15 | Sonos, Inc. | Devices, systems, and methods for distributed voice processing |
US11315556B2 (en) | 2019-02-08 | 2022-04-26 | Sonos, Inc. | Devices, systems, and methods for distributed voice processing by transmitting sound data associated with a wake word to an appropriate device for identification |
CN111627425B (zh) * | 2019-02-12 | 2023-11-28 | 阿里巴巴集团控股有限公司 | 一种语音识别方法及系统 |
US11348573B2 (en) | 2019-03-18 | 2022-05-31 | Apple Inc. | Multimodality in digital assistant systems |
US11120794B2 (en) | 2019-05-03 | 2021-09-14 | Sonos, Inc. | Voice assistant persistence across multiple network microphone devices |
US11482217B2 (en) * | 2019-05-06 | 2022-10-25 | Google Llc | Selectively activating on-device speech recognition, and using recognized text in selectively activating on-device NLU and/or on-device fulfillment |
DK201970509A1 (en) | 2019-05-06 | 2021-01-15 | Apple Inc | Spoken notifications |
US11307752B2 (en) | 2019-05-06 | 2022-04-19 | Apple Inc. | User configurable task triggers |
US11475884B2 (en) | 2019-05-06 | 2022-10-18 | Apple Inc. | Reducing digital assistant latency when a language is incorrectly determined |
US11423908B2 (en) | 2019-05-06 | 2022-08-23 | Apple Inc. | Interpreting spoken requests |
US11140099B2 (en) | 2019-05-21 | 2021-10-05 | Apple Inc. | Providing message response suggestions |
DK180129B1 (en) | 2019-05-31 | 2020-06-02 | Apple Inc. | USER ACTIVITY SHORTCUT SUGGESTIONS |
US11289073B2 (en) | 2019-05-31 | 2022-03-29 | Apple Inc. | Device text to speech |
US11496600B2 (en) | 2019-05-31 | 2022-11-08 | Apple Inc. | Remote execution of machine-learned models |
US11360641B2 (en) | 2019-06-01 | 2022-06-14 | Apple Inc. | Increasing the relevance of new available information |
US10586540B1 (en) | 2019-06-12 | 2020-03-10 | Sonos, Inc. | Network microphone device with command keyword conditioning |
US11361756B2 (en) | 2019-06-12 | 2022-06-14 | Sonos, Inc. | Conditional wake word eventing based on environment |
US11200894B2 (en) | 2019-06-12 | 2021-12-14 | Sonos, Inc. | Network microphone device with command keyword eventing |
KR20190089125A (ko) * | 2019-07-09 | 2019-07-30 | 엘지전자 주식회사 | 커뮤니케이션 로봇 및 그의 구동 방법 |
US10871943B1 (en) | 2019-07-31 | 2020-12-22 | Sonos, Inc. | Noise classification for event detection |
US11138975B2 (en) | 2019-07-31 | 2021-10-05 | Sonos, Inc. | Locally distributed keyword detection |
US11138969B2 (en) | 2019-07-31 | 2021-10-05 | Sonos, Inc. | Locally distributed keyword detection |
WO2021056255A1 (en) | 2019-09-25 | 2021-04-01 | Apple Inc. | Text detection using global geometry estimators |
US11189286B2 (en) | 2019-10-22 | 2021-11-30 | Sonos, Inc. | VAS toggle based on device orientation |
US11200900B2 (en) | 2019-12-20 | 2021-12-14 | Sonos, Inc. | Offline voice control |
US11562740B2 (en) | 2020-01-07 | 2023-01-24 | Sonos, Inc. | Voice verification for media playback |
US11556307B2 (en) | 2020-01-31 | 2023-01-17 | Sonos, Inc. | Local voice data processing |
US11308958B2 (en) | 2020-02-07 | 2022-04-19 | Sonos, Inc. | Localized wakeword verification |
US11482224B2 (en) | 2020-05-20 | 2022-10-25 | Sonos, Inc. | Command keywords with input detection windowing |
US11727919B2 (en) | 2020-05-20 | 2023-08-15 | Sonos, Inc. | Memory allocation for keyword spotting engines |
US11308962B2 (en) | 2020-05-20 | 2022-04-19 | Sonos, Inc. | Input detection windowing |
US11698771B2 (en) | 2020-08-25 | 2023-07-11 | Sonos, Inc. | Vocal guidance engines for playback devices |
US11551700B2 (en) | 2021-01-25 | 2023-01-10 | Sonos, Inc. | Systems and methods for power-efficient keyword detection |
CN113838453B (zh) * | 2021-08-17 | 2022-06-28 | 北京百度网讯科技有限公司 | 语音处理方法、装置、设备和计算机存储介质 |
CN116299179B (zh) * | 2023-05-22 | 2023-09-12 | 北京边锋信息技术有限公司 | 一种声源定位方法、声源定位装置和可读存储介质 |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH03274593A (ja) * | 1990-03-26 | 1991-12-05 | Ricoh Co Ltd | 車載用音声認識装置 |
JPH0844387A (ja) * | 1994-08-04 | 1996-02-16 | Aqueous Res:Kk | 音声認識装置 |
JPH11143486A (ja) * | 1997-11-10 | 1999-05-28 | Fuji Xerox Co Ltd | 話者適応装置および方法 |
JP2000066698A (ja) * | 1998-08-19 | 2000-03-03 | Nippon Telegr & Teleph Corp <Ntt> | 音認識装置 |
JP2002041079A (ja) * | 2000-07-31 | 2002-02-08 | Sharp Corp | 音声認識装置および音声認識方法、並びに、プログラム記録媒体 |
JP2002264051A (ja) * | 2001-03-09 | 2002-09-18 | Japan Science & Technology Corp | ロボット視聴覚システム |
Family Cites Families (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6471420B1 (en) * | 1994-05-13 | 2002-10-29 | Matsushita Electric Industrial Co., Ltd. | Voice selection apparatus voice response apparatus, and game apparatus using word tables from which selected words are output as voice selections |
US5828997A (en) * | 1995-06-07 | 1998-10-27 | Sensimetrics Corporation | Content analyzer mixing inverse-direction-probability-weighted noise to input signal |
DE69815067T2 (de) * | 1997-12-12 | 2004-02-26 | Philips Intellectual Property & Standards Gmbh | Verfahren zur bestimmung modell-spezifischer faktoren für die mustererkennung im insbesonderen für sprachmuster |
FI116505B (fi) * | 1998-03-23 | 2005-11-30 | Nokia Corp | Menetelmä ja järjestelmä suunnatun äänen käsittelemiseksi akustisessa virtuaaliympäristössä |
JP3195920B2 (ja) * | 1999-06-11 | 2001-08-06 | 科学技術振興事業団 | 音源同定・分離装置及びその方法 |
DE10047718A1 (de) * | 2000-09-27 | 2002-04-18 | Philips Corp Intellectual Pty | Verfahren zur Spracherkennung |
US7076433B2 (en) * | 2001-01-24 | 2006-07-11 | Honda Giken Kogyo Kabushiki Kaisha | Apparatus and program for separating a desired sound from a mixed input sound |
JP2003131683A (ja) * | 2001-10-22 | 2003-05-09 | Sony Corp | 音声認識装置および音声認識方法、並びにプログラムおよび記録媒体 |
JP4195267B2 (ja) * | 2002-03-14 | 2008-12-10 | インターナショナル・ビジネス・マシーンズ・コーポレーション | 音声認識装置、その音声認識方法及びプログラム |
AU2003274445A1 (en) * | 2002-06-11 | 2003-12-22 | Sony Electronics Inc. | Microphone array with time-frequency source discrimination |
KR100493172B1 (ko) * | 2003-03-06 | 2005-06-02 | 삼성전자주식회사 | 마이크로폰 어레이 구조, 이를 이용한 일정한 지향성을갖는 빔 형성방법 및 장치와 음원방향 추정방법 및 장치 |
-
2004
- 2004-11-12 EP EP04818533A patent/EP1691344B1/en not_active Expired - Fee Related
- 2004-11-12 DE DE602004021716T patent/DE602004021716D1/de active Active
- 2004-11-12 WO PCT/JP2004/016883 patent/WO2005048239A1/ja active Application Filing
- 2004-11-12 US US10/579,235 patent/US20090018828A1/en not_active Abandoned
- 2004-11-12 JP JP2005515466A patent/JP4516527B2/ja not_active Expired - Fee Related
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH03274593A (ja) * | 1990-03-26 | 1991-12-05 | Ricoh Co Ltd | 車載用音声認識装置 |
JPH0844387A (ja) * | 1994-08-04 | 1996-02-16 | Aqueous Res:Kk | 音声認識装置 |
JPH11143486A (ja) * | 1997-11-10 | 1999-05-28 | Fuji Xerox Co Ltd | 話者適応装置および方法 |
JP2000066698A (ja) * | 1998-08-19 | 2000-03-03 | Nippon Telegr & Teleph Corp <Ntt> | 音認識装置 |
JP2002041079A (ja) * | 2000-07-31 | 2002-02-08 | Sharp Corp | 音声認識装置および音声認識方法、並びに、プログラム記録媒体 |
JP2002264051A (ja) * | 2001-03-09 | 2002-09-18 | Japan Science & Technology Corp | ロボット視聴覚システム |
Non-Patent Citations (7)
Title |
---|
KAZUHIRO NAKADAI ET AL.: "A humanoid Listens to three simultaneous talkers by Integrating Active Audition and Face Recognition", IJCAI-03 WORKSHOP ON ISSUES IN DESIGNING PHYSICAL AGENTS FOR DYNAMIC REAL-TIME ENVIRONMENTS: WORLD MODELING, PLANNING, LEARNING AND COMMUNICATING, pages 117 - 124 |
NAKADAI ET AL.: "Robot recognizes three simultaneous speech by active audition", PROCEEDINGS OF THE IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION, vol. 1, 14 September 2003 (2003-09-14), pages 398 - 405, XP010667329, DOI: doi:10.1109/ROBOT.2003.1241628 |
NAKADAI K. ET AL.: "Active audition ni yoru fukusu ongen no teii bunri ninshiki", THE JAPANESE SOCIETY FOR ARTIFICIAL INTELLIGENCE KENKYUKAI SHIRYO SIG-CHALLENGE-0216-5, 22 November 2002 (2002-11-22), pages 25 - 32, XP002997485 * |
NAKADAI K. ET AL.: "Kaisoteki na shichokaku togo to sanran riron o riyo chita robot ni yoru sanwasha doji hatsuwa ninshiki no kojo", THE ROBOTICS SOCIETY OF JAPAN GAKUJUTSU KOENKAI YOKOSHU, 20 September 2003 (2003-09-20), XP002997484 * |
NAKADAI K. ET AL.: "Robot o taisho to shita sanran riron ni yoru sanwasha doji hatsuwa no teii bunri ninshiki no kojo", THE JAPANESE SOCIETY FOR ARTIFICIAL INTELLIGENCE KENKYUKAI SHIRYO SIG-CHALLENGE-0216-5, 13 November 2003 (2003-11-13), pages 33 - 38, XP002997486 * |
NAKAGAI K. ET AL.: "Robot recognizes three simultaneous speech by activie audition", PROC. OF THE 2003 IEEE, 14 September 2003 (2003-09-14), pages 398 - 405, XP010667329 * |
See also references of EP1691344A4 |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2007264328A (ja) * | 2006-03-28 | 2007-10-11 | Matsushita Electric Works Ltd | 浴室装置及びそれに用いる音声操作装置 |
JP2009020352A (ja) * | 2007-07-12 | 2009-01-29 | Yamaha Corp | 音声処理装置およびプログラム |
JP2011146871A (ja) * | 2010-01-13 | 2011-07-28 | Hitachi Ltd | 音源探索装置及び音源探索方法 |
WO2015029296A1 (ja) * | 2013-08-29 | 2015-03-05 | パナソニック インテレクチュアル プロパティ コーポレーション オブ アメリカ | 音声認識方法及び音声認識装置 |
JPWO2015029296A1 (ja) * | 2013-08-29 | 2017-03-02 | パナソニック インテレクチュアル プロパティ コーポレーション オブ アメリカPanasonic Intellectual Property Corporation of America | 音声認識方法及び音声認識装置 |
US9818403B2 (en) | 2013-08-29 | 2017-11-14 | Panasonic Intellectual Property Corporation Of America | Speech recognition method and speech recognition device |
JP2018517325A (ja) * | 2015-04-09 | 2018-06-28 | シンテフ ティーティーオー アクティーゼルスカブ | 音声認識 |
JPWO2019138619A1 (ja) * | 2018-01-09 | 2021-01-28 | ソニー株式会社 | 情報処理装置、情報処理方法、およびプログラム |
JP7120254B2 (ja) | 2018-01-09 | 2022-08-17 | ソニーグループ株式会社 | 情報処理装置、情報処理方法、およびプログラム |
US10880643B2 (en) | 2018-09-27 | 2020-12-29 | Fujitsu Limited | Sound-source-direction determining apparatus, sound-source-direction determining method, and storage medium |
TWI740315B (zh) * | 2019-08-23 | 2021-09-21 | 大陸商北京市商湯科技開發有限公司 | 聲音分離方法、電子設備和電腦可讀儲存媒體 |
CN113576527A (zh) * | 2021-08-27 | 2021-11-02 | 复旦大学 | 一种利用声控进行超声输入判断的方法 |
Also Published As
Publication number | Publication date |
---|---|
JP4516527B2 (ja) | 2010-08-04 |
JPWO2005048239A1 (ja) | 2007-11-29 |
DE602004021716D1 (de) | 2009-08-06 |
EP1691344A1 (en) | 2006-08-16 |
US20090018828A1 (en) | 2009-01-15 |
EP1691344A4 (en) | 2008-04-02 |
EP1691344B1 (en) | 2009-06-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP4516527B2 (ja) | 音声認識装置 | |
US10901063B2 (en) | Localization algorithm for sound sources with known statistics | |
Pak et al. | Sound localization based on phase difference enhancement using deep neural networks | |
Srinivasan et al. | Binary and ratio time-frequency masks for robust speech recognition | |
JP5738020B2 (ja) | 音声認識装置及び音声認識方法 | |
Yamakawa et al. | Environmental sound recognition for robot audition using matching-pursuit | |
Venkatesan et al. | Binaural classification-based speech segregation and robust speaker recognition system | |
Yamamoto et al. | Assessment of general applicability of robot audition system by recognizing three simultaneous speeches | |
Grondin et al. | WISS, a speaker identification system for mobile robots | |
Okuno et al. | Computational auditory scene analysis and its application to robot audition | |
Okuno et al. | Robot audition: Missing feature theory approach and active audition | |
Kallasjoki et al. | Mask estimation and sparse imputation for missing data speech recognition in multisource reverberant environments | |
Venkatesan et al. | Deep recurrent neural networks based binaural speech segregation for the selection of closest target of interest | |
KR20090116055A (ko) | 은닉 마코프 모델을 이용한 잡음 마스크 추정 방법 및 이를수행하는 장치 | |
Kundegorski et al. | Two-Microphone dereverberation for automatic speech recognition of Polish | |
JP2011081324A (ja) | ピッチ・クラスター・マップを用いた音声認識方法 | |
Asaei et al. | Verified speaker localization utilizing voicing level in split-bands | |
Habib et al. | Auditory inspired methods for localization of multiple concurrent speakers | |
Lee et al. | Space-time voice activity detection | |
Okuno et al. | Computational auditory scene analysis and its application to robot audition: Five years experience | |
Kim et al. | Target speech detection and separation for humanoid robots in sparse dialogue with noisy home environments | |
Himawan | Speech recognition using ad-hoc microphone arrays | |
Yamamoto et al. | Genetic algorithm-based improvement of robot hearing capabilities in separating and recognizing simultaneous speech signals | |
Mahmoodzadeh et al. | Binaural speech separation based on the time-frequency binary mask | |
Yoon et al. | Acoustic model combination incorporated with mask-based multi-channel source separation for automatic speech recognition |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AK | Designated states |
Kind code of ref document: A1 Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BW BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE EG ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NA NI NO NZ OM PG PH PL PT RO RU SC SD SE SG SK SL SY TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW |
|
AL | Designated countries for regional patents |
Kind code of ref document: A1 Designated state(s): BW GH GM KE LS MW MZ NA SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IS IT LU MC NL PL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
WWE | Wipo information: entry into national phase |
Ref document number: 2005515466 Country of ref document: JP |
|
WWE | Wipo information: entry into national phase |
Ref document number: 10579235 Country of ref document: US |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2004818533 Country of ref document: EP |
|
WWP | Wipo information: published in national office |
Ref document number: 2004818533 Country of ref document: EP |