CN103069480B - Speech and noise models for speech recognition - Google Patents
Speech and noise models for speech recognition Download PDFInfo
- Publication number
- CN103069480B CN103069480B CN201180026390.4A CN201180026390A CN103069480B CN 103069480 B CN103069480 B CN 103069480B CN 201180026390 A CN201180026390 A CN 201180026390A CN 103069480 B CN103069480 B CN 103069480B
- Authority
- CN
- China
- Prior art keywords
- user
- audio
- sound signal
- model
- noise
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/20—Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
Abstract
An audio signal generated by a device based on audio input from a user may be received. The audio signal may include at least a user audio portion that corresponds to one or more user utterances recorded by the device. A user speech model associated with the user may be accessed and a determination may be made background audio in the audio signal is below a defined threshold. In response to determining that the background audio in the audio signal is below the defined threshold, the accessed user speech model may be adapted based on the audio signal to generate an adapted user speech model that models speech characteristics of the user. Noise compensation may be performed on the received audio signal using the adapted user speech model to generate a filtered audio signal with reduced background audio compared to the received audio signal.
Description
the cross reference of related application
This application claims that submit on June 14th, 2010, that title is the Application U.S. Serial No 12/814,665 of " SPEECH ANDNOISE MODELS FOR SPEECH RECOGNITION " right of priority, its disclosure is incorporated into this by reference.
Technical field
This instructions relates to speech recognition.
Background technology
Speech recognition may be used for voice search query.Usually, search inquiry comprises one or more query term that user submits to search engine when user asks search engine execution search.In other modes, user can by key on keyboard or when voice queries by the microphone to such as mobile device in spoken query item carry out the query term of typing search inquiry.
When being submitted to voice queries by such as mobile device, the microphone of mobile device also may record neighbourhood noise or sound except the spoken utterance of user, is referred to as " environment audio frequency " or " background audio " in other respects.Such as, environment audio frequency can comprise be positioned at around user other people background chat or talk or by nature (such as, bark) or the noise that generates of culture (such as, office, airport or highway noise or construction activity).Environment audio frequency partly may cover the speech of user, thus makes automated voice identification (" ASR ") engine be difficult to accurately identify spoken utterance.
Summary of the invention
In one aspect, a kind of system comprises one or more treatment facility and stores one or more memory devices of instruction, when instruction is performed by one or more treatment facility, make one or more treatment facility receive the sound signal generated based on the audio frequency input from user by equipment, sound signal at least comprises the audio user part corresponded to by one or more user spoken utterances of equipment record; Access the user speech model be associated with user; Determine that the background audio in sound signal is defining below threshold value; In response to the background audio determined in sound signal below the threshold value of definition, based on the user speech model of audio signal adaptation access to generate adapt user speech model to the modeling of user speech characteristic; And use adapt user speech model to perform noise compensation to generate the filtering audio signals compared with the sound signal received with the background audio of minimizing to the sound signal received.
Implementation can comprise one or more following characteristics.Such as, sound signal can comprise the environment audio-frequency unit that only corresponds to around the background audio of user to determine that background audio in sound signal is under definition threshold value, instruction can comprise as given an order, upon being performed, the amount of the energy in one or more treatment facility determination environment audio-frequency unit is made; And determine that the amount of energy in environment audio-frequency unit is under threshold energy.In order to determine that the background audio in sound signal is defining under threshold value, instruction comprises as given an order, and upon being performed, makes the signal to noise ratio (S/N ratio) of described one or more treatment facility determination sound signal; And determine that this signal to noise ratio (S/N ratio) is under threshold signal-to-noise ratio.Sound signal can comprise and only corresponding to around the environment audio-frequency unit of the background audio of user to determine the signal to noise ratio (S/N ratio) of sound signal, instruction comprises as given an order, upon being performed, the amount of the energy in the audio user part of one or more treatment facility determination sound signal is made; Determine the amount of the energy in the environment audio-frequency unit of sound signal; And determine signal to noise ratio (S/N ratio) by the ratio between the amount of determining the energy in audio user part and environment audio-frequency unit.
The user speech model of access can comprise the alternative user speech model of the characteristics of speech sounds modeling be not yet adapted for user.Instruction can comprise as given an order, and when being performed by one or more treatment facility, makes one or more treatment facility select to substitute user speech model; And alternative speech model is associated with user.In order to select alternative user speech model, instruction can comprise as given an order, and when being performed by one or more treatment facility, makes one or more treatment facility determine the sex of user; And select to substitute user speech model among multiple alternative user speech model based on the sex of user.In order to select alternative user speech model, instruction can comprise as given an order, and when being performed by one or more treatment facility, makes one or more treatment facility determine the position of the user when recording one or more language; And select to substitute user speech model among multiple alternative user speech model based on the position of the user when recording one or more language.In order to select alternative user speech model, instruction can comprise as given an order, and when being performed by one or more treatment facility, makes one or more treatment facility determine language or the accent of user; And select to substitute user speech model among multiple alternative user speech model based on language or accent.In order to select alternative user speech model, instruction can comprise as given an order, when being performed by one or more treatment facility, one or more treatment facility is made to receive the initial sound signal at least comprising and corresponding to by the initial user audio-frequency unit of one or more user spoken utterances of equipment record; Similarity measurement between the desired user speech model of the user determining multiple alternative user speech model and determine based on described initial sound signal; And select to substitute user speech model among multiple alternative user speech model based on similarity measurement.
Instruction can comprise as given an order, and upon being performed, makes one or more treatment facility access the noise model be associated with user; And wherein in order to perform noise compensation, instruction may further include as given an order, it makes one or more treatment facility use adapt user speech model and access noise model to perform noise compensation to the sound signal received.In order to perform noise compensation, instruction may further include as given an order, and it makes one or more treatment facility based on the audio signal adaptation access noise model received to generate the adaptive noise model to the characteristic modeling of the background audio around user; And use adapt user speech model and adaptive noise model to come to perform noise compensation to the sound signal received.Instruction can comprise as given an order, and upon being performed, makes one or more treatment facility receive the second sound signal at least comprising and corresponding to by the second audio user part of one or more user spoken utterances of equipment record; Determine that the background audio in the second sound signal is defining on threshold value; And in response to the background audio determined in the second sound signal on definition threshold value, the noise model be associated with user based on the second audio signal adaptation is to generate the adaptive noise model of the characteristic modeling to the background audio around user.Access noise model can comprise the alternative noise model of the characteristic modeling be not yet adapted for the background audio around user.
Instruction can comprise as given an order, and when being performed by one or more treatment facility, makes one or more treatment facility select to substitute noise model; And alternative noise model is associated with user.In order to select alternative noise model, instruction can comprise as given an order, when being performed by one or more treatment facility, one or more treatment facility is made to receive the initial sound signal at least comprising and corresponding to by the initial user audio-frequency unit of one or more user spoken utterances of equipment record; Determine the position of the user when recording the one or more language corresponding to initial user audio-frequency unit; And select to substitute noise model among multiple alternative noise model based on the position of the user when recording the one or more language corresponding to initial user audio-frequency unit.
In order to select alternative noise model, instruction can comprise as given an order, when being performed by one or more treatment facility, one or more treatment facility is made to receive the initial sound signal at least comprising and corresponding to by the initial user audio-frequency unit of one or more user spoken utterances of equipment record; Similarity measurement between the expectation noise model of the user determining multiple alternative noise model and determine based on initial sound signal; And select to substitute noise model among multiple alternative noise model based on similarity measurement.Each in multiple alternative noise model can to the characteristic modeling of the background audio in ad-hoc location.Each in multiple alternative noise model can to the characteristic modeling of the background audio in the environmental baseline of particular types.
In order to access noise model, instruction can comprise as given an order, and when being performed by one or more treatment facility, makes one or more treatment facility determine the position of the user when recording one or more language; And among multiple noise model, select noise model based on the position of user.
Sound signal can correspond to voice search query, and instruction can comprise as given an order, when being performed by one or more treatment facility, make the execution of one or more treatment facility to the speech recognition of filtering audio signals to generate one or more candidate transcription of one or more user spoken utterances; One or more candidate transcription is used to perform search inquiry to generate Search Results; And send Search Results to equipment.
On the other hand, system comprises client device and automated voice recognition system.Client device is configured to send to automated voice recognition system the sound signal at least comprising and corresponding to by the audio user part of one or more user spoken utterances of equipment record.Automated voice recognition system is configured to from client device received audio signal; Access the user speech model be associated with user; Determine that the background audio in sound signal is defining under threshold value; In response to the background audio determined in sound signal under definition threshold value, based on the user speech model of audio signal adaptation access to generate adapt user speech model to the characteristics of speech sounds modeling of user; And use adapt user speech model to perform noise compensation to generate the filtering audio signals compared with the sound signal received with the background audio of minimizing to the sound signal received.
Implementation can comprise following characteristics.Such as, automated voice recognition system can be configured to perform speech recognition to generate one or more candidate transcription of one or more user spoken utterances to filtering audio signals.System can comprise search engine system, and it is configured to use one or more candidate transcription to perform search inquiry to generate Search Results; And send Search Results to client device.
On the other hand, method comprises the sound signal receiving and generated based on the audio frequency input from user by equipment, and sound signal at least comprises the audio user part corresponded to by one or more user spoken utterances of equipment record; Access the user speech model be associated with user; Determine that the background audio in sound signal is defining below threshold value; In response to the background audio determined in sound signal below restriction threshold value, based on the user speech model of audio signal adaptation access to generate adapt user speech model to the characteristics of speech sounds modeling of user; And use adapt user speech model to perform noise compensation to generate the filtering audio signals compared with the sound signal received with the background audio of minimizing to the sound signal received.
The implementation of described technology can comprise the computer software in hardware, method or process or computer accessible.
In accompanying drawing and the details setting forth one or more implementation in hereafter describing.Other features and will become obvious from description, accompanying drawing from claim.
In accompanying drawing and the details setting forth one or more implementation in hereafter describing.Other potential feature, aspect and advantages will become obvious from description, accompanying drawing and claim.
Accompanying drawing explanation
Fig. 1 is the schematic diagram of the example system supporting voice search query.
Fig. 2 is the process flow diagram of the example that process is shown.
Fig. 3 is the process flow diagram of another example that process is shown.
Fig. 4 is swimming lane (swim lane) figure of the example that process is shown.
Embodiment
Fig. 1 shows the schematic diagram of the example of the system 100 supporting voice search query.System 100 comprises search engine 106 and automatic speech recognition (ASR) engine 108, it is connected with one group of mobile device 102a-102c and mobile device 104 by one or more network 110, such as in some embodiments, described one or more network 110 be wireless cellular network, WLAN (wireless local area network) (WLAN) or Wi-Fi network, the third generation (3G) mobile telecom network, dedicated network as Intranet, common network is as the Internet or it is any appropriately combined.
Usually, the user of equipment (such as mobile device 104) can to the microphone oral account search inquiry of mobile device 104.The spoken search query note of user is sound signal by the application run on the mobile devices 104, and sends the part of this sound signal as voice search query to ASR engine 108.After receiving the sound signal corresponding to voice search query, user spoken utterances in sound signal can be translated or be transcribed into one or more text candidates and transcribe by ASR engine 108, and these candidate transcription can be supplied to search engine 106 as query term, thus support the audio search function of mobile device 104.Query term can comprise one or more complete or part of words, character or character string.
Search engine 106 can use search query term to provide Search Results (such as, the Uniform Resource Identifier (URI), image, document, multimedia file etc. of webpage) to mobile device 104.Such as, Search Results can comprise the Uniform Resource Identifier (URI) quoting following resource, and search engine determines that this resource response is in search inquiry.Additionally or alternatively, Search Results can comprise the description or from corresponding resource automatically or other of manual extraction or the extracts of text that is otherwise associated with corresponding resource and so on of such as title, preview image, user's grading, map or direction, corresponding resource.Search engine 106 can comprise in some examples for finding the web search engine of reference in the Internet, being used for finding the phone book type search engine of enterprise or individual or another specialized search engine (such as, the such as amusement inventory such as restaurant and cinema's information, medical treatment and medicine information).
As the example of the operation of system 100, sound signal 138 is included in the voice search query sent from mobile device 104 to ASR engine 108 by network 110.Sound signal 138 comprises language 140 " Gym New York ".ASR engine 108 receives the voice search query comprising sound signal 138.ASR engine 108 audio signal 138 is to generate one or more text candidates of mating with the language detected in sound signal 138 and transcribe or one group of text candidates through rank transcribes 146.Such as, the language in sound signal 138 can produce " Gym New York " and " Jim Newark " alternatively transcribes 146.
The one or more candidate transcription 146 generated by speech recognition system 118 are delivered to search engine 106 by as search query term from ASR engine 108.Search engine 106 provides search query term 146 to generate one or more Search Results to searching algorithm.Search engine 106 provides last set result 152 (such as, the Uniform Resource Identifier (URI), image, document, multimedia file etc. of webpage) to mobile device 104.
Mobile device 104 is display of search results 152 in viewing area.As shown in screenshot capture 158, language " Gym New York " 140 generates three Search Results 160 " Jim Newark " 160a, " New York Fitness " 160b and " Manhattan Body Building " 160c.First Search Results 160a corresponds to candidate transcription Jim Newark, and such as can provide telephone number to user, or mobile device 104 can be used when selected automatically to dial Jim Newark.Latter two Search Results 160b and 160c corresponds to candidate transcription " Gym New York " and comprises webpage URI.Candidate transcription and/or Search Results can carry out rank based on the confidence measurement produced by ASR 108, and this confidence measures the confidence levels that the given candidate transcription of instruction accurately corresponds to the language in sound signal.
Transcribe in order to one or more text candidates is translated or be transcribed into the user spoken utterances in sound signal, ASR engine 108 comprises noise compensation system 116, speech recognition system 118 and stores the database 111 of noise model 112 and user speech model 114.The speech recognition system 118 pairs of sound signals perform speech recognitions and transcribe to identify user spoken utterances in sound signal and these language are translated into one or more text candidates.In some implementation, speech recognition system 118 can generate multiple candidate transcription for given language.Such as, language can be transcribed into multiple item and can assign and transcribe with each of language the confidence levels be associated by speech recognition system 118.
In some implementation, the specific change of speech recognition system 118 can be selected for audio signal based on the additional contextual information relevant with sound signal, and the change selected may be used for the language of transcribing in sound signal.Such as, in some implementation, together with the sound signal comprising user spoken utterances, voice search query can comprise region or the language message of the change for selecting speech recognition system 118.In particular example, the region of registration of mobile devices 104 or the language of mobile device 104 arrange language and can be provided to ASR engine 108 and for ASR engine 108 for determining the language that the user of mobile device 104 is possible or accent wherein.The change of speech recognition system 118 can carry out choice and operation based on the expection language of the user of mobile device 104 or accent.
Noise compensation system 116 can be applied to such as from the sound signal that mobile device 104 receives by ASR engine 108 before execution speech recognition.Noise compensation system 116 can remove or reduce background in sound signal or environment audio frequency to produce filtering audio signals.Because the microphone of mobile device 104 can also capturing ambient audio frequency except the language of user, therefore sound signal may comprise the mixing of user spoken utterances and environment audio frequency.Therefore sound signal can comprise the one or more environmental audio signal only comprising environment audio frequency, and comprises the audio user signal of language (and potential environment audio frequency) of user.Usually, environment audio frequency can comprise generation (nature or other) any ambient sound around user.Environment audio frequency gets rid of the speech of the user of mobile device, language or sound usually.Speech recognition system 118 can perform speech recognition with transcribing user language to the filtering audio signals produced by noise compensation system 116.In some instances, to filtering audio signals perform speech recognition can produce than direct to receive sound signal perform speech recognition transcribe more accurately.
For giving audio signal, one of the noise model 112 stored in noise compensation system 116 usage data storehouse 111 removes with one of user speech model or the background that reduces in sound signal or environment audio frequency.Noise model 112 comprises alternative noise model 120 and adaptive noise model 120b.Similarly, user speech model comprises alternative user speech model 126a and adapt user speech model 126b.Usually, adaptive noise model 120b and adapt user speech model 126b is exclusively used in specific user and is adapted to this user based on by previous talk search inquiry from the sound signal that this user receives.When the specific user for submission current voice search inquiry does not have adaptive noise model or adapt user speech model, use respectively and substitute noise model 120a and alternative user speech model 126a.
In some instances, the performance of noise compensation system 116 can be improved by using adapt user speech model, and this adapt user speech model is by trained or otherwise adapt to the concrete sound characteristic of the specific user submitting voice search query to.But, in order to make speech model adapt to specific user, the sampling of the voice of this user may be needed.In the environment of such as system 100, those samplings may not easily can be used at first.Therefore, in one implementation, if during adapt user speech model when user sends voice search query at first or for some other reasons not for user, ASR 108 selects to substitute user speech model from one or more alternative user speech model 126a.Selected alternative user speech model can be the rationally approximate user speech model of the characteristics of speech sounds being confirmed as user.Selected alternative user speech model is used for performing noise compensation to initial sound signal.Along with user submits voice search query subsequently to, with described those inquire about subsequently together with some or all sound signals that send for by selected alternative user speech model training or adapt to be exclusively used in this user adapt user speech model (namely, characteristics of speech sounds modeling to user), it is for the noise compensation of those sound signals subsequently.
Such as, in one implementation, when receiving sound signal subsequently, ASR 108 determines whether environment or background audio are under specific threshold.If under specific threshold, then this sound signal be used for by alternative user speech model adaptation in or further adapt user speech model is adapted to specific user.If background audio is on threshold value, then sound signal is not used in adapt user speech model and (but may be used for adaptive noise model, as mentioned below).
User speech model (no matter being alternative user speech model 126a or adapt user speech model 126b) such as may be implemented as hidden Markov model (HMM) or gauss hybrid models (GMM).Expectation maximization Algorithm for Training or otherwise adapt user speech model can be used.
In some implementation, user can be positively identified.Such as, some implementation can point out mark the forward direction user accepting search inquiry.Other implementations can use other available information implicit identification users, such as key in the Move Mode (such as, as accelerator forming device a part of) of the pattern of user or user.When user can specifically be identified, adapt user speech model can carry out index by the user identifier corresponding to identifying user.
In other implementations, user may not specifically be identified.In the case, equipment (such as mobile device 104) for typing voice search query can be used as the identifier of particular user, and can based on the device identifier index adapt user speech model for submitting to the equipment of voice search query corresponding.Usually only exist in the environment of single or major equipment user wherein, such as when mobile phone is used as input equipment, based on equipment, develop adapt user speech model can provide acceptable speech model to reach the performance constraints that noise compensation system 116 (particularly) or ASR 108 (more general) force.
Can be improved the same procedure of the performance of noise compensation system 116 by adapt user speech model, the performance of noise compensation system 116 can also have been trained or otherwise adapted to the usual noise model around the environment audio frequency of user by use and be modified.As speech sample, in the environment of such as system 100, the sampling usually around the environment audio frequency of user may not easily can be used at first.Therefore, in one implementation, if during adapt user speech model when user sends voice search query at first or for some other reasons not for user, ASR 108 selects to substitute noise model from one or more alternative noise model 126b.Selected alternative noise model can be the rationally approximate noise model being determined to be in the expectation environment audio frequency around user based on information that is known or that determine.Selected alternative noise model is used for performing noise compensation to initial sound signal.Along with user submits voice search query subsequently to, some or all sound signals that send together with those are inquired about for selected alternative noise model is adapted to be exclusively used in this user adaptive noise model (namely, the characteristic modeling to the typical environment sound around user when submitting search inquiry to), it is for the noise compensation of those sound signals subsequently.
Such as, in one implementation, when receiving sound signal subsequently, ASR 108 determines whether environment or background audio are under specific threshold.If not under specific threshold, then this sound signal is used for being adapted to by alternative noise model or further adaptive noise model being adapted to specific user.In some implementation, no matter whether background audio is on specific threshold, and the sound signal of reception may be used to adaptively substitute noise model or adaptive noise model.
In some implementation; in order to ensure obtaining the sampling without the environment audio frequency of user spoken utterances and this sampling may be used for adaptive noise model, the application of voice search query on mobile device 104 can start before user says search inquiry record and/or can user complete say search inquiry after continue record.Such as, voice search query application can be captured in user and say the audio frequency of before or after search inquiry two seconds to guarantee to obtain the sampling of environment audio frequency.
In some implementation, single alternative noise model can be selected and be adapted to the single adaptive noise model for this user of the varying environment using voice search to apply across user.But in other realize, when using voice search application, the various positions that adaptive noise model often can go for user are developed.Such as, can different noise model be developed for diverse location and be stored as alternative noise model 120a.When submitting voice search query to, the position of user can be sent to ASR 108 by mobile device 104, or the position of user can be determined by other means when submitting voice search query to.When receiving the initial sound signal for given position, then can select the alternative noise model for this position, and when receiving other voice search query from this position, the sound signal be associated may be used for this particular noise model adaptive.This can occur for each position in the diverse location when performing voice search query residing for user, and produce the multiple adaptive noise model for user thus, wherein each model is exclusively used in certain position.After the non-usage time period of definition (such as, user does not perform voice search in this position special time), can delete position particular noise model.
When submitting voice search query to, the position of user, the position be associated with given noise model and the position that is associated with given speech model all can be defined by various granularity level, longitude and latitude navigation coordinate or closely defined the region of (such as, 1/4th miles or less) by navigation coordinate the most specifically.Alternatively, position can use realm identifier to provide, the identifier (such as, " cell/region ABC 123 ") of such as state name or identifier, city name, trivial name (such as, " Central Park "), country name or any defined range.In some implementation, position can locative type, in such as seabeach in some examples, big city, amusement park, mobile traffic, on ship, in buildings, open air, countryside, underground position (such as, subway, parking lot etc.), in the street in the inner or forest of position, high building (skyscraper), instead of geo-specific location.Granularity level and the customer location when submitting voice search query to and the position that given noise model is associated and with can be identical or different between the position that given speech model is associated.
Noise model (no matter being alternative 120a or adaptive 120b) such as may be implemented as hidden Markov model (HMM) or gauss hybrid models (GMM).User speech model can use expectation maximization Algorithm for Training or otherwise adaptive.
As described above, in some implementation, user can by specifically identifying in other implementations equipment can be used as user substitute.Therefore, be similar to the index to speech model, adaptive noise model can carry out index by the user identifier of the user corresponding to the mark when user can specifically be identified, or can by based on corresponding to when user cannot specifically be identified for submitting the device identifier index of the equipment of voice search query to.
Fig. 2 shows the process flow diagram of the example of the process 200 that can perform when receiving initial voice search query from user or equipment, and Fig. 3 shows the process flow diagram of the example of the process 300 that can perform when receiving voice search query subsequently from user or equipment.Be hereafter implementation 200 and process 300 by the component description of system 100, but other assemblies of system 100 or another system also can implementation 200 or process 300.
Initial voice search query (202) is received from equipment (such as mobile device 104) with reference to figure 2, ASR 108.Initial voice search query can be initial, because this voice search query is first voice search query received for particular user or equipment; Because this voice search query is first from submitting to the ad-hoc location of this voice search query to receive; Or (or both) for some other reasons (such as, deleted because this model does not use in special time period) for user or equipment not because adapt user speech model or adaptive noise model.
Voice search query comprises sound signal, and this sound signal comprises audio user signal and environmental audio signal.Audio user signal comprises to be given an oral account to one or more language of the microphone of mobile device 104 and potential environment audio frequency by user.Environmental audio signal only comprises environment audio frequency.As mentioned below, voice search query can also comprise contextual information.
When employed, ASR 108 accesses the contextual information (204) about voice search query.This contextual information such as can provide the instruction of the condition about the sound signal in voice search query.This contextual information can comprise temporal information, date and time information, the data quoting speed or the amount of movement measured by specific mobile device during recording, other device sensor data, device status data (such as, bluetooth headset, speaker-phone or conventional input method) if user selects user identifier when providing or identifies the information of mobile device type or model.
This contextual information can also be included in the position that it submits voice search query to.This position such as can be determined by the schedule of user, from user preference (such as, be stored in the user account of ASR engine 108 or search engine 106) or default location derivation, based on past position (such as, by the equipment for submit Query (such as, mobile device 104) GPS (GPS) module calculate proximal most position), there is provided by user is explicit when submitting voice queries to, determine from language, based on launching tower trigonometric calculations, there is provided (such as by the GPS module in mobile device 104, voice search application can access GPS device to determine position and to send this position with voice search query), or use dead reckoning to estimate.If sent by equipment, then positional information can comprise the accuracy information of the levels of precision of this positional information of instruction.
ASR 108 can use this type of contextual information to help speech recognition, such as, by using contextual information to select the particular variant of speech recognition system or select suitable alternative user speech model or alternative noise model.This type of contextual information can be delivered to search engine 106 to improve Search Results by ASR 108.Some or all contextual informations can receive together with voice search query.
If do not existed for the adapt user speech model of user, then ASR 108 selects initial or alternative user speech model and be associated with user or equipment by this initial user speech model (such as, depending on whether user can specifically be identified) (206).Such as, as described above, ASR 108 can select in some available alternative user speech models.
Selected alternative user speech model can be the rationally approximate user speech model being confirmed as the characteristics of speech sounds of user based on known or comformed information, although this selected alternative user speech model not yet by any sampling of the voice with user adaptation.Such as, in one implementation, two alternative user speech models can be there are: one for male voice one for women's speech.The sex of user can be determined and suitable alternative user speech model (sex) can be selected based on the possible sex of user.The sex of user such as can by analyze the sound signal that receive together with initial voice search query or based on such as by user submit to voluntarily and the information be included in the information in the profile of user determine.
Additionally or alternatively, the adapt user speech model for other users (such as the user of mobile device 102a-102c) can be used as alternative user speech model.When receiving initial voice search query, represent that the expectational model for the user submitting initial searches inquiry to can be determined based on the initial sound signal comprised together with inquiring about with initial searches with the measuring similarity being stored in the similarity between the adapt user speech model in database 111 (corresponding to other users).Such as, if model is based on the linear regression technique of constraint maximum likelihood, then measuring similarity can be the L2 norm (summation for the difference of two squares of each coefficient) of the difference between model.When using GMM technology wherein, measuring similarity can be the Kullback-Leibler entropy between two probability density functions, if or model is GMM and expectational model from single language is spatial point, then may be that the probability density of GMM is positioned at this spatial point.In other implementations using GMM, measuring similarity can be such as each GMM average between distance, or by some norm of covariance matrix normalized average between distance.
Adapt user speech model closest to the expectational model (as shown in by measuring similarity) of user can be selected as the alternative user speech model for the user submitting initial voice search query to.Such as, when the user of equipment 104 submits initial voice search query to, ASR 108 can determine the measuring similarity of the similarity represented between the desired user speech model for the user of equipment 104 and the adapt user speech model of the user for equipment 102a.Similarly, ASR 108 can determine the measuring similarity of the similarity represented between the desired user speech model for the user of equipment 104 and the adapt user speech model of the user for equipment 102b.If measuring similarity pointer is more similar to the model of the user for equipment 102a than the model for the user of equipment 102b to the expectational model of the user of equipment 104, then can be used as the alternative user speech model of the user for equipment 104 for the model of the user of equipment 102a.
As the particular example of the implementation of employing GMM, voice search query can comprise the language comprising voice and ambient signal.This inquiry can be segmented into the segmentation of such as 25ms, and wherein each segmentation is voice or pure environment.For each segmentation, calculate proper vector x
t, the vector wherein corresponding to voice is designated as x
s.For each potential alternative model M had in a database
i, calculate the likelihood score of each vector:
This is that the likelihood score of GMM calculates and p (i) is the priori of this alternative model.Suppose the independence of observing, speech vector x
sthe probability of set can be expressed as:
Wherein x
sit is the set of speech vector.
Given observation x
sthe conditional probability of class i be:
p(i|x
s)=p(x
s,i)/p(x
s)
Wherein
This conditional probability can be used as current utterance and certain alternative speech model M
ibetween measuring similarity.
The alternative model with the highest conditional probability can be selected:
model
index=ArgMax(p(i|x
s))i
Contextual information (accent of such as user or the language of expectation) can be used alone or combinationally use to select alternative user speech model with other technologies mentioned above.Such as, multiple alternative user speech model can store for different language and/or accent.When submitting voice search query to, the position of user can be used for ASR 108 for determining language or the accent of expectation, and the alternative user speech model corresponding to expectation language and/or accent can be selected.Similarly, can be stored in the profile of such as user for the language of user and/or positional information, and correspond to the language of user and/or the alternative user speech model of accent for selecting.
If adapt user speech model is saved as (such as, be original position for ad-hoc location due to voice search query but be not for user or equipment), then action 206 can be skipped, or can substitute by other adaptations with adapt user speech model.Such as, the sound signal received by initial voice search query can be evaluated to determine background audio whether under specific threshold, and if under specific threshold, then this sound signal can be used to further training or this adapt user speech model adaptive by other means.
ASR 108 selects initial or alternative noise model and be associated with user or equipment by this initial noise model (such as, depending on whether user can specifically be identified) (208).The selected noise model that substitutes can be the rationally approximate noise model being confirmed as the expectation environment audio frequency around user based on known or comformed information.Such as, alternative noise model can for the environmental baseline of various criterion kind (such as, in the car, on airport, be in or in bar/dining room) develop.Data from other users in system can be used to develop alternative noise model.Such as, if some duration of low noise data (such as, 10 minutes) is collected by from user, then these data can be used to generate alternative model.When receiving initial sound signal, the measuring similarity that expression expectation noise model and standard substitute the similarity between noise model can be determined based on initial sound signal, and this standard substitutes one of noise model can carry out selecting (such as, use and be similar to above about selecting to substitute the technology described in user model) based on this measuring similarity.Such as, expect that noise model can be determined based on environmental audio signal.Exceed specific dissimilar threshold value (such as, determine based on KL distance) alternative noise model (such as, 100) set can be retained as standard alternative model, and the alternative model used can use as described in measuring similarity select from this set.When selecting to substitute noise model, this can minimization calculation.
Additionally or alternatively, different noise model can be developed for diverse location and is stored as alternative noise model 120a.Such as, the noise model for position A 132a and position B 132b can be developed and be stored as alternative noise model 120a.Noise model for particular location can be developed based on by other Client-initiated previous talk search inquiries in those positions.For position B 132b noise model such as can based on when position B 132b by ASR 108 receive a part for the voice search query as the user from equipment 102b sound signal 130b and at position B 132b time receive a part for the voice search query as the user from equipment 102c by ASR 108 sound signal 130c develop.For position A 132a noise model such as can based at position A by ASR
The 108 sound signal 130a received as a part for the voice search query of the user from equipment 102a develop.
When receiving initial sound signal, alternative noise model can be selected based on the position of user.Such as, when the user of mobile device 104 submits initial voice search to from position B 132b, ASR 108 can select the alternative noise model for position B.In some implementation, the voice search on mobile device 104 applies the GPS that can access on this mobile device to determine the position of user and send positional information to ASR 108 together with voice search query.Positional information can be used for ASR 108 to use to determine suitable alternative noise model based on this position then.In other implementations, when receiving initial sound signal, represent that the measuring similarity of similarity between the distinctive alternative noise model in position expecting to have stored in noise model and database 111 can be determined based on this initial sound signal, and one of distinctive alternative noise model in this position can be selected based on this measuring similarity.
Use initial (or adaptive) user speech model and initial noise model, the noise compensation system 116 of ASR 108 performs noise compensation to remove or to reduce the background audio in sound signal to the sound signal received together with voice search query, produces filtering audio signals (210) thus.Such as, at such as ALGONQUIN:Iterating Laplace ' s Methodto Remove Multiple Types of Acoustic Distortion for Robust Speech Recognition, the algorithm of such as Algonquin algorithm described in Eurospeech 2001-Scandinavia and so on may be used for using initial user speech model and initial noise model to perform noise compensation.
Speech recognition system performs speech recognition so that the language in sound signal is transcribed into one or more candidate transcription (210) to filtering audio signals.Search inquiry can use one or more candidate transcription to perform.In some implementation, ASR 108 can use contextual information to select the particular variant of the speech recognition system for performing speech recognition.Such as, the accent of user and/or expectation or known language may be used for selecting suitable speech recognition system.When submitting voice search query to, the position of user may be used for the expectation language determining user, or the language of user can be included in the profile of this user.
With reference to figure 3, ASR 108 from equipment (such as mobile device 104) reception voice search query (302) subsequently.This voice search query subsequently can be subsequently, this is because this voice search query receives after the previous talk search inquiry for particular user or equipment, or because there is substituting or adapt user speech model or noise model for user or equipment.
Voice search query subsequently comprises sound signal, and this sound signal comprises audio user signal and environmental audio signal.Audio user signal comprises to be given an oral account to the one or more language in the microphone of mobile device 104 and potential environment audio frequency by user.Environmental audio signal only comprises environment audio frequency.As mentioned below, voice search query can also comprise contextual information.
When employed, ASR 108 accesses the contextual information (304) about voice search query.ASR 108 can use this type of contextual information to help speech recognition, such as, by the particular variant using this contextual information to select speech recognition system.Additionally or alternatively, contextual information may be used for helping substituting or the selection of adapt user speech model and/or adaptive or alternative noise model and/or adaptation.ASR 108 can transmit this type of contextual information to improve Search Results to search engine 106.Some or all contextual informations can receive together with voice search query.
ASR 108 determines in the sound signal received together with voice search query, whether environment audio frequency is defining under threshold value (306).Such as, speech activity detector may be used for determining the audio user signal in the sound signal of reception and environmental audio signal.ASR 108 then can determine energy in environmental audio signal and the energy this determined and threshold energy compare.If this energy is under described threshold energy, then environment audio frequency is considered under definition threshold value.In another example, ASR 108 can determine the energy in audio user signal, determines the energy in environmental audio signal, and then determines the ratio of the energy in audio user signal and the energy in environmental audio signal.This ratio can represent the signal to noise ratio (S/N ratio) (SNR) of sound signal.The SNR of sound signal then can compared with threshold value SNR, and when the SNR of sound signal is on threshold value SNR, environment audio frequency is considered under definition threshold value.
If the environment audio frequency in the sound signal received together with voice search query is not under definition threshold value, then this audio signal adaptation is used to substitute (or adaptive) noise model to generate adaptive noise model (312).In some implementation, treat that adaptive particular noise model is selected based on the position of user.Such as, when different noise model frequently submits the diverse location of voice search query for user to from it, ASR 108 can use the position of user or equipment to select substituting or adaptive noise model for this position.
Noise model can be adaptive in whole sound signal, or environmental audio signal can be extracted and for adaptive noise model, depends on the specific implementation mode of noise model and speech enhan-cement or Speech separation algorithm.The technology of such as hidden Markov model or gauss hybrid models and so on may be used for realizing user speech model, and the technology of such as expectation maximization and so on may be used for adapt user speech model.
If the environment audio frequency in the sound signal received together with voice search query is under definition threshold value, then this sound signal is used for alternative user speech model (if this substitutes previously not yet adapted to adapt user speech model) or the adapt user speech model (308) of adaptive previously selection.User speech model can be adaptive in whole sound signal, or audio user signal can be extracted and for adapt user speech model, depend on the specific implementation mode of user speech model.Be similar to noise model, the technology of such as hidden Markov model or gauss hybrid models and so on may be used for realizing user speech model, and the technology of such as expectation maximization or maximum a posteriori (MAP) adaptation and so on may be used for adapt user speech model.
In some implementation, ASR 108 is also based on the sound signal training under threshold value of wherein background audio or otherwise adaptively substitute noise model or adaptive noise model (310).Although in some implementation, user speech model only uses the wherein sound signal training or adaptive of background audio under definition threshold value, but in some instances, noise model can based on this type of sound signal and the wherein sound signal training or adaptive of background audio on threshold value, and this depends on the particular technology for realizing noise model.Such as, some noise model can comprise reflect the wherein environment of background audio under threshold value in parameter, and therefore this class model can be benefited from the adaptation wherein sound signal of background audio under threshold value.
Use and substitute or adapt user speech model (depending on whether alternative speech model is adapted) and alternative or adaptive noise model (depending on whether alternative noise model is adapted), the noise compensation system 116 of ASR 108 performs noise compensation to remove or to reduce the background audio in sound signal with the sound signal that mode identical as described above pair receives together with voice search query, thus produces filtering audio signals (314).Speech recognition system performs speech recognition so that the speech in sound signal is transcribed into one or more candidate transcription (316) in mode identical as described above to filtering audio signals.
Although process 300 illustrates adaptive noise model and/or user speech model before for noise compensation, but adaptation can occur after execution noise compensation, and noise compensation can based on noise and/or user speech model by the noise before further adaptation and/or user speech model.This can be following situation, such as, when adaptation is computation-intensive.In the case, to Expected Response time of voice search query can by use for the current noise of noise compensation and user speech model and based on sound signal new afterwards to its realization of more newly arriving.
Fig. 4 shows the swimming lane figure of the example of the process 400 performed by mobile device 104, ASR 108 and the search engine 106 for the treatment of voice search query.Mobile device 104 sends voice search query (402) to ASR 108.As described above, voice search query comprises the sound signal comprising environmental audio signal and audio user signal, environmental audio signal comprises the environment audio frequency without user spoken utterances, and audio user signal comprises user spoken utterances (and potentially environment audio frequency).Voice search query can also comprise contextual information, all contextual informations as described above.
ASR 108 receives voice search query (402) and selects both noise model and user speech model (404).ASR 108 such as can based on the adapt user speech model comprising or select the addressable user identifier of ASR 108 or device identifier by other means storage together with voice search query.Similarly, ASR 108 such as can based on the adaptive noise model comprising or select the addressable user identifier of ASR 108 or device identifier by other means storage together with voice search query.Using in implementation for the different noise models of particular location, ASR 108 can select the adaptive noise model of storage from the peculiar adaptive noise model in multiple position based on user or device identifier and the location identifier of position corresponding to the user when submitting voice search query to.ASR 108 can from send voice search query or by other means to ASR 108 can contextual information in find out positional information.
Do not exist in the event of adapt user speech model for user or equipment, ASR 108 such as uses technology mentioned above to select alternative user speech model (404).Similarly, if there is not adaptive noise model for user or equipment, or at least not for the ad-hoc location of the user when submitting voice search query to, then ASR 108 such as uses technology mentioned above to select alternative noise model.
ASR 108 uses the next adaptive selected audio user model (406) of the sound signal received together with voice search query and/or selected noise model (408) to generate adapt user speech model or adaptive noise model then, and this depends on the background audio in sound signal.As described above, in background audio when defining under threshold value, sound signal is used for the user speech model selected by adaptation, and for the noise model selected by adaptation in some implementation.In background audio when defining on threshold value, then at least in some implementation, noise signal is used for the noise model only selected by adaptation.
ASR 108 uses adapt user speech model and adaptive noise model to perform noise compensation (410) to generate the filtering audio signals having reduced or removed background audio compared with the sound signal received to sound signal.
ASR engine 404 pairs of filtering audio signals perform speech recognition 416 and transcribe (412) so that the one or more language in sound signal are transcribed into text candidates.ASR engine 404 forwards transcribing (414) of 418 generations to search engine 406.If ASR engine 404 generates multiple transcribing, then can be that ordered pair transcribes sequence alternatively with degree of confidence.ASR engine 404 can provide context data to search engine 406 alternatively, such as geographic position, and search engine 406 can use this context data filter Search Results or sort.
Search engine 406 uses and transcribes to perform search operation (416).Search engine 406 can be located one or more URI relevant with transcribing item.
Search engine 406 provides search query results (418) to mobile device 402.Such as, search engine 406 can forward following HTML code, the visual inventory of the URI of this code building location.
Describe multiple implementation.But, will understand, and can various amendment be carried out and not depart from Spirit Essence and the scope of disclosure.Such as, above technology is described about performing speech recognition to the sound signal in voice search query, and this technology may be used for other system, such as in the computerize speech dictation system moved or other equipment realize or conversational system.In addition, can resequencing, add or removal step time use above shown in the various forms of flow process.Thus, other implementations within the scope of the appended claims.
The embodiment that describes in this instructions and all functions operation can be realized in Fundamental Digital Circuit or in one that is included in the computer software of structure disclosed in this instructions and structural equivalents thereof, firmware or hardware or in them or multinomial combination.Embodiment may be implemented as one or more computer program, namely encode on a computer-readable medium for being performed by data processing equipment or one or more module of computer program instructions of operation of control data treating apparatus.Computer-readable medium can be machine readable storage device, machine readable storage substrate, memory devices, realize the material composition of machine readable transmitting signal or or multinomial combination in them.Term " data processing equipment " covers for the treatment of all devices of data, equipment and machine, such as, comprise a programmable processor, a computing machine or multiple processor or computing machine.The computer program that device can also comprise for discussing except comprising hardware creates the code of execution environment, such as, form the code of processor firmware, protocol stack, data base management system (DBMS), operating system or in them or multinomial combination.Transmitting signal is the artificial signal generated, and such as, the electricity generated by machine, optics or electromagnetic signal, this signal is generated and sends for suitable acceptor device for encoding to information.
Computer program (also referred to as program, software, software application, script or code) can be write with any type of programming language comprising compiling or interpretative code, and can be disposed it by any form, comprise as stand-alone program or as the module being suitable for using in a computing environment, parts, subroutine or other unit.Computer program not necessarily corresponds to the file in file system.Program can be stored in the part of the file keeping other program or data (such as, be stored in one or more script in marking language document), in the Single document of program being exclusively used in discussion or in multiple coordinated files (such as, storing the file of one or more module, subroutine or code section).Computer program can be deployed on a computer or be positioned at the three unities or be distributed in multiple place and perform by multiple computing machines of interconnection of telecommunication network.
The process described in this manual and logic flow can be performed by one or more programmable processor, and this processor performs one or more computer program with by generating output and carry out n-back test input data manipulation.Process and logic flow also can be performed by dedicated logic circuit such as FPGA (field programmable gate array) or ASIC (special IC), and device also can be implemented as this dedicated logic circuit.
Be suitable for performing any one or multiple processor that the processor of computer program such as comprises the digital machine of general and special microprocessor and any kind.Generally speaking, processor will from ROM (read-only memory) or random access memory or both receive instruction and data.The elementary cell of computing machine is the processor for performing instruction and one or more memory devices for storing instruction and data.Generally speaking, computing machine also by one or more mass memory unit comprised for storing data (such as, disk, photomagneto disk or CD) or be operatively coupled into from this mass memory unit receive data or to this mass memory unit transmit data or both.But computing machine is without the need to having such equipment.In addition, computing machine can be embedded in another equipment, only gives a few examples, and this another equipment is such as flat computer, mobile phone, personal digital assistant (PDA), Mobile audio player, GPS (GPS) receiver.The computer-readable medium being suitable for storing computer program instructions and data comprises nonvolatile memory, medium and the memory devices of form of ownership, such as comprises semiconductor memory devices (such as, EPROM, EEPROM and flash memory device); Disk (such as, internal hard drive or removable disk); Magneto-optic disk; And CD ROM and DVD-ROM dish.Processor and storer by supplemented or can be incorporated in dedicated logic circuit.
Mutual in order to what provide with user, embodiment can be limited on computing machine in fact, this computing machine has for showing the display apparatus of information (such as to user, CRT (cathode-ray tube (CRT)) or LCD (liquid crystal display) monitor) and user can be used for providing to computing machine keyboard and the indication equipment (such as, mouse or tracking ball) of input.It is mutual that the equipment of other kind also can be used to provide with user; Such as, the feedback provided to user can be any type of sensory feedback (such as, visual feedback, audio feedback or tactile feedback); And can with comprising sound, any form of voice or sense of touch input receives input from user.
Embodiment can be implemented in computing system, this computing system comprises back-end component (such as, as data server) or comprise middleware component (such as, application server) or comprise any combination of one or more parts in front end component (such as, there is user can be used for carrying out mutual graphic user interface or the client computer of Web browser with implementation) or such rear end, middleware or front end component.The parts of system can be interconnected by any digital data communication form or medium (such as, communication network).The example of communication network comprises LAN (Local Area Network) (" LAN ") and wide area network (" WAN "), such as, and the Internet.
Computing system can comprise client and server.Client and server generally mutual away from and usually mutual by communication network.The relation computer program of client and server occurs, and these computer programs run and mutually have client-server relation on corresponding computer.
Although this instructions comprises many details, these should not be construed as to scope of the disclosure or can be claimed the restriction of scope of content, and should as description specific implementation being realized to distinctive feature.Some feature that also can describe in the context of independent embodiment in single this instructions of embodiment combination enforcement.Otherwise, also can in multiple embodiment separately or in any suitable sub-portfolio, implement the various feature that describes in the context of single embodiment.In addition; although can describe feature as above in some embodiments effect and even originally claimed like this; but one or more feature can removed from claimed combination in some cases from this combination, and claimed combination can relate to the variant of sub-portfolio or sub-portfolio.
Similarly, although describe operation with particular order in the accompanying drawings, this should not be construed as and requires with shown particular order or perform such operation with sequence order or perform all shown operations to realize the result of wishing.In some circumstances, multitask and parallel processing can be favourable.In addition, in above-described embodiment, be separated various system unit should not be construed as and require such separation in all embodiments, and should be appreciated that the program element of description and system generally can together be integrated in single software product or be encapsulated in multiple software product.
Mention in each example of html file wherein, other file type or form can be replaced with.Such as, html file can replace with the file of XML, JSON, plaintext or other type.In addition, when mentioning table or hash table, other data structure (such as spreadsheet, relational database or structured document) can be used.
Therefore, particular implementation is described.Other embodiment within the scope of the appended claims.Such as, the action recorded in the claims can perform by different order and still obtain the result of hope.
Claims (24)
1., for a system for speech recognition, comprising:
For receiving the device of the sound signal generated based on the audio frequency input from user by equipment, described sound signal at least comprises the audio user part corresponded to by one or more user spoken utterances of described equipment record;
For accessing the device of the user speech model be associated with described user;
For determining the device of background audio below definition threshold value in described sound signal;
For in response to the described background audio determined in described sound signal below described definition threshold value, the user speech model of accessing based on described audio signal adaptation is to generate the device to the adapt user speech model of the characteristics of speech sounds modeling of described user; And
Voice use described adapt user speech model to perform noise compensation to generate the device compared with the sound signal of described reception with the filtering audio signals of the background audio of minimizing to the sound signal received.
2. system according to claim 1, wherein said sound signal comprises the environment audio-frequency unit only corresponded to around the background audio of described user, and in order to determine that the described background audio in described sound signal is defining under threshold value, described system comprises:
For determining the device of the amount of the energy in described environment audio-frequency unit; And
For determining the device of amount under threshold energy of the described energy in described environment audio-frequency unit.
3. system according to claim 2, in order to determine that the described background audio in described sound signal is defining under threshold value, described system comprises:
For determining the device of the signal to noise ratio (S/N ratio) of described sound signal; And
For determining the device of described signal to noise ratio (S/N ratio) under threshold signal-to-noise ratio.
4. system according to claim 3, wherein said sound signal comprises the environment audio-frequency unit only corresponded to around the background audio of described user, and in order to determine the described signal to noise ratio (S/N ratio) of described sound signal, described system comprises:
For determining the device of the amount of the energy in the described audio user part of described sound signal;
For determining the device of the amount of the energy in the described environment audio-frequency unit of described sound signal; And
The device of described signal to noise ratio (S/N ratio) is determined for the ratio between the amount by determining the energy in described audio user part and described environment audio-frequency unit.
5. system according to claim 1, the user speech model of wherein accessing comprises the alternative user speech model of the described characteristics of speech sounds modeling be not adapted to be described user.
6. system according to claim 5, wherein said system comprises:
For selecting the device of described alternative user speech model; And
For the device that described alternative speech model and described user are carried out associating.
7. system according to claim 6, wherein in order to select described alternative user speech model, described system comprises:
For determining the device of the sex of described user; And
Among multiple alternative user speech model, the device of described alternative user speech model is selected for the described sex based on described user.
8. system according to claim 6, wherein in order to select described alternative user speech model, described system comprises:
For determining the device of the position of the described user when recording described one or more language; And
Among multiple alternative user speech model, the device of described alternative user speech model is selected for the described position based on user described when recording described one or more language.
9. system according to claim 6, in order to select described alternative user speech model, described system comprises:
For the device of the language or accent of determining described user; And
For selecting the device of described alternative user speech model among multiple alternative user speech model based on described language or accent.
10. system according to claim 6, wherein in order to select described alternative user speech model, described system comprises:
For receiving the device at least comprising and corresponding to by the initial sound signal of the initial user audio-frequency unit of one or more user spoken utterances of described equipment record;
For the described user that determines multiple alternative user speech model and determine based on described initial sound signal desired user speech model between the device of similarity measurement; And
For selecting the device of described alternative user speech model among described multiple alternative user speech model based on described similarity measurement.
11. systems according to claim 1, wherein said system comprises:
For accessing the device of the noise model be associated with described user; And
Wherein in order to perform noise compensation, described system comprises further for using described adapt user speech model and access noise model to the device of the sound signal execution noise compensation received.
12. systems according to claim 11, wherein in order to perform noise compensation, described system comprises further:
For accessing noise model based on the audio signal adaptation received to generate the device to the adaptive noise model of the characteristic modeling of the background audio around described user; And
Carry out to perform the sound signal received the device of noise compensation for using described adapt user speech model and described adaptive noise model.
13. systems according to claim 11, wherein said system comprises:
For receiving the device at least comprising and corresponding to by the second sound signal of the second audio user part of one or more user spoken utterances of described equipment record;
For determining the device of background audio on definition threshold value in described second sound signal; And
For in response to the described background audio determined in described second sound signal on described definition threshold value, the described noise model be associated with described user based on described second audio signal adaptation is to generate the device of the adaptive noise model of the characteristic modeling to the background audio around described user.
14. systems according to claim ll, wherein said access noise model comprises the alternative noise model of the characteristic modeling be not yet adapted to be the background audio around described user.
15. systems according to claim 14, wherein said system comprises:
For selecting the device of described alternative noise model; And
For the device that described alternative noise model and described user are carried out associating.
16. systems according to claim 15, wherein in order to select described alternative noise model, described system comprises:
For receiving the device at least comprising and corresponding to by the initial sound signal of the initial user audio-frequency unit of one or more user spoken utterances of described equipment record;
For determining the device of the position of the described user when recording the described one or more language corresponding to described initial user audio-frequency unit; And
For selecting the device of described alternative noise model among multiple alternative noise model based on the described position of the described user when recording the described one or more language corresponding to described initial user audio-frequency unit.
17. systems according to claim 15, wherein in order to select described alternative noise model, described system comprises:
For receiving the device at least comprising and corresponding to by the initial sound signal of the initial user audio-frequency unit of one or more user spoken utterances of described equipment record;
For the described user that determines multiple alternative noise model and determine based on described initial sound signal expectation noise model between the device of similarity measurement; And
For selecting the device of described alternative noise model among described multiple alternative noise model based on described similarity measurement.
18. systems according to claim 17, each alternative noise model in wherein said multiple alternative noise model is to the characteristic modeling of the background audio in ad-hoc location.
19. systems according to claim 17, each alternative noise model in wherein said multiple alternative noise model is to the characteristic modeling of the background audio in the environmental baseline of particular types.
20. systems according to claim 11, wherein in order to access described noise model, described system comprises:
For determining the device of the position of the described user when recording described one or more language; And
Among multiple noise model, the device of described noise model is selected for the described position based on described user.
21. systems according to claim 1, wherein said sound signal corresponds to voice search query, and described system comprises:
For performing speech recognition to generate the device of one or more candidate transcription of described one or more user spoken utterances to described filtering audio signals;
Search inquiry is performed to generate the device of Search Results for using described one or more candidate transcription; And
For sending the device of described Search Results to described equipment.
22. 1 kinds, for the system of speech recognition, comprising:
For sending the device of the sound signal of the audio user part at least comprising the one or more user spoken utterances corresponding to record to automated voice recognition system;
For receiving the device of described sound signal;
For accessing the device of the user speech model be associated with described user;
For determining the device of background audio under definition threshold value in described sound signal;
For in response to the described background audio determined in described sound signal under described definition threshold value, the user speech model of accessing based on described audio signal adaptation is to generate the device to the adapt user speech model of the characteristics of speech sounds modeling of described user; And
For using described adapt user speech model, noise compensation is performed to generate the device compared with the sound signal of described reception with the filtering audio signals of the background audio of minimizing to the sound signal received.
23. systems according to claim 22, wherein said system comprises further for performing speech recognition to generate the device of one or more candidate transcription of described one or more user spoken utterances to described filtering audio signals, and described system comprises further:
Search inquiry is performed to generate the device of Search Results for using described one or more candidate transcription; And
For sending the device of described Search Results.
24. 1 kinds, for the method for speech recognition, comprising:
Receive the sound signal generated based on the audio frequency input from user by equipment, described sound signal at least comprises the audio user part corresponded to by one or more user spoken utterances of described equipment record;
Access the user speech model be associated with described user;
Determine that the background audio in described sound signal is defining below threshold value;
In response to the described background audio determined in described sound signal below definition threshold value, the user speech model of accessing based on described audio signal adaptation is to generate the adapt user speech model to the characteristics of speech sounds modeling of described user; And
Described adapt user speech model is used to perform noise compensation to generate the filtering audio signals compared with the sound signal received with the background audio of minimizing to the sound signal of described reception.
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US12/814,665 | 2010-06-14 | ||
US12/814,665 US8234111B2 (en) | 2010-06-14 | 2010-06-14 | Speech and noise models for speech recognition |
PCT/US2011/040225 WO2011159628A1 (en) | 2010-06-14 | 2011-06-13 | Speech and noise models for speech recognition |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103069480A CN103069480A (en) | 2013-04-24 |
CN103069480B true CN103069480B (en) | 2014-12-24 |
Family
ID=44303537
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201180026390.4A Active CN103069480B (en) | 2010-06-14 | 2011-06-13 | Speech and noise models for speech recognition |
Country Status (5)
Country | Link |
---|---|
US (3) | US8234111B2 (en) |
EP (1) | EP2580751B1 (en) |
CN (1) | CN103069480B (en) |
AU (1) | AU2011267982B2 (en) |
WO (1) | WO2011159628A1 (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105719645A (en) * | 2014-12-17 | 2016-06-29 | 现代自动车株式会社 | Speech recognition apparatus, vehicle including the same, and method of controlling the same |
Families Citing this family (337)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2001013255A2 (en) * | 1999-08-13 | 2001-02-22 | Pixo, Inc. | Displaying and traversing links in character array |
US8645137B2 (en) | 2000-03-16 | 2014-02-04 | Apple Inc. | Fast, language-independent method for user authentication by voice |
ITFI20010199A1 (en) | 2001-10-22 | 2003-04-22 | Riccardo Vieri | SYSTEM AND METHOD TO TRANSFORM TEXTUAL COMMUNICATIONS INTO VOICE AND SEND THEM WITH AN INTERNET CONNECTION TO ANY TELEPHONE SYSTEM |
US7669134B1 (en) | 2003-05-02 | 2010-02-23 | Apple Inc. | Method and apparatus for displaying information during an instant messaging session |
US8677377B2 (en) | 2005-09-08 | 2014-03-18 | Apple Inc. | Method and apparatus for building an intelligent automated assistant |
US7633076B2 (en) | 2005-09-30 | 2009-12-15 | Apple Inc. | Automated response to and sensing of user activity in portable devices |
US9318108B2 (en) | 2010-01-18 | 2016-04-19 | Apple Inc. | Intelligent automated assistant |
US20080129520A1 (en) * | 2006-12-01 | 2008-06-05 | Apple Computer, Inc. | Electronic device with enhanced audio feedback |
US7912828B2 (en) * | 2007-02-23 | 2011-03-22 | Apple Inc. | Pattern searching methods and apparatuses |
US8977255B2 (en) * | 2007-04-03 | 2015-03-10 | Apple Inc. | Method and system for operating a multi-function portable electronic device using voice-activation |
ITFI20070177A1 (en) | 2007-07-26 | 2009-01-27 | Riccardo Vieri | SYSTEM FOR THE CREATION AND SETTING OF AN ADVERTISING CAMPAIGN DERIVING FROM THE INSERTION OF ADVERTISING MESSAGES WITHIN AN EXCHANGE OF MESSAGES AND METHOD FOR ITS FUNCTIONING. |
US9053089B2 (en) * | 2007-10-02 | 2015-06-09 | Apple Inc. | Part-of-speech tagging using latent analogy |
US8595642B1 (en) | 2007-10-04 | 2013-11-26 | Great Northern Research, LLC | Multiple shell multi faceted graphical user interface |
US8165886B1 (en) | 2007-10-04 | 2012-04-24 | Great Northern Research LLC | Speech interface system and method for control and interaction with applications on a computing system |
US8364694B2 (en) | 2007-10-26 | 2013-01-29 | Apple Inc. | Search assistant for digital media assets |
US8620662B2 (en) | 2007-11-20 | 2013-12-31 | Apple Inc. | Context-aware unit selection |
US10002189B2 (en) | 2007-12-20 | 2018-06-19 | Apple Inc. | Method and apparatus for searching using an active ontology |
US9330720B2 (en) * | 2008-01-03 | 2016-05-03 | Apple Inc. | Methods and apparatus for altering audio output signals |
US8327272B2 (en) | 2008-01-06 | 2012-12-04 | Apple Inc. | Portable multifunction device, method, and graphical user interface for viewing and managing electronic calendars |
US8065143B2 (en) | 2008-02-22 | 2011-11-22 | Apple Inc. | Providing text input using speech data and non-speech data |
US8289283B2 (en) * | 2008-03-04 | 2012-10-16 | Apple Inc. | Language input interface on a device |
US8996376B2 (en) | 2008-04-05 | 2015-03-31 | Apple Inc. | Intelligent text-to-speech conversion |
US10496753B2 (en) | 2010-01-18 | 2019-12-03 | Apple Inc. | Automatically adapting user interfaces for hands-free interaction |
US8464150B2 (en) | 2008-06-07 | 2013-06-11 | Apple Inc. | Automatic language identification for dynamic text processing |
US20100030549A1 (en) | 2008-07-31 | 2010-02-04 | Lee Michael M | Mobile device having human language translation capability with positional feedback |
US8768702B2 (en) | 2008-09-05 | 2014-07-01 | Apple Inc. | Multi-tiered voice feedback in an electronic device |
US8898568B2 (en) * | 2008-09-09 | 2014-11-25 | Apple Inc. | Audio user interface |
US20100082328A1 (en) * | 2008-09-29 | 2010-04-01 | Apple Inc. | Systems and methods for speech preprocessing in text to speech synthesis |
US8352268B2 (en) * | 2008-09-29 | 2013-01-08 | Apple Inc. | Systems and methods for selective rate of speech and speech preferences for text to speech synthesis |
US8352272B2 (en) * | 2008-09-29 | 2013-01-08 | Apple Inc. | Systems and methods for text to speech synthesis |
US8396714B2 (en) * | 2008-09-29 | 2013-03-12 | Apple Inc. | Systems and methods for concatenation of words in text to speech synthesis |
US8712776B2 (en) * | 2008-09-29 | 2014-04-29 | Apple Inc. | Systems and methods for selective text to speech synthesis |
US8583418B2 (en) | 2008-09-29 | 2013-11-12 | Apple Inc. | Systems and methods of detecting language and natural language strings for text to speech synthesis |
US8355919B2 (en) * | 2008-09-29 | 2013-01-15 | Apple Inc. | Systems and methods for text normalization for text to speech synthesis |
US8676904B2 (en) | 2008-10-02 | 2014-03-18 | Apple Inc. | Electronic devices with voice command and contextual data processing capabilities |
US9959870B2 (en) | 2008-12-11 | 2018-05-01 | Apple Inc. | Speech recognition involving a mobile device |
US8862252B2 (en) | 2009-01-30 | 2014-10-14 | Apple Inc. | Audio user interface for displayless electronic device |
US8380507B2 (en) | 2009-03-09 | 2013-02-19 | Apple Inc. | Systems and methods for determining the language to use for speech generated by a text to speech engine |
US10241644B2 (en) | 2011-06-03 | 2019-03-26 | Apple Inc. | Actionable reminder entries |
US10255566B2 (en) | 2011-06-03 | 2019-04-09 | Apple Inc. | Generating and processing task items that represent tasks to perform |
US9858925B2 (en) | 2009-06-05 | 2018-01-02 | Apple Inc. | Using context information to facilitate processing of commands in a virtual assistant |
US10241752B2 (en) | 2011-09-30 | 2019-03-26 | Apple Inc. | Interface for a virtual digital assistant |
US10540976B2 (en) | 2009-06-05 | 2020-01-21 | Apple Inc. | Contextual voice commands |
US9431006B2 (en) | 2009-07-02 | 2016-08-30 | Apple Inc. | Methods and apparatuses for automatic speech recognition |
US20110010179A1 (en) * | 2009-07-13 | 2011-01-13 | Naik Devang K | Voice synthesis and processing |
US20110066438A1 (en) * | 2009-09-15 | 2011-03-17 | Apple Inc. | Contextual voiceover |
US8682649B2 (en) * | 2009-11-12 | 2014-03-25 | Apple Inc. | Sentiment prediction from textual data |
US8600743B2 (en) * | 2010-01-06 | 2013-12-03 | Apple Inc. | Noise profile determination for voice-related feature |
US20110167350A1 (en) * | 2010-01-06 | 2011-07-07 | Apple Inc. | Assist Features For Content Display Device |
US8381107B2 (en) | 2010-01-13 | 2013-02-19 | Apple Inc. | Adaptive audio feedback system and method |
US8311838B2 (en) * | 2010-01-13 | 2012-11-13 | Apple Inc. | Devices and methods for identifying a prompt corresponding to a voice input in a sequence of prompts |
US10705794B2 (en) | 2010-01-18 | 2020-07-07 | Apple Inc. | Automatically adapting user interfaces for hands-free interaction |
US10679605B2 (en) | 2010-01-18 | 2020-06-09 | Apple Inc. | Hands-free list-reading by intelligent automated assistant |
US10276170B2 (en) | 2010-01-18 | 2019-04-30 | Apple Inc. | Intelligent automated assistant |
US10553209B2 (en) | 2010-01-18 | 2020-02-04 | Apple Inc. | Systems and methods for hands-free notification summaries |
US9058732B2 (en) * | 2010-02-25 | 2015-06-16 | Qualcomm Incorporated | Method and apparatus for enhanced indoor position location with assisted user profiles |
US8682667B2 (en) | 2010-02-25 | 2014-03-25 | Apple Inc. | User profiling for selecting user specific voice input processing information |
US8639516B2 (en) * | 2010-06-04 | 2014-01-28 | Apple Inc. | User-specific noise suppression for voice quality improvements |
US8713021B2 (en) | 2010-07-07 | 2014-04-29 | Apple Inc. | Unsupervised document clustering using latent semantic density analysis |
US9104670B2 (en) | 2010-07-21 | 2015-08-11 | Apple Inc. | Customized search or acquisition of digital media assets |
US8521526B1 (en) | 2010-07-28 | 2013-08-27 | Google Inc. | Disambiguation of a spoken query term |
WO2012020394A2 (en) * | 2010-08-11 | 2012-02-16 | Bone Tone Communications Ltd. | Background sound removal for privacy and personalization use |
US8719006B2 (en) | 2010-08-27 | 2014-05-06 | Apple Inc. | Combined statistical and rule-based part-of-speech tagging for text-to-speech synthesis |
US8719014B2 (en) | 2010-09-27 | 2014-05-06 | Apple Inc. | Electronic device with text error correction based on voice recognition data |
KR20120054845A (en) * | 2010-11-22 | 2012-05-31 | 삼성전자주식회사 | Speech recognition method for robot |
US10515147B2 (en) | 2010-12-22 | 2019-12-24 | Apple Inc. | Using statistical language models for contextual lookup |
US10762293B2 (en) | 2010-12-22 | 2020-09-01 | Apple Inc. | Using parts-of-speech tagging and named entity recognition for spelling correction |
US9538286B2 (en) * | 2011-02-10 | 2017-01-03 | Dolby International Ab | Spatial adaptation in multi-microphone sound capture |
US8781836B2 (en) | 2011-02-22 | 2014-07-15 | Apple Inc. | Hearing assistance system for providing consistent human speech |
US9262612B2 (en) | 2011-03-21 | 2016-02-16 | Apple Inc. | Device access using voice authentication |
AU2012236649A1 (en) * | 2011-03-28 | 2013-10-31 | Ambientz | Methods and systems for searching utilizing acoustical context |
US10672399B2 (en) | 2011-06-03 | 2020-06-02 | Apple Inc. | Switching between text data and audio data based on a mapping |
US10057736B2 (en) | 2011-06-03 | 2018-08-21 | Apple Inc. | Active transport based notifications |
US8812294B2 (en) | 2011-06-21 | 2014-08-19 | Apple Inc. | Translating phrases from one language into another using an order-based set of declarative rules |
GB2493413B (en) * | 2011-07-25 | 2013-12-25 | Ibm | Maintaining and supplying speech models |
TWI442384B (en) * | 2011-07-26 | 2014-06-21 | Ind Tech Res Inst | Microphone-array-based speech recognition system and method |
US8595015B2 (en) * | 2011-08-08 | 2013-11-26 | Verizon New Jersey Inc. | Audio communication assessment |
US8706472B2 (en) | 2011-08-11 | 2014-04-22 | Apple Inc. | Method for disambiguating multiple readings in language conversion |
US8994660B2 (en) | 2011-08-29 | 2015-03-31 | Apple Inc. | Text correction processing |
US8762156B2 (en) | 2011-09-28 | 2014-06-24 | Apple Inc. | Speech recognition repair using contextual information |
US8712184B1 (en) * | 2011-12-05 | 2014-04-29 | Hermes Microvision, Inc. | Method and system for filtering noises in an image scanned by charged particles |
US10134385B2 (en) | 2012-03-02 | 2018-11-20 | Apple Inc. | Systems and methods for name pronunciation |
US9483461B2 (en) | 2012-03-06 | 2016-11-01 | Apple Inc. | Handling speech synthesis of content for multiple languages |
US9280610B2 (en) | 2012-05-14 | 2016-03-08 | Apple Inc. | Crowd sourcing information to fulfill user requests |
US8775442B2 (en) | 2012-05-15 | 2014-07-08 | Apple Inc. | Semantic search using a single-source semantic model |
US10417037B2 (en) | 2012-05-15 | 2019-09-17 | Apple Inc. | Systems and methods for integrating third party services with a digital assistant |
US11023520B1 (en) | 2012-06-01 | 2021-06-01 | Google Llc | Background audio identification for query disambiguation |
US9123338B1 (en) | 2012-06-01 | 2015-09-01 | Google Inc. | Background audio identification for speech disambiguation |
WO2013185109A2 (en) | 2012-06-08 | 2013-12-12 | Apple Inc. | Systems and methods for recognizing textual identifiers within a plurality of words |
US9721563B2 (en) | 2012-06-08 | 2017-08-01 | Apple Inc. | Name recognition system |
US9489940B2 (en) * | 2012-06-11 | 2016-11-08 | Nvoq Incorporated | Apparatus and methods to update a language model in a speech recognition system |
US9384737B2 (en) * | 2012-06-29 | 2016-07-05 | Microsoft Technology Licensing, Llc | Method and device for adjusting sound levels of sources based on sound source priority |
US9495129B2 (en) | 2012-06-29 | 2016-11-15 | Apple Inc. | Device, method, and user interface for voice-activated navigation and browsing of a document |
CN102841932A (en) * | 2012-08-06 | 2012-12-26 | 河海大学 | Content-based voice frequency semantic feature similarity comparative method |
KR20150046100A (en) | 2012-08-10 | 2015-04-29 | 뉘앙스 커뮤니케이션즈, 인코포레이티드 | Virtual agent communication for electronic devices |
US9576574B2 (en) | 2012-09-10 | 2017-02-21 | Apple Inc. | Context-sensitive handling of interruptions by intelligent digital assistant |
US20140074466A1 (en) | 2012-09-10 | 2014-03-13 | Google Inc. | Answering questions using environmental context |
US9547647B2 (en) | 2012-09-19 | 2017-01-17 | Apple Inc. | Voice-based media searching |
US8935167B2 (en) | 2012-09-25 | 2015-01-13 | Apple Inc. | Exemplar-based latent perceptual modeling for automatic speech recognition |
US9319816B1 (en) * | 2012-09-26 | 2016-04-19 | Amazon Technologies, Inc. | Characterizing environment using ultrasound pilot tones |
US9190057B2 (en) * | 2012-12-12 | 2015-11-17 | Amazon Technologies, Inc. | Speech model retrieval in distributed speech recognition systems |
US9653070B2 (en) | 2012-12-31 | 2017-05-16 | Intel Corporation | Flexible architecture for acoustic signal processing engine |
US8494853B1 (en) * | 2013-01-04 | 2013-07-23 | Google Inc. | Methods and systems for providing speech recognition systems based on speech recordings logs |
CN103971680B (en) * | 2013-01-24 | 2018-06-05 | 华为终端(东莞)有限公司 | A kind of method, apparatus of speech recognition |
CN103065631B (en) * | 2013-01-24 | 2015-07-29 | 华为终端有限公司 | A kind of method of speech recognition, device |
KR20230137475A (en) | 2013-02-07 | 2023-10-04 | 애플 인크. | Voice trigger for a digital assistant |
US9460715B2 (en) * | 2013-03-04 | 2016-10-04 | Amazon Technologies, Inc. | Identification using audio signatures and additional characteristics |
US20140278395A1 (en) * | 2013-03-12 | 2014-09-18 | Motorola Mobility Llc | Method and Apparatus for Determining a Motion Environment Profile to Adapt Voice Recognition Processing |
US20140278392A1 (en) * | 2013-03-12 | 2014-09-18 | Motorola Mobility Llc | Method and Apparatus for Pre-Processing Audio Signals |
US20140278415A1 (en) * | 2013-03-12 | 2014-09-18 | Motorola Mobility Llc | Voice Recognition Configuration Selector and Method of Operation Therefor |
US10306389B2 (en) | 2013-03-13 | 2019-05-28 | Kopin Corporation | Head wearable acoustic system with noise canceling microphone geometry apparatuses and methods |
US9312826B2 (en) * | 2013-03-13 | 2016-04-12 | Kopin Corporation | Apparatuses and methods for acoustic channel auto-balancing during multi-channel signal extraction |
US9977779B2 (en) | 2013-03-14 | 2018-05-22 | Apple Inc. | Automatic supplementation of word correction dictionaries |
US10642574B2 (en) | 2013-03-14 | 2020-05-05 | Apple Inc. | Device, method, and graphical user interface for outputting captions |
US9733821B2 (en) | 2013-03-14 | 2017-08-15 | Apple Inc. | Voice control to diagnose inadvertent activation of accessibility features |
US10652394B2 (en) | 2013-03-14 | 2020-05-12 | Apple Inc. | System and method for processing voicemail |
US10424292B1 (en) * | 2013-03-14 | 2019-09-24 | Amazon Technologies, Inc. | System for recognizing and responding to environmental noises |
US9368114B2 (en) | 2013-03-14 | 2016-06-14 | Apple Inc. | Context-sensitive handling of interruptions |
US10572476B2 (en) | 2013-03-14 | 2020-02-25 | Apple Inc. | Refining a search based on schedule items |
KR101857648B1 (en) | 2013-03-15 | 2018-05-15 | 애플 인크. | User training by intelligent digital assistant |
AU2014251347B2 (en) | 2013-03-15 | 2017-05-18 | Apple Inc. | Context-sensitive handling of interruptions |
US10748529B1 (en) | 2013-03-15 | 2020-08-18 | Apple Inc. | Voice activated device for use with a voice-based digital assistant |
AU2014233517B2 (en) | 2013-03-15 | 2017-05-25 | Apple Inc. | Training an at least partial voice command system |
WO2014144579A1 (en) | 2013-03-15 | 2014-09-18 | Apple Inc. | System and method for updating an adaptive speech recognition model |
US9208781B2 (en) | 2013-04-05 | 2015-12-08 | International Business Machines Corporation | Adapting speech recognition acoustic models with environmental and social cues |
WO2014182453A2 (en) * | 2013-05-06 | 2014-11-13 | Motorola Mobility Llc | Method and apparatus for training a voice recognition model database |
US9953630B1 (en) * | 2013-05-31 | 2018-04-24 | Amazon Technologies, Inc. | Language recognition for device settings |
US9582608B2 (en) | 2013-06-07 | 2017-02-28 | Apple Inc. | Unified ranking with entropy-weighted information for phrase-based semantic auto-completion |
WO2014197336A1 (en) | 2013-06-07 | 2014-12-11 | Apple Inc. | System and method for detecting errors in interactions with a voice-based digital assistant |
WO2014197334A2 (en) | 2013-06-07 | 2014-12-11 | Apple Inc. | System and method for user-specified pronunciation of words for speech synthesis and recognition |
WO2014197335A1 (en) | 2013-06-08 | 2014-12-11 | Apple Inc. | Interpreting and acting upon commands that involve sharing information with remote devices |
US10176167B2 (en) | 2013-06-09 | 2019-01-08 | Apple Inc. | System and method for inferring user intent from speech inputs |
EP3937002A1 (en) | 2013-06-09 | 2022-01-12 | Apple Inc. | Device, method, and graphical user interface for enabling conversation persistence across two or more instances of a digital assistant |
AU2014278595B2 (en) | 2013-06-13 | 2017-04-06 | Apple Inc. | System and method for emergency calls initiated by voice command |
DE112014003653B4 (en) | 2013-08-06 | 2024-04-18 | Apple Inc. | Automatically activate intelligent responses based on activities from remote devices |
US10296160B2 (en) | 2013-12-06 | 2019-05-21 | Apple Inc. | Method for extracting salient dialog usage from live data |
US10534623B2 (en) * | 2013-12-16 | 2020-01-14 | Nuance Communications, Inc. | Systems and methods for providing a virtual assistant |
US9953634B1 (en) * | 2013-12-17 | 2018-04-24 | Knowles Electronics, Llc | Passive training for automatic speech recognition |
GB2524222B (en) * | 2013-12-18 | 2018-07-18 | Cirrus Logic Int Semiconductor Ltd | Activating speech processing |
US9589560B1 (en) * | 2013-12-19 | 2017-03-07 | Amazon Technologies, Inc. | Estimating false rejection rate in a detection system |
US9466310B2 (en) * | 2013-12-20 | 2016-10-11 | Lenovo Enterprise Solutions (Singapore) Pte. Ltd. | Compensating for identifiable background content in a speech recognition device |
JP6375521B2 (en) * | 2014-03-28 | 2018-08-22 | パナソニックIpマネジメント株式会社 | Voice search device, voice search method, and display device |
US10446168B2 (en) * | 2014-04-02 | 2019-10-15 | Plantronics, Inc. | Noise level measurement with mobile devices, location services, and environmental response |
KR102257910B1 (en) * | 2014-05-02 | 2021-05-27 | 삼성전자주식회사 | Apparatus and method for speech recognition, apparatus and method for generating noise-speech recognition model |
US9620105B2 (en) | 2014-05-15 | 2017-04-11 | Apple Inc. | Analyzing audio input for efficient speech and music recognition |
US10592095B2 (en) | 2014-05-23 | 2020-03-17 | Apple Inc. | Instantaneous speaking of content on touch devices |
US9502031B2 (en) | 2014-05-27 | 2016-11-22 | Apple Inc. | Method for supporting dynamic grammars in WFST-based ASR |
US9760559B2 (en) | 2014-05-30 | 2017-09-12 | Apple Inc. | Predictive text input |
US10289433B2 (en) | 2014-05-30 | 2019-05-14 | Apple Inc. | Domain specific language for encoding assistant dialog |
US9430463B2 (en) | 2014-05-30 | 2016-08-30 | Apple Inc. | Exemplar-based natural language processing |
US9842101B2 (en) | 2014-05-30 | 2017-12-12 | Apple Inc. | Predictive conversion of language input |
US10078631B2 (en) | 2014-05-30 | 2018-09-18 | Apple Inc. | Entropy-guided text prediction using combined word and character n-gram language models |
US9734193B2 (en) | 2014-05-30 | 2017-08-15 | Apple Inc. | Determining domain salience ranking from ambiguous words in natural speech |
US9633004B2 (en) | 2014-05-30 | 2017-04-25 | Apple Inc. | Better resolution when referencing to concepts |
AU2015266863B2 (en) | 2014-05-30 | 2018-03-15 | Apple Inc. | Multi-command single utterance input method |
US9785630B2 (en) | 2014-05-30 | 2017-10-10 | Apple Inc. | Text prediction using combined word N-gram and unigram language models |
US9715875B2 (en) | 2014-05-30 | 2017-07-25 | Apple Inc. | Reducing the need for manual start/end-pointing and trigger phrases |
US10170123B2 (en) | 2014-05-30 | 2019-01-01 | Apple Inc. | Intelligent assistant for home automation |
US9904851B2 (en) | 2014-06-11 | 2018-02-27 | At&T Intellectual Property I, L.P. | Exploiting visual information for enhancing audio signals via source separation and beamforming |
US9858922B2 (en) | 2014-06-23 | 2018-01-02 | Google Inc. | Caching speech recognition scores |
US9639854B2 (en) | 2014-06-26 | 2017-05-02 | Nuance Communications, Inc. | Voice-controlled information exchange platform, such as for providing information to supplement advertising |
US9338493B2 (en) | 2014-06-30 | 2016-05-10 | Apple Inc. | Intelligent automated assistant for TV user interactions |
US10659851B2 (en) | 2014-06-30 | 2020-05-19 | Apple Inc. | Real-time digital assistant knowledge updates |
US9837102B2 (en) * | 2014-07-02 | 2017-12-05 | Microsoft Technology Licensing, Llc | User environment aware acoustic noise reduction |
US10446141B2 (en) | 2014-08-28 | 2019-10-15 | Apple Inc. | Automatic speech recognition based on user feedback |
US9953646B2 (en) | 2014-09-02 | 2018-04-24 | Belleau Technologies | Method and system for dynamic speech recognition and tracking of prewritten script |
US9818400B2 (en) | 2014-09-11 | 2017-11-14 | Apple Inc. | Method and apparatus for discovering trending terms in speech requests |
US10789041B2 (en) | 2014-09-12 | 2020-09-29 | Apple Inc. | Dynamic thresholds for always listening speech trigger |
US9886432B2 (en) | 2014-09-30 | 2018-02-06 | Apple Inc. | Parsimonious handling of word inflection via categorical stem + suffix N-gram language models |
US9646609B2 (en) | 2014-09-30 | 2017-05-09 | Apple Inc. | Caching apparatus for serving phonetic pronunciations |
US9668121B2 (en) | 2014-09-30 | 2017-05-30 | Apple Inc. | Social reminders |
US10074360B2 (en) | 2014-09-30 | 2018-09-11 | Apple Inc. | Providing an indication of the suitability of speech recognition |
US10127911B2 (en) | 2014-09-30 | 2018-11-13 | Apple Inc. | Speaker identification and unsupervised speaker adaptation techniques |
US9299347B1 (en) | 2014-10-22 | 2016-03-29 | Google Inc. | Speech recognition using associative mapping |
US10999636B1 (en) * | 2014-10-27 | 2021-05-04 | Amazon Technologies, Inc. | Voice-based content searching on a television based on receiving candidate search strings from a remote server |
US9667321B2 (en) * | 2014-10-31 | 2017-05-30 | Pearson Education, Inc. | Predictive recommendation engine |
US10116563B1 (en) | 2014-10-30 | 2018-10-30 | Pearson Education, Inc. | System and method for automatically updating data packet metadata |
US10318499B2 (en) | 2014-10-30 | 2019-06-11 | Pearson Education, Inc. | Content database generation |
EP3213232A1 (en) | 2014-10-30 | 2017-09-06 | Pearson Education, Inc. | Content database generation |
US10110486B1 (en) | 2014-10-30 | 2018-10-23 | Pearson Education, Inc. | Automatic determination of initial content difficulty |
US10735402B1 (en) | 2014-10-30 | 2020-08-04 | Pearson Education, Inc. | Systems and method for automated data packet selection and delivery |
US10333857B1 (en) | 2014-10-30 | 2019-06-25 | Pearson Education, Inc. | Systems and methods for data packet metadata stabilization |
US10218630B2 (en) | 2014-10-30 | 2019-02-26 | Pearson Education, Inc. | System and method for increasing data transmission rates through a content distribution network |
JP2016109725A (en) * | 2014-12-02 | 2016-06-20 | ソニー株式会社 | Information-processing apparatus, information-processing method, and program |
US10552013B2 (en) | 2014-12-02 | 2020-02-04 | Apple Inc. | Data detection |
US9711141B2 (en) | 2014-12-09 | 2017-07-18 | Apple Inc. | Disambiguating heteronyms in speech synthesis |
US9865280B2 (en) | 2015-03-06 | 2018-01-09 | Apple Inc. | Structured dictation using intelligent automated assistants |
US10152299B2 (en) | 2015-03-06 | 2018-12-11 | Apple Inc. | Reducing response latency of intelligent automated assistants |
US10567477B2 (en) | 2015-03-08 | 2020-02-18 | Apple Inc. | Virtual assistant continuity |
US9721566B2 (en) | 2015-03-08 | 2017-08-01 | Apple Inc. | Competing devices responding to voice triggers |
US9886953B2 (en) | 2015-03-08 | 2018-02-06 | Apple Inc. | Virtual assistant activation |
US9899019B2 (en) | 2015-03-18 | 2018-02-20 | Apple Inc. | Systems and methods for structured stem and suffix language models |
US9842105B2 (en) | 2015-04-16 | 2017-12-12 | Apple Inc. | Parsimonious continuous-space phrase representations for natural language processing |
US10460227B2 (en) | 2015-05-15 | 2019-10-29 | Apple Inc. | Virtual assistant in a communication session |
US10083688B2 (en) | 2015-05-27 | 2018-09-25 | Apple Inc. | Device voice control for selecting a displayed affordance |
US10504509B2 (en) | 2015-05-27 | 2019-12-10 | Google Llc | Providing suggested voice-based action queries |
US10200824B2 (en) | 2015-05-27 | 2019-02-05 | Apple Inc. | Systems and methods for proactively identifying and surfacing relevant content on a touch-sensitive device |
US10127220B2 (en) | 2015-06-04 | 2018-11-13 | Apple Inc. | Language identification from short strings |
US9578173B2 (en) | 2015-06-05 | 2017-02-21 | Apple Inc. | Virtual assistant aided communication with 3rd party service in a communication session |
US10101822B2 (en) | 2015-06-05 | 2018-10-16 | Apple Inc. | Language input correction |
US10255907B2 (en) | 2015-06-07 | 2019-04-09 | Apple Inc. | Automatic accent detection using acoustic models |
US10186254B2 (en) | 2015-06-07 | 2019-01-22 | Apple Inc. | Context-based endpoint detection |
US11025565B2 (en) | 2015-06-07 | 2021-06-01 | Apple Inc. | Personalized prediction of responses for instant messaging |
US20160378747A1 (en) | 2015-06-29 | 2016-12-29 | Apple Inc. | Virtual assistant for media playback |
US9786270B2 (en) | 2015-07-09 | 2017-10-10 | Google Inc. | Generating acoustic models |
US10008199B2 (en) * | 2015-08-22 | 2018-06-26 | Toyota Motor Engineering & Manufacturing North America, Inc. | Speech recognition system with abbreviated training |
US10614368B2 (en) | 2015-08-28 | 2020-04-07 | Pearson Education, Inc. | System and method for content provisioning with dual recommendation engines |
US10747498B2 (en) | 2015-09-08 | 2020-08-18 | Apple Inc. | Zero latency digital assistant |
US10740384B2 (en) | 2015-09-08 | 2020-08-11 | Apple Inc. | Intelligent automated assistant for media search and playback |
US10331312B2 (en) | 2015-09-08 | 2019-06-25 | Apple Inc. | Intelligent automated assistant in a media environment |
US10671428B2 (en) | 2015-09-08 | 2020-06-02 | Apple Inc. | Distributed personal assistant |
US9697820B2 (en) | 2015-09-24 | 2017-07-04 | Apple Inc. | Unit-selection text-to-speech synthesis using concatenation-sensitive neural networks |
US10366158B2 (en) | 2015-09-29 | 2019-07-30 | Apple Inc. | Efficient word encoding for recurrent neural network language models |
US11010550B2 (en) | 2015-09-29 | 2021-05-18 | Apple Inc. | Unified language modeling framework for word prediction, auto-completion and auto-correction |
US11587559B2 (en) | 2015-09-30 | 2023-02-21 | Apple Inc. | Intelligent device identification |
US11631421B2 (en) | 2015-10-18 | 2023-04-18 | Solos Technology Limited | Apparatuses and methods for enhanced speech recognition in variable environments |
US10691473B2 (en) | 2015-11-06 | 2020-06-23 | Apple Inc. | Intelligent automated assistant in a messaging environment |
US10956666B2 (en) | 2015-11-09 | 2021-03-23 | Apple Inc. | Unconventional virtual assistant interactions |
US10468016B2 (en) | 2015-11-24 | 2019-11-05 | International Business Machines Corporation | System and method for supporting automatic speech recognition of regional accents based on statistical information and user corrections |
US10049668B2 (en) | 2015-12-02 | 2018-08-14 | Apple Inc. | Applying neural network language models to weighted finite state transducers for automatic speech recognition |
US10223066B2 (en) | 2015-12-23 | 2019-03-05 | Apple Inc. | Proactive assistance based on dialog communication between devices |
US10229672B1 (en) | 2015-12-31 | 2019-03-12 | Google Llc | Training acoustic models using connectionist temporal classification |
US10446143B2 (en) | 2016-03-14 | 2019-10-15 | Apple Inc. | Identification of voice inputs providing credentials |
US11138987B2 (en) * | 2016-04-04 | 2021-10-05 | Honeywell International Inc. | System and method to distinguish sources in a multiple audio source environment |
US11188841B2 (en) | 2016-04-08 | 2021-11-30 | Pearson Education, Inc. | Personalized content distribution |
US10789316B2 (en) | 2016-04-08 | 2020-09-29 | Pearson Education, Inc. | Personalized automatic content aggregation generation |
US10642848B2 (en) | 2016-04-08 | 2020-05-05 | Pearson Education, Inc. | Personalized automatic content aggregation generation |
US10325215B2 (en) | 2016-04-08 | 2019-06-18 | Pearson Education, Inc. | System and method for automatic content aggregation generation |
US9934775B2 (en) | 2016-05-26 | 2018-04-03 | Apple Inc. | Unit-selection text-to-speech synthesis based on predicted concatenation parameters |
US9972304B2 (en) | 2016-06-03 | 2018-05-15 | Apple Inc. | Privacy preserving distributed evaluation framework for embedded personalized systems |
US11227589B2 (en) | 2016-06-06 | 2022-01-18 | Apple Inc. | Intelligent list reading |
US10249300B2 (en) | 2016-06-06 | 2019-04-02 | Apple Inc. | Intelligent list reading |
US10049663B2 (en) | 2016-06-08 | 2018-08-14 | Apple, Inc. | Intelligent automated assistant for media exploration |
CN109313896B (en) * | 2016-06-08 | 2020-06-30 | 谷歌有限责任公司 | Extensible dynamic class language modeling method, system for generating an utterance transcription, computer-readable medium |
DK179588B1 (en) | 2016-06-09 | 2019-02-22 | Apple Inc. | Intelligent automated assistant in a home environment |
US10067938B2 (en) | 2016-06-10 | 2018-09-04 | Apple Inc. | Multilingual word prediction |
US10490187B2 (en) | 2016-06-10 | 2019-11-26 | Apple Inc. | Digital assistant providing automated status report |
US10192552B2 (en) | 2016-06-10 | 2019-01-29 | Apple Inc. | Digital assistant providing whispered speech |
US10509862B2 (en) | 2016-06-10 | 2019-12-17 | Apple Inc. | Dynamic phrase expansion of language input |
US10586535B2 (en) | 2016-06-10 | 2020-03-10 | Apple Inc. | Intelligent digital assistant in a multi-tasking environment |
DK179343B1 (en) | 2016-06-11 | 2018-05-14 | Apple Inc | Intelligent task discovery |
DK179415B1 (en) | 2016-06-11 | 2018-06-14 | Apple Inc | Intelligent device arbitration and control |
DK201670540A1 (en) | 2016-06-11 | 2018-01-08 | Apple Inc | Application integration with a digital assistant |
DK179049B1 (en) | 2016-06-11 | 2017-09-18 | Apple Inc | Data driven natural language event detection and classification |
US20180018973A1 (en) | 2016-07-15 | 2018-01-18 | Google Inc. | Speaker verification |
US10474753B2 (en) | 2016-09-07 | 2019-11-12 | Apple Inc. | Language identification using recurrent neural networks |
US10043516B2 (en) | 2016-09-23 | 2018-08-07 | Apple Inc. | Intelligent automated assistant |
US10951720B2 (en) | 2016-10-24 | 2021-03-16 | Bank Of America Corporation | Multi-channel cognitive resource platform |
US11281993B2 (en) | 2016-12-05 | 2022-03-22 | Apple Inc. | Model and ensemble compression for metric learning |
US10593346B2 (en) | 2016-12-22 | 2020-03-17 | Apple Inc. | Rank-reduced token representation for automatic speech recognition |
US11204787B2 (en) | 2017-01-09 | 2021-12-21 | Apple Inc. | Application integration with a digital assistant |
US10417266B2 (en) | 2017-05-09 | 2019-09-17 | Apple Inc. | Context-aware ranking of intelligent response suggestions |
DK201770383A1 (en) | 2017-05-09 | 2018-12-14 | Apple Inc. | User interface for correcting recognition errors |
DK180048B1 (en) | 2017-05-11 | 2020-02-04 | Apple Inc. | MAINTAINING THE DATA PROTECTION OF PERSONAL INFORMATION |
DK201770439A1 (en) | 2017-05-11 | 2018-12-13 | Apple Inc. | Offline personal assistant |
US10726832B2 (en) | 2017-05-11 | 2020-07-28 | Apple Inc. | Maintaining privacy of personal information |
US10395654B2 (en) | 2017-05-11 | 2019-08-27 | Apple Inc. | Text normalization based on a data-driven learning network |
US11301477B2 (en) | 2017-05-12 | 2022-04-12 | Apple Inc. | Feedback analysis of a digital assistant |
DK201770429A1 (en) | 2017-05-12 | 2018-12-14 | Apple Inc. | Low-latency intelligent automated assistant |
DK179745B1 (en) | 2017-05-12 | 2019-05-01 | Apple Inc. | SYNCHRONIZATION AND TASK DELEGATION OF A DIGITAL ASSISTANT |
DK179496B1 (en) | 2017-05-12 | 2019-01-15 | Apple Inc. | USER-SPECIFIC Acoustic Models |
DK201770431A1 (en) | 2017-05-15 | 2018-12-20 | Apple Inc. | Optimizing dialogue policy decisions for digital assistants using implicit feedback |
DK201770432A1 (en) | 2017-05-15 | 2018-12-21 | Apple Inc. | Hierarchical belief states for digital assistants |
US10403278B2 (en) | 2017-05-16 | 2019-09-03 | Apple Inc. | Methods and systems for phonetic matching in digital assistant services |
US10303715B2 (en) | 2017-05-16 | 2019-05-28 | Apple Inc. | Intelligent automated assistant for media exploration |
DK179560B1 (en) | 2017-05-16 | 2019-02-18 | Apple Inc. | Far-field extension for digital assistant services |
US20180336892A1 (en) | 2017-05-16 | 2018-11-22 | Apple Inc. | Detecting a trigger of a digital assistant |
US10311144B2 (en) | 2017-05-16 | 2019-06-04 | Apple Inc. | Emoji word sense disambiguation |
US10657328B2 (en) | 2017-06-02 | 2020-05-19 | Apple Inc. | Multi-task recurrent neural network architecture for efficient morphology handling in neural language modeling |
US10706840B2 (en) | 2017-08-18 | 2020-07-07 | Google Llc | Encoder-decoder models for sequence to sequence mapping |
US10096311B1 (en) | 2017-09-12 | 2018-10-09 | Plantronics, Inc. | Intelligent soundscape adaptation utilizing mobile devices |
US10445429B2 (en) | 2017-09-21 | 2019-10-15 | Apple Inc. | Natural language understanding using vocabularies with compressed serialized tries |
US10755051B2 (en) | 2017-09-29 | 2020-08-25 | Apple Inc. | Rule-based natural language processing |
CN107908742A (en) * | 2017-11-15 | 2018-04-13 | 百度在线网络技术(北京)有限公司 | Method and apparatus for output information |
US10636424B2 (en) | 2017-11-30 | 2020-04-28 | Apple Inc. | Multi-turn canned dialog |
KR102446637B1 (en) * | 2017-12-28 | 2022-09-23 | 삼성전자주식회사 | Sound output system and speech processing method |
US10733982B2 (en) | 2018-01-08 | 2020-08-04 | Apple Inc. | Multi-directional dialog |
CN108182270A (en) * | 2018-01-17 | 2018-06-19 | 广东小天才科技有限公司 | Search for content transmission and searching method, smart pen, search terminal and storage medium |
KR102609430B1 (en) * | 2018-01-23 | 2023-12-04 | 구글 엘엘씨 | Selective adaptation and utilization of noise reduction technique in invocation phrase detection |
US10733375B2 (en) | 2018-01-31 | 2020-08-04 | Apple Inc. | Knowledge-based framework for improving natural language understanding |
KR102585231B1 (en) * | 2018-02-02 | 2023-10-05 | 삼성전자주식회사 | Speech signal processing mehtod for speaker recognition and electric apparatus thereof |
US10789959B2 (en) | 2018-03-02 | 2020-09-29 | Apple Inc. | Training speaker recognition models for digital assistants |
US10592604B2 (en) | 2018-03-12 | 2020-03-17 | Apple Inc. | Inverse text normalization for automatic speech recognition |
US10818288B2 (en) | 2018-03-26 | 2020-10-27 | Apple Inc. | Natural assistant interaction |
US10909331B2 (en) | 2018-03-30 | 2021-02-02 | Apple Inc. | Implicit identification of translation payload with neural machine translation |
US10923139B2 (en) * | 2018-05-02 | 2021-02-16 | Melo Inc. | Systems and methods for processing meeting information obtained from multiple sources |
US11145294B2 (en) | 2018-05-07 | 2021-10-12 | Apple Inc. | Intelligent automated assistant for delivering content from user experiences |
US10928918B2 (en) | 2018-05-07 | 2021-02-23 | Apple Inc. | Raise to speak |
US10984780B2 (en) | 2018-05-21 | 2021-04-20 | Apple Inc. | Global semantic word embeddings using bi-directional recurrent neural networks |
DK180639B1 (en) | 2018-06-01 | 2021-11-04 | Apple Inc | DISABILITY OF ATTENTION-ATTENTIVE VIRTUAL ASSISTANT |
DK201870355A1 (en) | 2018-06-01 | 2019-12-16 | Apple Inc. | Virtual assistant operation in multi-device environments |
US11386266B2 (en) | 2018-06-01 | 2022-07-12 | Apple Inc. | Text correction |
DK179822B1 (en) | 2018-06-01 | 2019-07-12 | Apple Inc. | Voice interaction at a primary device to access call functionality of a companion device |
US10892996B2 (en) | 2018-06-01 | 2021-01-12 | Apple Inc. | Variable latency device coordination |
US10504518B1 (en) | 2018-06-03 | 2019-12-10 | Apple Inc. | Accelerated task performance |
CN109087659A (en) * | 2018-08-03 | 2018-12-25 | 三星电子(中国)研发中心 | Audio optimization method and apparatus |
US11010561B2 (en) | 2018-09-27 | 2021-05-18 | Apple Inc. | Sentiment prediction from textual data |
US11170166B2 (en) | 2018-09-28 | 2021-11-09 | Apple Inc. | Neural typographical error modeling via generative adversarial networks |
US10839159B2 (en) | 2018-09-28 | 2020-11-17 | Apple Inc. | Named entity normalization in a spoken dialog system |
US11462215B2 (en) | 2018-09-28 | 2022-10-04 | Apple Inc. | Multi-modal inputs for voice commands |
US11475898B2 (en) | 2018-10-26 | 2022-10-18 | Apple Inc. | Low-latency multi-speaker speech recognition |
CN111415653B (en) * | 2018-12-18 | 2023-08-01 | 百度在线网络技术(北京)有限公司 | Method and device for recognizing speech |
US11638059B2 (en) | 2019-01-04 | 2023-04-25 | Apple Inc. | Content playback on multiple devices |
CN109841227B (en) * | 2019-03-11 | 2020-10-02 | 南京邮电大学 | Background noise removing method based on learning compensation |
US11348573B2 (en) | 2019-03-18 | 2022-05-31 | Apple Inc. | Multimodality in digital assistant systems |
DK201970509A1 (en) | 2019-05-06 | 2021-01-15 | Apple Inc | Spoken notifications |
US11423908B2 (en) | 2019-05-06 | 2022-08-23 | Apple Inc. | Interpreting spoken requests |
US11475884B2 (en) | 2019-05-06 | 2022-10-18 | Apple Inc. | Reducing digital assistant latency when a language is incorrectly determined |
US11307752B2 (en) | 2019-05-06 | 2022-04-19 | Apple Inc. | User configurable task triggers |
US11140099B2 (en) | 2019-05-21 | 2021-10-05 | Apple Inc. | Providing message response suggestions |
DK201970511A1 (en) | 2019-05-31 | 2021-02-15 | Apple Inc | Voice identification in digital assistant systems |
US11289073B2 (en) | 2019-05-31 | 2022-03-29 | Apple Inc. | Device text to speech |
DK180129B1 (en) | 2019-05-31 | 2020-06-02 | Apple Inc. | User activity shortcut suggestions |
US11496600B2 (en) | 2019-05-31 | 2022-11-08 | Apple Inc. | Remote execution of machine-learned models |
US11360641B2 (en) | 2019-06-01 | 2022-06-14 | Apple Inc. | Increasing the relevance of new available information |
US11227599B2 (en) | 2019-06-01 | 2022-01-18 | Apple Inc. | Methods and user interfaces for voice-based control of electronic devices |
US11848023B2 (en) * | 2019-06-10 | 2023-12-19 | Google Llc | Audio noise reduction |
CN112201247A (en) * | 2019-07-08 | 2021-01-08 | 北京地平线机器人技术研发有限公司 | Speech enhancement method and apparatus, electronic device, and storage medium |
KR102260216B1 (en) * | 2019-07-29 | 2021-06-03 | 엘지전자 주식회사 | Intelligent voice recognizing method, voice recognizing apparatus, intelligent computing device and server |
CN110648680A (en) * | 2019-09-23 | 2020-01-03 | 腾讯科技(深圳)有限公司 | Voice data processing method and device, electronic equipment and readable storage medium |
WO2021056255A1 (en) | 2019-09-25 | 2021-04-01 | Apple Inc. | Text detection using global geometry estimators |
US11489794B2 (en) | 2019-11-04 | 2022-11-01 | Bank Of America Corporation | System for configuration and intelligent transmission of electronic communications and integrated resource processing |
CN110956955B (en) * | 2019-12-10 | 2022-08-05 | 思必驰科技股份有限公司 | Voice interaction method and device |
CN112820307B (en) * | 2020-02-19 | 2023-12-15 | 腾讯科技(深圳)有限公司 | Voice message processing method, device, equipment and medium |
CN111461438B (en) * | 2020-04-01 | 2024-01-05 | 中国人民解放军空军93114部队 | Signal detection method and device, electronic equipment and storage medium |
US11061543B1 (en) | 2020-05-11 | 2021-07-13 | Apple Inc. | Providing relevant data items based on context |
US11038934B1 (en) | 2020-05-11 | 2021-06-15 | Apple Inc. | Digital assistant hardware abstraction |
US11755276B2 (en) | 2020-05-12 | 2023-09-12 | Apple Inc. | Reducing description length based on confidence |
US11490204B2 (en) | 2020-07-20 | 2022-11-01 | Apple Inc. | Multi-device audio adjustment coordination |
US11438683B2 (en) | 2020-07-21 | 2022-09-06 | Apple Inc. | User identification using headphones |
US11580959B2 (en) * | 2020-09-28 | 2023-02-14 | International Business Machines Corporation | Improving speech recognition transcriptions |
CN112652304B (en) * | 2020-12-02 | 2022-02-01 | 北京百度网讯科技有限公司 | Voice interaction method and device of intelligent equipment and electronic equipment |
CN112669867B (en) * | 2020-12-15 | 2023-04-11 | 阿波罗智联(北京)科技有限公司 | Debugging method and device of noise elimination algorithm and electronic equipment |
CN112634932B (en) * | 2021-03-09 | 2021-06-22 | 赣州柏朗科技有限公司 | Audio signal processing method and device, server and related equipment |
CN113053382A (en) * | 2021-03-30 | 2021-06-29 | 联想(北京)有限公司 | Processing method and device |
US11875798B2 (en) | 2021-05-03 | 2024-01-16 | International Business Machines Corporation | Profiles for enhanced speech recognition training |
CN114333881B (en) * | 2022-03-09 | 2022-05-24 | 深圳市迪斯声学有限公司 | Audio transmission noise reduction method, device and medium based on environment self-adaptation |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1453767A (en) * | 2002-04-26 | 2003-11-05 | 日本先锋公司 | Speech recognition apparatus and speech recognition method |
US6718302B1 (en) * | 1997-10-20 | 2004-04-06 | Sony Corporation | Method for utilizing validity constraints in a speech endpoint detector |
Family Cites Families (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5970446A (en) * | 1997-11-25 | 1999-10-19 | At&T Corp | Selective noise/channel/coding models and recognizers for automatic speech recognition |
US7209880B1 (en) * | 2001-03-20 | 2007-04-24 | At&T Corp. | Systems and methods for dynamic re-configurable speech recognition |
JP3826032B2 (en) * | 2001-12-28 | 2006-09-27 | 株式会社東芝 | Speech recognition apparatus, speech recognition method, and speech recognition program |
JP4357867B2 (en) * | 2003-04-25 | 2009-11-04 | パイオニア株式会社 | Voice recognition apparatus, voice recognition method, voice recognition program, and recording medium recording the same |
US7321852B2 (en) * | 2003-10-28 | 2008-01-22 | International Business Machines Corporation | System and method for transcribing audio files of various languages |
JP4340686B2 (en) * | 2004-03-31 | 2009-10-07 | パイオニア株式会社 | Speech recognition apparatus and speech recognition method |
DE102004017486A1 (en) * | 2004-04-08 | 2005-10-27 | Siemens Ag | Method for noise reduction in a voice input signal |
DE602007004733D1 (en) * | 2007-10-10 | 2010-03-25 | Harman Becker Automotive Sys | speaker recognition |
US20100145687A1 (en) | 2008-12-04 | 2010-06-10 | Microsoft Corporation | Removing noise from speech |
-
2010
- 2010-06-14 US US12/814,665 patent/US8234111B2/en active Active
-
2011
- 2011-06-13 EP EP11731192.8A patent/EP2580751B1/en active Active
- 2011-06-13 AU AU2011267982A patent/AU2011267982B2/en active Active
- 2011-06-13 CN CN201180026390.4A patent/CN103069480B/en active Active
- 2011-06-13 WO PCT/US2011/040225 patent/WO2011159628A1/en active Application Filing
- 2011-09-30 US US13/250,777 patent/US8249868B2/en active Active
-
2012
- 2012-06-22 US US13/530,614 patent/US8666740B2/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6718302B1 (en) * | 1997-10-20 | 2004-04-06 | Sony Corporation | Method for utilizing validity constraints in a speech endpoint detector |
CN1453767A (en) * | 2002-04-26 | 2003-11-05 | 日本先锋公司 | Speech recognition apparatus and speech recognition method |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105719645A (en) * | 2014-12-17 | 2016-06-29 | 现代自动车株式会社 | Speech recognition apparatus, vehicle including the same, and method of controlling the same |
CN105719645B (en) * | 2014-12-17 | 2020-09-18 | 现代自动车株式会社 | Voice recognition apparatus, vehicle including the same, and method of controlling voice recognition apparatus |
Also Published As
Publication number | Publication date |
---|---|
AU2011267982A1 (en) | 2012-11-01 |
US8249868B2 (en) | 2012-08-21 |
US20120022860A1 (en) | 2012-01-26 |
CN103069480A (en) | 2013-04-24 |
AU2011267982B2 (en) | 2015-02-05 |
EP2580751A1 (en) | 2013-04-17 |
US8234111B2 (en) | 2012-07-31 |
US20120259631A1 (en) | 2012-10-11 |
US8666740B2 (en) | 2014-03-04 |
US20110307253A1 (en) | 2011-12-15 |
WO2011159628A1 (en) | 2011-12-22 |
EP2580751B1 (en) | 2014-08-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103069480B (en) | Speech and noise models for speech recognition | |
CN104575493B (en) | Use the acoustic model adaptation of geography information | |
EP3923281B1 (en) | Noise compensation using geotagged audio signals | |
AU2014200999B2 (en) | Geotagged environmental audio for enhanced speech recognition accuracy |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
CP01 | Change in the name or title of a patent holder | ||
CP01 | Change in the name or title of a patent holder |
Address after: American California Patentee after: Google limited liability company Address before: American California Patentee before: Google Inc. |