CN103069480B

CN103069480B - Speech and noise models for speech recognition

Info

Publication number: CN103069480B
Application number: CN201180026390.4A
Authority: CN
Inventors: M·I·洛伊德; T·克里斯特詹森
Original assignee: Google LLC
Current assignee: Google LLC
Priority date: 2010-06-14
Filing date: 2011-06-13
Publication date: 2014-12-24
Anticipated expiration: 2031-06-13
Also published as: AU2011267982A1; US8249868B2; US20120022860A1; CN103069480A; AU2011267982B2; EP2580751A1; US8234111B2; US20120259631A1; US8666740B2; US20110307253A1; WO2011159628A1; EP2580751B1

Abstract

An audio signal generated by a device based on audio input from a user may be received. The audio signal may include at least a user audio portion that corresponds to one or more user utterances recorded by the device. A user speech model associated with the user may be accessed and a determination may be made background audio in the audio signal is below a defined threshold. In response to determining that the background audio in the audio signal is below the defined threshold, the accessed user speech model may be adapted based on the audio signal to generate an adapted user speech model that models speech characteristics of the user. Noise compensation may be performed on the received audio signal using the adapted user speech model to generate a filtered audio signal with reduced background audio compared to the received audio signal.

Description

For speech model and the noise model of speech recognition

the cross reference of related application

This application claims that submit on June 14th, 2010, that title is the Application U.S. Serial No 12/814,665 of " SPEECH ANDNOISE MODELS FOR SPEECH RECOGNITION " right of priority, its disclosure is incorporated into this by reference.

Technical field

This instructions relates to speech recognition.

Background technology

Speech recognition may be used for voice search query.Usually, search inquiry comprises one or more query term that user submits to search engine when user asks search engine execution search.In other modes, user can by key on keyboard or when voice queries by the microphone to such as mobile device in spoken query item carry out the query term of typing search inquiry.

When being submitted to voice queries by such as mobile device, the microphone of mobile device also may record neighbourhood noise or sound except the spoken utterance of user, is referred to as " environment audio frequency " or " background audio " in other respects.Such as, environment audio frequency can comprise be positioned at around user other people background chat or talk or by nature (such as, bark) or the noise that generates of culture (such as, office, airport or highway noise or construction activity).Environment audio frequency partly may cover the speech of user, thus makes automated voice identification (" ASR ") engine be difficult to accurately identify spoken utterance.

Summary of the invention

In one aspect, a kind of system comprises one or more treatment facility and stores one or more memory devices of instruction, when instruction is performed by one or more treatment facility, make one or more treatment facility receive the sound signal generated based on the audio frequency input from user by equipment, sound signal at least comprises the audio user part corresponded to by one or more user spoken utterances of equipment record; Access the user speech model be associated with user; Determine that the background audio in sound signal is defining below threshold value; In response to the background audio determined in sound signal below the threshold value of definition, based on the user speech model of audio signal adaptation access to generate adapt user speech model to the modeling of user speech characteristic; And use adapt user speech model to perform noise compensation to generate the filtering audio signals compared with the sound signal received with the background audio of minimizing to the sound signal received.

Implementation can comprise one or more following characteristics.Such as, sound signal can comprise the environment audio-frequency unit that only corresponds to around the background audio of user to determine that background audio in sound signal is under definition threshold value, instruction can comprise as given an order, upon being performed, the amount of the energy in one or more treatment facility determination environment audio-frequency unit is made; And determine that the amount of energy in environment audio-frequency unit is under threshold energy.In order to determine that the background audio in sound signal is defining under threshold value, instruction comprises as given an order, and upon being performed, makes the signal to noise ratio (S/N ratio) of described one or more treatment facility determination sound signal; And determine that this signal to noise ratio (S/N ratio) is under threshold signal-to-noise ratio.Sound signal can comprise and only corresponding to around the environment audio-frequency unit of the background audio of user to determine the signal to noise ratio (S/N ratio) of sound signal, instruction comprises as given an order, upon being performed, the amount of the energy in the audio user part of one or more treatment facility determination sound signal is made; Determine the amount of the energy in the environment audio-frequency unit of sound signal; And determine signal to noise ratio (S/N ratio) by the ratio between the amount of determining the energy in audio user part and environment audio-frequency unit.

The user speech model of access can comprise the alternative user speech model of the characteristics of speech sounds modeling be not yet adapted for user.Instruction can comprise as given an order, and when being performed by one or more treatment facility, makes one or more treatment facility select to substitute user speech model; And alternative speech model is associated with user.In order to select alternative user speech model, instruction can comprise as given an order, and when being performed by one or more treatment facility, makes one or more treatment facility determine the sex of user; And select to substitute user speech model among multiple alternative user speech model based on the sex of user.In order to select alternative user speech model, instruction can comprise as given an order, and when being performed by one or more treatment facility, makes one or more treatment facility determine the position of the user when recording one or more language; And select to substitute user speech model among multiple alternative user speech model based on the position of the user when recording one or more language.In order to select alternative user speech model, instruction can comprise as given an order, and when being performed by one or more treatment facility, makes one or more treatment facility determine language or the accent of user; And select to substitute user speech model among multiple alternative user speech model based on language or accent.In order to select alternative user speech model, instruction can comprise as given an order, when being performed by one or more treatment facility, one or more treatment facility is made to receive the initial sound signal at least comprising and corresponding to by the initial user audio-frequency unit of one or more user spoken utterances of equipment record; Similarity measurement between the desired user speech model of the user determining multiple alternative user speech model and determine based on described initial sound signal; And select to substitute user speech model among multiple alternative user speech model based on similarity measurement.

Instruction can comprise as given an order, and upon being performed, makes one or more treatment facility access the noise model be associated with user; And wherein in order to perform noise compensation, instruction may further include as given an order, it makes one or more treatment facility use adapt user speech model and access noise model to perform noise compensation to the sound signal received.In order to perform noise compensation, instruction may further include as given an order, and it makes one or more treatment facility based on the audio signal adaptation access noise model received to generate the adaptive noise model to the characteristic modeling of the background audio around user; And use adapt user speech model and adaptive noise model to come to perform noise compensation to the sound signal received.Instruction can comprise as given an order, and upon being performed, makes one or more treatment facility receive the second sound signal at least comprising and corresponding to by the second audio user part of one or more user spoken utterances of equipment record; Determine that the background audio in the second sound signal is defining on threshold value; And in response to the background audio determined in the second sound signal on definition threshold value, the noise model be associated with user based on the second audio signal adaptation is to generate the adaptive noise model of the characteristic modeling to the background audio around user.Access noise model can comprise the alternative noise model of the characteristic modeling be not yet adapted for the background audio around user.

Instruction can comprise as given an order, and when being performed by one or more treatment facility, makes one or more treatment facility select to substitute noise model; And alternative noise model is associated with user.In order to select alternative noise model, instruction can comprise as given an order, when being performed by one or more treatment facility, one or more treatment facility is made to receive the initial sound signal at least comprising and corresponding to by the initial user audio-frequency unit of one or more user spoken utterances of equipment record; Determine the position of the user when recording the one or more language corresponding to initial user audio-frequency unit; And select to substitute noise model among multiple alternative noise model based on the position of the user when recording the one or more language corresponding to initial user audio-frequency unit.

In order to select alternative noise model, instruction can comprise as given an order, when being performed by one or more treatment facility, one or more treatment facility is made to receive the initial sound signal at least comprising and corresponding to by the initial user audio-frequency unit of one or more user spoken utterances of equipment record; Similarity measurement between the expectation noise model of the user determining multiple alternative noise model and determine based on initial sound signal; And select to substitute noise model among multiple alternative noise model based on similarity measurement.Each in multiple alternative noise model can to the characteristic modeling of the background audio in ad-hoc location.Each in multiple alternative noise model can to the characteristic modeling of the background audio in the environmental baseline of particular types.

In order to access noise model, instruction can comprise as given an order, and when being performed by one or more treatment facility, makes one or more treatment facility determine the position of the user when recording one or more language; And among multiple noise model, select noise model based on the position of user.

Sound signal can correspond to voice search query, and instruction can comprise as given an order, when being performed by one or more treatment facility, make the execution of one or more treatment facility to the speech recognition of filtering audio signals to generate one or more candidate transcription of one or more user spoken utterances; One or more candidate transcription is used to perform search inquiry to generate Search Results; And send Search Results to equipment.

On the other hand, system comprises client device and automated voice recognition system.Client device is configured to send to automated voice recognition system the sound signal at least comprising and corresponding to by the audio user part of one or more user spoken utterances of equipment record.Automated voice recognition system is configured to from client device received audio signal; Access the user speech model be associated with user; Determine that the background audio in sound signal is defining under threshold value; In response to the background audio determined in sound signal under definition threshold value, based on the user speech model of audio signal adaptation access to generate adapt user speech model to the characteristics of speech sounds modeling of user; And use adapt user speech model to perform noise compensation to generate the filtering audio signals compared with the sound signal received with the background audio of minimizing to the sound signal received.

Implementation can comprise following characteristics.Such as, automated voice recognition system can be configured to perform speech recognition to generate one or more candidate transcription of one or more user spoken utterances to filtering audio signals.System can comprise search engine system, and it is configured to use one or more candidate transcription to perform search inquiry to generate Search Results; And send Search Results to client device.

On the other hand, method comprises the sound signal receiving and generated based on the audio frequency input from user by equipment, and sound signal at least comprises the audio user part corresponded to by one or more user spoken utterances of equipment record; Access the user speech model be associated with user; Determine that the background audio in sound signal is defining below threshold value; In response to the background audio determined in sound signal below restriction threshold value, based on the user speech model of audio signal adaptation access to generate adapt user speech model to the characteristics of speech sounds modeling of user; And use adapt user speech model to perform noise compensation to generate the filtering audio signals compared with the sound signal received with the background audio of minimizing to the sound signal received.

The implementation of described technology can comprise the computer software in hardware, method or process or computer accessible.

In accompanying drawing and the details setting forth one or more implementation in hereafter describing.Other features and will become obvious from description, accompanying drawing from claim.

In accompanying drawing and the details setting forth one or more implementation in hereafter describing.Other potential feature, aspect and advantages will become obvious from description, accompanying drawing and claim.

Accompanying drawing explanation

Fig. 1 is the schematic diagram of the example system supporting voice search query.

Fig. 2 is the process flow diagram of the example that process is shown.

Fig. 3 is the process flow diagram of another example that process is shown.

Fig. 4 is swimming lane (swim lane) figure of the example that process is shown.

Embodiment

Fig. 1 shows the schematic diagram of the example of the system 100 supporting voice search query.System 100 comprises search engine 106 and automatic speech recognition (ASR) engine 108, it is connected with one group of mobile device 102a-102c and mobile device 104 by one or more network 110, such as in some embodiments, described one or more network 110 be wireless cellular network, WLAN (wireless local area network) (WLAN) or Wi-Fi network, the third generation (3G) mobile telecom network, dedicated network as Intranet, common network is as the Internet or it is any appropriately combined.

Usually, the user of equipment (such as mobile device 104) can to the microphone oral account search inquiry of mobile device 104.The spoken search query note of user is sound signal by the application run on the mobile devices 104, and sends the part of this sound signal as voice search query to ASR engine 108.After receiving the sound signal corresponding to voice search query, user spoken utterances in sound signal can be translated or be transcribed into one or more text candidates and transcribe by ASR engine 108, and these candidate transcription can be supplied to search engine 106 as query term, thus support the audio search function of mobile device 104.Query term can comprise one or more complete or part of words, character or character string.

Search engine 106 can use search query term to provide Search Results (such as, the Uniform Resource Identifier (URI), image, document, multimedia file etc. of webpage) to mobile device 104.Such as, Search Results can comprise the Uniform Resource Identifier (URI) quoting following resource, and search engine determines that this resource response is in search inquiry.Additionally or alternatively, Search Results can comprise the description or from corresponding resource automatically or other of manual extraction or the extracts of text that is otherwise associated with corresponding resource and so on of such as title, preview image, user's grading, map or direction, corresponding resource.Search engine 106 can comprise in some examples for finding the web search engine of reference in the Internet, being used for finding the phone book type search engine of enterprise or individual or another specialized search engine (such as, the such as amusement inventory such as restaurant and cinema's information, medical treatment and medicine information).

As the example of the operation of system 100, sound signal 138 is included in the voice search query sent from mobile device 104 to ASR engine 108 by network 110.Sound signal 138 comprises language 140 " Gym New York ".ASR engine 108 receives the voice search query comprising sound signal 138.ASR engine 108 audio signal 138 is to generate one or more text candidates of mating with the language detected in sound signal 138 and transcribe or one group of text candidates through rank transcribes 146.Such as, the language in sound signal 138 can produce " Gym New York " and " Jim Newark " alternatively transcribes 146.

The one or more candidate transcription 146 generated by speech recognition system 118 are delivered to search engine 106 by as search query term from ASR engine 108.Search engine 106 provides search query term 146 to generate one or more Search Results to searching algorithm.Search engine 106 provides last set result 152 (such as, the Uniform Resource Identifier (URI), image, document, multimedia file etc. of webpage) to mobile device 104.

Mobile device 104 is display of search results 152 in viewing area.As shown in screenshot capture 158, language " Gym New York " 140 generates three Search Results 160 " Jim Newark " 160a, " New York Fitness " 160b and " Manhattan Body Building " 160c.First Search Results 160a corresponds to candidate transcription Jim Newark, and such as can provide telephone number to user, or mobile device 104 can be used when selected automatically to dial Jim Newark.Latter two Search Results 160b and 160c corresponds to candidate transcription " Gym New York " and comprises webpage URI.Candidate transcription and/or Search Results can carry out rank based on the confidence measurement produced by ASR 108, and this confidence measures the confidence levels that the given candidate transcription of instruction accurately corresponds to the language in sound signal.

Transcribe in order to one or more text candidates is translated or be transcribed into the user spoken utterances in sound signal, ASR engine 108 comprises noise compensation system 116, speech recognition system 118 and stores the database 111 of noise model 112 and user speech model 114.The speech recognition system 118 pairs of sound signals perform speech recognitions and transcribe to identify user spoken utterances in sound signal and these language are translated into one or more text candidates.In some implementation, speech recognition system 118 can generate multiple candidate transcription for given language.Such as, language can be transcribed into multiple item and can assign and transcribe with each of language the confidence levels be associated by speech recognition system 118.

In some implementation, the specific change of speech recognition system 118 can be selected for audio signal based on the additional contextual information relevant with sound signal, and the change selected may be used for the language of transcribing in sound signal.Such as, in some implementation, together with the sound signal comprising user spoken utterances, voice search query can comprise region or the language message of the change for selecting speech recognition system 118.In particular example, the region of registration of mobile devices 104 or the language of mobile device 104 arrange language and can be provided to ASR engine 108 and for ASR engine 108 for determining the language that the user of mobile device 104 is possible or accent wherein.The change of speech recognition system 118 can carry out choice and operation based on the expection language of the user of mobile device 104 or accent.

Noise compensation system 116 can be applied to such as from the sound signal that mobile device 104 receives by ASR engine 108 before execution speech recognition.Noise compensation system 116 can remove or reduce background in sound signal or environment audio frequency to produce filtering audio signals.Because the microphone of mobile device 104 can also capturing ambient audio frequency except the language of user, therefore sound signal may comprise the mixing of user spoken utterances and environment audio frequency.Therefore sound signal can comprise the one or more environmental audio signal only comprising environment audio frequency, and comprises the audio user signal of language (and potential environment audio frequency) of user.Usually, environment audio frequency can comprise generation (nature or other) any ambient sound around user.Environment audio frequency gets rid of the speech of the user of mobile device, language or sound usually.Speech recognition system 118 can perform speech recognition with transcribing user language to the filtering audio signals produced by noise compensation system 116.In some instances, to filtering audio signals perform speech recognition can produce than direct to receive sound signal perform speech recognition transcribe more accurately.

For giving audio signal, one of the noise model 112 stored in noise compensation system 116 usage data storehouse 111 removes with one of user speech model or the background that reduces in sound signal or environment audio frequency.Noise model 112 comprises alternative noise model 120 and adaptive noise model 120b.Similarly, user speech model comprises alternative user speech model 126a and adapt user speech model 126b.Usually, adaptive noise model 120b and adapt user speech model 126b is exclusively used in specific user and is adapted to this user based on by previous talk search inquiry from the sound signal that this user receives.When the specific user for submission current voice search inquiry does not have adaptive noise model or adapt user speech model, use respectively and substitute noise model 120a and alternative user speech model 126a.

In some instances, the performance of noise compensation system 116 can be improved by using adapt user speech model, and this adapt user speech model is by trained or otherwise adapt to the concrete sound characteristic of the specific user submitting voice search query to.But, in order to make speech model adapt to specific user, the sampling of the voice of this user may be needed.In the environment of such as system 100, those samplings may not easily can be used at first.Therefore, in one implementation, if during adapt user speech model when user sends voice search query at first or for some other reasons not for user, ASR 108 selects to substitute user speech model from one or more alternative user speech model 126a.Selected alternative user speech model can be the rationally approximate user speech model of the characteristics of speech sounds being confirmed as user.Selected alternative user speech model is used for performing noise compensation to initial sound signal.Along with user submits voice search query subsequently to, with described those inquire about subsequently together with some or all sound signals that send for by selected alternative user speech model training or adapt to be exclusively used in this user adapt user speech model (namely, characteristics of speech sounds modeling to user), it is for the noise compensation of those sound signals subsequently.

Such as, in one implementation, when receiving sound signal subsequently, ASR 108 determines whether environment or background audio are under specific threshold.If under specific threshold, then this sound signal be used for by alternative user speech model adaptation in or further adapt user speech model is adapted to specific user.If background audio is on threshold value, then sound signal is not used in adapt user speech model and (but may be used for adaptive noise model, as mentioned below).

User speech model (no matter being alternative user speech model 126a or adapt user speech model 126b) such as may be implemented as hidden Markov model (HMM) or gauss hybrid models (GMM).Expectation maximization Algorithm for Training or otherwise adapt user speech model can be used.

In some implementation, user can be positively identified.Such as, some implementation can point out mark the forward direction user accepting search inquiry.Other implementations can use other available information implicit identification users, such as key in the Move Mode (such as, as accelerator forming device a part of) of the pattern of user or user.When user can specifically be identified, adapt user speech model can carry out index by the user identifier corresponding to identifying user.

In other implementations, user may not specifically be identified.In the case, equipment (such as mobile device 104) for typing voice search query can be used as the identifier of particular user, and can based on the device identifier index adapt user speech model for submitting to the equipment of voice search query corresponding.Usually only exist in the environment of single or major equipment user wherein, such as when mobile phone is used as input equipment, based on equipment, develop adapt user speech model can provide acceptable speech model to reach the performance constraints that noise compensation system 116 (particularly) or ASR 108 (more general) force.

Can be improved the same procedure of the performance of noise compensation system 116 by adapt user speech model, the performance of noise compensation system 116 can also have been trained or otherwise adapted to the usual noise model around the environment audio frequency of user by use and be modified.As speech sample, in the environment of such as system 100, the sampling usually around the environment audio frequency of user may not easily can be used at first.Therefore, in one implementation, if during adapt user speech model when user sends voice search query at first or for some other reasons not for user, ASR 108 selects to substitute noise model from one or more alternative noise model 126b.Selected alternative noise model can be the rationally approximate noise model being determined to be in the expectation environment audio frequency around user based on information that is known or that determine.Selected alternative noise model is used for performing noise compensation to initial sound signal.Along with user submits voice search query subsequently to, some or all sound signals that send together with those are inquired about for selected alternative noise model is adapted to be exclusively used in this user adaptive noise model (namely, the characteristic modeling to the typical environment sound around user when submitting search inquiry to), it is for the noise compensation of those sound signals subsequently.

Such as, in one implementation, when receiving sound signal subsequently, ASR 108 determines whether environment or background audio are under specific threshold.If not under specific threshold, then this sound signal is used for being adapted to by alternative noise model or further adaptive noise model being adapted to specific user.In some implementation, no matter whether background audio is on specific threshold, and the sound signal of reception may be used to adaptively substitute noise model or adaptive noise model.

In some implementation; in order to ensure obtaining the sampling without the environment audio frequency of user spoken utterances and this sampling may be used for adaptive noise model, the application of voice search query on mobile device 104 can start before user says search inquiry record and/or can user complete say search inquiry after continue record.Such as, voice search query application can be captured in user and say the audio frequency of before or after search inquiry two seconds to guarantee to obtain the sampling of environment audio frequency.

In some implementation, single alternative noise model can be selected and be adapted to the single adaptive noise model for this user of the varying environment using voice search to apply across user.But in other realize, when using voice search application, the various positions that adaptive noise model often can go for user are developed.Such as, can different noise model be developed for diverse location and be stored as alternative noise model 120a.When submitting voice search query to, the position of user can be sent to ASR 108 by mobile device 104, or the position of user can be determined by other means when submitting voice search query to.When receiving the initial sound signal for given position, then can select the alternative noise model for this position, and when receiving other voice search query from this position, the sound signal be associated may be used for this particular noise model adaptive.This can occur for each position in the diverse location when performing voice search query residing for user, and produce the multiple adaptive noise model for user thus, wherein each model is exclusively used in certain position.After the non-usage time period of definition (such as, user does not perform voice search in this position special time), can delete position particular noise model.

When submitting voice search query to, the position of user, the position be associated with given noise model and the position that is associated with given speech model all can be defined by various granularity level, longitude and latitude navigation coordinate or closely defined the region of (such as, 1/4th miles or less) by navigation coordinate the most specifically.Alternatively, position can use realm identifier to provide, the identifier (such as, " cell/region ABC 123 ") of such as state name or identifier, city name, trivial name (such as, " Central Park "), country name or any defined range.In some implementation, position can locative type, in such as seabeach in some examples, big city, amusement park, mobile traffic, on ship, in buildings, open air, countryside, underground position (such as, subway, parking lot etc.), in the street in the inner or forest of position, high building (skyscraper), instead of geo-specific location.Granularity level and the customer location when submitting voice search query to and the position that given noise model is associated and with can be identical or different between the position that given speech model is associated.

Noise model (no matter being alternative 120a or adaptive 120b) such as may be implemented as hidden Markov model (HMM) or gauss hybrid models (GMM).User speech model can use expectation maximization Algorithm for Training or otherwise adaptive.

As described above, in some implementation, user can by specifically identifying in other implementations equipment can be used as user substitute.Therefore, be similar to the index to speech model, adaptive noise model can carry out index by the user identifier of the user corresponding to the mark when user can specifically be identified, or can by based on corresponding to when user cannot specifically be identified for submitting the device identifier index of the equipment of voice search query to.

Fig. 2 shows the process flow diagram of the example of the process 200 that can perform when receiving initial voice search query from user or equipment, and Fig. 3 shows the process flow diagram of the example of the process 300 that can perform when receiving voice search query subsequently from user or equipment.Be hereafter implementation 200 and process 300 by the component description of system 100, but other assemblies of system 100 or another system also can implementation 200 or process 300.

Initial voice search query (202) is received from equipment (such as mobile device 104) with reference to figure 2, ASR 108.Initial voice search query can be initial, because this voice search query is first voice search query received for particular user or equipment; Because this voice search query is first from submitting to the ad-hoc location of this voice search query to receive; Or (or both) for some other reasons (such as, deleted because this model does not use in special time period) for user or equipment not because adapt user speech model or adaptive noise model.

Voice search query comprises sound signal, and this sound signal comprises audio user signal and environmental audio signal.Audio user signal comprises to be given an oral account to one or more language of the microphone of mobile device 104 and potential environment audio frequency by user.Environmental audio signal only comprises environment audio frequency.As mentioned below, voice search query can also comprise contextual information.

When employed, ASR 108 accesses the contextual information (204) about voice search query.This contextual information such as can provide the instruction of the condition about the sound signal in voice search query.This contextual information can comprise temporal information, date and time information, the data quoting speed or the amount of movement measured by specific mobile device during recording, other device sensor data, device status data (such as, bluetooth headset, speaker-phone or conventional input method) if user selects user identifier when providing or identifies the information of mobile device type or model.

This contextual information can also be included in the position that it submits voice search query to.This position such as can be determined by the schedule of user, from user preference (such as, be stored in the user account of ASR engine 108 or search engine 106) or default location derivation, based on past position (such as, by the equipment for submit Query (such as, mobile device 104) GPS (GPS) module calculate proximal most position), there is provided by user is explicit when submitting voice queries to, determine from language, based on launching tower trigonometric calculations, there is provided (such as by the GPS module in mobile device 104, voice search application can access GPS device to determine position and to send this position with voice search query), or use dead reckoning to estimate.If sent by equipment, then positional information can comprise the accuracy information of the levels of precision of this positional information of instruction.

ASR 108 can use this type of contextual information to help speech recognition, such as, by using contextual information to select the particular variant of speech recognition system or select suitable alternative user speech model or alternative noise model.This type of contextual information can be delivered to search engine 106 to improve Search Results by ASR 108.Some or all contextual informations can receive together with voice search query.

If do not existed for the adapt user speech model of user, then ASR 108 selects initial or alternative user speech model and be associated with user or equipment by this initial user speech model (such as, depending on whether user can specifically be identified) (206).Such as, as described above, ASR 108 can select in some available alternative user speech models.

Selected alternative user speech model can be the rationally approximate user speech model being confirmed as the characteristics of speech sounds of user based on known or comformed information, although this selected alternative user speech model not yet by any sampling of the voice with user adaptation.Such as, in one implementation, two alternative user speech models can be there are: one for male voice one for women's speech.The sex of user can be determined and suitable alternative user speech model (sex) can be selected based on the possible sex of user.The sex of user such as can by analyze the sound signal that receive together with initial voice search query or based on such as by user submit to voluntarily and the information be included in the information in the profile of user determine.

Additionally or alternatively, the adapt user speech model for other users (such as the user of mobile device 102a-102c) can be used as alternative user speech model.When receiving initial voice search query, represent that the expectational model for the user submitting initial searches inquiry to can be determined based on the initial sound signal comprised together with inquiring about with initial searches with the measuring similarity being stored in the similarity between the adapt user speech model in database 111 (corresponding to other users).Such as, if model is based on the linear regression technique of constraint maximum likelihood, then measuring similarity can be the L2 norm (summation for the difference of two squares of each coefficient) of the difference between model.When using GMM technology wherein, measuring similarity can be the Kullback-Leibler entropy between two probability density functions, if or model is GMM and expectational model from single language is spatial point, then may be that the probability density of GMM is positioned at this spatial point.In other implementations using GMM, measuring similarity can be such as each GMM average between distance, or by some norm of covariance matrix normalized average between distance.

Adapt user speech model closest to the expectational model (as shown in by measuring similarity) of user can be selected as the alternative user speech model for the user submitting initial voice search query to.Such as, when the user of equipment 104 submits initial voice search query to, ASR 108 can determine the measuring similarity of the similarity represented between the desired user speech model for the user of equipment 104 and the adapt user speech model of the user for equipment 102a.Similarly, ASR 108 can determine the measuring similarity of the similarity represented between the desired user speech model for the user of equipment 104 and the adapt user speech model of the user for equipment 102b.If measuring similarity pointer is more similar to the model of the user for equipment 102a than the model for the user of equipment 102b to the expectational model of the user of equipment 104, then can be used as the alternative user speech model of the user for equipment 104 for the model of the user of equipment 102a.

As the particular example of the implementation of employing GMM, voice search query can comprise the language comprising voice and ambient signal.This inquiry can be segmented into the segmentation of such as 25ms, and wherein each segmentation is voice or pure environment.For each segmentation, calculate proper vector x _t, the vector wherein corresponding to voice is designated as x _s.For each potential alternative model M had in a database _i, calculate the likelihood score of each vector:

p (x_{t}, i) = p (x_{t} | i) p (i) = \underset{j}{Σ} π_{j} N (x_{t}; μ_{i, j}, Σ_{i, j}) p (i)

This is that the likelihood score of GMM calculates and p (i) is the priori of this alternative model.Suppose the independence of observing, speech vector x _sthe probability of set can be expressed as:

p (x_{s}, i) = \underset{s}{Π} \underset{j}{Σ} π_{j} N (x_{s}; μ_{i, j}, Σ_{i, j}) p (i)]

Wherein x _sit is the set of speech vector.

Given observation x _sthe conditional probability of class i be:

p(i|x _s)＝p(x _s，i)/p(x _s)

Wherein

p (x_{s}) = \underset{i}{Σ} p (x_{s}, i)

This conditional probability can be used as current utterance and certain alternative speech model M _ibetween measuring similarity.

The alternative model with the highest conditional probability can be selected:

model _index＝ArgMax(p(i|x _s))i

Contextual information (accent of such as user or the language of expectation) can be used alone or combinationally use to select alternative user speech model with other technologies mentioned above.Such as, multiple alternative user speech model can store for different language and/or accent.When submitting voice search query to, the position of user can be used for ASR 108 for determining language or the accent of expectation, and the alternative user speech model corresponding to expectation language and/or accent can be selected.Similarly, can be stored in the profile of such as user for the language of user and/or positional information, and correspond to the language of user and/or the alternative user speech model of accent for selecting.

If adapt user speech model is saved as (such as, be original position for ad-hoc location due to voice search query but be not for user or equipment), then action 206 can be skipped, or can substitute by other adaptations with adapt user speech model.Such as, the sound signal received by initial voice search query can be evaluated to determine background audio whether under specific threshold, and if under specific threshold, then this sound signal can be used to further training or this adapt user speech model adaptive by other means.

ASR 108 selects initial or alternative noise model and be associated with user or equipment by this initial noise model (such as, depending on whether user can specifically be identified) (208).The selected noise model that substitutes can be the rationally approximate noise model being confirmed as the expectation environment audio frequency around user based on known or comformed information.Such as, alternative noise model can for the environmental baseline of various criterion kind (such as, in the car, on airport, be in or in bar/dining room) develop.Data from other users in system can be used to develop alternative noise model.Such as, if some duration of low noise data (such as, 10 minutes) is collected by from user, then these data can be used to generate alternative model.When receiving initial sound signal, the measuring similarity that expression expectation noise model and standard substitute the similarity between noise model can be determined based on initial sound signal, and this standard substitutes one of noise model can carry out selecting (such as, use and be similar to above about selecting to substitute the technology described in user model) based on this measuring similarity.Such as, expect that noise model can be determined based on environmental audio signal.Exceed specific dissimilar threshold value (such as, determine based on KL distance) alternative noise model (such as, 100) set can be retained as standard alternative model, and the alternative model used can use as described in measuring similarity select from this set.When selecting to substitute noise model, this can minimization calculation.

Additionally or alternatively, different noise model can be developed for diverse location and is stored as alternative noise model 120a.Such as, the noise model for position A 132a and position B 132b can be developed and be stored as alternative noise model 120a.Noise model for particular location can be developed based on by other Client-initiated previous talk search inquiries in those positions.For position B 132b noise model such as can based on when position B 132b by ASR 108 receive a part for the voice search query as the user from equipment 102b sound signal 130b and at position B 132b time receive a part for the voice search query as the user from equipment 102c by ASR 108 sound signal 130c develop.For position A 132a noise model such as can based at position A by ASR

The 108 sound signal 130a received as a part for the voice search query of the user from equipment 102a develop.

When receiving initial sound signal, alternative noise model can be selected based on the position of user.Such as, when the user of mobile device 104 submits initial voice search to from position B 132b, ASR 108 can select the alternative noise model for position B.In some implementation, the voice search on mobile device 104 applies the GPS that can access on this mobile device to determine the position of user and send positional information to ASR 108 together with voice search query.Positional information can be used for ASR 108 to use to determine suitable alternative noise model based on this position then.In other implementations, when receiving initial sound signal, represent that the measuring similarity of similarity between the distinctive alternative noise model in position expecting to have stored in noise model and database 111 can be determined based on this initial sound signal, and one of distinctive alternative noise model in this position can be selected based on this measuring similarity.

Use initial (or adaptive) user speech model and initial noise model, the noise compensation system 116 of ASR 108 performs noise compensation to remove or to reduce the background audio in sound signal to the sound signal received together with voice search query, produces filtering audio signals (210) thus.Such as, at such as ALGONQUIN:Iterating Laplace ' s Methodto Remove Multiple Types of Acoustic Distortion for Robust Speech Recognition, the algorithm of such as Algonquin algorithm described in Eurospeech 2001-Scandinavia and so on may be used for using initial user speech model and initial noise model to perform noise compensation.

Speech recognition system performs speech recognition so that the language in sound signal is transcribed into one or more candidate transcription (210) to filtering audio signals.Search inquiry can use one or more candidate transcription to perform.In some implementation, ASR 108 can use contextual information to select the particular variant of the speech recognition system for performing speech recognition.Such as, the accent of user and/or expectation or known language may be used for selecting suitable speech recognition system.When submitting voice search query to, the position of user may be used for the expectation language determining user, or the language of user can be included in the profile of this user.

With reference to figure 3, ASR 108 from equipment (such as mobile device 104) reception voice search query (302) subsequently.This voice search query subsequently can be subsequently, this is because this voice search query receives after the previous talk search inquiry for particular user or equipment, or because there is substituting or adapt user speech model or noise model for user or equipment.

Voice search query subsequently comprises sound signal, and this sound signal comprises audio user signal and environmental audio signal.Audio user signal comprises to be given an oral account to the one or more language in the microphone of mobile device 104 and potential environment audio frequency by user.Environmental audio signal only comprises environment audio frequency.As mentioned below, voice search query can also comprise contextual information.

When employed, ASR 108 accesses the contextual information (304) about voice search query.ASR 108 can use this type of contextual information to help speech recognition, such as, by the particular variant using this contextual information to select speech recognition system.Additionally or alternatively, contextual information may be used for helping substituting or the selection of adapt user speech model and/or adaptive or alternative noise model and/or adaptation.ASR 108 can transmit this type of contextual information to improve Search Results to search engine 106.Some or all contextual informations can receive together with voice search query.

ASR 108 determines in the sound signal received together with voice search query, whether environment audio frequency is defining under threshold value (306).Such as, speech activity detector may be used for determining the audio user signal in the sound signal of reception and environmental audio signal.ASR 108 then can determine energy in environmental audio signal and the energy this determined and threshold energy compare.If this energy is under described threshold energy, then environment audio frequency is considered under definition threshold value.In another example, ASR 108 can determine the energy in audio user signal, determines the energy in environmental audio signal, and then determines the ratio of the energy in audio user signal and the energy in environmental audio signal.This ratio can represent the signal to noise ratio (S/N ratio) (SNR) of sound signal.The SNR of sound signal then can compared with threshold value SNR, and when the SNR of sound signal is on threshold value SNR, environment audio frequency is considered under definition threshold value.

If the environment audio frequency in the sound signal received together with voice search query is not under definition threshold value, then this audio signal adaptation is used to substitute (or adaptive) noise model to generate adaptive noise model (312).In some implementation, treat that adaptive particular noise model is selected based on the position of user.Such as, when different noise model frequently submits the diverse location of voice search query for user to from it, ASR 108 can use the position of user or equipment to select substituting or adaptive noise model for this position.

Noise model can be adaptive in whole sound signal, or environmental audio signal can be extracted and for adaptive noise model, depends on the specific implementation mode of noise model and speech enhan-cement or Speech separation algorithm.The technology of such as hidden Markov model or gauss hybrid models and so on may be used for realizing user speech model, and the technology of such as expectation maximization and so on may be used for adapt user speech model.

If the environment audio frequency in the sound signal received together with voice search query is under definition threshold value, then this sound signal is used for alternative user speech model (if this substitutes previously not yet adapted to adapt user speech model) or the adapt user speech model (308) of adaptive previously selection.User speech model can be adaptive in whole sound signal, or audio user signal can be extracted and for adapt user speech model, depend on the specific implementation mode of user speech model.Be similar to noise model, the technology of such as hidden Markov model or gauss hybrid models and so on may be used for realizing user speech model, and the technology of such as expectation maximization or maximum a posteriori (MAP) adaptation and so on may be used for adapt user speech model.

In some implementation, ASR 108 is also based on the sound signal training under threshold value of wherein background audio or otherwise adaptively substitute noise model or adaptive noise model (310).Although in some implementation, user speech model only uses the wherein sound signal training or adaptive of background audio under definition threshold value, but in some instances, noise model can based on this type of sound signal and the wherein sound signal training or adaptive of background audio on threshold value, and this depends on the particular technology for realizing noise model.Such as, some noise model can comprise reflect the wherein environment of background audio under threshold value in parameter, and therefore this class model can be benefited from the adaptation wherein sound signal of background audio under threshold value.

Use and substitute or adapt user speech model (depending on whether alternative speech model is adapted) and alternative or adaptive noise model (depending on whether alternative noise model is adapted), the noise compensation system 116 of ASR 108 performs noise compensation to remove or to reduce the background audio in sound signal with the sound signal that mode identical as described above pair receives together with voice search query, thus produces filtering audio signals (314).Speech recognition system performs speech recognition so that the speech in sound signal is transcribed into one or more candidate transcription (316) in mode identical as described above to filtering audio signals.

Although process 300 illustrates adaptive noise model and/or user speech model before for noise compensation, but adaptation can occur after execution noise compensation, and noise compensation can based on noise and/or user speech model by the noise before further adaptation and/or user speech model.This can be following situation, such as, when adaptation is computation-intensive.In the case, to Expected Response time of voice search query can by use for the current noise of noise compensation and user speech model and based on sound signal new afterwards to its realization of more newly arriving.

Fig. 4 shows the swimming lane figure of the example of the process 400 performed by mobile device 104, ASR 108 and the search engine 106 for the treatment of voice search query.Mobile device 104 sends voice search query (402) to ASR 108.As described above, voice search query comprises the sound signal comprising environmental audio signal and audio user signal, environmental audio signal comprises the environment audio frequency without user spoken utterances, and audio user signal comprises user spoken utterances (and potentially environment audio frequency).Voice search query can also comprise contextual information, all contextual informations as described above.

ASR 108 receives voice search query (402) and selects both noise model and user speech model (404).ASR 108 such as can based on the adapt user speech model comprising or select the addressable user identifier of ASR 108 or device identifier by other means storage together with voice search query.Similarly, ASR 108 such as can based on the adaptive noise model comprising or select the addressable user identifier of ASR 108 or device identifier by other means storage together with voice search query.Using in implementation for the different noise models of particular location, ASR 108 can select the adaptive noise model of storage from the peculiar adaptive noise model in multiple position based on user or device identifier and the location identifier of position corresponding to the user when submitting voice search query to.ASR 108 can from send voice search query or by other means to ASR 108 can contextual information in find out positional information.

Do not exist in the event of adapt user speech model for user or equipment, ASR 108 such as uses technology mentioned above to select alternative user speech model (404).Similarly, if there is not adaptive noise model for user or equipment, or at least not for the ad-hoc location of the user when submitting voice search query to, then ASR 108 such as uses technology mentioned above to select alternative noise model.

ASR 108 uses the next adaptive selected audio user model (406) of the sound signal received together with voice search query and/or selected noise model (408) to generate adapt user speech model or adaptive noise model then, and this depends on the background audio in sound signal.As described above, in background audio when defining under threshold value, sound signal is used for the user speech model selected by adaptation, and for the noise model selected by adaptation in some implementation.In background audio when defining on threshold value, then at least in some implementation, noise signal is used for the noise model only selected by adaptation.

ASR 108 uses adapt user speech model and adaptive noise model to perform noise compensation (410) to generate the filtering audio signals having reduced or removed background audio compared with the sound signal received to sound signal.

ASR engine 404 pairs of filtering audio signals perform speech recognition 416 and transcribe (412) so that the one or more language in sound signal are transcribed into text candidates.ASR engine 404 forwards transcribing (414) of 418 generations to search engine 406.If ASR engine 404 generates multiple transcribing, then can be that ordered pair transcribes sequence alternatively with degree of confidence.ASR engine 404 can provide context data to search engine 406 alternatively, such as geographic position, and search engine 406 can use this context data filter Search Results or sort.

Search engine 406 uses and transcribes to perform search operation (416).Search engine 406 can be located one or more URI relevant with transcribing item.

Search engine 406 provides search query results (418) to mobile device 402.Such as, search engine 406 can forward following HTML code, the visual inventory of the URI of this code building location.

Describe multiple implementation.But, will understand, and can various amendment be carried out and not depart from Spirit Essence and the scope of disclosure.Such as, above technology is described about performing speech recognition to the sound signal in voice search query, and this technology may be used for other system, such as in the computerize speech dictation system moved or other equipment realize or conversational system.In addition, can resequencing, add or removal step time use above shown in the various forms of flow process.Thus, other implementations within the scope of the appended claims.

The embodiment that describes in this instructions and all functions operation can be realized in Fundamental Digital Circuit or in one that is included in the computer software of structure disclosed in this instructions and structural equivalents thereof, firmware or hardware or in them or multinomial combination.Embodiment may be implemented as one or more computer program, namely encode on a computer-readable medium for being performed by data processing equipment or one or more module of computer program instructions of operation of control data treating apparatus.Computer-readable medium can be machine readable storage device, machine readable storage substrate, memory devices, realize the material composition of machine readable transmitting signal or or multinomial combination in them.Term " data processing equipment " covers for the treatment of all devices of data, equipment and machine, such as, comprise a programmable processor, a computing machine or multiple processor or computing machine.The computer program that device can also comprise for discussing except comprising hardware creates the code of execution environment, such as, form the code of processor firmware, protocol stack, data base management system (DBMS), operating system or in them or multinomial combination.Transmitting signal is the artificial signal generated, and such as, the electricity generated by machine, optics or electromagnetic signal, this signal is generated and sends for suitable acceptor device for encoding to information.

Computer program (also referred to as program, software, software application, script or code) can be write with any type of programming language comprising compiling or interpretative code, and can be disposed it by any form, comprise as stand-alone program or as the module being suitable for using in a computing environment, parts, subroutine or other unit.Computer program not necessarily corresponds to the file in file system.Program can be stored in the part of the file keeping other program or data (such as, be stored in one or more script in marking language document), in the Single document of program being exclusively used in discussion or in multiple coordinated files (such as, storing the file of one or more module, subroutine or code section).Computer program can be deployed on a computer or be positioned at the three unities or be distributed in multiple place and perform by multiple computing machines of interconnection of telecommunication network.

The process described in this manual and logic flow can be performed by one or more programmable processor, and this processor performs one or more computer program with by generating output and carry out n-back test input data manipulation.Process and logic flow also can be performed by dedicated logic circuit such as FPGA (field programmable gate array) or ASIC (special IC), and device also can be implemented as this dedicated logic circuit.

Be suitable for performing any one or multiple processor that the processor of computer program such as comprises the digital machine of general and special microprocessor and any kind.Generally speaking, processor will from ROM (read-only memory) or random access memory or both receive instruction and data.The elementary cell of computing machine is the processor for performing instruction and one or more memory devices for storing instruction and data.Generally speaking, computing machine also by one or more mass memory unit comprised for storing data (such as, disk, photomagneto disk or CD) or be operatively coupled into from this mass memory unit receive data or to this mass memory unit transmit data or both.But computing machine is without the need to having such equipment.In addition, computing machine can be embedded in another equipment, only gives a few examples, and this another equipment is such as flat computer, mobile phone, personal digital assistant (PDA), Mobile audio player, GPS (GPS) receiver.The computer-readable medium being suitable for storing computer program instructions and data comprises nonvolatile memory, medium and the memory devices of form of ownership, such as comprises semiconductor memory devices (such as, EPROM, EEPROM and flash memory device); Disk (such as, internal hard drive or removable disk); Magneto-optic disk; And CD ROM and DVD-ROM dish.Processor and storer by supplemented or can be incorporated in dedicated logic circuit.

Mutual in order to what provide with user, embodiment can be limited on computing machine in fact, this computing machine has for showing the display apparatus of information (such as to user, CRT (cathode-ray tube (CRT)) or LCD (liquid crystal display) monitor) and user can be used for providing to computing machine keyboard and the indication equipment (such as, mouse or tracking ball) of input.It is mutual that the equipment of other kind also can be used to provide with user; Such as, the feedback provided to user can be any type of sensory feedback (such as, visual feedback, audio feedback or tactile feedback); And can with comprising sound, any form of voice or sense of touch input receives input from user.

Embodiment can be implemented in computing system, this computing system comprises back-end component (such as, as data server) or comprise middleware component (such as, application server) or comprise any combination of one or more parts in front end component (such as, there is user can be used for carrying out mutual graphic user interface or the client computer of Web browser with implementation) or such rear end, middleware or front end component.The parts of system can be interconnected by any digital data communication form or medium (such as, communication network).The example of communication network comprises LAN (Local Area Network) (" LAN ") and wide area network (" WAN "), such as, and the Internet.

Computing system can comprise client and server.Client and server generally mutual away from and usually mutual by communication network.The relation computer program of client and server occurs, and these computer programs run and mutually have client-server relation on corresponding computer.

Although this instructions comprises many details, these should not be construed as to scope of the disclosure or can be claimed the restriction of scope of content, and should as description specific implementation being realized to distinctive feature.Some feature that also can describe in the context of independent embodiment in single this instructions of embodiment combination enforcement.Otherwise, also can in multiple embodiment separately or in any suitable sub-portfolio, implement the various feature that describes in the context of single embodiment.In addition; although can describe feature as above in some embodiments effect and even originally claimed like this; but one or more feature can removed from claimed combination in some cases from this combination, and claimed combination can relate to the variant of sub-portfolio or sub-portfolio.

Similarly, although describe operation with particular order in the accompanying drawings, this should not be construed as and requires with shown particular order or perform such operation with sequence order or perform all shown operations to realize the result of wishing.In some circumstances, multitask and parallel processing can be favourable.In addition, in above-described embodiment, be separated various system unit should not be construed as and require such separation in all embodiments, and should be appreciated that the program element of description and system generally can together be integrated in single software product or be encapsulated in multiple software product.

Mention in each example of html file wherein, other file type or form can be replaced with.Such as, html file can replace with the file of XML, JSON, plaintext or other type.In addition, when mentioning table or hash table, other data structure (such as spreadsheet, relational database or structured document) can be used.

Therefore, particular implementation is described.Other embodiment within the scope of the appended claims.Such as, the action recorded in the claims can perform by different order and still obtain the result of hope.

Claims

1., for a system for speech recognition, comprising:

For receiving the device of the sound signal generated based on the audio frequency input from user by equipment, described sound signal at least comprises the audio user part corresponded to by one or more user spoken utterances of described equipment record;

For accessing the device of the user speech model be associated with described user;

For determining the device of background audio below definition threshold value in described sound signal;

For in response to the described background audio determined in described sound signal below described definition threshold value, the user speech model of accessing based on described audio signal adaptation is to generate the device to the adapt user speech model of the characteristics of speech sounds modeling of described user; And

Voice use described adapt user speech model to perform noise compensation to generate the device compared with the sound signal of described reception with the filtering audio signals of the background audio of minimizing to the sound signal received.

2. system according to claim 1, wherein said sound signal comprises the environment audio-frequency unit only corresponded to around the background audio of described user, and in order to determine that the described background audio in described sound signal is defining under threshold value, described system comprises:

For determining the device of the amount of the energy in described environment audio-frequency unit; And

For determining the device of amount under threshold energy of the described energy in described environment audio-frequency unit.

3. system according to claim 2, in order to determine that the described background audio in described sound signal is defining under threshold value, described system comprises:

For determining the device of the signal to noise ratio (S/N ratio) of described sound signal; And

For determining the device of described signal to noise ratio (S/N ratio) under threshold signal-to-noise ratio.

4. system according to claim 3, wherein said sound signal comprises the environment audio-frequency unit only corresponded to around the background audio of described user, and in order to determine the described signal to noise ratio (S/N ratio) of described sound signal, described system comprises:

For determining the device of the amount of the energy in the described audio user part of described sound signal;

For determining the device of the amount of the energy in the described environment audio-frequency unit of described sound signal; And

The device of described signal to noise ratio (S/N ratio) is determined for the ratio between the amount by determining the energy in described audio user part and described environment audio-frequency unit.

5. system according to claim 1, the user speech model of wherein accessing comprises the alternative user speech model of the described characteristics of speech sounds modeling be not adapted to be described user.

6. system according to claim 5, wherein said system comprises:

For selecting the device of described alternative user speech model; And

For the device that described alternative speech model and described user are carried out associating.

7. system according to claim 6, wherein in order to select described alternative user speech model, described system comprises:

For determining the device of the sex of described user; And

Among multiple alternative user speech model, the device of described alternative user speech model is selected for the described sex based on described user.

8. system according to claim 6, wherein in order to select described alternative user speech model, described system comprises:

For determining the device of the position of the described user when recording described one or more language; And

Among multiple alternative user speech model, the device of described alternative user speech model is selected for the described position based on user described when recording described one or more language.

9. system according to claim 6, in order to select described alternative user speech model, described system comprises:

For the device of the language or accent of determining described user; And

For selecting the device of described alternative user speech model among multiple alternative user speech model based on described language or accent.

10. system according to claim 6, wherein in order to select described alternative user speech model, described system comprises:

For receiving the device at least comprising and corresponding to by the initial sound signal of the initial user audio-frequency unit of one or more user spoken utterances of described equipment record;

For the described user that determines multiple alternative user speech model and determine based on described initial sound signal desired user speech model between the device of similarity measurement; And

For selecting the device of described alternative user speech model among described multiple alternative user speech model based on described similarity measurement.

11. systems according to claim 1, wherein said system comprises:

For accessing the device of the noise model be associated with described user; And

Wherein in order to perform noise compensation, described system comprises further for using described adapt user speech model and access noise model to the device of the sound signal execution noise compensation received.

12. systems according to claim 11, wherein in order to perform noise compensation, described system comprises further:

For accessing noise model based on the audio signal adaptation received to generate the device to the adaptive noise model of the characteristic modeling of the background audio around described user; And

Carry out to perform the sound signal received the device of noise compensation for using described adapt user speech model and described adaptive noise model.

13. systems according to claim 11, wherein said system comprises:

For receiving the device at least comprising and corresponding to by the second sound signal of the second audio user part of one or more user spoken utterances of described equipment record;

For determining the device of background audio on definition threshold value in described second sound signal; And

For in response to the described background audio determined in described second sound signal on described definition threshold value, the described noise model be associated with described user based on described second audio signal adaptation is to generate the device of the adaptive noise model of the characteristic modeling to the background audio around described user.

14. systems according to claim ll, wherein said access noise model comprises the alternative noise model of the characteristic modeling be not yet adapted to be the background audio around described user.

15. systems according to claim 14, wherein said system comprises:

For selecting the device of described alternative noise model; And

For the device that described alternative noise model and described user are carried out associating.

16. systems according to claim 15, wherein in order to select described alternative noise model, described system comprises:

For determining the device of the position of the described user when recording the described one or more language corresponding to described initial user audio-frequency unit; And

For selecting the device of described alternative noise model among multiple alternative noise model based on the described position of the described user when recording the described one or more language corresponding to described initial user audio-frequency unit.

17. systems according to claim 15, wherein in order to select described alternative noise model, described system comprises:

For the described user that determines multiple alternative noise model and determine based on described initial sound signal expectation noise model between the device of similarity measurement; And

For selecting the device of described alternative noise model among described multiple alternative noise model based on described similarity measurement.

18. systems according to claim 17, each alternative noise model in wherein said multiple alternative noise model is to the characteristic modeling of the background audio in ad-hoc location.

19. systems according to claim 17, each alternative noise model in wherein said multiple alternative noise model is to the characteristic modeling of the background audio in the environmental baseline of particular types.

20. systems according to claim 11, wherein in order to access described noise model, described system comprises:

Among multiple noise model, the device of described noise model is selected for the described position based on described user.

21. systems according to claim 1, wherein said sound signal corresponds to voice search query, and described system comprises:

For performing speech recognition to generate the device of one or more candidate transcription of described one or more user spoken utterances to described filtering audio signals;

Search inquiry is performed to generate the device of Search Results for using described one or more candidate transcription; And

For sending the device of described Search Results to described equipment.

22. 1 kinds, for the system of speech recognition, comprising:

For sending the device of the sound signal of the audio user part at least comprising the one or more user spoken utterances corresponding to record to automated voice recognition system;

For receiving the device of described sound signal;

For determining the device of background audio under definition threshold value in described sound signal;

For in response to the described background audio determined in described sound signal under described definition threshold value, the user speech model of accessing based on described audio signal adaptation is to generate the device to the adapt user speech model of the characteristics of speech sounds modeling of described user; And

For using described adapt user speech model, noise compensation is performed to generate the device compared with the sound signal of described reception with the filtering audio signals of the background audio of minimizing to the sound signal received.

23. systems according to claim 22, wherein said system comprises further for performing speech recognition to generate the device of one or more candidate transcription of described one or more user spoken utterances to described filtering audio signals, and described system comprises further:

For sending the device of described Search Results.

24. 1 kinds, for the method for speech recognition, comprising:

Receive the sound signal generated based on the audio frequency input from user by equipment, described sound signal at least comprises the audio user part corresponded to by one or more user spoken utterances of described equipment record;

Access the user speech model be associated with described user;

Determine that the background audio in described sound signal is defining below threshold value;

In response to the described background audio determined in described sound signal below definition threshold value, the user speech model of accessing based on described audio signal adaptation is to generate the adapt user speech model to the characteristics of speech sounds modeling of described user; And

Described adapt user speech model is used to perform noise compensation to generate the filtering audio signals compared with the sound signal received with the background audio of minimizing to the sound signal of described reception.