US20150154002A1 - User interface customization based on speaker characteristics - Google Patents

User interface customization based on speaker characteristics Download PDF

Info

Publication number
US20150154002A1
US20150154002A1 US14/096,608 US201314096608A US2015154002A1 US 20150154002 A1 US20150154002 A1 US 20150154002A1 US 201314096608 A US201314096608 A US 201314096608A US 2015154002 A1 US2015154002 A1 US 2015154002A1
Authority
US
United States
Prior art keywords
client device
user interface
speaker
customizations
user profile
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US14/096,608
Inventor
Eugene Weinstein
Ignacio L. Moreno
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Google LLC
Original Assignee
Google LLC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Google LLC filed Critical Google LLC
Priority to US14/096,608 priority Critical patent/US20150154002A1/en
Assigned to GOOGLE INC. reassignment GOOGLE INC. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: MORENO, Ignacio L., WEINSTEIN, EUGENE
Publication of US20150154002A1 publication Critical patent/US20150154002A1/en
Priority to US15/230,891 priority patent/US11137977B2/en
Assigned to GOOGLE LLC reassignment GOOGLE LLC CHANGE OF NAME (SEE DOCUMENT FOR DETAILS). Assignors: GOOGLE INC.
Priority to US17/136,069 priority patent/US11403065B2/en
Priority to US17/811,793 priority patent/US11620104B2/en
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output
    • G06F3/167Audio in a user interface, e.g. using voice commands for navigating, audio feedback
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/048Interaction techniques based on graphical user interfaces [GUI]
    • G06F3/0481Interaction techniques based on graphical user interfaces [GUI] based on specific properties of the displayed interaction object or a metaphor-based environment, e.g. interaction with desktop elements like windows or icons, or assisted by a cursor's changing behaviour or appearance
    • G06F3/04817Interaction techniques based on graphical user interfaces [GUI] based on specific properties of the displayed interaction object or a metaphor-based environment, e.g. interaction with desktop elements like windows or icons, or assisted by a cursor's changing behaviour or appearance using icons
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/103Formatting, i.e. changing of presentation of documents
    • G06F40/109Font handling; Temporal or kinetic typography
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/451Execution arrangements for user interfaces

Definitions

  • This specification describes technologies relates to adjustment of a user interface based on characteristics of a speaker.
  • Client devices may be shared by multiple users, each of which may have different characteristics and preferences.
  • Characteristics of a speaker may be estimated using speech processing and machine learning.
  • the characteristics of the speaker such as age, gender, emotion, and/or dialect, may be used to automatically customize a user interface of a client device for the speaker.
  • the speaker's characteristics may also be provided to other applications executing on the client device to enhance their content and provide a richer user experience.
  • one aspect of the subject matter includes the action of selecting a user profile associated with a user interface.
  • the actions further include, after selecting the user profile, obtaining an audio signal encoding an utterance of a speaker.
  • the actions also include processing the audio signal to identify at least one characteristic of the speaker.
  • the actions include customizing the user interface associated with the user profile based on the at least one characteristic.
  • the characteristic may include, for example, an age, gender, dialect, or emotion of the speaker.
  • Some implementations include the further action of providing the at least one characteristic of the speaker to a third-party application.
  • customizing the user interface associated with the user profile based on the at least one characteristic may include changing a font size of the user interface based on the at least one characteristic, changing a color scheme of the user interface based on the at least one characteristic, restricting access to one or more applications on the user interface based on the at least one characteristic, providing access to one or more applications on the user interface based on the at least one characteristic, restricting access to one or more applications provided by the user interface based on the at least one characteristic.
  • selecting a user profile includes selecting, at a client device, a user profile based on one or more of a password, voice recognition, speech recognition, fingerprint recognition, or facial recognition.
  • selecting a user profile includes selecting, at a client device, a default user profile for the client device.
  • processing the audio signal to identify at least one characteristic of the speaker includes the actions of providing the audio signal as an input to a neural network and receiving a set of likelihoods associated with the at least one characteristic as an output of the neural network.
  • Another general aspect of the subject matter includes the action of obtaining an audio signal encoding an utterance of a speaker.
  • the actions further include performing speech recognition, voice recognition, or both on the audio signal to select a user profile associated with a user interface.
  • the actions also include processing the audio signal to identify at least one characteristic of the speaker.
  • the actions include customizing the user interface associated with the user profile based on the at least one characteristic.
  • the characteristic may include, for example, an age, gender, dialect, or emotion of the speaker.
  • Some implementations include the further action of providing the at least one characteristic of the speaker to a third-party application.
  • customizing the user interface associated with the user profile based on the at least one characteristic may include changing a font size of the user interface based on the at least one characteristic, changing a color scheme of the user interface based on the at least one characteristic, restricting access to one or more applications on the user interface based on the at least one characteristic, providing access to one or more applications on the user interface based on the at least one characteristic, restricting access to one or more applications provided by the user interface based on the at least one characteristic.
  • processing the audio signal to identify at least one characteristic of the speaker includes the actions of providing the audio signal as an input to a neural network and receiving a set of likelihoods associated with the at least one characteristic as an output of the neural network.
  • Some implementations may advantageously customize a user interface based on characteristics of a speaker, thus providing a rich user experience. Some implementations also may advantageously provide characteristics of a speaker to other applications to enhance their content and provide a richer user experience.
  • FIG. 1 is a diagram that illustrates a client device configured to identify characteristics of a speaker and customize a user interface based on the identified characteristics.
  • FIG. 2 is a diagram that illustrates an example of processing for speech recognition using neural networks.
  • FIG. 3 is a diagram that illustrates an example of processing to generate latent variables of factor analysis.
  • FIG. 4 is a flow diagram that illustrates an example of a process for customizing a user interface based on characteristics of a speaker.
  • FIG. 5 is a flow diagram that illustrates another example of a process for customizing a user interface based on characteristics of a speaker.
  • Typical client devices do not automatically perform customization based on characteristics of the person using the device. When users share devices, this may mean that the user interface for any given user may not match that user's preferences. In addition, even when only a single user operates a given client device, it may be desirable to automatically tailor the user interface to settings that may be typically preferred by individuals with the user's characteristics. For example, a young child operating a client device may prefer large icons, little or no text, and the ability to play some games and call home. The child's parents also may prefer that the client device restrict access to most applications while the child is using the client device. As another example, an elderly individual may prefer large icons and text and unrestricted access to applications on the client device. By detecting characteristics of the user, the client device may be able to automatically customize the user interface to provide such settings without requiring the user to manually apply the settings.
  • a client device may customize a user interface based on characteristics that it estimates using the speaker's voice. For example, when a user logs into a client device using a voice unlock feature with speaker verification and/or uses voice search or voice input, a speech recording may be made and analyzed by a classifier to identify characteristics of the user. These characteristics may include, for example, the user's age, gender, emotion, and/or dialect.
  • a trait classification pipeline may be activated.
  • the pipeline may include processes at the client device and/or a speech recognition server.
  • a speech recognition processor could be applied to compute mel-frequency cepstral coefficients (MFCCs), perceptive linear predictive (PLP), and/or filterbank energy features.
  • the output of the feature computation step could then be provided to a statistical classifier such as a Gaussian mixture model or a neural network trained to classify the features of interest.
  • Training data for the classifier could include, for example, voice search logs annotated as corresponding to: child, adult, or elderly speech; male or female speech; happiness or sadness; or regionally dialected speech. Classifications for certain characteristics may be determined using other techniques, such as, for example, using the pitch characteristics of the speech recording to detect child speech, or by clustering data for male and female speakers.
  • Such customizations may be particularly advantageous in the case of children, who are likely to benefit from a specially customized user interface, and who also may be prone to altering preferred settings of adult users of a device, e.g., by changing settings, deleting files, etc. Additionally, children may be vulnerable to being harmed by unfiltered speech from the Internet, and they may therefore benefit from having the client device automatically provide a “safe mode” that restricts web browsing and only allows the use of selected child-safe applications when child speech is detected. For example, if likely child speech is detected, the user interface may change to a simplified safe mode that allows restricted web browsing and application use.
  • the client device may not store the child's inputs in search logs or adapt the speaker profile on the client device based on the child's inputs.
  • Some implementations may provide an override feature for adult users in case the client device erroneously determines that the user has vocal characteristics corresponding to child speech.
  • the user interface may provide large icons and text.
  • the client device may also provide identified characteristics of a speaker to native or third party applications, for example, in the form of an application programming interface (API).
  • API application programming interface
  • a user may have the option to selectively enable and/or disable the sharing of their characteristics with these other applications.
  • the client device may request access before such an API may be installed, in which case the user may affirmatively approve installation of the API.
  • customizing the user interface based on characteristics of a speaker should be distinguished from selecting or accessing a user profile based on voice and/or speech recognition.
  • Some client devices may allow a user to establish a user profile and store a set of preferences for operating the client device in the user profile.
  • the user profile can include, for example, the files and folders saved by the user; the applications, software, and programs, downloaded to the computing device by the user; security settings for loading the user profile; operation restrictions for the user; user interface of the client device including font size, icon size, type of wallpaper, icons to be displayed; and any other items or settings for operation of the user profile on the client device.
  • customizing the user interface based on the characteristics of a speaker may be performed in addition to, or instead of, selecting a user profile.
  • the client device may perform speech and/or voice recognition to access a user profile and unlock a client device.
  • the same utterance used to access the user profile may also be analyzed to identify characteristics of the speaker and customize the user interface associated with the user profile.
  • the client device may access a default user profile and the user may unlock a device using, for example, a PIN, password, username and password, or biometrics.
  • the user may make an utterance and the client device may customize the user interface based on characteristics of the user's voice.
  • FIG. 1 illustrates a client device 100 configured to identify characteristics of a speaker and customize a user interface 110 based on the identified characteristics.
  • a user 102 speaks an utterance 104 into the client device 100 , which generates an audio signal encoding the utterance.
  • the client device 100 then processes the audio signal to identify characteristics of the user 102 .
  • the client device may provide the audio signal to a trained classifier (e.g., a neural network) at the client device 100 .
  • the client device 100 may provide the audio signal to a server 120 via a network 130 , and the server then provides the audio signal to a trained classifier at the server.
  • a trained classifier e.g., a neural network
  • the client device 100 and/or the server uses output from the trained classifier to identify voice characteristics of the user 102 . These voice characteristics are then used to customize the user interface 110 of the client device. As shown in FIG. 1 , the user interface 110 a represents the display of the client device before analyzing the voice characteristics of the user 102 , and the user interface 110 b represents the display after analyzing the voice characteristics of the user 102 .
  • the client device 100 can be, for example, a desktop computer, laptop computer, a tablet computer, a wearable computer, a cellular phone, a smart phone, a music player, an e-book reader, a navigation system, or any other appropriate computing device.
  • the functions performed by the server 120 can be performed by individual computer systems or can be distributed across multiple computer systems.
  • the network 130 can be wired or wireless or a combination of both and can include the Internet.
  • a user 102 of the client device 100 initiates a speech recognition session such that the client device encodes an audio signal that includes the utterance 104 of the user.
  • the user may, for example, press a button on the client device 100 to perform a voice search or input a voice command or hotword, speak an utterance, and then release the button on the client device 100 .
  • the user may select a user interface control on the client device 100 before speaking the utterance.
  • the user 102 may activate a voice unlock feature on the client device 100 by speaking an utterance.
  • the client device 100 encodes the utterance into an audio signal, which may be, for example, a snippet of relatively high quality audio, such as 16 kHz lossless audio.
  • the client device 100 may perform speech and/or voice recognition on the utterance to identify the speaker as an authorized user of the device and then unlock the device. For example, the client device 100 and/or the server 120 may compare a voice signature of the utterance with one or more voice signatures associated with authorized users that are stored on the client device. Alternatively or in addition, the client device 100 and/or the server 120 may perform speech recognition on the utterance to identify an authorized password or phrase associated with an authorized user of the client device. In some aspects, different users of the same client device 100 may each establish a user profile that includes preferences for operating the client device associated with the user and the applications of interest to the user.
  • Each user profile may also be associated with one or more voice signatures, passwords, and/or passphrases.
  • the client device 100 identifies a voice signature, password, and/or passphrase associated with a user profile, the client device select the associated user profile, unlock the client device, and provide access to that user profile.
  • the client device 100 and/or the server 120 then identify audio characteristics. These audio characteristics may be independent of the words spoken by the user 102 .
  • the audio characteristics may indicate audio features that likely correspond to one or more of the speaker's gender, the speaker's age, speaker's emotional state, and/or the speaker's dialect. While feature vectors may be indicative of audio characteristics of specific portions of the particular words spoken, the audio characteristics may be indicative of time-independent characteristics of the audio signal.
  • the audio characteristics can include latent variables of multivariate factor analysis (MFA) of the audio signal.
  • the latent variables may be accessed from data storage, received from another system, or calculated by the client device 100 and/or the server 120 .
  • feature vectors derived from the audio signal may be analyzed by a factor analysis model.
  • the factor analysis model may create a probabilistic partition of an acoustic space using a Gaussian Mixture Model, and then average the feature vectors associated with each partition.
  • the averaging can be a soft averaging weighted by the probability that each feature vector belongs to the partition.
  • the result of processing with the factor analysis model can be an i-vector, as discussed further below.
  • the client device 100 and/or the server 120 inputs the audio characteristics into a trained classifier.
  • the audio characteristics may be represented by, for example, an i-vector and/or acoustic features such as MFCCs or PLPs.
  • the classifier may be, for example, a Gaussian mixture mode, a neural network, a logistic regression classifier, or a support vector machine (SVM).
  • SVM support vector machine
  • the classifier has been trained to classify the features of interest.
  • Training data for the classifier could include, for example, voice search logs annotated as corresponding to: child, adult, or elderly speech; male or female speech; or regionally dialected speech.
  • Classifications for certain characteristics may be determined using other techniques, such as, for example, using the pitch characteristics of the speech recording to detect child speech, or by clustering data for male and female speakers.
  • the dialect may be determined by the client device 100 and/or the server 120 using the automated dialect-identifying processes described, for example, in D. Martinez, O. Plchot, L. Burget, Ondrej Glembek, and Pavel Matejka, “Language Recognition in iVectors Space.,” INTERSPEECH., pp. 861-864, ISCA (2011); K.
  • emotions of a speaker may be classified using techniques such as those described in K. Rao, S. Koolagudi, and R. Vempada, “Emotion Recognition From Speech Using Global and Local Prosodic Features,” Intl Journal of Speech Technology, Vol. 16, Issue 2, pp. 143-160 (June 2013) or D. Ververidis, C. Kotropoulos, “Emotional speech recognition: Resources, features, and methods,” Speech Communication 48, pp. 1162-81 (2006).
  • the classifier then outputs data classifying various characteristics of the speaker.
  • the classifier may output a set of likelihoods for the speaker's age, gender, emotion, and/or dialect.
  • the output may be a normalized probability between zero and one for one or more of these characteristics.
  • Output for a gender classification may be, for example, male—0.80, female—0.20.
  • Output for an age classification may be, for example, child—0.6, adult—0.3, elderly—0.1.
  • Output for an emotion classification may be, for example, happy—0.6, angry—0.3, sad—0.1.
  • Output for a dialect classification may be, for example, British English—0.5, Kiwi English—0.2, Indian English—0.1, Australian English—0.1, Irish English—0.1.
  • the client device 100 and/or the server 120 then identifies characteristics of the user 102 based on the output of the classifier. For example, the client device and/or the server 120 may select the characteristics having the highest probability. To continue the example above, the client device and/or the server 120 may identify the user 102 as a male child speaking British English. In some cases, the client device 102 may apply a minimum threshold probability to the selection. In such instances, where the classifier does not identify a characteristic as having a probability that exceeds the threshold, the client device 100 and/or server 120 may select a default characteristic such as “unknown,” may prompt the user for additional information, and/or may cause an error message to be outputted to the user 102 (e.g., “please say that again,” or “please provide additional information”).
  • a default characteristic such as “unknown”
  • Customizing the user interface 110 may include, for example, changing layout, font size, icon size, color scheme, wallpaper, icons and/or text to be displayed, animations, and any other items or settings for operation of the client device.
  • Customizing user interface may also include, for example, restricting and/or modifying the operation of one or more applications executing on the client device. The specific customizations that will correspond to various characteristics may be determined based on user information, demographic information, surveys, empirical observations, and/or any other suitable techniques.
  • the client device modifies the user interface to correspond to a child safe mode.
  • the user interface 110 a includes a full complement of applications that may be accessed, including a camera application, a contact application, a calendar application, a search application, a messaging application, a browser application, a call application, and an email application.
  • the user interface 110 b has been modified to permit access only to a camera application and a phone call application.
  • the icons for both of these applications have also been enlarged to make it easier for children to understand and operate.
  • the user interface 110 b for a child mode may also restrict or modify the operations of the applications that are accessible, for example by limiting the phone numbers that can be called (e.g., only able to call home), limiting the number of pictures that can be taken, or restricting the websites that can be accessed.
  • the client device 100 may provide information regarding the characteristics of the user to native and/or third-party applications with an API on the client device, which the applications may use to modify their operations.
  • the users may be provided with an opportunity to control whether programs or features collect personal information (e.g., information about a user's social network, social actions or activities, profession, a user's preferences, or a user's current location), or to control whether and/or how to receive content from the content server that may be more relevant to the user.
  • personal information e.g., information about a user's social network, social actions or activities, profession, a user's preferences, or a user's current location
  • certain data may be anonymized in one or more ways before it is stored or used, so that personally identifiable information is removed.
  • a user's identity may be anonymized so that no personally identifiable information can be determined for the user, or a user's geographic location may be generalized where location information is obtained (such as to a city, ZIP code, or state level), so that a particular location of a user cannot be determined.
  • location information such as to a city, ZIP code, or state level
  • the user may have control over how information is collected about him or her and used by a content server.
  • FIG. 2 is a diagram 200 that illustrates an example of processing for speech recognition using neural networks.
  • the operations discussed are described as being performed by the server 120 , but may be performed by other systems, including combinations of the client device 100 and/or multiple computing systems.
  • the example architecture described with reference to FIG. 2 includes i-vector inputs into a neural network classifier, the present disclosure is not limited to this architecture.
  • any suitable inputs representing audio characteristics of the speaker such as MFCCs, or PLPs, could be used.
  • a neural network could be trained directly from acoustic features (MFCCs or PLPs), e.g., a neural network could receive an MFCC vector (X) and predict the characteristic that maximizes P(L
  • MFCCs or PLPs acoustic features
  • X MFCC vector
  • any suitable classifier may be used such as an SVM or logistic regression.
  • the server 120 receives data about an audio signal 210 that includes speech to be recognized.
  • the server 120 or another system then performs feature extraction on the audio signal 210 .
  • the server 120 analyzes different segments or analysis windows 220 of the audio signal 210 .
  • the windows 220 are labeled w 0 . . . w n , and as illustrated, the windows 220 can overlap.
  • each window 220 may include 25 ms of the audio signal 210 , and a new window 220 may begin every 10 ms.
  • the window 220 labeled w 0 may represent the portion of audio signal 210 from a start time of 0 ms to an end time of 25 ms
  • the next window 220 labeled w 1
  • each window 220 includes 15 ms of the audio signal 210 that is included in the previous window 220 .
  • the server 120 performs a Fast Fourier Transform (FFT) on the audio in each window 220 .
  • the results of the FFT are shown as time-frequency representations 230 of the audio in each window 220 .
  • the server 120 extracts features that are represented as an acoustic feature vector 240 for the window 220 .
  • the acoustic features may be determined by binning according to filterbank energy coefficients, using an MFCC transform, using a PLP transform, or using other techniques.
  • the logarithm of the energy in each of various bands of the FFT may be used to determine acoustic features.
  • the acoustic feature vectors 240 include values corresponding to each of multiple dimensions.
  • each acoustic feature vector 240 may include a value for a PLP feature, a value for a first order temporal difference, and a value for a second order temporal difference, for each of 13 dimensions, for a total of 39 dimensions per acoustic feature vector 240 .
  • Each acoustic feature vector 240 represents characteristics of the portion of the audio signal 210 within its corresponding window 220 .
  • the server 120 may also obtain an i-vector 250 .
  • the server 120 may process the audio signal 210 with an acoustic model 260 to obtain the i-vector 250 .
  • the i-vector 250 indicates latent variables of multivariate factor analysis.
  • the i-vector 250 may be normalized, for example, to have a zero mean unit variance.
  • the i-vector 250 may be projected, for example, using principal component analysis (PCA) or linear discriminant analysis (LDA). Techniques for obtaining an i-vector are described further below with respect to FIG. 3 .
  • the server 120 uses a neural network 270 that can serve as an acoustic model and indicate likelihoods that acoustic feature vectors 240 represent different phonetic units.
  • the neural network 270 includes an input layer 271 , a number of hidden layers 272 a - 272 c , and an output layer 273 .
  • the neural network 270 receives an i-vector as input.
  • the first hidden layer 272 a has connections from the i-vector input portion of the input layer 271 , where such connections are not present in typical neural networks used for speech recognition.
  • the neural network 270 has been trained to estimate likelihoods that an i-vector represents various speaker characteristics. For example, during training, input to the neural network 270 may be i-vectors corresponding to the utterances from which the acoustic feature vectors were derived.
  • the various training data sets can include i-vectors derived from utterances from multiple speakers.
  • the server 120 inputs the i-vector 250 at the input layer 271 of the neural network 270 .
  • the neural network 270 indicates likelihoods that the speech corresponds to specific speaker characteristics.
  • the output layer 273 provides predictions or probabilities for these characteristics given the data at the input layer 271 .
  • the output layer 273 can provide a value, for each of the speaker characteristics of interest. Because the i-vector 250 indicates constant or overall properties of the audio signal 210 as a whole, the information in the i-vector 250 is independent of the particular acoustic states that may occur at specific windows 220 .
  • the i-vector 250 is based on a current utterance i-vector derived from the current utterance (e.g., the particular audio signal 210 ) being recognized.
  • the i-vector 250 may be a speaker i-vector generated using multiple utterances of the speaker (e.g., utterances from multiple different recording sessions, such as recordings on different days). For example, multiple utterances for a speaker may be stored in association with a user profile, and the utterances may be retrieved to update a speaker i-vector for that user profile.
  • an i-vector can be determined for each utterance in the set of multiple utterances of the speaker. The i-vectors can be averaged together to generate the speaker i-vector.
  • post processing may discriminative training, such as LDA, to identify attributes that are indicative of speaker characteristics. For example, various techniques can be used to isolate speaker characteristics, independent of noise, room characteristics, and other non-speaker-dependent characteristics.
  • the server 120 may identify the speaker and select an i-vector based on the speaker's identity. An i-vector may be calculated for each of multiple users, and the i-vectors may be stored in association with user profiles for those users for later use in recognizing speech of the corresponding users.
  • the server 120 may receive a device identifier for a device, such as a mobile phone, that the speaker is using to record speech.
  • the server 120 may receive a user identifier that identifies the user, such as a name or user account login.
  • the server 120 may identify the speaker as a user that owns the device or a user is logged into a user account on the device.
  • the server 120 may identify the speaker before recognition begins, or before audio is received during the current session. The server 120 may then look up the i-vector that corresponds to the identified user and use that i-vector to recognize received speech.
  • a successive approximation technique may be used to approximate and re-estimate the i-vector 250 while audio is received.
  • the i-vector 250 may be re-estimated at a predetermined interval, for example, each time a threshold amount of new audio has been received. For example, a first i-vector may be estimated using the initial three seconds of audio received. Then, after another three seconds of audio has been received, a second i-vector may be estimated using the six seconds of audio received so far. After another three seconds, a third i-vector may be estimated using all nine seconds of audio received, and so on. The re-estimation period may occur at longer intervals, such as 10 seconds or 30 seconds, to reduce the amount of computation required. In some implementations, i-vectors are re-estimated at pauses in speech (e.g., as detected by a speech energy or voice activity detection algorithm), rather than at predetermined intervals.
  • An i-vector derived from a small segment of an utterance may introduce some inaccuracy compared to an i-vector for the entire utterance, but as more audio is received, the estimated i-vectors approach the accuracy of an i-vector derived from the whole utterance.
  • audio from recent utterances e.g., audio from a predetermined number of most recent utterances or audio acquired within a threshold period of the current time
  • the server 120 transitions from using a first i-vector to a second i-vector during recognition of an utterance.
  • the server 120 may begin by using a first i-vector derived from a previous utterance.
  • a threshold amount of audio e.g., 3, 5, 10, or 30 seconds
  • the server 120 generates a second i-vector based on the audio received in the current session and uses the second i-vector to process subsequently received audio.
  • FIG. 3 is a diagram 300 that illustrates an example of processing to generate latent variables of factor analysis.
  • the example of FIG. 3 shows techniques for determining an i-vector, which includes these latent variables of factor analysis.
  • I-vectors are time-independent components that represent overall characteristics of an audio signal rather than characteristics at a specific segment of time within an utterance. I-vectors can summarize a variety of characteristics of audio that are independent of the phonetic units spoken, for example, information indicative of the age, gender, emotion, and/or dialect of the speaker.
  • the example of FIG. 3 illustrates processing to calculate an i-vector 250 for a sample utterance 310 .
  • the server 120 accesses training data 320 that includes a number of utterances 321 .
  • the training data 320 may include utterances 321 including speech from different speakers, utterances 321 having different background noise conditions, and utterances 321 having other differences.
  • Each of the utterances 321 is represented as a set of acoustic feature vectors.
  • Each of the acoustic feature vectors can be, for example, a 39-dimensional vector determined in the same manner that the acoustic feature vectors 240 are determined in the example of FIG. 2 .
  • the server 120 uses the utterances 321 to train a Gaussian mixture model (GMM) 330 .
  • the GMM 330 may include 1000 39-dimensional Gaussians 331 .
  • the GMM 330 is trained using the acoustic feature vectors of the utterances 321 regardless of the phones or acoustic states that the acoustic feature vectors represent. As a result, acoustic feature vectors corresponding to different phones and acoustic states are used to train the GMM 330 .
  • all of the acoustic feature vectors from all of the utterances 321 in the training data 320 can be used to train the GMM 330 .
  • the GMM 330 is different from GMMs that are trained with only the acoustic feature vectors for a single phone or a single acoustic state.
  • the server 120 determines acoustic feature vectors that describe the utterance 310 .
  • the server 120 classifies the acoustic feature vectors of the utterance 310 using the GMM 330 . For example, the Gaussian 331 that corresponds to each acoustic feature vector of the sample utterance 310 may be identified.
  • the server 120 then re-estimates the Gaussians 331 that are observed in the sample utterance 310 , illustrated as re-estimated Gaussians 335 shown in dashed lines.
  • a set of one or more acoustic feature vectors of the sample utterance 310 may be classified as matching a particular Gaussian 331 a from the GMM 330 . Based on this set of acoustic feature vectors, the server 120 calculates a re-estimated Gaussian 335 a having a mean and/or variance different from the Gaussian 331 a . Typically, only some of the Gaussians 331 in the GMM 330 are observed in the sample utterance 310 and re-estimated.
  • the server 120 then identifies differences between the Gaussians 331 and the corresponding re-estimated Gaussians 335 . For example, the server 120 may generate difference vectors that each indicate changes in parameters between a Gaussian 331 and its corresponding re-estimated Gaussian 335 . Since each of the Gaussians is 39-dimensional, each difference vector can have 39 values, where each value indicates a change in one of the 39 dimensions.
  • the server 120 concatenates or stacks the difference vectors to generate a supervector 340 . Because only some of the Gaussians 331 were observed and re-estimated, a value of zero (e.g., indicating no change from the original Gaussian 331 ) is included in the supervector 340 for each the 39 dimensions of each Gaussian 331 that was not observed in the sample utterance 310 . For a GMM 330 having 1000 Gaussians that are each 39-dimensional, the supervector 340 would include 39,000 elements. In many instances, Gaussians 331 and the corresponding re-estimated Gaussians 335 differ only in their mean values. The supervector 340 can represent the differences between the mean values of the Gaussians 331 and the mean values of the corresponding re-estimated Gaussians 335 .
  • the server 120 In addition to generating the supervector 340 , the server 120 also generates a count vector 345 for the utterance 310 .
  • the values in the count vector 345 can represent 0 th order Baum-Welch statistics, referred to as counts or accumulated posteriors.
  • the count vector 345 can indicate the relative importance of the Gaussians 331 in the GMM 330 .
  • the count vector 345 includes a value for each Gaussian 331 in the GMM 330 . As a result, for a GMM 330 having 1000 Gaussians, the count vector 345 for the utterance 310 would include 1,000 elements.
  • Each value in the vector 345 can be the sum of the posterior probabilities of the feature vectors of the utterance 310 with respect to a particular Gaussian 331 .
  • the posterior probability of each feature vector in the utterance 310 is computed (e.g., the probability of occurrence of the feature vector as indicated by the first Gaussian 331 a ).
  • the sum of the posterior probabilities for the feature vectors in the utterance 310 is used as the value for the first Gaussian 331 a in the count vector 345 .
  • Posterior probabilities for the each feature vector in the utterance 310 can be calculated and summed for each of the other Gaussians 331 to complete the count vector 345 .
  • the server 120 In the same manner that the supervector 340 and count vector 345 was generated for the sample utterance 310 , the server 120 generates a supervector 350 and a count vector 355 for each of the utterances 321 in the training data 320 .
  • the GMM 330 , the supervectors 350 , and the count vectors 355 may be generated and stored before receiving the sample utterance 310 . Then, when the sample utterance 310 is received, the previously generated GMM 330 , supervectors 350 , and count vectors can be accessed from storage, which limits the amount of computation necessary to generate an i-vector for the sample utterance 310 .
  • the server 120 uses the supervectors 350 to create a factor analysis module 360 .
  • the factor analysis module 360 like the GMM 330 and the supervectors 350 , may be generated in advance of receiving the sample utterance 310 .
  • the factor analysis module 360 can perform multivariate factor analysis to project a supervector to a lower-dimensional vector that represents particular factors of interest. For example, the factor analysis module may project a supervector of 39,000 elements to a vector of only a few thousand elements or only a few hundred elements.
  • the factor analysis module 360 is trained using a collection of utterances, which may be the utterances in the same training data 320 used to generate the GMM 330 .
  • An adapted or re-estimated GMM may be determined for each of the i utterances [U 1 , U 2 , . . . , U i ] in the training data 320 , in the same manner that the re-estimated Gaussians 335 are determined for the utterance 310 .
  • the factor analysis module 360 is trained to learn the common range of movement of the adapted or re-estimated GMMs for the utterances [U 1 , U 2 , . . . , U i ] relative to the general GMM 330 . Difference parameters between re-estimated GMMs and the GMM 330 are then constrained to move only over the identified common directions of movement in the space of the supervectors.
  • Movement is limited to a manifold, and the variables that describe the position of the difference parameters over the manifold are denoted as i-vectors.
  • the server 120 inputs the supervector 340 and count vector 345 for the sample utterance 310 to the trained factor analysis module 360 .
  • the output of the factor analysis module 360 is the i-vector 250 , which includes latent variables of multivariate factor analysis.
  • the i-vector 250 represents time-independent characteristics of the sample utterance 310 rather than characteristics of a particular window or subset of windows within the sample utterance 310 .
  • the i-vector 250 may include, for example, approximately 300 elements.
  • FIG. 4 is a flow diagram that illustrates an example of a process for customizing a user interface based on characteristics of a speaker.
  • the process 400 may be performed by data processing apparatus, such as the client device 100 described above or another data processing apparatus.
  • the client device selects a user profile associated with the user interface. For example, the client device may select a user profile based on one or more of a password, voice recognition, speech recognition, fingerprint recognition, or facial recognition. In some instances, the client device may have a default user profile, in which case the client device typically selects and operates using the default user profile.
  • the client device obtains an audio signal encoding an utterance of the speaker.
  • the client device may receive an utterance of the speaker at a microphone, and encode the utterance into an audio signal such as, for example, a 16 kHz lossless audio signal.
  • the client device processes the audio signal to identify one or more characteristics of the speaker.
  • the characteristics may include one or more of an age, gender, emotion, and/or an dialect of the speaker.
  • the client device may provide the audio signal to a trained classifier (e.g., a neural network, SVM, or a Gaussian mixture model) that outputs likelihoods associated with one or more characteristics of the speaker.
  • the client device may then select characteristics having the highest likelihood, and/or apply a threshold to identify characteristics of the speaker.
  • the client device may transmit the audio signal to a server, which inputs the audio signal to a trained classifier, identifies characteristics of the speaker using the classifier, and then transmits the identified characteristics back to the client device.
  • the client device customizes the user interface associated with the user profile based on the identified characteristics. For example, the client device may change layout, font size, icon size, color scheme, wallpaper, icons and/or text to be displayed, animations, and any other items or settings for operation of the client device.
  • the client device may restrict access to applications that were previously accessible, and/or provide access to applications that were previously inaccessible.
  • the client device also may, for example, restrict and/or modify the operation of one or more applications executing on the client device. For example, the client device may restrict a web browsing application to provide access to only a limited set of websites.
  • the client device may also provide one or more of the characteristics of the speaker to native and/or third-party applications executing on the client device.
  • the client device may provide users with an option to decide whether to share this information with native and/or third-party applications.
  • FIG. 5 is a flow diagram that illustrates another example of a process for customizing a user interface based on characteristics of a speaker.
  • the process 500 may be performed by data processing apparatus, such as the client device 100 described above or another data processing apparatus.
  • the client device selects a user profile associated with the user interface.
  • the client device obtains an audio signal encoding an utterance of the speaker.
  • the client device may receive an utterance of the speaker at a microphone, and encode the utterance into an audio signal such as, for example, a 16 kHz lossless audio signal.
  • the client device (optionally in combination with one or more servers) perform speech recognition, voice recognition, or both on the audio signal to select and/or access the user profile associated with the user interface.
  • the client device may have a default user profile, in which case the client device may provide access to the default user profile when speech recognition and/or voice recognition successfully authenticate the speaker.
  • the client device also processes the audio signal to identify one or more characteristics of the speaker.
  • the characteristics may include one or more of an age, gender, emotion, and/or dialect of the speaker.
  • the client device may provide the audio signal to a trained classifier (e.g., a neural network, SVM, or a Gaussian mixture model) that outputs likelihoods associated with one or more characteristics of the speaker.
  • the client device may then select characteristics having the highest likelihood, and/or apply a threshold to identify characteristics of the speaker.
  • the client device may transmit the audio signal to a server, which inputs the audio signal to a trained classifier, identifies characteristics of the speaker using the classifier, and then transmits the identified characteristics back to the client device.
  • the client device customizes the user interface associated with the user profile based on the identified characteristics. For example, the client device may change layout, font size, icon size, color scheme, wallpaper, icons and/or text to be displayed, animations, and any other items or settings for operation of the client device.
  • the client device may restrict access to applications that were previously accessible, and/or provide access to applications that were previously inaccessible.
  • the client device also may, for example, restrict and/or modify the operation of one or more applications executing on the client device. For example, the client device may restrict a web browsing application to provide access to only a limited set of websites.
  • the client device may also provide one or more of the characteristics of the speaker to native and/or third-party applications executing on the client device.
  • the client device may provide users with an option to decide whether to share this information with native and/or third-party applications.
  • Embodiments and all of the functional operations described in this specification may be implemented in digital electronic circuitry, or in computer software, firmware, or hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them.
  • Embodiments may be implemented as one or more computer program products, i.e., one or more modules of computer program instructions encoded on a computer-readable medium for execution by, or to control the operation of, data processing apparatus.
  • the computer readable-medium may be a machine-readable storage device, a machine-readable storage substrate, a memory device, a composition of matter affecting a machine-readable propagated signal, or a combination of one or more of them.
  • the computer-readable medium may be a non-transitory computer-readable medium.
  • data processing apparatus encompasses all apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, or multiple processors or computers.
  • the apparatus may include, in addition to hardware, code that creates an execution environment for the computer program in question, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of one or more of them.
  • a propagated signal is an artificially generated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal that is generated to encode information for transmission to suitable receiver apparatus.
  • a computer program (also known as a program, software, software application, script, or code) may be written in any form of programming language, including compiled or interpreted languages, and it may be deployed in any form, including as a standalone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.
  • a computer program does not necessarily correspond to a file in a file system.
  • a program may be stored in a portion of a file that holds other programs or data (e.g., one or more scripts stored in a markup language document), in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub programs, or portions of code).
  • a computer program may be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a communication network.
  • the processes and logic flows described in this specification may be performed by one or more programmable processors executing one or more computer programs to perform functions by operating on input data and generating output.
  • the processes and logic flows may also be performed by, and apparatus may also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application specific integrated circuit).
  • FPGA field programmable gate array
  • ASIC application specific integrated circuit
  • processors suitable for the execution of a computer program include, by way of example, both general and special purpose microprocessors, and any one or more processors of any kind of digital computer.
  • a processor will receive instructions and data from a read only memory or a random access memory or both.
  • the essential elements of a computer are a processor for performing instructions and one or more memory devices for storing instructions and data.
  • a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto optical disks, or optical disks.
  • mass storage devices for storing data, e.g., magnetic, magneto optical disks, or optical disks.
  • a computer need not have such devices.
  • a computer may be embedded in another device, e.g., a tablet computer, a mobile telephone, a personal digital assistant (PDA), a mobile audio player, a Global Positioning System (GPS) receiver, to name just a few.
  • Computer readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks; magneto optical disks; and CD-ROM and DVD-ROM disks.
  • the processor and the memory may be supplemented by, or incorporated in, special purpose logic circuitry.
  • embodiments may be implemented on a computer having a display device, e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor, for displaying information to the user and a keyboard and a pointing device, e.g., a mouse or a trackball, by which the user may provide input to the computer.
  • a display device e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor
  • keyboard and a pointing device e.g., a mouse or a trackball
  • Other kinds of devices may be used to provide for interaction with a user as well; for example, feedback provided to the user may be any form of sensory feedback, e.g., visual feedback, auditory feedback, or tactile feedback; and input from the user may be received in any form, including acoustic, speech, or tactile input.
  • Embodiments may be implemented in a computing system that includes a back end component, e.g., as a data server, or that includes a middleware component, e.g., an application server, or that includes a front end component, e.g., a client computer having a graphical user interface or a Web browser through which a user may interact with an implementation of the techniques disclosed, or any combination of one or more such back end, middleware, or front end components.
  • the components of the system may be interconnected by any form or medium of digital data communication, e.g., a communication network. Examples of communication networks include a local area network (“LAN”) and a wide area network (“WAN”), e.g., the Internet.
  • LAN local area network
  • WAN wide area network
  • the computing system may include clients and servers.
  • a client and server are generally remote from each other and typically interact through a communication network.
  • the relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

Abstract

Characteristics of a speaker are estimated using speech processing and machine learning. The characteristics of the speaker are used to automatically customize a user interface of a client device for the speaker.

Description

    FIELD
  • This specification describes technologies relates to adjustment of a user interface based on characteristics of a speaker.
  • BACKGROUND
  • Client devices may be shared by multiple users, each of which may have different characteristics and preferences.
  • SUMMARY
  • Characteristics of a speaker may be estimated using speech processing and machine learning. The characteristics of the speaker, such as age, gender, emotion, and/or dialect, may be used to automatically customize a user interface of a client device for the speaker. The speaker's characteristics may also be provided to other applications executing on the client device to enhance their content and provide a richer user experience.
  • In general, one aspect of the subject matter includes the action of selecting a user profile associated with a user interface. The actions further include, after selecting the user profile, obtaining an audio signal encoding an utterance of a speaker. The actions also include processing the audio signal to identify at least one characteristic of the speaker. And the actions include customizing the user interface associated with the user profile based on the at least one characteristic. The characteristic may include, for example, an age, gender, dialect, or emotion of the speaker. Some implementations include the further action of providing the at least one characteristic of the speaker to a third-party application.
  • In some implementations, customizing the user interface associated with the user profile based on the at least one characteristic may include changing a font size of the user interface based on the at least one characteristic, changing a color scheme of the user interface based on the at least one characteristic, restricting access to one or more applications on the user interface based on the at least one characteristic, providing access to one or more applications on the user interface based on the at least one characteristic, restricting access to one or more applications provided by the user interface based on the at least one characteristic.
  • In some implementations, selecting a user profile includes selecting, at a client device, a user profile based on one or more of a password, voice recognition, speech recognition, fingerprint recognition, or facial recognition. Alternatively or in addition, in some implementations, selecting a user profile includes selecting, at a client device, a default user profile for the client device.
  • In some implementations, processing the audio signal to identify at least one characteristic of the speaker includes the actions of providing the audio signal as an input to a neural network and receiving a set of likelihoods associated with the at least one characteristic as an output of the neural network.
  • Another general aspect of the subject matter includes the action of obtaining an audio signal encoding an utterance of a speaker. The actions further include performing speech recognition, voice recognition, or both on the audio signal to select a user profile associated with a user interface. The actions also include processing the audio signal to identify at least one characteristic of the speaker. And the actions include customizing the user interface associated with the user profile based on the at least one characteristic. The characteristic may include, for example, an age, gender, dialect, or emotion of the speaker. Some implementations include the further action of providing the at least one characteristic of the speaker to a third-party application.
  • In some implementations, customizing the user interface associated with the user profile based on the at least one characteristic may include changing a font size of the user interface based on the at least one characteristic, changing a color scheme of the user interface based on the at least one characteristic, restricting access to one or more applications on the user interface based on the at least one characteristic, providing access to one or more applications on the user interface based on the at least one characteristic, restricting access to one or more applications provided by the user interface based on the at least one characteristic.
  • In some implementations, processing the audio signal to identify at least one characteristic of the speaker includes the actions of providing the audio signal as an input to a neural network and receiving a set of likelihoods associated with the at least one characteristic as an output of the neural network.
  • Some implementations may advantageously customize a user interface based on characteristics of a speaker, thus providing a rich user experience. Some implementations also may advantageously provide characteristics of a speaker to other applications to enhance their content and provide a richer user experience.
  • The details of one or more embodiments of the subject matter described in this specification are set forth in the accompanying drawings and the description below. Other potential features, aspects, and advantages of the subject matter will become apparent from the description, the drawings, and the claims.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • FIG. 1 is a diagram that illustrates a client device configured to identify characteristics of a speaker and customize a user interface based on the identified characteristics.
  • FIG. 2 is a diagram that illustrates an example of processing for speech recognition using neural networks.
  • FIG. 3 is a diagram that illustrates an example of processing to generate latent variables of factor analysis.
  • FIG. 4 is a flow diagram that illustrates an example of a process for customizing a user interface based on characteristics of a speaker.
  • FIG. 5 is a flow diagram that illustrates another example of a process for customizing a user interface based on characteristics of a speaker.
  • Like reference numbers and designations in the various drawings indicate like elements.
  • DETAILED DESCRIPTION
  • Typical client devices do not automatically perform customization based on characteristics of the person using the device. When users share devices, this may mean that the user interface for any given user may not match that user's preferences. In addition, even when only a single user operates a given client device, it may be desirable to automatically tailor the user interface to settings that may be typically preferred by individuals with the user's characteristics. For example, a young child operating a client device may prefer large icons, little or no text, and the ability to play some games and call home. The child's parents also may prefer that the client device restrict access to most applications while the child is using the client device. As another example, an elderly individual may prefer large icons and text and unrestricted access to applications on the client device. By detecting characteristics of the user, the client device may be able to automatically customize the user interface to provide such settings without requiring the user to manually apply the settings.
  • As described in this disclosure, a client device may customize a user interface based on characteristics that it estimates using the speaker's voice. For example, when a user logs into a client device using a voice unlock feature with speaker verification and/or uses voice search or voice input, a speech recording may be made and analyzed by a classifier to identify characteristics of the user. These characteristics may include, for example, the user's age, gender, emotion, and/or dialect.
  • Once a speech recording is made, a trait classification pipeline may be activated. The pipeline may include processes at the client device and/or a speech recognition server. In the pipeline, a speech recognition processor could be applied to compute mel-frequency cepstral coefficients (MFCCs), perceptive linear predictive (PLP), and/or filterbank energy features. The output of the feature computation step could then be provided to a statistical classifier such as a Gaussian mixture model or a neural network trained to classify the features of interest. Training data for the classifier could include, for example, voice search logs annotated as corresponding to: child, adult, or elderly speech; male or female speech; happiness or sadness; or regionally dialected speech. Classifications for certain characteristics may be determined using other techniques, such as, for example, using the pitch characteristics of the speech recording to detect child speech, or by clustering data for male and female speakers.
  • Various customizations based on the identified characteristics may be implemented. Such customizations may be particularly advantageous in the case of children, who are likely to benefit from a specially customized user interface, and who also may be prone to altering preferred settings of adult users of a device, e.g., by changing settings, deleting files, etc. Additionally, children may be vulnerable to being harmed by unfiltered speech from the Internet, and they may therefore benefit from having the client device automatically provide a “safe mode” that restricts web browsing and only allows the use of selected child-safe applications when child speech is detected. For example, if likely child speech is detected, the user interface may change to a simplified safe mode that allows restricted web browsing and application use. In some implementations, when child speech is detected, the client device may not store the child's inputs in search logs or adapt the speaker profile on the client device based on the child's inputs. Some implementations may provide an override feature for adult users in case the client device erroneously determines that the user has vocal characteristics corresponding to child speech. As another example, if elderly speech is detected, the user interface may provide large icons and text.
  • The client device may also provide identified characteristics of a speaker to native or third party applications, for example, in the form of an application programming interface (API). A user may have the option to selectively enable and/or disable the sharing of their characteristics with these other applications. In some implementations, the client device may request access before such an API may be installed, in which case the user may affirmatively approve installation of the API.
  • As described herein, customizing the user interface based on characteristics of a speaker should be distinguished from selecting or accessing a user profile based on voice and/or speech recognition. Some client devices may allow a user to establish a user profile and store a set of preferences for operating the client device in the user profile. The user profile can include, for example, the files and folders saved by the user; the applications, software, and programs, downloaded to the computing device by the user; security settings for loading the user profile; operation restrictions for the user; user interface of the client device including font size, icon size, type of wallpaper, icons to be displayed; and any other items or settings for operation of the user profile on the client device.
  • However, customizing the user interface based on the characteristics of a speaker may be performed in addition to, or instead of, selecting a user profile. For example, in some implementations, the client device may perform speech and/or voice recognition to access a user profile and unlock a client device. The same utterance used to access the user profile may also be analyzed to identify characteristics of the speaker and customize the user interface associated with the user profile. Alternatively or in addition, the client device may access a default user profile and the user may unlock a device using, for example, a PIN, password, username and password, or biometrics. After the client device has been unlocked, the user may make an utterance and the client device may customize the user interface based on characteristics of the user's voice.
  • FIG. 1 illustrates a client device 100 configured to identify characteristics of a speaker and customize a user interface 110 based on the identified characteristics. In particular, a user 102 speaks an utterance 104 into the client device 100, which generates an audio signal encoding the utterance. The client device 100 then processes the audio signal to identify characteristics of the user 102. For example, the client device may provide the audio signal to a trained classifier (e.g., a neural network) at the client device 100. Alternatively or in addition, the client device 100 may provide the audio signal to a server 120 via a network 130, and the server then provides the audio signal to a trained classifier at the server. The client device 100 and/or the server then uses output from the trained classifier to identify voice characteristics of the user 102. These voice characteristics are then used to customize the user interface 110 of the client device. As shown in FIG. 1, the user interface 110 a represents the display of the client device before analyzing the voice characteristics of the user 102, and the user interface 110 b represents the display after analyzing the voice characteristics of the user 102.
  • The client device 100 can be, for example, a desktop computer, laptop computer, a tablet computer, a wearable computer, a cellular phone, a smart phone, a music player, an e-book reader, a navigation system, or any other appropriate computing device. The functions performed by the server 120 can be performed by individual computer systems or can be distributed across multiple computer systems. The network 130 can be wired or wireless or a combination of both and can include the Internet.
  • In more detail, a user 102 of the client device 100 initiates a speech recognition session such that the client device encodes an audio signal that includes the utterance 104 of the user. The user may, for example, press a button on the client device 100 to perform a voice search or input a voice command or hotword, speak an utterance, and then release the button on the client device 100. In another example, the user may select a user interface control on the client device 100 before speaking the utterance. As another example, the user 102 may activate a voice unlock feature on the client device 100 by speaking an utterance. The client device 100 encodes the utterance into an audio signal, which may be, for example, a snippet of relatively high quality audio, such as 16 kHz lossless audio.
  • In some implementations involving a voice unlock feature, the client device 100 may perform speech and/or voice recognition on the utterance to identify the speaker as an authorized user of the device and then unlock the device. For example, the client device 100 and/or the server 120 may compare a voice signature of the utterance with one or more voice signatures associated with authorized users that are stored on the client device. Alternatively or in addition, the client device 100 and/or the server 120 may perform speech recognition on the utterance to identify an authorized password or phrase associated with an authorized user of the client device. In some aspects, different users of the same client device 100 may each establish a user profile that includes preferences for operating the client device associated with the user and the applications of interest to the user. Each user profile may also be associated with one or more voice signatures, passwords, and/or passphrases. When the client device 100 identifies a voice signature, password, and/or passphrase associated with a user profile, the client device select the associated user profile, unlock the client device, and provide access to that user profile.
  • The client device 100 and/or the server 120 then identify audio characteristics. These audio characteristics may be independent of the words spoken by the user 102. For example, the audio characteristics may indicate audio features that likely correspond to one or more of the speaker's gender, the speaker's age, speaker's emotional state, and/or the speaker's dialect. While feature vectors may be indicative of audio characteristics of specific portions of the particular words spoken, the audio characteristics may be indicative of time-independent characteristics of the audio signal.
  • As discussed further below, the audio characteristics can include latent variables of multivariate factor analysis (MFA) of the audio signal. The latent variables may be accessed from data storage, received from another system, or calculated by the client device 100 and/or the server 120. To obtain the audio characteristics, feature vectors derived from the audio signal may be analyzed by a factor analysis model. The factor analysis model may create a probabilistic partition of an acoustic space using a Gaussian Mixture Model, and then average the feature vectors associated with each partition. The averaging can be a soft averaging weighted by the probability that each feature vector belongs to the partition. The result of processing with the factor analysis model can be an i-vector, as discussed further below.
  • In the illustrated example, the client device 100 and/or the server 120 inputs the audio characteristics into a trained classifier. The audio characteristics may be represented by, for example, an i-vector and/or acoustic features such as MFCCs or PLPs. The classifier may be, for example, a Gaussian mixture mode, a neural network, a logistic regression classifier, or a support vector machine (SVM). The classifier has been trained to classify the features of interest. Training data for the classifier could include, for example, voice search logs annotated as corresponding to: child, adult, or elderly speech; male or female speech; or regionally dialected speech. Classifications for certain characteristics may be determined using other techniques, such as, for example, using the pitch characteristics of the speech recording to detect child speech, or by clustering data for male and female speakers. As another example, in some implementations, the dialect may be determined by the client device 100 and/or the server 120 using the automated dialect-identifying processes described, for example, in D. Martinez, O. Plchot, L. Burget, Ondrej Glembek, and Pavel Matejka, “Language Recognition in iVectors Space.,” INTERSPEECH., pp. 861-864, ISCA (2011); K. Hirose et al., “Accent Type Recognition And Syntactic Boundary Detection Of Japanese Using Statistical Modeling Of Moraic Transitions Of Fundamental Frequency Contours,” Proc. IEEE ICASSP '98 (1998); T. Chen et al., “Automatic Accent Identification using Gaussian Mixture Models,” IEEE Workshop on ASRU (2001); or R. A. Cole, J. W. T. Inouye, Y. K. Muthusamy, and M. Gopalakrishnan, “Language identification with neural networks: a feasibility study,” Communications, Computers and Signal Processing, 1989 Conference Proceeding., IEEE Pacific Rim Conference on, 1989, pp. 525-529. As another example, emotions of a speaker may be classified using techniques such as those described in K. Rao, S. Koolagudi, and R. Vempada, “Emotion Recognition From Speech Using Global and Local Prosodic Features,” Intl Journal of Speech Technology, Vol. 16, Issue 2, pp. 143-160 (June 2013) or D. Ververidis, C. Kotropoulos, “Emotional speech recognition: Resources, features, and methods,” Speech Communication 48, pp. 1162-81 (2006).
  • The classifier then outputs data classifying various characteristics of the speaker. For example, the classifier may output a set of likelihoods for the speaker's age, gender, emotion, and/or dialect. In particular, the output may be a normalized probability between zero and one for one or more of these characteristics. Output for a gender classification may be, for example, male—0.80, female—0.20. Output for an age classification may be, for example, child—0.6, adult—0.3, elderly—0.1. Output for an emotion classification may be, for example, happy—0.6, angry—0.3, sad—0.1. Output for a dialect classification may be, for example, British English—0.5, Kiwi English—0.2, Indian English—0.1, Australian English—0.1, Irish English—0.1.
  • The client device 100 and/or the server 120 then identifies characteristics of the user 102 based on the output of the classifier. For example, the client device and/or the server 120 may select the characteristics having the highest probability. To continue the example above, the client device and/or the server 120 may identify the user 102 as a male child speaking British English. In some cases, the client device 102 may apply a minimum threshold probability to the selection. In such instances, where the classifier does not identify a characteristic as having a probability that exceeds the threshold, the client device 100 and/or server 120 may select a default characteristic such as “unknown,” may prompt the user for additional information, and/or may cause an error message to be outputted to the user 102 (e.g., “please say that again,” or “please provide additional information”).
  • Once the characteristics of the user 102 have been identified, the client device 100 customizes the user interface 110 based on these characteristics. Customizing the user interface 110 may include, for example, changing layout, font size, icon size, color scheme, wallpaper, icons and/or text to be displayed, animations, and any other items or settings for operation of the client device. Customizing user interface may also include, for example, restricting and/or modifying the operation of one or more applications executing on the client device. The specific customizations that will correspond to various characteristics may be determined based on user information, demographic information, surveys, empirical observations, and/or any other suitable techniques.
  • For example, to continue the example where the user 102 has been identified as a male child speaking British English, the client device modifies the user interface to correspond to a child safe mode. In particular, before analyzing the voice characteristics of the user 102, the user interface 110 a includes a full complement of applications that may be accessed, including a camera application, a contact application, a calendar application, a search application, a messaging application, a browser application, a call application, and an email application. In contrast, after analyzing the voice characteristics, the user interface 110 b has been modified to permit access only to a camera application and a phone call application. The icons for both of these applications have also been enlarged to make it easier for children to understand and operate. In some implementations, the user interface 110 b for a child mode may also restrict or modify the operations of the applications that are accessible, for example by limiting the phone numbers that can be called (e.g., only able to call home), limiting the number of pictures that can be taken, or restricting the websites that can be accessed. For example, the client device 100 may provide information regarding the characteristics of the user to native and/or third-party applications with an API on the client device, which the applications may use to modify their operations.
  • For situations in which the systems discussed here collect personal information about users, or may make use of personal information, the users may be provided with an opportunity to control whether programs or features collect personal information (e.g., information about a user's social network, social actions or activities, profession, a user's preferences, or a user's current location), or to control whether and/or how to receive content from the content server that may be more relevant to the user. In addition, certain data may be anonymized in one or more ways before it is stored or used, so that personally identifiable information is removed. For example, a user's identity may be anonymized so that no personally identifiable information can be determined for the user, or a user's geographic location may be generalized where location information is obtained (such as to a city, ZIP code, or state level), so that a particular location of a user cannot be determined. Thus, the user may have control over how information is collected about him or her and used by a content server.
  • FIG. 2 is a diagram 200 that illustrates an example of processing for speech recognition using neural networks. The operations discussed are described as being performed by the server 120, but may be performed by other systems, including combinations of the client device 100 and/or multiple computing systems. While the example architecture described with reference to FIG. 2 includes i-vector inputs into a neural network classifier, the present disclosure is not limited to this architecture. For example, any suitable inputs representing audio characteristics of the speaker, such as MFCCs, or PLPs, could be used. In particular, a neural network could be trained directly from acoustic features (MFCCs or PLPs), e.g., a neural network could receive an MFCC vector (X) and predict the characteristic that maximizes P(L|X) in a similar manner as described below. As another example, any suitable classifier may be used such as an SVM or logistic regression.
  • The server 120 receives data about an audio signal 210 that includes speech to be recognized. The server 120 or another system then performs feature extraction on the audio signal 210. For example, the server 120 analyzes different segments or analysis windows 220 of the audio signal 210. The windows 220 are labeled w0 . . . wn, and as illustrated, the windows 220 can overlap. For example, each window 220 may include 25 ms of the audio signal 210, and a new window 220 may begin every 10 ms. For example, the window 220 labeled w0 may represent the portion of audio signal 210 from a start time of 0 ms to an end time of 25 ms, and the next window 220, labeled w1, may represent the portion of audio signal 120 from a start time of 10 ms to an end time of 35 ms. In this manner, each window 220 includes 15 ms of the audio signal 210 that is included in the previous window 220.
  • The server 120 performs a Fast Fourier Transform (FFT) on the audio in each window 220. The results of the FFT are shown as time-frequency representations 230 of the audio in each window 220. From the FFT data for a window 220, the server 120 extracts features that are represented as an acoustic feature vector 240 for the window 220. The acoustic features may be determined by binning according to filterbank energy coefficients, using an MFCC transform, using a PLP transform, or using other techniques. In some implementations, the logarithm of the energy in each of various bands of the FFT may be used to determine acoustic features.
  • The acoustic feature vectors 240, labeled v1 . . . vn, include values corresponding to each of multiple dimensions. As an example, each acoustic feature vector 240 may include a value for a PLP feature, a value for a first order temporal difference, and a value for a second order temporal difference, for each of 13 dimensions, for a total of 39 dimensions per acoustic feature vector 240. Each acoustic feature vector 240 represents characteristics of the portion of the audio signal 210 within its corresponding window 220.
  • The server 120 may also obtain an i-vector 250. For example, the server 120 may process the audio signal 210 with an acoustic model 260 to obtain the i-vector 250. In the example, the i-vector 250 indicates latent variables of multivariate factor analysis. The i-vector 250 may be normalized, for example, to have a zero mean unit variance. In addition, or as an alternative, the i-vector 250 may be projected, for example, using principal component analysis (PCA) or linear discriminant analysis (LDA). Techniques for obtaining an i-vector are described further below with respect to FIG. 3.
  • The server 120 uses a neural network 270 that can serve as an acoustic model and indicate likelihoods that acoustic feature vectors 240 represent different phonetic units. The neural network 270 includes an input layer 271, a number of hidden layers 272 a-272 c, and an output layer 273. The neural network 270 receives an i-vector as input. For example, the first hidden layer 272 a has connections from the i-vector input portion of the input layer 271, where such connections are not present in typical neural networks used for speech recognition.
  • The neural network 270 has been trained to estimate likelihoods that an i-vector represents various speaker characteristics. For example, during training, input to the neural network 270 may be i-vectors corresponding to the utterances from which the acoustic feature vectors were derived. The various training data sets can include i-vectors derived from utterances from multiple speakers.
  • To classify speaker characteristics from the audio signal 210 using the neural network 270, the server 120 inputs the i-vector 250 at the input layer 271 of the neural network 270. At the output layer 273, the neural network 270 indicates likelihoods that the speech corresponds to specific speaker characteristics. The output layer 273 provides predictions or probabilities for these characteristics given the data at the input layer 271. The output layer 273 can provide a value, for each of the speaker characteristics of interest. Because the i-vector 250 indicates constant or overall properties of the audio signal 210 as a whole, the information in the i-vector 250 is independent of the particular acoustic states that may occur at specific windows 220.
  • In some implementations, the i-vector 250 is based on a current utterance i-vector derived from the current utterance (e.g., the particular audio signal 210) being recognized. In some implementations, the i-vector 250 may be a speaker i-vector generated using multiple utterances of the speaker (e.g., utterances from multiple different recording sessions, such as recordings on different days). For example, multiple utterances for a speaker may be stored in association with a user profile, and the utterances may be retrieved to update a speaker i-vector for that user profile. To generate a speaker i-vector, an i-vector can be determined for each utterance in the set of multiple utterances of the speaker. The i-vectors can be averaged together to generate the speaker i-vector. In some implementations, where a speaker i-vector is used rather than an utterance i-vector derived from the utterance being recognized, post processing may discriminative training, such as LDA, to identify attributes that are indicative of speaker characteristics. For example, various techniques can be used to isolate speaker characteristics, independent of noise, room characteristics, and other non-speaker-dependent characteristics.
  • In some implementations, the server 120 may identify the speaker and select an i-vector based on the speaker's identity. An i-vector may be calculated for each of multiple users, and the i-vectors may be stored in association with user profiles for those users for later use in recognizing speech of the corresponding users. The server 120 may receive a device identifier for a device, such as a mobile phone, that the speaker is using to record speech. In addition, or as an alternative, the server 120 may receive a user identifier that identifies the user, such as a name or user account login. The server 120 may identify the speaker as a user that owns the device or a user is logged into a user account on the device. In some implementations, the server 120 may identify the speaker before recognition begins, or before audio is received during the current session. The server 120 may then look up the i-vector that corresponds to the identified user and use that i-vector to recognize received speech.
  • In some implementations, a successive approximation technique may be used to approximate and re-estimate the i-vector 250 while audio is received. The i-vector 250 may be re-estimated at a predetermined interval, for example, each time a threshold amount of new audio has been received. For example, a first i-vector may be estimated using the initial three seconds of audio received. Then, after another three seconds of audio has been received, a second i-vector may be estimated using the six seconds of audio received so far. After another three seconds, a third i-vector may be estimated using all nine seconds of audio received, and so on. The re-estimation period may occur at longer intervals, such as 10 seconds or 30 seconds, to reduce the amount of computation required. In some implementations, i-vectors are re-estimated at pauses in speech (e.g., as detected by a speech energy or voice activity detection algorithm), rather than at predetermined intervals.
  • An i-vector derived from a small segment of an utterance may introduce some inaccuracy compared to an i-vector for the entire utterance, but as more audio is received, the estimated i-vectors approach the accuracy of an i-vector derived from the whole utterance. In addition, audio from recent utterances (e.g., audio from a predetermined number of most recent utterances or audio acquired within a threshold period of the current time) may be used with received audio to estimate the i-vectors, which may further reduce any inaccuracy present in the estimates.
  • In some implementations, the server 120 transitions from using a first i-vector to a second i-vector during recognition of an utterance. For example, the server 120 may begin by using a first i-vector derived from a previous utterance. After a threshold amount of audio has been received (e.g., 3, 5, 10, or 30 seconds), the server 120 generates a second i-vector based on the audio received in the current session and uses the second i-vector to process subsequently received audio.
  • FIG. 3 is a diagram 300 that illustrates an example of processing to generate latent variables of factor analysis. The example of FIG. 3 shows techniques for determining an i-vector, which includes these latent variables of factor analysis. I-vectors are time-independent components that represent overall characteristics of an audio signal rather than characteristics at a specific segment of time within an utterance. I-vectors can summarize a variety of characteristics of audio that are independent of the phonetic units spoken, for example, information indicative of the age, gender, emotion, and/or dialect of the speaker.
  • The example of FIG. 3 illustrates processing to calculate an i-vector 250 for a sample utterance 310. The server 120 accesses training data 320 that includes a number of utterances 321. The training data 320 may include utterances 321 including speech from different speakers, utterances 321 having different background noise conditions, and utterances 321 having other differences. Each of the utterances 321 is represented as a set of acoustic feature vectors. Each of the acoustic feature vectors can be, for example, a 39-dimensional vector determined in the same manner that the acoustic feature vectors 240 are determined in the example of FIG. 2.
  • The server 120 uses the utterances 321 to train a Gaussian mixture model (GMM) 330. For example, the GMM 330 may include 1000 39-dimensional Gaussians 331. The GMM 330 is trained using the acoustic feature vectors of the utterances 321 regardless of the phones or acoustic states that the acoustic feature vectors represent. As a result, acoustic feature vectors corresponding to different phones and acoustic states are used to train the GMM 330. For example, all of the acoustic feature vectors from all of the utterances 321 in the training data 320 can be used to train the GMM 330. In this respect, the GMM 330 is different from GMMs that are trained with only the acoustic feature vectors for a single phone or a single acoustic state.
  • When the sample utterance 310 is received, the server 120 determines acoustic feature vectors that describe the utterance 310. The server 120 classifies the acoustic feature vectors of the utterance 310 using the GMM 330. For example, the Gaussian 331 that corresponds to each acoustic feature vector of the sample utterance 310 may be identified. The server 120 then re-estimates the Gaussians 331 that are observed in the sample utterance 310, illustrated as re-estimated Gaussians 335 shown in dashed lines. As an example, a set of one or more acoustic feature vectors of the sample utterance 310 may be classified as matching a particular Gaussian 331 a from the GMM 330. Based on this set of acoustic feature vectors, the server 120 calculates a re-estimated Gaussian 335 a having a mean and/or variance different from the Gaussian 331 a. Typically, only some of the Gaussians 331 in the GMM 330 are observed in the sample utterance 310 and re-estimated.
  • The server 120 then identifies differences between the Gaussians 331 and the corresponding re-estimated Gaussians 335. For example, the server 120 may generate difference vectors that each indicate changes in parameters between a Gaussian 331 and its corresponding re-estimated Gaussian 335. Since each of the Gaussians is 39-dimensional, each difference vector can have 39 values, where each value indicates a change in one of the 39 dimensions.
  • The server 120 concatenates or stacks the difference vectors to generate a supervector 340. Because only some of the Gaussians 331 were observed and re-estimated, a value of zero (e.g., indicating no change from the original Gaussian 331) is included in the supervector 340 for each the 39 dimensions of each Gaussian 331 that was not observed in the sample utterance 310. For a GMM 330 having 1000 Gaussians that are each 39-dimensional, the supervector 340 would include 39,000 elements. In many instances, Gaussians 331 and the corresponding re-estimated Gaussians 335 differ only in their mean values. The supervector 340 can represent the differences between the mean values of the Gaussians 331 and the mean values of the corresponding re-estimated Gaussians 335.
  • In addition to generating the supervector 340, the server 120 also generates a count vector 345 for the utterance 310. The values in the count vector 345 can represent 0th order Baum-Welch statistics, referred to as counts or accumulated posteriors. The count vector 345 can indicate the relative importance of the Gaussians 331 in the GMM 330. The count vector 345 includes a value for each Gaussian 331 in the GMM 330. As a result, for a GMM 330 having 1000 Gaussians, the count vector 345 for the utterance 310 would include 1,000 elements. Each value in the vector 345 can be the sum of the posterior probabilities of the feature vectors of the utterance 310 with respect to a particular Gaussian 331. For example, for a first Gaussian 331 a, the posterior probability of each feature vector in the utterance 310 is computed (e.g., the probability of occurrence of the feature vector as indicated by the first Gaussian 331 a). The sum of the posterior probabilities for the feature vectors in the utterance 310 is used as the value for the first Gaussian 331 a in the count vector 345. Posterior probabilities for the each feature vector in the utterance 310 can be calculated and summed for each of the other Gaussians 331 to complete the count vector 345.
  • In the same manner that the supervector 340 and count vector 345 was generated for the sample utterance 310, the server 120 generates a supervector 350 and a count vector 355 for each of the utterances 321 in the training data 320. The GMM 330, the supervectors 350, and the count vectors 355 may be generated and stored before receiving the sample utterance 310. Then, when the sample utterance 310 is received, the previously generated GMM 330, supervectors 350, and count vectors can be accessed from storage, which limits the amount of computation necessary to generate an i-vector for the sample utterance 310.
  • The server 120 uses the supervectors 350 to create a factor analysis module 360. The factor analysis module 360, like the GMM 330 and the supervectors 350, may be generated in advance of receiving the sample utterance 310. The factor analysis module 360 can perform multivariate factor analysis to project a supervector to a lower-dimensional vector that represents particular factors of interest. For example, the factor analysis module may project a supervector of 39,000 elements to a vector of only a few thousand elements or only a few hundred elements.
  • The factor analysis module 360, like the GMM 330, is trained using a collection of utterances, which may be the utterances in the same training data 320 used to generate the GMM 330. An adapted or re-estimated GMM may be determined for each of the i utterances [U1, U2, . . . , Ui] in the training data 320, in the same manner that the re-estimated Gaussians 335 are determined for the utterance 310. A supervector 350 [S1, S2, . . . , Si] and count vector 355 [C1, C2, . . . , Ci] for each utterance [U1, U2, . . . , Ui] is also determined. Using the vector pairs [Si, Ci] for each utterance, the factor analysis module 360 is trained to learn the common range of movement of the adapted or re-estimated GMMs for the utterances [U1, U2, . . . , Ui] relative to the general GMM 330. Difference parameters between re-estimated GMMs and the GMM 330 are then constrained to move only over the identified common directions of movement in the space of the supervectors. Movement is limited to a manifold, and the variables that describe the position of the difference parameters over the manifold are denoted as i-vectors. As a result, the factor analysis module 360 learns a correspondence [Si, Ci]-->i-vectori, such that Si/Ci=f(i-vectori), where f( ) is a linear function f(x)=T*x and T is a matrix.
  • The server 120 inputs the supervector 340 and count vector 345 for the sample utterance 310 to the trained factor analysis module 360. The output of the factor analysis module 360 is the i-vector 250, which includes latent variables of multivariate factor analysis. The i-vector 250 represents time-independent characteristics of the sample utterance 310 rather than characteristics of a particular window or subset of windows within the sample utterance 310. In some implementations, the i-vector 250 may include, for example, approximately 300 elements.
  • FIG. 4 is a flow diagram that illustrates an example of a process for customizing a user interface based on characteristics of a speaker. The process 400 may be performed by data processing apparatus, such as the client device 100 described above or another data processing apparatus.
  • In step 402, the client device selects a user profile associated with the user interface. For example, the client device may select a user profile based on one or more of a password, voice recognition, speech recognition, fingerprint recognition, or facial recognition. In some instances, the client device may have a default user profile, in which case the client device typically selects and operates using the default user profile.
  • In step 404, after selecting the user profile, the client device obtains an audio signal encoding an utterance of the speaker. For example, the client device may receive an utterance of the speaker at a microphone, and encode the utterance into an audio signal such as, for example, a 16 kHz lossless audio signal.
  • The client device (optionally in combination with one or more servers), in step 406, processes the audio signal to identify one or more characteristics of the speaker. In some implementations, the characteristics may include one or more of an age, gender, emotion, and/or an dialect of the speaker. For example, the client device may provide the audio signal to a trained classifier (e.g., a neural network, SVM, or a Gaussian mixture model) that outputs likelihoods associated with one or more characteristics of the speaker. The client device may then select characteristics having the highest likelihood, and/or apply a threshold to identify characteristics of the speaker. Alternatively or in addition, the client device may transmit the audio signal to a server, which inputs the audio signal to a trained classifier, identifies characteristics of the speaker using the classifier, and then transmits the identified characteristics back to the client device.
  • Finally, in step 408, the client device customizes the user interface associated with the user profile based on the identified characteristics. For example, the client device may change layout, font size, icon size, color scheme, wallpaper, icons and/or text to be displayed, animations, and any other items or settings for operation of the client device. In some cases, the client device may restrict access to applications that were previously accessible, and/or provide access to applications that were previously inaccessible. The client device also may, for example, restrict and/or modify the operation of one or more applications executing on the client device. For example, the client device may restrict a web browsing application to provide access to only a limited set of websites.
  • In some implementations, the client device may also provide one or more of the characteristics of the speaker to native and/or third-party applications executing on the client device. The client device may provide users with an option to decide whether to share this information with native and/or third-party applications.
  • FIG. 5 is a flow diagram that illustrates another example of a process for customizing a user interface based on characteristics of a speaker. The process 500 may be performed by data processing apparatus, such as the client device 100 described above or another data processing apparatus.
  • In step 502, the client device selects a user profile associated with the user interface. The client device obtains an audio signal encoding an utterance of the speaker. For example, the client device may receive an utterance of the speaker at a microphone, and encode the utterance into an audio signal such as, for example, a 16 kHz lossless audio signal.
  • In step 504, the client device (optionally in combination with one or more servers) perform speech recognition, voice recognition, or both on the audio signal to select and/or access the user profile associated with the user interface. In some instances, the client device may have a default user profile, in which case the client device may provide access to the default user profile when speech recognition and/or voice recognition successfully authenticate the speaker.
  • The client device (optionally in combination with one or more servers), in step 506, also processes the audio signal to identify one or more characteristics of the speaker. In some implementations, the characteristics may include one or more of an age, gender, emotion, and/or dialect of the speaker. For example, the client device may provide the audio signal to a trained classifier (e.g., a neural network, SVM, or a Gaussian mixture model) that outputs likelihoods associated with one or more characteristics of the speaker. The client device may then select characteristics having the highest likelihood, and/or apply a threshold to identify characteristics of the speaker. Alternatively or in addition, the client device may transmit the audio signal to a server, which inputs the audio signal to a trained classifier, identifies characteristics of the speaker using the classifier, and then transmits the identified characteristics back to the client device.
  • Finally, in step 508, the client device customizes the user interface associated with the user profile based on the identified characteristics. For example, the client device may change layout, font size, icon size, color scheme, wallpaper, icons and/or text to be displayed, animations, and any other items or settings for operation of the client device. In some cases, the client device may restrict access to applications that were previously accessible, and/or provide access to applications that were previously inaccessible. The client device also may, for example, restrict and/or modify the operation of one or more applications executing on the client device. For example, the client device may restrict a web browsing application to provide access to only a limited set of websites.
  • In some implementations, the client device may also provide one or more of the characteristics of the speaker to native and/or third-party applications executing on the client device. The client device may provide users with an option to decide whether to share this information with native and/or third-party applications.
  • A number of implementations have been described. Nevertheless, it will be understood that various modifications may be made without departing from the spirit and scope of the disclosure. For example, various forms of the flows shown above may be used, with steps re-ordered, added, or removed.
  • Embodiments and all of the functional operations described in this specification may be implemented in digital electronic circuitry, or in computer software, firmware, or hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them. Embodiments may be implemented as one or more computer program products, i.e., one or more modules of computer program instructions encoded on a computer-readable medium for execution by, or to control the operation of, data processing apparatus. The computer readable-medium may be a machine-readable storage device, a machine-readable storage substrate, a memory device, a composition of matter affecting a machine-readable propagated signal, or a combination of one or more of them. The computer-readable medium may be a non-transitory computer-readable medium. The term “data processing apparatus” encompasses all apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, or multiple processors or computers. The apparatus may include, in addition to hardware, code that creates an execution environment for the computer program in question, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of one or more of them. A propagated signal is an artificially generated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal that is generated to encode information for transmission to suitable receiver apparatus.
  • A computer program (also known as a program, software, software application, script, or code) may be written in any form of programming language, including compiled or interpreted languages, and it may be deployed in any form, including as a standalone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A computer program does not necessarily correspond to a file in a file system. A program may be stored in a portion of a file that holds other programs or data (e.g., one or more scripts stored in a markup language document), in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub programs, or portions of code). A computer program may be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a communication network.
  • The processes and logic flows described in this specification may be performed by one or more programmable processors executing one or more computer programs to perform functions by operating on input data and generating output. The processes and logic flows may also be performed by, and apparatus may also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application specific integrated circuit).
  • Processors suitable for the execution of a computer program include, by way of example, both general and special purpose microprocessors, and any one or more processors of any kind of digital computer. Generally, a processor will receive instructions and data from a read only memory or a random access memory or both. The essential elements of a computer are a processor for performing instructions and one or more memory devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto optical disks, or optical disks. However, a computer need not have such devices. Moreover, a computer may be embedded in another device, e.g., a tablet computer, a mobile telephone, a personal digital assistant (PDA), a mobile audio player, a Global Positioning System (GPS) receiver, to name just a few. Computer readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks; magneto optical disks; and CD-ROM and DVD-ROM disks. The processor and the memory may be supplemented by, or incorporated in, special purpose logic circuitry.
  • To provide for interaction with a user, embodiments may be implemented on a computer having a display device, e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor, for displaying information to the user and a keyboard and a pointing device, e.g., a mouse or a trackball, by which the user may provide input to the computer. Other kinds of devices may be used to provide for interaction with a user as well; for example, feedback provided to the user may be any form of sensory feedback, e.g., visual feedback, auditory feedback, or tactile feedback; and input from the user may be received in any form, including acoustic, speech, or tactile input.
  • Embodiments may be implemented in a computing system that includes a back end component, e.g., as a data server, or that includes a middleware component, e.g., an application server, or that includes a front end component, e.g., a client computer having a graphical user interface or a Web browser through which a user may interact with an implementation of the techniques disclosed, or any combination of one or more such back end, middleware, or front end components. The components of the system may be interconnected by any form or medium of digital data communication, e.g., a communication network. Examples of communication networks include a local area network (“LAN”) and a wide area network (“WAN”), e.g., the Internet.
  • The computing system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
  • While this specification contains many specifics, these should not be construed as limitations, but rather as descriptions of features specific to particular embodiments. Certain features that are described in this specification in the context of separate embodiments may also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment may also be implemented in multiple embodiments separately or in any suitable subcombination. Moreover, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination may in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.
  • Similarly, while operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In certain circumstances, multitasking and parallel processing may be advantageous. Moreover, the separation of various system components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems may generally be integrated together in a single software product or packaged into multiple software products.
  • Thus, particular embodiments have been described. Other embodiments are within the scope of the following claims. For example, the actions recited in the claims may be performed in a different order and still achieve desirable results.

Claims (24)

1. A computer-implemented method comprising:
selecting, at a client device, a user profile associated with a user interface;
obtaining, at the client device, an audio signal encoding an utterance of a speaker;
providing the audio signal as an input to a neural network;
identifying a particular characteristic that (i) is indicated, based on an output of the neural network, as likely associated with the speaker, and (ii) is associated with at least one person other than the speaker;
determining one or more particular customizations that correspond to the particular characteristic based on customizations for the at least one person other than the speaker; and
customizing, by the client device, the user interface associated with the user profile based on the one or more particular customizations.
2. The method of claim 1, wherein customizing, by the client device, the user interface associated with the user profile based on the one or more particular customizations comprises changing a font size of the user interface based on the one or more particular customizations.
3. The method of claim 1, wherein customizing, by the client device, the user interface associated with the user profile based on the one or more particular customizations comprises changing a color scheme of the user interface based on the one or more particular customizations.
4. The method of claim 1, wherein customizing, by the client device, the user interface associated with the user profile based on the one or more particular customizations comprises restricting access to one or more applications on the user interface based on the one or more particular customizations.
5. The method of claim 1, wherein customizing, by the client device, the user interface associated with the user profile based on the one or more particular customizations comprises providing access to one or more applications on the user interface based on the one or more particular customizations.
6. (canceled)
7. The method of claim 1, wherein selecting, at a client device, a user profile comprises selecting, at a client device, a user profile based on one or more of a password, voice recognition, speech recognition, fingerprint recognition, or facial recognition.
8. The method of claim 1, wherein selecting, at a client device, a user profile comprises selecting, at a client device, a default user profile for the client device.
9. The method of claim 1, wherein the particular characteristic comprises one or more of an age, a gender, an emotion, or an dialect.
10. (canceled)
11. The method of claim 1, further comprising providing the particular characteristic to a third-party application to allow the third-party application to modify operation of the third-party application using the particular characteristic.
12. A computer-implemented method comprising:
obtaining, at a client device, an audio signal encoding an utterance of a speaker;
performing speech recognition, voice recognition, or both on the audio signal to select a user profile associated with a user interface;
providing the audio signal as an input to a neural network;
identifying a particular characteristic that (i) is indicated, based on an output of the neural network, as likely associated with the speaker, and (ii) is associated with at least one person other than the speaker;
determining one or more particular customizations that correspond to the particular characteristic based on customizations for the at least one person other than the speaker; and
customizing, by the client device, the user interface associated with the user profile based on the one or more particular customizations.
13. A non-transitory computer-readable medium storing software comprising instructions executable by one or more computers which, upon such execution, cause the one or more computers to perform operations comprising:
selecting, at a client device, a user profile associated with a user interface;
obtaining, at the client device, an audio signal encoding an utterance of a speaker;
providing the audio signal as an input to a neural network;
identifying a particular characteristic that (i) is indicated, based on an output of the neural network, as likely associated with the speaker, and (ii) is associated with at least one person other than the speaker;
determining one or more particular customizations that correspond to the particular characteristic based on customizations for the at least one person other than the speaker; and
customizing, by the client device, the user interface associated with the user profile based on one or more particular customizations.
14. The computer-readable medium of claim 13, wherein customizing, by the client device, the user interface associated with the user profile based on the one or more particular customizations comprises changing a font size of the user interface based on the one or more particular customizations.
15. The computer-readable medium of claim 13, wherein customizing, by the client device, the user interface associated with the user profile based on the one or more particular customizations comprises changing a color scheme of the user interface based on the one or more particular customizations.
16. The computer-readable medium of claim 13, wherein customizing, by the client device, the user interface associated with the user profile based on the one or more particular customizations comprises restricting access to one or more applications on the user interface based on the one or more particular customizations.
17. The computer-readable medium of claim 13, wherein customizing, by the client device, the user interface associated with the user profile based on the one or more particular customizations comprises providing access to one or more applications on the user interface based on the one or more particular customizations.
18. (canceled)
19. The computer-readable medium of claim 13, wherein selecting, at a client device, a user profile comprises selecting, at a client device, a user profile based on one or more of a password, voice recognition, speech recognition, fingerprint recognition, or facial recognition.
20. The computer-readable medium of claim 13, wherein selecting, at a client device, a user profile comprises selecting, at a client device, a default user profile for the client device.
21. The method of claim 1, wherein customizing, by the client device, the user interface associated with the user profile based on the one or more particular customizations comprises restricting access to one or more features of an application based on the one or more particular customizations.
22. The method of claim 21, wherein restricting access to one or more features of the application based on the one or more particular customizations comprises providing a safe mode operation of the application.
23. The method of claim 1, further comprising disabling, by the client device based on the one or more particular customizations, updating of search logs.
24. The method of claim 1, further comprising disabling, by the client device based on the one or more particular customizations, adaptation of a speaker profile in response to any utterances by the speaker.
US14/096,608 2013-12-04 2013-12-04 User interface customization based on speaker characteristics Abandoned US20150154002A1 (en)

Priority Applications (4)

Application Number Priority Date Filing Date Title
US14/096,608 US20150154002A1 (en) 2013-12-04 2013-12-04 User interface customization based on speaker characteristics
US15/230,891 US11137977B2 (en) 2013-12-04 2016-08-08 User interface customization based on speaker characteristics
US17/136,069 US11403065B2 (en) 2013-12-04 2020-12-29 User interface customization based on speaker characteristics
US17/811,793 US11620104B2 (en) 2013-12-04 2022-07-11 User interface customization based on speaker characteristics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US14/096,608 US20150154002A1 (en) 2013-12-04 2013-12-04 User interface customization based on speaker characteristics

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US15/230,891 Continuation US11137977B2 (en) 2013-12-04 2016-08-08 User interface customization based on speaker characteristics

Publications (1)

Publication Number Publication Date
US20150154002A1 true US20150154002A1 (en) 2015-06-04

Family

ID=53265383

Family Applications (4)

Application Number Title Priority Date Filing Date
US14/096,608 Abandoned US20150154002A1 (en) 2013-12-04 2013-12-04 User interface customization based on speaker characteristics
US15/230,891 Active 2034-02-22 US11137977B2 (en) 2013-12-04 2016-08-08 User interface customization based on speaker characteristics
US17/136,069 Active US11403065B2 (en) 2013-12-04 2020-12-29 User interface customization based on speaker characteristics
US17/811,793 Active US11620104B2 (en) 2013-12-04 2022-07-11 User interface customization based on speaker characteristics

Family Applications After (3)

Application Number Title Priority Date Filing Date
US15/230,891 Active 2034-02-22 US11137977B2 (en) 2013-12-04 2016-08-08 User interface customization based on speaker characteristics
US17/136,069 Active US11403065B2 (en) 2013-12-04 2020-12-29 User interface customization based on speaker characteristics
US17/811,793 Active US11620104B2 (en) 2013-12-04 2022-07-11 User interface customization based on speaker characteristics

Country Status (1)

Country Link
US (4) US20150154002A1 (en)

Cited By (57)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130215126A1 (en) * 2012-02-17 2013-08-22 Monotype Imaging Inc. Managing Font Distribution
US20150161995A1 (en) * 2013-12-06 2015-06-11 Nuance Communications, Inc. Learning front-end speech recognition parameters within neural network training
US20150161999A1 (en) * 2013-12-09 2015-06-11 Ravi Kalluri Media content consumption with individualized acoustic speech recognition
US20170316790A1 (en) * 2016-04-27 2017-11-02 Knuedge Incorporated Estimating Clean Speech Features Using Manifold Modeling
US20170372706A1 (en) * 2015-02-11 2017-12-28 Bang & Olufsen A/S Speaker recognition in multimedia system
US20180039888A1 (en) * 2016-08-08 2018-02-08 Interactive Intelligence Group, Inc. System and method for speaker change detection
US20180067664A1 (en) * 2016-09-06 2018-03-08 Acronis International Gmbh System and method for backing up social network data
US9934785B1 (en) * 2016-11-30 2018-04-03 Spotify Ab Identification of taste attributes from an audio signal
US9983775B2 (en) 2016-03-10 2018-05-29 Vignet Incorporated Dynamic user interfaces based on multiple data sources
US20180211651A1 (en) * 2017-01-26 2018-07-26 David R. Hall Voice-Controlled Secure Remote Actuation System
US20180293990A1 (en) * 2015-12-30 2018-10-11 Baidu Online Network Technology (Beijing) Co., Ltd. Method and device for processing voiceprint authentication
US10115215B2 (en) 2015-04-17 2018-10-30 Monotype Imaging Inc. Pairing fonts for presentation
US20180359197A1 (en) * 2016-07-20 2018-12-13 Ping An Technology (Shenzhen) Co., Ltd. Automatic reply method, device, apparatus, and storage medium
US20180374498A1 (en) * 2017-06-23 2018-12-27 Casio Computer Co., Ltd. Electronic Device, Emotion Information Obtaining System, Storage Medium, And Emotion Information Obtaining Method
US20190005949A1 (en) * 2017-06-30 2019-01-03 International Business Machines Corporation Linguistic profiling for digital customization and personalization
US10304475B1 (en) * 2017-08-14 2019-05-28 Amazon Technologies, Inc. Trigger word based beam selection
US10311874B2 (en) 2017-09-01 2019-06-04 4Q Catalyst, LLC Methods and systems for voice-based programming of a voice-controlled device
US10347244B2 (en) * 2017-04-21 2019-07-09 Go-Vivace Inc. Dialogue system incorporating unique speech to text conversion method for meaningful dialogue response
WO2019162054A1 (en) * 2018-02-20 2019-08-29 Koninklijke Philips N.V. System and method for client-side physiological condition estimations based on a video of an individual
US20200050347A1 (en) * 2018-08-13 2020-02-13 Cal-Comp Big Data, Inc. Electronic makeup mirror device and script operation method thereof
US10572574B2 (en) 2010-04-29 2020-02-25 Monotype Imaging Inc. Dynamic font subsetting using a file size threshold for an electronic document
WO2020146105A1 (en) * 2019-01-08 2020-07-16 Universal Electronics Inc. Universal voice assistant
US20200258527A1 (en) * 2019-02-08 2020-08-13 Nec Corporation Speaker recognition system and method of using the same
US10775974B2 (en) 2018-08-10 2020-09-15 Vignet Incorporated User responsive dynamic architecture
US10811136B2 (en) 2017-06-27 2020-10-20 Stryker Corporation Access systems for use with patient support apparatuses
US10868858B2 (en) 2014-05-15 2020-12-15 Universal Electronics Inc. System and method for appliance detection and app configuration
CN112083985A (en) * 2019-06-12 2020-12-15 皇家飞利浦有限公司 Apparatus and method for generating personalized virtual user interface
US10909989B2 (en) * 2016-07-15 2021-02-02 Tencent Technology (Shenzhen) Company Limited Identity vector generation method, computer device, and computer-readable storage medium
US10909429B2 (en) 2017-09-27 2021-02-02 Monotype Imaging Inc. Using attributes for identifying imagery for selection
US20210120301A1 (en) * 2019-01-08 2021-04-22 Universal Electronics Inc. Universal voice assistant
US11011174B2 (en) 2018-12-18 2021-05-18 Yandex Europe Ag Method and system for determining speaker-user of voice-controllable device
US11096850B2 (en) 2017-06-27 2021-08-24 Stryker Corporation Patient support apparatus control systems
US11137977B2 (en) 2013-12-04 2021-10-05 Google Llc User interface customization based on speaker characteristics
US20210368562A1 (en) * 2019-01-08 2021-11-25 Universal Electronics Inc. Universal audio device pairing assistant
US11202729B2 (en) 2017-06-27 2021-12-21 Stryker Corporation Patient support apparatus user interfaces
US11227605B2 (en) 2017-09-11 2022-01-18 Telefonaktiebolaget Lm Ericsson (Publ) Voice-controlled management of user profiles
US11238979B1 (en) 2019-02-01 2022-02-01 Vignet Incorporated Digital biomarkers for health research, digital therapeautics, and precision medicine
US11244104B1 (en) 2016-09-29 2022-02-08 Vignet Incorporated Context-aware surveys and sensor data collection for health research
US11315553B2 (en) 2018-09-20 2022-04-26 Samsung Electronics Co., Ltd. Electronic device and method for providing or obtaining data for training thereof
US11334750B2 (en) 2017-09-07 2022-05-17 Monotype Imaging Inc. Using attributes for predicting imagery performance
US11337872B2 (en) 2017-06-27 2022-05-24 Stryker Corporation Patient support systems and methods for assisting caregivers with patient care
US11366568B1 (en) * 2016-06-20 2022-06-21 Amazon Technologies, Inc. Identifying and recommending events of interest in real-time media content
US11382812B2 (en) 2017-06-27 2022-07-12 Stryker Corporation Patient support systems and methods for assisting caregivers with patient care
US11430449B2 (en) * 2017-09-11 2022-08-30 Telefonaktiebolaget Lm Ericsson (Publ) Voice-controlled management of user profiles
US11436414B2 (en) * 2018-11-15 2022-09-06 National University Of Defense Technology Device and text representation method applied to sentence embedding
US11445011B2 (en) 2014-05-15 2022-09-13 Universal Electronics Inc. Universal voice assistant
US11451618B2 (en) * 2014-05-15 2022-09-20 Universal Electronics Inc. Universal voice assistant
US11484451B1 (en) 2017-06-27 2022-11-01 Stryker Corporation Patient support apparatus user interfaces
US11537262B1 (en) 2015-07-21 2022-12-27 Monotype Imaging Inc. Using attributes for font recommendations
US11657602B2 (en) 2017-10-30 2023-05-23 Monotype Imaging Inc. Font identification from imagery
US11705230B1 (en) 2021-11-30 2023-07-18 Vignet Incorporated Assessing health risks using genetic, epigenetic, and phenotypic data sources
US11763919B1 (en) 2020-10-13 2023-09-19 Vignet Incorporated Platform to increase patient engagement in clinical trials through surveys presented on mobile devices
US11776539B2 (en) 2019-01-08 2023-10-03 Universal Electronics Inc. Voice assistant with sound metering capabilities
US11792185B2 (en) 2019-01-08 2023-10-17 Universal Electronics Inc. Systems and methods for associating services and/or devices with a voice assistant
US11810667B2 (en) 2017-06-27 2023-11-07 Stryker Corporation Patient support systems and methods for assisting caregivers with patient care
US11901083B1 (en) 2021-11-30 2024-02-13 Vignet Incorporated Using genetic and phenotypic data sets for drug discovery clinical trials
US11923079B1 (en) 2021-07-21 2024-03-05 Vignet Incorporated Creating and testing digital bio-markers based on genetic and phenotypic data for therapeutic interventions and clinical trials

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10042593B2 (en) * 2016-09-02 2018-08-07 Datamax-O'neil Corporation Printer smart folders using USB mass storage profile
US11514465B2 (en) * 2017-03-02 2022-11-29 The Nielsen Company (Us), Llc Methods and apparatus to perform multi-level hierarchical demographic classification
CN109146450A (en) * 2017-06-16 2019-01-04 阿里巴巴集团控股有限公司 Method of payment, client, electronic equipment, storage medium and server
US10715604B1 (en) 2017-10-26 2020-07-14 Amazon Technologies, Inc. Remote system processing based on a previously identified user
US10567515B1 (en) * 2017-10-26 2020-02-18 Amazon Technologies, Inc. Speech processing performed with respect to first and second user profiles in a dialog session
CN111316277A (en) * 2017-11-09 2020-06-19 深圳传音通讯有限公司 Mobile terminal and computer readable storage medium for data erasure
CN109448735B (en) * 2018-12-21 2022-05-20 深圳创维-Rgb电子有限公司 Method and device for adjusting video parameters based on voiceprint recognition and read storage medium
CN110083430B (en) * 2019-04-30 2022-03-29 成都映潮科技股份有限公司 System theme color changing method, device and medium
US11444893B1 (en) * 2019-12-13 2022-09-13 Wells Fargo Bank, N.A. Enhanced chatbot responses during conversations with unknown users based on maturity metrics determined from history of chatbot interactions
US11875785B2 (en) * 2021-08-27 2024-01-16 Accenture Global Solutions Limited Establishing user persona in a conversational system

Citations (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040030750A1 (en) * 2002-04-02 2004-02-12 Worldcom, Inc. Messaging response system
US20040059705A1 (en) * 2002-09-25 2004-03-25 Wittke Edward R. System for timely delivery of personalized aggregations of, including currently-generated, knowledge
US20040224771A1 (en) * 2003-05-09 2004-11-11 Chen Ling Tony Web access to secure data
US20040235530A1 (en) * 2003-05-23 2004-11-25 General Motors Corporation Context specific speaker adaptation user interface
US20050016360A1 (en) * 2003-07-24 2005-01-27 Tong Zhang System and method for automatic classification of music
US6868525B1 (en) * 2000-02-01 2005-03-15 Alberti Anemometer Llc Computer graphic display visualization system and method
US20080147404A1 (en) * 2000-05-15 2008-06-19 Nusuara Technologies Sdn Bhd System and methods for accent classification and adaptation
US20080172610A1 (en) * 2005-03-11 2008-07-17 Paul Blair Customizable User Interface For Electronic Devices
US20100086277A1 (en) * 2008-10-03 2010-04-08 Guideworks, Llc Systems and methods for deleting viewed portions of recorded programs
US8050922B2 (en) * 2006-02-21 2011-11-01 Sony Computer Entertainment Inc. Voice recognition with dynamic filter bank adjustment based on speaker categorization
US20120166190A1 (en) * 2010-12-23 2012-06-28 Electronics And Telecommunications Research Institute Apparatus for removing noise for sound/voice recognition and method thereof
US20120215639A1 (en) * 2005-09-14 2012-08-23 Jorey Ramer System for Targeting Advertising to Mobile Communication Facilities Using Third Party Data
US20120260294A1 (en) * 2000-03-31 2012-10-11 United Video Properties, Inc. User speech interfaces for interactive media guidance applications
US20130110511A1 (en) * 2011-10-31 2013-05-02 Telcordia Technologies, Inc. System, Method and Program for Customized Voice Communication
US20130139229A1 (en) * 2011-11-10 2013-05-30 Lawrence Fried System for sharing personal and qualifying data with a third party
US20130145457A1 (en) * 2011-12-01 2013-06-06 Matthew Nicholas Papakipos Protecting Personal Information Upon Sharing a Personal Computing Device
US20140282061A1 (en) * 2013-03-14 2014-09-18 United Video Properties, Inc. Methods and systems for customizing user input interfaces

Family Cites Families (35)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7263489B2 (en) * 1998-12-01 2007-08-28 Nuance Communications, Inc. Detection of characteristics of human-machine interactions for dialog customization and analysis
US6308155B1 (en) * 1999-01-20 2001-10-23 International Computer Science Institute Feature extraction for automatic speech recognition
US6594629B1 (en) * 1999-08-06 2003-07-15 International Business Machines Corporation Methods and apparatus for audio-visual speech detection and recognition
US7222075B2 (en) * 1999-08-31 2007-05-22 Accenture Llp Detecting emotions using voice signal analysis
EP1217610A1 (en) * 2000-11-28 2002-06-26 Siemens Aktiengesellschaft Method and system for multilingual speech recognition
US6964023B2 (en) * 2001-02-05 2005-11-08 International Business Machines Corporation System and method for multi-modal focus detection, referential ambiguity resolution and mood classification using multi-modal input
US20020162031A1 (en) * 2001-03-08 2002-10-31 Shmuel Levin Method and apparatus for automatic control of access
US20020194003A1 (en) * 2001-06-05 2002-12-19 Mozer Todd F. Client-server security system and method
US20030233233A1 (en) * 2002-06-13 2003-12-18 Industrial Technology Research Institute Speech recognition involving a neural network
US8321427B2 (en) * 2002-10-31 2012-11-27 Promptu Systems Corporation Method and apparatus for generation and augmentation of search terms from external and internal sources
US20050070276A1 (en) 2003-09-26 2005-03-31 Mcgarry Rob Systems and methods that provide modes of access for a phone
US8078465B2 (en) * 2007-01-23 2011-12-13 Lena Foundation System and method for detection and analysis of speech
EP1800293B1 (en) * 2004-09-17 2011-04-13 Agency for Science, Technology and Research Spoken language identification system and methods for training and operating same
KR100655491B1 (en) * 2004-12-21 2006-12-11 한국전자통신연구원 Two stage utterance verification method and device of speech recognition system
DE102005012729B4 (en) 2005-03-19 2020-09-24 Wera Werkzeuge Gmbh Screwdriving tool with exchangeable blade
ES2339130T3 (en) * 2005-06-01 2010-05-17 Loquendo S.P.A. PROCEDURE FOR ADAPTATION OF A NEURAL NETWORK OF AN AUTOMATIC SPEECH RECOGNITION DEVICE.
WO2007017853A1 (en) * 2005-08-08 2007-02-15 Nice Systems Ltd. Apparatus and methods for the detection of emotions in audio interactions
US8131718B2 (en) 2005-12-13 2012-03-06 Muse Green Investments LLC Intelligent data retrieval system
CA2536976A1 (en) * 2006-02-20 2007-08-20 Diaphonics, Inc. Method and apparatus for detecting speaker change in a voice transaction
US20080028326A1 (en) 2006-07-26 2008-01-31 Research In Motion Limited System and method for adaptive theming of a mobile device
US8370751B2 (en) 2007-08-31 2013-02-05 Sap Ag User interface customization system
US10460085B2 (en) * 2008-03-13 2019-10-29 Mattel, Inc. Tablet computer
US8799417B2 (en) 2008-04-24 2014-08-05 Centurylink Intellectual Property Llc System and method for customizing settings in a communication device for a user
US9667726B2 (en) 2009-06-27 2017-05-30 Ridetones, Inc. Vehicle internet radio interface
US8280726B2 (en) * 2009-12-23 2012-10-02 Qualcomm Incorporated Gender detection in mobile phones
US8718633B2 (en) * 2011-07-13 2014-05-06 Qualcomm Incorporated Intelligent parental controls for wireless devices
US8498491B1 (en) * 2011-08-10 2013-07-30 Google Inc. Estimating age using multiple classifiers
US20130097416A1 (en) 2011-10-18 2013-04-18 Google Inc. Dynamic profile switching
US9400893B2 (en) 2011-12-15 2016-07-26 Facebook, Inc. Multi-user login for shared mobile devices
TWI473080B (en) * 2012-04-10 2015-02-11 Nat Univ Chung Cheng The use of phonological emotions or excitement to assist in resolving the gender or age of speech signals
US20130297927A1 (en) 2012-05-07 2013-11-07 Samsung Electronics Co., Ltd. Electronic device and method for managing an electronic device setting thereof
US9570066B2 (en) * 2012-07-16 2017-02-14 General Motors Llc Sender-responsive text-to-speech processing
KR102118209B1 (en) * 2013-02-07 2020-06-02 애플 인크. Voice trigger for a digital assistant
US10438581B2 (en) * 2013-07-31 2019-10-08 Google Llc Speech recognition using neural networks
US20150154002A1 (en) 2013-12-04 2015-06-04 Google Inc. User interface customization based on speaker characteristics

Patent Citations (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6868525B1 (en) * 2000-02-01 2005-03-15 Alberti Anemometer Llc Computer graphic display visualization system and method
US20120260294A1 (en) * 2000-03-31 2012-10-11 United Video Properties, Inc. User speech interfaces for interactive media guidance applications
US20130185080A1 (en) * 2000-03-31 2013-07-18 United Video Properties, Inc. User speech interfaces for interactive media guidance applications
US20080147404A1 (en) * 2000-05-15 2008-06-19 Nusuara Technologies Sdn Bhd System and methods for accent classification and adaptation
US20040030750A1 (en) * 2002-04-02 2004-02-12 Worldcom, Inc. Messaging response system
US20040059705A1 (en) * 2002-09-25 2004-03-25 Wittke Edward R. System for timely delivery of personalized aggregations of, including currently-generated, knowledge
US20040224771A1 (en) * 2003-05-09 2004-11-11 Chen Ling Tony Web access to secure data
US20040235530A1 (en) * 2003-05-23 2004-11-25 General Motors Corporation Context specific speaker adaptation user interface
US20050016360A1 (en) * 2003-07-24 2005-01-27 Tong Zhang System and method for automatic classification of music
US20080172610A1 (en) * 2005-03-11 2008-07-17 Paul Blair Customizable User Interface For Electronic Devices
US20120215639A1 (en) * 2005-09-14 2012-08-23 Jorey Ramer System for Targeting Advertising to Mobile Communication Facilities Using Third Party Data
US8050922B2 (en) * 2006-02-21 2011-11-01 Sony Computer Entertainment Inc. Voice recognition with dynamic filter bank adjustment based on speaker categorization
US20100086277A1 (en) * 2008-10-03 2010-04-08 Guideworks, Llc Systems and methods for deleting viewed portions of recorded programs
US20120166190A1 (en) * 2010-12-23 2012-06-28 Electronics And Telecommunications Research Institute Apparatus for removing noise for sound/voice recognition and method thereof
US20130110511A1 (en) * 2011-10-31 2013-05-02 Telcordia Technologies, Inc. System, Method and Program for Customized Voice Communication
US20130139229A1 (en) * 2011-11-10 2013-05-30 Lawrence Fried System for sharing personal and qualifying data with a third party
US20130145457A1 (en) * 2011-12-01 2013-06-06 Matthew Nicholas Papakipos Protecting Personal Information Upon Sharing a Personal Computing Device
US20140282061A1 (en) * 2013-03-14 2014-09-18 United Video Properties, Inc. Methods and systems for customizing user input interfaces

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Neural Networks, 1994. IEEE World Congress on Computational Intelligence., 1994 IEEE International Conference on (Volume:7 ) Date of Conference: 27 Jun-2 Jul 1994 Page(s): 4483 - 4486 vol.7 Print ISBN: 0-7803-1901-X INSPEC Accession Number: 4956992Page(s): 4483 - 4486 vol.7 Print ISBN: 0-7803-1901-X INSPEC Accession Number: 4956992 *

Cited By (86)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10572574B2 (en) 2010-04-29 2020-02-25 Monotype Imaging Inc. Dynamic font subsetting using a file size threshold for an electronic document
US20130215126A1 (en) * 2012-02-17 2013-08-22 Monotype Imaging Inc. Managing Font Distribution
US11137977B2 (en) 2013-12-04 2021-10-05 Google Llc User interface customization based on speaker characteristics
US11403065B2 (en) 2013-12-04 2022-08-02 Google Llc User interface customization based on speaker characteristics
US11620104B2 (en) 2013-12-04 2023-04-04 Google Llc User interface customization based on speaker characteristics
US20150161995A1 (en) * 2013-12-06 2015-06-11 Nuance Communications, Inc. Learning front-end speech recognition parameters within neural network training
US10360901B2 (en) * 2013-12-06 2019-07-23 Nuance Communications, Inc. Learning front-end speech recognition parameters within neural network training
US20150161999A1 (en) * 2013-12-09 2015-06-11 Ravi Kalluri Media content consumption with individualized acoustic speech recognition
US10868858B2 (en) 2014-05-15 2020-12-15 Universal Electronics Inc. System and method for appliance detection and app configuration
US11445011B2 (en) 2014-05-15 2022-09-13 Universal Electronics Inc. Universal voice assistant
US11451618B2 (en) * 2014-05-15 2022-09-20 Universal Electronics Inc. Universal voice assistant
US10893094B2 (en) * 2014-05-15 2021-01-12 Universal Electronics Inc. System and method for appliance detection and app configuration
US10354657B2 (en) * 2015-02-11 2019-07-16 Bang & Olufsen A/S Speaker recognition in multimedia system
US20170372706A1 (en) * 2015-02-11 2017-12-28 Bang & Olufsen A/S Speaker recognition in multimedia system
US10115215B2 (en) 2015-04-17 2018-10-30 Monotype Imaging Inc. Pairing fonts for presentation
US11537262B1 (en) 2015-07-21 2022-12-27 Monotype Imaging Inc. Using attributes for font recommendations
US20180293990A1 (en) * 2015-12-30 2018-10-11 Baidu Online Network Technology (Beijing) Co., Ltd. Method and device for processing voiceprint authentication
US10685658B2 (en) * 2015-12-30 2020-06-16 Baidu Online Network Technology (Beijing) Co., Ltd. Method and device for processing voiceprint authentication
US9983775B2 (en) 2016-03-10 2018-05-29 Vignet Incorporated Dynamic user interfaces based on multiple data sources
US20170316790A1 (en) * 2016-04-27 2017-11-02 Knuedge Incorporated Estimating Clean Speech Features Using Manifold Modeling
US11366568B1 (en) * 2016-06-20 2022-06-21 Amazon Technologies, Inc. Identifying and recommending events of interest in real-time media content
US10909989B2 (en) * 2016-07-15 2021-02-02 Tencent Technology (Shenzhen) Company Limited Identity vector generation method, computer device, and computer-readable storage medium
US20180359197A1 (en) * 2016-07-20 2018-12-13 Ping An Technology (Shenzhen) Co., Ltd. Automatic reply method, device, apparatus, and storage medium
US10404629B2 (en) * 2016-07-20 2019-09-03 Ping An Technology (Shenzhen) Co., Ltd. Automatic reply method, device, apparatus, and storage medium
US10535000B2 (en) * 2016-08-08 2020-01-14 Interactive Intelligence Group, Inc. System and method for speaker change detection
US20180039888A1 (en) * 2016-08-08 2018-02-08 Interactive Intelligence Group, Inc. System and method for speaker change detection
US10712951B2 (en) * 2016-09-06 2020-07-14 Acronis International Gmbh System and method for backing up social network data
US20180067664A1 (en) * 2016-09-06 2018-03-08 Acronis International Gmbh System and method for backing up social network data
US11675971B1 (en) 2016-09-29 2023-06-13 Vignet Incorporated Context-aware surveys and sensor data collection for health research
US11507737B1 (en) 2016-09-29 2022-11-22 Vignet Incorporated Increasing survey completion rates and data quality for health monitoring programs
US11501060B1 (en) 2016-09-29 2022-11-15 Vignet Incorporated Increasing effectiveness of surveys for digital health monitoring
US11244104B1 (en) 2016-09-29 2022-02-08 Vignet Incorporated Context-aware surveys and sensor data collection for health research
US10891948B2 (en) 2016-11-30 2021-01-12 Spotify Ab Identification of taste attributes from an audio signal
US9934785B1 (en) * 2016-11-30 2018-04-03 Spotify Ab Identification of taste attributes from an audio signal
US10810999B2 (en) * 2017-01-26 2020-10-20 Hall Labs Llc Voice-controlled secure remote actuation system
US20180211651A1 (en) * 2017-01-26 2018-07-26 David R. Hall Voice-Controlled Secure Remote Actuation System
US10347244B2 (en) * 2017-04-21 2019-07-09 Go-Vivace Inc. Dialogue system incorporating unique speech to text conversion method for meaningful dialogue response
US10580433B2 (en) * 2017-06-23 2020-03-03 Casio Computer Co., Ltd. Electronic device, emotion information obtaining system, storage medium, and emotion information obtaining method
US20180374498A1 (en) * 2017-06-23 2018-12-27 Casio Computer Co., Ltd. Electronic Device, Emotion Information Obtaining System, Storage Medium, And Emotion Information Obtaining Method
US11202729B2 (en) 2017-06-27 2021-12-21 Stryker Corporation Patient support apparatus user interfaces
US11096850B2 (en) 2017-06-27 2021-08-24 Stryker Corporation Patient support apparatus control systems
US11337872B2 (en) 2017-06-27 2022-05-24 Stryker Corporation Patient support systems and methods for assisting caregivers with patient care
US10811136B2 (en) 2017-06-27 2020-10-20 Stryker Corporation Access systems for use with patient support apparatuses
US11559450B2 (en) 2017-06-27 2023-01-24 Stryker Corporation Patient support apparatus user interfaces
US11382812B2 (en) 2017-06-27 2022-07-12 Stryker Corporation Patient support systems and methods for assisting caregivers with patient care
US11710556B2 (en) 2017-06-27 2023-07-25 Stryker Corporation Access systems for use with patient support apparatuses
US11810667B2 (en) 2017-06-27 2023-11-07 Stryker Corporation Patient support systems and methods for assisting caregivers with patient care
US11484451B1 (en) 2017-06-27 2022-11-01 Stryker Corporation Patient support apparatus user interfaces
US10762895B2 (en) * 2017-06-30 2020-09-01 International Business Machines Corporation Linguistic profiling for digital customization and personalization
US20190005949A1 (en) * 2017-06-30 2019-01-03 International Business Machines Corporation Linguistic profiling for digital customization and personalization
US10304475B1 (en) * 2017-08-14 2019-05-28 Amazon Technologies, Inc. Trigger word based beam selection
US10311874B2 (en) 2017-09-01 2019-06-04 4Q Catalyst, LLC Methods and systems for voice-based programming of a voice-controlled device
US11334750B2 (en) 2017-09-07 2022-05-17 Monotype Imaging Inc. Using attributes for predicting imagery performance
US11727939B2 (en) 2017-09-11 2023-08-15 Telefonaktiebolaget Lm Ericsson (Publ) Voice-controlled management of user profiles
US11227605B2 (en) 2017-09-11 2022-01-18 Telefonaktiebolaget Lm Ericsson (Publ) Voice-controlled management of user profiles
US11430449B2 (en) * 2017-09-11 2022-08-30 Telefonaktiebolaget Lm Ericsson (Publ) Voice-controlled management of user profiles
US10909429B2 (en) 2017-09-27 2021-02-02 Monotype Imaging Inc. Using attributes for identifying imagery for selection
US11657602B2 (en) 2017-10-30 2023-05-23 Monotype Imaging Inc. Font identification from imagery
WO2019162054A1 (en) * 2018-02-20 2019-08-29 Koninklijke Philips N.V. System and method for client-side physiological condition estimations based on a video of an individual
US20210038088A1 (en) * 2018-02-20 2021-02-11 Koninklijke Philips N.V. System and method for client-side physiological condition estimations based on a video of an individual
US11904224B2 (en) * 2018-02-20 2024-02-20 Koninklijke Philips N.V. System and method for client-side physiological condition estimations based on a video of an individual
US11409417B1 (en) 2018-08-10 2022-08-09 Vignet Incorporated Dynamic engagement of patients in clinical and digital health research
US11520466B1 (en) 2018-08-10 2022-12-06 Vignet Incorporated Efficient distribution of digital health programs for research studies
US10775974B2 (en) 2018-08-10 2020-09-15 Vignet Incorporated User responsive dynamic architecture
US20200050347A1 (en) * 2018-08-13 2020-02-13 Cal-Comp Big Data, Inc. Electronic makeup mirror device and script operation method thereof
US11315553B2 (en) 2018-09-20 2022-04-26 Samsung Electronics Co., Ltd. Electronic device and method for providing or obtaining data for training thereof
US11436414B2 (en) * 2018-11-15 2022-09-06 National University Of Defense Technology Device and text representation method applied to sentence embedding
US11514920B2 (en) 2018-12-18 2022-11-29 Yandex Europe Ag Method and system for determining speaker-user of voice-controllable device
US11011174B2 (en) 2018-12-18 2021-05-18 Yandex Europe Ag Method and system for determining speaker-user of voice-controllable device
US11700412B2 (en) * 2019-01-08 2023-07-11 Universal Electronics Inc. Universal voice assistant
US20210120301A1 (en) * 2019-01-08 2021-04-22 Universal Electronics Inc. Universal voice assistant
WO2020146105A1 (en) * 2019-01-08 2020-07-16 Universal Electronics Inc. Universal voice assistant
US11792185B2 (en) 2019-01-08 2023-10-17 Universal Electronics Inc. Systems and methods for associating services and/or devices with a voice assistant
US11665757B2 (en) * 2019-01-08 2023-05-30 Universal Electronics Inc. Universal audio device pairing assistant
US20210368562A1 (en) * 2019-01-08 2021-11-25 Universal Electronics Inc. Universal audio device pairing assistant
US11776539B2 (en) 2019-01-08 2023-10-03 Universal Electronics Inc. Voice assistant with sound metering capabilities
US11238979B1 (en) 2019-02-01 2022-02-01 Vignet Incorporated Digital biomarkers for health research, digital therapeautics, and precision medicine
US10803875B2 (en) * 2019-02-08 2020-10-13 Nec Corporation Speaker recognition system and method of using the same
US20200258527A1 (en) * 2019-02-08 2020-08-13 Nec Corporation Speaker recognition system and method of using the same
EP3751403A1 (en) 2019-06-12 2020-12-16 Koninklijke Philips N.V. An apparatus and method for generating a personalized virtual user interface
CN112083985A (en) * 2019-06-12 2020-12-15 皇家飞利浦有限公司 Apparatus and method for generating personalized virtual user interface
EP3751402A1 (en) * 2019-06-12 2020-12-16 Koninklijke Philips N.V. An apparatus and method for generating a personalized virtual user interface
US11763919B1 (en) 2020-10-13 2023-09-19 Vignet Incorporated Platform to increase patient engagement in clinical trials through surveys presented on mobile devices
US11923079B1 (en) 2021-07-21 2024-03-05 Vignet Incorporated Creating and testing digital bio-markers based on genetic and phenotypic data for therapeutic interventions and clinical trials
US11705230B1 (en) 2021-11-30 2023-07-18 Vignet Incorporated Assessing health risks using genetic, epigenetic, and phenotypic data sources
US11901083B1 (en) 2021-11-30 2024-02-13 Vignet Incorporated Using genetic and phenotypic data sets for drug discovery clinical trials

Also Published As

Publication number Publication date
US20160342389A1 (en) 2016-11-24
US11620104B2 (en) 2023-04-04
US20220342632A1 (en) 2022-10-27
US11403065B2 (en) 2022-08-02
US11137977B2 (en) 2021-10-05
US20210117153A1 (en) 2021-04-22

Similar Documents

Publication Publication Date Title
US11620104B2 (en) User interface customization based on speaker characteristics
US10930271B2 (en) Speech recognition using neural networks
US9311915B2 (en) Context-based speech recognition
US10832662B2 (en) Keyword detection modeling using contextual information
TWI719304B (en) Method, apparatus and system for speaker verification
US10997980B2 (en) System and method for determining voice characteristics
US10771627B2 (en) Personalized support routing based on paralinguistic information
US9401143B2 (en) Cluster specific speech model
CN105723450B (en) The method and system that envelope for language detection compares
US9711148B1 (en) Dual model speaker identification
US9589560B1 (en) Estimating false rejection rate in a detection system
WO2019027531A1 (en) Neural networks for speaker verification
US11562744B1 (en) Stylizing text-to-speech (TTS) voice response for assistant systems
KR20230116886A (en) Self-supervised speech representation for fake audio detection
US11831644B1 (en) Anomaly detection in workspaces
KR20230156145A (en) Hybrid multilingual text-dependent and text-independent speaker verification
Lopez‐Otero et al. Influence of speaker de‐identification in depression detection
Chavakula Analysis of audio data to measure social interaction in the treatment of autism spectrum disorder using speaker diarization and identification
JP2024510798A (en) Hybrid multilingual text-dependent and text-independent speaker verification
Yang A Real-Time Speech Processing System for Medical Conversations

Legal Events

Date Code Title Description
AS Assignment

Owner name: GOOGLE INC., CALIFORNIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:WEINSTEIN, EUGENE;MORENO, IGNACIO L.;REEL/FRAME:032002/0907

Effective date: 20131204

AS Assignment

Owner name: GOOGLE LLC, CALIFORNIA

Free format text: CHANGE OF NAME;ASSIGNOR:GOOGLE INC.;REEL/FRAME:044567/0001

Effective date: 20170929

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION