WO2021216299A1 - Voice characteristic machine learning modelling - Google Patents

Voice characteristic machine learning modelling Download PDF

Info

Publication number
WO2021216299A1
WO2021216299A1 PCT/US2021/026476 US2021026476W WO2021216299A1 WO 2021216299 A1 WO2021216299 A1 WO 2021216299A1 US 2021026476 W US2021026476 W US 2021026476W WO 2021216299 A1 WO2021216299 A1 WO 2021216299A1
Authority
WO
WIPO (PCT)
Prior art keywords
user
phrase
vocalized
voice
voice model
Prior art date
Application number
PCT/US2021/026476
Other languages
French (fr)
Inventor
Saurjya Sarkar
Nidhin Balakumar VELAVENDAN
Original Assignee
Qualcomm Incorporated
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qualcomm Incorporated filed Critical Qualcomm Incorporated
Publication of WO2021216299A1 publication Critical patent/WO2021216299A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/18Artificial neural networks; Connectionist approaches
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/22Interactive procedures; Man-machine interfaces
    • G10L17/24Interactive procedures; Man-machine interfaces the user being prompted to utter a password or a predefined phrase
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/26Recognition of special voice characteristics, e.g. for use in lie detectors; Recognition of animal voices

Definitions

  • Machine learning algorithms, neural networks and other advanced intelligence computing systems often rely on training using data sets of labelled data.
  • the task of labelling the data e.g., text, audio, images, video, etc.
  • One way to overcome the issue of having labelled images has been to use a “captcha” system.
  • a user is provided with image data, and is told to select, for example, all the items that include a vehicle.
  • These captcha systems have been used as a way to confirm that a user is not a robot and further gain labelled image sets that can be used to train data image classifiers.
  • a voice modelling system e.g., a voice captcha system
  • the voice modelling system instructs the user to say a phrase, for example, to confirm the user is not a robot. While confirming the user is not a robot is useful, the features of the user’s voice may be extracted for the known vocalization. Rather than a speech to text system, that tries to decipher what the user is saying, the system knows what the user has said (e.g., because the system provided the text for the user to say). The features of the vocalization are extracted and used, along with known characteristics of the user, to train a voice model.
  • the voice model may map the features (e.g., phonemes, formants, and/or frequency modulations of the vocalization that may be related to the user’s vocabulary, accent, and/or temperament.) to the known characteristics of the user.
  • the trained voice model may later be used to predict characteristics about an unknown user.
  • the trained voice model may be used to generate a vocalization of textual data with specific characteristics.
  • a system of one or more computers can be configured to perform particular operations or actions by virtue of having software, firmware, hardware, or a combination of them installed on the system that in operation causes or cause the system to perform the actions.
  • One or more computer programs can be configured to perform particular operations or actions by virtue of including instructions that, when executed by data processing apparatus, cause the apparatus to perform the actions.
  • One general aspect includes a computer-implemented method performed by a computing system.
  • a computer-implemented method includes: providing, by a computing system to a user device, a textual phrase for a user to vocalize; receiving, from the user device, known characteristics of the user and a vocalized phrase associated with the user, the vocalized phrase being a vocalization of the textual phrase; extracting, by the computing system, features from the vocalized phrase; and generating a voice model by training a machine learning algorithm using the extracted features and the known characteristics of the user.
  • a system includes one or more processors and a memory having stored thereon instructions that upon execution of the instructions by the one or more processors, cause the one or more processors to: provide, to a user device, a textual phrase for a user to vocalize; receive, from the user device, known characteristics of the user and a vocalized phrase associated with the user, the vocalized phrase being a vocalization of the textual phrase; extract features from the vocalized phrase; and generate a voice model by training a machine learning algorithm using the extracted features and the known characteristics of the user.
  • a non-transitory computer-readable medium has stored thereon instructions that, when executed by one or more processors, cause the one or more processors to: provide, to a user device, a textual phrase for a user to vocalize; receive, from the user device, known characteristics of the user and a vocalized phrase associated with the user, the vocalized phrase being a vocalization of the textual phrase; extract features from the vocalized phrase; and generate a voice model by training a machine learning algorithm using the extracted features and the known characteristics of the user.
  • an apparatus in another example, includes: means for providing to a user device, a textual phrase for a user to vocalize; means for receiving, from the user device, known characteristics of the user and a vocalized phrase associated with the user, the vocalized phrase being a vocalization of the textual phrase; means for extracting, by the computing system, features from the vocalized phrase; and means for generating a voice model by training a machine learning algorithm using the extracted features and the known characteristics of the user.
  • the extracted features comprise at least one of a fundamental frequency modulation, phoneme, formant, harmonics, and/or a noise profile.
  • the known characteristics of the user comprise at least one of an age of the user, a nationality of the user, a gender of the user, a primary language of the user, a geographic region of residence of the user, and/or an education level of the user.
  • extracted features and known characteristics of many users are used to train the machine learning algorithm.
  • the method, apparatuses, and computer-readable medium described above further comprise: receiving, by the computing system, a second vocalized phrase associated with a second user, the second vocalized phrase being a vocalization of a second textual phrase; providing, by the computing system, the second vocalized phrase to the voice model; and generating, by the voice model, a prediction of characteristics of the second user.
  • the voice model maps extracted features of the second vocalized phrase to characteristics to generate the prediction of characteristics of the second user.
  • the method, apparatuses, and computer-readable medium described above further comprise: providing, by the computing system, a textual input and at least one desired characteristic to the voice model; and receiving, from the voice model, an audio sample comprising a predicted voice representation of a vocalization of the textual input, wherein the predicted voice representation is based at least in part on the desired characteristic.
  • the voice model maps the desired characteristic to features of a voice to generate the predicted voice representation used to generate the audio sample.
  • the voice model maps the extracted features of the vocalized phrase to the known characteristics of the user.
  • the known characteristics of the user comprise at least one of information on a vocabulary of the user, information on an accent of the user, and/or information on a temperament of the user.
  • training the machine learning algorithm is based on a context of the vocalized phrase.
  • the context of the vocalized phrase includes background noise or environmental sounds.
  • the apparatus is, is part of, and/or includes a camera, a mobile device (e.g., a mobile telephone or so-called “smart phone” or other mobile device), a wearable device, an extended reality device (e.g., a virtual reality (VR) device, an augmented reality (AR) device, or a mixed reality (MR) device), a personal computer, a laptop computer, a server computer, a vehicle or a computing device or component of a vehicle, or other device.
  • the apparatus includes a camera or multiple cameras for capturing one or more images.
  • the apparatus further includes a display for displaying one or more images, notifications, and/or other displayable data.
  • the apparatuses described above can include one or more sensors (e.g., one or more inertial measurement units (IMUs), such as one or more gyrometers, one or more accelerometers, any combination thereof, and/or other sensor).
  • sensors e.g., one or more inertial measurement units (IMUs), such as one or more gyrometers, one or more accelerometers, any combination thereof, and/or other sensor.
  • IMUs inertial measurement units
  • FIG. 1 is a voice modelling system for generating a voice model by training a machine learning algorithm , according to some embodiments.
  • FIG. 2 is a deployed system for predicting characteristics of an unknown user based on a vocalization from the user using the voice model, according to some embodiments.
  • FIG. 3 is a deployed system for generating a vocalization of textual content having desired characteristics using a trained voice model, according to some embodiments.
  • FIG. 4 is a flow diagram of a method of training a machine learning algorithm, according to some embodiments.
  • FIG. 5 is a block diagram illustrating an example of a deep learning neural network, according to some embodiments.
  • FIG. 6 is a block diagram illustrating an example of a convolutional neural network (CNN), according to some embodiments.
  • CNN convolutional neural network
  • FIG. 7 is a block diagram of an embodiment of a user equipment (UE), according to some embodiments.
  • UE user equipment
  • FIG. 8 is a block diagram of an embodiment of a computer system.
  • multiple instances of an element may be indicated by following a first number for the element with a letter or a hyphen and a second number.
  • multiple instances of an element 105 may be indicated as 105-1, 105-2, 105-3 etc. or as 105a, 105b, 105c, etc.
  • any instance of the element is to be understood (e.g., element 105 in the previous example would refer to elements 105-1, 105-2, and 105- 3 or to elements 105a, 105b, and 105c).
  • the described system can be used to generate large amounts of labelled and/or annotated voice data, which can be used to generate voice models for classifying characteristics of users based on their voices or to generate voices based on characteristics of users.
  • the voice modelling system 100 includes user devices 105a, 105b, through 105n, a training system 110, and a data set database 115.
  • the voice modelling system 100 may include more or fewer components than illustrated in FIG. 1.
  • one or more components may be cloud based.
  • training system 110 may be provided as a cloud service.
  • N is an integer greater than or equal to one.
  • User devices 105a, 105b, through 105n may be any suitable user devices including, for example, tablets, home computers, smartphones, and the like.
  • the user devices 105a, 105b, through 105n may include some or all of the components described with respect to FIG. 6 of computer system 800, and/or the user devices 105a, 105b, through 105n may be the User Equipment (UE) 500 as described with respect to FIG. 7.
  • UE User Equipment
  • An example user device 105a may include display 120, microphone 125, audio embedder 130, characteristic collection module 135, communication module 140, and user interface (UI) subsystem 145.
  • UI user interface
  • the user devices 105b through 105n may contain similar components that perform similar functionality to the components described for example user device 105 a.
  • the user devices 105 a, 105b, through 105n may include more or fewer components to perform the functionality described herein without departing from the scope of the description.
  • Display 120 may be any visual display device that can display textual content.
  • display 120 may be a touchscreen, non-touchscreen, liquid crystal display (LCD), light emitting diode (LED) display, or the like.
  • LCD liquid crystal display
  • LED light emitting diode
  • Microphone 125 may be any microphone device that can capture audio from the user. Specifically, when the user vocalizes or reads text aloud, the microphone captures the audio of the user’s voice vocalization of the text the user read or spoke.
  • Audio embedder 130 may be any audio embedder that uses the captured audio from the microphone to embed the audio into an embedding space.
  • audio embedder 130 may convert the captured audio into embedded audio in the form of raw audio, compressed audio, a mel spectrogram, mel-frequency cepstral coefficients (MFCC), a constant Q transform (CQT), or a short-term Fourier Transform (STFT).
  • MFCC mel-frequency cepstral coefficients
  • CQT constant Q transform
  • STFT short-term Fourier Transform
  • Characteristic collection module 135 may collect characteristics of the owner (or, in some cases, characteristics of multiple users) of a user device 105a of the user devices 105a, 105b, through 105n. Characteristics may include the user’s age, nationality, gender, primary language, education level, health, social preferences, geo-location including residential geographic location as well as current location, and the like. Because user device 105a belongs to the user, the user’s known characteristics may be captured from the device in many contexts. For example, when setting up the user device 105a, the user may select a primary language. The user may have accounts, such as email accounts or other accounts for which the user provides other characteristic information including their age, nationality, gender, and so forth.
  • the user device 105a may include a global positioning system transceiver such that the location of the user device 105a at the time of a vocalization may be known.
  • characteristics of the user may be collected from additional sensors of the user device 105a (not shown). For example, heart rate and other information can provide an indication of the user’s current state such as, for example, relaxed, excited, agitated, and the like.
  • some characteristics of the user can be collected from a wearable device in communication with user device 105a.
  • the wearable device can also be a user device (e.g., user device 105b) containing similar components to user device 105a.
  • the characteristic collection module 135 may collect the underlying information (e.g., user’s heart rate) and determine the user’s state or provide the underlying information as a characteristic.
  • the information on the characteristics of the user may be collected by the characteristic collection module 135 and provided to the communication module 140 with the embedded audio.
  • the characteristic collection module 135 may also collect information that may be used to determine context of the user’s vocalization, including, for example, background noise or environmental sounds may provide context of the vocalization.
  • the characteristic collection module may combine the background noise or environmental sound information with geo-location information to determine a context of the vocalization.
  • the context may also be provided with the characteristics and text embedding to the communication module 140.
  • Communication module 140 may be any suitable communication module for transmitting data to communication module 150 of training system 110.
  • communication module 140 may include the hardware and software needed for sending the information (e.g., user characteristics, context of the user’s vocalization, and/or embedded audio) via packets on a network.
  • the communication module 140 may be communication subsystem 830 as described with respect to FIG. 8 and/or wireless communication interface 730 with antenna 732 for sending signals 734 containing the information as described with respect to FIG. 7.
  • UI subsystem 145 may be any suitable user interface system that allows the user to interface with components of user device 105a and the training system 110.
  • the user may use the display 120 to view a menu that allows the user to request a textual sample for reading aloud.
  • the user may request a story to read to their child.
  • the UI subsystem 145 may also allow the information from training system to be displayed on display 120.
  • the user may be able to make selections for participation in the training system including, for example, to allow the training system to request a voice authentication (e.g., a voice “captcha”) every time or some frequency of times the user may access the user device 105a.
  • a voice authentication e.g., a voice “captcha”
  • the user may access the user device 105a using a password or pin, for example, and the training system 110 may provide a text sample to use for voice authentication to access the user device 105a as well.
  • the user may be able to select, using the UI subsystem 145, that the voice authentication may only request voice authentication one time per day, for example.
  • the UI subsystem 145 may also provide the context of the captured voice for a given instance of providing audio to the training system 110. For example, if the user requested a story, the UI subsystem 145 may provide the context as a story reading session for the audio captured by microphone 125 and provided via communication module 140.
  • the UI subsystem 145 may provide the context as voice authentication phrase, for example.
  • Training system 110 may be any suitable computer system or group of computer systems that can receive the information (e.g., data) from the user devices 105a, 105b, through 105n and use it to train modelling module 160.
  • training system 110 may be a computing system such as computer system 800 as described with respect to FIG. 8.
  • training system 110 may be a cloud-based system that offers the training and functionality of training system 110 as a cloud based service.
  • Communication module 150 may be any suitable communication module for transmitting data to communication modules 140 of user devices 105a, 105b, through 105n.
  • communication module 150 may include the hardware and software needed for sending information packets on a network for communication with user devices 105a, 105b, through 105n.
  • the communication module 150 may be communication subsystem 830 as described with respect to FIG. 8.
  • communication module 150 may be the same or similar to the wireless communication interface 730 with antenna 732 for sending signals 734 as described with respect to FIG. 7.
  • Feature extraction module 180 may receive the embedded audio from user devices 105a, 105b, through 105n and extract features from the audio.
  • the embedded audio may be, for example, the raw audio, compressed audio, MFCC, CQT, or STFT.
  • the feature extraction module 180 can be trained to extract features from above-mentioned representations including identifying phonemes, formants, and/or frequency modulations that are related to the user’s vocabulary, accent, and/or temperament.
  • the feature extraction module 180 can be trained using embedded audio and labels as training data. During inference (after the feature extraction module 180 has been trained), the feature extraction module can extract features from embedded audio.
  • the extracted features may include a fundamental frequency modulation, harmonics, formants, phonemes, or a noise profile for the vocalization.
  • the feature extraction module 180 may also extract information from the data provided from communication module 150 including a context of the audio. For example, the context of reading a story may be different than speaking a phrase to gain access to the user device 105a (e.g., voice authentication). Accordingly, a context of the audio may be included in the data received from user device 105a.
  • the feature extraction module 180 may be included on the user device 105a and the extracted features provided to the training system 110 rather than the entire embedded audio.
  • the user’s characteristics and/or context of the audio extracted by feature extraction model 180 can be sent to the training system 110 with the extracted features.
  • the feature extraction module 180 may be, in some embodiments, a neural network based encoder-decoder pair for feature extraction from raw audio that is jointly optimized with an end-to-end training paradigm to extract a data driven feature representation optimized for a given classification or generation task.
  • the encoder may be a temporal convolutional neural network.
  • Modelling module 160 may be any suitable machine learning algorithm, such as a neural network, state vector machine (SVM), non-negative matrix factorization (NMF), Gaussian mixture models (GMMs), Bayesian inference models, hidden Markov models (HMM), independent component analysis, independent subspace analysis, decision tree, deep clustering, random forests, or the like.
  • the modelling module 160 may be a neural network that includes an input layer, an output layer, and hidden layers.
  • the layers may include input cells, hidden cells, output cells, recurrent cells, memory cells, kernels, convolutional cells, pool cells, or any other suitable cell types.
  • the modelling module 160 may be any type of machine learning algorithm including a neural network such as, for example, a radial basis neural network, recurrent neural network, long/short term memory neural network, gated recurrent neural network, auto encoder (AE) neural network, variational AE neural network, denoising AE neural network, sparse AE neural network, Markov chain neural network, Hopfield neural network, Boltzmann machine neural network, convolutional neural network, deconvolutional neural network, generative adversarial network, liquid state machine neural network, extreme learning machine neural network, Kohonen network, support vector machine neural network, neural hiring machine neural network, deep residual network, or any other type of neural network.
  • a neural network such as, for example, a radial basis neural network, recurrent neural network, long/short term memory neural network, gated recurrent neural network, auto encoder (AE) neural network, variational AE neural network, denoising AE neural network, sparse AE neural network, Markov chain neural network, Hopfield neural network
  • training system 110 may train the modelling module 160 to leam latent representations for vocal feature-user characteristic mapping (e.g., a voice model).
  • the voice model may map user characteristics to extracted features of the vocalizations.
  • modelling module 160 can be trained to map the extracted feature information relating to a user’s accent to the regional characteristic of the user.
  • Modelling module 160 may also be trained to map the other extracted features from the user’s vocalizations (e.g., temperament, vocabulary, etc.) to the characteristics of the user (e.g., age, nationality, education level, etc.).
  • the modelling module 160 may output characteristic information from input features extracted from voice embeddings (e.g., embedded audio from the audio embedder 130) and the loss calculation subsystem 165 may provide feedback loss values to the modelling module 160 for adjusting parameters of the modelling module 160, such as in a supervised training process using the annotated data received from user device 105a.
  • voice embeddings e.g., embedded audio from the audio embedder 130
  • the loss calculation subsystem 165 may provide feedback loss values to the modelling module 160 for adjusting parameters of the modelling module 160, such as in a supervised training process using the annotated data received from user device 105a.
  • Illustrative examples of training neural networks using supervised leaning are described below with respect to FIG. 5 and FIG. 6.
  • the modelling module 160 can be trained to generate the most accurate mapping of the extracted features from the vocalization of the user (e.g., phonemes, formants, frequency modulations, and/or a noise profile that may be related to the user’s vocabulary, accent, and/or temperament, and/or the like) to the characteristics of the user that were collected by the characteristic collection module 135.
  • the modelling module 160 may receive features extracted from embedded audio of vocalizations and determine characteristics of the user that uttered the vocalization based on the extracted features.
  • training system 110 may train modelling module 160 to generate vocalizations (e.g., speech synthesis) based on a provided text or phrase and desired characteristics for the vocalization.
  • the desired characteristics for the vocalization can correspond to the known user characteristics described above.
  • the modelling module 160 can be trained to generate vocalizations by utilizing the embedded audio, the features generated by feature extraction module 180, and the known characteristics of the user.
  • modelling module 160 can be trained to directly generate mel spectrograms from text data and desired characteristics of the vocalization.
  • the modelling module 160 can be trained using as a parametric speech synthesizer.
  • the modelling module 160 can be trained for speech synthesis in conjunction with a neural network that is trained using annotated data to rate the accuracy of generated speech for given characteristics.
  • a neural network may be pre-trained to use as a loss function for speech synthesis, or be used as the encoder for a variational autoencoder, or be jointly trained with the speech synthesizer as the discriminator in an adversarial training paradigm, for example.
  • the text generation module 155 may generate or select the text provided for the user to vocalize.
  • the text may be randomly generated.
  • a large database of phrases may be created in advance and, in some embodiments, grouped into various categories to ensure that the corpus of phrases the user vocalizes is sufficient to provide a sufficient selection for training the modelling module 160.
  • a categorization of phrases that test the user’s vocabulary, accent, pronunciation of various sounds, and so forth may be generated. This categorization may be automated by, for example, using length of words within the phrases for testing vocabulary, using number of letters within the phrase for testing accent and pronunciation, and the like. Then a large number of phrases may be categorized and selected such that at least some of the phrases from each category are selected for the user to ensure the provided vocalizations for that user are sufficient.
  • the data set creation subsystem 170 may be used to generate labeled/annotated data sets that may be stored in data set database 115 and used for training other audio classification systems in the future. For example, the features for a user and the text spoken may be collected and stored together with other samples (features and known generated text associated with the features) for a data set that is specific to a user. In some embodiments, the samples may be combined to generate user generic data sets.
  • the user may attempt to access the user device 105a or request a story via the UI subsystem 145.
  • the communication module 140 may provide a notification of the request to the communication module 150, which provides the request to the text generation module 155.
  • the text generation module 155 generates a textual phrase or a retrieves a story based on the request and provides the textual phrase or story text to communication module 150.
  • the communication module 150 provides the textual phrase to communication module 140.
  • the communication module 140 provides the textual phrase to display 120.
  • the user device 105a requested the textual phrase for display to the user to, for example, unlock the user’s device 105a. Having the user state a phrase to unlock the user’s device 105a may or may not be used for authentication. This may be to ensure the user is not a robot.
  • an example purpose of having the user utter the phrase can be to capture the user’s vocalization of the generated phrase.
  • the phrase may be a single word, a set of words, a story, a portion of an ebook, an entire ebook, or any other suitable length of phrase (i.e., textual content).
  • the user may use a storytelling application that the user requested a story to tell their child, for example, the text generation module 155 may select a story and provide the story to the communication module 150, which provides the story to the communication module 140, which displays the story via the storytelling application on the display 120.
  • the user when the user receives the text from the text generation module, the user can vocalize the displayed text, and microphone 125 can capture the vocalization.
  • the audio embedder 130 captures the vocalization via the microphone 125 and embeds the vocalization into an embedding space (e.g., embedded audio).
  • the audio embedder 130 generates a mel spectrogram from the vocalization.
  • the audio embedder 130 provides the embedded audio to the characteristic collection module 135.
  • the characteristic collection module 135 may collect and attach the characteristic information of the user to the embedded audio for delivery to the communication module 140.
  • the audio embedder 130 may send a signal to characteristic collection module 135 to obtain the characteristics.
  • the audio embedder 130 may attach the characteristic information and provide the data to the communication module 140.
  • the communication module 140 can collect the characteristics from the characteristic collection module 135 and the embedded audio from the audio embedder 130.
  • the communication module 140 having received the information including the embedded audio of the vocalization of the phrase from the user in addition to the user characteristics, provides the information to the communication module 150.
  • communication module 150 provides the packet of information including the user characteristics and audio embedding to the feature extraction module 180.
  • the feature extraction module 180 can extracts features from the embedded audio.
  • the features may include information that can be used to identify the user’s vocabulary, accent, and/or temperament.
  • the extracted features may also include a fundamental frequency modulation, phonemes, formants, harmonics, and/or a noise profile from the audio embedding.
  • the feature extraction module 180 may format the user characteristics and the extracted features from the audio embedding into a format suitable for providing to the loss calculation subsystem 165 and the modelling module 160, respectively.
  • the feature extraction module 180 provides the extracted features of the audio embedding to the modelling module 160.
  • the modelling module 160 uses the extracted features and a voice model to predict characteristics of the user and outputs them.
  • the loss calculation subsystem 165 receives the predicted characteristics from the modelling module 160 as well as the known user characteristics from the feature extraction module 180 and calculates one or more loss values. The loss values are fed back into the modelling module 160 to adjust parameters of the modelling module 160 to better predict user characteristics based on the extracted features.
  • the text generation module 155 may provide the known text that the user was asked to vocalize to the data set creation subsystem 170.
  • the feature extraction module 180 may provide the known user characteristics and the extracted features from the user’s vocalization to the data set creation subsystem 170.
  • the audio sample of the vocalization is provided to the training system 110 for audio embedding
  • the entire audio sample may be provided to data set creation subsystem 170.
  • the data set creation subsystem 170 may store the user characteristics, the extracted features, the audio sample, and/or the text sample in the data set database 115.
  • Each user device 105a, 105b, through 105n may have an associated user (or multiple associated users). Each associated user may provide many vocalizations. Accordingly, the data set creation subsystem 170 may create a data set for each user having many different text samples, vocalizations, features extracted from the vocalizations, and the user’s characteristics all associated in the data set. In some embodiments, all of the user information for many users is in a single data set.
  • FIG. 2 illustrates a system 200 for predicting characteristics of a user based on a captured vocalization.
  • the system 200 may include a capture device 205 and a prediction system 210. While a single capture device 205 is shown, the system 200 may include many capture devices 205.
  • the prediction system 210 may be a cloud based service or a remote server, for example.
  • the capture device 205 may be any suitable computing system having the described components in FIG. 2 and may include more components.
  • the capture device 205 may be a computing system such as computer system 800 as described with respect to FIG. 8, and capture device 205 may therefore include components of computer system 800, which have been omitted from capture device 205 for ease of description.
  • Capture device 205 may include a microphone 215, audio embedder 220, communication module 230, and user interface (UI) subsystem 225.
  • Microphone 215 may be the same as and/or perform similar functions to microphone 125 as described with respect to FIG. 1.
  • Audio embedder 220 may be the same as and/or perform similar functions to audio embedder 130 as described with respect to FIG. 1.
  • Communication module 230 may be the same as and/or perform similar functions to communication module 140 as described with respect to FIG. 1.
  • UI subsystem 225 may be an optional component that may provide a user interface to a user of capture device 205.
  • capture device may be an advertising system having a display screen.
  • the advertising system may be within a vehicle such as a shared ride vehicle, within an elevator, or any other location for which targeted advertising of a person may be desirable.
  • Prediction system 210 may be any suitable computing system having the described components in FIG. 2 and may include more components.
  • the prediction system 210 may be a computing system, such as computer system 800 as described with respect to FIG. 8, and prediction system 210 may therefore include components of computer system 800, which have been omitted from prediction system 210 for ease of description.
  • Prediction system 210 may include communication module 150, feature extraction module 180, characteristics subsystem 235, and modelling module 260.
  • modelling module 260 can be a trained modelling module 160 as illustrated with respect to FIG. 1 above, after having been trained by training system 110.
  • Characteristics subsystem 235 may obtain the output from modelling module 260 of predicted characteristics. Characteristics subsystem 235 may format the characteristics for use by a UI system and provide the predicted characteristics to communication module 150.
  • a user may speak, for example on a cellular device, to himself, or to another person within the vicinity.
  • the microphone 215 may capture the user’s spoken words (i.e., an audio sample), and the audio embedder 220 may embed the audio sample to generate, for example, a mel spectrogram of the audio sample.
  • the embedded audio sample can be provided to the communication module 230.
  • the communication module 230 can transmit the embedded audio to the communication module 150.
  • the communication module 150 provides the embedded audio to the feature extraction module 180.
  • the feature extraction module 180 extracts features from the embedded audio.
  • the extracted features may include information related to the user’s vocabulary, accent, and/or temperament.
  • the extracted features may also include a fundamental frequency modulation, phonemes, formant, harmonics, and/or a noise profile for the vocalization.
  • the extracted features are input to the modelling module 260, which uses a voice model (developed during training in training system 110) to predict characteristics of the user.
  • the predicted characteristics are output from modelling module 260 to characteristics subsystem 235.
  • Characteristics subsystem 235 may format the predicted characteristics and provide the information to communication module 150.
  • Communication module 150 may provide the predicted characteristics to communication module 230, which may provide the predicted characteristics to UI subsystem 225.
  • the UI subsystem 225 may use the predicted characteristics to generate output in the user interface. For example, the predicted characteristics may be output to a graphical user interface. As another example, advertising may be selected based on the predicted characteristics of the person.
  • FIG. 3 illustrates a system 300 for generating a voice representation (e.g., a vocalization or synthesized speech) based on desired characteristics and a textual selection.
  • the system 300 may include a user device 305 and a prediction system 310. While a single user device 305 is shown, the system 300 may include many user devices 305.
  • the prediction system 310 may be a cloud based service or a remote server, for example.
  • the user device 305 may be any suitable computing system having the described components in FIG. 3 and may include more components.
  • the user device 305 may be a computing system such as computer system 800 as described with respect to FIG. 8, and user device 305 may therefore include components of computer system 800, which have been omitted from user device 305 for ease of description.
  • User device 305 may include a speaker 315, UI subsystem 320, characteristic selection subsystem 325, communication module 335, and textual selection subsystem 330.
  • Speaker 315 may be any speaker device that can output an audible sound.
  • UI subsystem 320 may be any user interface for providing visual and audible output to a user via a display or, for example, speaker 315.
  • Characteristic selection subsystem 325 may be a system that a user may use, via the UI subsystem 320 to select the characteristics of a desired vocalization.
  • Textual selection subsystem 330 may be a system that a user may use, via the UI subsystem 320, to select desired text.
  • the user device 305 may be a system in which the user may select text, using the textual selection subsystem 330, and the user may select desired characteristics for a generated voice that vocalizes the selected text using the characteristic selection subsystem 325.
  • the user may select text, using the textual selection subsystem 330, and the user may select desired characteristics for a generated voice that vocalizes the selected text using the characteristic selection subsystem 325.
  • an audio book may be selected and a user may select desired characteristics for each character for vocalizing their selected portions or quotes. For example, an elderly woman from the Southern United States would have a different generated voice from a teenage boy from Chicago, Illinois.
  • Prediction system 310 may be any suitable computing system having the described components in FIG. 3 and may include more components.
  • the prediction system 310 may be a computing system, such as computer system 800 as described with respect to FIG. 8 and prediction system 310 may therefore include components of computer system 800, which have been omitted from prediction system 310 for ease of description.
  • Prediction system 310 may include communication module 150, characteristic extraction subsystem 340, textual extraction subsystem 345, audio generation subsystem 350, and modelling module 360.
  • modelling module 360 can be trained modelling module 160 as illustrated in FIG. 1 above , after having been trained by training system 110.
  • Characteristic extraction subsystem 340 may obtain the information from communication module 150 and extract the desired characteristics selected by the user using characteristic selection subsystem 325.
  • Textual extraction subsystem 345 may obtain the information from communication module 150 and extract the selected text by the user using textual selection subsystem 330.
  • the audio generation subsystem 350 may obtain the selected text from the textual extraction subsystem 345 and the predicted voice features from the modelling module 360 that are output based on the selected desired voice characteristics.
  • a user may select desired text via the UI subsystem 320 using the textual selection subsystem 330 as well as the desired characteristics using the characteristic selection subsystem 325.
  • the UI subsystem 320 may provide the selected text and desired characteristics to communication module 335.
  • the communication module 335 may provide the selected text and desired characteristic information to the communication module 150.
  • the communication module 150 may provide the information to the characteristic extraction subsystem 340 and the textual extraction subsystem 345.
  • the characteristic extraction subsystem 340 may extract the selected characteristics, format the characteristics as needed by the modelling module 360, and submit the characteristic information to the modelling module 360.
  • the modelling module 360 uses the voice models to map the desired characteristics to the features of the voice for the user with the desired characteristics.
  • the modelling module 360 provides the predicted vocal features to the audio generation subsystem 350.
  • the textual extraction subsystem 345 may extract the selected text from the information and provide the selected text to the audio generation subsystem 350.
  • the audio generation subsystem 350 uses the selected text and the predicted vocal features to generate an audio sample of a predicted voice representation based on the selected features to vocalize the selected text.
  • the audio sample may be provided to the communication module 150.
  • the communication module 150 provides the audio sample to the communication module 335.
  • the communication module 335 provides the audio sample to the UI subsystem 320.
  • the UI subsystem 320 outputs the audio sample to the speaker 315 for the user to hear.
  • FIG. 4 is a flow diagram of a method 400 for generating voice models by training a machine learning algorithm.
  • Alternative embodiments may vary in function by combining, separating, or otherwise varying the functionality described in the blocks illustrated in FIG. 4.
  • Means for performing the functionality of one or more of the blocks illustrated in FIG. 4 may comprise hardware and/or software components of a computer system, such as the computer system 800 illustrated in FIG. 8 and described in more detail below.
  • the training system may provide a textual phrase for a user to vocalize to a user device (e.g., user device 105a).
  • the textual phrase may be generated randomly or selected from a database of phrases by, for example, text generation module 155.
  • Means for performing the functionality at block 405 may include one or more software and/or hardware components of a computer system, such as a bus 805, processing unit(s) 810, memory 835, communication subsystem 830, and/or other software and/or hardware components of a computer system 800 as illustrated in FIG. 8 and described in more detail below
  • the training system 110 may receive known characteristics of the user and a vocalized phrase associated with the user (e.g., spoken by the user), the vocalized phrase being a vocalization of the textual phrase.
  • the vocalized phrase can be received, for example, by a microphone.
  • the textual phrase may be output to a display (e.g., display 120) for the user to view, and the display may request that the user vocalize the textual phrase.
  • the display may ask the user to say the phrase to access the user device (e.g., user device 105a).
  • the user device may embed the audio sample obtained when the user speaks the phrase using the microphone (e.g., microphone 125).
  • the embedded audio sample which may be a mel spectrogram based on the embedding, is provided to the training system (e.g., training system 110).
  • Means for performing the functionality at block 410 may include one or more software and/or hardware components of a computer system, such as a bus 805, processing unit(s) 810, memory 835, communication subsystem 830, and/or other software and/or hardware components of a computer system 800 as illustrated in FIG. 8 and described in more detail below.
  • the training system 110 may extract features from the vocalized phrase.
  • the audio sample which may be embedded into a mel spectrogram (or any other suitable representation), may have features extracted including, for example, identifying information related to the user’s vocabulary, accent, and/or temperament.
  • the extracted features may also include fundamental frequency modulations, phonemes, formants, harmonics, and/or a noise profile for the vocalization.
  • the features may be extracted by, for example, feature extraction module 180.
  • Means for performing the functionality at block 415 may include one or more software and/or hardware components of a computer system, such as a bus 805, processing unit(s) 810, memory 835, communication subsystem 830, and/or other software and/or hardware components of a computer system 800 as illustrated in FIG. 8 and described in more detail below.
  • a bus 805 processing unit(s) 810, memory 835, communication subsystem 830, and/or other software and/or hardware components of a computer system 800 as illustrated in FIG. 8 and described in more detail below.
  • the training system 110 may generate a voice model by training a machine learning algorithm (e.g., modelling module 160) using the extracted features and the known characteristics of the user.
  • the machine learning algorithm may receive the extracted features and map the features to predict the characteristics of the user.
  • the machine learning algorithm may receive the raw audio and generate feature representations to map to the characteristics of the user.
  • the known characteristics of the user may be provided to a loss calculation subsystem for generating a loss value based on a comparison of the predicted characteristics and the known characteristics. The loss value may be fed back to the machine learning algorithm to adjust parameters of the machine learning algorithm to reduce the loss value.
  • Means for performing the functionality at block 420 may include one or more software and/or hardware components of a computer system, such as a bus 805, processing unit(s) 810, memory 835, communication subsystem 830, and/or other software and/or hardware components of a computer system 800 as illustrated in FIG. 8 and described in more detail below.
  • a bus 805 processing unit(s) 810, memory 835, communication subsystem 830, and/or other software and/or hardware components of a computer system 800 as illustrated in FIG. 8 and described in more detail below.
  • the method 400 may include any of a variety of additional features, depending on desired functionality. For example, in some cases many users provide multiple vocalized phrases and known characteristics of the users (e.g., age, gender, nationality, primary language, geographic region of residence, education level, and so forth) to the training system.
  • many users provide multiple vocalized phrases and known characteristics of the users (e.g., age, gender, nationality, primary language, geographic region of residence, education level, and so forth) to the training system.
  • Other additional features of method 400 may include deploying the voice model and receiving a vocalization to which the machine learning algorithm predicts characteristics of the user.
  • the deployed voice model may predict vocal features of a user based on receipt of desired characteristics of the user.
  • the voice model may be used to map features of the user’s voice (e.g., phonemes, formants, frequency modulations and/or a noise profile that are related to the user’s vocabulary, accent, and/or temperament, etc.) to characteristics of the user (age, gender, geographical region of residence, etc.).
  • the same voice models may be used to map the known or desired characteristics of the user to features of the user’s voice.
  • the speech synthesis may be employed using a parametric speech synthesizer where the parameters are provided by the learned feature-characteristic representation.
  • the speech synthesis may also be achieved by directly generating mel-spectrograms from text data and conditioned by user characteristics. Speech synthesis from text data and user characteristics may also be employed by using a conditioned autoregressive or generative model generating audio samples.
  • FIG. 5 is an illustrative example of a deep learning neural network 500 that can be used to implement the voice modelling system described above.
  • the deep learning neural network includes an input layer 520 that includes input data.
  • the input layer 520 can include data representing the pixels of an input video frame.
  • the neural network 500 also includes multiple hidden layers 522a, 522b, through 522n.
  • the hidden layers 522a, 522b, through 522n include “n” number of hidden layers, where “n” is an integer greater than or equal to one.
  • the number of hidden layers can be made to include as many layers as needed for the given application.
  • the neural network 500 further includes an output layer 521 that provides an output resulting from the processing performed by the hidden layers 522a, 522b, through 522n.
  • the output layer 521 can provide a classification for an object in an input video frame.
  • the classification can include a class identifying the type of activity (e.g., playing soccer, playing piano, listening to piano, playing guitar, etc.).
  • the neural network 500 is a multi-layer neural network of interconnected nodes. Each node can represent a piece of information. Information associated with the nodes is shared among the different layers and each layer retains information as information is processed.
  • the neural network 500 can include a feed-forward network, in which case there are no feedback connections where outputs of the network are fed back into itself.
  • the neural network 500 can include a recurrent neural network, which can have loops that allow information to be carried across nodes while reading in input.
  • Nodes of the input layer 520 can activate a set of nodes in the first hidden layer 522a.
  • each of the input nodes of the input layer 520 is connected to each of the nodes of the first hidden layer 522a.
  • the nodes of the first hidden layer 522a can transform the information of each input node by applying activation functions to the input node information.
  • the information derived from the transformation can then be passed to and can activate the nodes of the next hidden layer 522b, which can perform their own designated functions.
  • Example functions include convolutional, up-sampling, data transformation, and/or any other suitable functions.
  • the output of the hidden layer 522b can then activate nodes of the next hidden layer, and so on.
  • the output of the last hidden layer 522n can activate one or more nodes of the output layer 521, at which an output is provided.
  • nodes e.g., node 526) in the neural network 500 are shown as having multiple output lines, a node has a single output and all lines shown as being output from a node represent the same output value.
  • each node or interconnection between nodes can have a weight that is a set of parameters derived from the training of the neural network 500.
  • the neural network 500 can be referred to as a trained neural network, which can be used to classify one or more activities.
  • an interconnection between nodes can represent a piece of information learned about the interconnected nodes.
  • the interconnection can have a tunable numeric weight that can be tuned (e.g., based on a training data set), allowing the neural network 500 to be adaptive to inputs and able to learn as more and more data is processed.
  • the neural network 500 is pre-trained to process the features from the data in the input layer 520 using the different hidden layers 522a, 522b, through 522n in order to provide the output through the output layer 521.
  • the neural network 500 can be trained using training data that includes both frames and labels, as described above. For instance, training frames can be input into the network, with each training frame having a label indicating the features in the frames (for the feature extraction machine learning system) or a label indicating classes of an activity in each frame.
  • a training frame can include an image of a number 2, in which case the label for the image can be [0 0 1 0 0 0 0 0 0 0 0]
  • the neural network 500 can adjust the weights of the nodes using a training process called backpropagation.
  • a backpropagation process can include a forward pass, a loss function, a backward pass, and a weight update.
  • the forward pass, loss function, backward pass, and parameter update is performed for one training iteration.
  • the process can be repeated for a certain number of iterations for each set of training images until the neural network 500 is trained well enough so that the weights of the layers are accurately tuned.
  • the forward pass can include passing a training frame through the neural network 500.
  • the weights are initially randomized before the neural network 500 is trained.
  • a frame can include an array of numbers representing the pixels of the image. Each number in the array can include a value from 0 to 255 describing the pixel intensity at that position in the array.
  • the array can include a 28 x 28 x 3 array of numbers with 28 rows and 28 columns of pixels and 3 color components (e.g., red, green, and blue, or luma and two chroma components, or the like).
  • the output will likely include values that do not give preference to any particular class due to the weights being randomly selected at initialization. For example, if the output is a vector with probabilities that the object includes different classes, the probability value for each of the different classes may be equal or at least very similar (e.g., for ten possible classes, each class may have a probability value of 0.1). With the initial weights, the neural network 500 is unable to determine low level features and thus cannot make an accurate determination of what the classification of the object might be.
  • MSE mean squared error
  • the loss (or error) will be high for the first training images since the actual values will be much different than the predicted output.
  • the goal of training is to minimize the amount of loss so that the predicted output is the same as the training label.
  • the neural network 500 can perform a backward pass by determining which inputs (weights) most contributed to the loss of the network, and can adjust the weights so that the loss decreases and is eventually minimized.
  • a derivative of the loss with respect to the weights (denoted as dL/dW, where W are the weights at a particular layer) can be computed to determine the weights that contributed most to the loss of the network. After the derivative is computed, a weight update can be performed by updating all the weights of the filters.
  • the weights can be updated so that they change in the opposite direction of the gradient.
  • the learning rate can be set to any suitable value, with a high learning rate including larger weight updates and a lower value indicating smaller weight updates.
  • the neural network 500 can include any suitable deep network.
  • One example includes a convolutional neural network (CNN), which includes an input layer and an output layer, with multiple hidden layers between the input and out layers.
  • the hidden layers of a CNN include a series of convolutional, nonlinear, pooling (for downsampling), and fully connected layers.
  • the neural network 500 can include any other deep network other than a CNN, such as an autoencoder, a deep belief nets (DBNs), a Recurrent Neural Networks (RNNs), among others.
  • DNNs deep belief nets
  • RNNs Recurrent Neural Networks
  • FIG. 6 is an illustrative example of a convolutional neural network (CNN) 600.
  • the input layer 620 of the CNN 600 includes data representing an image or frame.
  • the data can include an array of numbers representing the pixels of the image, with each number in the array including a value from 0 to 255 describing the pixel intensity at that position in the array.
  • the array can include a 28 x 28 x 3 array of numbers with 28 rows and 28 columns of pixels and 3 color components (e.g., red, green, and blue, or luma and two chroma components, or the like).
  • the image can be passed through a convolutional hidden layer 622a, an optional non-linear activation layer, a pooling hidden layer 622b, and fully connected hidden layers 622c to get an output at the output layer 624. While only one of each hidden layer is shown in FIG. 6, one of ordinary skill will appreciate that multiple convolutional hidden layers, non-linear layers, pooling hidden layers, and/or fully connected layers can be included in the CNN 600. As previously described, the output can indicate a single class of an object or can include a probability of classes that best describe the object in the image.
  • the first layer of the CNN 600 is the convolutional hidden layer 622a.
  • the convolutional hidden layer 622a analyzes the image data of the input layer 620.
  • Each node of the convolutional hidden layer 622a is connected to a region of nodes (pixels) of the input image called a receptive field.
  • the convolutional hidden layer 622a can be considered as one or more filters (each filter corresponding to a different activation or feature map), with each convolutional iteration of a filter being a node or neuron of the convolutional hidden layer 622a.
  • the region of the input image that a filter covers at each convolutional iteration would be the receptive field for the filter.
  • each filter and corresponding receptive field
  • each filter is a 5x5 array
  • Each connection between a node and a receptive field for that node learns a weight and, in some cases, an overall bias such that each node leams to analyze its particular local receptive field in the input image.
  • Each node of the hidden layer 622a will have the same weights and bias (called a shared weight and a shared bias).
  • the filter has an array of weights (numbers) and the same depth as the input.
  • a filter will have a depth of 3 for the video frame example (according to three color components of the input image).
  • An illustrative example size of the filter array is 5 x 5 x 3, corresponding to a size of the receptive field of a node.
  • the convolutional nature of the convolutional hidden layer 622a is due to each node of the convolutional layer being applied to its corresponding receptive field.
  • a filter of the convolutional hidden layer 622a can begin in the top-left comer of the input image array and can convolve around the input image.
  • each convolutional iteration of the filter can be considered a node or neuron of the convolutional hidden layer 622a.
  • the values of the filter are multiplied with a corresponding number of the original pixel values of the image (e.g., the 5x5 filter array is multiplied by a 5x5 array of input pixel values at the top-left comer of the input image array).
  • the multiplications from each convolutional iteration can be summed together to obtain a total sum for that iteration or node.
  • the process is next continued at a next location in the input image according to the receptive field of a next node in the convolutional hidden layer 622a.
  • a filter can be moved by a step amount (referred to as a stride) to the next receptive field.
  • the stride can be set to 1 or other suitable amount. For example, if the stride is set to 1, the filter will be moved to the right by 1 pixel at each convolutional iteration. Processing the filter at each unique location of the input volume produces a number representing the filter results for that location, resulting in a total sum value being determined for each node of the convolutional hidden layer 622a.
  • the mapping from the input layer to the convolutional hidden layer 622a is referred to as an activation map (or feature map).
  • the activation map includes a value for each node representing the filter results at each locations of the input volume.
  • the activation map can include an array that includes the various total sum values resulting from each iteration of the filter on the input volume. For example, the activation map will include a 24 x 24 array if a 5 x 5 filter is applied to each pixel (a stride of 1) of a 28 x 28 input image.
  • the convolutional hidden layer 622a can include several activation maps in order to identify multiple features in an image. The example shown in FIG. 6 includes three activation maps. Using three activation maps, the convolutional hidden layer 622a can detect three different kinds of features, with each feature being detectable across the entire image.
  • a non-linear hidden layer can be applied after the convolutional hidden layer 622a.
  • the non-linear layer can be used to introduce non-linearity to a system that has been computing linear operations.
  • One illustrative example of a non-linear layer is a rectified linear unit (ReLU) layer.
  • the pooling hidden layer 622b can be applied after the convolutional hidden layer 622a (and after the non-linear hidden layer when used).
  • the pooling hidden layer 622b is used to simplify the information in the output from the convolutional hidden layer 622a.
  • the pooling hidden layer 622b can take each activation map output from the convolutional hidden layer 622a and generates a condensed activation map (or feature map) using a pooling function.
  • Max-pooling is one example of a function performed by a pooling hidden layer.
  • Other forms of pooling functions be used by the pooling hidden layer 622a, such as average pooling, L2-norm pooling, or other suitable pooling functions.
  • a pooling function (e.g., a max-pooling filter, an L2-norm filter, or other suitable pooling filter) is applied to each activation map included in the convolutional hidden layer 622a.
  • a pooling function e.g., a max-pooling filter, an L2-norm filter, or other suitable pooling filter
  • three pooling filters are used for the three activation maps in the convolutional hidden layer 622a.
  • max-pooling can be used by applying a max-pooling filter (e.g., having a size of 2x2) with a stride (e.g., equal to a dimension of the filter, such as a stride of 2) to an activation map output from the convolutional hidden layer 622a.
  • the output from a max-pooling filter includes the maximum number in every sub-region that the filter convolves around.
  • each unit in the pooling layer can summarize a region of 2x2 nodes in the previous layer (with each node being a value in the activation map).
  • an activation map For example, four values (nodes) in an activation map will be analyzed by a 2x2 max-pooling filter at each iteration of the filter, with the maximum value from the four values being output as the “max” value. If such a max-pooling filter is applied to an activation filter from the convolutional hidden layer 622a having a dimension of 24x24 nodes, the output from the pooling hidden layer 622b will be an array of 12x12 nodes.
  • an L2-norm pooling filter could also be used.
  • the L2-norm pooling filter includes computing the square root of the sum of the squares of the values in the 2x2 region (or other suitable region) of an activation map (instead of computing the maximum values as is done in max-pooling), and using the computed values as an output.
  • the pooling function determines whether a given feature is found anywhere in a region of the image, and discards the exact positional information. This can be done without affecting results of the feature detection because, once a feature has been found, the exact location of the feature is not as important as its approximate location relative to other features. Max-pooling (as well as other pooling methods) offer the benefit that there are many fewer pooled features, thus reducing the number of parameters needed in later layers of the CNN 600.
  • the final layer of connections in the network is a fully-connected layer that connects every node from the pooling hidden layer 622b to every one of the output nodes in the output layer 624.
  • the input layer includes 28 x 28 nodes encoding the pixel intensities of the input image
  • the convolutional hidden layer 622a includes 3x24x24 hidden feature nodes based on application of a 5 x5 local receptive field (for the filters) to three activation maps
  • the pooling hidden layer 622b includes a layer of 3 c 12 c 12 hidden feature nodes based on application of max-pooling filter to 2x2 regions across each of the three feature maps.
  • the output layer 624 can include ten output nodes. In such an example, every node of the 3x12x12 pooling hidden layer 622b is connected to every node of the output layer 624.
  • the fully connected layer 622c can obtain the output of the previous pooling hidden layer 622b (which should represent the activation maps of high-level features) and determines the features that most correlate to a particular class.
  • the fully connected layer 622c layer can determine the high-level features that most strongly correlate to a particular class, and can include weights (nodes) for the high-level features.
  • a product can be computed between the weights of the fully connected layer 622c and the pooling hidden layer 622b to obtain probabilities for the different classes.
  • the CNN 600 is being used to predict that an object in a video frame is a person, high values will be present in the activation maps that represent high-level features of people (e.g., two legs are present, a face is present at the top of the object, two eyes are present at the top left and top right of the face, a nose is present in the middle of the face, a mouth is present at the bottom of the face, and/or other features common for a person).
  • high values will be present in the activation maps that represent high-level features of people (e.g., two legs are present, a face is present at the top of the object, two eyes are present at the top left and top right of the face, a nose is present in the middle of the face, a mouth is present at the bottom of the face, and/or other features common for a person).
  • M indicates the number of classes that the CNN 600 has to choose from when classifying the object in the image.
  • Other example outputs can also be provided.
  • Each number in the M-dimensional vector can represent the probability the object is of a certain class.
  • a 10-dimensional output vector represents ten different classes of objects is [0 0 0.05 0.8 0 0.15 0 0 0 0]
  • the vector indicates that there is a 5% probability that the image is the third class of object (e.g., a dog), an 80% probability that the image is the fourth class of object (e.g., a human), and a 15% probability that the image is the sixth class of object (e.g., a kangaroo).
  • the probability for a class can be considered a confidence level that the object is part of that class.
  • FIG. 7 illustrates an embodiment of a user equipment (UE) 700, which can be utilized as described herein above (e.g. in association with FIGS. 1-4).
  • the UE 700 can perform one or more of the functions of method 400 of FIG. 4.
  • FIG. 7 is meant only to provide a generalized illustration of various components, any or all of which may be utilized as appropriate.
  • components illustrated by FIG. 7 can be localized to a single physical device and/or distributed among various networked devices, which may be disposed at different physical locations (e.g., located at different parts of a user’s body, in which case the components may be communicatively connected via a Personal Area Network (PAN) and/or other means).
  • PAN Personal Area Network
  • the UE 700 is shown comprising hardware elements that can be electrically coupled via a bus 705 (or may otherwise be in communication, as appropriate).
  • the hardware elements may include a processing unit(s) 710 which can include without limitation one or more general-purpose processors, one or more special-purpose processors (such as digital signal processing (DSP) chips, graphics acceleration processors, application specific integrated circuits (ASICs), and/or the like), and/or other processing structure or means.
  • DSP Digital Signal Processor
  • Location determination and/or other determinations based on wireless communication may be provided in the processing unit(s) 710 and/or wireless communication interface 730 (discussed below).
  • the UE 700 also can include one or more input devices 770, which can include without limitation a keyboard, touch screen, a touch pad, microphone, button(s), dial(s), switch(es), and/or the like; and one or more output devices 715, which can include without limitation a display, light emitting diode (LED), speakers, and/or the like.
  • input devices 770 can include without limitation a keyboard, touch screen, a touch pad, microphone, button(s), dial(s), switch(es), and/or the like
  • output devices 715 which can include without limitation a display, light emitting diode (LED), speakers, and/or the like.
  • the UE 700 may also include a wireless communication interface 730, which may comprise without limitation a modem, a network card, an infrared communication device, a wireless communication device, and/or a chipset (such as a Bluetooth® device, an IEEE 802.11 device, an IEEE 802.15.4 device, a WiFi device, a WiMax device, a WAN device and/or various cellular devices, etc.), and/or the like, which may enable the UE 700 to communicate via the networks described above with regard to FIG. 1.
  • the wireless communication interface 730 may permit data and signaling to be communicated (e.g.
  • the communication can be carried out via one or more wireless communication antenna(s) 732 that send and/or receive wireless signals 734.
  • the wireless communication interface 730 may comprise separate transceivers to communicate with base stations (e.g., ng-eNBs and gNBs) and other terrestrial transceivers, such as wireless devices and access points.
  • the UE 700 may communicate with different data networks that may comprise various network types.
  • a Wireless Wide Area Network may be a Code Division Multiple Access (CDMA) network, a Time Division Multiple Access (TDMA) network, a Frequency Division Multiple Access (FDMA) network, an Orthogonal Frequency Division Multiple Access (OFDMA) network, a Single-Carrier Frequency Division Multiple Access (SC-FDMA) network, a WiMax (IEEE 802.16) network, and so on.
  • CDMA Code Division Multiple Access
  • TDMA Time Division Multiple Access
  • FDMA Frequency Division Multiple Access
  • OFDMA Orthogonal Frequency Division Multiple Access
  • SC-FDMA Single-Carrier Frequency Division Multiple Access
  • WiMax IEEE 802.16
  • a CDMA network may implement one or more radio access technologies (RATs) such as CDMA2000, Wideband CDMA (WCDMA), and so on.
  • Cdma2000 includes IS- 95, IS-2000, and/or IS-856 standards.
  • a TDMA network may implement GSM, Digital Advanced Mobile Phone System (D-AMPS), or some other RAT.
  • An OFDMA network may employ LTE, LTE Advanced, 5G NR, and so on. 5G NR, LTE, LTE Advanced, GSM, and WCDMA are described in documents from the Third Generation Partnership Project (3GPP).
  • Cdma2000 is described in documents from a consortium named “3rd Generation Partnership Project 2” (3GPP2). 3GPP and 3GPP2 documents are publicly available.
  • a wireless local area network may also be an IEEE 802.1 lx network
  • a wireless personal area network may be a Bluetooth network, an IEEE 802.15x, or some other type of network.
  • the techniques described herein may also be used for any combination of WWAN, WLAN and/or WPAN.
  • the UE 700 can further include sensor(s) 740.
  • Sensors 740 may comprise, without limitation, one or more inertial sensors and/or other sensors (e.g., accelerometer(s), gyroscope(s), camera(s), magnetometer(s), altimeter(s), microphone(s), proximity sensor(s), light sensor(s), barometer(s), and the like), some of which may be used to complement and/or facilitate the position determination described herein, in some instances.
  • Embodiments of the UE 700 may also include a GNSS receiver 780 capable of receiving signals 784 from one or more GNSS satellites using an antenna 782 (which could be the same as antenna 732). Positioning based on GNSS signal measurement can be utilized to complement and/or incorporate the techniques described herein.
  • the GNSS receiver 780 can extract a position of the UE 700, using conventional techniques, from GNSS SVs of a GNSS system, such as Global Positioning System (GPS), Galileo, Glonass, Quasi-Zenith Satellite System (QZSS) over Japan, Indian Regional Navigational Satellite System (IRNSS) over India, Beidou over China, and/or the like.
  • GPS Global Positioning System
  • Galileo Galileo
  • Glonass Glonass
  • Quasi-Zenith Satellite System QZSS
  • IRNSS Indian Regional Navigational Satellite System
  • Beidou Beidou over China
  • the GNSS receiver 780 can be used with various augmentation systems (e.g., a Satellite Based Augmentation System (SBAS)) that may be associated with or otherwise enabled for use with one or more global and/or regional navigation satellite systems, such as, e.g., WAAS, EGNOS, Multi -functional Satellite Augmentation System (MSAS), and Geo Augmented Navigation system (GAGAN), and/or the like.
  • SBAS Satellite Based Augmentation System
  • WAAS Satellite Based Augmentation System
  • EGNOS EGNOS
  • MSAS Multi -functional Satellite Augmentation System
  • GAGAN Geo Augmented Navigation system
  • the UE 700 may further include and/or be in communication with a memory 760.
  • the memory 760 can include, without limitation, local and/or network accessible storage, a disk drive, a drive array, an optical storage device, a solid-state storage device, such as a random access memory (“RAM”), and/or a read-only memory (“ROM”), which can be programmable, flash-updateable, and/or the like.
  • RAM random access memory
  • ROM read-only memory
  • Such storage devices may be configured to implement any appropriate data stores, including without limitation, various file systems, database structures, and/or the like.
  • the memory 760 of the UE 700 also can comprise software elements (not shown in FIG. 7), including an operating system, device drivers, executable libraries, and/or other code, such as one or more application programs, which may comprise computer programs provided by various embodiments, and/or may be designed to implement methods, and/or configure systems, provided by other embodiments, as described herein.
  • one or more procedures described with respect to the method(s) discussed above may be implemented as code and/or instructions in memory 760 that are executable by the UE 700 (and/or processing unit(s) 710 or DSP 720 within UE 700).
  • code and/or instructions can be used to configure and/or adapt a general purpose computer (or other device) to perform one or more operations in accordance with the described methods.
  • FIG. 8 illustrates an embodiment of a computer system 800, which may be utilized and/or incorporated into one or more components of a system (e.g., voice modelling system 100 of FIG. 1).
  • FIG. 8 provides a schematic illustration of one embodiment of a computer system 800 that can perform the methods provided by various other embodiments, such as the methods described in relation to FIG. 4. It should be noted that FIG. 8 is meant only to provide a generalized illustration of various components, any or all of which may be utilized as appropriate. FIG. 8, therefore, broadly illustrates how individual system elements may be implemented in a relatively separated or relatively more integrated manner. In addition, it can be noted that components illustrated by FIG.
  • the computer system 800 can be localized to a single device and/or distributed among various networked devices, which may be disposed at different physical or geographical locations.
  • the computer system 800 may correspond to components of a training system such as user devices 105a, 105b, through 105n, training system 110, capture device 205, prediction system 210, user device 305, or prediction system 310.
  • the computer system 800 is shown comprising hardware elements that can be electrically coupled via a bus 805 (or may otherwise be in communication, as appropriate).
  • the hardware elements may include processing unit(s) 810, which can include without limitation one or more general-purpose processors, one or more special- purpose processors (such as digital signal processing chips, graphics acceleration processors, and/or the like), and/or other processing structure, which can be configured to perform one or more of the methods described herein, including the method described in relation to FIG. 8.
  • the computer system 800 also can include one or more input devices 815, which can include without limitation a mouse, a keyboard, a camera, a microphone, and/or the like; and one or more output devices 820, which can include without limitation a display device, a printer, and/or the like.
  • input devices 815 can include without limitation a mouse, a keyboard, a camera, a microphone, and/or the like
  • output devices 820 which can include without limitation a display device, a printer, and/or the like.
  • the computer system 800 may further include (and/or be in communication with) one or more non-transitory storage devices 825, which can comprise, without limitation, local and/or network accessible storage, and/or can include, without limitation, a disk drive, a drive array, an optical storage device, a solid-state storage device, such as a random access memory (“RAM”), and/or a read-only memory (“ROM”), which can be programmable, flash-updateable, and/or the like.
  • RAM random access memory
  • ROM read-only memory
  • Such storage devices may be configured to implement any appropriate data stores, including without limitation, various file systems, database structures, and/or the like.
  • the computer system 800 may also include a communication subsystem 830, which can include support of wireline communication technologies and/or wireless communication technologies (in some embodiments) managed and controlled by a wireless communication interface 833.
  • the communication subsystem 830 may include a modem, a network card (wireless or wired), an infrared communication device, a wireless communication device, and/or a chipset, and/or the like.
  • the communication subsystem 830 may include one or more input and/or output communication interfaces, such as the wireless communication interface 833, to permit data and signaling to be exchanged with a network, mobile devices, other computer systems, and/or any other electronic devices described herein.
  • mobile device and “UE” are used interchangeably herein to refer to any mobile communications device such as, but not limited to, mobile phones, smartphones, wearable devices, mobile computing devices (e.g., laptops, PDAs, tablets), embedded modems, and automotive and other vehicular computing devices.
  • mobile communications device such as, but not limited to, mobile phones, smartphones, wearable devices, mobile computing devices (e.g., laptops, PDAs, tablets), embedded modems, and automotive and other vehicular computing devices.
  • the computer system 800 will further comprise a working memory 835, which can include a RAM and/or or ROM device.
  • Software elements shown as being located within the working memory 835, can include an operating system 840, device drivers, executable libraries, and/or other code, such as application(s) 845, which may comprise computer programs provided by various embodiments, and/or may be designed to implement methods, and/or configure systems, provided by other embodiments, as described herein.
  • application(s) 845 which may comprise computer programs provided by various embodiments, and/or may be designed to implement methods, and/or configure systems, provided by other embodiments, as described herein.
  • one or more procedures described with respect to the method(s) discussed above, such as the method described in relation to FIG. 8, may be implemented as code and/or instructions that are stored (e.g.
  • working memory 835 stores code and/or instructions for performing various operations in accordance with the described methods.
  • code and/or instructions can be used to configure and/or adapt a general purpose computer (or other device) to perform one or more operations in accordance with the described methods.
  • a set of these instructions and/or code might be stored on a non-transitory computer-readable storage medium, such as the storage device(s) 825 described above.
  • the storage medium might be incorporated within a computer system, such as computer system 800.
  • the storage medium might be separate from a computer system (e.g., a removable medium, such as an optical disc), and/or provided in an installation package, such that the storage medium can be used to program, configure, and/or adapt a general purpose computer with the instructions/code stored thereon.
  • These instructions might take the form of executable code, which is executable by the computer system 800 and/or might take the form of source and/or installable code, which, upon compilation and/or installation on the computer system 800 (e.g., using any of a variety of generally available compilers, installation programs, compression/decompression utilities, etc.), then takes the form of executable code.
  • components that can include memory can include non-transitory machine-readable media.
  • machine-readable medium and “computer-readable medium” as used herein, refer to any storage medium that participates in providing data that causes a machine to operate in a specific fashion.
  • various machine-readable media might be involved in providing instructions/code to processing units and/or other device(s) for execution. Additionally or alternatively, the machine-readable media might be used to store and/or carry such instructions/code.
  • a computer-readable medium is a physical and/or tangible storage medium. Such a medium may take many forms, including but not limited to, non-volatile media, volatile media, and transmission media.
  • Computer-readable media include, for example, magnetic and/or optical media, any other physical medium with patterns of holes, a RAM, a PROM, EPROM, a FLASH-EPROM, any other memory chip or cartridge, a carrier wave as described hereinafter, or any other medium from which a computer can read instructions and/or code.
  • a special purpose computer or a similar special purpose electronic computing device is capable of manipulating or transforming signals, typically represented as physical electronic, electrical, or magnetic quantities within memories, registers, or other information storage devices, transmission devices, or display devices of the special purpose computer or similar special purpose electronic computing device.
  • the term “at least one of’ if used to associate a list, such as A, B, or C or A, B, and/or C, can be interpreted to mean any combination of A, B, and/or C, such as A, AB, AC, BC, ABC, AAB, AABBCCC, etc.
  • Illustrative aspects of the disclosure include:
  • a computer-implemented method comprising: providing, by a computing system to a user device, a textual phrase for a user to vocalize; receiving, from the user device, known characteristics of the user and a vocalized phrase associated with the user, the vocalized phrase being a vocalization of the textual phrase; extracting, by the computing system, features from the vocalized phrase; and generating a voice model by training a machine learning algorithm using the extracted features and the known characteristics of the user.
  • Aspect 2 The computer-implemented method of aspect 1, wherein the extracted features comprise at least one of a fundamental frequency modulation, phoneme, formant, harmonics, and/or a noise profile.
  • Aspect 3 The computer-implemented method of any one of aspects 1 or 2, wherein the known characteristics of the user comprise at least one of an age of the user, a nationality of the user, a gender of the user, a primary language of the user, a geographic region of residence of the user, and/or an education level of the user.
  • Aspect 4 The computer-implemented method of any one of aspects 1 to 3, wherein extracted features and known characteristics of many users are used to train the machine learning algorithm.
  • Aspect 5 The computer-implemented method of any one of aspects 1 to 4, further comprising: receiving, by the computing system, a second vocalized phrase associated with a second user, the second vocalized phrase being a vocalization of a second textual phrase; providing, by the computing system, the second vocalized phrase to the voice model; and generating, by the voice model, a prediction of characteristics of the second user.
  • Aspect 6 The computer-implemented method of aspect 5, wherein the voice model maps extracted features of the second vocalized phrase to characteristics to generate the prediction of characteristics of the second user.
  • Aspect 7 The computer-implemented method of any one of aspects 1 to 6, further comprising: providing, by the computing system, a textual input and at least one desired characteristic to the voice model; and receiving, from the voice model, an audio sample comprising a predicted voice representation of a vocalization of the textual input, wherein the predicted voice representation is based at least in part on the desired characteristic.
  • Aspect 8 The computer-implemented method of aspect 7, wherein the voice model maps the desired characteristic to features of a voice to generate the predicted voice representation used to generate the audio sample.
  • Aspect 9 The computer-implemented method of any one of aspects 1 to 8, wherein the voice model maps the extracted features of the vocalized phrase to the known characteristics of the user.
  • Aspect 10 The computer-implemented method of aspect 7, wherein the known characteristics of the user comprise at least one of information on a vocabulary of the user, information on an accent of the user, and/or information on a temperament of the user.
  • Aspect 11 The computer-implemented method of any one of aspects 1 to 10, wherein training the machine learning algorithm is based on a context of the vocalized phrase.
  • Aspect 12 The computer-implemented method of any one of aspects 1 to 11 wherein the context of the vocalized phrase includes background noise or environmental sounds.
  • a system comprising: one or more processors; and a memory having stored thereon instructions that, upon execution of the instructions by the one or more processors, cause the one or more processors to: provide, to a user device, a textual phrase for a user to vocalize; receive, from the user device, known characteristics of the user and a vocalized phrase associated with the user, the vocalized phrase being a vocalization of the textual phrase; extract features from the vocalized phrase; and generate a voice model by training a machine learning algorithm using the extracted features and the known characteristics of the user.
  • Aspect 14 The system of aspect 13, wherein the extracted features comprise at least one of a fundamental frequency modulation, phoneme, formant, harmonics, and/or a noise profile.
  • Aspect 15 The system of any one of aspects 13 or 14, wherein the known characteristics of the user comprise at least one of an age of the user, a nationality of the user, a gender of the user, a primary language of the user, a geographic region of residence of the user, and/or an education level of the user.
  • Aspect 16 The system of any one of aspects 13 to 15, wherein extracted features and known characteristics of many users are used to train the machine learning algorithm.
  • Aspect 17 The system of any one of aspects 13 to 16, wherein the instructions comprise further instructions that, upon execution by the one or more processors, cause the one or more processors to: receive a second vocalized phrase associated with a second user, the second vocalized phrase being a vocalization of a second textual phrase; provide the second vocalized phrase to the voice model; and generate, by the voice model, a prediction of characteristics of the second user.
  • Aspect 18 The system of aspect 17, wherein the voice model maps extracted features of the second vocalized phrase to characteristics to generate the prediction of characteristics of the second user.
  • Aspect 19 The system of any one of aspects 13 to 18, wherein the instructions comprise further instructions that, upon execution by the one or more processors, cause the one or more processors to: provide a textual input and desired characteristics to the voice model; and receive, from the voice model, an audio sample comprising a predicted voice representation of a vocalization of the textual input, wherein the predicted voice representation is based at least in part on the desired characteristics.
  • Aspect 20 The system of aspect 19, wherein the voice model maps the desired characteristics to features of a voice to generate the predicted voice representation used to generate the audio sample.
  • Aspect 21 The system of any one of aspects 13 to 20, wherein the voice model maps the extracted features of the vocalized phrase to the known characteristics of the user.
  • Aspect 22 The system of aspect 21, wherein the known characteristics of the user comprise at least one of information on a vocabulary of the user, information on an accent of the user, and/or information on a temperament of the user.
  • Aspect 23 The system of any one of aspects 13 to 22, wherein training the machine learning algorithm is based on a context of the vocalized phrase.
  • Aspect 24 The system of aspect 21 , wherein the context of the vocalized phrase includes background noise or environmental sounds.
  • Aspect 25 A non-transitory computer-readable storage medium having stored thereon instructions which, when executed by one or more processors, cause the one or more processors to perform any of the operations of aspects 1 to 24.
  • Aspect 26 An apparatus comprising means for performing any of the operations of aspects 1 to 24.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

A voice modelling system is described for obtaining labelled data sets of audio vocalizations. Users may be prompted to vocalize known phrases, and features of the users' voices such as accent, vocabulary, pronunciation, and so forth can be extracted. Known characteristics of the users such as their age, gender, race, geographical region, education, and so forth can be captured. The labelled vocalizations along with the known characteristics of the users can be used to train a machine learning algorithm for generating voice models. The voice models may map the features of the users' voices to characteristics of the users. The trained machine learning algorithm can be deployed to predict characteristics of a person based on a vocalization captured from the person or to predict a vocalization with appropriate accent and so forth based on input of characteristics and a textual phrase.

Description

VOICE CHARACTERISTIC MACHINE LEARNING MODELLING
BACKGROUND
[0001] Machine learning algorithms, neural networks and other advanced intelligence computing systems often rely on training using data sets of labelled data. The task of labelling the data (e.g., text, audio, images, video, etc.) is time consuming, so it is difficult to generate sufficient labelled data. One way to overcome the issue of having labelled images has been to use a “captcha” system. A user is provided with image data, and is told to select, for example, all the items that include a vehicle. These captcha systems have been used as a way to confirm that a user is not a robot and further gain labelled image sets that can be used to train data image classifiers.
BRIEF SUMMARY
[0002] Techniques described herein address these and other issues by providing a voice modelling system (e.g., a voice captcha system). The voice modelling system instructs the user to say a phrase, for example, to confirm the user is not a robot. While confirming the user is not a robot is useful, the features of the user’s voice may be extracted for the known vocalization. Rather than a speech to text system, that tries to decipher what the user is saying, the system knows what the user has said (e.g., because the system provided the text for the user to say). The features of the vocalization are extracted and used, along with known characteristics of the user, to train a voice model. The voice model may map the features (e.g., phonemes, formants, and/or frequency modulations of the vocalization that may be related to the user’s vocabulary, accent, and/or temperament.) to the known characteristics of the user. In some cases, the trained voice model may later be used to predict characteristics about an unknown user. In some cases, the trained voice model may be used to generate a vocalization of textual data with specific characteristics.
[0003] A system of one or more computers can be configured to perform particular operations or actions by virtue of having software, firmware, hardware, or a combination of them installed on the system that in operation causes or cause the system to perform the actions. One or more computer programs can be configured to perform particular operations or actions by virtue of including instructions that, when executed by data processing apparatus, cause the apparatus to perform the actions. One general aspect includes a computer-implemented method performed by a computing system.
[0004] According to at least one illustrative example, a computer-implemented method is provided. The method includes: providing, by a computing system to a user device, a textual phrase for a user to vocalize; receiving, from the user device, known characteristics of the user and a vocalized phrase associated with the user, the vocalized phrase being a vocalization of the textual phrase; extracting, by the computing system, features from the vocalized phrase; and generating a voice model by training a machine learning algorithm using the extracted features and the known characteristics of the user.
[0005] In another example, a system is provided that includes one or more processors and a memory having stored thereon instructions that upon execution of the instructions by the one or more processors, cause the one or more processors to: provide, to a user device, a textual phrase for a user to vocalize; receive, from the user device, known characteristics of the user and a vocalized phrase associated with the user, the vocalized phrase being a vocalization of the textual phrase; extract features from the vocalized phrase; and generate a voice model by training a machine learning algorithm using the extracted features and the known characteristics of the user.
[0006] In another example, a non-transitory computer-readable medium is provided that has stored thereon instructions that, when executed by one or more processors, cause the one or more processors to: provide, to a user device, a textual phrase for a user to vocalize; receive, from the user device, known characteristics of the user and a vocalized phrase associated with the user, the vocalized phrase being a vocalization of the textual phrase; extract features from the vocalized phrase; and generate a voice model by training a machine learning algorithm using the extracted features and the known characteristics of the user.
[0007] In another example, an apparatus is provided. The apparatus includes: means for providing to a user device, a textual phrase for a user to vocalize; means for receiving, from the user device, known characteristics of the user and a vocalized phrase associated with the user, the vocalized phrase being a vocalization of the textual phrase; means for extracting, by the computing system, features from the vocalized phrase; and means for generating a voice model by training a machine learning algorithm using the extracted features and the known characteristics of the user.
[0008] In some aspects, the extracted features comprise at least one of a fundamental frequency modulation, phoneme, formant, harmonics, and/or a noise profile.
[0009] In some aspects, the known characteristics of the user comprise at least one of an age of the user, a nationality of the user, a gender of the user, a primary language of the user, a geographic region of residence of the user, and/or an education level of the user.
[0010] In some aspects, extracted features and known characteristics of many users are used to train the machine learning algorithm.
[0011] In some aspects, the method, apparatuses, and computer-readable medium described above further comprise: receiving, by the computing system, a second vocalized phrase associated with a second user, the second vocalized phrase being a vocalization of a second textual phrase; providing, by the computing system, the second vocalized phrase to the voice model; and generating, by the voice model, a prediction of characteristics of the second user.
[0012] In some aspects, the voice model maps extracted features of the second vocalized phrase to characteristics to generate the prediction of characteristics of the second user.
[0013] In some aspects, the method, apparatuses, and computer-readable medium described above further comprise: providing, by the computing system, a textual input and at least one desired characteristic to the voice model; and receiving, from the voice model, an audio sample comprising a predicted voice representation of a vocalization of the textual input, wherein the predicted voice representation is based at least in part on the desired characteristic.
[0014] In some aspects, the voice model maps the desired characteristic to features of a voice to generate the predicted voice representation used to generate the audio sample.
[0015] In some aspects, the voice model maps the extracted features of the vocalized phrase to the known characteristics of the user. [0016] In some aspects, the known characteristics of the user comprise at least one of information on a vocabulary of the user, information on an accent of the user, and/or information on a temperament of the user.
[0017] In some aspects, training the machine learning algorithm is based on a context of the vocalized phrase.
[0018] In some aspects, the context of the vocalized phrase includes background noise or environmental sounds.
[0019] In some aspects, the apparatus is, is part of, and/or includes a camera, a mobile device (e.g., a mobile telephone or so-called “smart phone” or other mobile device), a wearable device, an extended reality device (e.g., a virtual reality (VR) device, an augmented reality (AR) device, or a mixed reality (MR) device), a personal computer, a laptop computer, a server computer, a vehicle or a computing device or component of a vehicle, or other device. In some aspects, the apparatus includes a camera or multiple cameras for capturing one or more images. In some aspects, the apparatus further includes a display for displaying one or more images, notifications, and/or other displayable data. In some aspects, the apparatuses described above can include one or more sensors (e.g., one or more inertial measurement units (IMUs), such as one or more gyrometers, one or more accelerometers, any combination thereof, and/or other sensor).
[0020] This summary is not intended to identify key or essential features of the claimed subject matter, nor is it intended to be used in isolation to determine the scope of the claimed subject matter. The subject matter should be understood by reference to appropriate portions of the entire specification of this patent, any or all drawings, and each claim.
[0021] The foregoing, together with other features and embodiments, will become more apparent upon referring to the following specification, claims, and accompanying drawings. BRIEF DESCRIPTION OF THE DRAWINGS
[0022] FIG. 1 is a voice modelling system for generating a voice model by training a machine learning algorithm , according to some embodiments.
[0023] FIG. 2 is a deployed system for predicting characteristics of an unknown user based on a vocalization from the user using the voice model, according to some embodiments.
[0024] FIG. 3 is a deployed system for generating a vocalization of textual content having desired characteristics using a trained voice model, according to some embodiments.
[0025] FIG. 4 is a flow diagram of a method of training a machine learning algorithm, according to some embodiments.
[0026] FIG. 5 is a block diagram illustrating an example of a deep learning neural network, according to some embodiments.
[0027] FIG. 6 is a block diagram illustrating an example of a convolutional neural network (CNN), according to some embodiments.
[0028] FIG. 7 is a block diagram of an embodiment of a user equipment (UE), according to some embodiments.
[0029] FIG. 8 is a block diagram of an embodiment of a computer system.
[0030] Like reference symbols in the various drawings indicate like elements, in accordance with certain example implementations. In addition, multiple instances of an element may be indicated by following a first number for the element with a letter or a hyphen and a second number. For example, multiple instances of an element 105 may be indicated as 105-1, 105-2, 105-3 etc. or as 105a, 105b, 105c, etc. When referring to such an element using only the first number, any instance of the element is to be understood (e.g., element 105 in the previous example would refer to elements 105-1, 105-2, and 105- 3 or to elements 105a, 105b, and 105c). DET AILED DESCRIPTION
[0031] Several illustrative embodiments will now be described with respect to the accompanying drawings, which form a part hereof. While particular embodiments, in which one or more aspects of the disclosure may be implemented, are described below, other embodiments may be used and various modifications may be made without departing from the scope of the disclosure or the spirit of the appended claims.
[0032] As discussed above, obtaining labelled data sets for training machine learning algorithms is a difficult task. To train a neural network and machine learning models, diverse, annotated (or labelled) data is needed. While solutions for obtaining labelled image data have been successful, there are not good sources for annotated voice or audio data.
[0033] To remedy this problem, the described system can be used to generate large amounts of labelled and/or annotated voice data, which can be used to generate voice models for classifying characteristics of users based on their voices or to generate voices based on characteristics of users.
[0034] Turning to FIG. 1, a voice modelling system 100 is depicted. The voice modelling system 100 includes user devices 105a, 105b, through 105n, a training system 110, and a data set database 115. The voice modelling system 100 may include more or fewer components than illustrated in FIG. 1. In some implementations, one or more components may be cloud based. For example, training system 110 may be provided as a cloud service. Although at least three user devices 105a, 105b, through 105n are shown, there may be any number “N” user devices, where N is an integer greater than or equal to one.
[0035] User devices 105a, 105b, through 105n may be any suitable user devices including, for example, tablets, home computers, smartphones, and the like. The user devices 105a, 105b, through 105n may include some or all of the components described with respect to FIG. 6 of computer system 800, and/or the user devices 105a, 105b, through 105n may be the User Equipment (UE) 500 as described with respect to FIG. 7. The components described with respect to computer system 800 and UE 500 are excluded from FIG. 1 for ease of description. An example user device 105a may include display 120, microphone 125, audio embedder 130, characteristic collection module 135, communication module 140, and user interface (UI) subsystem 145. The user devices 105b through 105n may contain similar components that perform similar functionality to the components described for example user device 105 a. The user devices 105 a, 105b, through 105n may include more or fewer components to perform the functionality described herein without departing from the scope of the description.
[0036] Display 120 may be any visual display device that can display textual content. For example, display 120 may be a touchscreen, non-touchscreen, liquid crystal display (LCD), light emitting diode (LED) display, or the like.
[0037] Microphone 125 may be any microphone device that can capture audio from the user. Specifically, when the user vocalizes or reads text aloud, the microphone captures the audio of the user’s voice vocalization of the text the user read or spoke.
[0038] Audio embedder 130 may be any audio embedder that uses the captured audio from the microphone to embed the audio into an embedding space. For example, audio embedder 130 may convert the captured audio into embedded audio in the form of raw audio, compressed audio, a mel spectrogram, mel-frequency cepstral coefficients (MFCC), a constant Q transform (CQT), or a short-term Fourier Transform (STFT).
[0039] Characteristic collection module 135 may collect characteristics of the owner (or, in some cases, characteristics of multiple users) of a user device 105a of the user devices 105a, 105b, through 105n. Characteristics may include the user’s age, nationality, gender, primary language, education level, health, social preferences, geo-location including residential geographic location as well as current location, and the like. Because user device 105a belongs to the user, the user’s known characteristics may be captured from the device in many contexts. For example, when setting up the user device 105a, the user may select a primary language. The user may have accounts, such as email accounts or other accounts for which the user provides other characteristic information including their age, nationality, gender, and so forth. The user device 105a may include a global positioning system transceiver such that the location of the user device 105a at the time of a vocalization may be known. In some embodiments, characteristics of the user may be collected from additional sensors of the user device 105a (not shown). For example, heart rate and other information can provide an indication of the user’s current state such as, for example, relaxed, excited, agitated, and the like. In some examples, some characteristics of the user can be collected from a wearable device in communication with user device 105a. In some cases, the wearable device can also be a user device (e.g., user device 105b) containing similar components to user device 105a. The characteristic collection module 135 may collect the underlying information (e.g., user’s heart rate) and determine the user’s state or provide the underlying information as a characteristic. The information on the characteristics of the user may be collected by the characteristic collection module 135 and provided to the communication module 140 with the embedded audio. In some embodiments, the characteristic collection module 135 may also collect information that may be used to determine context of the user’s vocalization, including, for example, background noise or environmental sounds may provide context of the vocalization. In some aspects, the characteristic collection module may combine the background noise or environmental sound information with geo-location information to determine a context of the vocalization. In some examples, the context may also be provided with the characteristics and text embedding to the communication module 140.
[0040] Communication module 140 may be any suitable communication module for transmitting data to communication module 150 of training system 110. For example, communication module 140 may include the hardware and software needed for sending the information (e.g., user characteristics, context of the user’s vocalization, and/or embedded audio) via packets on a network. As an example, the communication module 140 may be communication subsystem 830 as described with respect to FIG. 8 and/or wireless communication interface 730 with antenna 732 for sending signals 734 containing the information as described with respect to FIG. 7.
[0041] UI subsystem 145 may be any suitable user interface system that allows the user to interface with components of user device 105a and the training system 110. For example, the user may use the display 120 to view a menu that allows the user to request a textual sample for reading aloud. As an example, the user may request a story to read to their child. The UI subsystem 145 may also allow the information from training system to be displayed on display 120. The user may be able to make selections for participation in the training system including, for example, to allow the training system to request a voice authentication (e.g., a voice “captcha”) every time or some frequency of times the user may access the user device 105a. For example, the user may access the user device 105a using a password or pin, for example, and the training system 110 may provide a text sample to use for voice authentication to access the user device 105a as well. The user may be able to select, using the UI subsystem 145, that the voice authentication may only request voice authentication one time per day, for example. The UI subsystem 145 may also provide the context of the captured voice for a given instance of providing audio to the training system 110. For example, if the user requested a story, the UI subsystem 145 may provide the context as a story reading session for the audio captured by microphone 125 and provided via communication module 140. For a voice authentication use case of the user reading a phrase aloud to access the user device 105a or some application on the user device 105a, the UI subsystem 145 may provide the context as voice authentication phrase, for example.
[0042] Training system 110 may be any suitable computer system or group of computer systems that can receive the information (e.g., data) from the user devices 105a, 105b, through 105n and use it to train modelling module 160. For example, training system 110 may be a computing system such as computer system 800 as described with respect to FIG. 8. In some embodiments, training system 110 may be a cloud-based system that offers the training and functionality of training system 110 as a cloud based service.
[0043] Communication module 150 may be any suitable communication module for transmitting data to communication modules 140 of user devices 105a, 105b, through 105n. For example, communication module 150 may include the hardware and software needed for sending information packets on a network for communication with user devices 105a, 105b, through 105n. As an example, the communication module 150 may be communication subsystem 830 as described with respect to FIG. 8. In some embodiments, communication module 150 may be the same or similar to the wireless communication interface 730 with antenna 732 for sending signals 734 as described with respect to FIG. 7.
[0044] Feature extraction module 180 may receive the embedded audio from user devices 105a, 105b, through 105n and extract features from the audio. The embedded audio may be, for example, the raw audio, compressed audio, MFCC, CQT, or STFT. In some examples, the feature extraction module 180 can be trained to extract features from above-mentioned representations including identifying phonemes, formants, and/or frequency modulations that are related to the user’s vocabulary, accent, and/or temperament. In some cases, the feature extraction module 180 can be trained using embedded audio and labels as training data. During inference (after the feature extraction module 180 has been trained), the feature extraction module can extract features from embedded audio.
[0045] The extracted features may include a fundamental frequency modulation, harmonics, formants, phonemes, or a noise profile for the vocalization. In some cases, the feature extraction module 180 may also extract information from the data provided from communication module 150 including a context of the audio. For example, the context of reading a story may be different than speaking a phrase to gain access to the user device 105a (e.g., voice authentication). Accordingly, a context of the audio may be included in the data received from user device 105a. In some embodiments, the feature extraction module 180 may be included on the user device 105a and the extracted features provided to the training system 110 rather than the entire embedded audio. In some implementations, the user’s characteristics and/or context of the audio extracted by feature extraction model 180 can be sent to the training system 110 with the extracted features.
[0046] The feature extraction module 180 may be, in some embodiments, a neural network based encoder-decoder pair for feature extraction from raw audio that is jointly optimized with an end-to-end training paradigm to extract a data driven feature representation optimized for a given classification or generation task. The encoder may be a temporal convolutional neural network.
[0047] Modelling module 160 may be any suitable machine learning algorithm, such as a neural network, state vector machine (SVM), non-negative matrix factorization (NMF), Gaussian mixture models (GMMs), Bayesian inference models, hidden Markov models (HMM), independent component analysis, independent subspace analysis, decision tree, deep clustering, random forests, or the like. For example, the modelling module 160 may be a neural network that includes an input layer, an output layer, and hidden layers. The layers may include input cells, hidden cells, output cells, recurrent cells, memory cells, kernels, convolutional cells, pool cells, or any other suitable cell types. The modelling module 160 may be any type of machine learning algorithm including a neural network such as, for example, a radial basis neural network, recurrent neural network, long/short term memory neural network, gated recurrent neural network, auto encoder (AE) neural network, variational AE neural network, denoising AE neural network, sparse AE neural network, Markov chain neural network, Hopfield neural network, Boltzmann machine neural network, convolutional neural network, deconvolutional neural network, generative adversarial network, liquid state machine neural network, extreme learning machine neural network, Kohonen network, support vector machine neural network, neural hiring machine neural network, deep residual network, or any other type of neural network. Illustrative examples of neural networks are described below with respect to FIG. 5 and FIG. 6.
[0048] In some implementations, training system 110 may train the modelling module 160 to leam latent representations for vocal feature-user characteristic mapping (e.g., a voice model). The voice model may map user characteristics to extracted features of the vocalizations. For example, modelling module 160 can be trained to map the extracted feature information relating to a user’s accent to the regional characteristic of the user. Modelling module 160 may also be trained to map the other extracted features from the user’s vocalizations (e.g., temperament, vocabulary, etc.) to the characteristics of the user (e.g., age, nationality, education level, etc.).
[0049] During training, the modelling module 160 may output characteristic information from input features extracted from voice embeddings (e.g., embedded audio from the audio embedder 130) and the loss calculation subsystem 165 may provide feedback loss values to the modelling module 160 for adjusting parameters of the modelling module 160, such as in a supervised training process using the annotated data received from user device 105a. Illustrative examples of training neural networks using supervised leaning are described below with respect to FIG. 5 and FIG. 6. In some examples, the modelling module 160 can be trained to generate the most accurate mapping of the extracted features from the vocalization of the user (e.g., phonemes, formants, frequency modulations, and/or a noise profile that may be related to the user’s vocabulary, accent, and/or temperament, and/or the like) to the characteristics of the user that were collected by the characteristic collection module 135. During inference (after the modelling module 160 has been trained), the modelling module 160 may receive features extracted from embedded audio of vocalizations and determine characteristics of the user that uttered the vocalization based on the extracted features.
[0050] In some implementations, training system 110 may train modelling module 160 to generate vocalizations (e.g., speech synthesis) based on a provided text or phrase and desired characteristics for the vocalization. The desired characteristics for the vocalization can correspond to the known user characteristics described above. The modelling module 160 can be trained to generate vocalizations by utilizing the embedded audio, the features generated by feature extraction module 180, and the known characteristics of the user. In some cases, modelling module 160 can be trained to directly generate mel spectrograms from text data and desired characteristics of the vocalization. In some implementations, the modelling module 160 can be trained using as a parametric speech synthesizer.
[0051] In some implementations, the modelling module 160 can be trained for speech synthesis in conjunction with a neural network that is trained using annotated data to rate the accuracy of generated speech for given characteristics. Such a neural network may be pre-trained to use as a loss function for speech synthesis, or be used as the encoder for a variational autoencoder, or be jointly trained with the speech synthesizer as the discriminator in an adversarial training paradigm, for example.
[0052] The text generation module 155 may generate or select the text provided for the user to vocalize. In some embodiments, the text may be randomly generated. In some embodiments, a large database of phrases may be created in advance and, in some embodiments, grouped into various categories to ensure that the corpus of phrases the user vocalizes is sufficient to provide a sufficient selection for training the modelling module 160. For example, a categorization of phrases that test the user’s vocabulary, accent, pronunciation of various sounds, and so forth may be generated. This categorization may be automated by, for example, using length of words within the phrases for testing vocabulary, using number of letters within the phrase for testing accent and pronunciation, and the like. Then a large number of phrases may be categorized and selected such that at least some of the phrases from each category are selected for the user to ensure the provided vocalizations for that user are sufficient.
[0053] The data set creation subsystem 170 may be used to generate labeled/annotated data sets that may be stored in data set database 115 and used for training other audio classification systems in the future. For example, the features for a user and the text spoken may be collected and stored together with other samples (features and known generated text associated with the features) for a data set that is specific to a user. In some embodiments, the samples may be combined to generate user generic data sets. [0054] In one example use case, the user may attempt to access the user device 105a or request a story via the UI subsystem 145. The communication module 140 may provide a notification of the request to the communication module 150, which provides the request to the text generation module 155. The text generation module 155 generates a textual phrase or a retrieves a story based on the request and provides the textual phrase or story text to communication module 150. The communication module 150 provides the textual phrase to communication module 140. The communication module 140 provides the textual phrase to display 120. In some examples, the user device 105a requested the textual phrase for display to the user to, for example, unlock the user’s device 105a. Having the user state a phrase to unlock the user’s device 105a may or may not be used for authentication. This may be to ensure the user is not a robot. In some implementations, an example purpose of having the user utter the phrase can be to capture the user’s vocalization of the generated phrase. The phrase may be a single word, a set of words, a story, a portion of an ebook, an entire ebook, or any other suitable length of phrase (i.e., textual content). In some cases, the user may use a storytelling application that the user requested a story to tell their child, for example, the text generation module 155 may select a story and provide the story to the communication module 150, which provides the story to the communication module 140, which displays the story via the storytelling application on the display 120.
[0055] In some cases, when the user receives the text from the text generation module, the user can vocalize the displayed text, and microphone 125 can capture the vocalization. The audio embedder 130 captures the vocalization via the microphone 125 and embeds the vocalization into an embedding space (e.g., embedded audio). In some embodiments, the audio embedder 130 generates a mel spectrogram from the vocalization. In some cases, the audio embedder 130 provides the embedded audio to the characteristic collection module 135.
[0056] In some implementations, the characteristic collection module 135 may collect and attach the characteristic information of the user to the embedded audio for delivery to the communication module 140. In some examples, the audio embedder 130 may send a signal to characteristic collection module 135 to obtain the characteristics. In such examples, the audio embedder 130 may attach the characteristic information and provide the data to the communication module 140. In some cases, the communication module 140 can collect the characteristics from the characteristic collection module 135 and the embedded audio from the audio embedder 130.
[0057] In some examples, the communication module 140, having received the information including the embedded audio of the vocalization of the phrase from the user in addition to the user characteristics, provides the information to the communication module 150. In some implementations, communication module 150 provides the packet of information including the user characteristics and audio embedding to the feature extraction module 180.
[0058] As described above, the feature extraction module 180 can extracts features from the embedded audio. The features may include information that can be used to identify the user’s vocabulary, accent, and/or temperament. The extracted features may also include a fundamental frequency modulation, phonemes, formants, harmonics, and/or a noise profile from the audio embedding. The feature extraction module 180 may format the user characteristics and the extracted features from the audio embedding into a format suitable for providing to the loss calculation subsystem 165 and the modelling module 160, respectively.
[0059] The feature extraction module 180 provides the extracted features of the audio embedding to the modelling module 160. The modelling module 160 uses the extracted features and a voice model to predict characteristics of the user and outputs them. The loss calculation subsystem 165 receives the predicted characteristics from the modelling module 160 as well as the known user characteristics from the feature extraction module 180 and calculates one or more loss values. The loss values are fed back into the modelling module 160 to adjust parameters of the modelling module 160 to better predict user characteristics based on the extracted features.
[0060] Additionally, in some implementations, the text generation module 155 may provide the known text that the user was asked to vocalize to the data set creation subsystem 170. The feature extraction module 180 may provide the known user characteristics and the extracted features from the user’s vocalization to the data set creation subsystem 170. In some embodiments in which the audio sample of the vocalization is provided to the training system 110 for audio embedding, the entire audio sample may be provided to data set creation subsystem 170. The data set creation subsystem 170 may store the user characteristics, the extracted features, the audio sample, and/or the text sample in the data set database 115.
[0061] There may be many users that provide vocalizations of text generated by text generation module 155. Each user device 105a, 105b, through 105n may have an associated user (or multiple associated users). Each associated user may provide many vocalizations. Accordingly, the data set creation subsystem 170 may create a data set for each user having many different text samples, vocalizations, features extracted from the vocalizations, and the user’s characteristics all associated in the data set. In some embodiments, all of the user information for many users is in a single data set.
[0062] FIG. 2 illustrates a system 200 for predicting characteristics of a user based on a captured vocalization. The system 200 may include a capture device 205 and a prediction system 210. While a single capture device 205 is shown, the system 200 may include many capture devices 205. The prediction system 210 may be a cloud based service or a remote server, for example.
[0063] The capture device 205 may be any suitable computing system having the described components in FIG. 2 and may include more components. For example, the capture device 205 may be a computing system such as computer system 800 as described with respect to FIG. 8, and capture device 205 may therefore include components of computer system 800, which have been omitted from capture device 205 for ease of description.
[0064] Capture device 205 may include a microphone 215, audio embedder 220, communication module 230, and user interface (UI) subsystem 225. Microphone 215 may be the same as and/or perform similar functions to microphone 125 as described with respect to FIG. 1. Audio embedder 220 may be the same as and/or perform similar functions to audio embedder 130 as described with respect to FIG. 1. Communication module 230 may be the same as and/or perform similar functions to communication module 140 as described with respect to FIG. 1. UI subsystem 225 may be an optional component that may provide a user interface to a user of capture device 205. In one illustrative example, capture device may be an advertising system having a display screen. In some aspects, the advertising system may be within a vehicle such as a shared ride vehicle, within an elevator, or any other location for which targeted advertising of a person may be desirable.
[0065] Prediction system 210 may be any suitable computing system having the described components in FIG. 2 and may include more components. For example, the prediction system 210 may be a computing system, such as computer system 800 as described with respect to FIG. 8, and prediction system 210 may therefore include components of computer system 800, which have been omitted from prediction system 210 for ease of description.
[0066] Prediction system 210 may include communication module 150, feature extraction module 180, characteristics subsystem 235, and modelling module 260. In some examples, modelling module 260 can be a trained modelling module 160 as illustrated with respect to FIG. 1 above, after having been trained by training system 110. Characteristics subsystem 235 may obtain the output from modelling module 260 of predicted characteristics. Characteristics subsystem 235 may format the characteristics for use by a UI system and provide the predicted characteristics to communication module 150.
[0067] In an example use case, a user may speak, for example on a cellular device, to himself, or to another person within the vicinity. The microphone 215 may capture the user’s spoken words (i.e., an audio sample), and the audio embedder 220 may embed the audio sample to generate, for example, a mel spectrogram of the audio sample. The embedded audio sample can be provided to the communication module 230. The communication module 230 can transmit the embedded audio to the communication module 150. The communication module 150 provides the embedded audio to the feature extraction module 180. The feature extraction module 180 extracts features from the embedded audio. For example, the extracted features may include information related to the user’s vocabulary, accent, and/or temperament. The extracted features may also include a fundamental frequency modulation, phonemes, formant, harmonics, and/or a noise profile for the vocalization. The extracted features are input to the modelling module 260, which uses a voice model (developed during training in training system 110) to predict characteristics of the user. [0068] The predicted characteristics are output from modelling module 260 to characteristics subsystem 235. Characteristics subsystem 235 may format the predicted characteristics and provide the information to communication module 150. Communication module 150 may provide the predicted characteristics to communication module 230, which may provide the predicted characteristics to UI subsystem 225. The UI subsystem 225 may use the predicted characteristics to generate output in the user interface. For example, the predicted characteristics may be output to a graphical user interface. As another example, advertising may be selected based on the predicted characteristics of the person.
[0069] FIG. 3 illustrates a system 300 for generating a voice representation (e.g., a vocalization or synthesized speech) based on desired characteristics and a textual selection. The system 300 may include a user device 305 and a prediction system 310. While a single user device 305 is shown, the system 300 may include many user devices 305. The prediction system 310 may be a cloud based service or a remote server, for example.
[0070] The user device 305 may be any suitable computing system having the described components in FIG. 3 and may include more components. For example, the user device 305 may be a computing system such as computer system 800 as described with respect to FIG. 8, and user device 305 may therefore include components of computer system 800, which have been omitted from user device 305 for ease of description.
[0071] User device 305 may include a speaker 315, UI subsystem 320, characteristic selection subsystem 325, communication module 335, and textual selection subsystem 330. Speaker 315 may be any speaker device that can output an audible sound. UI subsystem 320 may be any user interface for providing visual and audible output to a user via a display or, for example, speaker 315. Characteristic selection subsystem 325 may be a system that a user may use, via the UI subsystem 320 to select the characteristics of a desired vocalization. Textual selection subsystem 330 may be a system that a user may use, via the UI subsystem 320, to select desired text. As an example, the user device 305 may be a system in which the user may select text, using the textual selection subsystem 330, and the user may select desired characteristics for a generated voice that vocalizes the selected text using the characteristic selection subsystem 325. For example, an audio book may be selected and a user may select desired characteristics for each character for vocalizing their selected portions or quotes. For example, an elderly woman from the Southern United States would have a different generated voice from a teenage boy from Chicago, Illinois.
[0072] Prediction system 310 may be any suitable computing system having the described components in FIG. 3 and may include more components. For example, the prediction system 310 may be a computing system, such as computer system 800 as described with respect to FIG. 8 and prediction system 310 may therefore include components of computer system 800, which have been omitted from prediction system 310 for ease of description.
[0073] Prediction system 310 may include communication module 150, characteristic extraction subsystem 340, textual extraction subsystem 345, audio generation subsystem 350, and modelling module 360. In some cases, modelling module 360 can be trained modelling module 160 as illustrated in FIG. 1 above , after having been trained by training system 110. Characteristic extraction subsystem 340 may obtain the information from communication module 150 and extract the desired characteristics selected by the user using characteristic selection subsystem 325. Textual extraction subsystem 345 may obtain the information from communication module 150 and extract the selected text by the user using textual selection subsystem 330.
[0074] The audio generation subsystem 350 may obtain the selected text from the textual extraction subsystem 345 and the predicted voice features from the modelling module 360 that are output based on the selected desired voice characteristics.
[0075] In an example use case, a user may select desired text via the UI subsystem 320 using the textual selection subsystem 330 as well as the desired characteristics using the characteristic selection subsystem 325. The UI subsystem 320 may provide the selected text and desired characteristics to communication module 335. The communication module 335 may provide the selected text and desired characteristic information to the communication module 150. The communication module 150 may provide the information to the characteristic extraction subsystem 340 and the textual extraction subsystem 345. The characteristic extraction subsystem 340 may extract the selected characteristics, format the characteristics as needed by the modelling module 360, and submit the characteristic information to the modelling module 360. The modelling module 360 uses the voice models to map the desired characteristics to the features of the voice for the user with the desired characteristics. The modelling module 360 provides the predicted vocal features to the audio generation subsystem 350. Further, the textual extraction subsystem 345 may extract the selected text from the information and provide the selected text to the audio generation subsystem 350. The audio generation subsystem 350 uses the selected text and the predicted vocal features to generate an audio sample of a predicted voice representation based on the selected features to vocalize the selected text. The audio sample may be provided to the communication module 150. The communication module 150 provides the audio sample to the communication module 335. The communication module 335 provides the audio sample to the UI subsystem 320. The UI subsystem 320 outputs the audio sample to the speaker 315 for the user to hear.
[0076] FIG. 4 is a flow diagram of a method 400 for generating voice models by training a machine learning algorithm. Alternative embodiments may vary in function by combining, separating, or otherwise varying the functionality described in the blocks illustrated in FIG. 4. Means for performing the functionality of one or more of the blocks illustrated in FIG. 4 may comprise hardware and/or software components of a computer system, such as the computer system 800 illustrated in FIG. 8 and described in more detail below.
[0077] At block 405, the training system (e.g., training system 110) may provide a textual phrase for a user to vocalize to a user device (e.g., user device 105a). The textual phrase may be generated randomly or selected from a database of phrases by, for example, text generation module 155. Means for performing the functionality at block 405 may include one or more software and/or hardware components of a computer system, such as a bus 805, processing unit(s) 810, memory 835, communication subsystem 830, and/or other software and/or hardware components of a computer system 800 as illustrated in FIG. 8 and described in more detail below
[0078] At block 410, the training system 110 may receive known characteristics of the user and a vocalized phrase associated with the user (e.g., spoken by the user), the vocalized phrase being a vocalization of the textual phrase. The vocalized phrase can be received, for example, by a microphone. For example, the textual phrase may be output to a display (e.g., display 120) for the user to view, and the display may request that the user vocalize the textual phrase. For example, the display may ask the user to say the phrase to access the user device (e.g., user device 105a). Once vocalized, the user device may embed the audio sample obtained when the user speaks the phrase using the microphone (e.g., microphone 125). The embedded audio sample, which may be a mel spectrogram based on the embedding, is provided to the training system (e.g., training system 110). Means for performing the functionality at block 410 may include one or more software and/or hardware components of a computer system, such as a bus 805, processing unit(s) 810, memory 835, communication subsystem 830, and/or other software and/or hardware components of a computer system 800 as illustrated in FIG. 8 and described in more detail below.
[0079] At block 415, the training system 110 may extract features from the vocalized phrase. For example, the audio sample, which may be embedded into a mel spectrogram (or any other suitable representation), may have features extracted including, for example, identifying information related to the user’s vocabulary, accent, and/or temperament. The extracted features may also include fundamental frequency modulations, phonemes, formants, harmonics, and/or a noise profile for the vocalization. The features may be extracted by, for example, feature extraction module 180. Means for performing the functionality at block 415 may include one or more software and/or hardware components of a computer system, such as a bus 805, processing unit(s) 810, memory 835, communication subsystem 830, and/or other software and/or hardware components of a computer system 800 as illustrated in FIG. 8 and described in more detail below.
[0080] At block 420, the training system 110 may generate a voice model by training a machine learning algorithm (e.g., modelling module 160) using the extracted features and the known characteristics of the user. As described, above, the machine learning algorithm may receive the extracted features and map the features to predict the characteristics of the user. In some embodiments, the machine learning algorithm may receive the raw audio and generate feature representations to map to the characteristics of the user. During training of the machine learning algorithm, the known characteristics of the user may be provided to a loss calculation subsystem for generating a loss value based on a comparison of the predicted characteristics and the known characteristics. The loss value may be fed back to the machine learning algorithm to adjust parameters of the machine learning algorithm to reduce the loss value. Means for performing the functionality at block 420 may include one or more software and/or hardware components of a computer system, such as a bus 805, processing unit(s) 810, memory 835, communication subsystem 830, and/or other software and/or hardware components of a computer system 800 as illustrated in FIG. 8 and described in more detail below.
[0081] As indicated in the previously described examples, the method 400 may include any of a variety of additional features, depending on desired functionality. For example, in some cases many users provide multiple vocalized phrases and known characteristics of the users (e.g., age, gender, nationality, primary language, geographic region of residence, education level, and so forth) to the training system.
[0082] Other additional features of method 400 may include deploying the voice model and receiving a vocalization to which the machine learning algorithm predicts characteristics of the user. In some embodiments, rather than predicting the characteristics of the user from an audio sample of a user’s vocalization, the deployed voice model may predict vocal features of a user based on receipt of desired characteristics of the user. For example, the voice model may be used to map features of the user’s voice (e.g., phonemes, formants, frequency modulations and/or a noise profile that are related to the user’s vocabulary, accent, and/or temperament, etc.) to characteristics of the user (age, gender, geographical region of residence, etc.). The same voice models may be used to map the known or desired characteristics of the user to features of the user’s voice.
[0083] In embodiments in which the deployed voice model predicts vocal features of a voice representation based on desired characteristics (i.e., speech synthesis), the speech synthesis may be employed using a parametric speech synthesizer where the parameters are provided by the learned feature-characteristic representation. The speech synthesis may also be achieved by directly generating mel-spectrograms from text data and conditioned by user characteristics. Speech synthesis from text data and user characteristics may also be employed by using a conditioned autoregressive or generative model generating audio samples.
[0084] Other additional features of the method 400 may include storing the vocalized phrase, features of the phrase, the textual phrase, extracted features of the vocalization into a data set or a database with other vocalizations and related information from the user or other users. [0085] As described above, various aspects of the present disclosure can use machine learning systems, such as training system 110, the modelling module 160, and/or the modelling module 260 described above. FIG. 5 is an illustrative example of a deep learning neural network 500 that can be used to implement the voice modelling system described above. The deep learning neural network includes an input layer 520 that includes input data. In one illustrative example, the input layer 520 can include data representing the pixels of an input video frame. The neural network 500 also includes multiple hidden layers 522a, 522b, through 522n. The hidden layers 522a, 522b, through 522n include “n” number of hidden layers, where “n” is an integer greater than or equal to one. The number of hidden layers can be made to include as many layers as needed for the given application. The neural network 500 further includes an output layer 521 that provides an output resulting from the processing performed by the hidden layers 522a, 522b, through 522n. In one illustrative example, the output layer 521 can provide a classification for an object in an input video frame. The classification can include a class identifying the type of activity (e.g., playing soccer, playing piano, listening to piano, playing guitar, etc.).
[0086] The neural network 500 is a multi-layer neural network of interconnected nodes. Each node can represent a piece of information. Information associated with the nodes is shared among the different layers and each layer retains information as information is processed. In some cases, the neural network 500 can include a feed-forward network, in which case there are no feedback connections where outputs of the network are fed back into itself. In some cases, the neural network 500 can include a recurrent neural network, which can have loops that allow information to be carried across nodes while reading in input.
[0087] Information can be exchanged between nodes through node-to-node interconnections between the various layers. Nodes of the input layer 520 can activate a set of nodes in the first hidden layer 522a. For example, as shown, each of the input nodes of the input layer 520 is connected to each of the nodes of the first hidden layer 522a. The nodes of the first hidden layer 522a can transform the information of each input node by applying activation functions to the input node information. The information derived from the transformation can then be passed to and can activate the nodes of the next hidden layer 522b, which can perform their own designated functions. Example functions include convolutional, up-sampling, data transformation, and/or any other suitable functions. The output of the hidden layer 522b can then activate nodes of the next hidden layer, and so on. The output of the last hidden layer 522n can activate one or more nodes of the output layer 521, at which an output is provided. In some cases, while nodes (e.g., node 526) in the neural network 500 are shown as having multiple output lines, a node has a single output and all lines shown as being output from a node represent the same output value.
[0088] In some cases, each node or interconnection between nodes can have a weight that is a set of parameters derived from the training of the neural network 500. Once the neural network 500 is trained, it can be referred to as a trained neural network, which can be used to classify one or more activities. For example, an interconnection between nodes can represent a piece of information learned about the interconnected nodes. The interconnection can have a tunable numeric weight that can be tuned (e.g., based on a training data set), allowing the neural network 500 to be adaptive to inputs and able to learn as more and more data is processed.
[0089] The neural network 500 is pre-trained to process the features from the data in the input layer 520 using the different hidden layers 522a, 522b, through 522n in order to provide the output through the output layer 521. In an example in which the neural network 500 is used to identify activities being performed by a driver in frames, the neural network 500 can be trained using training data that includes both frames and labels, as described above. For instance, training frames can be input into the network, with each training frame having a label indicating the features in the frames (for the feature extraction machine learning system) or a label indicating classes of an activity in each frame. In one example using object classification for illustrative purposes, a training frame can include an image of a number 2, in which case the label for the image can be [0 0 1 0 0 0 0 0 0 0]
[0090] In some cases, the neural network 500 can adjust the weights of the nodes using a training process called backpropagation. As noted above, a backpropagation process can include a forward pass, a loss function, a backward pass, and a weight update. The forward pass, loss function, backward pass, and parameter update is performed for one training iteration. The process can be repeated for a certain number of iterations for each set of training images until the neural network 500 is trained well enough so that the weights of the layers are accurately tuned. [0091] For the example of identifying objects in frames, the forward pass can include passing a training frame through the neural network 500. The weights are initially randomized before the neural network 500 is trained. As an illustrative example, a frame can include an array of numbers representing the pixels of the image. Each number in the array can include a value from 0 to 255 describing the pixel intensity at that position in the array. In one example, the array can include a 28 x 28 x 3 array of numbers with 28 rows and 28 columns of pixels and 3 color components (e.g., red, green, and blue, or luma and two chroma components, or the like).
[0092] As noted above, for a first training iteration for the neural network 500, the output will likely include values that do not give preference to any particular class due to the weights being randomly selected at initialization. For example, if the output is a vector with probabilities that the object includes different classes, the probability value for each of the different classes may be equal or at least very similar (e.g., for ten possible classes, each class may have a probability value of 0.1). With the initial weights, the neural network 500 is unable to determine low level features and thus cannot make an accurate determination of what the classification of the object might be. A loss function can be used to analyze error in the output. Any suitable loss function definition can be used, such as a Cross-Entropy loss. Another example of a loss function includes the mean squared error (MSE), defined as Etotal =
Figure imgf000026_0001
(target — output)2. The loss can be set to be equal to the value of Etotai.
[0093] The loss (or error) will be high for the first training images since the actual values will be much different than the predicted output. The goal of training is to minimize the amount of loss so that the predicted output is the same as the training label. The neural network 500 can perform a backward pass by determining which inputs (weights) most contributed to the loss of the network, and can adjust the weights so that the loss decreases and is eventually minimized. A derivative of the loss with respect to the weights (denoted as dL/dW, where W are the weights at a particular layer) can be computed to determine the weights that contributed most to the loss of the network. After the derivative is computed, a weight update can be performed by updating all the weights of the filters. For example, the weights can be updated so that they change in the opposite direction of
Figure imgf000026_0002
the gradient. The weight update can be denoted as w =
Figure imgf000026_0003
where w denotes a weight, w, denotes the initial weight, and h denotes a learning rate. The learning rate can be set to any suitable value, with a high learning rate including larger weight updates and a lower value indicating smaller weight updates.
[0094] The neural network 500 can include any suitable deep network. One example includes a convolutional neural network (CNN), which includes an input layer and an output layer, with multiple hidden layers between the input and out layers. The hidden layers of a CNN include a series of convolutional, nonlinear, pooling (for downsampling), and fully connected layers. The neural network 500 can include any other deep network other than a CNN, such as an autoencoder, a deep belief nets (DBNs), a Recurrent Neural Networks (RNNs), among others.
[0095] FIG. 6 is an illustrative example of a convolutional neural network (CNN) 600. The input layer 620 of the CNN 600 includes data representing an image or frame. For example, the data can include an array of numbers representing the pixels of the image, with each number in the array including a value from 0 to 255 describing the pixel intensity at that position in the array. Using the previous example from above, the array can include a 28 x 28 x 3 array of numbers with 28 rows and 28 columns of pixels and 3 color components (e.g., red, green, and blue, or luma and two chroma components, or the like). The image can be passed through a convolutional hidden layer 622a, an optional non-linear activation layer, a pooling hidden layer 622b, and fully connected hidden layers 622c to get an output at the output layer 624. While only one of each hidden layer is shown in FIG. 6, one of ordinary skill will appreciate that multiple convolutional hidden layers, non-linear layers, pooling hidden layers, and/or fully connected layers can be included in the CNN 600. As previously described, the output can indicate a single class of an object or can include a probability of classes that best describe the object in the image.
[0096] The first layer of the CNN 600 is the convolutional hidden layer 622a. The convolutional hidden layer 622a analyzes the image data of the input layer 620. Each node of the convolutional hidden layer 622a is connected to a region of nodes (pixels) of the input image called a receptive field. The convolutional hidden layer 622a can be considered as one or more filters (each filter corresponding to a different activation or feature map), with each convolutional iteration of a filter being a node or neuron of the convolutional hidden layer 622a. For example, the region of the input image that a filter covers at each convolutional iteration would be the receptive field for the filter. In one illustrative example, if the input image includes a 28x28 array, and each filter (and corresponding receptive field) is a 5x5 array, then there will be 24x24 nodes in the convolutional hidden layer 622a. Each connection between a node and a receptive field for that node learns a weight and, in some cases, an overall bias such that each node leams to analyze its particular local receptive field in the input image. Each node of the hidden layer 622a will have the same weights and bias (called a shared weight and a shared bias). For example, the filter has an array of weights (numbers) and the same depth as the input. A filter will have a depth of 3 for the video frame example (according to three color components of the input image). An illustrative example size of the filter array is 5 x 5 x 3, corresponding to a size of the receptive field of a node.
[0097] The convolutional nature of the convolutional hidden layer 622a is due to each node of the convolutional layer being applied to its corresponding receptive field. For example, a filter of the convolutional hidden layer 622a can begin in the top-left comer of the input image array and can convolve around the input image. As noted above, each convolutional iteration of the filter can be considered a node or neuron of the convolutional hidden layer 622a. At each convolutional iteration, the values of the filter are multiplied with a corresponding number of the original pixel values of the image (e.g., the 5x5 filter array is multiplied by a 5x5 array of input pixel values at the top-left comer of the input image array). The multiplications from each convolutional iteration can be summed together to obtain a total sum for that iteration or node. The process is next continued at a next location in the input image according to the receptive field of a next node in the convolutional hidden layer 622a. For example, a filter can be moved by a step amount (referred to as a stride) to the next receptive field. The stride can be set to 1 or other suitable amount. For example, if the stride is set to 1, the filter will be moved to the right by 1 pixel at each convolutional iteration. Processing the filter at each unique location of the input volume produces a number representing the filter results for that location, resulting in a total sum value being determined for each node of the convolutional hidden layer 622a.
[0098] The mapping from the input layer to the convolutional hidden layer 622a is referred to as an activation map (or feature map). The activation map includes a value for each node representing the filter results at each locations of the input volume. The activation map can include an array that includes the various total sum values resulting from each iteration of the filter on the input volume. For example, the activation map will include a 24 x 24 array if a 5 x 5 filter is applied to each pixel (a stride of 1) of a 28 x 28 input image. The convolutional hidden layer 622a can include several activation maps in order to identify multiple features in an image. The example shown in FIG. 6 includes three activation maps. Using three activation maps, the convolutional hidden layer 622a can detect three different kinds of features, with each feature being detectable across the entire image.
[0099] In some examples, a non-linear hidden layer can be applied after the convolutional hidden layer 622a. The non-linear layer can be used to introduce non-linearity to a system that has been computing linear operations. One illustrative example of a non-linear layer is a rectified linear unit (ReLU) layer. A ReLU layer can apply the function f(x) = max(0, x) to all of the values in the input volume, which changes all the negative activations to 0. The ReLU can thus increase the non-linear properties of the CNN 600 without affecting the receptive fields of the convolutional hidden layer 622a.
[0100] The pooling hidden layer 622b can be applied after the convolutional hidden layer 622a (and after the non-linear hidden layer when used). The pooling hidden layer 622b is used to simplify the information in the output from the convolutional hidden layer 622a. For example, the pooling hidden layer 622b can take each activation map output from the convolutional hidden layer 622a and generates a condensed activation map (or feature map) using a pooling function. Max-pooling is one example of a function performed by a pooling hidden layer. Other forms of pooling functions be used by the pooling hidden layer 622a, such as average pooling, L2-norm pooling, or other suitable pooling functions. A pooling function (e.g., a max-pooling filter, an L2-norm filter, or other suitable pooling filter) is applied to each activation map included in the convolutional hidden layer 622a. In the example shown in FIG. 6, three pooling filters are used for the three activation maps in the convolutional hidden layer 622a.
[0101] In some examples, max-pooling can be used by applying a max-pooling filter (e.g., having a size of 2x2) with a stride (e.g., equal to a dimension of the filter, such as a stride of 2) to an activation map output from the convolutional hidden layer 622a. The output from a max-pooling filter includes the maximum number in every sub-region that the filter convolves around. Using a 2x2 filter as an example, each unit in the pooling layer can summarize a region of 2x2 nodes in the previous layer (with each node being a value in the activation map). For example, four values (nodes) in an activation map will be analyzed by a 2x2 max-pooling filter at each iteration of the filter, with the maximum value from the four values being output as the “max” value. If such a max-pooling filter is applied to an activation filter from the convolutional hidden layer 622a having a dimension of 24x24 nodes, the output from the pooling hidden layer 622b will be an array of 12x12 nodes.
[0102] In some examples, an L2-norm pooling filter could also be used. The L2-norm pooling filter includes computing the square root of the sum of the squares of the values in the 2x2 region (or other suitable region) of an activation map (instead of computing the maximum values as is done in max-pooling), and using the computed values as an output.
[0103] Intuitively, the pooling function (e.g., max-pooling, L2-norm pooling, or other pooling function) determines whether a given feature is found anywhere in a region of the image, and discards the exact positional information. This can be done without affecting results of the feature detection because, once a feature has been found, the exact location of the feature is not as important as its approximate location relative to other features. Max-pooling (as well as other pooling methods) offer the benefit that there are many fewer pooled features, thus reducing the number of parameters needed in later layers of the CNN 600.
[0104] The final layer of connections in the network is a fully-connected layer that connects every node from the pooling hidden layer 622b to every one of the output nodes in the output layer 624. Using the example above, the input layer includes 28 x 28 nodes encoding the pixel intensities of the input image, the convolutional hidden layer 622a includes 3x24x24 hidden feature nodes based on application of a 5 x5 local receptive field (for the filters) to three activation maps, and the pooling hidden layer 622b includes a layer of 3 c 12c 12 hidden feature nodes based on application of max-pooling filter to 2x2 regions across each of the three feature maps. Extending this example, the output layer 624 can include ten output nodes. In such an example, every node of the 3x12x12 pooling hidden layer 622b is connected to every node of the output layer 624.
[0105] The fully connected layer 622c can obtain the output of the previous pooling hidden layer 622b (which should represent the activation maps of high-level features) and determines the features that most correlate to a particular class. For example, the fully connected layer 622c layer can determine the high-level features that most strongly correlate to a particular class, and can include weights (nodes) for the high-level features. A product can be computed between the weights of the fully connected layer 622c and the pooling hidden layer 622b to obtain probabilities for the different classes. For example, if the CNN 600 is being used to predict that an object in a video frame is a person, high values will be present in the activation maps that represent high-level features of people (e.g., two legs are present, a face is present at the top of the object, two eyes are present at the top left and top right of the face, a nose is present in the middle of the face, a mouth is present at the bottom of the face, and/or other features common for a person).
[0106] In some examples, the output from the output layer 624 can include an M- dimensional vector (in the prior example, M=10). M indicates the number of classes that the CNN 600 has to choose from when classifying the object in the image. Other example outputs can also be provided. Each number in the M-dimensional vector can represent the probability the object is of a certain class. In one illustrative example, if a 10-dimensional output vector represents ten different classes of objects is [0 0 0.05 0.8 0 0.15 0 0 0 0], the vector indicates that there is a 5% probability that the image is the third class of object (e.g., a dog), an 80% probability that the image is the fourth class of object (e.g., a human), and a 15% probability that the image is the sixth class of object (e.g., a kangaroo). The probability for a class can be considered a confidence level that the object is part of that class.
[0107] FIG. 7 illustrates an embodiment of a user equipment (UE) 700, which can be utilized as described herein above (e.g. in association with FIGS. 1-4). For example, the UE 700 can perform one or more of the functions of method 400 of FIG. 4. It should be noted that FIG. 7 is meant only to provide a generalized illustration of various components, any or all of which may be utilized as appropriate. It can be noted that, in some instances, components illustrated by FIG. 7 can be localized to a single physical device and/or distributed among various networked devices, which may be disposed at different physical locations (e.g., located at different parts of a user’s body, in which case the components may be communicatively connected via a Personal Area Network (PAN) and/or other means). [0108] The UE 700 is shown comprising hardware elements that can be electrically coupled via a bus 705 (or may otherwise be in communication, as appropriate). The hardware elements may include a processing unit(s) 710 which can include without limitation one or more general-purpose processors, one or more special-purpose processors (such as digital signal processing (DSP) chips, graphics acceleration processors, application specific integrated circuits (ASICs), and/or the like), and/or other processing structure or means. As shown in FIG. 7, some embodiments may have a separate Digital Signal Processor (DSP) 720, depending on desired functionality. Location determination and/or other determinations based on wireless communication may be provided in the processing unit(s) 710 and/or wireless communication interface 730 (discussed below). The UE 700 also can include one or more input devices 770, which can include without limitation a keyboard, touch screen, a touch pad, microphone, button(s), dial(s), switch(es), and/or the like; and one or more output devices 715, which can include without limitation a display, light emitting diode (LED), speakers, and/or the like.
[0109] The UE 700 may also include a wireless communication interface 730, which may comprise without limitation a modem, a network card, an infrared communication device, a wireless communication device, and/or a chipset (such as a Bluetooth® device, an IEEE 802.11 device, an IEEE 802.15.4 device, a WiFi device, a WiMax device, a WAN device and/or various cellular devices, etc.), and/or the like, which may enable the UE 700 to communicate via the networks described above with regard to FIG. 1. The wireless communication interface 730 may permit data and signaling to be communicated (e.g. transmitted and received) with a network, for example, via eNBs, gNBs, ng-eNBs, access points, various base stations and/or other access node types, and/or other network components, computer systems, and/or any other electronic devices described herein. The communication can be carried out via one or more wireless communication antenna(s) 732 that send and/or receive wireless signals 734.
[0110] Depending on desired functionality, the wireless communication interface 730 may comprise separate transceivers to communicate with base stations (e.g., ng-eNBs and gNBs) and other terrestrial transceivers, such as wireless devices and access points. The UE 700 may communicate with different data networks that may comprise various network types. For example, a Wireless Wide Area Network (WWAN) may be a Code Division Multiple Access (CDMA) network, a Time Division Multiple Access (TDMA) network, a Frequency Division Multiple Access (FDMA) network, an Orthogonal Frequency Division Multiple Access (OFDMA) network, a Single-Carrier Frequency Division Multiple Access (SC-FDMA) network, a WiMax (IEEE 802.16) network, and so on. A CDMA network may implement one or more radio access technologies (RATs) such as CDMA2000, Wideband CDMA (WCDMA), and so on. Cdma2000 includes IS- 95, IS-2000, and/or IS-856 standards. A TDMA network may implement GSM, Digital Advanced Mobile Phone System (D-AMPS), or some other RAT. An OFDMA network may employ LTE, LTE Advanced, 5G NR, and so on. 5G NR, LTE, LTE Advanced, GSM, and WCDMA are described in documents from the Third Generation Partnership Project (3GPP). Cdma2000 is described in documents from a consortium named “3rd Generation Partnership Project 2” (3GPP2). 3GPP and 3GPP2 documents are publicly available. A wireless local area network (WLAN) may also be an IEEE 802.1 lx network, and a wireless personal area network (WPAN) may be a Bluetooth network, an IEEE 802.15x, or some other type of network. The techniques described herein may also be used for any combination of WWAN, WLAN and/or WPAN.
[0111] The UE 700 can further include sensor(s) 740. Sensors 740 may comprise, without limitation, one or more inertial sensors and/or other sensors (e.g., accelerometer(s), gyroscope(s), camera(s), magnetometer(s), altimeter(s), microphone(s), proximity sensor(s), light sensor(s), barometer(s), and the like), some of which may be used to complement and/or facilitate the position determination described herein, in some instances.
[0112] Embodiments of the UE 700 may also include a GNSS receiver 780 capable of receiving signals 784 from one or more GNSS satellites using an antenna 782 (which could be the same as antenna 732). Positioning based on GNSS signal measurement can be utilized to complement and/or incorporate the techniques described herein. The GNSS receiver 780 can extract a position of the UE 700, using conventional techniques, from GNSS SVs of a GNSS system, such as Global Positioning System (GPS), Galileo, Glonass, Quasi-Zenith Satellite System (QZSS) over Japan, Indian Regional Navigational Satellite System (IRNSS) over India, Beidou over China, and/or the like. Moreover, the GNSS receiver 780 can be used with various augmentation systems (e.g., a Satellite Based Augmentation System (SBAS)) that may be associated with or otherwise enabled for use with one or more global and/or regional navigation satellite systems, such as, e.g., WAAS, EGNOS, Multi -functional Satellite Augmentation System (MSAS), and Geo Augmented Navigation system (GAGAN), and/or the like.
[0113] The UE 700 may further include and/or be in communication with a memory 760. The memory 760 can include, without limitation, local and/or network accessible storage, a disk drive, a drive array, an optical storage device, a solid-state storage device, such as a random access memory (“RAM”), and/or a read-only memory (“ROM”), which can be programmable, flash-updateable, and/or the like. Such storage devices may be configured to implement any appropriate data stores, including without limitation, various file systems, database structures, and/or the like.
[0114] The memory 760 of the UE 700 also can comprise software elements (not shown in FIG. 7), including an operating system, device drivers, executable libraries, and/or other code, such as one or more application programs, which may comprise computer programs provided by various embodiments, and/or may be designed to implement methods, and/or configure systems, provided by other embodiments, as described herein. Merely by way of example, one or more procedures described with respect to the method(s) discussed above may be implemented as code and/or instructions in memory 760 that are executable by the UE 700 (and/or processing unit(s) 710 or DSP 720 within UE 700). In an aspect, then, such code and/or instructions can be used to configure and/or adapt a general purpose computer (or other device) to perform one or more operations in accordance with the described methods.
[0115] FIG. 8 illustrates an embodiment of a computer system 800, which may be utilized and/or incorporated into one or more components of a system (e.g., voice modelling system 100 of FIG. 1). FIG. 8 provides a schematic illustration of one embodiment of a computer system 800 that can perform the methods provided by various other embodiments, such as the methods described in relation to FIG. 4. It should be noted that FIG. 8 is meant only to provide a generalized illustration of various components, any or all of which may be utilized as appropriate. FIG. 8, therefore, broadly illustrates how individual system elements may be implemented in a relatively separated or relatively more integrated manner. In addition, it can be noted that components illustrated by FIG. 8 can be localized to a single device and/or distributed among various networked devices, which may be disposed at different physical or geographical locations. In some embodiments, the computer system 800 may correspond to components of a training system such as user devices 105a, 105b, through 105n, training system 110, capture device 205, prediction system 210, user device 305, or prediction system 310.
[0116] The computer system 800 is shown comprising hardware elements that can be electrically coupled via a bus 805 (or may otherwise be in communication, as appropriate). The hardware elements may include processing unit(s) 810, which can include without limitation one or more general-purpose processors, one or more special- purpose processors (such as digital signal processing chips, graphics acceleration processors, and/or the like), and/or other processing structure, which can be configured to perform one or more of the methods described herein, including the method described in relation to FIG. 8. The computer system 800 also can include one or more input devices 815, which can include without limitation a mouse, a keyboard, a camera, a microphone, and/or the like; and one or more output devices 820, which can include without limitation a display device, a printer, and/or the like.
[0117] The computer system 800 may further include (and/or be in communication with) one or more non-transitory storage devices 825, which can comprise, without limitation, local and/or network accessible storage, and/or can include, without limitation, a disk drive, a drive array, an optical storage device, a solid-state storage device, such as a random access memory (“RAM”), and/or a read-only memory (“ROM”), which can be programmable, flash-updateable, and/or the like. Such storage devices may be configured to implement any appropriate data stores, including without limitation, various file systems, database structures, and/or the like.
[0118] The computer system 800 may also include a communication subsystem 830, which can include support of wireline communication technologies and/or wireless communication technologies (in some embodiments) managed and controlled by a wireless communication interface 833. The communication subsystem 830 may include a modem, a network card (wireless or wired), an infrared communication device, a wireless communication device, and/or a chipset, and/or the like. The communication subsystem 830 may include one or more input and/or output communication interfaces, such as the wireless communication interface 833, to permit data and signaling to be exchanged with a network, mobile devices, other computer systems, and/or any other electronic devices described herein. Note that the terms “mobile device” and “UE” are used interchangeably herein to refer to any mobile communications device such as, but not limited to, mobile phones, smartphones, wearable devices, mobile computing devices (e.g., laptops, PDAs, tablets), embedded modems, and automotive and other vehicular computing devices.
[0119] In many embodiments, the computer system 800 will further comprise a working memory 835, which can include a RAM and/or or ROM device. Software elements, shown as being located within the working memory 835, can include an operating system 840, device drivers, executable libraries, and/or other code, such as application(s) 845, which may comprise computer programs provided by various embodiments, and/or may be designed to implement methods, and/or configure systems, provided by other embodiments, as described herein. Merely by way of example, one or more procedures described with respect to the method(s) discussed above, such as the method described in relation to FIG. 8, may be implemented as code and/or instructions that are stored (e.g. temporarily) in working memory 835 and are executable by a computer (and/or a processing unit within a computer such as processing unit(s) 810); in an aspect, then, such code and/or instructions can be used to configure and/or adapt a general purpose computer (or other device) to perform one or more operations in accordance with the described methods.
[0120] A set of these instructions and/or code might be stored on a non-transitory computer-readable storage medium, such as the storage device(s) 825 described above. In some cases, the storage medium might be incorporated within a computer system, such as computer system 800. In other embodiments, the storage medium might be separate from a computer system (e.g., a removable medium, such as an optical disc), and/or provided in an installation package, such that the storage medium can be used to program, configure, and/or adapt a general purpose computer with the instructions/code stored thereon. These instructions might take the form of executable code, which is executable by the computer system 800 and/or might take the form of source and/or installable code, which, upon compilation and/or installation on the computer system 800 (e.g., using any of a variety of generally available compilers, installation programs, compression/decompression utilities, etc.), then takes the form of executable code.
[0121] It will be apparent to those skilled in the art that substantial variations may be made in accordance with specific requirements. For example, customized hardware might also be used, and/or particular elements might be implemented in hardware, software (including portable software, such as applets, etc.), or both. Further, connection to other computing devices such as network input/output devices may be employed.
[0122] With reference to the appended figures, components that can include memory can include non-transitory machine-readable media. The term “machine-readable medium” and “computer-readable medium” as used herein, refer to any storage medium that participates in providing data that causes a machine to operate in a specific fashion. In embodiments provided hereinabove, various machine-readable media might be involved in providing instructions/code to processing units and/or other device(s) for execution. Additionally or alternatively, the machine-readable media might be used to store and/or carry such instructions/code. In many implementations, a computer-readable medium is a physical and/or tangible storage medium. Such a medium may take many forms, including but not limited to, non-volatile media, volatile media, and transmission media. Common forms of computer-readable media include, for example, magnetic and/or optical media, any other physical medium with patterns of holes, a RAM, a PROM, EPROM, a FLASH-EPROM, any other memory chip or cartridge, a carrier wave as described hereinafter, or any other medium from which a computer can read instructions and/or code.
[0123] The methods, systems, and devices discussed herein are examples. Various embodiments may omit, substitute, or add various procedures or components as appropriate. For instance, features described with respect to certain embodiments may be combined in various other embodiments. Different aspects and elements of the embodiments may be combined in a similar manner. The various components of the figures provided herein can be embodied in hardware and/or software. Also, technology evolves and, thus, many of the elements are examples that do not limit the scope of the disclosure to those specific examples.
[0124] It has proven convenient at times, principally for reasons of common usage, to refer to such signals as bits, information, values, elements, symbols, characters, variables, terms, numbers, numerals, or the like. It should be understood, however, that all of these or similar terms are to be associated with appropriate physical quantities and are merely convenient labels. Unless specifically stated otherwise, as is apparent from the discussion above, it is appreciated that throughout this Specification discussions utilizing terms such as “processing,” “computing,” “calculating,” “determining,” “ascertaining,” “identifying,” “associating,” “measuring,” “performing,” or the like refer to actions or processes of a specific apparatus, such as a special purpose computer or a similar special purpose electronic computing device. In the context of this Specification, therefore, a special purpose computer or a similar special purpose electronic computing device is capable of manipulating or transforming signals, typically represented as physical electronic, electrical, or magnetic quantities within memories, registers, or other information storage devices, transmission devices, or display devices of the special purpose computer or similar special purpose electronic computing device.
[0125] Terms, “and” and “or” as used herein, may include a variety of meanings that also is expected to depend at least in part upon the context in which such terms are used. Typically, “or” and “and/or”, if used to associate a list, such as A, B, or C, is intended to mean A, B, and C, here used in the inclusive sense, as well as A, B, or C, here used in the exclusive sense. In addition, the phrases “one or more” and “at least one of’ as used herein may be used to describe any feature, structure, or characteristic in the singular or may be used to describe some combination of features, structures, or characteristics. However, it should be noted that this is merely an illustrative example and claimed subject matter is not limited to this example. Furthermore, the term “at least one of’ if used to associate a list, such as A, B, or C or A, B, and/or C, can be interpreted to mean any combination of A, B, and/or C, such as A, AB, AC, BC, ABC, AAB, AABBCCC, etc.
[0126] Having described several embodiments, various modifications, alternative constructions, and equivalents may be used without departing from the spirit of the disclosure. For example, the above elements may merely be a component of a larger system, wherein other rules may take precedence over or otherwise modify the application of the various embodiments. Also, a number of steps may be undertaken before, during, or after the above elements are considered. Accordingly, the above description does not limit the scope of the disclosure.
[0127] Illustrative aspects of the disclosure include:
[0128] Aspect 1: A computer-implemented method, comprising: providing, by a computing system to a user device, a textual phrase for a user to vocalize; receiving, from the user device, known characteristics of the user and a vocalized phrase associated with the user, the vocalized phrase being a vocalization of the textual phrase; extracting, by the computing system, features from the vocalized phrase; and generating a voice model by training a machine learning algorithm using the extracted features and the known characteristics of the user.
[0129] Aspect 2: The computer-implemented method of aspect 1, wherein the extracted features comprise at least one of a fundamental frequency modulation, phoneme, formant, harmonics, and/or a noise profile.
[0130] Aspect 3: The computer-implemented method of any one of aspects 1 or 2, wherein the known characteristics of the user comprise at least one of an age of the user, a nationality of the user, a gender of the user, a primary language of the user, a geographic region of residence of the user, and/or an education level of the user.
[0131] Aspect 4: The computer-implemented method of any one of aspects 1 to 3, wherein extracted features and known characteristics of many users are used to train the machine learning algorithm.
[0132] Aspect 5: The computer-implemented method of any one of aspects 1 to 4, further comprising: receiving, by the computing system, a second vocalized phrase associated with a second user, the second vocalized phrase being a vocalization of a second textual phrase; providing, by the computing system, the second vocalized phrase to the voice model; and generating, by the voice model, a prediction of characteristics of the second user.
[0133] Aspect 6: The computer-implemented method of aspect 5, wherein the voice model maps extracted features of the second vocalized phrase to characteristics to generate the prediction of characteristics of the second user.
[0134] Aspect 7: The computer-implemented method of any one of aspects 1 to 6, further comprising: providing, by the computing system, a textual input and at least one desired characteristic to the voice model; and receiving, from the voice model, an audio sample comprising a predicted voice representation of a vocalization of the textual input, wherein the predicted voice representation is based at least in part on the desired characteristic. [0135] Aspect 8: The computer-implemented method of aspect 7, wherein the voice model maps the desired characteristic to features of a voice to generate the predicted voice representation used to generate the audio sample.
[0136] Aspect 9: The computer-implemented method of any one of aspects 1 to 8, wherein the voice model maps the extracted features of the vocalized phrase to the known characteristics of the user.
[0137] Aspect 10: The computer-implemented method of aspect 7, wherein the known characteristics of the user comprise at least one of information on a vocabulary of the user, information on an accent of the user, and/or information on a temperament of the user.
[0138] Aspect 11: The computer-implemented method of any one of aspects 1 to 10, wherein training the machine learning algorithm is based on a context of the vocalized phrase.
[0139] Aspect 12: The computer-implemented method of any one of aspects 1 to 11 wherein the context of the vocalized phrase includes background noise or environmental sounds.,
[0140] Aspect 13: A system, comprising: one or more processors; and a memory having stored thereon instructions that, upon execution of the instructions by the one or more processors, cause the one or more processors to: provide, to a user device, a textual phrase for a user to vocalize; receive, from the user device, known characteristics of the user and a vocalized phrase associated with the user, the vocalized phrase being a vocalization of the textual phrase; extract features from the vocalized phrase; and generate a voice model by training a machine learning algorithm using the extracted features and the known characteristics of the user.
[0141] Aspect 14: The system of aspect 13, wherein the extracted features comprise at least one of a fundamental frequency modulation, phoneme, formant, harmonics, and/or a noise profile.
[0142] Aspect 15: The system of any one of aspects 13 or 14, wherein the known characteristics of the user comprise at least one of an age of the user, a nationality of the user, a gender of the user, a primary language of the user, a geographic region of residence of the user, and/or an education level of the user.
[0143] Aspect 16: The system of any one of aspects 13 to 15, wherein extracted features and known characteristics of many users are used to train the machine learning algorithm.
[0144] Aspect 17: The system of any one of aspects 13 to 16, wherein the instructions comprise further instructions that, upon execution by the one or more processors, cause the one or more processors to: receive a second vocalized phrase associated with a second user, the second vocalized phrase being a vocalization of a second textual phrase; provide the second vocalized phrase to the voice model; and generate, by the voice model, a prediction of characteristics of the second user.
[0145] Aspect 18: The system of aspect 17, wherein the voice model maps extracted features of the second vocalized phrase to characteristics to generate the prediction of characteristics of the second user.
[0146] Aspect 19: The system of any one of aspects 13 to 18, wherein the instructions comprise further instructions that, upon execution by the one or more processors, cause the one or more processors to: provide a textual input and desired characteristics to the voice model; and receive, from the voice model, an audio sample comprising a predicted voice representation of a vocalization of the textual input, wherein the predicted voice representation is based at least in part on the desired characteristics.
[0147] Aspect 20: The system of aspect 19, wherein the voice model maps the desired characteristics to features of a voice to generate the predicted voice representation used to generate the audio sample.
[0148] Aspect 21 : The system of any one of aspects 13 to 20, wherein the voice model maps the extracted features of the vocalized phrase to the known characteristics of the user.
[0149] Aspect 22: The system of aspect 21, wherein the known characteristics of the user comprise at least one of information on a vocabulary of the user, information on an accent of the user, and/or information on a temperament of the user. [0150] Aspect 23: The system of any one of aspects 13 to 22, wherein training the machine learning algorithm is based on a context of the vocalized phrase.
[0151] Aspect 24: The system of aspect 21 , wherein the context of the vocalized phrase includes background noise or environmental sounds.
[0152] Aspect 25: A non-transitory computer-readable storage medium having stored thereon instructions which, when executed by one or more processors, cause the one or more processors to perform any of the operations of aspects 1 to 24.
[0153] Aspect 26: An apparatus comprising means for performing any of the operations of aspects 1 to 24.

Claims

WHAT IS CLAIMED IS:
1. A computer-implemented method of generating one or more voice models, comprising: providing, by a computing system to a user device, a textual phrase for a user to vocalize; receiving, from the user device, known characteristics of the user and a vocalized phrase associated with the user, the vocalized phrase being a vocalization of the textual phrase; extracting, by the computing system, features from the vocalized phrase; and generating a voice model at least in part by training a machine learning algorithm using the extracted features and the known characteristics of the user.
2. The method of claim 1, wherein the extracted features comprise at least one of a fundamental frequency modulation, phoneme, formant, harmonics, and/or a noise profile.
3. The method of claim 1, wherein the known characteristics of the user comprise at least one of an age of the user, a nationality of the user, a gender of the user, a primary language of the user, a geographic region of residence of the user, and/or an education level of the user.
4. The method of claim 1, wherein extracted features and known characteristics of many users are used to train the machine learning algorithm.
5. The method of claim 1, further comprising: receiving, by the computing system, a second vocalized phrase associated with a second user, the second vocalized phrase being a vocalization of a second textual phrase; providing, by the computing system, the second vocalized phrase to the voice model; and generating, by the voice model, a prediction of characteristics of the second user.
6. The method of claim 5, wherein the voice model maps extracted features of the second vocalized phrase to characteristics to generate the prediction of characteristics of the second user.
7. The method of claim 1, further comprising: providing, by the computing system, a textual input and at least one desired characteristic to the voice model; and receiving, from the voice model, an audio sample comprising a predicted voice representation of a vocalization of the textual input, wherein the predicted voice representation is based at least in part on the desired characteristic.
8. The method of claim 7, wherein the voice model maps the desired characteristic to features of a voice to generate the predicted voice representation used to generate the audio sample.
9. The method of claim 1, wherein the voice model maps the extracted features of the vocalized phrase to the known characteristics of the user.
10. The method of claim 9, wherein the known characteristics of the user comprise at least one of information on a vocabulary of the user, information on an accent of the user, and/or information on a temperament of the user.
11. The method of claim 1 , wherein training the machine learning algorithm is based on a context of the vocalized phrase.
12. The method of claim 11, wherein the context of the vocalized phrase includes background noise or environmental sounds.
13. A system for generating one or more voice models, comprising: one or more processors; and a memory having stored thereon instructions that, upon execution of the instructions by the one or more processors, cause the one or more processors to: provide, to a user device, a textual phrase for a user to vocalize; receive, from the user device, known characteristics of the user and a vocalized phrase associated with the user, the vocalized phrase being a vocalization of the textual phrase; extract features from the vocalized phrase; and generate a voice model at least in part by training a machine learning algorithm using the extracted features and the known characteristics of the user.
14. The system of claim 13, wherein the extracted features comprise at least one of a fundamental frequency modulation, phoneme, formant, harmonics, and/or a noise profile.
15. The system of claim 13, wherein the known characteristics of the user comprise at least one of an age of the user, a nationality of the user, a gender of the user, a primary language of the user, a geographic region of residence of the user, and/or an education level of the user.
16. The system of claim 13, wherein extracted features and known characteristics of many users are used to train the machine learning algorithm.
17. The system of claim 13, wherein the instructions comprise further instructions that, upon execution by the one or more processors, cause the one or more processors to: receive a second vocalized phrase associated with a second user, the second vocalized phrase being a vocalization of a second textual phrase; provide the second vocalized phrase to the voice model; and generate, by the voice model, a prediction of characteristics of the second user.
18. The system of claim 17, wherein the voice model maps extracted features of the second vocalized phrase to characteristics to generate the prediction of characteristics of the second user.
19. The system of claim 13, wherein the instructions comprise further instructions that, upon execution by the one or more processors, cause the one or more processors to: provide a textual input and desired characteristics to the voice model; and receive, from the voice model, an audio sample comprising a predicted voice representation of a vocalization of the textual input, wherein the predicted voice representation is based at least in part on the desired characteristics.
20. The system of claim 19, wherein the voice model maps the desired characteristics to features of a voice to generate the predicted voice representation used to generate the audio sample.
21. The system of claim 13, wherein the voice model maps the extracted features of the vocalized phrase to the known characteristics of the user.
22. The system of claim 21, wherein the known characteristics of the user comprise at least one of information on a vocabulary of the user, information on an accent of the user, and/or information on a temperament of the user.
23. The system of claim 13, wherein training the machine learning algorithm is based on a context of the vocalized phrase.
24. The system of claim 23, wherein the context of the vocalized phrase includes background noise or environmental sounds.
25. A non-transitory computer-readable medium having stored thereon instructions that, when executed by one or more processors, cause the one or more processors to: provide, to a user device, a textual phrase for a user to vocalize; receive, from the user device, known characteristics of the user and a vocalized phrase associated with the user, the vocalized phrase being a vocalization of the textual phrase; extract features from the vocalized phrase; and generate a voice model at least in part by training a machine learning algorithm using the extracted features and the known characteristics of the user.
26. The non-transitory computer-readable medium of claim 25, wherein the extracted features comprise at least one of a fundamental frequency modulation, phoneme, formant, harmonics, and/or a noise profile.
27. The non-transitory computer-readable medium of claim 25, further comprising instructions that, when executed by one or more processors, cause the one or more processors to: receive a second vocalized phrase associated with a second user, the second vocalized phrase being a vocalization of a second textual phrase; provide the second vocalized phrase to the voice model; and generate a prediction of characteristics of the second user using the voice model.
28. An apparatus for generating one or more voice models, comprising: means for providing, to a user device, a textual phrase for a user to vocalize; means for receiving, from the user device, known characteristics of the user and a vocalized phrase associated with the user, the vocalized phrase being a vocalization of the textual phrase; means for extracting features from the vocalized phrase; and means for generating a voice model at least in part by training a machine learning algorithm using the extracted features and the known characteristics of the user.
29. The apparatus of claim 28, wherein the extracted features comprise at least one of a fundamental frequency modulation, phoneme, formant, harmonics, and/or a noise profile.
30. The apparatus of claim 28, further comprising: means for receiving a second vocalized phrase associated with a second user, the second vocalized phrase being a vocalization of a second textual phrase; means for providing the second vocalized phrase to the voice model; and means for generating a prediction of characteristics of the second user using the voice model.
PCT/US2021/026476 2020-04-20 2021-04-08 Voice characteristic machine learning modelling WO2021216299A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
IN202041016877 2020-04-20
IN202041016877 2020-04-20

Publications (1)

Publication Number Publication Date
WO2021216299A1 true WO2021216299A1 (en) 2021-10-28

Family

ID=75690720

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2021/026476 WO2021216299A1 (en) 2020-04-20 2021-04-08 Voice characteristic machine learning modelling

Country Status (1)

Country Link
WO (1) WO2021216299A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112562655A (en) * 2020-12-03 2021-03-26 北京猎户星空科技有限公司 Residual error network training and speech synthesis method, device, equipment and medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9564123B1 (en) * 2014-05-12 2017-02-07 Soundhound, Inc. Method and system for building an integrated user profile
US20200005763A1 (en) * 2019-07-25 2020-01-02 Lg Electronics Inc. Artificial intelligence (ai)-based voice sampling apparatus and method for providing speech style

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9564123B1 (en) * 2014-05-12 2017-02-07 Soundhound, Inc. Method and system for building an integrated user profile
US20200005763A1 (en) * 2019-07-25 2020-01-02 Lg Electronics Inc. Artificial intelligence (ai)-based voice sampling apparatus and method for providing speech style

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112562655A (en) * 2020-12-03 2021-03-26 北京猎户星空科技有限公司 Residual error network training and speech synthesis method, device, equipment and medium

Similar Documents

Publication Publication Date Title
US11610354B2 (en) Joint audio-video facial animation system
US20200294488A1 (en) Method, device and storage medium for speech recognition
US10475442B2 (en) Method and device for recognition and method and device for constructing recognition model
EP3218901B1 (en) Prediction-based sequence recognition
US11068474B2 (en) Sequence to sequence conversational query understanding
US20190026630A1 (en) Information processing apparatus and information processing method
CN110019745A (en) Conversational system with self study natural language understanding
US20150364129A1 (en) Language Identification
US20210118436A1 (en) Artificial intelligence apparatus and method for recognizing speech by correcting misrecognized word
US11705105B2 (en) Speech synthesizer for evaluating quality of synthesized speech using artificial intelligence and method of operating the same
US11398219B2 (en) Speech synthesizer using artificial intelligence and method of operating the same
US20160253594A1 (en) Method and apparatus for determining probabilistic context awreness of a mobile device user using a single sensor and/or multi-sensor data fusion
US11417313B2 (en) Speech synthesizer using artificial intelligence, method of operating speech synthesizer and computer-readable recording medium
US11861318B2 (en) Method for providing sentences on basis of persona, and electronic device supporting same
US20200035217A1 (en) Method and device for speech processing
CN110322760A (en) Voice data generation method, device, terminal and storage medium
US11373656B2 (en) Speech processing method and apparatus therefor
WO2021216299A1 (en) Voice characteristic machine learning modelling
Dahanayaka et al. A multi-modular approach for sign language and speech recognition for deaf-mute people
CN116978359A (en) Phoneme recognition method, device, electronic equipment and storage medium
US20230034450A1 (en) Semantically-augmented context representation generation
US11227578B2 (en) Speech synthesizer using artificial intelligence, method of operating speech synthesizer and computer-readable recording medium
KR102631143B1 (en) Voice synthesizer using artificial intelligence, operating method of voice synthesizer and computer redable recording medium
US20220399016A1 (en) Presence-based application invocation
US20240104420A1 (en) Accurate and efficient inference in multi-device environments

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21722077

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21722077

Country of ref document: EP

Kind code of ref document: A1