GB2588689A - Personalized models - Google Patents

Personalized models Download PDF

Info

Publication number
GB2588689A
GB2588689A GB1916021.7A GB201916021A GB2588689A GB 2588689 A GB2588689 A GB 2588689A GB 201916021 A GB201916021 A GB 201916021A GB 2588689 A GB2588689 A GB 2588689A
Authority
GB
United Kingdom
Prior art keywords
model
user
dataset
updated
server
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
GB1916021.7A
Other versions
GB201916021D0 (en
GB2588689B (en
Inventor
Gila Couto Pimentel Ramos Alberto
Chulhong Min
Kawsar Fahim
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nokia Technologies Oy
Original Assignee
Nokia Technologies Oy
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nokia Technologies Oy filed Critical Nokia Technologies Oy
Priority to GB1916021.7A priority Critical patent/GB2588689B/en
Publication of GB201916021D0 publication Critical patent/GB201916021D0/en
Publication of GB2588689A publication Critical patent/GB2588689A/en
Application granted granted Critical
Publication of GB2588689B publication Critical patent/GB2588689B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/065Adaptation
    • G10L15/07Adaptation to the speaker
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • G06F18/2148Generating training patterns; Bootstrap methods, e.g. bagging or boosting characterised by the process organisation or structure, e.g. boosting cascade
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Mathematical Physics (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Telephonic Communication Services (AREA)

Abstract

A method includes receiving data relating to one or more user features from a user device 61, wherein the user device comprises a first model trained with a first dataset 21; selecting a first subset of the first dataset based on the received data relating to said one or more user features; generating an updated first model 36 using the first subset of the first dataset 33; and deploying the updated first model 24 to the user device. The method is preferably performed by a server 31. It may be used to tailor a speech recognition model to a particular user based on their accent by selecting data from the training dataset relating to similar accents and using this to update the model. Preferably, the data relating to user features is generated by a second model 63 on the user device which has been trained on a user dataset 62. The models can comprise neural networks.

Description

Personalized models
Field
The present specification relates to personalization of models.
Background
Training models (such as machine learning models) tailored to individuals presents a number of practical challenges. There remains a need for further improvements in this field.
Summary
In a first aspect, this specification provides an apparatus comprising means for performing: receiving data relating to one or more user features from a user device, wherein the user device comprises a first model, wherein the first model is trained with a first dataset; selecting a first subset of the first dataset based on the received data relating to said one or more user features; generating an updated first model using the first subset of the first dataset; and deploying the updated first model to the user device.
Some examples include means for performing: deploying a second model to the user device (e.g. deploying the second model by receiving the second model or by receiving parameters of the second model at the user device), wherein the second model is trained to generate the data relating to the one or more user features based on a user dataset.
Some examples include means for performing: training the second model. In some examples, the second model is trained using the first dataset.
In some examples, the first model is used for performing speech recognition.
Alternatively, or in addition, the first model is used for facial recognition.
In some examples, the data relating to user features comprises user vectors.
In some examples, the one or more user features are derived from user speech.
In some examples, the first model comprises a first machine learning model. In some examples, the second model may comprise a second machine learning model.
The means may comprise: at least one processor; and at least one memory including 5 computer program code, the at least one memory and the computer program code configured, with the at least one processor, to cause the performance of the apparatus.
In a second aspect, this specification provides an apparatus comprising means for performing: deploying a first model, wherein the first model is trained using a first dataset; generating data relating to one or more user features using a second model based on a user dataset; sending the data relating to said one or more user features to a server; and deploying an updated first model, wherein the updated first model is updated, at the server, based on a subset of the first dataset, and the subset of the first dataset is selected at the server based on the data relating to said one or more user features.
In some examples, deploying the first model comprises receiving parameters of the first model from the server or receiving the first model from the server.
Some examples include means for performing: deploying the second model.
Some examples include means for performing: generating the updated first model based on updated parameters received from the server. Alternatively, or in addition, embodiment, the entire updated first model may be received.
In some examples, the first model is used for performing speech recognition. Alternatively, or in addition, the first model is used for facial recognition.
In some examples, the data relating to user features comprises user vectors.
In some examples, the one or more user features are derived from user speech.
In some examples, the first model comprises a first machine learning model. In some examples, the second model may comprise a second machine learning model. -3 -
The means may comprise: at least one processor; and at least one memory including computer program code, the at least one memory and the computer program code configured, with the at least one processor, to cause the performance of the apparatus.
In a third aspect, this specification describes a method comprising: receiving data (e.g. user vectors) relating to one or more user features from a user device, wherein the user device comprises a first model, wherein the first model is trained with a first dataset; selecting a first subset of the first dataset based on the received data relating to said one or more user features; generating an updated first model using the first subset of the io first dataset; and deploying the updated first model to the user device. The user features may be derived from user speech.
Some examples include deploying a second model to the user device (e.g. deploying the second model by receiving the second model or by receiving parameters of the second model at the user device), wherein the second model is trained to generate the data relating to the one or more user features based on a user dataset.
Some examples include training the second model. In some examples, the second model is trained using the first dataset.
In some examples, the first model is used for performing speech recognition. Alternatively, or in addition, the first model is used for facial recognition.
In some examples, the first model comprises a first machine learning model. In some 25 examples, the second model may comprise a second machine learning model.
In a fourth aspect, this specification describes an apparatus configured to perform any method as described with reference to the third aspect.
In a fifth aspect, this specification describes computer-readable instructions which, when executed by computing apparatus, cause the computing apparatus to perform any method as described with reference to the third aspect.
In a sixth aspect, this specification describes a method comprising: deploying a first model, wherein the first model is trained using a first dataset; generating data relating to one or more user features using a second model based on a user dataset; sending the -4 -data relating to said one or more user features to a server; and deploying an updated first model, wherein the updated first model is updated, at the server, based on a subset of the first dataset, and the subset of the first dataset is selected at the server based on the data relating to said one or more user features.
In some examples, deploying the first model comprises receiving parameters of the first model from the server or receiving the first model from the server.
Some examples include deploying the second model.
Some examples include generating the updated first model based on update parameters received from the server. Alternatively, or in addition, embodiment, the entire updated first model may be received.
In some examples, the first model is used for performing speech recognition.
Alternatively, or in addition, the first model is used for facial recognition.
In some examples, the data relating to user features comprises user vectors.
In some examples, the one or more user features are derived from user speech.
In some examples, the first model comprises a first machine learning model. In some examples, the second model may comprise a second machine learning model.
In a seventh aspect, this specification describes an apparatus configured to perform any method as described with reference to the sixth aspect.
In an eighth aspect, this specification describes computer-readable instructions which, when executed by computing apparatus, cause the computing apparatus to perform any method as described with reference to the sixth aspect.
In a ninth aspect, this specification describes a computer program comprising instructions for causing an apparatus to perform at least the following: receiving data relating to one or more user features from a user device, wherein the user device comprises a first model, wherein the first model is trained with a first dataset; selecting a first subset of the first dataset based on the received data relating to said one or more -5 -user features; generating an updated first model using the first subset of the first dataset; and deploying the updated first model to the user device.
In a tenth aspect, this specification describes a computer-readable medium (such as a non-transitory computer-readable medium) comprising program instructions stored thereon for performing at least the following: receiving data relating to one or more user features from a user device, wherein the user device comprises a first model, wherein the first model is trained with a first dataset; selecting a first subset of the first dataset based on the received data relating to said one or more user features; io generating an updated first model using the first subset of the first dataset; and deploying the updated first model to the user device.
In an eleventh aspect, this specification describes a computer program comprising instructions for causing an apparatus to perform at least the following: deploying a first model, wherein the first model is trained using a first dataset; generating data relating to one or more user features using a second model based on a user dataset; sending the data relating to said one or more user features to a server; and deploying an updated first model, wherein the updated first model is updated, at the server, based on a subset of the first dataset, and the subset of the first dataset is selected at the sewer based on the data relating to said one or more user features.
In a twelfth aspect, this specification describes a computer-readable medium (such as a non-transitory computer-readable medium) comprising program instructions stored thereon for performing at least the following: deploying a first model, wherein the first model is trained using a first dataset; generating data relating to one or more user features using a second model based on a user dataset; sending the data relating to said one or more user features to a server; and deploying an updated first model, wherein the updated first model is updated, at the server, based on a subset of the first dataset, and the subset of the first dataset is selected at the server based on the data relating to said one or more user features.
In a thirteenth aspect, this specification describes a system comprising: a sewer, wherein the sewer comprises means for performing: receiving data relating to one or more user features from a user device, wherein the user device comprises a first model, wherein the first model is trained with a first dataset; selecting a first subset of the first dataset based on the received data relating to said one or more user features; -6 -generating an updated first model using the first subset of the first dataset; and deploying the updated first model to the user device; and at least one user device, wherein the at least one user device comprises means for performing: deploying the first model; generating said data relating to the one or more user features using a second model based on a user dataset; sending the data relating to said one or more user features to a server; and deploying the updated first model.
In a fourteenth aspect, this specification describes an apparatus comprising: at least one processor; and at least one memory including computer program code which, when _to executed by the at least one processor, causes the apparatus to: receiving data relating to one or more user features from a user device, wherein the user device comprises a first model, wherein the first model is trained with a first dataset; selecting a first subset of the first dataset based on the received data relating to said one or more user features; generating an updated first model using the first subset of the first dataset; and deploying the updated first model to the user device.
In a fifteenth aspect, this specification describes an apparatus comprising: at least one processor; and at least one memory including computer program code which, when executed by the at least one processor, causes the apparatus to: deploying a first model, wherein the first model is trained using a first dataset; generating data relating to one or more user features using a second model based on a user dataset; sending the data relating to said one or more user features to a server; and deploying an updated first model, wherein the updated first model is updated, at the server, based on a subset of the first dataset, and the subset of the first dataset is selected at the server based on the data relating to said one or more user features.
In a sixteenth aspect, this specification describes an apparatus comprising: an input module configured to receive data relating to one or more user features from a user device, wherein the user device comprises a first model, wherein the first model is trained with a first dataset; a selection module configured to select a first subset of the first dataset based on the received data relating to said one or more user features; an update module configured to generate an updated first model using the first subset of the first dataset; and a deployment module configured to deploy the updated first model to the user device. -7 -
In a seventeenth aspect, this specification describes an apparatus comprising: a first module configured to deploy a first model, wherein the first model is trained using a first dataset; a second module configured to generate data relating to one or more user features using a second model based on a user dataset; a third module configured to send the data relating to said one or more user features to a server; and a fourth module configured to deploy an updated first model, wherein the updated first model is updated, at the server, based on a subset of the first dataset, and the subset of the first dataset is selected at the server based on the data relating to said one or more user features.
Brief description of the drawings
Example embodiments will now be described, by way of example only, with reference to the following schematic drawings, in which: /5 FIGS. ito 3 are block diagrams of systems in accordance with example embodiments; FIGS. 4 and 5 are flowcharts showing algorithms in accordance with example embodiments; FIG. 6 is a block diagram of a system in accordance with an example embodiment; FIGS. 7 and 8 are flowcharts showing algorithms in accordance with example 20 embodiments; FIG. 9 is a block diagram of a system in accordance with an example embodiment; FIGS. to and IA show neural networks used in some example embodiments; FIG. 12 is a block diagram of components of a system in accordance with an example embodiment; and FIGS. 13A and 1313 show tangible media, respectively a removable non-volatile memory unit and a Compact Disc (CD) storing computer-readable code which when run by a computer perform operations according to example embodiments.
Detailed description
The scope of protection sought for various embodiments of the invention is set out by the independent claims. The embodiments and features, if any, described in the specification that do not fall under the scope of the independent claims are to be interpreted as examples useful for understanding various embodiments of the invention. -8 -
In the description and drawings, like reference numerals refer to like elements throughout.
FIG. lisa block diagram of a system, indicated generally by the reference numeral 10, in accordance with an example embodiment. The system 10 comprises a server 11 and a user device 12 in two-way communication with each other. The server 11 may, for example, be a cloud server, which may provide one or more configurations to the user device 12. The user device 12 may be any device used by the user for performing various functions. For example, the user device 12 may comprise functionalities such as voice rci and/or speech recognition, image processing, facial recognition, fingerprint recognition, or the like. One or more models, such as machine learning models, may be used for implementing at least part of such functionalities. The models may be trained at a server, such as the server 11 (e.g. operated by an operating system developer), and then deployed at a plurality of user devices, such as the user device 12.
FIG. 2 is a block diagram of an example system, indicated generally by the reference numeral 20. The system 20 comprises the server 11 and the user device 12. The server 11 may comprise a first model 22, which may be trained using a first dataset 21. The user device 12 may comprise a deployed first model 24, such that the deployed first model 24 receives user input 23 and provides an output based on the user input 23. The server 11 may deploy the trained first model 22, at the user device 12. The deployed first model 24 may be used at the user device 12 for providing one or more outputs. For example, the first model 22 may be trained for performing speech or voice recognition using the first dataset 21, which first dataset 21 may comprise a generic dataset with speech data from a large group of people. The deployed first model 24 may then be used for speech or voice recognition at the user device 12. As such, when user speech is received at the user input 23, the deployed first model 24 may be used for understanding the user speech, and providing outputs relating to the speech recognition.
It may be appreciated that although the examples provided herein generally relate to speech or voice recognition, the principles described herein can be applied to many other applications, such as image processing, facial recognition, fingerprint recognition, or the like. -9 -
In one example, deploying a model at a user device may comprise one or more of sending a model (e.g. a trained model) from a server (such as the server to the user device (such as the user device 12), sending parameters of the model from the server to the user device, and/or activating a model at a user device such that the model may be used to perform one or more actions at the user device. For example, the server may deploy a model at a user device using direct or indirect communication between the server and the user device.
A machine learning (ML) practitioner (e.g. an entity controlling the server ii) may _to create ML models which work well on average across potential future users, by gathering generic representative data of a target population. Performance of such models may not be ideal for some users, for example, in the case of speech recognition, in the event that the voice of a particular user for whatever reason is significantly different from the majority of the users of the model. Different users may speak differently (e.g. they may have different voices or accents), such that a model trained with a generic dataset may not work well for some users who may speak significantly differently from at least a majority of the large group of people corresponding to the speech data of the generic dataset. Furthermore, as the model may be trained based on audio data from a variety of users who may have significantly different speech characteristics, the performance of the model may represent a compromise between many users, such that it is not tailored to any particular user or user group. In some examples, the model may be updated at a user device using data from a particular individual user. However, the volume of user data collected from an individual user may not be large enough to improve the performance of the model for that individual user.
FIG. 3 is a block diagram of a system, indicated generally by the reference numeral 30, in accordance with an example embodiment. The system 30 comprises a server 31, and the server 31 comprises the first dataset 21 and the first model 22, similar to the first dataset 21 and first model 22 of FIG. 2. The training of the first model 22 may be performed at the server 31 using the first dataset 21, or may be performed elsewhere (e.g. at a remote device or another server), such that the trained first model 22 is deployed at the server 31. The server 31 further comprises a selector module 32 used for determining a first subset 33 of the first dataset (i.e. the dataset 21), a personalization module 35, and an updated first model 36. The server 31 optionally may further -10 -comprise a second model 34 (described in further detail below with reference to FIGS. 5 to 9).
The system 30 may be configured to provide a model that may be more suitable for each individual user of the model, for example, for performing tasks such as speech recognition. The first dataset 21 may be used for training the first model 22, as described above with reference to FIG. 2. The first dataset 21 may be a generic dataset with data from a large population. The selector 32 may be used for selecting a first subset 33 of the first dataset 21 based, at least partially, on user feature data of one or io more users, such that the first subset of data may be more relevant to the individual one or more users. The user features may be received from one or more user devices. FIG. 3 is described further below with reference to FIG. 4.
FIG. 4 is a flowchart of an algorithm, indicated generally by the reference numeral 40, in accordance with an example embodiment. At operation 41, data relating to one or more user features may be received at the server 31 from one or more user devices. The one or more user devices may be deployed with a first model, which may be similar to the first model 22 that is trained at the server 31. At operation 42, a first subset 33 of the first dataset 21 may be selected using the selector 32, such that the selection is based, at least partially, on the received data relating to said one or more user features.
At operation 43, the personalization module 35 may use data belonging to the first subset 33 for generating an updated first model 36. At operation 44, the updated first model 36 may be deployed at one or more user devices. For example, the updated first model 36 may be deployed by sending the updated first model 36, or a plurality of parameters of the updated first model 36 to the relevant user device(s). Alternatively, or in addition, in the event that the first model 22 is already deployed at a relevant user device, the updated first model 36 may be deployed at that user device by sending one or more updated parameters of the updated first model, which updated parameters are different from the parameters of the first model, or by sending information regarding one or more differences in the parameters of the updated first model 36 and the parameters of the first model 22.
In an example embodiment, the one or more user features may characterize the speech of a user of a user device from which the data relating to the user features are received. 35 For example, users from a particular geographical area may have similar accents, and therefore user features relating to these users may be similar. As such, when the data relating to the user features indicates that the user may be from the particular geographical area (e.g. residing in that area, ethnically related to that area, born or brought up in that area, etc.), the first subset 33 may be selected such that the first subset 33 comprises data relating to a plurality of users that may have some connection (e.g. residing in that area, ethnically related to that area, born or brought up in that area, etc.) with that particular geographical area. The selection of the first subset 33 may also be dependent on one or more other user features, such as gender or age of the user. When the updated first model 36 is generated specifically based on the first subset 33, the training of the updated first model 36 may be more relevant to this particular user, and may perform better for this user, compared to the performance of the first model 22 (e.g. a generic model).
The server 31 may be capable of separately selecting first subsets for different user devices, based on the user feature data received from the respective user devices.
/5 Therefore, the server may generate different updated first models for different user devices. However, when similar user features are received from different user devices, the first subsets selected for the different user devices may be similar, and the updated first model deployed at the different user devices may be similar.
FIG. 5 is a flowchart of an algorithm, indicated generally by the reference numeral 50, in accordance with an example embodiment. The operations of algorithm 50 may optionally be performed at a server, such as the server 31, for example, before the operations of algorithm 40. At operation 51, the first model 22 may be trained (e.g. at the server 31 or elsewhere) using the first dataset 21. At operation 52, the server 31 may deploy the first model at a user device (e.g. a user device that sends the data relating the user features, as discussed above with reference to the operation 41). At operation 53, the server 31 may train the second model 34, for example, using the first dataset 21. The second model 34 may be trained for determining, at a user device, user feature data based on user datasets. The user feature data is explained in further detail below. At operation 54, the server 31 may deploy the second model at 34 at the user device. For example, the second model 34 may be deployed by sending the second model 34 to the user device, or by sending one or more parameters of the second model to the user device.
FIG. 6 is a block diagram of a system, indicated generally by the reference numeral 60, in accordance with an example embodiment. The system 60 comprises a user device 61, -12 -and the user device 61 may be in communication with a server, such as the server 31. The user device 61 may comprise the deployed first model 24 that may receive user inputs 23, and may provide outputs based on the user inputs 23 (in a similar manner to the user device 12 described above). The deployed first model 24 may be updated based on one or more parameters received, for example, from a server (such as the server 31). The user device 61 further comprises a second model 63, which may be configured to receive one or more user datasets 62 as inputs, and provide data relating to one or more user features as outputs. The second model 63 may be (or may be related to) the second model 34 described above, which model may be deployed in the operation 54 described above. FIG. 6 is described in further detail below with reference to FIG. 7.
FIG. 7 is a flowchart of an algorithm, indicated generally by the reference numeral 70, in accordance with an example embodiment. The operations of algorithm 70 may be performed at a user device, such as the user device 61. At operation 71, the first model 24 may be deployed at the user device 61. For example, the first model 61, or one or more parameters of the first model may be received from a server, such as the server 31, or any other source. At operation 72, data relating to one or more user features may be generated using the second model 63, based, at least in part, on the user dataset 62. At operation 73, the generated data relating to the one or more user features may be sent to a server, such as the server 31. At operation 74, an updated first model is deployed at the user device 61, for example, by updating one or more parameters of the deployed first model 24 based on one or more received parameters of an updated first model.
In an example embodiment, the second model 63 is trained with a generic dataset (e.g. the first dataset 21) at the server 31 for determining user feature data based on user datasets 62. For example, the user dataset 62 may comprise speech data from the user, and the user feature data may comprise data relating to one or more user features that are specific to the user of the user device 61. The user features may be in the form of user vectors, such that when user speech is received as inputs, the second model 63 is able to determine user vectors that characterize the speech of the user (e.g. how the user sounds). As the second model 63 is trained with speech data from a large population, the second model may be able to determine a user vector that accurately characterizes the user's speech. The user feature data sent to the server 31 may be an average of a plurality of user vectors (e.g. in response to multiple data points of the user dataset 62). The generated user feature data, such as the user vectors, may be similar for users that speak similarly. For example, users in a particular geographical area may -13 -have similar accents, such that the user vectors may be similar for these users, or the difference between the user vectors may be below a threshold difference.
In an example embodiment, the server 31 may use the data relating to the one or more user features for selecting the first subset 33 of the first dataset (e.g. at operation 42), and then generating the updated first model 36 (e.g. at operation 43) based on the first dataset, such that the updated first model 36 may be deployed at the user device 61 (for example by sending one or more parameters of the updated first model 36 to the user device 61). In one example embodiment, the second model may comprise a function to convert the received user speech into user vectors. With reference to the above example regarding the geographical areas, a plurality of user vectors may be defined for a plurality of geographical areas respectively. A first user vector may be determined by processing raw data from the speech of a first user. The second model may be able to determine which of the plurality of user vectors is closest to the first user vector, in /5 order to determine a geographical area related to the first user vector. In subsequent iterations, the second model may be trained further such that distance between user vectors of users from the same or similar geographical regions is decreased, while distance between user vectors of users from different geographical regions is increased.
In one example, the second model 34, 63 may comprise a neural network f(x, parameters), such that the neural network may receive input audio x as an input, and provides one or more features (e.g. embedding vector f(x, parameters)) as an output in response to the received input. The output may characterize the user's speech. For example, the second model 34, 63 may be trained with the first dataset 21, such that the first dataset 21 may contain audio samples, where each audio sample may be organized as (x, y, r), such that x is an audio input, y is a ground truth label and r identifies a geographical region of a person that uttered the audio x (e.g. into a recording microphone when creating the audio sample for this first dataset 21). Alternatively, or in addition, r may identify gender, age or similar personal attributes which may impact how a person sounds.
In another example, the second model 34, 63 may be trained with input audio samples that are organized as (x, r), such that only the audio input x and geographical region r is used, and the ground truth label y may be omitted. Assuming that the geographical 35 regions are not too small, it may be expected that, on average, most people from that -14 -geographical region speak similarly, or at least more similarly compared to people from another geographical location.
In order to train the second model 34, 63, an objective function may be designed to decrease the distance between f(x i I parameters) and f(x_j 'parameters) when the corresponding regions r_i and r_j are the same. Similarly, the objective function is further designed to increase the distance between f (x 'parameters) and f(x_j 'parameters) when the corresponding regions r_i and r_j are different.
In one example, as described above, the trained second model 63 may be used for providing user features (e.g. user feature data 'z') that are used by the selector module 32 for determining the first subset 33. For example, when the server 31 receives the user feature data (e.g. output of the second model 63) from the user device 61, the selector 32 may analyse a plurality of elements of the first dataset 21. For each audio /5 sample (x, y, r) in first dataset 21, the distance between f (x, parameters) and f (z, parameters) may be computed, and one or more audio samples with distances below a threshold may be selected to form the first subset 33 of the first dataset. The first subset 33 may then be used in the personalization module 35 for generating the updated first model 36.
In an example embodiment, the user dataset 62 may be frequently updated. For example, when the deployed first model 24 is being used for speech or voice recognition, the user dataset 62 may be updated whenever the user speaks into the user device 61, such that data relating to the user's speech and/or user's voice may be stored in the user dataset 62. As the user dataset 62 is updated, the second model 63 may output user feature data corresponding to the updated user dataset 62, and, for example, may periodically send updated user feature data to the server 31. The server may then periodically update the first model 24 at the user device 61 (e.g. by sending appropriate parameters to the user device 61 in order to the deployed first model).
FIG. 8 is a flowchart of an algorithm, indicated generally by the reference numeral 80, in accordance with an example embodiment. Algorithm 80 comprises operations 71 to 74 of algorithm 70, and further comprises operations 81 to 85 (some or all of which may be optional operations).
-15 -At operation 81, the user device 61 may receive parameters for the deployed first model 24 from the server 31. For example, the user device 61 may initially receive parameters of the first model 22 for deploying the first model 24 at the user device 61. Subsequently, when the updated first model 36 is generated at the server 31, the parameters of the updated first model 36 are sent to the user device 61, such that the user device 61 may receive updated parameters for updating the deployed first model 24.
At operation 71, the first model 24 is deployed at the user device, as described above with reference to FIG. 7.
At operation 82, the second model 63 may be deployed at the user device 61. For example, the second model 63 may be deployed by receiving a second model, such as the second model 34, or by receiving parameters of the second model 34. The second model 34 or the parameters of the second model 34 may be received at the user device 61 from the server 31. Alternatively, or in addition, the second model or parameters of the second model may be received from any other source, for example, a remote server that may send the second model to the server 31 (such that the second model 34 is trained at the server 31), and to the user device 61 (such that the second model 63 is deployed, and used at the user device 61). In an alternative arrangement, the second model may be pre-deployed and may or may not be updatable.
At operation 72, the second model 63 is used for generating data relating to one or more user features based, at least in part, on the user dataset 62 (as described above with reference to FIG. 7).
At operation 73, the generated user features may be sent, for example, to the server 31 (as described above with reference to FIG. 7).
At operation 83, updated parameters for the deployed first model 24 may be received from the server 31, which updated parameters may be determined at the server 31 based, at least partially, on the user feature data relating to the user of the user device 61.
At operation 84, an updated first model may be generated at the user device 61 based on the received updated parameters. The updated first model may then be deployed at the user device 61 at operation 74, as described above with reference to FIG. 7. Alternatively, or in addition, the updated first model may be received from the server 31, such that the updated first model need not be generated at the user device 61. The received updated first model may then be deployed at operation 74.
At operation 85, the deployed first model 24 may be used for performing speech recognition. For example, the user input 23 may comprise user speech, and the output of the first model 24 may be based on the recognition of the user speech (e.g. determining what the user is saying, determining one or more spoken commands, etc.).
_to The output may also comprise recognition of the user's voice, for example, for determining identity of the user or authenticating the user based on the user's voice. Of course, as noted elsewhere, speech processing is only one of a number of example uses of the principles described herein.
In an example embodiment, the user features provided by the second model 63 comprise user vectors. For example, the user features may be derived from the user's speech. The user features may be unique to the user's speech (e.g. voice, accent, speed, pitch, etc.). As described above, the user vectors may be similar for users related to the same geographical region. Alternatively, or in addition, the user vectors may be similar for users with the similar gender, age, ways of using the user device (e.g. time spent on the user device, amount of speech provided to the user device, etc.), etc. FIG. 9 is a block diagram of a system, indicated generally by the reference numeral go, in accordance with an example embodiment. The system 90 comprises the server 31 and the user device 61, as described above. The system go provides an illustration of the communication between the server 31 and the user device 61 in accordance with an example implementation.
At the server 31, the first dataset 21 is used for training the first model 22 and optionally the second model 34, which are then deployed at the user device 61, as the deployed first model 24 and the second model 63 (e.g. a bio-extractor model) respectively. At the user device 61, the second model 63 is used for determining data relating to one or more user features (e.g. user vectors) based on a user dataset 62 (e.g. voice audio signals).
-17 -A purpose of the second model 63 may be to learn how to cluster similar samples (e.g. similarly sounding samples) from the user dataset, for example such that speakers (e.g., sound recordings in the first dataset 21 from the people that uttered them) that sound similar may produce user features (e.g. user vectors, or bio-features) which are closer (e.g. in the Euclidean norm (L2) or another norm) to each other, whereas speakers that sound different to each other produce user features (e.g. user vectors, or No-features) which are farther way from each other. At the user device 61, the second model 63 may process the user dataset 62 (e.g. user speech in that dataset) and produce data relating to user features (e.g. when the user interacts using voice interaction with the user device 61) which characterize the user (e.g. how the user sounds').
The user features determined by the second model 63 are sent to the server 31. The user features may comprise vectors that are completely anonymized by construction, such that the privacy of the user of the user device 61 is protected from the server 31.
At the server 31, the received user feature data may be used at the selector 32 to select a first subset 33, where the first subset 33 may correspond to similar users from within the first dataset 21 to that defined by the user dataset 62 (e.g. audio data uttered by similarly speaking users from within the first dataset 21). The first subset 33 of data on the server (e.g. speech data that sound like the user but is not necessarily uttered by the user), can then be used to train and improve one or more personalized models, such as the updated first model 36, that are then sent to the user device 61 such that the previously deployed first model can be updated, with the intention of providing an improved user experience. Such personalized models may have a higher accuracy for the specific user whose user features are used for selecting subsets of the first dataset to be used for training the personalized models. This process can then be repeated over time, where the performance of the updated deployed first model 24 is expected to improve as the server augments its datasets, and as the user features generated on the user device 61 improve with more data generated in the user dataset 62 by the user on the user device 61.
In an example embodiment, the first model (22, 24, 36) may comprise a first machine learning model. Alternatively, or in addition, the second model (34, 63) may comprise a second machine learning model. The machine learning models may be trained using 35 supervised or unsupervised training, as described in the example embodiments above.
-18 -FIG. in shows a neural network, indicated generally by the reference numeral loo, used in some example embodiments. For example, the first model (22, 24,36) may comprise a machine learning model, such as the neural network leo. The neural network wo may be trained with inputs, including the first dataset 21, and/or the first subset 33 of the first dataset 21. The training inputs may comprise speech data (e.g. audio signals) of a plurality of users, and speech recognition data (e.g. recognition of the spoken words in the audio signals).The training may be performed at one or more of the server 31 and the user device 61 (or elsewhere). The neural network 100 comprises an input layer 101, one or more hidden layers 102, and an output layer 103. During the usage of the neural io network loo, at the input layer 101, user input 33 (e.g. audio signals from the user of the user device 6o) may be received as inputs. The hidden layers 102 may comprise a plurality of hidden nodes, where the processing may be performed based on the received user input 33. At the output layer 103, one or more outputs (e.g. recognition of the spoken words in the audio signals) relating to the user input may be provided. The neural network 100 may be trained offline (e.g. pre-trained before starting the use of the model), and/or may be trained online (e.g. training may continue, and the neural network loo may be updated based on new data).
FIG. 11 shows a neural network, indicated generally by the reference numeral no, used in some example embodiments. For example, the second model (34, 63) may comprise a machine learning model, such as the neural network no. The neural network no may be trained with inputs, including the first dataset 21. The training inputs may comprise speech data (e.g. audio signals) of a plurality of users, and user feature data (e.g. user vectors indicating geographical area relating to the speech of the respective users). The training may be performed at one or more of the server 31 and the user device 61 (or elsewhere). The neural network no comprises an input layer 111, one or more hidden layers 112, and an output layer 113. During the usage of the neural network no, at the input layer 111, user dataset 62 (e.g. audio signals from the user of the user device 60) may be received as inputs. The hidden layers 112 may comprise a plurality of hidden nodes, where the processing may be performed based on the received user dataset 62. At the output layer 113, data relating to one or more user features (e.g. user vectors) relating to the user may be provided. The neural network no may be trained offline (e.g. pre-trained before starting the use of the model), and/or may be trained online (e.g. training may continue, and the neural network no may be updated based on new data).
-19 -For completeness, FIG. 12 is a schematic diagram of components of one or more of the example embodiments described previously, which hereafter are referred to generically as a processing system 300. The processing system 300 may, for example, be the apparatus referred to in the claims below.
The processing system 300 may have a processor 302, a memory 304 closely coupled to the processor and comprised of a RAM 314 and a ROM 312, and, optionally, a user input 310 and a display 318. The processing system 300 may comprise one or more network/apparatus interfaces 308 for connection to a network/apparatus, e.g. a modem which may be wired or wireless. The interface 308 may also operate as a connection to other apparatus such as device/apparatus which is not network side apparatus. Thus, direct connection between devices/apparatus without network participation is possible.
The processor 302 is connected to each of the other components in order to control operation thereof.
The memory 304 may comprise a non-volatile memory, such as a hard disk drive (HDD) or a solid state drive (SSD). The ROM 312 of the memory 304 stores, amongst other things, an operating system 315 and may store software applications 316. The RAM 314 of the memory 304 is used by the processor 302 for the temporary storage of data. The operating system 315 may contain code which, when executed by the processor implements aspects of the algorithms 40, 50, 70 and 80 described above. Note that in the case of small device/apparatus the memory can be most suitable for small size usage i.e. not always a hard disk drive (HDD) or a solid state drive (SSD) is used.
The processor 302 may take any suitable form. For instance, it may be a microcontroller, a plurality of microcontrollers, a processor, or a plurality of 30 processors.
The processing system 300 may be a standalone computer, a server, a console, or a network thereof. The processing system 300 and needed structural parts may be all inside device/apparatus such as IoT device/apparatus i.e. embedded to very small size.
-20 -In some example embodiments, the processing system 300 may also be associated with external software applications. These may be applications stored on a remote server device/apparatus and may run partly or exclusively on the remote server device/apparatus. These applications may be termed cloud-hosted applications. The processing system 300 may be in communication with the remote server device/apparatus in order to utilize the software application stored there.
FIGS. 13A and 13B show tangible media, respectively a removable memory unit 365 and a compact disc (CD) 368, storing computer-readable code which when run by a _to computer may perform methods according to example embodiments described above. The removable memory unit 365 may be a memory stick, e.g. a USB memory stick, having internal memory 366 storing the computer-readable code. The internal memory 366 may be accessed by a computer system via a connector 367. The CD 368 may be a CD-ROM or a DVD or similar. Other forms of tangible storage media may be used.
/5 Tangible media can be any device/apparatus capable of storing data/information which data/information can be exchanged between devices/apparatus/network.
Embodiments of the present invention may be implemented in software, hardware, application logic or a combination of software, hardware and application logic. The software, application logic and/or hardware may reside on memory, or any computer media. In an example embodiment, the application logic, software or an instruction set is maintained on any one of various conventional computer-readable media. In the context of this document, a "memory" or "computer-readable medium" may be any non-transitory media or means that can contain, store, communicate, propagate or transport the instructions for use by or in connection with an instruction execution system, apparatus, or device, such as a computer.
Reference to, where relevant, "computer-readable medium", "computer program product", "tangibly embodied computer program" etc., or a "processor" or "processing circuitry" etc. should be understood to encompass not only computers having differing architectures such as single/multi-processor architectures and sequencers/parallel architectures, but also specialised circuits such as field programmable gate arrays FPGA, application specify circuits ASIC, signal processing devices/apparatus and other devices/apparatus. References to computer program, instructions, code etc. should be understood to express software for a programmable processor firmware such as the programmable content of a hardware device/apparatus as instructions for a processor -21 -or configured or configuration settings for a fixed function device/apparatus, gate array, programmable logic device/apparatus, etc. If desired, the different functions discussed herein may be performed in a different order and/or concurrently with each other. Furthermore, if desired, one or more of the above-described functions may be optional or may be combined. Similarly, it will also be appreciated that the flow diagrams and message sequences of Figures 4,5, 7 and 8 are examples only and that various operations depicted therein may be omitted, reordered and/or combined.
It will be appreciated that the above described example embodiments are purely illustrative and are not limiting on the scope of the invention. Other variations and modifications will be apparent to persons skilled in the art upon reading the present specification.
Moreover, the disclosure of the present application should be understood to include any novel features or any novel combination of features either explicitly or implicitly disclosed herein or any generalization thereof and during the prosecution of the present application or of any application derived therefrom, new claims may be formulated to cover any such features and/or combination of such features.
Although various aspects of the invention are set out in the independent claims, other aspects of the invention comprise other combinations of features from the described example embodiments and/or the dependent claims with the features of the independent claims, and not solely the combinations explicitly set out in the claims.
It is also noted herein that while the above describes various examples, these descriptions should not be viewed in a limiting sense. Rather, there are several variations and modifications which may be made without departing from the scope of 30 the present invention as defined in the appended claims.

Claims (18)

  1. -22 -Claims 1. An apparatus comprising means for performing: receiving data relating to one or more user features from a user device, wherein 5 the user device comprises a first model, wherein the first model is trained with a first dataset; selecting a first subset of the first dataset based on the received data relating to said one or more user features; generating an updated first model using the first subset of the first dataset; and deploying the updated first model to the user device.
  2. 2. An apparatus as claimed in claim 1, wherein the means are further configured to perform: deploying a second model to the user device, wherein the second model is trained to generate the data relating to the one or more user features based on a user 15 dataset.
  3. 3. An apparatus as claimed in claim 2, wherein the means are further configured to perform: training the second model.
  4. 4. An apparatus as claimed in claim 3, wherein the second model is trained using the first dataset.
  5. 5. An apparatus comprising means for performing: deploying a first model, wherein the first model is trained using a first dataset; generating data relating to one or more user features using a second model based on a user dataset; sending the data relating to said one or more user features to a server; and deploying an updated first model, wherein the updated first model is updated, at the server, based on a subset of the first dataset, and the subset of the first dataset is selected at the server based on the data relating to said one or more user features.
  6. 6. An apparatus as claimed in claim 5, wherein deploying the first model comprises receiving parameters of the first model from the sewer or receiving the first model from the server.
  7. -23 - 7. An apparatus as claimed in claim 5 or claim 6, wherein the means are further configured to perform: deploying the second model.
  8. 8. An apparatus as claimed in any one of claims 5 to 7, wherein the means are 5 further configured to perform: generating the updated first model based on updated parameters received from the server.
  9. 9. An apparatus as claimed in any one of the preceding claims, wherein the first model is used for performing speech recognition.
  10. 10. An apparatus as claimed in any one of the preceding claims, wherein the data relating to user features comprises user vectors.
  11. An apparatus as claimed in any one of the preceding claims, wherein the one or more user features are derived from user speech.
  12. 12. An apparatus as claimed in any one of the preceding claims, wherein the first model comprises a first machine learning model.
  13. 13. An apparatus as claimed in any one of the preceding claims, wherein the means comprise: at least one processor; and at least one memory including computer program code, the at least one memory and the computer program configured, with the at least one processor, to cause the performance of the apparatus.
  14. 14. A method comprising: receiving data relating to one or more user features from a user device, wherein the user device comprises a first model, wherein the first model is trained with a first 30 dataset; selecting a first subset of the first dataset based on the received data relating to said one or more user features; generating an updated first model using the first subset of the first dataset; and deploying the updated first model to the user device.
  15. -24 - 15. A computer program comprising instructions for causing an apparatus to perform at least the following: receiving data relating to one or more user features from a user device, wherein the user device comprises a first model, wherein the first model is trained with a first 5 dataset; selecting a first subset of the first dataset based on the received data relating to said one or more user features; generating an updated first model using the first subset of the first dataset; and deploying the updated first model to the user device.
  16. 16. A method comprising: deploying a first model, wherein the first model is trained using a first dataset; generating data relating to one or more user features using a second model based on a user dataset; sending the data relating to said one or more user features to a server; and deploying an updated first model, wherein the updated first model is updated, at the server, based on a subset of the first dataset, and the subset of the first dataset is selected at the server based on the data relating to said one or more user features.
  17. 17. A computer program comprising instructions for causing an apparatus to perform at least the following: deploying a first model, wherein the first model is trained using a first dataset; generating data relating to one or more user features using a second model based on a user dataset; sending the data relating to said one or more user features to a server; and deploying an updated first model, wherein the updated first model is updated, at the server, based on a subset of the first dataset, and the subset of the first dataset is selected at the server based on the data relating to said one or more user features.
  18. 18. A system comprising: a server, wherein the server comprises means for performing: receiving data relating to one or more user features from a user device, wherein the user device comprises a first model, wherein the first model is trained with a first dataset; selecting a first subset of the first dataset based on the received data relating to said one or more user features; generating an updated first model using the first subset of the first dataset; and deploying the updated first model to the user device; and -25 -at least one user device, wherein the at least one user device comprises means for performing: deploying the first model; generating said data relating to the one or more user features using a second model based on a user dataset; sending the data relating to said one or more user features to a server; and deploying the updated first model.
GB1916021.7A 2019-11-04 2019-11-04 Personalized models Active GB2588689B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
GB1916021.7A GB2588689B (en) 2019-11-04 2019-11-04 Personalized models

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
GB1916021.7A GB2588689B (en) 2019-11-04 2019-11-04 Personalized models

Publications (3)

Publication Number Publication Date
GB201916021D0 GB201916021D0 (en) 2019-12-18
GB2588689A true GB2588689A (en) 2021-05-05
GB2588689B GB2588689B (en) 2024-04-24

Family

ID=69059076

Family Applications (1)

Application Number Title Priority Date Filing Date
GB1916021.7A Active GB2588689B (en) 2019-11-04 2019-11-04 Personalized models

Country Status (1)

Country Link
GB (1) GB2588689B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6073096A (en) * 1998-02-04 2000-06-06 International Business Machines Corporation Speaker adaptation system and method based on class-specific pre-clustering training speakers
US20170194006A1 (en) * 2015-07-22 2017-07-06 Google Inc. Individualized hotword detection models
US20170256254A1 (en) * 2016-03-04 2017-09-07 Microsoft Technology Licensing, Llc Modular deep learning model

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6073096A (en) * 1998-02-04 2000-06-06 International Business Machines Corporation Speaker adaptation system and method based on class-specific pre-clustering training speakers
US20170194006A1 (en) * 2015-07-22 2017-07-06 Google Inc. Individualized hotword detection models
US20170256254A1 (en) * 2016-03-04 2017-09-07 Microsoft Technology Licensing, Llc Modular deep learning model

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
M Padmanabhan et al., "Speaker clustering and transformation for speaker adaptation in speech recognition systems", IEEE Transactions on Speech and Audio Processing, vol. 6, no. 1, pp. 71-77, 1998, available from: https://api.semanticscholar.org/CorpusID:16318740, [accessed 24 April 2020] *

Also Published As

Publication number Publication date
GB201916021D0 (en) 2019-12-18
GB2588689B (en) 2024-04-24

Similar Documents

Publication Publication Date Title
US11170788B2 (en) Speaker recognition
CN108269569B (en) Speech recognition method and device
JP7023934B2 (en) Speech recognition method and equipment
CN106688034B (en) Text-to-speech conversion with emotional content
CN110473526B (en) Device and method for personalizing voice recognition model and electronic device
WO2019102884A1 (en) Label generation device, model learning device, emotion recognition device, and method, program, and storage medium for said devices
CN105810193B (en) Method and apparatus for training language model and method and apparatus for recognizing language
US9412361B1 (en) Configuring system operation using image data
US12014724B2 (en) Unsupervised federated learning of machine learning model layers
WO2018218081A1 (en) System and method for voice-to-voice conversion
CN112071330B (en) Audio data processing method and device and computer readable storage medium
KR102216160B1 (en) Apparatus and method for diagnosing disease that causes voice and swallowing disorders
JP2016110082A (en) Language model training method and apparatus, and speech recognition method and apparatus
JP2019514046A (en) System and method for speech recognition in noisy unknown channel conditions
US20160034811A1 (en) Efficient generation of complementary acoustic models for performing automatic speech recognition system combination
WO2016188593A1 (en) Speech recognition system and method using an adaptive incremental learning approach
US11670299B2 (en) Wakeword and acoustic event detection
US11132990B1 (en) Wakeword and acoustic event detection
Badino et al. Integrating articulatory data in deep neural network-based acoustic modeling
CN111081230A (en) Speech recognition method and apparatus
GB2607133A (en) Knowledge distillation using deep clustering
Ons et al. Fast vocabulary acquisition in an NMF-based self-learning vocal user interface
JP2021157145A (en) Inference device and learning method of inference device
GB2588689A (en) Personalized models
KR20200144366A (en) Generating trigger recognition models for robot