WO2014096506A1 - Method, apparatus, and computer program product for personalizing speech recognition - Google Patents

Method, apparatus, and computer program product for personalizing speech recognition Download PDF

Info

Publication number
WO2014096506A1
WO2014096506A1 PCT/FI2012/051285 FI2012051285W WO2014096506A1 WO 2014096506 A1 WO2014096506 A1 WO 2014096506A1 FI 2012051285 W FI2012051285 W FI 2012051285W WO 2014096506 A1 WO2014096506 A1 WO 2014096506A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech recognition
recognition model
speech
received
model
Prior art date
Application number
PCT/FI2012/051285
Other languages
French (fr)
Inventor
Yongbeom PAK
Original Assignee
Nokia Corporation
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nokia Corporation filed Critical Nokia Corporation
Priority to PCT/FI2012/051285 priority Critical patent/WO2014096506A1/en
Publication of WO2014096506A1 publication Critical patent/WO2014096506A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/065Adaptation
    • G10L15/07Adaptation to the speaker
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/30Distributed recognition, e.g. in client-server systems, for mobile phones or network applications

Definitions

  • An example embodiment of the present invention relates generally to speech recognition, and more particularly, to a method, apparatus and computer program product for personalizing speech recognition.
  • Speech recognition may be used to control these and other devices, such as wireless phones, cars, household appliances, and other devices used in everyday life or work.
  • Speech recognition which may be referred to as automatic speech recognition (ASR)
  • ASR automatic speech recognition
  • SRM speech recognition model
  • acoustic modes and language models can be fused together, or otherwise may be combined.
  • These SRMs are the building blocks for words and strings of words, such as phrases or sentences and are used by a device to process speech input (e.g., recognize the speech input and derive a machine readable interpretation).
  • a speech recognition processor may receive speech samples and then may match those samples with the basic sound units in the acoustic model.
  • the speech recognition processor then may, for example, calculate the most likely words from the SRM based on the matched basic sound units, such as by using Hidden Markov Models (HMMs) and/or dynamic time warping (DTW).
  • HMM and DTW are examples of statistical models that describe speech patterns probabilistically.
  • various neural networks (NN) and /or finite state transducers (FST) may also be used as SRMs.
  • Other suitable models can also be used as SRM.
  • an unknown speech pattern is compared with known reference patterns.
  • the speech pattern is divided into several frames, and the local distance between the speech pattern included in each frame and the corresponding speech segment of the reference pattern is calculated. This distance is calculated by comparing the speech segment and the corresponding speech segment of the reference pattern with each other, and it is thus a kind of numerical value for the differences found in the comparison. For speech segments close to each other, a smaller distance is usually obtained than for speech segments further from each other. On the basis of local distances obtained this way, a minimum path between the beginning and end points of the word are sought by using a DTW algorithm. Thus, by DTW, a distance is obtained between the uttered word and the reference word.
  • an HMM model is first formed for each word to be recognized (e.g. for each reference word).
  • an observation probability is calculated for each HMM model in the memory, and as the recognition result, a counterpart word is obtained for the HMM model with the greatest observation probability.
  • the probability is calculated that it is the word uttered by the speaker.
  • the above-mentioned observation probability describes the resemblance of the received speech pattern and the closest HMM model (e.g. the closest reference speech pattern).
  • the reference words, or word candidates can be further weighted by the language models.
  • the recognition process can occur in a single pass- through mode with fused acoustic models and language models.
  • interconnecting data nodes store information regarding speech patterns.
  • the nodes of the NN may be used to classify phonetic features of speech input, and may be configured so as to focus on portions of the model that may be most valuable in distinguishing words during speech recognition processes.
  • a well designed NN will therefore minimize, in some examples, the processing time required to recognize speech inputs.
  • NNs are particularly well suited for training of larger data sets, such as data sets representing natural language.
  • FST frequency division multiple access
  • speech inputs may be processed, various operations may be performed on the speech input, and a most probable output, (e.g., recognized word) may be selected.
  • FSTs may be particularly beneficial, in some examples, in phonological analysis. The reusability and flexibility of algorithms performed on FSTs make FSTs particularly useful in combining portions of, or various SRMs.
  • An SRM may therefore incorporate speech recognition data from various sources, apply weights to the speech recognition data, and generate weighted FSTs for use in speech recognition tasks.
  • the various types of SRMs may include speaker independent SRMs and speaker dependent SRMs.
  • Speaker independent SRMs may comprise averages of language and acoustic models collected from a large sample of users.
  • a speaker dependent SRM may be specific to the user and may be adapted by the user through training. Initial training may be performed during a first use of the SRM and training continues during normal use of the SRM.
  • a speaker dependent SRM comprises unique sets of electronic characteristics for the acoustic model and a unique language model for the words formed from combinations of unique basic sound units.
  • SRMs used by a device to process speech input may therefore rely on any combination of the HMM, DTW, NN, FST, and other models, as well as a blend of speaker dependent SRMs and speaker independent SRMs.
  • a method, apparatus, and computer program product are provided for personalizing a speech recognition model (SRM).
  • SRM speech recognition model
  • a method is provided for receiving at least one portion of a speech recognition model, wherein the at least one portion of the speech recognition model is stored remotely and is adaptable by one or more user terminal to process input speech, accessing a speech recognition model, wherein the speech recognition model is based on at least the received at least one portion of the speech recognition model, and adapting the speech recognition model based on terminal dependent data.
  • the method may further include processing received speech input using the speech recognition model, and generating a textual output.
  • the method may further include receiving a speech input, and refining a speaker dependent speech recognition model based on the speech input.
  • the method may further include verifying or correcting a processing of the speech input, wherein refining the speaker dependent speech recognition model is further based on the verification or correction.
  • the method may further include causing transmission of at least one portion of the speaker dependent speech recognition model to a remote storage location.
  • the terminal dependent data may comprise microphone information and/or a context.
  • the received at least one portion of a speech recognition model is received based on one of at least an individual user, group of users, geographic location, or a dialect.
  • the received at least one portion of a speech recognition model is based on at least one of a Hidden Markov Model, dynamic time warping model, neural network, or finite state transducer.
  • An additional method is provided including receiving at least one portion of a speaker dependent speech recognition model from a user terminal and generating at least one additional portion of a speech recognition model based on the received at least one portion of a speaker dependent speech recognition model, wherein the at least one additional portion of a speech recognition model is adaptable by one or more user terminals.
  • the method may further include causing transmission of the at least one additional portion of the speech recognition model to an additional user terminal.
  • Generating the at least one additional portion of the speech recognition model is further based on one of at least an individual user, group of users, geographic location, or a dialect.
  • the at least one additional portion of the speech recognition model may be based on at least one of a hidden Markov Model, dynamic time warping model, neural network, or finite state transducer.
  • An apparatus comprising at least one processor and at least one memory including computer program code, the at least one memory and the computer program code configured to, with the processor, cause the apparatus to at least receive at least one portion of a speech recognition model, wherein the at least one portion of the speech recognition model is stored remotely, and is adaptable by one or more user terminals to process input speech, access a speech recognition model, wherein the speech recognition model is based on at least the received at least one portion of the speech recognition model, and adapt the speech recognition model based on terminal dependent data.
  • An additional apparatus comprising at least one processor and at least one memory including computer program code, the at least one memory and the computer program code configured to, with the processor, cause the apparatus to at least receive at least one portion of a speaker dependent speech recognition model from a user terminal, and generate at least one additional portion of a speech recognition model based on the received at least one portion of a speaker dependent speech recognition model, wherein the at least one additional portion of the speech recognition model is adaptable by one or more user terminals.
  • a computer program product comprising at least one non-transitory computer- readable storage medium having computer-executable program code instructions stored therein, the computer-executable program code instructions comprising program code instructions to receive at least one portion of a speech recognition model, wherein the at least one portion of the speech recognition model is stored remotely, and is adaptable by one or more user terminals to process input speech, access a speech recognition model, wherein the speech recognition model is based on at least the received at least one portion of the speech recognition model, and adapt the speech recognition model based on terminal dependent data.
  • An additional computer program product comprising at least one non-transitory computer-readable storage medium having computer-executable program code instructions stored therein, the computer-executable program code instructions comprising program code instructions to receive at least one portion of a speaker dependent speech recognition model from a user terminal, generate at least one additional portion of a speech recognition model based on the received at least one portion of a speaker dependent speech recognition model, wherein the at least one additional portion of a speech recognition model is adaptable by one or more user terminals.
  • An apparatus comprising means for receiving at least one portion of a speech recognition model, wherein the at least one portion of the speech recognition model is stored remotely, and is adaptable by one or more user terminals to process input speech, accessing a speech recognition model, wherein the speech recognition model is based on at least the received at least one portion of the speech recognition model, and adapting the speech recognition model based on terminal dependent data.
  • An additional apparatus comprising means for receiving at least one portion of a speaker dependent speech recognition model from a user terminal, and generating at least one additional portion of a speech recognition based on the received at least one portion of a speaker dependent speech recognition model, wherein the at least one additional portion of the speech recognition model is adaptable by one or more user terminals.
  • Figure 1 is a block diagram of a personalized speech recognition apparatus in communication with user terminals which may be configured to implement example embodiments of the present invention
  • FIG. 2 is a flowchart illustrating operations to receive and adapt an SRM on a user terminal, in accordance with one embodiment of the present invention
  • Figure 3 is a flowchart illustrating operations to transmit an SRM, receive a speaker dependent SRM, and generate an additional SRM, using a speech personalization apparatus in accordance with one embodiment of the present invention.
  • FIG. 4 is a display for training an SRM, in accordance with one embodiment of the present invention.
  • circuitry refers to (a) hardware-only circuit implementations (e.g., implementations in analog circuitry and/or digital circuitry); (b) combinations of circuits and computer program product(s) comprising software and/or firmware instructions stored on one or more computer readable memories that work together to cause an apparatus to perform one or more functions described herein; and (c) circuits, such as, for example, a microprocessor(s) or a portion of a microprocessor(s), that require software or firmware for operation even if the software or firmware is not physically present.
  • This definition of 'circuitry' applies to all uses of this term herein, including in any claims.
  • the term 'circuitry' also includes an implementation comprising one or more processors and/or portion(s) thereof and accompanying software and/or firmware.
  • the term 'circuitry' as used herein also includes, for example, a baseband integrated circuit or applications processor integrated circuit for a mobile phone or a similar integrated circuit in a server, a cellular network device, other network device, and/or other computing device.
  • personalized speech recognition apparatus 102 may include or otherwise be in communication with processor 20, user interface 22, communication interface 24, memory device 26, and speech personalization administrator 28.
  • Personalized speech recognition apparatus 102 may be embodied by a wide variety of devices including mobile terminals, e.g., mobile telephones, smartphones, tablet computers, laptop computers, or the like, computers, workstations, servers or the like and may be implemented as a distributed system or a cloud based entity.
  • the personalized speech recognition apparatus 102 may receive, and/or transmit SRMs, as well as generate additional SRMs that may be adaptable by one more user terminals.
  • An SRM is a statistical model that describes speech patterns probabilistically, and may include a language model (words) and an acoustic model (basic sound units).
  • Example SRMs include the HMM, DTW, , and FST models.
  • An SRM may be provided to a user terminal to enable speech recognition capabilities (e.g., processing of input speech) on the user terminal.
  • transmittal of an SRM may include transmittal of a portion of the SRM, since an SRM in its entirety may be too large for practical transmission (and an SRM portion may also be considered an SRM).
  • the SRM portion may be incorporable into an SRM, so that the portion may then be incorporated with another portion of an SRM to provide a complete or fully functioning SRM. It will therefore be appreciated that any reference to an SRM herein, may indicate a portion or portions of an SRM, but for simplicity may be referred to as an SRM.
  • the SRMs may incorporate speaker independent data, speaker dependent data, and/or terminal dependent data.
  • the speaker independent data may include averaged, normalized, or otherwise consolidated language and acoustic models collected from a large sample of users.
  • the speaker dependent data may alternatively be biased toward a particular individual, or group of users, such as a group of users speaking a particular language or dialect, or from a particular geographic region.
  • the speaker dependent data may be generated and/or refined on a user terminal by training the SRM.
  • the speaker dependent data may be generated or refined on one or more user terminals and/or devices, such that it may be shared, via the personalized speech recognition apparatus 102, between the one or more user terminals and/or devices.
  • training may include, but is not limited to, providing speech input to the user terminal, potentially updating and/or verifying the processing of the speech input, and updating the SRM accordingly.
  • the training may include the explicit dictation of special training data by a speaker, and/or implicit training through the general use of the user terminal.
  • various models such as a HMM may be constructed for each the speaker dependent SRM to be stored.
  • a speaker dependent SRM incorporating the speaker dependent data may be communicated from the user terminal to the personalized speech recognition apparatus 102.
  • the terminal dependent data may include information regarding the user terminal itself, such as characteristics of the microphone on the user terminal to capture the speech input, and/or a context of the user terminal (e.g., an environment the device is commonly used in, or the intended purpose of the device), or any settings of the user terminal 1 10A that could impact the processing of speech input.
  • An SRM received from the personalized speech recognition apparatus 102 may be adapted on the user terminal based on the terminal dependent data, so that the particular user terminal may more accurately process speech inputs.
  • Speaker dependent SRMs including speaker dependent data may be stored on personalized speech recognition apparatus 102.
  • the speaker dependent SRM, or a portion thereof, may be further modified and/or transmitted to another device to allow the user terminal to benefit from the speaker dependent data, thereby improving the probability of successful speech recognition on another user terminal.
  • one or more user terminals may access or otherwise download the speaker dependent model for the purposes of providing personalized speech recognition.
  • the personalized speech recognition apparatus 102 may receive updates to the speaker dependent model.
  • the personalized speech recognition apparatus 102 may therefore further tune or otherwise modify the speaker dependent model.
  • the processor 20 (and/or co-processors or any other processing circuitry assisting or otherwise associated with the processor 20) may be in communication with the memory device 26 via a bus for passing information among components of the personalized speech recognition apparatus 102.
  • the memory device 26 may include, for example, one or more volatile and/or non-volatile memories.
  • the memory device 26 may be an electronic storage device (e.g., a computer readable storage medium) comprising gates configured to store data (e.g., bits) that may be retrievable by a machine (e.g., a computing device like the processor 20).
  • the memory device 26 may be configured to store information, data, content, applications, instructions, or the like for enabling the apparatus to carry out various functions in accordance with an example embodiment of the present invention.
  • the memory device 26 could be configured to store various SRMs, including speaker independent and speaker dependent portions.
  • the speaker dependent data may be associated with a particular user or group of users, enabling the processor 20 to identify and provide appropriate SRMs to various devices.
  • the memory device 26 could be configured to buffer input data for processing by the processor 20, and/or to store instructions for execution by the processor 20.
  • the personalized speech recognition apparatus 102 may, in some embodiments, be embodied in various devices as described above. However, in some embodiments, the personalized speech recognition apparatus 102 may be embodied as a chip or chip set.
  • the personalized speech recognition apparatus 102 may comprise one or more physical packages (e.g., chips) including materials, components and/or wires on a structural assembly (e.g., a baseboard).
  • the structural assembly may provide physical strength, conservation of size, and/or limitation of electrical interaction for component circuitry included thereon.
  • the personalized speech recognition apparatus 102 may therefore, in some cases, may be configured to implement an embodiment of the present invention on a single chip or as a single "system on a chip.”
  • a chip or chipset may constitute means for performing one or more operations described herein for personalizing speech recognition in devices.
  • the processor 20 may be embodied in a number of different ways.
  • the processor 20 may be embodied as one or more of various hardware processing means such as a coprocessor, a microprocessor, a controller, a digital signal processor (DSP), a processing element with or without an accompanying DSP, or various other processing circuitry including integrated circuits such as, for example, an application specific integrated circuit (ASIC) an field programmable gate array (FPGA), a microcontroller unit (MCU), a hardware accelerator, a special-purpose computer chip, or the like.
  • the processor 20 may include one or more processing cores configured to perform independently.
  • a multi-core processor may enable multiprocessing within a single physical package.
  • the processor 20 may include one or more processors configured in tandem via the bus to enable independent execution of instructions, pipelining and/or multithreading.
  • the processor 20 may be configured to execute instructions stored in the memory device 26 or otherwise accessible to the processor 20.
  • such instructions may provide for the retrieval, transmittal, and/or processing of SRMs, including generating additional SRMs based on received updated speaker dependent SRMs.
  • the processor 20 may be configured to execute hard coded functionality.
  • the processor 20 may represent an entity (e.g., physically embodied in circuitry) capable of performing operations according to an embodiment of the present invention, such as the personalization of SRMs.
  • the processor 20 when the processor 20 is embodied as an ASIC, FPGA or the like, the processor 20 may be specifically configured hardware for conducting the operations described herein.
  • the instructions when the processor 20 is embodied as an executor of software instructions, the instructions may specifically configure the processor 20 to perform the algorithms and/or operations described herein when the instructions are executed.
  • the processor 20 may be a processor of a specific device (e.g., a user terminal or network entity) configured to employ an embodiment of the present invention by further configuration of the processor 20 by instructions for performing the algorithms and/or operations described herein.
  • the processor 20 may include, among other things, a clock, an arithmetic logic unit (ALU) and logic gates configured to support operation of the processor 20.
  • ALU arithmetic logic unit
  • the communication interface 24 may be any means such as a device or circuitry embodied in either hardware or a combination of hardware and software that is configured to receive and/or transmit data from/to a network and/or any other device or module in communication with the personalized speech recognition apparatus 102.
  • the communication interface 24 may include, for example, an antenna (or multiple antennas) and supporting hardware and/or software for enabling communications with a wireless communication network, for transmitting and receiving SRMs to and from remote devices. Additionally or alternatively, the communication interface 24 may include the circuitry for interacting with the antenna(s) to cause transmission of signals via the antenna(s) or to handle receipt of signals received via the antenna(s).
  • the communication interface 24 may alternatively or also support wired communication.
  • the communication interface 24 may include a communication modem and/or other hardware/software for supporting communication via cable, digital subscriber line (DSL), universal serial bus (USB) or other mechanisms.
  • DSL digital subscriber line
  • USB universal serial bus
  • the personalized speech recognition apparatus 102 may include a user interface 22 that may, in turn, be in communication with the processor 20 to receive an indication of a user input and/or to cause provision of an audible, visual, mechanical or other output to the user.
  • the user interface 22 may include, for example, a keyboard, a mouse a display, a touch screen(s), touch areas, soft keys, a microphone, a speaker, or other input/output mechanisms.
  • the processor 20 may comprise user interface circuitry configured to control at least some functions of one or more user interface elements such as, for example, a speaker, ringer, microphone, display, and/or the like.
  • the processor 20 and/or user interface circuitry comprising the processor 20 may be configured to control one or more functions of one or more user interface elements through computer program instructions (e.g., software and/or firmware) stored on a memory accessible to the processor 20 (e.g., memory device 26, and/or the like).
  • processor 20 may be embodied as, include, or otherwise control a speech personalization administrator 28 for providing personalized speech recognition.
  • the speech personalization administrator 28 may be embodied as various means, such as circuitry, hardware, a computer program product comprising computer readable program instructions stored on a computer readable medium (for example, memory device 26) and executed by a processing device (for example, processor 20), or some combination thereof.
  • Speech personalization administrator 28 may be capable of communication with one or more of the processor 20, memory device 26, user interface 22, and communication interface 24. As such, the speech personalization administrator 28 may be configured to generate additional SRMs, adaptable by a variety of user terminals and that may be based on speaker dependent SRMs, as described above and in further detail hereinafter.
  • User terminal 1 10 may be embodied as a mobile terminal, such as personal digital assistants (PDAs), pagers, mobile televisions, mobile telephones, gaming devices, laptop computers, tablet computers, cameras, camera phones, video recorders, audio/video players, radios, global positioning system (GPS) devices, navigation devices, or any combination of the aforementioned, and other types of devices capable of providing speech recognition.
  • PDAs personal digital assistants
  • GPS global positioning system
  • the user terminal 110 need not necessarily be embodied by a mobile device and, instead, may be embodied in a fixed device, such as a computer, workstation, or home appliance, such as a coffee maker. Additionally or alternatively, user terminal(s) 110 may be embodied in a vehicle, or any other machine or device capable of processing voice commands.
  • user terminal 110A is illustrated in further detail, but it will be appreciated that any of the user terminals 110, such as user terminal 110B may be configured as illustrated in and described with respect to user terminal 1 10A.
  • the user terminal 1 10 may therefore include or otherwise be in communication with processor 120, user interface 122, communication interface 124, and memory device 126.
  • the processor 120 (and/or co-processors or any other processing circuitry assisting or otherwise associated with the processor 120) may be in communication with the memory device 126 via a bus for passing information among components of the user terminal 1 10.
  • the memory device 126 may include, for example, one or more volatile and/or non-volatile memories.
  • the memory device 126 may be an electronic storage device (e.g., a computer readable storage medium) comprising gates configured to store data (e.g., bits) that may be retrievable by a machine (e.g., a computing device like the processor 120).
  • the memory device 126 may be configured to store information, data, content, applications, instructions, or the like for enabling the user terminal to carry out various functions in accordance with an example embodiment of the present invention.
  • the memory device 126 could be configured to store SRMs, instructions for adapting SRMs with terminal dependent data, and instructions for training SRMs with speaker dependent data. Memory device 126 may therefore buffer input data for processing by the processor 120. Additionally or alternatively, the memory device 26 could be configured to store instructions for execution by the processor 120.
  • the processor 120 may be embodied in a number of different ways.
  • the processor 120 may be embodied as one or more of various hardware processing means such as a coprocessor, a microprocessor, a controller, a DSP, a processing element with or without an accompanying DSP, or various other processing circuitry including integrated circuits such as, for example, an ASIC , an FPGA, an MCU, a hardware accelerator, a special-purpose computer chip, or the like.
  • the processor 120 may include one or more processing cores configured to perform independently.
  • a multi-core processor may enable multiprocessing within a single physical package.
  • the processor 120 may include one or more processors configured in tandem via the bus to enable independent execution of instructions, pipelining and/or multithreading.
  • the processor 120 may be configured to execute instructions stored in the memory device 126 or otherwise accessible to the processor 120.
  • the processor 120 may be configured to adapt an SRM advantageously to the user terminal, based on terminal dependent data, such as microphone information and context, so that the SRM may account for variances across user terminals.
  • the user terminal(s) 110 may include means, such as a processor 120, for training the SRM with speech input, to generate and/or refine a speaker dependent SRM that may improve speech input processing on the user terminal (and subsequently, other user terminals).
  • the processor 120 may be configured to execute hard coded functionality.
  • the processor 120 may represent an entity (e.g., physically embodied in circuitry) capable of performing operations according to an embodiment of the present invention while configured accordingly.
  • the processor 120 when the processor 120 is embodied as an ASIC, FPGA or the like, the processor 120 may be specifically configured hardware for conducting the operations described herein.
  • the instructions when the processor 120 is embodied as an executor of software instructions, the instructions may specifically configure the processor 120 to perform the algorithms and/or operations, such as adaptation and training of SRMs, processing of speech input, such as by using the SRMs, for conversion to text, when the instructions are executed.
  • the processor 120 may be a processor of a specific device (e.g., a mobile terminal or network entity) configured to employ an embodiment of the present invention by further configuration of the processor 120 by instructions for performing the algorithms and/or operations described herein.
  • the processor 120 may include, among other things, a clock, an arithmetic logic unit (ALU) and logic gates configured to support operation of the processor 120.
  • ALU arithmetic logic unit
  • the communication interface 124 may be any means such as a device or circuitry embodied in either hardware or a combination of hardware and software that is configured to receive and/or transmit data from/to a network and/or any other device or module in communication with the user terminal 110.
  • the communication interface 124 may be specifically configured for transmitting and receiving SRMs to and from the personalized speech recognition apparatus 102.
  • the communication interface 124 may include, for example, an antenna (or multiple antennas) and supporting hardware and/or software for enabling communications with a wireless communication network. Additionally or alternatively, the communication interface 124 may include the circuitry for interacting with the antenna(s) to cause transmission of signals via the antenna(s) or to handle receipt of signals received via the antenna(s).
  • the communication interface 124 may alternatively or also support wired communication for communication of SRMs.
  • the communication interface 124 may include a communication modem and/or other hardware/software for supporting communication via cable, digital subscriber line (DSL), universal serial bus (USB) or other mechanisms.
  • DSL digital subscriber line
  • USB universal serial bus
  • the user terminal 1 10 may include a user interface 122 that may, in turn, be in communication with the processor 120 to receive an indication of a user input and/or to cause provision of an audible, visual, mechanical or other output to the user.
  • the user interface 122 may include, for example, a keyboard, a mouse, a display, a touch screen(s), touch areas, soft keys, a microphone, a speaker, or other input/output mechanisms.
  • the user interface 122 may therefore be configured to receive speech input, such as, via a microphone, for the purposes of speech recognition and/or training of an SRM.
  • the processor 120 may comprise user interface circuitry configured to control at least some functions of one or more user interface elements such as, for example, a speaker, ringer, microphone, display, and/or the like.
  • the processor 120 and/or user interface circuitry comprising the processor 120 may be configured to control one or more functions of one or more user interface elements through computer program instructions (e.g., software and/or firmware) stored on a memory accessible to the processor 120 (e.g., memory device 126, and/or the like).
  • Network 100 may be embodied in a local area network, the Intemet, any other form of a network, or in any combination thereof, including proprietary private and semi-private networks and public networks.
  • the network 100 may comprise a wire line network, wireless network (e.g., a cellular network, wireless local area network, wireless wide area network, some combination thereof, or the like), or a combination thereof, and in some example embodiments comprises at least a portion of the Internet.
  • the network 100 may be used for transmitting speaker dependent data and/or SRMs to and from devices.
  • a user terminal 1 10 may be directly coupled to and/or may include a personalized speech recognition apparatus 102.
  • FIG. 2 the operations for receiving and adapting an SRM on a user terminal, in accordance with one embodiment of the present invention are outlined in accordance with one example embodiment.
  • the operations of Figures 2 may be performed by the user terminal 110A, user terminal HOB, and/or the like, for example.
  • the user terminal 1 10A may include means, such as the processor 120, communication interface 124, or the like, for receiving at least one portion of an SRM, wherein the at least one portion of an SRM is stored remotely and is adaptable by one of more user terminals to process input speech.
  • the user terminal 1 10A may receive at least one portion of an SRM from the personalized speech recognition apparatus 102, for example, including any combination of the HMM, DTW, , and FST models, as described above.
  • the at least one portion of an SRM may also include any combination of speaker independent data and/or speaker dependent data, and may be adaptable by the user terminal 11 OA to process speech input (e.g., perform speech recognition tasks).
  • the adaptation is described in further detail with respect to operation 210.
  • a user of user terminal 1 1 OA may provide logon credentials or the like, via user interface 122, communication interface 124, and/or network 100 to the personalized speech recognition apparatus 102.
  • the user terminal 110A may check for updates by communicating with the personalized speech recognition apparatus 102, and receive an SRM or portion thereof if an update is available.
  • an update may be available if a user updated, based on training, verification or the like on another device, such as user terminal HOB.
  • the user terminal 1 10A may download an SRM or portion thereof for the first time (such as during initial device setup, or factory reset), or the newly received SRM or portion thereof may include updates compared to a previous version used by user terminal 11 OA.
  • receipt of the SRM or portion thereof by the user terminal 1 1 OA may occur during scheduled update routines that may be unobtrusive to or unnoticed by a user. That is, the synchronization may occur seamlessly as a background system update.
  • a request for an SRM or portion thereof may be explicitly initiated on the user terminal 11 OA (such as logging onto the personalized speech recognition apparatus 102 and requesting an update).
  • an update may be initiated by the personalized speech recognition apparatus 102. For example, a user may be automatically notified that an update is available, such as by Short Message Service (SMS), for example, so as to confirm that they would like to receive the at least one portion of an SRM on the user terminal 11 OA.
  • SMS Short Message Service
  • the user terminal 1 10A may therefore receive at least one portion of an SRM associated with the individual user (such as identified with the logon credentials). Additionally or alternatively, the SRM or portion thereof may be identified by the personalized speech recognition apparatus by other means. For example, a user of a device may provide a geographic location, via a Global Positioning Device (GPS) and/or manual indication of a location, for example. The user terminal 110A may therefore receive an SRM based on a geographic location and /or dialect. Having received at least one portion of an SRM, as described with respect to operation 200, the user terminal 110A may include means, such as the processor 120, for accessing a speech recognition model, wherein the speech recognition model is based on at least the received at least one portion of speech recognition model.
  • GPS Global Positioning Device
  • the received at least one portion of an SRM may be a complete SRM, and may therefore be stored on memory device 126, and accessed by the processor 120.
  • the processor 120 may incorporate the at least one portion of an SRM to form a complete SRM.
  • the SRM may be stored and accessed on memory device 126, for example.
  • the user terminal 11 OA may include means, such as the processor 120, for adapting the SRM based on or more terminal dependent data.
  • the terminal dependent data may include information regarding the user terminal 1 1 OA itself, such as characteristics of the microphone on the user terminal 1 10A to capture the speech input, and/or a context of the user terminal (e.g., an environment the device is commonly used in, or the intended purpose of the user terminal), or any settings of the user terminal 1 10A that could impact the process of speech input.
  • the processor 120 may therefore utilize the terminal dependent data in adapting the SRM for use on the user terminal 110A.
  • microphone information may be retrieved from memory device 126, or read from a microphone component of user interface 122 by processor 120, for example.
  • the microphone information may include any information relating to the microphone that may impact how speech input is recognized and/or processed according to the SRM.
  • the microphone information may comprise a microphone model identifier, or orientation of the microphone within the device.
  • the microphone may additionally or alternatively be characterized by its transduction type, such as condenser and/or dynamic, for example.
  • the user terminal 1 10A using the processor 120, may therefore adapt the SRM according to microphone information to account for acoustic, phonetic, and/or other variances between microphones. For example, calculations in a DTW model may be consistently modified throughout, so that the user terminal 1 10A may accurately interpret sounds captured by the microphone.
  • the user terminal 11 OA may adapt the SRM based on the context of the user terminal.
  • Use of an SRM by a speaker phone in a vehicle may be subject to background noise, such as wind, and/or radio or other device interference.
  • the processor 120 of user terminal 1 1 OA may therefore adapt the received SRM, which in its previous state may not have accounted for such background noises, accordingly.
  • Information regarding the context or use of the user terminal 1 1 OA may be explicitly retrieved from memory device 126, for example, and/or derived from various components of the user terminal 110A, allowing processor 120 to adapt the SRM based on what contexts the user terminal 11 OA will most likely be used in.
  • Settings configuring various components of the user terminal 1 10A may be considered by the processor 120 in adapting the SRM for the user terminal 11 OA.
  • the settings may affect the adaptation of the SRM, and/or cause the processor 120 to adjust the settings of the user terminal 1 10A to tailor the device for use of the SRM.
  • An adapted SRM may be stored on memory device 126, for example.
  • the user terminal 11 OA may include means, such as the user interface 122, communication interface 124, and/or processor 120 for receiving a speech input.
  • the speech input may be provided by a user to user terminal 110B by using a microphone of user interface 122, for example.
  • the user terminal 11 OA may receive a speech input through everyday use of the user terminal and may process the speech to generate text.
  • the user terminal 110A may process received speech input using the SRM, and generate a textual output.
  • the processor 120 may process the speech input according to the SRM.
  • the processor 120 may calculate observation probability on the speech input based on the SRM that includes one or more HMM, DTW, , or FST models, for example.
  • the processor 120 may identify a reference word with the highest probability when compared to other reference words, a threshold or the like. Based on those probabilities, the processor may then select or otherwise generate the speech recognition result (e.g. a text output).
  • the user terminal 11 OA may include means, such as the user interface 122, communication interface 124 and/or processor 120, for verifying or correcting a processing of the speech input.
  • the verification or correction could be received explicitly by a user input to the user terminal 11 OA, or implicitly by everyday use of the user terminal 11 OA.
  • the user terminal 11 OA may be configured to receive an explicit correction of a processed speech input.
  • speech recognition such as an example application that prefills dictated words in a draft email message
  • the interpretation of the speech input may be incorrect.
  • the user may correct a misinterpreted word(s) by selecting the misinterpreted word, and typing the corrected word in its place. See Figure 4.
  • a user interface 122 may display an indication 400 of a word, such as a word that is misinterpreted, such as a word that is misinterpreted during the processing of input speech.
  • indication 400 may be provided by the user terminal 1 1 OA in scenarios such as those in which the SRM provided no reference word above some threshold probability, indicating that the processing of the speech input was not likely correct. Additionally or alternatively, the indication 400 may be provided explicitly by a user, by selection of the word for correction, for example.
  • User input 410 provides a means for receiving a correction of the processed speech input. In this example, the speech recognition system has interpreted the word "forest,” and a user provides the correct phrase, "for the rest.”
  • a speech input may be deemed as correct based on implicit verification.
  • a user terminal such as user terminal 1 10A, may be embodied as a mobile phone and may further be operable to receive a speech input such as "call Suzanne.”
  • the user terminal 100A Upon automatic selection and execution of the associated command (e.g., initiating a call to a phone number saved for a contact by the name of Suzanne), and failure to receive any correction to stop the initiated phone call, the user terminal 100A, such as by the processor 120, may consider this absence of any action by the user a verification of the processed speech input.
  • the user terminal 11 OA may generate and/or otherwise refine a speaker dependent SRM based on the speech input.
  • the SRM may be trained using speech input received with respect to operation 220, and/or verification or correction of the processed speech input with respect to operation 230.
  • Existing SRMs on memory device 126 may therefore be tailored for use by a particular user or group of users.
  • new speaker dependent SRMs may be generated for improved speech input processing. Training can be performed, for example, by using feature vectors of the speech input (provided with respect to operation 220) and associating them with corresponding reference words, as provided by the verification and/or correction with respect to the operation 230 above.
  • a verification or correction need not be provided, but the processor 120 may identify the reference words from a script on memory device 126 (such as in an example embodiment where the speech input is received based on a script).
  • the SRM such as an HMM, DTW, , FST, or the like, may therefore be expanded, or otherwise modified, to incorporate the speech input and associated reference words.
  • processed speech input and associated reference words may be further processed by processor 120, and applied to an existing SRM, to refine a speaker dependent SRM.
  • a new speaker dependent SRM may be generated.
  • the generated or refined speaker dependent SRM may be stored on memory device 126, for example.
  • the user terminal 1 10A may include means, such as communication interface 124, and/or processor 120, for causing transmission of the speaker dependent SRM to a remote storage location, such as personalized speech recognition apparatus 102, for example.
  • Transmission of the speaker dependent SRM to a remote location may allow the speaker dependent SRM to be advantageously transmitted to other user terminals, such as described in further detail with respect to Figure 3. Further, and in some examples, by transmitting the speaker dependent SRM to the remote location, one or more user terminals may provide updates to or otherwise refine the speaker dependent SRM. The speaker dependent SRM may therefore be retrieved from memory device 126, and transmitted via communication interface 124 and over network 100, for example, to the remote storage location.
  • the transmission may occur automatically following generation and/or refinement of the speaker dependent SRM with respect to operation 240.
  • a user of user terminal 1 10A may initiate the transmission, such as for example, providing logon credentials to the personalized speech recognition apparatus 102, as described with respect to operation 200, for example.
  • the speaker dependent SRM may then be transmitted to the personalized speech recognition apparatus 102 for storage, and subsequent retrievals.
  • FIG. 3 is a flowchart illustrating operations to transmit an SRM, receive a speaker dependent SRM, and generate an additional SRM, using a speech personalization apparatus 102 in accordance with one embodiment of the present invention.
  • the personalized speech recognition apparatus 102 may include means, such as the processor 20, speech personalization administrator 28, communication interface 24, or the like, for causing transmission of an SRM (or portion thereof) to a user terminal.
  • the SRM may therefore be retrieved from memory device 26, and sent over network 1 10, via communication interface 24, to user terminal 1 10A, for example.
  • the SRM that is transmitted may be an SRM that is configured for a particular device, a particular region or dialect or the like.
  • the personalized speech recognition apparatus 102 may generate the additional SRM based on an associated with a group of users, such as one associated with a geographic location. For example, some geographic areas, like the southern United States, for example, may experience regional accents that may otherwise confuse speech input processing systems. Personalized speech recognition apparatus 102 may therefore generate the additional SRM based on a particular geographic location in order to subsequently provide more accurate speech recognition functions to users in, from, or otherwise associated with the same geographic location. Similarly, an additional SRM may be generated based on a specific dialect. For example, due to varying dialects, some words may be pronounced differently than the same word in a different language, potentially causing erroneous speech input processing on a user terminal.
  • Personalized speech recognition apparatus 102 may therefore associate the speaker dependent SRM with a dialect in order to provide more accurate speech recognition functions to users whose speech is closely related to the specific dialect.
  • a user of a device may then provide indication of a particular dialect, and receive an SRM adapted for that dialect.
  • the SRM may already be adapted to a particular user.
  • the personalized speech recognition apparatus 102 may receive logon information from a user terminal, such as user terminal 11 OA that indicates the identity of a particular user.
  • personalized speech recognition apparatus 102 such as via the processor 20, the communications interface 24 or the like, may cause the SRM related to the particular user to be transmitted to user terminal 11 OA.
  • the transmission may be initiated on the personalized speech recognition apparatus 102 in various ways, such as receiving requests initiated explicitly (e.g., logon) or automatically (e.g., initial installation) from the user terminal 11 OA, and/or automatic transmission imitated by the personalized speech recognition apparatus 102.
  • requests initiated explicitly e.g., logon
  • automatically e.g., initial installation
  • transmission imitated by the personalized speech recognition apparatus 102 e.g., initial installation
  • the personalized speech recognition apparatus 102 may include means, such as the processor 20, speech personalization administrator 28, communication interface 24, or the like, for receiving at least a portion of a speaker dependent SRM from the user terminal, such as user terminal 11 OA.
  • the received speaker dependent SRM (or portion thereof) may contain one or more updates to or refinements of the speaker dependent SRM as is described with respect to operations 240 and 250 of Figure 2.
  • the personalized speech recognition apparatus 102 may include means, such as the processor 20, speech personalization administrator 28, or the like, for generating an additional or otherwise updated SRM based on the speaker dependent SRM, wherein the additional SRM is adaptable by one or more user terminals.
  • the additional SRM is contrasted based on the speaker dependent SRM and comprises the updates to or refinements of the SRM from the user terminal, as well as, one or more other user terminals.
  • the speech personalization administrator 28 may access an existing SRM on memory 26, and modify, update or otherwise refine the SRM with the speaker dependent SRM, or a portion of the speaker dependent SRM, accordingly. Additionally, or alternatively, a new SRM may be generated using the speaker dependent SRM.
  • the additional SRM may be, or otherwise include, a HMM, DTW, , or FST, for example.
  • the additional SRM may be adaptable by one or more user terminals, such as described with respect to operations 200 and 210 above.
  • the personalized speech recognition apparatus 102 may include means, such as the processor 20, communication interface 24, or the like, for causing transmission of the additional SRM to an additional device.
  • the transmission may be initiated and completed by use of similar operations described with respect to operation 300, but the SRM may this time be transmitted to a different terminal, such as user terminal HOB, for example.
  • the additional SRM may be shared between one or more user terminals, devices and/or the like.
  • the personalized speech recognition apparatus 102 may select the additional SRM to transmit to the user terminal HOB, based on a variety of factors, such as terminal dependent data, and/or user identification, for example.
  • An association of the individual user (or group of users) and speaker dependent SRM may allow the personalized speech recognition apparatus 102 to advantageously provide the SRM on demand, to various devices belonging to a user.
  • a user terminal 11 OA embodied as a personal computer or laptop capable of producing text from speech input, such as dictated reports or emails, may rely on an extensive SRM representing an entire natural language.
  • a speaker dependent SRM generated and/or refined on the user terminal 11 OA may be available on personalized speech recognition apparatus 102 for distribution to one or more other user terminals.
  • the same user of user terminal 1 10A purchases a new user terminal 1 10B, it may be advantageous to provide portions of the speaker dependent SRM to the user terminal 1 10B.
  • the user terminal HOB is a coffee maker.
  • a coffee maker may not require such a broad vocabulary required by the personal computer or laptop embodiment of user terminal 1 10A, but only portions of the speaker dependent SRM including language relating to functions of the coffee maker (e.g., grind, brew), or language relating to measurements and/or timing.
  • the personalized speech recognition apparatus 102 may advantageously select an SRM or a portion of an SRM (as generated in operation 320) for use by the coffee maker, potentially minimizing bandwidth required for transmitting, and memory required for storing (on user terminal HOB), an otherwise extensive SRM.
  • the user terminal HOB may utilize the SRM in processing of speech input, therefore offering to its users personalized speech recognition or in other examples, by not needing its users to retrain a SRM. That is, the speech input processing may be improved by use of the SRM, which may include a portion(s) of the speaker dependent SRM generated or refined on user terminal 1 10A. As such, the user of user terminal HOB may provide speech input to user terminal 1 10B, and experience reduced or minimized error rates in speech input processing and/or execution of associated voice commands.
  • Figure 1 illustrates an embodiment utilizing user terminals 110A and 1 10B, and a personalized speech recognition apparatus 102
  • a personalized speech recognition apparatus 102 may be locally installed on a device such as user terminal 110A and/or HOB and configured to run independently, where data may not necessarily be shared across devices or a server.
  • a user terminal such as user terminal 11 OA and/or HOB may provide a speaker dependent SRM to a personalized speech recognition apparatus 102, may receive an SRM from a personalized speech recognition apparatus 102, or may both provide and receive the same, respectively.
  • the personalized speech recognition apparatus 102 may be implemented in the cloud, and data may be transmitted between user terminal(s) and server(s) over network 100.
  • various user terminals may routinely receive updated SRMs, thereby continually improving speech input processing by utilizing speaker dependent SRMs generated and/or refined on other user terminals.
  • a device such as user terminal 1 10A and/or HOB may be shipped with SRMs preinstalled.
  • the SRM may be local to the area or country the user terminal is distributed in. That is, the SRM may be based on a speaker dependent SRM based on a dialect or geographic area.
  • Figures 2 and 3 are flowcharts illustrating operations performed by a user terminal 110A user terminal 1 10B and/or the like, and personalized speech recognition apparatus 102, respectively.
  • each block of the flowchart, and combinations of blocks in the flowcharts may be implemented by various means, such as hardware, firmware, processor, circuitry, and/or other devices associated with execution of software including one or more computer program instructions.
  • one or more of the procedures described above may be embodied by computer program instructions.
  • the computer program instructions which embody the procedures described above may be stored by a memory device 26 or 126 employing an embodiment of the present invention and executed by a processor 20 or 120.
  • any such computer program instructions may be loaded onto a computer or other programmable apparatus (e.g., hardware) to produce a machine, such that the resulting computer or other programmable apparatus implements the functions specified in the flowcharts' blocks.
  • These computer program instructions may also be stored in a computer- readable memory that may direct a computer or other programmable apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture the execution of which implements the function specified in the flowcharts' blocks.
  • the computer program instructions may also be loaded onto a computer or other programmable apparatus to cause a series of operations to be performed on the computer or other programmable apparatus to produce a computer-implemented process such that the instructions which execute on the computer or other programmable apparatus provide operations for implementing the functions specified in the flowcharts' blocks.
  • blocks of the flowcharts support combinations of means for performing the specified functions and combinations of operations for performing the specified functions for performing the specified functions. It will also be understood that one or more blocks of the flowcharts, and combinations of blocks in the flowcharts, may be implemented by special purpose hardware -based computer systems which perform the specified functions, or combinations of special purpose hardware and computer instructions.
  • certain ones of the operations above may be modified or further amplified.
  • additional optional operations may be included as indicated by the blocks shown with a dashed outline in Figures 2 and 3. Modifications, additions, or amplifications to the operations above may be performed in any order and in any combination.

Landscapes

  • Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • User Interface Of Digital Computer (AREA)
  • Telephonic Communication Services (AREA)
  • Telephone Function (AREA)

Abstract

A method, apparatus and computer program product are provided for personalizing speech recognition data. A speech recognition model (SRM) that is adaptable by a user terminal based on user terminal dependent data may be received and adapted by a user terminal. A speaker dependent SRM may be refined on the user terminal and transmitted to a remote storage location, such as personalized speech recognition apparatus. The apparatus may cause transmission of SRMs to various user terminals, and may generate additional SRMs based on speaker dependent SRMs. Speaker dependent SRMs may be generated based on an individual, group of users, geographic location, dialect, or the like. SRMs may be based on hidden Markov Models,dynamic time warming models, neural networks, finite state transducers, or the like.

Description

METHOD, APPARATUS, AND COMPUTER PROGRAM PRODUCT FOR PERSONALIZING SPEECH RECOGNITION
TECHNOLOGICAL FIELD
An example embodiment of the present invention relates generally to speech recognition, and more particularly, to a method, apparatus and computer program product for personalizing speech recognition. BACKGROUND
The widespread use of technology, including mobile technology, in everyday life has led to an increased demand for other forms of user interaction with various devices. Devices providing a user with hands free control capabilities are becoming increasingly popular that allow users to control a device with voice commands, such as via speech recognition, while still focusing their attention on driving or other activities. Speech recognition may be used to control these and other devices, such as wireless phones, cars, household appliances, and other devices used in everyday life or work. Speech recognition, which may be referred to as automatic speech recognition (ASR), may be conducted by various applications that may be operable to convert recognized speech into text (e.g., a speech-to-text system). Current ASR and/or speech-to-text systems are typically based on a speech recognition model (SRM) comprising an acoustic model and a language model. For improved efficiency, the acoustic modes and language models can be fused together, or otherwise may be combined. These SRMs are the building blocks for words and strings of words, such as phrases or sentences and are used by a device to process speech input (e.g., recognize the speech input and derive a machine readable interpretation).
By way of example, a speech recognition processor, in some examples, may receive speech samples and then may match those samples with the basic sound units in the acoustic model. The speech recognition processor then may, for example, calculate the most likely words from the SRM based on the matched basic sound units, such as by using Hidden Markov Models (HMMs) and/or dynamic time warping (DTW). HMM and DTW are examples of statistical models that describe speech patterns probabilistically. Additionally or alternatively, various neural networks (NN) and /or finite state transducers (FST) may also be used as SRMs. Other suitable models can also be used as SRM.
In the DTW and in some additional examples, an unknown speech pattern is compared with known reference patterns. In dynamic time warping, the speech pattern is divided into several frames, and the local distance between the speech pattern included in each frame and the corresponding speech segment of the reference pattern is calculated. This distance is calculated by comparing the speech segment and the corresponding speech segment of the reference pattern with each other, and it is thus a kind of numerical value for the differences found in the comparison. For speech segments close to each other, a smaller distance is usually obtained than for speech segments further from each other. On the basis of local distances obtained this way, a minimum path between the beginning and end points of the word are sought by using a DTW algorithm. Thus, by DTW, a distance is obtained between the uttered word and the reference word.
In speech recognition using the HMM method, an HMM model is first formed for each word to be recognized (e.g. for each reference word). When the speech recognition device receives a speech pattern, an observation probability is calculated for each HMM model in the memory, and as the recognition result, a counterpart word is obtained for the HMM model with the greatest observation probability. Thus for each reference word, the probability is calculated that it is the word uttered by the speaker. The above-mentioned observation probability describes the resemblance of the received speech pattern and the closest HMM model (e.g. the closest reference speech pattern). The reference words, or word candidates, can be further weighted by the language models. In some embodiments, the recognition process can occur in a single pass- through mode with fused acoustic models and language models.
In a NN method, interconnecting data nodes store information regarding speech patterns. The nodes of the NN may be used to classify phonetic features of speech input, and may be configured so as to focus on portions of the model that may be most valuable in distinguishing words during speech recognition processes. A well designed NN will therefore minimize, in some examples, the processing time required to recognize speech inputs. NNs are particularly well suited for training of larger data sets, such as data sets representing natural language.
In an FST method, speech inputs may be processed, various operations may be performed on the speech input, and a most probable output, (e.g., recognized word) may be selected. FSTs may be particularly beneficial, in some examples, in phonological analysis. The reusability and flexibility of algorithms performed on FSTs make FSTs particularly useful in combining portions of, or various SRMs. An SRM may therefore incorporate speech recognition data from various sources, apply weights to the speech recognition data, and generate weighted FSTs for use in speech recognition tasks.
The various types of SRMs may include speaker independent SRMs and speaker dependent SRMs. Speaker independent SRMs may comprise averages of language and acoustic models collected from a large sample of users. A speaker dependent SRM may be specific to the user and may be adapted by the user through training. Initial training may be performed during a first use of the SRM and training continues during normal use of the SRM. A speaker dependent SRM comprises unique sets of electronic characteristics for the acoustic model and a unique language model for the words formed from combinations of unique basic sound units.
It is appreciated that given the complexity of natural language, the data needed to process and understand speech may also be complex. SRMs used by a device to process speech input may therefore rely on any combination of the HMM, DTW, NN, FST, and other models, as well as a blend of speaker dependent SRMs and speaker independent SRMs.
BRIEF SUMMARY
A method, apparatus, and computer program product are provided for personalizing a speech recognition model (SRM). In one embodiment, a method is provided for receiving at least one portion of a speech recognition model, wherein the at least one portion of the speech recognition model is stored remotely and is adaptable by one or more user terminal to process input speech, accessing a speech recognition model, wherein the speech recognition model is based on at least the received at least one portion of the speech recognition model, and adapting the speech recognition model based on terminal dependent data.
In some embodiments, the method may further include processing received speech input using the speech recognition model, and generating a textual output. In some embodiments, the method may further include receiving a speech input, and refining a speaker dependent speech recognition model based on the speech input. In some embodiments, the method may further include verifying or correcting a processing of the speech input, wherein refining the speaker dependent speech recognition model is further based on the verification or correction. In some embodiments, the method may further include causing transmission of at least one portion of the speaker dependent speech recognition model to a remote storage location. The terminal dependent data may comprise microphone information and/or a context. The received at least one portion of a speech recognition model is received based on one of at least an individual user, group of users, geographic location, or a dialect. The received at least one portion of a speech recognition model is based on at least one of a Hidden Markov Model, dynamic time warping model, neural network, or finite state transducer. An additional method is provided including receiving at least one portion of a speaker dependent speech recognition model from a user terminal and generating at least one additional portion of a speech recognition model based on the received at least one portion of a speaker dependent speech recognition model, wherein the at least one additional portion of a speech recognition model is adaptable by one or more user terminals. In some embodiments, the method may further include causing transmission of the at least one additional portion of the speech recognition model to an additional user terminal. Generating the at least one additional portion of the speech recognition model is further based on one of at least an individual user, group of users, geographic location, or a dialect. The at least one additional portion of the speech recognition model may be based on at least one of a hidden Markov Model, dynamic time warping model, neural network, or finite state transducer.
An apparatus is also provided, comprising at least one processor and at least one memory including computer program code, the at least one memory and the computer program code configured to, with the processor, cause the apparatus to at least receive at least one portion of a speech recognition model, wherein the at least one portion of the speech recognition model is stored remotely, and is adaptable by one or more user terminals to process input speech, access a speech recognition model, wherein the speech recognition model is based on at least the received at least one portion of the speech recognition model, and adapt the speech recognition model based on terminal dependent data.
An additional apparatus is provided comprising at least one processor and at least one memory including computer program code, the at least one memory and the computer program code configured to, with the processor, cause the apparatus to at least receive at least one portion of a speaker dependent speech recognition model from a user terminal, and generate at least one additional portion of a speech recognition model based on the received at least one portion of a speaker dependent speech recognition model, wherein the at least one additional portion of the speech recognition model is adaptable by one or more user terminals.
A computer program product is provided, comprising at least one non-transitory computer- readable storage medium having computer-executable program code instructions stored therein, the computer-executable program code instructions comprising program code instructions to receive at least one portion of a speech recognition model, wherein the at least one portion of the speech recognition model is stored remotely, and is adaptable by one or more user terminals to process input speech, access a speech recognition model, wherein the speech recognition model is based on at least the received at least one portion of the speech recognition model, and adapt the speech recognition model based on terminal dependent data.
An additional computer program product is provided, comprising at least one non-transitory computer-readable storage medium having computer-executable program code instructions stored therein, the computer-executable program code instructions comprising program code instructions to receive at least one portion of a speaker dependent speech recognition model from a user terminal, generate at least one additional portion of a speech recognition model based on the received at least one portion of a speaker dependent speech recognition model, wherein the at least one additional portion of a speech recognition model is adaptable by one or more user terminals.
An apparatus is also provided, comprising means for receiving at least one portion of a speech recognition model, wherein the at least one portion of the speech recognition model is stored remotely, and is adaptable by one or more user terminals to process input speech, accessing a speech recognition model, wherein the speech recognition model is based on at least the received at least one portion of the speech recognition model, and adapting the speech recognition model based on terminal dependent data.
An additional apparatus is provided, comprising means for receiving at least one portion of a speaker dependent speech recognition model from a user terminal, and generating at least one additional portion of a speech recognition based on the received at least one portion of a speaker dependent speech recognition model, wherein the at least one additional portion of the speech recognition model is adaptable by one or more user terminals.
BRIEF DESCRIPTION OF THE DRAWINGS
Having thus described certain example embodiments of the present invention in general terms, reference will hereinafter be made to the accompanying drawings which are not necessarily drawn to scale, and wherein:
Figure 1 is a block diagram of a personalized speech recognition apparatus in communication with user terminals which may be configured to implement example embodiments of the present invention;
Figure 2 is a flowchart illustrating operations to receive and adapt an SRM on a user terminal, in accordance with one embodiment of the present invention;
Figure 3 is a flowchart illustrating operations to transmit an SRM, receive a speaker dependent SRM, and generate an additional SRM, using a speech personalization apparatus in accordance with one embodiment of the present invention; and
Figure 4 is a display for training an SRM, in accordance with one embodiment of the present invention.
DETAILED DESCRIPTION Some embodiments of the present invention will now be described more fully hereinafter with reference to the accompanying drawings, in which some, but not all, embodiments of the invention are shown. Indeed, various embodiments of the invention may be embodied in many different forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will satisfy applicable legal requirements. Like reference numerals refer to like elements throughout. As used herein, the terms "data," "content," "information," and similar terms may be used interchangeably to refer to data capable of being transmitted, received and/or stored in accordance with embodiments of the present invention. Thus, use of any such terms should not be taken to limit the spirit and scope of embodiments of the present invention.
Additionally, as used herein, the term 'circuitry' refers to (a) hardware-only circuit implementations (e.g., implementations in analog circuitry and/or digital circuitry); (b) combinations of circuits and computer program product(s) comprising software and/or firmware instructions stored on one or more computer readable memories that work together to cause an apparatus to perform one or more functions described herein; and (c) circuits, such as, for example, a microprocessor(s) or a portion of a microprocessor(s), that require software or firmware for operation even if the software or firmware is not physically present. This definition of 'circuitry' applies to all uses of this term herein, including in any claims. As a further example, as used herein, the term 'circuitry' also includes an implementation comprising one or more processors and/or portion(s) thereof and accompanying software and/or firmware. As another example, the term 'circuitry' as used herein also includes, for example, a baseband integrated circuit or applications processor integrated circuit for a mobile phone or a similar integrated circuit in a server, a cellular network device, other network device, and/or other computing device.
As defined herein, a "computer-readable storage medium," which refers to a physical storage medium (e.g., volatile or non-volatile memory device), may be differentiated from a "computer- readable transmission medium," which refers to an electromagnetic signal.
As described below, a method, apparatus and computer program product are provided for accessing and adapting remotely stored personalized speech recognition data for use on or more devices. Referring to Figure 1, personalized speech recognition apparatus 102 may include or otherwise be in communication with processor 20, user interface 22, communication interface 24, memory device 26, and speech personalization administrator 28. Personalized speech recognition apparatus 102 may be embodied by a wide variety of devices including mobile terminals, e.g., mobile telephones, smartphones, tablet computers, laptop computers, or the like, computers, workstations, servers or the like and may be implemented as a distributed system or a cloud based entity.
In example embodiments, the personalized speech recognition apparatus 102 may receive, and/or transmit SRMs, as well as generate additional SRMs that may be adaptable by one more user terminals. An SRM is a statistical model that describes speech patterns probabilistically, and may include a language model (words) and an acoustic model (basic sound units). Example SRMs include the HMM, DTW, , and FST models. An SRM may be provided to a user terminal to enable speech recognition capabilities (e.g., processing of input speech) on the user terminal. In some embodiments, transmittal of an SRM may include transmittal of a portion of the SRM, since an SRM in its entirety may be too large for practical transmission (and an SRM portion may also be considered an SRM). The SRM portion may be incorporable into an SRM, so that the portion may then be incorporated with another portion of an SRM to provide a complete or fully functioning SRM. It will therefore be appreciated that any reference to an SRM herein, may indicate a portion or portions of an SRM, but for simplicity may be referred to as an SRM.
In some embodiments, the SRMs may incorporate speaker independent data, speaker dependent data, and/or terminal dependent data. The speaker independent data may include averaged, normalized, or otherwise consolidated language and acoustic models collected from a large sample of users.
The speaker dependent data may alternatively be biased toward a particular individual, or group of users, such as a group of users speaking a particular language or dialect, or from a particular geographic region. The speaker dependent data may be generated and/or refined on a user terminal by training the SRM. Alternatively or additionally, the speaker dependent data may be generated or refined on one or more user terminals and/or devices, such that it may be shared, via the personalized speech recognition apparatus 102, between the one or more user terminals and/or devices.
In some example embodiments, training may include, but is not limited to, providing speech input to the user terminal, potentially updating and/or verifying the processing of the speech input, and updating the SRM accordingly. On some user terminals, the training may include the explicit dictation of special training data by a speaker, and/or implicit training through the general use of the user terminal. In some embodiments, various models, such as a HMM may be constructed for each the speaker dependent SRM to be stored. A speaker dependent SRM incorporating the speaker dependent data may be communicated from the user terminal to the personalized speech recognition apparatus 102. The terminal dependent data may include information regarding the user terminal itself, such as characteristics of the microphone on the user terminal to capture the speech input, and/or a context of the user terminal (e.g., an environment the device is commonly used in, or the intended purpose of the device), or any settings of the user terminal 1 10A that could impact the processing of speech input. An SRM received from the personalized speech recognition apparatus 102 may be adapted on the user terminal based on the terminal dependent data, so that the particular user terminal may more accurately process speech inputs.
Speaker dependent SRMs, including speaker dependent data may be stored on personalized speech recognition apparatus 102. The speaker dependent SRM, or a portion thereof, may be further modified and/or transmitted to another device to allow the user terminal to benefit from the speaker dependent data, thereby improving the probability of successful speech recognition on another user terminal. As such, one or more user terminals may access or otherwise download the speaker dependent model for the purposes of providing personalized speech recognition.
Advantageously, for example, as the one or more user terminals provide personalized speech recognition, using the speaker dependent model, and the speech recognition result is verified or otherwise confirmed (e.g. check by a user for errors), the personalized speech recognition apparatus 102 may receive updates to the speaker dependent model. The personalized speech recognition apparatus 102 may therefore further tune or otherwise modify the speaker dependent model.
In some embodiments, the processor 20 (and/or co-processors or any other processing circuitry assisting or otherwise associated with the processor 20) may be in communication with the memory device 26 via a bus for passing information among components of the personalized speech recognition apparatus 102. The memory device 26 may include, for example, one or more volatile and/or non-volatile memories. In other words, for example, the memory device 26 may be an electronic storage device (e.g., a computer readable storage medium) comprising gates configured to store data (e.g., bits) that may be retrievable by a machine (e.g., a computing device like the processor 20). The memory device 26 may be configured to store information, data, content, applications, instructions, or the like for enabling the apparatus to carry out various functions in accordance with an example embodiment of the present invention. For example, the memory device 26 could be configured to store various SRMs, including speaker independent and speaker dependent portions. The speaker dependent data may be associated with a particular user or group of users, enabling the processor 20 to identify and provide appropriate SRMs to various devices. As such, the memory device 26 could be configured to buffer input data for processing by the processor 20, and/or to store instructions for execution by the processor 20. The personalized speech recognition apparatus 102 may, in some embodiments, be embodied in various devices as described above. However, in some embodiments, the personalized speech recognition apparatus 102 may be embodied as a chip or chip set. In other words, the personalized speech recognition apparatus 102 may comprise one or more physical packages (e.g., chips) including materials, components and/or wires on a structural assembly (e.g., a baseboard). The structural assembly may provide physical strength, conservation of size, and/or limitation of electrical interaction for component circuitry included thereon. The personalized speech recognition apparatus 102 may therefore, in some cases, may be configured to implement an embodiment of the present invention on a single chip or as a single "system on a chip." As such, in some cases, a chip or chipset may constitute means for performing one or more operations described herein for personalizing speech recognition in devices.
The processor 20 may be embodied in a number of different ways. For example, the processor 20 may be embodied as one or more of various hardware processing means such as a coprocessor, a microprocessor, a controller, a digital signal processor (DSP), a processing element with or without an accompanying DSP, or various other processing circuitry including integrated circuits such as, for example, an application specific integrated circuit (ASIC) an field programmable gate array (FPGA), a microcontroller unit (MCU), a hardware accelerator, a special-purpose computer chip, or the like. As such, in some embodiments, the processor 20 may include one or more processing cores configured to perform independently. A multi-core processor may enable multiprocessing within a single physical package. Additionally or alternatively, the processor 20 may include one or more processors configured in tandem via the bus to enable independent execution of instructions, pipelining and/or multithreading. In an example embodiment, the processor 20 may be configured to execute instructions stored in the memory device 26 or otherwise accessible to the processor 20. In example embodiments, such instructions may provide for the retrieval, transmittal, and/or processing of SRMs, including generating additional SRMs based on received updated speaker dependent SRMs. Alternatively or additionally, the processor 20 may be configured to execute hard coded functionality. As such, whether configured by hardware or software methods, or by a combination thereof, the processor 20 may represent an entity (e.g., physically embodied in circuitry) capable of performing operations according to an embodiment of the present invention, such as the personalization of SRMs. Thus, for example, when the processor 20 is embodied as an ASIC, FPGA or the like, the processor 20 may be specifically configured hardware for conducting the operations described herein. Alternatively, as another example, when the processor 20 is embodied as an executor of software instructions, the instructions may specifically configure the processor 20 to perform the algorithms and/or operations described herein when the instructions are executed. However, in some cases, the processor 20 may be a processor of a specific device (e.g., a user terminal or network entity) configured to employ an embodiment of the present invention by further configuration of the processor 20 by instructions for performing the algorithms and/or operations described herein. The processor 20 may include, among other things, a clock, an arithmetic logic unit (ALU) and logic gates configured to support operation of the processor 20. Meanwhile, the communication interface 24 may be any means such as a device or circuitry embodied in either hardware or a combination of hardware and software that is configured to receive and/or transmit data from/to a network and/or any other device or module in communication with the personalized speech recognition apparatus 102. In this regard, the communication interface 24 may include, for example, an antenna (or multiple antennas) and supporting hardware and/or software for enabling communications with a wireless communication network, for transmitting and receiving SRMs to and from remote devices. Additionally or alternatively, the communication interface 24 may include the circuitry for interacting with the antenna(s) to cause transmission of signals via the antenna(s) or to handle receipt of signals received via the antenna(s). In some environments, the communication interface 24 may alternatively or also support wired communication. As such, for example, the communication interface 24 may include a communication modem and/or other hardware/software for supporting communication via cable, digital subscriber line (DSL), universal serial bus (USB) or other mechanisms.
In some embodiments, such as instances in which the personalized speech recognition apparatus 102 is embodied by a user device, the personalized speech recognition apparatus 102 may include a user interface 22 that may, in turn, be in communication with the processor 20 to receive an indication of a user input and/or to cause provision of an audible, visual, mechanical or other output to the user. As such, the user interface 22 may include, for example, a keyboard, a mouse a display, a touch screen(s), touch areas, soft keys, a microphone, a speaker, or other input/output mechanisms. Alternatively or additionally, the processor 20 may comprise user interface circuitry configured to control at least some functions of one or more user interface elements such as, for example, a speaker, ringer, microphone, display, and/or the like. The processor 20 and/or user interface circuitry comprising the processor 20 may be configured to control one or more functions of one or more user interface elements through computer program instructions (e.g., software and/or firmware) stored on a memory accessible to the processor 20 (e.g., memory device 26, and/or the like). In some example embodiments, processor 20 may be embodied as, include, or otherwise control a speech personalization administrator 28 for providing personalized speech recognition. As such, the speech personalization administrator 28 may be embodied as various means, such as circuitry, hardware, a computer program product comprising computer readable program instructions stored on a computer readable medium (for example, memory device 26) and executed by a processing device (for example, processor 20), or some combination thereof. Speech personalization administrator 28 may be capable of communication with one or more of the processor 20, memory device 26, user interface 22, and communication interface 24. As such, the speech personalization administrator 28 may be configured to generate additional SRMs, adaptable by a variety of user terminals and that may be based on speaker dependent SRMs, as described above and in further detail hereinafter.
Any number of user terminal(s) 110, such as 11 OA and HOB, may connect to personalized speech recognition apparatus 102 via a network 100. User terminal 1 10 may be embodied as a mobile terminal, such as personal digital assistants (PDAs), pagers, mobile televisions, mobile telephones, gaming devices, laptop computers, tablet computers, cameras, camera phones, video recorders, audio/video players, radios, global positioning system (GPS) devices, navigation devices, or any combination of the aforementioned, and other types of devices capable of providing speech recognition. The user terminal 110 need not necessarily be embodied by a mobile device and, instead, may be embodied in a fixed device, such as a computer, workstation, or home appliance, such as a coffee maker. Additionally or alternatively, user terminal(s) 110 may be embodied in a vehicle, or any other machine or device capable of processing voice commands.
For simplicity, only user terminal 110A is illustrated in further detail, but it will be appreciated that any of the user terminals 110, such as user terminal 110B may be configured as illustrated in and described with respect to user terminal 1 10A. The user terminal 1 10 may therefore include or otherwise be in communication with processor 120, user interface 122, communication interface 124, and memory device 126.
In some embodiments, the processor 120 (and/or co-processors or any other processing circuitry assisting or otherwise associated with the processor 120) may be in communication with the memory device 126 via a bus for passing information among components of the user terminal 1 10. The memory device 126 may include, for example, one or more volatile and/or non-volatile memories. In other words, for example, the memory device 126 may be an electronic storage device (e.g., a computer readable storage medium) comprising gates configured to store data (e.g., bits) that may be retrievable by a machine (e.g., a computing device like the processor 120). The memory device 126 may be configured to store information, data, content, applications, instructions, or the like for enabling the user terminal to carry out various functions in accordance with an example embodiment of the present invention. For example, the memory device 126 could be configured to store SRMs, instructions for adapting SRMs with terminal dependent data, and instructions for training SRMs with speaker dependent data. Memory device 126 may therefore buffer input data for processing by the processor 120. Additionally or alternatively, the memory device 26 could be configured to store instructions for execution by the processor 120.
The processor 120 may be embodied in a number of different ways. For example, the processor 120 may be embodied as one or more of various hardware processing means such as a coprocessor, a microprocessor, a controller, a DSP, a processing element with or without an accompanying DSP, or various other processing circuitry including integrated circuits such as, for example, an ASIC , an FPGA, an MCU, a hardware accelerator, a special-purpose computer chip, or the like. As such, in some embodiments, the processor 120 may include one or more processing cores configured to perform independently. A multi-core processor may enable multiprocessing within a single physical package. Additionally or alternatively, the processor 120 may include one or more processors configured in tandem via the bus to enable independent execution of instructions, pipelining and/or multithreading. In an example embodiment, the processor 120 may be configured to execute instructions stored in the memory device 126 or otherwise accessible to the processor 120. For example, the processor 120 may be configured to adapt an SRM advantageously to the user terminal, based on terminal dependent data, such as microphone information and context, so that the SRM may account for variances across user terminals. In example embodiments, the user terminal(s) 110 may include means, such as a processor 120, for training the SRM with speech input, to generate and/or refine a speaker dependent SRM that may improve speech input processing on the user terminal (and subsequently, other user terminals). Alternatively or additionally, the processor 120 may be configured to execute hard coded functionality. As such, whether configured by hardware or software methods, or by a combination thereof, the processor 120 may represent an entity (e.g., physically embodied in circuitry) capable of performing operations according to an embodiment of the present invention while configured accordingly. Thus, for example, when the processor 120 is embodied as an ASIC, FPGA or the like, the processor 120 may be specifically configured hardware for conducting the operations described herein. Alternatively, as another example, when the processor 120 is embodied as an executor of software instructions, the instructions may specifically configure the processor 120 to perform the algorithms and/or operations, such as adaptation and training of SRMs, processing of speech input, such as by using the SRMs, for conversion to text, when the instructions are executed. However, in some cases, the processor 120 may be a processor of a specific device (e.g., a mobile terminal or network entity) configured to employ an embodiment of the present invention by further configuration of the processor 120 by instructions for performing the algorithms and/or operations described herein. The processor 120 may include, among other things, a clock, an arithmetic logic unit (ALU) and logic gates configured to support operation of the processor 120.
Meanwhile, the communication interface 124 may be any means such as a device or circuitry embodied in either hardware or a combination of hardware and software that is configured to receive and/or transmit data from/to a network and/or any other device or module in communication with the user terminal 110. In example embodiments, the communication interface 124 may be specifically configured for transmitting and receiving SRMs to and from the personalized speech recognition apparatus 102. In this regard, the communication interface 124 may include, for example, an antenna (or multiple antennas) and supporting hardware and/or software for enabling communications with a wireless communication network. Additionally or alternatively, the communication interface 124 may include the circuitry for interacting with the antenna(s) to cause transmission of signals via the antenna(s) or to handle receipt of signals received via the antenna(s). In some environments, the communication interface 124 may alternatively or also support wired communication for communication of SRMs. As such, for example, the communication interface 124 may include a communication modem and/or other hardware/software for supporting communication via cable, digital subscriber line (DSL), universal serial bus (USB) or other mechanisms.
The user terminal 1 10 may include a user interface 122 that may, in turn, be in communication with the processor 120 to receive an indication of a user input and/or to cause provision of an audible, visual, mechanical or other output to the user. As such, the user interface 122 may include, for example, a keyboard, a mouse, a display, a touch screen(s), touch areas, soft keys, a microphone, a speaker, or other input/output mechanisms. The user interface 122 may therefore be configured to receive speech input, such as, via a microphone, for the purposes of speech recognition and/or training of an SRM. Alternatively or additionally, the processor 120 may comprise user interface circuitry configured to control at least some functions of one or more user interface elements such as, for example, a speaker, ringer, microphone, display, and/or the like. The processor 120 and/or user interface circuitry comprising the processor 120 may be configured to control one or more functions of one or more user interface elements through computer program instructions (e.g., software and/or firmware) stored on a memory accessible to the processor 120 (e.g., memory device 126, and/or the like). Network 100 may be embodied in a local area network, the Intemet, any other form of a network, or in any combination thereof, including proprietary private and semi-private networks and public networks. The network 100 may comprise a wire line network, wireless network (e.g., a cellular network, wireless local area network, wireless wide area network, some combination thereof, or the like), or a combination thereof, and in some example embodiments comprises at least a portion of the Internet. The network 100 may be used for transmitting speaker dependent data and/or SRMs to and from devices. As another example, a user terminal 1 10 may be directly coupled to and/or may include a personalized speech recognition apparatus 102. Referring now to Figure 2, the operations for receiving and adapting an SRM on a user terminal, in accordance with one embodiment of the present invention are outlined in accordance with one example embodiment. In this regard and as described below, the operations of Figures 2 may be performed by the user terminal 110A, user terminal HOB, and/or the like, for example. As shown by operation 200, the user terminal 1 10A may include means, such as the processor 120, communication interface 124, or the like, for receiving at least one portion of an SRM, wherein the at least one portion of an SRM is stored remotely and is adaptable by one of more user terminals to process input speech. In other words, the user terminal 1 10A may receive at least one portion of an SRM from the personalized speech recognition apparatus 102, for example, including any combination of the HMM, DTW, , and FST models, as described above. The at least one portion of an SRM may also include any combination of speaker independent data and/or speaker dependent data, and may be adaptable by the user terminal 11 OA to process speech input (e.g., perform speech recognition tasks). The adaptation is described in further detail with respect to operation 210.
To receive the at least one portion of an SRM on the user terminal 1 1 OA, in an example embodiment, a user of user terminal 1 1 OA may provide logon credentials or the like, via user interface 122, communication interface 124, and/or network 100 to the personalized speech recognition apparatus 102. In some embodiments, the user terminal 110A may check for updates by communicating with the personalized speech recognition apparatus 102, and receive an SRM or portion thereof if an update is available. In some examples, an update may be available if a user updated, based on training, verification or the like on another device, such as user terminal HOB.
In some embodiments, the user terminal 1 10A may download an SRM or portion thereof for the first time (such as during initial device setup, or factory reset), or the newly received SRM or portion thereof may include updates compared to a previous version used by user terminal 11 OA. In some embodiments, receipt of the SRM or portion thereof by the user terminal 1 1 OA may occur during scheduled update routines that may be unobtrusive to or unnoticed by a user. That is, the synchronization may occur seamlessly as a background system update. Additionally or alternatively, a request for an SRM or portion thereof may be explicitly initiated on the user terminal 11 OA (such as logging onto the personalized speech recognition apparatus 102 and requesting an update). In some embodiments, an update may be initiated by the personalized speech recognition apparatus 102. For example, a user may be automatically notified that an update is available, such as by Short Message Service (SMS), for example, so as to confirm that they would like to receive the at least one portion of an SRM on the user terminal 11 OA.
The user terminal 1 10A may therefore receive at least one portion of an SRM associated with the individual user (such as identified with the logon credentials). Additionally or alternatively, the SRM or portion thereof may be identified by the personalized speech recognition apparatus by other means. For example, a user of a device may provide a geographic location, via a Global Positioning Device (GPS) and/or manual indication of a location, for example. The user terminal 110A may therefore receive an SRM based on a geographic location and /or dialect. Having received at least one portion of an SRM, as described with respect to operation 200, the user terminal 110A may include means, such as the processor 120, for accessing a speech recognition model, wherein the speech recognition model is based on at least the received at least one portion of speech recognition model. As such, the received at least one portion of an SRM may be a complete SRM, and may therefore be stored on memory device 126, and accessed by the processor 120. In some embodiments, where the at least one portion of the SRM does not provide a complete or fully functioning SRM, the processor 120 may incorporate the at least one portion of an SRM to form a complete SRM. As such the SRM may be stored and accessed on memory device 126, for example.
Having accessed an SRM, as shown by operation 208, the user terminal 11 OA may include means, such as the processor 120, for adapting the SRM based on or more terminal dependent data. The terminal dependent data may include information regarding the user terminal 1 1 OA itself, such as characteristics of the microphone on the user terminal 1 10A to capture the speech input, and/or a context of the user terminal (e.g., an environment the device is commonly used in, or the intended purpose of the user terminal), or any settings of the user terminal 1 10A that could impact the process of speech input. The processor 120 may therefore utilize the terminal dependent data in adapting the SRM for use on the user terminal 110A.
In an example embodiment, microphone information may be retrieved from memory device 126, or read from a microphone component of user interface 122 by processor 120, for example. The microphone information may include any information relating to the microphone that may impact how speech input is recognized and/or processed according to the SRM. For example, the microphone information may comprise a microphone model identifier, or orientation of the microphone within the device. The microphone may additionally or alternatively be characterized by its transduction type, such as condenser and/or dynamic, for example. The user terminal 1 10A, using the processor 120, may therefore adapt the SRM according to microphone information to account for acoustic, phonetic, and/or other variances between microphones. For example, calculations in a DTW model may be consistently modified throughout, so that the user terminal 1 10A may accurately interpret sounds captured by the microphone.
In another example embodiment, the user terminal 11 OA may adapt the SRM based on the context of the user terminal. Use of an SRM by a speaker phone in a vehicle, for example, may be subject to background noise, such as wind, and/or radio or other device interference. The processor 120 of user terminal 1 1 OA may therefore adapt the received SRM, which in its previous state may not have accounted for such background noises, accordingly. Information regarding the context or use of the user terminal 1 1 OA may be explicitly retrieved from memory device 126, for example, and/or derived from various components of the user terminal 110A, allowing processor 120 to adapt the SRM based on what contexts the user terminal 11 OA will most likely be used in.
Although microphone information and context of the user terminal are provided as example terminal dependent data, it will be appreciated that numerous other terminal dependent data exist. Settings configuring various components of the user terminal 1 10A may be considered by the processor 120 in adapting the SRM for the user terminal 11 OA. In some embodiments, the settings may affect the adaptation of the SRM, and/or cause the processor 120 to adjust the settings of the user terminal 1 10A to tailor the device for use of the SRM. An adapted SRM may be stored on memory device 126, for example.
As shown by operation 220, the user terminal 11 OA may include means, such as the user interface 122, communication interface 124, and/or processor 120 for receiving a speech input. The speech input may be provided by a user to user terminal 110B by using a microphone of user interface 122, for example.
Additionally or alternatively, the user terminal 11 OA may receive a speech input through everyday use of the user terminal and may process the speech to generate text. The user terminal 110A may process received speech input using the SRM, and generate a textual output. In some examples, the processor 120 may process the speech input according to the SRM. For example, the processor 120 may calculate observation probability on the speech input based on the SRM that includes one or more HMM, DTW, , or FST models, for example. By way of further example, the processor 120 may identify a reference word with the highest probability when compared to other reference words, a threshold or the like. Based on those probabilities, the processor may then select or otherwise generate the speech recognition result (e.g. a text output).
As shown by operation 230, the user terminal 11 OA may include means, such as the user interface 122, communication interface 124 and/or processor 120, for verifying or correcting a processing of the speech input. The verification or correction could be received explicitly by a user input to the user terminal 11 OA, or implicitly by everyday use of the user terminal 11 OA.
For example, the user terminal 11 OA may be configured to receive an explicit correction of a processed speech input. In applications employing speech recognition, such as an example application that prefills dictated words in a draft email message, the interpretation of the speech input may be incorrect. In such cases, the user may correct a misinterpreted word(s) by selecting the misinterpreted word, and typing the corrected word in its place. See Figure 4.
As is provided in Figure 4, a user interface 122 may display an indication 400 of a word, such as a word that is misinterpreted, such as a word that is misinterpreted during the processing of input speech. In some examples, indication 400 may be provided by the user terminal 1 1 OA in scenarios such as those in which the SRM provided no reference word above some threshold probability, indicating that the processing of the speech input was not likely correct. Additionally or alternatively, the indication 400 may be provided explicitly by a user, by selection of the word for correction, for example. User input 410 provides a means for receiving a correction of the processed speech input. In this example, the speech recognition system has interpreted the word "forest," and a user provides the correct phrase, "for the rest."
In other examples a speech input may be deemed as correct based on implicit verification. For example, a user terminal, such as user terminal 1 10A, may be embodied as a mobile phone and may further be operable to receive a speech input such as "call Suzanne." Upon automatic selection and execution of the associated command (e.g., initiating a call to a phone number saved for a contact by the name of Suzanne), and failure to receive any correction to stop the initiated phone call, the user terminal 100A, such as by the processor 120, may consider this absence of any action by the user a verification of the processed speech input.
As shown by operation 240, the user terminal 11 OA, such as by processor 120, and memory device 126, for example, may generate and/or otherwise refine a speaker dependent SRM based on the speech input. As such, the SRM may be trained using speech input received with respect to operation 220, and/or verification or correction of the processed speech input with respect to operation 230. Existing SRMs on memory device 126 may therefore be tailored for use by a particular user or group of users. Additionally or alternatively, new speaker dependent SRMs may be generated for improved speech input processing. Training can be performed, for example, by using feature vectors of the speech input (provided with respect to operation 220) and associating them with corresponding reference words, as provided by the verification and/or correction with respect to the operation 230 above. Additionally or alternatively, a verification or correction need not be provided, but the processor 120 may identify the reference words from a script on memory device 126 (such as in an example embodiment where the speech input is received based on a script).
The SRM, such as an HMM, DTW, , FST, or the like, may therefore be expanded, or otherwise modified, to incorporate the speech input and associated reference words. In some examples, processed speech input and associated reference words may be further processed by processor 120, and applied to an existing SRM, to refine a speaker dependent SRM. In some embodiments, where an SRM is not already present on the user terminal 11 OA, a new speaker dependent SRM may be generated. The generated or refined speaker dependent SRM may be stored on memory device 126, for example. As shown by operation 250, the user terminal 1 10A may include means, such as communication interface 124, and/or processor 120, for causing transmission of the speaker dependent SRM to a remote storage location, such as personalized speech recognition apparatus 102, for example. Transmission of the speaker dependent SRM to a remote location may allow the speaker dependent SRM to be advantageously transmitted to other user terminals, such as described in further detail with respect to Figure 3. Further, and in some examples, by transmitting the speaker dependent SRM to the remote location, one or more user terminals may provide updates to or otherwise refine the speaker dependent SRM. The speaker dependent SRM may therefore be retrieved from memory device 126, and transmitted via communication interface 124 and over network 100, for example, to the remote storage location.
In some embodiments, the transmission may occur automatically following generation and/or refinement of the speaker dependent SRM with respect to operation 240. In some embodiments, a user of user terminal 1 10A may initiate the transmission, such as for example, providing logon credentials to the personalized speech recognition apparatus 102, as described with respect to operation 200, for example. The speaker dependent SRM may then be transmitted to the personalized speech recognition apparatus 102 for storage, and subsequent retrievals.
Figure 3 is a flowchart illustrating operations to transmit an SRM, receive a speaker dependent SRM, and generate an additional SRM, using a speech personalization apparatus 102 in accordance with one embodiment of the present invention.
As shown by operation 300, the personalized speech recognition apparatus 102 may include means, such as the processor 20, speech personalization administrator 28, communication interface 24, or the like, for causing transmission of an SRM (or portion thereof) to a user terminal. The SRM may therefore be retrieved from memory device 26, and sent over network 1 10, via communication interface 24, to user terminal 1 10A, for example. In some examples, the SRM that is transmitted may be an SRM that is configured for a particular device, a particular region or dialect or the like.
For example, the personalized speech recognition apparatus 102 may generate the additional SRM based on an associated with a group of users, such as one associated with a geographic location. For example, some geographic areas, like the southern United States, for example, may experience regional accents that may otherwise confuse speech input processing systems. Personalized speech recognition apparatus 102 may therefore generate the additional SRM based on a particular geographic location in order to subsequently provide more accurate speech recognition functions to users in, from, or otherwise associated with the same geographic location. Similarly, an additional SRM may be generated based on a specific dialect. For example, due to varying dialects, some words may be pronounced differently than the same word in a different language, potentially causing erroneous speech input processing on a user terminal. Personalized speech recognition apparatus 102 may therefore associate the speaker dependent SRM with a dialect in order to provide more accurate speech recognition functions to users whose speech is closely related to the specific dialect. A user of a device may then provide indication of a particular dialect, and receive an SRM adapted for that dialect.
Alternatively or additionally, the SRM may already be adapted to a particular user. For example, the personalized speech recognition apparatus 102 may receive logon information from a user terminal, such as user terminal 11 OA that indicates the identity of a particular user. As such, personalized speech recognition apparatus 102, such as via the processor 20, the communications interface 24 or the like, may cause the SRM related to the particular user to be transmitted to user terminal 11 OA.
The transmission may be initiated on the personalized speech recognition apparatus 102 in various ways, such as receiving requests initiated explicitly (e.g., logon) or automatically (e.g., initial installation) from the user terminal 11 OA, and/or automatic transmission imitated by the personalized speech recognition apparatus 102. Various other methods for initiation of transmission of the SRM are described herein.
As shown by operation 310, the personalized speech recognition apparatus 102 may include means, such as the processor 20, speech personalization administrator 28, communication interface 24, or the like, for receiving at least a portion of a speaker dependent SRM from the user terminal, such as user terminal 11 OA. In some examples, the received speaker dependent SRM (or portion thereof) may contain one or more updates to or refinements of the speaker dependent SRM as is described with respect to operations 240 and 250 of Figure 2.
As shown by operation 320, the personalized speech recognition apparatus 102 may include means, such as the processor 20, speech personalization administrator 28, or the like, for generating an additional or otherwise updated SRM based on the speaker dependent SRM, wherein the additional SRM is adaptable by one or more user terminals. In some examples, the additional SRM is contrasted based on the speaker dependent SRM and comprises the updates to or refinements of the SRM from the user terminal, as well as, one or more other user terminals.
As such, the speech personalization administrator 28 may access an existing SRM on memory 26, and modify, update or otherwise refine the SRM with the speaker dependent SRM, or a portion of the speaker dependent SRM, accordingly. Additionally, or alternatively, a new SRM may be generated using the speaker dependent SRM. The additional SRM may be, or otherwise include, a HMM, DTW, , or FST, for example. The additional SRM may be adaptable by one or more user terminals, such as described with respect to operations 200 and 210 above.
As shown by operation 330, the personalized speech recognition apparatus 102 may include means, such as the processor 20, communication interface 24, or the like, for causing transmission of the additional SRM to an additional device. The transmission may be initiated and completed by use of similar operations described with respect to operation 300, but the SRM may this time be transmitted to a different terminal, such as user terminal HOB, for example. Advantageously, for example, the additional SRM may be shared between one or more user terminals, devices and/or the like.
In an example embodiment, the personalized speech recognition apparatus 102 may select the additional SRM to transmit to the user terminal HOB, based on a variety of factors, such as terminal dependent data, and/or user identification, for example. An association of the individual user (or group of users) and speaker dependent SRM may allow the personalized speech recognition apparatus 102 to advantageously provide the SRM on demand, to various devices belonging to a user.
For example, a user terminal 11 OA embodied as a personal computer or laptop capable of producing text from speech input, such as dictated reports or emails, may rely on an extensive SRM representing an entire natural language. A speaker dependent SRM generated and/or refined on the user terminal 11 OA may be available on personalized speech recognition apparatus 102 for distribution to one or more other user terminals. For example, if the same user of user terminal 1 10A purchases a new user terminal 1 10B, it may be advantageous to provide portions of the speaker dependent SRM to the user terminal 1 10B. Presume for example that the user terminal HOB is a coffee maker. A coffee maker, may not require such a broad vocabulary required by the personal computer or laptop embodiment of user terminal 1 10A, but only portions of the speaker dependent SRM including language relating to functions of the coffee maker (e.g., grind, brew), or language relating to measurements and/or timing. As such, upon detecting that user terminal HOB is embodied as a coffee maker, for example, the personalized speech recognition apparatus 102 may advantageously select an SRM or a portion of an SRM (as generated in operation 320) for use by the coffee maker, potentially minimizing bandwidth required for transmitting, and memory required for storing (on user terminal HOB), an otherwise extensive SRM.
Having received an SRM from the personalized speech recognition apparatus 102, the user terminal HOB may utilize the SRM in processing of speech input, therefore offering to its users personalized speech recognition or in other examples, by not needing its users to retrain a SRM. That is, the speech input processing may be improved by use of the SRM, which may include a portion(s) of the speaker dependent SRM generated or refined on user terminal 1 10A. As such, the user of user terminal HOB may provide speech input to user terminal 1 10B, and experience reduced or minimized error rates in speech input processing and/or execution of associated voice commands.
It will be appreciated that, although Figure 1 illustrates an embodiment utilizing user terminals 110A and 1 10B, and a personalized speech recognition apparatus 102, many other configurations exist. Indeed, a personalized speech recognition apparatus 102 may be locally installed on a device such as user terminal 110A and/or HOB and configured to run independently, where data may not necessarily be shared across devices or a server.
In some embodiments, a user terminal such as user terminal 11 OA and/or HOB may provide a speaker dependent SRM to a personalized speech recognition apparatus 102, may receive an SRM from a personalized speech recognition apparatus 102, or may both provide and receive the same, respectively. In such example embodiments, the personalized speech recognition apparatus 102 may be implemented in the cloud, and data may be transmitted between user terminal(s) and server(s) over network 100. In a particularly advantageous embodiment, various user terminals may routinely receive updated SRMs, thereby continually improving speech input processing by utilizing speaker dependent SRMs generated and/or refined on other user terminals.
Additionally or alternatively, in some embodiments, a device such as user terminal 1 10A and/or HOB may be shipped with SRMs preinstalled. In some embodiments, the SRM may be local to the area or country the user terminal is distributed in. That is, the SRM may be based on a speaker dependent SRM based on a dialect or geographic area.
As described above, Figures 2 and 3 are flowcharts illustrating operations performed by a user terminal 110A user terminal 1 10B and/or the like, and personalized speech recognition apparatus 102, respectively. It will be understood that each block of the flowchart, and combinations of blocks in the flowcharts, may be implemented by various means, such as hardware, firmware, processor, circuitry, and/or other devices associated with execution of software including one or more computer program instructions. For example, one or more of the procedures described above may be embodied by computer program instructions. In this regard, the computer program instructions which embody the procedures described above may be stored by a memory device 26 or 126 employing an embodiment of the present invention and executed by a processor 20 or 120. As will be appreciated, any such computer program instructions may be loaded onto a computer or other programmable apparatus (e.g., hardware) to produce a machine, such that the resulting computer or other programmable apparatus implements the functions specified in the flowcharts' blocks. These computer program instructions may also be stored in a computer- readable memory that may direct a computer or other programmable apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture the execution of which implements the function specified in the flowcharts' blocks. The computer program instructions may also be loaded onto a computer or other programmable apparatus to cause a series of operations to be performed on the computer or other programmable apparatus to produce a computer-implemented process such that the instructions which execute on the computer or other programmable apparatus provide operations for implementing the functions specified in the flowcharts' blocks.
Accordingly, blocks of the flowcharts support combinations of means for performing the specified functions and combinations of operations for performing the specified functions for performing the specified functions. It will also be understood that one or more blocks of the flowcharts, and combinations of blocks in the flowcharts, may be implemented by special purpose hardware -based computer systems which perform the specified functions, or combinations of special purpose hardware and computer instructions.
In some embodiments, certain ones of the operations above may be modified or further amplified. Furthermore, in some embodiments, additional optional operations may be included as indicated by the blocks shown with a dashed outline in Figures 2 and 3. Modifications, additions, or amplifications to the operations above may be performed in any order and in any combination.
Many modifications and other embodiments of the inventions set forth herein will come to mind to one skilled in the art to which these inventions pertain having the benefit of the teachings presented in the foregoing descriptions and the associated drawings. Therefore, it is to be understood that the inventions are not to be limited to the specific embodiments disclosed and that modifications and other embodiments are intended to be included within the scope of the appended claims. Moreover, although the foregoing descriptions and the associated drawings describe example embodiments in the context of certain example combinations of elements and/or functions, it should be appreciated that different combinations of elements and/or functions may be provided by alternative embodiments without departing from the scope of the appended claims. In this regard, for example, different combinations of elements and/or functions than those explicitly described above are also contemplated as may be set forth in some of the appended claims. Although specific terms are employed herein, they are used in a generic and descriptive sense only and not for purposes of limitation.

Claims

THAT WHICH IS CLAIMED
1. A method comprising:
receiving at least one portion of a speech recognition model, wherein the at least one portion of the speech recognition model is stored remotely and is adaptable by one or more user terminal to process input speech;
accessing a speech recognition model, wherein the speech recognition model is based on at least the received at least one portion of the speech recognition model; and adapting the speech recognition model based on terminal dependent data.
2. A method according to claim 1 , further comprising:
processing received speech input using the speech recognition model; and generating a textual output.
3. A method according to claim 1 or 2, further comprising:
receiving a speech input; and
refining a speaker dependent speech recognition model based on the speech input.
4. A method according to claim 3, further comprising:
verifying or correcting a processing of the speech input, wherein refining the speaker dependent speech recognition model is further based on the verification or correction.
5. A method according to claim 3, or 4, further comprising:
causing transmission of at least one portion of the speaker dependent speech recognition model to a remote storage location.
6. The method according to claim 1 , 2, 3, or 4, wherein the terminal dependent data
comprises microphone information.
7. The method according to claim 1, 2, 3, 4 or 5, wherein the terminal dependent data
comprises a context.
8. The method according to claim 1, 2, 3, 4, 5, or 6, wherein the received at least one
portion of a speech recognition model is received based on one of at least an individual user, group of users, geographic location, or a dialect.
9. A method according to claim 1, 2, 3, 4, 5, 6 or 7, wherein the received at least one portion of a speech recognition model is based on at least one of a Hidden Markov Model, dynamic time warping model, neural network, or finite state transducer.
10. A method comprising:
receiving at least one portion of a speaker dependent speech recognition model from a user terminal; and
generating at least one additional portion of a speech recognition model based on the received at least one portion of a speaker dependent speech recognition model, wherein the at least one additional portion of a speech recognition model is adaptable by one or more user terminals.
1 1. A method according to claim 10, further comprising:
causing transmission of the at least one additional portion of the speech recognition model to an additional user terminal.
12. A method according to claim 10 or 11 , wherein generating the at least one additional portion of the speech recognition model is further based on one of at least an individual user, group of users, geographic location, or a dialect.
13. A method according to claim 10, 11 or 12, wherein the at least one additional portion of the speech recognition model is based on at least one of a hidden Markov Model, dynamic time warping model, neural network, or finite state transducer.
14. An apparatus comprising at least one processor and at least one memory including
computer program code, the at least one memory and the computer program code configured to, with the processor, cause the apparatus to at least:
receive at least one portion of a speech recognition model, wherein the at least one portion of the speech recognition model is stored remotely, and is adaptable by one or more user terminals to process input speech;
access a speech recognition model, wherein the speech recognition model is based on at least the received at least one portion of the speech recognition model; and
adapt the speech recognition model based on terminal dependent data.
15. An apparatus according to claim 14, wherein the at least one memory and the computer program code are further configured to, with the processor, cause the apparatus to at least:
process received speech input using the speech recognition model; and generate a textual output.
16. An apparatus according to claim 14 or 15, wherein the at least one memory and the computer program code are further configured to, with the processor, cause the apparatus to at least:
receive a speech input; and
refine a speaker dependent speech recognition model based on the speech input.
17. An apparatus according to claim 16, wherein the at least one memory and the computer program code are further configured to, with the processor, cause the apparatus to at least:
verify or correct a processing of the speech input, wherein refining the speaker dependent speech recognition model is further based on the verification or correction.
18. An apparatus according to claim 16 or 17, wherein the at least one memory and the
computer program code are further configured to, with the processor, cause the apparatus to at least:
cause transmission of at least a portion of the speaker dependent speech recognition model to a remote storage location.
19. An apparatus according to claim 14, 15, 16, 17 or 18 wherein the terminal dependent data comprises microphone information.
20. An apparatus according to claim 14, 15, 16, 17, 18 or 19, wherein the terminal dependent data comprises a context.
21. An apparatus according to claim 14, 15, 16, 17 or 18, wherein the received at least one portion of a speech recognition model is received based on one of at least an individual user, group of users, geographic location, or a dialect.
22. An apparatus according to claim 14, 15, 16, 17, 18, 19, 20 or 21 , wherein the received at least one portion of a speech recognition model is based on at least one of a Hidden Markov Model, dynamic time warping model, neural network, or finite state transducer.
23. An apparatus comprising at least one processor and at least one memory including
computer program code, the at least one memory and the computer program code configured to, with the processor, cause the apparatus to at least:
receive at least one portion of a speaker dependent speech recognition model from a user terminal; and
generate at least one additional portion of a speech recognition model based on the received at least one portion of a speaker dependent speech recognition model, wherein the at least one additional portion of the speech recognition model is adaptable by one or more user terminals.
24. An apparatus according to claim 23, wherein the at least one memory and the computer program code are further configured to, with the processor, cause the apparatus to at least:
cause transmission of the at least one additional portion of a speech recognition model to an additional user terminal.
25. An apparatus according to claim 23 or 24, wherein generating the at least one additional portion of a speech recognition model is further based on one of at least an individual user, group of users, geographic location, or a dialect.
26. An apparatus according to claim 23, 24 or 25, wherein the at least one additional portion of the speech recognition model is based on at least one of a hidden Markov Model, dynamic time warping model, neural network, or finite state transducer.
27. A computer program product comprising at least one non-transitory computer-readable storage medium having computer-executable program code instructions stored therein, the computer-executable program code instructions comprising program code instructions to:
receive at least one portion of a speech recognition model, wherein the at least one portion of the speech recognition model is stored remotely, and is adaptable by one or more user terminals to process input speech;
access a speech recognition model, wherein the speech recognition model is based on at least the received at least one portion of the speech recognition model; and
adapt the speech recognition model based on terminal dependent data.
28. A computer program product according to claim 27, wherein the computer-executable program code instructions further comprise program code instructions to:
process received speech input using the speech recognition model; and generate a textual output.
29. A computer program product according to claim 27 or 28, wherein the computer- executable program code instructions further comprise program code instructions to: receive a speech input; and
refine a speaker dependent speech recognition model based on the speech input.
30. A computer program product according to claim 29, wherein the computer-executable program code instructions further comprise program code instructions to:
verify or correct a processing of the speech input, wherein refining the speaker dependent speech recognition model is further based on the verification or correction.'
31. A computer program product according to claim 29 or 30, wherein the computer- executable program code instructions further comprise program code instructions to: cause transmission of at least one portion of the speaker dependent speech recognition model to a remote storage location.
32. A computer program product according to claim 27, 28, 29, 30 or 31 , wherein the
terminal dependent data comprises microphone information.
33. A computer program product according to claim 27, 28, 29, 30, 31 or 32, wherein the terminal dependent data comprises a context.
34. A computer program product according to claim 27, 28, 29, 30, 31 , 32 or 33, wherein the received at least one portion of a speech recognition model is received based on one of at least an individual user, group of users, geographic location, or a dialect.
35. A computer program product according to claim 27, 28, 29, 30, 31 , 32, 33 or 34, wherein the received at least one portion of a speech recognition model is based on at least one of a Hidden Markov Model, dynamic time warping model, neural network, or finite state transducer.
36. A computer program product comprising at least one non-transitory computer-readable storage medium having computer-executable program code instructions stored therein, the computer-executable program code instructions comprising program code instructions to:
receive at least one portion of a speaker dependent speech recognition model from a user terminal; and
generate at least one additional portion of a speech recognition model based on the received at least one portion of a speaker dependent speech recognition model, wherein the at least one additional portion of a speech recognition model is adaptable by one or more user terminals.
37. A computer program product according to claim 36, wherein the computer-executable program code instructions further comprise program code instructions to: cause transmission of the at least one additional portion or the additional speech recognition model to an additional user terminal.
38. A computer program product according to claim 36 or 37, wherein generating the at least one additional portion of the speech recognition model is further based on one of at least an individual user, group of users, geographic location, or a dialect.
39. A computer program product according to claim 36, 37 or 38, wherein the at least one additional portion of the speech recognition model is based on at least one of a hidden Markov Model, dynamic time warping model, neural network, or finite state transducer.
40. An apparatus comprising means for:
receiving at least one portion of a speech recognition model, wherein the at least one portion of the speech recognition model is stored remotely, and is adaptable by one or more user terminals to process input speech;
accessing a speech recognition model, wherein the speech recognition model is based on at least the received at least one portion of the speech recognition model; and adapting the speech recognition model based on terminal dependent data.
41. An apparatus according to claim 40, further comprising means for:
process received speech input using the speech recognition model; and generate a textual output.
42. An apparatus according to claim 40 or 41 , further comprising means for:
receiving a speech input; and
refining a speaker dependent speech recognition model based on the speech input.
43. An apparatus according to claim 42, further comprising means for:
verifying or correcting a processing of the speech input, wherein refining the speaker dependent speech recognition model is further based on the verification or correction.
44. An apparatus according to claim 42 or 43, further comprising means for:
causing transmission of at least one portion of the speaker dependent speech recognition model to a remote storage location.
45. An apparatus according to claim 40, 41, 42, 43 or 44 wherein the terminal dependent data comprises microphone information.
46. An apparatus according to claim 40, 41, 42, 43, 44 or 45, wherein the terminal dependent data comprises a context.
47. An apparatus according to claim 40, 41, 42, 43, 44, 45 or 46, wherein the received at least one portion of a speech recognition model is received based on one of at least an individual user, group of users, geographic location, or a dialect.
48. An apparatus according to claim 40, 41, 42, 43, 44, 45, 46 or 47, wherein the received at least one portion of a speech recognition model is based on at least one of a Hidden Markov Model, dynamic time warping model, neural network, or finite state transducer.
49. An apparatus comprising means for:
receiving at least one portion of a speaker dependent speech recognition model from a user terminal; and
generating at least one additional portion of a speech recognition based on the received at least one portion of a speaker dependent speech recognition model, wherein the at least one additional portion of the speech recognition model is adaptable by one or more user terminals.
50. An apparatus according to claim 49, further comprising means for:
causing transmission of the at least one additional portion to an additional user terminal.
51. An apparatus according to claim 49 or 50, wherein generating the at least one additional portion of the speech recognition model is further based on one of at least an individual user, group of users, geographic location, or a dialect.
52. An apparatus according to claim 49, 50 or 51 , wherein the at least one additional portion or the additional speech recognition model is based on at least one of a hidden Markov Model, dynamic time warping model, neural network, or finite state transducer.
PCT/FI2012/051285 2012-12-21 2012-12-21 Method, apparatus, and computer program product for personalizing speech recognition WO2014096506A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/FI2012/051285 WO2014096506A1 (en) 2012-12-21 2012-12-21 Method, apparatus, and computer program product for personalizing speech recognition

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/FI2012/051285 WO2014096506A1 (en) 2012-12-21 2012-12-21 Method, apparatus, and computer program product for personalizing speech recognition

Publications (1)

Publication Number Publication Date
WO2014096506A1 true WO2014096506A1 (en) 2014-06-26

Family

ID=50977656

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/FI2012/051285 WO2014096506A1 (en) 2012-12-21 2012-12-21 Method, apparatus, and computer program product for personalizing speech recognition

Country Status (1)

Country Link
WO (1) WO2014096506A1 (en)

Cited By (125)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016209499A1 (en) * 2015-06-25 2016-12-29 Intel Corporation Speech recognition services
US10043516B2 (en) 2016-09-23 2018-08-07 Apple Inc. Intelligent automated assistant
US10083690B2 (en) 2014-05-30 2018-09-25 Apple Inc. Better resolution when referencing to concepts
US10108612B2 (en) 2008-07-31 2018-10-23 Apple Inc. Mobile device having human language translation capability with positional feedback
US10303715B2 (en) 2017-05-16 2019-05-28 Apple Inc. Intelligent automated assistant for media exploration
US10311871B2 (en) 2015-03-08 2019-06-04 Apple Inc. Competing devices responding to voice triggers
US10311144B2 (en) 2017-05-16 2019-06-04 Apple Inc. Emoji word sense disambiguation
US10332518B2 (en) 2017-05-09 2019-06-25 Apple Inc. User interface for correcting recognition errors
US10354652B2 (en) 2015-12-02 2019-07-16 Apple Inc. Applying neural network language models to weighted finite state transducers for automatic speech recognition
US10356243B2 (en) 2015-06-05 2019-07-16 Apple Inc. Virtual assistant aided communication with 3rd party service in a communication session
US10381016B2 (en) 2008-01-03 2019-08-13 Apple Inc. Methods and apparatus for altering audio output signals
US10390213B2 (en) 2014-09-30 2019-08-20 Apple Inc. Social reminders
US10395654B2 (en) 2017-05-11 2019-08-27 Apple Inc. Text normalization based on a data-driven learning network
US10403283B1 (en) 2018-06-01 2019-09-03 Apple Inc. Voice interaction at a primary device to access call functionality of a companion device
US10403278B2 (en) 2017-05-16 2019-09-03 Apple Inc. Methods and systems for phonetic matching in digital assistant services
US10410637B2 (en) 2017-05-12 2019-09-10 Apple Inc. User-specific acoustic models
US10417405B2 (en) 2011-03-21 2019-09-17 Apple Inc. Device access using voice authentication
US10417344B2 (en) 2014-05-30 2019-09-17 Apple Inc. Exemplar-based natural language processing
US10417266B2 (en) 2017-05-09 2019-09-17 Apple Inc. Context-aware ranking of intelligent response suggestions
US10431204B2 (en) 2014-09-11 2019-10-01 Apple Inc. Method and apparatus for discovering trending terms in speech requests
US10438595B2 (en) 2014-09-30 2019-10-08 Apple Inc. Speaker identification and unsupervised speaker adaptation techniques
JP2019528502A (en) * 2016-06-23 2019-10-10 ホアウェイ・テクノロジーズ・カンパニー・リミテッド Method and apparatus for optimizing a model applicable to pattern recognition and terminal device
US10445429B2 (en) 2017-09-21 2019-10-15 Apple Inc. Natural language understanding using vocabularies with compressed serialized tries
US10453443B2 (en) 2014-09-30 2019-10-22 Apple Inc. Providing an indication of the suitability of speech recognition
US10474753B2 (en) 2016-09-07 2019-11-12 Apple Inc. Language identification using recurrent neural networks
US10482874B2 (en) 2017-05-15 2019-11-19 Apple Inc. Hierarchical belief states for digital assistants
US10497365B2 (en) 2014-05-30 2019-12-03 Apple Inc. Multi-command single utterance input method
US10496705B1 (en) 2018-06-03 2019-12-03 Apple Inc. Accelerated task performance
US10529332B2 (en) 2015-03-08 2020-01-07 Apple Inc. Virtual assistant activation
CN110765105A (en) * 2019-10-14 2020-02-07 珠海格力电器股份有限公司 Method, device, equipment and medium for establishing wake-up instruction database
US10567477B2 (en) 2015-03-08 2020-02-18 Apple Inc. Virtual assistant continuity
US10580409B2 (en) 2016-06-11 2020-03-03 Apple Inc. Application integration with a digital assistant
US10592604B2 (en) 2018-03-12 2020-03-17 Apple Inc. Inverse text normalization for automatic speech recognition
US10636424B2 (en) 2017-11-30 2020-04-28 Apple Inc. Multi-turn canned dialog
US10643611B2 (en) 2008-10-02 2020-05-05 Apple Inc. Electronic devices with voice command and contextual data processing capabilities
US10657961B2 (en) 2013-06-08 2020-05-19 Apple Inc. Interpreting and acting upon commands that involve sharing information with remote devices
US10657328B2 (en) 2017-06-02 2020-05-19 Apple Inc. Multi-task recurrent neural network architecture for efficient morphology handling in neural language modeling
US10684703B2 (en) 2018-06-01 2020-06-16 Apple Inc. Attention aware virtual assistant dismissal
US10692504B2 (en) 2010-02-25 2020-06-23 Apple Inc. User profiling for voice input processing
US10699717B2 (en) 2014-05-30 2020-06-30 Apple Inc. Intelligent assistant for home automation
US10714117B2 (en) 2013-02-07 2020-07-14 Apple Inc. Voice trigger for a digital assistant
US10726832B2 (en) 2017-05-11 2020-07-28 Apple Inc. Maintaining privacy of personal information
US10733993B2 (en) 2016-06-10 2020-08-04 Apple Inc. Intelligent digital assistant in a multi-tasking environment
US10733982B2 (en) 2018-01-08 2020-08-04 Apple Inc. Multi-directional dialog
US10733375B2 (en) 2018-01-31 2020-08-04 Apple Inc. Knowledge-based framework for improving natural language understanding
US10741185B2 (en) 2010-01-18 2020-08-11 Apple Inc. Intelligent automated assistant
US10748546B2 (en) 2017-05-16 2020-08-18 Apple Inc. Digital assistant services based on device capabilities
US10755703B2 (en) 2017-05-11 2020-08-25 Apple Inc. Offline personal assistant
US10755051B2 (en) 2017-09-29 2020-08-25 Apple Inc. Rule-based natural language processing
US10769385B2 (en) 2013-06-09 2020-09-08 Apple Inc. System and method for inferring user intent from speech inputs
US10789959B2 (en) 2018-03-02 2020-09-29 Apple Inc. Training speaker recognition models for digital assistants
US10791176B2 (en) 2017-05-12 2020-09-29 Apple Inc. Synchronization and task delegation of a digital assistant
US10789945B2 (en) 2017-05-12 2020-09-29 Apple Inc. Low-latency intelligent automated assistant
US10810274B2 (en) 2017-05-15 2020-10-20 Apple Inc. Optimizing dialogue policy decisions for digital assistants using implicit feedback
US10818288B2 (en) 2018-03-26 2020-10-27 Apple Inc. Natural assistant interaction
US10839159B2 (en) 2018-09-28 2020-11-17 Apple Inc. Named entity normalization in a spoken dialog system
US10892996B2 (en) 2018-06-01 2021-01-12 Apple Inc. Variable latency device coordination
US10909331B2 (en) 2018-03-30 2021-02-02 Apple Inc. Implicit identification of translation payload with neural machine translation
US10928918B2 (en) 2018-05-07 2021-02-23 Apple Inc. Raise to speak
US10942703B2 (en) 2015-12-23 2021-03-09 Apple Inc. Proactive assistance based on dialog communication between devices
US10942702B2 (en) 2016-06-11 2021-03-09 Apple Inc. Intelligent device arbitration and control
US10956666B2 (en) 2015-11-09 2021-03-23 Apple Inc. Unconventional virtual assistant interactions
US10984780B2 (en) 2018-05-21 2021-04-20 Apple Inc. Global semantic word embeddings using bi-directional recurrent neural networks
US11010561B2 (en) 2018-09-27 2021-05-18 Apple Inc. Sentiment prediction from textual data
US11010127B2 (en) 2015-06-29 2021-05-18 Apple Inc. Virtual assistant for media playback
US11023513B2 (en) 2007-12-20 2021-06-01 Apple Inc. Method and apparatus for searching using an active ontology
US11025565B2 (en) 2015-06-07 2021-06-01 Apple Inc. Personalized prediction of responses for instant messaging
US11048473B2 (en) 2013-06-09 2021-06-29 Apple Inc. Device, method, and graphical user interface for enabling conversation persistence across two or more instances of a digital assistant
US11069347B2 (en) 2016-06-08 2021-07-20 Apple Inc. Intelligent automated assistant for media exploration
US11070949B2 (en) 2015-05-27 2021-07-20 Apple Inc. Systems and methods for proactively identifying and surfacing relevant content on an electronic device with a touch-sensitive display
US11069336B2 (en) 2012-03-02 2021-07-20 Apple Inc. Systems and methods for name pronunciation
US11120372B2 (en) 2011-06-03 2021-09-14 Apple Inc. Performing actions associated with task items that represent tasks to perform
US11126400B2 (en) 2015-09-08 2021-09-21 Apple Inc. Zero latency digital assistant
US11127397B2 (en) 2015-05-27 2021-09-21 Apple Inc. Device voice control
US11133008B2 (en) 2014-05-30 2021-09-28 Apple Inc. Reducing the need for manual start/end-pointing and trigger phrases
US11140099B2 (en) 2019-05-21 2021-10-05 Apple Inc. Providing message response suggestions
US11145294B2 (en) 2018-05-07 2021-10-12 Apple Inc. Intelligent automated assistant for delivering content from user experiences
US11170166B2 (en) 2018-09-28 2021-11-09 Apple Inc. Neural typographical error modeling via generative adversarial networks
US20210375290A1 (en) * 2020-05-26 2021-12-02 Apple Inc. Personalized voices for text messaging
US11204787B2 (en) 2017-01-09 2021-12-21 Apple Inc. Application integration with a digital assistant
US11217251B2 (en) 2019-05-06 2022-01-04 Apple Inc. Spoken notifications
US11227589B2 (en) 2016-06-06 2022-01-18 Apple Inc. Intelligent list reading
US11231904B2 (en) 2015-03-06 2022-01-25 Apple Inc. Reducing response latency of intelligent automated assistants
US11237797B2 (en) 2019-05-31 2022-02-01 Apple Inc. User activity shortcut suggestions
US11269678B2 (en) 2012-05-15 2022-03-08 Apple Inc. Systems and methods for integrating third party services with a digital assistant
US11281993B2 (en) 2016-12-05 2022-03-22 Apple Inc. Model and ensemble compression for metric learning
US11289073B2 (en) 2019-05-31 2022-03-29 Apple Inc. Device text to speech
US11301477B2 (en) 2017-05-12 2022-04-12 Apple Inc. Feedback analysis of a digital assistant
US11307752B2 (en) 2019-05-06 2022-04-19 Apple Inc. User configurable task triggers
US11314370B2 (en) 2013-12-06 2022-04-26 Apple Inc. Method for extracting salient dialog usage from live data
US11348573B2 (en) 2019-03-18 2022-05-31 Apple Inc. Multimodality in digital assistant systems
US11350253B2 (en) 2011-06-03 2022-05-31 Apple Inc. Active transport based notifications
US11360641B2 (en) 2019-06-01 2022-06-14 Apple Inc. Increasing the relevance of new available information
US11386266B2 (en) 2018-06-01 2022-07-12 Apple Inc. Text correction
US11388291B2 (en) 2013-03-14 2022-07-12 Apple Inc. System and method for processing voicemail
US11423908B2 (en) 2019-05-06 2022-08-23 Apple Inc. Interpreting spoken requests
US11423886B2 (en) 2010-01-18 2022-08-23 Apple Inc. Task flow identification based on user intent
US11462215B2 (en) 2018-09-28 2022-10-04 Apple Inc. Multi-modal inputs for voice commands
US11467802B2 (en) 2017-05-11 2022-10-11 Apple Inc. Maintaining privacy of personal information
US11468282B2 (en) 2015-05-15 2022-10-11 Apple Inc. Virtual assistant in a communication session
US11475884B2 (en) 2019-05-06 2022-10-18 Apple Inc. Reducing digital assistant latency when a language is incorrectly determined
US11475898B2 (en) 2018-10-26 2022-10-18 Apple Inc. Low-latency multi-speaker speech recognition
US11488406B2 (en) 2019-09-25 2022-11-01 Apple Inc. Text detection using global geometry estimators
US11495218B2 (en) 2018-06-01 2022-11-08 Apple Inc. Virtual assistant operation in multi-device environments
US11496600B2 (en) 2019-05-31 2022-11-08 Apple Inc. Remote execution of machine-learned models
US11500672B2 (en) 2015-09-08 2022-11-15 Apple Inc. Distributed personal assistant
US11516537B2 (en) 2014-06-30 2022-11-29 Apple Inc. Intelligent automated assistant for TV user interactions
US11526368B2 (en) 2015-11-06 2022-12-13 Apple Inc. Intelligent automated assistant in a messaging environment
US11532306B2 (en) 2017-05-16 2022-12-20 Apple Inc. Detecting a trigger of a digital assistant
US11638059B2 (en) 2019-01-04 2023-04-25 Apple Inc. Content playback on multiple devices
US11657813B2 (en) 2019-05-31 2023-05-23 Apple Inc. Voice identification in digital assistant systems
US11671920B2 (en) 2007-04-03 2023-06-06 Apple Inc. Method and system for operating a multifunction portable electronic device using voice-activation
US11696060B2 (en) 2020-07-21 2023-07-04 Apple Inc. User identification using headphones
US11755276B2 (en) 2020-05-12 2023-09-12 Apple Inc. Reducing description length based on confidence
US11765209B2 (en) 2020-05-11 2023-09-19 Apple Inc. Digital assistant hardware abstraction
US11790914B2 (en) 2019-06-01 2023-10-17 Apple Inc. Methods and user interfaces for voice-based control of electronic devices
US11798547B2 (en) 2013-03-15 2023-10-24 Apple Inc. Voice activated device for use with a voice-based digital assistant
US11809483B2 (en) 2015-09-08 2023-11-07 Apple Inc. Intelligent automated assistant for media search and playback
US11838734B2 (en) 2020-07-20 2023-12-05 Apple Inc. Multi-device audio adjustment coordination
US11853536B2 (en) 2015-09-08 2023-12-26 Apple Inc. Intelligent automated assistant in a media environment
US11914848B2 (en) 2020-05-11 2024-02-27 Apple Inc. Providing relevant data items based on context
US11928604B2 (en) 2005-09-08 2024-03-12 Apple Inc. Method and apparatus for building an intelligent automated assistant
US12010262B2 (en) 2013-08-06 2024-06-11 Apple Inc. Auto-activating smart responses based on activities from remote devices
US12014118B2 (en) 2017-05-15 2024-06-18 Apple Inc. Multi-modal interfaces having selection disambiguation and text modification capability
US12026197B2 (en) 2017-06-01 2024-07-02 Apple Inc. Intelligent automated assistant for media exploration

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030050783A1 (en) * 2001-09-13 2003-03-13 Shinichi Yoshizawa Terminal device, server device and speech recognition method
US20070124134A1 (en) * 2005-11-25 2007-05-31 Swisscom Mobile Ag Method for personalization of a service
US20100145699A1 (en) * 2008-12-09 2010-06-10 Nokia Corporation Adaptation of automatic speech recognition acoustic models

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030050783A1 (en) * 2001-09-13 2003-03-13 Shinichi Yoshizawa Terminal device, server device and speech recognition method
US20070124134A1 (en) * 2005-11-25 2007-05-31 Swisscom Mobile Ag Method for personalization of a service
US20100145699A1 (en) * 2008-12-09 2010-06-10 Nokia Corporation Adaptation of automatic speech recognition acoustic models

Cited By (196)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11928604B2 (en) 2005-09-08 2024-03-12 Apple Inc. Method and apparatus for building an intelligent automated assistant
US11671920B2 (en) 2007-04-03 2023-06-06 Apple Inc. Method and system for operating a multifunction portable electronic device using voice-activation
US11979836B2 (en) 2007-04-03 2024-05-07 Apple Inc. Method and system for operating a multi-function portable electronic device using voice-activation
US11023513B2 (en) 2007-12-20 2021-06-01 Apple Inc. Method and apparatus for searching using an active ontology
US10381016B2 (en) 2008-01-03 2019-08-13 Apple Inc. Methods and apparatus for altering audio output signals
US10108612B2 (en) 2008-07-31 2018-10-23 Apple Inc. Mobile device having human language translation capability with positional feedback
US11900936B2 (en) 2008-10-02 2024-02-13 Apple Inc. Electronic devices with voice command and contextual data processing capabilities
US11348582B2 (en) 2008-10-02 2022-05-31 Apple Inc. Electronic devices with voice command and contextual data processing capabilities
US10643611B2 (en) 2008-10-02 2020-05-05 Apple Inc. Electronic devices with voice command and contextual data processing capabilities
US11423886B2 (en) 2010-01-18 2022-08-23 Apple Inc. Task flow identification based on user intent
US10741185B2 (en) 2010-01-18 2020-08-11 Apple Inc. Intelligent automated assistant
US10692504B2 (en) 2010-02-25 2020-06-23 Apple Inc. User profiling for voice input processing
US10417405B2 (en) 2011-03-21 2019-09-17 Apple Inc. Device access using voice authentication
US11350253B2 (en) 2011-06-03 2022-05-31 Apple Inc. Active transport based notifications
US11120372B2 (en) 2011-06-03 2021-09-14 Apple Inc. Performing actions associated with task items that represent tasks to perform
US11069336B2 (en) 2012-03-02 2021-07-20 Apple Inc. Systems and methods for name pronunciation
US11321116B2 (en) 2012-05-15 2022-05-03 Apple Inc. Systems and methods for integrating third party services with a digital assistant
US11269678B2 (en) 2012-05-15 2022-03-08 Apple Inc. Systems and methods for integrating third party services with a digital assistant
US12009007B2 (en) 2013-02-07 2024-06-11 Apple Inc. Voice trigger for a digital assistant
US11862186B2 (en) 2013-02-07 2024-01-02 Apple Inc. Voice trigger for a digital assistant
US10978090B2 (en) 2013-02-07 2021-04-13 Apple Inc. Voice trigger for a digital assistant
US10714117B2 (en) 2013-02-07 2020-07-14 Apple Inc. Voice trigger for a digital assistant
US11557310B2 (en) 2013-02-07 2023-01-17 Apple Inc. Voice trigger for a digital assistant
US11636869B2 (en) 2013-02-07 2023-04-25 Apple Inc. Voice trigger for a digital assistant
US11388291B2 (en) 2013-03-14 2022-07-12 Apple Inc. System and method for processing voicemail
US11798547B2 (en) 2013-03-15 2023-10-24 Apple Inc. Voice activated device for use with a voice-based digital assistant
US10657961B2 (en) 2013-06-08 2020-05-19 Apple Inc. Interpreting and acting upon commands that involve sharing information with remote devices
US11727219B2 (en) 2013-06-09 2023-08-15 Apple Inc. System and method for inferring user intent from speech inputs
US10769385B2 (en) 2013-06-09 2020-09-08 Apple Inc. System and method for inferring user intent from speech inputs
US11048473B2 (en) 2013-06-09 2021-06-29 Apple Inc. Device, method, and graphical user interface for enabling conversation persistence across two or more instances of a digital assistant
US12010262B2 (en) 2013-08-06 2024-06-11 Apple Inc. Auto-activating smart responses based on activities from remote devices
US11314370B2 (en) 2013-12-06 2022-04-26 Apple Inc. Method for extracting salient dialog usage from live data
US11257504B2 (en) 2014-05-30 2022-02-22 Apple Inc. Intelligent assistant for home automation
US10714095B2 (en) 2014-05-30 2020-07-14 Apple Inc. Intelligent assistant for home automation
US11133008B2 (en) 2014-05-30 2021-09-28 Apple Inc. Reducing the need for manual start/end-pointing and trigger phrases
US10878809B2 (en) 2014-05-30 2020-12-29 Apple Inc. Multi-command single utterance input method
US11670289B2 (en) 2014-05-30 2023-06-06 Apple Inc. Multi-command single utterance input method
US11699448B2 (en) 2014-05-30 2023-07-11 Apple Inc. Intelligent assistant for home automation
US10083690B2 (en) 2014-05-30 2018-09-25 Apple Inc. Better resolution when referencing to concepts
US10657966B2 (en) 2014-05-30 2020-05-19 Apple Inc. Better resolution when referencing to concepts
US10497365B2 (en) 2014-05-30 2019-12-03 Apple Inc. Multi-command single utterance input method
US11810562B2 (en) 2014-05-30 2023-11-07 Apple Inc. Reducing the need for manual start/end-pointing and trigger phrases
US10417344B2 (en) 2014-05-30 2019-09-17 Apple Inc. Exemplar-based natural language processing
US10699717B2 (en) 2014-05-30 2020-06-30 Apple Inc. Intelligent assistant for home automation
US11838579B2 (en) 2014-06-30 2023-12-05 Apple Inc. Intelligent automated assistant for TV user interactions
US11516537B2 (en) 2014-06-30 2022-11-29 Apple Inc. Intelligent automated assistant for TV user interactions
US10431204B2 (en) 2014-09-11 2019-10-01 Apple Inc. Method and apparatus for discovering trending terms in speech requests
US10390213B2 (en) 2014-09-30 2019-08-20 Apple Inc. Social reminders
US10453443B2 (en) 2014-09-30 2019-10-22 Apple Inc. Providing an indication of the suitability of speech recognition
US10438595B2 (en) 2014-09-30 2019-10-08 Apple Inc. Speaker identification and unsupervised speaker adaptation techniques
US11231904B2 (en) 2015-03-06 2022-01-25 Apple Inc. Reducing response latency of intelligent automated assistants
US10311871B2 (en) 2015-03-08 2019-06-04 Apple Inc. Competing devices responding to voice triggers
US10930282B2 (en) 2015-03-08 2021-02-23 Apple Inc. Competing devices responding to voice triggers
US11087759B2 (en) 2015-03-08 2021-08-10 Apple Inc. Virtual assistant activation
US10567477B2 (en) 2015-03-08 2020-02-18 Apple Inc. Virtual assistant continuity
US11842734B2 (en) 2015-03-08 2023-12-12 Apple Inc. Virtual assistant activation
US10529332B2 (en) 2015-03-08 2020-01-07 Apple Inc. Virtual assistant activation
US12001933B2 (en) 2015-05-15 2024-06-04 Apple Inc. Virtual assistant in a communication session
US11468282B2 (en) 2015-05-15 2022-10-11 Apple Inc. Virtual assistant in a communication session
US11127397B2 (en) 2015-05-27 2021-09-21 Apple Inc. Device voice control
US11070949B2 (en) 2015-05-27 2021-07-20 Apple Inc. Systems and methods for proactively identifying and surfacing relevant content on an electronic device with a touch-sensitive display
US10681212B2 (en) 2015-06-05 2020-06-09 Apple Inc. Virtual assistant aided communication with 3rd party service in a communication session
US10356243B2 (en) 2015-06-05 2019-07-16 Apple Inc. Virtual assistant aided communication with 3rd party service in a communication session
US11025565B2 (en) 2015-06-07 2021-06-01 Apple Inc. Personalized prediction of responses for instant messaging
WO2016209499A1 (en) * 2015-06-25 2016-12-29 Intel Corporation Speech recognition services
US11010127B2 (en) 2015-06-29 2021-05-18 Apple Inc. Virtual assistant for media playback
US11947873B2 (en) 2015-06-29 2024-04-02 Apple Inc. Virtual assistant for media playback
US11126400B2 (en) 2015-09-08 2021-09-21 Apple Inc. Zero latency digital assistant
US11500672B2 (en) 2015-09-08 2022-11-15 Apple Inc. Distributed personal assistant
US11853536B2 (en) 2015-09-08 2023-12-26 Apple Inc. Intelligent automated assistant in a media environment
US11809483B2 (en) 2015-09-08 2023-11-07 Apple Inc. Intelligent automated assistant for media search and playback
US11550542B2 (en) 2015-09-08 2023-01-10 Apple Inc. Zero latency digital assistant
US11954405B2 (en) 2015-09-08 2024-04-09 Apple Inc. Zero latency digital assistant
US11526368B2 (en) 2015-11-06 2022-12-13 Apple Inc. Intelligent automated assistant in a messaging environment
US11809886B2 (en) 2015-11-06 2023-11-07 Apple Inc. Intelligent automated assistant in a messaging environment
US10956666B2 (en) 2015-11-09 2021-03-23 Apple Inc. Unconventional virtual assistant interactions
US11886805B2 (en) 2015-11-09 2024-01-30 Apple Inc. Unconventional virtual assistant interactions
US10354652B2 (en) 2015-12-02 2019-07-16 Apple Inc. Applying neural network language models to weighted finite state transducers for automatic speech recognition
US10942703B2 (en) 2015-12-23 2021-03-09 Apple Inc. Proactive assistance based on dialog communication between devices
US11227589B2 (en) 2016-06-06 2022-01-18 Apple Inc. Intelligent list reading
US11069347B2 (en) 2016-06-08 2021-07-20 Apple Inc. Intelligent automated assistant for media exploration
US11657820B2 (en) 2016-06-10 2023-05-23 Apple Inc. Intelligent digital assistant in a multi-tasking environment
US10733993B2 (en) 2016-06-10 2020-08-04 Apple Inc. Intelligent digital assistant in a multi-tasking environment
US11037565B2 (en) 2016-06-10 2021-06-15 Apple Inc. Intelligent digital assistant in a multi-tasking environment
US10580409B2 (en) 2016-06-11 2020-03-03 Apple Inc. Application integration with a digital assistant
US11152002B2 (en) 2016-06-11 2021-10-19 Apple Inc. Application integration with a digital assistant
US11809783B2 (en) 2016-06-11 2023-11-07 Apple Inc. Intelligent device arbitration and control
US10942702B2 (en) 2016-06-11 2021-03-09 Apple Inc. Intelligent device arbitration and control
US11749275B2 (en) 2016-06-11 2023-09-05 Apple Inc. Application integration with a digital assistant
US10825447B2 (en) 2016-06-23 2020-11-03 Huawei Technologies Co., Ltd. Method and apparatus for optimizing model applicable to pattern recognition, and terminal device
JP2019528502A (en) * 2016-06-23 2019-10-10 ホアウェイ・テクノロジーズ・カンパニー・リミテッド Method and apparatus for optimizing a model applicable to pattern recognition and terminal device
US10474753B2 (en) 2016-09-07 2019-11-12 Apple Inc. Language identification using recurrent neural networks
US10043516B2 (en) 2016-09-23 2018-08-07 Apple Inc. Intelligent automated assistant
US10553215B2 (en) 2016-09-23 2020-02-04 Apple Inc. Intelligent automated assistant
US11281993B2 (en) 2016-12-05 2022-03-22 Apple Inc. Model and ensemble compression for metric learning
US11204787B2 (en) 2017-01-09 2021-12-21 Apple Inc. Application integration with a digital assistant
US11656884B2 (en) 2017-01-09 2023-05-23 Apple Inc. Application integration with a digital assistant
US10332518B2 (en) 2017-05-09 2019-06-25 Apple Inc. User interface for correcting recognition errors
US10741181B2 (en) 2017-05-09 2020-08-11 Apple Inc. User interface for correcting recognition errors
US10417266B2 (en) 2017-05-09 2019-09-17 Apple Inc. Context-aware ranking of intelligent response suggestions
US10726832B2 (en) 2017-05-11 2020-07-28 Apple Inc. Maintaining privacy of personal information
US11467802B2 (en) 2017-05-11 2022-10-11 Apple Inc. Maintaining privacy of personal information
US11599331B2 (en) 2017-05-11 2023-03-07 Apple Inc. Maintaining privacy of personal information
US10847142B2 (en) 2017-05-11 2020-11-24 Apple Inc. Maintaining privacy of personal information
US10755703B2 (en) 2017-05-11 2020-08-25 Apple Inc. Offline personal assistant
US10395654B2 (en) 2017-05-11 2019-08-27 Apple Inc. Text normalization based on a data-driven learning network
US11405466B2 (en) 2017-05-12 2022-08-02 Apple Inc. Synchronization and task delegation of a digital assistant
US11380310B2 (en) 2017-05-12 2022-07-05 Apple Inc. Low-latency intelligent automated assistant
US11837237B2 (en) 2017-05-12 2023-12-05 Apple Inc. User-specific acoustic models
US11301477B2 (en) 2017-05-12 2022-04-12 Apple Inc. Feedback analysis of a digital assistant
US10410637B2 (en) 2017-05-12 2019-09-10 Apple Inc. User-specific acoustic models
US10789945B2 (en) 2017-05-12 2020-09-29 Apple Inc. Low-latency intelligent automated assistant
US10791176B2 (en) 2017-05-12 2020-09-29 Apple Inc. Synchronization and task delegation of a digital assistant
US11538469B2 (en) 2017-05-12 2022-12-27 Apple Inc. Low-latency intelligent automated assistant
US11862151B2 (en) 2017-05-12 2024-01-02 Apple Inc. Low-latency intelligent automated assistant
US11580990B2 (en) 2017-05-12 2023-02-14 Apple Inc. User-specific acoustic models
US10810274B2 (en) 2017-05-15 2020-10-20 Apple Inc. Optimizing dialogue policy decisions for digital assistants using implicit feedback
US10482874B2 (en) 2017-05-15 2019-11-19 Apple Inc. Hierarchical belief states for digital assistants
US12014118B2 (en) 2017-05-15 2024-06-18 Apple Inc. Multi-modal interfaces having selection disambiguation and text modification capability
US11675829B2 (en) 2017-05-16 2023-06-13 Apple Inc. Intelligent automated assistant for media exploration
US11217255B2 (en) 2017-05-16 2022-01-04 Apple Inc. Far-field extension for digital assistant services
US10909171B2 (en) 2017-05-16 2021-02-02 Apple Inc. Intelligent automated assistant for media exploration
US10311144B2 (en) 2017-05-16 2019-06-04 Apple Inc. Emoji word sense disambiguation
US10303715B2 (en) 2017-05-16 2019-05-28 Apple Inc. Intelligent automated assistant for media exploration
US10748546B2 (en) 2017-05-16 2020-08-18 Apple Inc. Digital assistant services based on device capabilities
US11532306B2 (en) 2017-05-16 2022-12-20 Apple Inc. Detecting a trigger of a digital assistant
US10403278B2 (en) 2017-05-16 2019-09-03 Apple Inc. Methods and systems for phonetic matching in digital assistant services
US12026197B2 (en) 2017-06-01 2024-07-02 Apple Inc. Intelligent automated assistant for media exploration
US10657328B2 (en) 2017-06-02 2020-05-19 Apple Inc. Multi-task recurrent neural network architecture for efficient morphology handling in neural language modeling
US10445429B2 (en) 2017-09-21 2019-10-15 Apple Inc. Natural language understanding using vocabularies with compressed serialized tries
US10755051B2 (en) 2017-09-29 2020-08-25 Apple Inc. Rule-based natural language processing
US10636424B2 (en) 2017-11-30 2020-04-28 Apple Inc. Multi-turn canned dialog
US10733982B2 (en) 2018-01-08 2020-08-04 Apple Inc. Multi-directional dialog
US10733375B2 (en) 2018-01-31 2020-08-04 Apple Inc. Knowledge-based framework for improving natural language understanding
US10789959B2 (en) 2018-03-02 2020-09-29 Apple Inc. Training speaker recognition models for digital assistants
US10592604B2 (en) 2018-03-12 2020-03-17 Apple Inc. Inverse text normalization for automatic speech recognition
US10818288B2 (en) 2018-03-26 2020-10-27 Apple Inc. Natural assistant interaction
US11710482B2 (en) 2018-03-26 2023-07-25 Apple Inc. Natural assistant interaction
US10909331B2 (en) 2018-03-30 2021-02-02 Apple Inc. Implicit identification of translation payload with neural machine translation
US10928918B2 (en) 2018-05-07 2021-02-23 Apple Inc. Raise to speak
US11854539B2 (en) 2018-05-07 2023-12-26 Apple Inc. Intelligent automated assistant for delivering content from user experiences
US11487364B2 (en) 2018-05-07 2022-11-01 Apple Inc. Raise to speak
US11169616B2 (en) 2018-05-07 2021-11-09 Apple Inc. Raise to speak
US11900923B2 (en) 2018-05-07 2024-02-13 Apple Inc. Intelligent automated assistant for delivering content from user experiences
US11907436B2 (en) 2018-05-07 2024-02-20 Apple Inc. Raise to speak
US11145294B2 (en) 2018-05-07 2021-10-12 Apple Inc. Intelligent automated assistant for delivering content from user experiences
US10984780B2 (en) 2018-05-21 2021-04-20 Apple Inc. Global semantic word embeddings using bi-directional recurrent neural networks
US11431642B2 (en) 2018-06-01 2022-08-30 Apple Inc. Variable latency device coordination
US10984798B2 (en) 2018-06-01 2021-04-20 Apple Inc. Voice interaction at a primary device to access call functionality of a companion device
US11495218B2 (en) 2018-06-01 2022-11-08 Apple Inc. Virtual assistant operation in multi-device environments
US10720160B2 (en) 2018-06-01 2020-07-21 Apple Inc. Voice interaction at a primary device to access call functionality of a companion device
US11630525B2 (en) 2018-06-01 2023-04-18 Apple Inc. Attention aware virtual assistant dismissal
US11360577B2 (en) 2018-06-01 2022-06-14 Apple Inc. Attention aware virtual assistant dismissal
US11386266B2 (en) 2018-06-01 2022-07-12 Apple Inc. Text correction
US10403283B1 (en) 2018-06-01 2019-09-03 Apple Inc. Voice interaction at a primary device to access call functionality of a companion device
US11009970B2 (en) 2018-06-01 2021-05-18 Apple Inc. Attention aware virtual assistant dismissal
US10892996B2 (en) 2018-06-01 2021-01-12 Apple Inc. Variable latency device coordination
US10684703B2 (en) 2018-06-01 2020-06-16 Apple Inc. Attention aware virtual assistant dismissal
US10504518B1 (en) 2018-06-03 2019-12-10 Apple Inc. Accelerated task performance
US10944859B2 (en) 2018-06-03 2021-03-09 Apple Inc. Accelerated task performance
US10496705B1 (en) 2018-06-03 2019-12-03 Apple Inc. Accelerated task performance
US11010561B2 (en) 2018-09-27 2021-05-18 Apple Inc. Sentiment prediction from textual data
US11462215B2 (en) 2018-09-28 2022-10-04 Apple Inc. Multi-modal inputs for voice commands
US10839159B2 (en) 2018-09-28 2020-11-17 Apple Inc. Named entity normalization in a spoken dialog system
US11170166B2 (en) 2018-09-28 2021-11-09 Apple Inc. Neural typographical error modeling via generative adversarial networks
US11893992B2 (en) 2018-09-28 2024-02-06 Apple Inc. Multi-modal inputs for voice commands
US11475898B2 (en) 2018-10-26 2022-10-18 Apple Inc. Low-latency multi-speaker speech recognition
US11638059B2 (en) 2019-01-04 2023-04-25 Apple Inc. Content playback on multiple devices
US11348573B2 (en) 2019-03-18 2022-05-31 Apple Inc. Multimodality in digital assistant systems
US11783815B2 (en) 2019-03-18 2023-10-10 Apple Inc. Multimodality in digital assistant systems
US11217251B2 (en) 2019-05-06 2022-01-04 Apple Inc. Spoken notifications
US11475884B2 (en) 2019-05-06 2022-10-18 Apple Inc. Reducing digital assistant latency when a language is incorrectly determined
US11423908B2 (en) 2019-05-06 2022-08-23 Apple Inc. Interpreting spoken requests
US11307752B2 (en) 2019-05-06 2022-04-19 Apple Inc. User configurable task triggers
US11675491B2 (en) 2019-05-06 2023-06-13 Apple Inc. User configurable task triggers
US11705130B2 (en) 2019-05-06 2023-07-18 Apple Inc. Spoken notifications
US11140099B2 (en) 2019-05-21 2021-10-05 Apple Inc. Providing message response suggestions
US11888791B2 (en) 2019-05-21 2024-01-30 Apple Inc. Providing message response suggestions
US11657813B2 (en) 2019-05-31 2023-05-23 Apple Inc. Voice identification in digital assistant systems
US11289073B2 (en) 2019-05-31 2022-03-29 Apple Inc. Device text to speech
US11496600B2 (en) 2019-05-31 2022-11-08 Apple Inc. Remote execution of machine-learned models
US11360739B2 (en) 2019-05-31 2022-06-14 Apple Inc. User activity shortcut suggestions
US11237797B2 (en) 2019-05-31 2022-02-01 Apple Inc. User activity shortcut suggestions
US11790914B2 (en) 2019-06-01 2023-10-17 Apple Inc. Methods and user interfaces for voice-based control of electronic devices
US11360641B2 (en) 2019-06-01 2022-06-14 Apple Inc. Increasing the relevance of new available information
US11488406B2 (en) 2019-09-25 2022-11-01 Apple Inc. Text detection using global geometry estimators
CN110765105A (en) * 2019-10-14 2020-02-07 珠海格力电器股份有限公司 Method, device, equipment and medium for establishing wake-up instruction database
US11914848B2 (en) 2020-05-11 2024-02-27 Apple Inc. Providing relevant data items based on context
US11924254B2 (en) 2020-05-11 2024-03-05 Apple Inc. Digital assistant hardware abstraction
US11765209B2 (en) 2020-05-11 2023-09-19 Apple Inc. Digital assistant hardware abstraction
US11755276B2 (en) 2020-05-12 2023-09-12 Apple Inc. Reducing description length based on confidence
US11508380B2 (en) 2020-05-26 2022-11-22 Apple Inc. Personalized voices for text messaging
US20210375290A1 (en) * 2020-05-26 2021-12-02 Apple Inc. Personalized voices for text messaging
US11838734B2 (en) 2020-07-20 2023-12-05 Apple Inc. Multi-device audio adjustment coordination
US11750962B2 (en) 2020-07-21 2023-09-05 Apple Inc. User identification using headphones
US11696060B2 (en) 2020-07-21 2023-07-04 Apple Inc. User identification using headphones

Similar Documents

Publication Publication Date Title
WO2014096506A1 (en) Method, apparatus, and computer program product for personalizing speech recognition
CN107644638B (en) Audio recognition method, device, terminal and computer readable storage medium
KR102100389B1 (en) Personalized entity pronunciation learning
EP3195310B1 (en) Keyword detection using speaker-independent keyword models for user-designated keywords
US20210264916A1 (en) Electronic device for generating personalized asr model and method for operating same
US8738375B2 (en) System and method for optimizing speech recognition and natural language parameters with user feedback
CN109710727B (en) System and method for natural language processing
US20180358019A1 (en) Dual mode speech recognition
US10705789B2 (en) Dynamic volume adjustment for virtual assistants
JP7171532B2 (en) Apparatus and method for recognizing speech, apparatus and method for training speech recognition model
CN112970059B (en) Electronic device for processing user utterance and control method thereof
CN112470217A (en) Method for determining electronic device to perform speech recognition and electronic device
WO2019213443A1 (en) Audio analytics for natural language processing
US9653073B2 (en) Voice input correction
US10170122B2 (en) Speech recognition method, electronic device and speech recognition system
KR20160028468A (en) Multi-level speech recofnition
TWI682385B (en) Speech service control apparatus and method thereof
CN107544271A (en) Terminal control method, device and computer-readable recording medium
US10535337B2 (en) Method for correcting false recognition contained in recognition result of speech of user
AU2019201441B2 (en) Electronic device for processing user voice input
CN111640429B (en) Method for providing voice recognition service and electronic device for the same
CN112334978A (en) Electronic device supporting personalized device connection and method thereof
CN110942779A (en) Noise processing method, device and system
CN111261151A (en) Voice processing method and device, electronic equipment and storage medium
CN114223029A (en) Server supporting device to perform voice recognition and operation method of server

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 12890352

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 12890352

Country of ref document: EP

Kind code of ref document: A1