CN117275453A - Method for presenting text messages using a user-specific speech model - Google Patents

Method for presenting text messages using a user-specific speech model Download PDF

Info

Publication number
CN117275453A
CN117275453A CN202310763655.3A CN202310763655A CN117275453A CN 117275453 A CN117275453 A CN 117275453A CN 202310763655 A CN202310763655 A CN 202310763655A CN 117275453 A CN117275453 A CN 117275453A
Authority
CN
China
Prior art keywords
user
receiving
text message
electronic device
speech
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310763655.3A
Other languages
Chinese (zh)
Inventor
杰瑞德·齐默尔曼
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Meta Platforms Technologies LLC
Original Assignee
Meta Platforms Technologies LLC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Meta Platforms Technologies LLC filed Critical Meta Platforms Technologies LLC
Publication of CN117275453A publication Critical patent/CN117275453A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/027Concept to speech synthesisers; Generation of natural phrases from machine-based concepts
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M1/00Substation equipment, e.g. for use by subscribers
    • H04M1/72Mobile telephones; Cordless telephones, i.e. devices for establishing wireless links to base stations without route selection
    • H04M1/724User interfaces specially adapted for cordless or mobile telephones
    • H04M1/72403User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality
    • H04M1/7243User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality with interactive means for internal management of messages
    • H04M1/72433User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality with interactive means for internal management of messages for voice messaging, e.g. dictaphones
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M1/00Substation equipment, e.g. for use by subscribers
    • H04M1/72Mobile telephones; Cordless telephones, i.e. devices for establishing wireless links to base stations without route selection
    • H04M1/724User interfaces specially adapted for cordless or mobile telephones
    • H04M1/72403User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality
    • H04M1/7243User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality with interactive means for internal management of messages
    • H04M1/72436User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality with interactive means for internal management of messages for text messaging, e.g. short messaging services [SMS] or e-mails
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M1/00Substation equipment, e.g. for use by subscribers
    • H04M1/72Mobile telephones; Cordless telephones, i.e. devices for establishing wireless links to base stations without route selection
    • H04M1/724User interfaces specially adapted for cordless or mobile telephones
    • H04M1/72403User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality
    • H04M1/72442User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality for playing music files

Landscapes

  • Engineering & Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Signal Processing (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Business, Economics & Management (AREA)
  • General Business, Economics & Management (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

A method of audibly presenting a text message selectively using user-specific synthesized speech includes: a first text message sent by a first sending user of a first sending electronic device is received at a receiving electronic device associated with the receiving user. The method further comprises the steps of: after receiving a request to audibly present a first text message, if the receiving user is authorized to access the user-specific synthesized speech, causing, by the receiving electronic device, audible presentation of the first text message using the user-specific synthesized speech representing the first sending user. The method further comprises the steps of: at the receiving electronic device, a second text message sent by a second sending user is received. The method further comprises the steps of: after receiving a request to audibly present a second text message by a receiving electronic device, causing audible presentation of the second text message using default speech.

Description

Method for presenting text messages using a user-specific speech model
Cross Reference to Related Applications
The present application claims the benefit of U.S. provisional patent application No. 63/354,648 (filed on 22 at 6 at 2022), and U.S. provisional patent application No. 63/407,544 (filed on 16 at 9 at 2022), each of which is incorporated herein by reference in its entirety.
Technical Field
The present disclosure relates generally to user-specific speech models and passive data collection. In particular, the present disclosure relates to the use of user-specific speech models to present text messages, and passive data collection for the purpose of training speech models.
Background
Text-based communications have exceeded voice-based communications over the past decade or so into the first largest form of communication. People are increasingly contacting their friends and family through short messages, instant messages, and email-rather than through face-to-face conversations and even telephone calls.
Although text-based communication tends to be more convenient than voice-based communication, it may be less personal in taste, which may lead to misunderstandings or misinterpretations. Thus, there is a need for a new form of communication: this new form of communication allows for the convenience of text-based communication while also incorporating the more resonant nature of voice-based communication.
In addition, to some extent relevant, current systems for collecting user-specific voice data and video data present a number of difficulties. For example, for the purpose of creating a user-specific speech model, a user may need to speak into a microphone (e.g., while sitting in front of a computer) for a longer period of time to generate the data needed to train the appropriate speech model. Current technology requires the user to record at least ten minutes of speech data to create an accurate speech model starting from the beginning. In some of these current techniques, the amount of speech data required may be reduced by using a pre-trained speech model (e.g., a model trained using speech similar to the user's speech). In any event, the recording process can be tedious, let alone time consuming. Thus, there is a need for a less demanding, more passive way of collecting voice data and video data.
Disclosure of Invention
The wrist wearable devices, head wearable devices, other types of electronic devices, and methods of using these devices (and systems including both wrist wearable devices and head wearable devices, and other types of electronic devices) described herein address one or more of the above-described drawbacks by: allowing the passive creation of user-specific synthesized speech; and authorizing other users to use the user-specific synthesized voices (e.g., to reiterate text messages received while the user is driving, and the voices reiterating the text messages may be voices that represent the respective voices of the respective sending users). Two specific aspects are briefly summarized below.
The first method involves audibly presenting a text message selectively using user-specific synthesized speech. The first method allows a user to listen to a received text message in sounds similar to the voice of the sending user. As described in more detail below, this approach does not sacrifice the personal taste that is typically lost by text-based communication media. This allows the receiving user to feel as if they heard a message that the sending user read to them. An example of the first method is provided below in clauses beginning with a.
(A1) In some embodiments, an example first method of selectively audibly presenting a text message using user-specific synthesized speech includes: a first text message sent by a first sending user of a first sending electronic device is received at a receiving electronic device associated with the receiving user. The method further comprises the steps of: upon receiving a request for audibly presenting the first text message by the receiving electronic device, the audible presentation of the first text message is caused by the receiving electronic device to occur using the user-specific synthesized voice representative of the first sending user in accordance with a determination that the receiving user is authorized to access the user-specific synthesized voice representative of the first sending user. The method further comprises the steps of: a second text message sent by a second sending user of a second sending electronic device is received at a receiving electronic device associated with the receiving user. The method further comprises the steps of: after receiving a request to audibly present the second text message by the receiving electronic device, causing audible presentation of the second text message using the default speech by the receiving electronic device.
(A2) In some embodiments of the method of A1, receiving the first text message further comprises: an indication is received from the first sending electronic device that the receiving user is authorized to access user-specific synthesized speech representing the first sending user. That is, the method of performing A2 helps to further provide guidance to the user regarding: why (and whether) the user-specific synthesized speech is to be provided (e.g., the receiving device may utilize the indication to provide additional notification or other information to the user later), which helps facilitate improved human-machine interface and continued interaction by ensuring that additional guidance is provided to the user regarding access to and use of the user-specific synthesized speech, as otherwise the additional functionality may be avoided or hidden in various user interfaces that may be difficult and cumbersome to locate and use (locating difficult and cumbersome functionality may make the device less available due to faster battery drain).
(A3) In some embodiments of the method of any of A1 and A2, the request to audibly present the first text message by the receiving electronic device comprises: a request to audibly present a first text message using a user-specific synthesized voice representing a first sending user.
(A4) In some embodiments of the method of any of A1-A3, causing audible presentation of the first text message using the user-specific synthesized speech comprises: identifying an explicit feature of the first text message, and changing the prosody of the audible presentation based on the explicit feature of the first text message.
(A5) In some embodiments of the method of any of A1-A4, causing audible presentation of the first text message using the user-specific synthesized speech comprises: identifying an implicit characteristic of the first text message, and changing the prosody of the audible presentation based on the implicit characteristic of the first text message.
(A6) In some embodiments of the method of any of A1-A5, causing audible presentation of the first text message using the user-specific synthesized speech comprises: features of additional text messages in a conversation that includes the first text message are identified. Causing the audible presentation further includes: the prosody of the audible presentation is changed based on the characteristics of the additional text message. Further altering the audible presentation using prosody based on the characteristics of the additional message allows for a more realistic and more tastant communication of the content of the text message (which then facilitates continued interaction to engage in more engaging and fluent conversations with other users).
(A7) In some embodiments of the method of any of A1-A6, the user-specific synthesized speech is synthesized by the receiving electronic device.
(A8) In some embodiments of the method of any of A1-A6, the user-specific synthesized speech is synthesized by the transmitting electronic device.
(A9) In some embodiments of the method of any of A1-A6, the user-specific synthesized speech is synthesized by a device other than the receiving electronic device and the transmitting electronic device.
(A10) In some embodiments of the method of any of A7-A9, the user-specific synthesized speech is synthesized using a speech model that is encrypted to not allow use of the speech model for purposes other than audibly presenting the first text message.
(A11) In some embodiments of the method of any of A1 to a10, the determining that the receiving user is authorized to access the user-specific synthesized speech representing the first sending user comprises: authentication receives the identity of the user.
(A12) In some embodiments of the method of any of A1 to a11, the determination that the receiving user is authorized to access the user-specific synthesized speech representing the first sending user is based on a strength of a relationship between the receiving user and the first sending user.
(A13) In some embodiments of the method of any of A1 to a12, the determining that the receiving user is authorized to access the user-specific synthesized speech representing the first sending user comprises: it is determined that the first sending user has granted permission for the receiving user to access the user-specific synthesized speech.
(A14) In some embodiments of the method of any of A1-a 13, causing audible presentation of the second text message using default speech comprises: the determination of whether the receiving user is authorized to access the user-specific synthesized speech representing the second transmitting user is abandoned.
(A15) In some embodiments of the method of a14, the second text message comprises a text message sent on behalf of the non-human entity.
(A16) In some embodiments of the method of any of a14 and a15, the receiving user has requested that the text message be audibly presented using default speech.
(A17) In another aspect, a non-transitory computer readable storage medium is provided. The non-transitory computer-readable storage medium includes instructions that, when executed by an electronic device, cause the electronic device to perform or cause to perform the method of any one of A1 to a 16.
(A18) In yet another aspect, an electronic device is provided. In some embodiments, the electronic device comprises means for performing or causing to perform the method of any one of A1 to a 16. In some embodiments, the electronic device comprises one or more processors and the one or more processors comprise instructions that, when executed, cause the electronic device to perform the method of any one of A1 to a 16.
A second method is also provided that allows for the creation of user-specific synthesized speech. This second method is used to passively collect training data that allows everyday users to generate a large amount of data needed to, for example, create a speech model (e.g., a personalized speech model that is associated with their speech and that can mimic the way they change their prosody). The passive voice collection method comprises the following steps: it is determined whether the user chooses to enable the functionality to respect and protect the user's privacy. An example of a second method is provided below in clauses beginning with B.
(B1) In some embodiments, a method of passively collecting training data includes: a select enable instruction is received. The selection enabling instruction indicates that the user agrees to collect training data associated with the user via one or more electronic devices. The method further comprises the steps of: upon receiving the selection enabling instruction, audio data including a user's speech is provided as training input for a speech model specific to the user in accordance with a determination that the user is speaking based on data collected by one or more electronic devices.
(B2) In some embodiments of the method of B1, receiving a selection enabling instruction from the user comprises: the user is presented with the option of selecting to collect training data or is determined that the user has selected to collect training data.
(B3) In some embodiments of the method of any of B1-B2, the data collected by the one or more electronic devices comprises video data.
(B4) In some embodiments of the method of B3, the video data is acquired using one or more cameras of the electronic device.
(B5) In some embodiments of the method of any of B1-B4, the data collected by the one or more electronic devices comprises audio data.
(B6) In some embodiments of the method of B5, the audio data is collected using microphones of one or more electronic devices.
(B7) In some embodiments of the method of any of B3 to B6, data collected by one or more electronic devices is processed to allow detection of a feature of a user while the user is speaking.
(B8) In some embodiments of the method of B7, the features of the user include prosodic features associated with the user while the user is speaking, and the prosodic features are provided as additional training inputs to the user-specific speech model.
(B9) In some embodiments of the method of B7, the characteristics of the user include a facial expression associated with the user while the user is speaking, and the method further comprises: the facial expression of the user is provided as an additional training input to the user-specific speech model.
(B10) In some embodiments of the method of any of B3-B9, determining that the user is speaking comprises at least one of: the audio data is compared with pre-authentication audio data associated with the user identity and the video data is compared with pre-authentication video data associated with the user identity.
(B11) In some embodiments of the method of any one of B1 to B10, the method further comprises: upon receiving the selection enabling instruction, the user is determined to be speaking based on data collected by the one or more electronic devices, and video data including the user speaking is provided as training input to a user-specific speech model.
(B12) In some embodiments of the method of any of B1 to B11, providing audio data comprises: first audio data is collected from a first electronic device of the one or more electronic devices. Furthermore, providing the audio data includes: second audio data is collected from a second electronic device of the one or more electronic devices. In addition, providing audio data includes: the first audio data and the second audio data are provided as at least one respective training input for a user-specific speech model.
(B13) In some embodiments of the method of any of B1-B12, the user-specific speech model is created locally on one or more electronic devices.
(B14) In some embodiments of the method of any of B1-B13, the user-specific speech model is created on a device on the same local network as the one or more electronic devices.
(B15) In some embodiments of the method of any of B1-B14, the user-specific speech model is not transmitted to or created at a server on the network that is remote from the one or more electronic devices.
(B16) In some embodiments of the method of any of B1-B15, the user-specific speech model is used to generate user-specific synthesized speech.
(B17) In some embodiments of the method of any one of B1 to B16, the method further comprises: one or more non-human sound samples are provided as training inputs to a speech model specific to the user.
(B18) In another aspect, a non-transitory computer readable storage medium is provided. The non-transitory computer-readable storage medium includes instructions that, when executed by an electronic device, cause the electronic device to perform or cause to perform the method of any one of B1 to B11.
(B19) In yet another aspect, an electronic device is provided. In some embodiments, the electronic device comprises means for performing or causing to perform the method of any one of B1 to B17. In some embodiments, the electronic device comprises one or more processors and the one or more processors comprise instructions that, when executed, cause the electronic device to perform the method of any of B1 to B17.
Additional examples are explained in more detail below.
Drawings
A more particular description may be rendered by reference to features of various embodiments, some of which are illustrated in the appended drawings. The accompanying drawings illustrate related example features of the present disclosure. The description may recognize other useful features as would be appreciated by one of skill in the art upon reading this disclosure.
FIGS. 1A and 1B illustrate example setup screens that may be used to configure access and use of user-specific synthesized speech for a messaging application, according to some embodiments.
Fig. 2A and 2B illustrate examples of a text message being repeated in a hands-free environment using user-specific synthesized speech, according to some embodiments.
Fig. 3A-3D illustrate a wrist-wearable device (e.g., a smartwatch) to reply to a text message to a user using user-specific synthesized speech or default speech, in accordance with some embodiments.
FIG. 4 illustrates a user interface of a messaging application in accordance with some embodiments and includes a schematic representation showing how a message may be repeated using user-specific synthesized speech.
FIG. 5 illustrates an example communication network that can be used to facilitate access to a speech model for creating various user-specific synthesized speech, in accordance with some embodiments.
FIG. 6 illustrates a high-level architectural diagram of how input data is provided to create a user-specific synthesized speech model that may then be used to generate speech segments, in accordance with some embodiments.
Fig. 7 illustrates a method (on the right) of reproducing a text message using user-specific synthesized speech, alongside user interface elements (on the left) that are caused to be presented when an electronic device (e.g., a wrist-wearable device) performs the described method, in accordance with some embodiments.
Fig. 8 is a detailed flow diagram of a method of audibly presenting a text message selectively using user-specific synthesized speech according to some embodiments.
FIG. 9 illustrates an example information overlay that may be presented to provide information to a user regarding passive speech collection, according to some embodiments.
FIG. 10 illustrates a user engaged in a video call through a computer, according to some embodiments.
FIG. 11 is a detailed flow diagram of a method of passively collecting training data for creating a user-specific speech model according to some embodiments.
FIG. 12 illustrates an example communication network that may be used to collect data (e.g., audio data, video data) for training a speech model, in accordance with some embodiments.
In accordance with common practice, like reference numerals refer to like features throughout the specification and drawings.
Detailed Description
Numerous details are described herein to provide a thorough understanding of the example embodiments shown in the drawings. However, some embodiments may be practiced without many of these specific details, and the scope of the claims is limited only by these features and aspects that are specifically related to these claims. In addition, well-known processes, components, and materials need not be described in full detail to avoid obscuring the relevant aspects of the embodiments described herein.
FIGS. 1A and 1B illustrate example setup screens that may be used to configure access and use of user-specific synthesized speech for a messaging application, according to some embodiments. As shown in fig. 1A, in some embodiments, a video conferencing device 0104 (e.g., facebook portal) displays a setup screen 0106 to the user 0102. The setup screen 0106 may include an option 0110 (e.g., slider bar, button, etc.) that enables personalized speech synthesis functionality. In addition, the setup screen 0106 may include a link 0108, which points to more information about the personalized speech synthesis function. As shown in fig. 1A, the link may include text such as "more information". Alternatively, the link may include, for example, "know more", "what is personalized speech synthesis? Text such as "or" help ".
For example, if user 0102 selects link 0108, device 0104 can display information overlay 0112 as shown in FIG. 1B. The information overlay 0112 may include a description 0114 of the personalized speech synthesis function. The information overlay 0112 may also include announcements 0116 of various types of data (e.g., audio data, text message data, contacts, etc.) that are collected to enable personalized speech synthesis functionality. Further, the information overlay 0112 may include options 0118 (e.g., slider bar, buttons, etc.) for enabling personalized speech synthesis functionality. In some embodiments, the option 0118 displayed within the information overlay 0112 is a single option (e.g., an "enable" button). In other embodiments, option 0118 includes more trim controls for user 0102 to selectively enable certain aspects of the personalized speech synthesis functionality. For example, option 0118 may allow user 0102 to enable personalized speech synthesis functionality for messages received from various other users. As another example, option 0118 may allow user 0102 to enable personalized speech synthesis functionality for messages sent to other users. Finally, the information overlay 0112 may include an option 0120 (e.g., an "X" button) for closing the overlay.
In addition to displaying the information overlay 0112 after the user selects link 0108, the device may also display the information overlay 0112 after the user selects option 0110 to enable personalized speech synthesis functionality. For example, if the user 0102 has never previously enabled the personalized speech synthesis function, the device 0104 can display an information overlay 0112 to inform the user 0102 what the personalized speech synthesis function needs, which allows the user to then make an informed decision as to whether to use the function or refuse to use the function. In some embodiments, the function is disabled by default, and the user must positively select enablement to use the function. Similarly, if the user 0102 has not enabled the personalized speech synthesis function for a long period of time (e.g., six weeks, two months, one year, etc.), the device 0104 can display an information overlay 0112 to alert the user 0102 what the personalized speech synthesis function needs (and in some embodiments, a new selection enablement can be requested to confirm that the user wants to continue using the function). Alternatively, the device 0104 may display an information overlay 0112 each time the user selects option 0110 to enable the personalized speech synthesis function.
Fig. 2A and 2B illustrate examples of a text message repeated using user-specific synthesized speech in a hands-free environment, according to some embodiments. In some embodiments, interface 0204 of the vehicle (e.g., car, boat, etc.) informs user 0202 of the message (e.g., unread message, unreturned message, etc.). For example, as shown in FIG. 2A, the notification says "you have a message from Lakexi (Lexie). ".
Interface 0204 may inform the user of the message by presenting audio via a speaker (e.g., a speaker of the vehicle, headphones of user 0202, etc.). Alternatively, interface 0204 may notify user 0202 via a display (e.g., a screen of interface 0204, etc.). Interface 0204 may notify user 0202 in response to a prompt from user 0202. For example, user 0202 might ask "do i have any unread messages? ". Alternatively, interface 0204 may notify user 0202 (e.g., periodically, e.g., every two hours, or after receiving each message, etc.) without a prompt from user 0202.
In some embodiments, notification 0208 includes the following cues: the prompt is used to determine whether interface 0204 should read (e.g., audibly present) the text of the message to user 0202. For example, as shown in fig. 2A, a notice says "do you want me to read [ message? ". If reply 0206 of user 0202 indicates that user 0202 wishes to reiterate the message to them, interface 0204 can audibly present the message to user 0202.
In some embodiments, interface 0204 reads the message to the user using default speech (e.g., general male speech, general female speech, general neutral speech, etc.). This is shown by default reading 0210 of FIG. 2A. As with notification 0208, default reading 0210 is shown using computer-like fonts to visually indicate that the reading is performed using default speech. (Note that reply 0206 of user 0202 uses a different font than notification 0208 and default reading 0210, which provides a visual indication that the message is being repeated using a different sound). The default reading 0210 also includes an image of the robot to provide another visual indication that reading is performed using default speech.
In other embodiments, interface 0204 reads messages to the user using personalized speech (e.g., speech that sounds like sending user speech, which may be referred to as user-specific synthesized speech). This is shown in personalized reading 0212 of FIG. 2B. Unlike notification 0208 and default reading 0210 of fig. 2A, personalized reading 0212 is visually depicted using expressive fonts to indicate that the reading is performed using personalized speech (e.g., the personalized speech may be generated using the architecture shown in fig. 6, and one or more devices shown in fig. 5). Personalized reading 0212 also includes an image of the sender to further visually indicate that the reading is performed using personalized speech.
Fig. 3A-3D illustrate that a wrist-wearable device 0314 (e.g., a smart watch) uses user-specific synthesized or default speech to re-present text messages 0308, 0315, and 0320 to user 0302, in accordance with some embodiments. As shown in fig. 3A, in some embodiments, user 0302 requests: his smart watch 0304 reads his unread message to him. Dialog 0310 indicates this, and this dialog 0310 may be a verbal instruction provided by the user to smart watch 0304, which may be "read my unread message for me". In some embodiments, user 0302 requests: his smart watch 0304 reads his most recent message to him. In some embodiments, user 0302 requests: his smart watch 0304 reads a particular type of message (e.g. email, instant message, etc.) to him.
In fig. 3A, user 0302 requests: his smart watch 0304 reads the message to him. However, in some embodiments, user 0302 requests: a smart phone, a smart speaker, a video conferencing device (e.g., facebook portal), or a computer (e.g., laptop computer) reads the message to him. Further, fig. 3A shows a user 0302 verbal request: his smart watch 0304 reads the message to him. In some embodiments, user 0302 makes a request other than a request by speech (e.g., by touch input, keyboard input, mouse input, motion input, etc.).
In some embodiments, display 0306 of smart watch 0304 indicates that user 0302 has unread messages. For example, in fig. 3A, display 0306 indicates that the user has three unread messages. In some embodiments, display 0306 also displays at least a portion of the text of unread message 0308 in each unread message. For example, in fig. 3A, display 0306 shows a message from Lexie, written "how does tomorrow and evening co-enter dinner? ". Display 0306 also includes a timestamp written "10 am: 49". The timestamp may indicate the time when the message 0308 was received, or the timestamp may indicate the time when the message was sent.
In some embodiments, at user 0302, request: after his smart watch 0304 reads his unread message to him, the smart watch 0304 indicates that the user 0302 has an unread message 0308 from the sending user. In fig. 3A, a dialog 0312 indicates this, which is written: "you have a message from Lexie. It is said that: …). Similarly, dialog 0316 of fig. 3B and dialog 0322 of fig. 3C indicate that the user has other unread messages.
In some embodiments, at user 0302, request: after his smart watch 0304 reads his unread message to him, the smart watch reads (e.g. presents, plays, etc.) the first unread message 0308 to the user. Dialog 0314 indicates this. As with personalized reading 0212 of fig. 2B, dialog 0314 includes a expressive font to visually indicate that the smart watch is reading the content of first unread message 0308 using personalized speech. In contrast, dialog 0312 includes a computer-like font to visually instruct smart watch 0304 to notify user 0302 of first unread message 0308 using default speech. Dialog 0314 also includes sending an image of the user to further visually indicate that the content of the first unread message 0308 is read in personalized speech. As also shown in fig. 3A, the introductory portion (e.g., the contents of dialog 0312) may be read with default speech followed by reading the message content (e.g., the contents of dialog 0314) with user-specific synthesized speech.
Turning now to fig. 3B, in some embodiments, after the smart watch 0304 reads the first unread message 0308, the smart watch reads the second unread message 0315 to the user. Dialog 0318 indicates this. As with dialog 0314 of fig. 3A, dialog 0318 includes a expressive font to visually indicate that the smart watch is reading the content of a second unread message 0315 using personalized speech. In contrast, dialog 3016 includes a computer-like font to visually instruct the smart watch 0304 to notify the user 0302 of a second unread message 0315 using default speech. Note also that the font of dialog 0318 is different from that of dialog 0314 in fig. 3A. This provides a visual indication of: the personalized speech associated with the sender of the second unread message 0315 is different from the personalized speech associated with the sender of the first unread message 0308. Dialog 0318 also includes sending an image of the user to further visually indicate that the content of the second unread message 0315 is read in personalized speech. As should be appreciated, the dialog boxes discussed herein are provided for purposes of explaining the context and are not displayed as separate user interfaces or other elements in the systems discussed herein. Instead, when the message is repeated, the user simply hears the appropriate sound, and not necessarily receives yet another visual element similar to the dialog box depicted. In some embodiments, a visual element may be displayed on a home screen or a connected device (e.g., a screen in an automobile) to depict that message content is being repeated, and may be similar to or different from the dialog boxes discussed herein as illustrative examples.
Turning now to fig. 3C, in some embodiments, after smart watch 0304 reads second unread message 0315, the smart watch reads third unread message 0320 to the user. Dialog box 0324 indicates this. Unlike dialog 0314 of fig. 3A and dialog 0318 of fig. 3B, dialog 0324 includes a computer-like font to visually indicate that the smart watch is reading the contents of a third unread message 0320 using default speech. Dialog box 0324 also includes an image of the robot to provide another visual indication of: the reading is performed using default speech.
In some embodiments, there are a number of possible reasons for reading the content of third unread message 0320 with the default voice instead of the personalized voice associated with the sender of third unread message 0320. For example, the sender of the third unread message 0320 may not have personalized speech synthesis functionality enabled for the message it sent. That is, the sender of the third unread message 0320 may not have authorized each device to read the message sent by it with the personalized speech associated with the sender. The sender of the third unread message 0320 may have explicitly prohibited certain users from accessing their associated user-specific synthesized speech, rather than all users from accessing that user-specific synthesized speech. In some embodiments, a user may have multiple associated voice models, and certain voice models may be provided for use by certain users while prohibiting these same certain users from accessing other voice models (e.g., a first voice model may be authorized for the user's spouse, important others, or close friends, while a second voice model may be authorized for less close acquaintances of the user, such as work colleagues when more specialized sounds may be needed).
As another example, the sender of the third unread message 0320 may not enable the personalized speech synthesis functionality for user 0302 (i.e., the receiving user). This may be because the sender of the third unread message 0320 is uncomfortable as follows: user 0302 reads the message from the sender of third unread message 0320 with a personalized voice associated with the sender of third unread message 0320. As another example of when the system may use default speech instead of user-specific synthesized speech, the sender of the third unread message 0320 may not have synthesized and be ready a personalized speech model for use by the recipient of the message. For example, the sender of the third unread message 0320 may not have enough input data to allow for the creation of user-specific synthesized speech for the sender of the third unread message. Thus, in this case, the message may be read with default speech instead of the user-specific synthesized speech of the sender of the third message. In this example, a data deficiency may refer to having too little data for some phones or having too low quality data for some phones.
As yet another example, the sender of the third unread message 0320 may have enabled the personalized speech synthesis functionality only for a group of people that does not include user 0302. For example, the sender of the third unread message 0320 may have the personalized speech synthesis functionality enabled only for its immediate relatives (and in this example, user 0302 is not a member of the sender's immediate relatives). Alternatively, the sender of the third unread message 0320 may have enabled the personalized speech synthesis functionality only for people in the sender's affinity network (and user 0302 is not in the sender's affinity network).
As used herein, an "affinity network" describes a group of people (e.g., an interpersonal network) for which a user has minimal trust. The user's one or more affinity networks may include nodes representing people and edges representing relationships between people (e.g., connections between nodes). In some embodiments, the affinity network is used to determine whether the receiving user should be allowed to listen to the message sent by the sending user with synthesized speech representing the sending user's speech. For example, if the receiving user is in the sending user's affinity network and/or the strength of one or more edges between the receiving user and the sending user meets a minimum strength threshold (e.g., the receiving user and the sending user have been friends on Facebook for at least six months, the receiving user and the sending user have sent more than 50 SMS messages to each other), the receiving user may be allowed to listen to the message from the sending user with the sending user's voice (i.e., the receiving user is allowed to access user-specific synthesized voice representing the sending user's voice).
Affinity networks may be defined based on characteristics associated with contacts (e.g., frequency of contacts, duration of contacts, nature of contacts, contact style, etc.). For example, a user's affinity network may be defined to include persons who are contacted with the user on a regular basis (e.g., a predetermined number of times per day, daily, a predetermined number of times per week, weekly, etc.). As another example, a user's affinity network may be defined to include people that have been in contact with the user for a minimum amount of time (e.g., one month, one year, etc.). As yet another example, a user's affinity network may be defined to include people that have contacted the user or have been contacted by the user at least a certain number of times (e.g., 10 times, 15 times, etc.). As yet another example, a user's affinity network may be defined to include people that are in contact with the user (e.g., send text messages to each other, call each other, etc.).
Affinity networks may also be defined based on social networks. For example, a user's affinity network may be defined to include people explicitly connected to the user (e.g., friends with the user, part of a user contact list) on a social media website (e.g., facebook, whatsApp). As another example, a user's affinity network may be defined to include persons implicitly connected to the user (e.g., persons focusing on and also focused by the user) on a social media website (e.g., instragram).
Affinity networks may also be defined based on real world relationships. For example, a user's affinity network may be defined to include the user's family members (e.g., siblings, parents, children, etc.). As another example, a user's affinity network may be defined to include the user's colleagues (e.g., people working in the same organization, the same group, the same office, etc.). As yet another example, a user's affinity network may be defined to include people with whom the user regularly interacts (e.g., members of a gym where the user is located, members of a volunteer organization where the user is located, etc.).
The affinity network of the user may also be defined based on one or more of the foregoing factors, as well as other factors not mentioned herein. For example, affinity networks may be defined based on social networks and frequency of connections. In addition, each side of the user's affinity network may be weighted based on various factors coupled to the affinity network. For example, for an affinity network based on both social networks and real world relationships, the strength of edges between siblings as friends on a social networking site may be higher than the strength of edges between non-family members as friends on the social networking site.
The affinity network of the user may be further defined based on specific preferences (e.g., privacy preferences) of the user. For example, a user may indicate that the user wishes that only a few people can choose to play a message from the user with the user's synthesized voice. Thus, a user's affinity network may be defined to include only members of the user's core family and the user's contacts multiple times per day. As another example, another user may indicate that the other user wishes a large number of people to be able to choose to play a message from the other user with the other user's synthesized voice. Thus, the affinity network of the other user may be defined to include the core family and the large family of the other user, as well as the owners of the other user who have explicit or implicit connections with them on the social media website. Alternatively, the user's preferences may define the edge strength required by the receiving user to play a message from the user with the user's voice.
In this connection, the affinity network need not be symmetrical. Whether the first user is in the affinity network of the second user may have no effect on whether the second user is in the affinity network of the first user. For example, if a user's affinity network is relatively small (e.g., limited to family, plus friends on a social media website for more than one year), the first user's affinity network may not include the user's acquaintances. However, if the affinity network of the acquaintance is relatively large (e.g., including a person with whom the acquaintance has sent or received more than 15 text messages), the affinity network of the acquaintance may include the user. Alternatively, the strength of the edge between the first user and the second user in the first user's affinity network may be different from the strength of the edge between the first user and the second user in the second user's affinity network. Further, the edge strength may be unidirectional. For example, the strength of the edge from the first user to the second user in the first user's affinity network may be different from the strength of the edge from the second user to the first user in the first user's affinity network.
As discussed elsewhere, a user may indicate whether the user wishes the user's device to play messages from other users with their synthesized speech. The indication of the user may likewise be based on the affinity network of the user. For example, if the other user is in the user's affinity network (and assuming the other user has also authorized the user to access the option of reading messages from the other user with their synthesized voice), the user's device may read messages with only the other user's synthesized voice. As another example, if the strength of an edge between a user and a corresponding other user meets a minimum strength threshold, the user's device may read the message with only the other user's synthesized speech.
Finally, turning to fig. 3D, in some embodiments, after reading each of the unread messages 0308, 0315, and 0320, the display 0306 of the smartwatch indicates that there are no unread messages. Furthermore, in some embodiments, smart watch 0304 provides audio that further indicates that there are no unread messages. This is shown in dialog 0326, which states that: "no unread messages have been received. ".
FIG. 4 illustrates a user interface 0402 of a messaging application in accordance with some embodiments and includes a schematic representation showing how a message can be repeated using user-specific synthesized speech. As shown in FIG. 4, in some embodiments, the user interface 0402 includes the name of another user 0404. For example, in FIG. 4, user interface 0402 includes a name "Lexie" that indicates that messages 0412, 0416, and 0418 of user interface 0402 are sent by another user 0404, named Lexie. Alternatively, in some embodiments, the name of one or more other users 0404 may include multiple names (e.g., "Lexie, rebekah, and Magnus"), or the name may be a group name (e.g., johnson family). In these embodiments, the names of other users 0404 may be independent of the name of any corresponding user whose message is displayed in user interface 0402.
In some embodiments, the user interface 0402 includes a phone call button 0406 and a video call button 0408. In some embodiments, the user interface 0402 includes an accessory button 0422, a camera button 0424, a picture button 0426, or a microphone button 0428. The microphone button may allow a user of the device (on which the user interface 0402 is displayed) to record audio or transmit recorded audio. Of course, sending audio is different from sending a text message to another device that then reads the text message with personalized speech associated with the sending user. Even though the end result may be similar, the latter allows the sending user to conveniently type the message rather than say the message (e.g., if the sending user is in quiet space) -while still allowing the receiving user to conveniently listen to the received message rather than browse the message.
In some other embodiments, microphone button 0428 allows a user to create a text message through voice-to-text technology. For example, a user may speak a message, the device transcribes the message into text, and then sends the message to another user 0404. In some embodiments, the user may also create a text message by typing in text entry field 0430. In some embodiments, text entry field 0430 includes an expression button. In some embodiments, the text entry bar is adjacent to the "like" button 0432. For example, the expression button may allow the user to input expressions (e.g., smiling face, sad face, smiling face, etc.). In addition, the "like" button 0432 may allow the user to send an emoticon with the thumb raised. Alternatively, the "like" button 0432 may allow the user to "like" (i.e., indicate consent, approval, etc.) a message (e.g., another user's message) displayed in the user interface.
In some embodiments, user interface 0402 includes a message. The messages may include messages received from the first sending user, such as messages 0412, 0416, and 0418. The messages may also include messages sent by the second sending user, such as messages 0410, 0414, and 0420. The message may also include a message (not shown) sent by the receiving user. The message may include text and emoticons, but the message may also include non-text content (e.g., images, video, GIF, decals, audio, etc.).
In some embodiments, the device reads messages 0412, 0416, or 0418 received from the first sending user using personalized synthesized speech associated with the first sending user. In some embodiments, the device reads messages 0410, 0414, and 0420 received from the second sending user using personalized synthesized speech associated with the second sending user. In some embodiments, the device alters the prosody (e.g., intonation, pace, rhythm, emotion, etc.) of the message reading based on the content of the message. For example, the content of message 0412 indicates that the first sending user 0404 was recently "diagnosed with new coronavirus pneumonia (covd)" and felt "not good. Dialog 0434 corresponding to message 0412 includes an image of the tired, sad first sending user. This indicates that the device changes its prosody when reading the message 0412 in order to convey emotion (e.g., tired, sad, etc.) indicated by the content of the message 0412. Similarly, dialog boxes 0436 and 0438 include images of other users that are less tired and recover well, respectively. This indicates that the device changes its prosody when reading the corresponding messages 0416 and 0418 in order to convey the emotion implied by the content of messages 0416 and 0418. Similarly, dialog boxes 0440, 0442, and 0444 include images corresponding to respective prosody with which the device can read respective messages 0410, 0414, and 0420.
As shown, message 0418 includes a smiley face emoticon. In some embodiments, the device recognizes the expression implied by the emoticon and changes the prosody of its presentation message accordingly. In other embodiments, the device simply reads the emoticons as text. For example, the device may state when rendering audio generated from the content of message 0418: "feel much! Smiling face. ".
As mentioned above, the messages of the user interface 0402 may include non-text messages, as well as text messages. In some embodiments, the device provides a description of the content of the non-text message. In some embodiments, the description of the non-text message is provided with synthesized speech associated with another user. For example, if the other user sends a sunset image to the user, the device may read: "I send you a sunset photo. The sky is purple and red. And is too beautiful. ". As another example, if the other user sends an embedded video or link to an online video, the device may read a description of the online video.
FIG. 5 illustrates an example communication network that can be used to facilitate access to a speech model for creating various user-specific synthesized speech, in accordance with some embodiments. As shown in fig. 5, in some embodiments, the server 0512 is connected to a plurality of devices through a communication network 0510 (e.g., wi-Fi network, bluetooth network, etc.). The server may be connected to a computer 0502 (e.g., a notebook computer, a desktop computer, etc.). The server may be connected to a video conferencing device 0504 (e.g., facebook portal, etc.). The server may be connected to the smart watch 0506. And the server may be connected to the smart phone 0508.
Fig. 6 is a high-level architectural diagram 0602 showing how input data may be provided to create a user-specific synthetic speech model (also referred to as a user-specific speech model, or speech model) 0606, which user-specific synthetic speech model 0606 may then be used to generate speech segments, according to some embodiments. As shown in fig. 6, in some embodiments, the input 0604 of the user-specific speech model 0606 includes user audio (e.g., a record of a user speaking) and user text (e.g., a text message entered by a user, a text message sent from a user's device, etc.). In addition, output 0608 from user-specific speech model 0606 includes speech segments (e.g., to read user text in synthetic speech that sounds like the user).
In some embodiments, input 0604 of speech model 0606 comprises audio of a person speaking. Input 0604 may also include a transcription of audio; alternatively, the speech model 0606 may create a transcription of the audio by analyzing the audio (e.g., using speech recognition). Input 0604 may also include a language identifier (e.g., identifying the language spoken in audio); alternatively, the speech model 0606 can generate a language identifier by identifying the language associated with the audio. For example, if the language spoken by the person in audio is English, input 0604 may include an identifier of the English language (e.g., a string read as "English" or "ENG," a numeric identifier of the English language, etc.). Input 0604 may also include a phoneme identifier; alternatively, the speech model 0606 may generate a phoneme identifier based on an audio, transcript, or language identifier. For example, if the language spoken by the person in audio is English, input 0604 may include identifiers of phonemes of the English language (e.g., identifiers of phonemes of "b", "d", "f", etc.). In some embodiments, words spoken in a first language (e.g., english) may be used to generate phonemes in a different second language (e.g., spanish), such that a user-specific speech model may be trained exclusively using input in the first language, but the user-specific speech model may then be used to generate a spoken corpus based on phonemes generated for the different second language.
As used herein, "phoneme" refers to the smallest phonetic unit that distinguishes one word from another. For example, an "n" tone is the smallest phonetic unit that distinguishes the english word "ran" from "ram" and "rap". Thus, an "n" tone is a phoneme of english. In summary, the English language includes 40 phones (e.g., 42 or 44), such as the pronunciation of "b" and "d", and "a" sounds differently in the words "cat", "management" and "ball". Some languages have fewer phones (e.g., spanish), while others have more phones (e.g., literacy); however, each language has a limited number of phonemes.
After receiving the audio, the speech model 0606 separates the audio into phonemes using the audio and the transcription. The speech model may also use a speech identifier or phoneme identifier to separate the audio into phonemes. The separated audio allows the speech model 0606 to learn how the person pronounces the various phones. For example, the speech model 0606 may learn how a person pronounces phonemes in a particular context (e.g., when the person is speaking quickly, speaking about a particular topic, speaking at a particular volume, or speaking to a particular audience (e.g., the person's family, the person's colleagues, etc.). As another example, the speech model 0606 may learn how a person's facial expressions correspond to different pronunciations of phonemes (e.g., the input 0604 includes video data corresponding to audio data).
In separating audio, the speech model 0606 can determine that the audio does not include every phoneme associated with the language. For example, audio may include only 36 of the more than about 40 phonemes associated with the English language. In some embodiments, the speech model 0606 may generate estimated phonemes (e.g., audio approximating the manner in which a person pronounces the missing phonemes) based on identified phonemes in the audio (e.g., the "nearest neighbor" phonemes) and/or preprogrammed information about the phonemes of the language. For example, the speech model 0606 may estimate how a user will pronounce "b" phones based on how the user pronounces "d" phones (as will the phones of one language to approximate phones of another language).
In some embodiments, training the speech model 0606 includes: audio is generated by the speech model 0606 based on the transcription of the provided audio and then the generated audio is compared to the provided audio. Training the speech model 0606 may further comprise, if the generated audio is different from the provided audio: the weights associated with the pronunciation of each phoneme are adjusted. In some embodiments, training the speech model 0606 includes: audio is generated based on the transcription of the audio (which is not provided as input 0604 to the speech model 0606) and then the generated audio is compared with the audio associated with the provided transcription. As described above, the weights associated with the pronunciation of each phoneme may be adjusted to further train and refine the speech model 0606.
In some embodiments, the speech model 0606 may convert an input 0604 (e.g., raw input) into an output 0608 such as a personalized audio speech segment that matches the speech of the person for whom the model was trained. In short, once the model has been sufficiently trained, the model can be used to convert input 0604 into output 0608, e.g., a speech segment of text read in the user's voice.
Fig. 7 illustrates a method (on the right) of reproducing a text message using user-specific synthesized speech, alongside user interface elements (on the left) that are caused to be presented when an electronic device (e.g., a wrist-wearable device) performs the described method, in accordance with some embodiments.
As shown in fig. 7, in some embodiments, notification 0702 (e.g., a notification on a smartphone, tablet, smart watch, etc.) indicates receipt of a text message (e.g., a sms, an instant message, an email, etc.). Notification 0702 may include the following options 0706 (e.g., buttons, sliders, icons, etc.): this option 0706 is used to cause the content (e.g., text, emoticons, etc.) of the text message to be presented in synthesized speech associated with the sender. The notification 0702 may also include an image 0708 associated with the sender of the message (e.g., a photograph of the sender, an avatar of the sender, an emoticon of the sender, etc.). The notification 0702 may also include a name 0704 associated with the sender of the message (e.g., the sender's name, the sender's user name, the sender's alias, etc.). Notification 0702 may also include a representation 0710 of the content of the message (e.g., a portion of the content of the message, the entire content of the message, etc.).
As indicated by touch indicator 0712 in fig. 7, in some embodiments, the user may select option 0706 by touching the option 0706 for causing the content of a text message to be presented in synthesized speech associated with the sender. In some embodiments, the user may select option 0706 by: clicking on the option 0706 (e.g., with a mouse), tapping the option 0706 (e.g., with a stylus), or requesting the option 0706 (e.g., with a voice command). In some embodiments, after the user selects option 0706, the content of the text message is presented to the user with synthesized speech associated with the sender of the message. This is represented by audible presentation 0714 in fig. 7.
As mentioned above, fig. 7 also shows a flow chart of a method 0726 of selectively using user-specific synthesized speech to audibly present a text message according to some embodiments. The operations of method 0726 may be performed by one or more processors of an electronic device (e.g., computer 0502, video conferencing device 0504, smart watch 0506, or smart phone 0508, or server 0512; or a head mounted virtual reality device, etc. in fig. 5). At least some of the operations shown in fig. 7 correspond to instructions stored in a computer memory or a computer readable storage medium. The operations of method 0726 may be performed separately by a single device. Alternatively, the operations of method 0726 may be performed in conjunction with one or more processors or hardware components of another device (e.g., computer 0502, video conferencing device 0504, smart watch 0506, or smart phone 0508, or server 0512; or a head-mounted virtual reality device, etc. in fig. 5) communicatively coupled to the device (e.g., through communication network 0510 of fig. 5), or in conjunction with instructions stored in a memory or computer-readable medium of the other device communicatively coupled to the device.
The method 0726 of fig. 7 includes: a text message from a sending user is received (0716). The method 0726 further comprises: a request to audibly present a text message is received (0718). The method 0726 further comprises: a determination (0720) is made as to whether the receiving user is authorized to access the synthesized speech representing the transmitting user. The method 0726 further comprises: in accordance with a determination that the receiving user is not authorized to access the synthesized speech, an audible presentation of the text message is caused (0722) using the default speech. The method 0726 further comprises: in accordance with a determination that the receiving user is authorized to access the synthesized speech, the sending user is caused (0724) to audibly render the text message using the synthesized speech indicative of the sending user. The steps of method 0726 are discussed in more detail below with respect to fig. 8 and corresponding steps therein.
Fig. 8 is a detailed flow diagram of a method 0816 of selectively using user-specific synthesized speech to audibly present a text message, according to some embodiments. As with method 0726 of fig. 7, the operations of method 0816 may be performed by one or more processors of an electronic device (e.g., computer 0502, video conferencing device 0504, smart watch 0506, smart phone 0508, or server 0512; or a head-mounted virtual reality device, etc. in fig. 5). At least some of the operations shown in fig. 8 correspond to instructions stored in a computer memory or a computer readable storage medium. The operations of method 0816 may be performed by a single device alone. Alternatively, the operations of method 0816 may be performed in conjunction with one or more processors or hardware components of another device (e.g., computer 0502, video conferencing device 0504, smart watch 0506, or smart phone 0508, or server 0512; or a head-mounted virtual reality device, etc. in fig. 5) communicatively coupled to the device (e.g., through communication network 0510 of fig. 5), or in conjunction with instructions stored in a memory or computer-readable medium of the other device communicatively coupled to the device.
In some embodiments, method 0816 includes: at a receiving electronic device associated with a receiving user, a first text message sent by a first sending user of a first sending electronic device is received (0802). An example of this is shown in FIG. 4, where FIG. 4 shows received messages 0412, 0416, and 0418 displayed by user interface 0402. Similarly, fig. 7 shows a notification 0702 of a received message, the notification 0702 comprising a user interface element 0706, the user interface element 0706 allowing an indication that the message should be repeated to a receiving user in an audible form.
The text message may be a message entered by a user of the sending electronic device using a keyboard (e.g., virtual keyboard, physical keyboard, screen keyboard, etc.). The text message may also be a message entered by a user of the sending electronic device using voice-to-text input. The text message may include both text and emoticons. The text message may also include a response (e.g., a "like" message, a "heart" message, etc.). The text message may be an instant message (e.g., a message sent to each other through a social media platform, etc.), a short message (e.g., SMS, MMS, etc.), or an email.
In some embodiments, method 0816 includes: after receiving (0804) a request to audibly present a first text message by a receiving electronic device: in accordance with a determination (0810) that the receiving user is authorized (0810-yes) to access user-specific synthesized speech representing the first sending user, audible presentation of the first text message is caused (0814) by the receiving electronic device to proceed using the user-specific synthesized speech representing the first sending user.
The aforementioned received request may come from the receiving user. For example, the request may include a receiving user pressing a button on the receiving device, a receiving user speaking a command into the receiving device, or a receiving user pressing a button of a device (e.g., smart speaker, wireless headset, car, etc.) connected to the receiving device, or speaking a command into a device connected to the receiving device. Alternatively, the request may also be automatically generated by the receiving electronic device, or a device coupled to the receiving electronic device on behalf of the receiving user, based on a determination that audible presentation criteria are met. For example, if a receiving device receives a text message while the user is currently driving, the receiving device may automatically generate a request to audibly present the text message.
The foregoing causing operation (0814) may include: the receiving electronic device uses its own speaker for audible presentation. This causing operation may also include: data is provided to one or more associated speakers (e.g., headphones, car speakers, bluetooth speakers, etc.) that subsequently render an audible presentation of the text message. This causing operation may also include: the prosody of the audible presentation is changed based on the content of the text message.
The user-specific synthesized speech representing the sending user may be synthesized speech that approximates the sending user's speech. The user-specific synthesized speech representing the sending user may be synthesized speech similar to the sending user's speech. The user-specific synthesized speech representing the sending user may be synthesized speech that sounds like the sending user's speech. Alternatively, the user-specific synthesized speech representing the first sending user may be synthesized speech selected or created by the first sending user (e.g., speech selected by the sending user that sounds like its preferred gender identity).
A user-specific synthesized voice representing a sending user may be received with the text message. The user-specific synthesized speech may be received separately from the text message (e.g., downloaded from a server). Alternatively, the user-specific synthesized speech may be synthesized on the receiving device using a speech model associated with the sending user (e.g., for devices with higher processing capabilities). The voice model may be encrypted on the receiving device such that it is inaccessible to the receiving user. The determination that the receiving user is authorized to access the user-specific synthesized speech representing the first sending user may include authentication (e.g., password, biometric scan, etc.). The determination that the receiving user is authorized to access the user-specific synthesized speech representing the first sending user may include: the first sending user allows any receiving user to access a user-specific synthesized voice representing the first sending user. "accessing" user-specific synthesized speech may include: a presentation of text of a text message is listened to with user specific synthesized speech.
In some embodiments, method 0816 includes: after receiving (0804) a request to audibly present a first text message by a receiving electronic device: in accordance with a determination (0810) that the receiving user is not authorized (0810-no) to access user-specific synthesized speech representing the first sending user, audible presentation of the first text message is caused (0812) by the receiving electronic device to use default speech. Such that audible presentation of the text message using the default speech may also be made in accordance with a determination (0808) that the receiving user prefers to present the text message using the default speech (0808-yes), e.g., sent by the second sending user. The default speech may be one of many default speech, where each of these default speech corresponds to a different language, gender, accent, tone, etc. (e.g., corresponds to U.S. female voice, spanish male voice, uk male voice, etc.). Such that the audible presentation using the default speech may also be made based on a determination (0806) that the first received message was sent on behalf of the non-human entity (0806-yes).
The above-described techniques help ensure that the model for generating user-specific synthesized speech is only used at the appropriate point in time and based on specific authorizations to do so. Since the computational cost of running speech synthesis models can be high (which can also deplete power resources), ensuring that these models are used only at the appropriate point in time can be critical to helping to make efficient use of the limited computational and power resources of the electronic device. These techniques also facilitate the user's personal perception to be closer to the sender of the text message (and express more resonance), both because the message is read in the sender's voice, and because the message may reflect the emotion of the content of the message (e.g., read in sad voice for sad messages). This facilitates continuous user interaction between the receiving user and the receiving electronic device. The receiving user may choose not to use the audible presentation when reading the message with default voice. However, when the audible presentation is personalized, the receiving user may be more likely to use the audible presentation functionality than a device that provides the audible presentation in only default speech, thereby facilitating continued user interaction with the audible presentation functionality on the receiving electronic device.
In some embodiments, receiving the first text message further comprises: receiving an indication from a first transmitting electronic device of: the receiving user is authorized to access the user-specific synthesized speech representing the first sending user. The first text message may include the following metadata: the metadata indicates that the receiving user is authorized to access the user-specific synthesized speech representing the first sending user. The indication may be sent directly from the first transmitting electronic device to the receiving device or indirectly from the first transmitting electronic device to the receiving device. For example, the first sending electronic device may send an indication to another electronic device (e.g., server 0512 in fig. 5) that then sends the indication to the receiving electronic device.
The receiving user being authorized to access the user-specific synthesized speech representing the first sending user may be based on the receiving user being in an affinity network for the first sending user. As an example, whether the sending user and the receiving user are part of a affinity network may be based on the number of friends they share in the social network (e.g., 10, 15, or 50 common Facebook friends), the amount of time the user has been connected through the social network (e.g., has become a friend), the number of interactions between users through the social network (e.g., the number of messages sent), or any combination of the above.
In some embodiments, the request to audibly present the first text message by the receiving electronic device includes: a request to audibly present a first text message using a user-specific synthesized voice representing a first sending user. The request to audibly present the first text message using the user-specific synthesized speech representing the first sending user may comprise a one-time request. For example, the user may select the following settings (personalize my message): the setting indicates that the user wishes to present each text message received from the first sending user using a user-specific synthesized voice representative of the first sending user. Alternatively, the request may be content generated each time the user receives a text message from a first sending user and indicates that he wishes to present the text message using user-specific synthesized speech. Alternatively, the request may be content generated in response to a user's regular query. For example, the electronic device may prompt the user periodically (e.g., once a week, once a month, etc.), or in response to receiving a predetermined number of messages (e.g., once every ten messages, once every hundred messages, etc.) from the first sending user.
In some embodiments, causing audible presentation of the first text message using the user-specific synthesized speech includes: identifying an explicit feature of the first text message (e.g., a statement that an originator of the text message is ill); and changing the prosody of the audible presentation based on the explicit characteristic of the first text message. For example, message 0412 in FIG. 4 explicitly states that: the sending user is ill and uncomfortable. Thus, the message 0412 may be presented such that the presented prosody conveys fatigue, discomfort, or sadness. As another example, the content of message 0438 in fig. 4 explicitly states: the sending user does not regenerate the illness and feels better. Thus, the message 0438 may be presented such that the prosodic delivery of the presentation is relaxed or happy. Changing the prosody of the audible presentation of the first text message may comprise: changing the speed, intonation, tone, etc. at which certain words or syllables of the message are presented.
In some embodiments, causing audible presentation of the first text message using the user-specific synthesized speech includes: identifying an implicit feature of the first text message (e.g., suggesting that the originator of the text message is angry based on the message being written in uppercase letters); and changing the prosody of the audible presentation based on the implicit characteristic of the first text message. The implicit characteristic of the first text message may be based on the first text message (e.g., "I'm SORRY", "IM SORRY", etc.) written in uppercase. Implicit features of the first text message may also be based on non-text content of the first text message, such as emoticons (e.g., heart-shaped emoticons, halo-steering emoticons, etc.), or images (e.g., pictures, GIFs, etc.). The implicit characteristic of the first text message may also be based on a time of transmission or a time of receipt of the first text message. For example, if a first text message is sent shortly after (e.g., a few seconds after) another text message, presenting the first text message may include: the first text message is read as if the first text message was combined with the further text message.
In some embodiments, causing audible presentation of the first text message using the user-specific synthesized speech includes: identifying a feature of an additional text message in the conversation that includes the first text message; and altering the prosody of the audible presentation based on characteristics (e.g., explicit characteristics, implicit characteristics) of the additional text message. The characteristics of the additional text message may include any of the characteristics discussed above with respect to the first text message.
In some embodiments, the location where the speech model is stored is determined based on computational constraints, bandwidth constraints, or security concerns for different devices that may be selected to perform speech synthesis operations (e.g., create a speech model, use a speech model, etc.). Regarding computational constraints, synthesizing synthesized speech using a speech model may require minimal computational power (e.g., processor speed, amount of memory required to perform speech synthesis operations). Thus, a device storing the speech model (e.g., a sending device, a receiving device, or an intermediate server) may need to meet one or more minimum computational power requirements to be selected as the device performing the speech synthesis operation. Furthermore, the device with the greatest computing power may be able to synthesize synthesized speech fastest, which may cause the speech model to be stored on the same device.
Regarding bandwidth, the transmit voice model may require less bandwidth than transmit synthesized voice. For example, if the speech model is stored at the sending device, the sending device may also need to send corresponding synthesized speech to the receiving device each time the sending device sends a text message to the receiving device. However, if the speech model is stored at the receiving device, the sending device will not need to send synthesized speech with each text message. Instead, the sending device will only need to send the speech model to the receiving device (or may not need to send the speech model at all if the speech model has been previously sent to the receiving device; in some embodiments, a speech model send check is performed to determine (e.g., when the speech model has been partially updated) whether the speech model needs to be sent in whole or in part). The receiving device will then be able to synthesize synthesized speech for each text message received from the sending user of the sending device. This can significantly reduce the bandwidth problems of the transmitting device and the receiving device. Thus, in some embodiments, bandwidth constraints may facilitate storing the speech model downstream of the sending device (e.g., at the receiving device or an intermediate server).
Regarding security, storing the speech model at a location outside of the sending device may cause security problems. For example, a sending user may not wish to have access to the speech model by a receiving user because accessing the speech model may allow the receiving user to synthesize audio based on text that was not authored by the sending user. Thus, security issues may facilitate storing the speech model at a location external to the receiving device (e.g., at the sending device or an intermediate server).
The determination of where to store the speech model may be performed on a device-by-device or user-by-user basis. For example, as discussed elsewhere with respect to the affinity network, certain metrics may indicate a high degree of trust between the sending user and the receiving user. In cases where trust between the sending user and the receiving user is particularly high, the security issue may not be a too great issue, and the speech model may be stored at the receiving user's device, for example. As another example, the user may be presented with the option to prioritize speed, bandwidth, or security, and a decision may be made as to where to store the speech model based on one or more selections by the user. As yet another example, a determination as to where to store the speech model may require balancing the above three factors according to known circumstances about the user and the device (e.g., the receiving device is a shared device, the sending user prioritizes security over everything, the receiving user has a high download speed).
In some embodiments, the determination that the receiving user is authorized to access the user-specific synthesized speech representing the first sending user comprises: authentication receives the identity of the user. The determination that the receiving user is authorized to access the user-specific synthesized speech may include: voice recognition technology or facial recognition technology is used to identify the identity of the user. The determining may also include: the user is required to verify his identity, for example by entering a password or completing a biometric (e.g. fingerprint, iris, etc.) scan. The determining may also include: it is determined whether the first sending user has agreed to access, by the receiving user, to user-specific synthesized speech associated with the first sending user.
In some embodiments, the determination that the receiving user is authorized to access the user-specific synthesized speech representing the first sending user is based on a strength of a relationship between the receiving user and the first sending user. The strength of the relationship may be determined by: the affinity network of the first sending user, and also including the receiving user, is analyzed such that the number of edges or nodes between these respective users in the affinity network can be used to evaluate the strength of the relationship (e.g., fewer than 5 edges connecting the respective users can be considered strong relationships, while more than 5 edges between users can show weak relationships).
In some embodiments, the determination that the receiving user is authorized to access the user-specific synthesized speech representing the first sending user comprises: it is determined that the first sending user has granted permission for the receiving user to access the user-specific synthesized speech. Determining that the first sending user has granted permission for the receiving user to access user-specific synthesized speech representing the first sending user may include: it is determined that the first sending user has explicitly authorized the receiving user to access the user-specific synthesized speech. Alternatively, determining that the first sending user has granted permission for the receiving user to access the user-specific synthesized speech may include: it is determined that the first user has generally authorized all of the people in the particular category (e.g., all of the people; all of the contacts saved on the first sending user's phone; all Facebook friends, all instragram friends, all WhatsApp friends, etc.; etc.).
In some embodiments, causing audible presentation of the second text message using the default speech includes: the determination of whether the receiving user is authorized to access the user-specific synthesized speech representing the second transmitting user is abandoned. If the second sending user is a non-human entity, it may be assumed that there is no speech model corresponding to the speech of the second sending user (because the non-human entity has no sound). This assumption may reduce the processing time required to perform the methods of the present disclosure. In addition, the receiving user may also indicate that he does not wish the receiving electronic device to audibly present the first text message using the user-specific synthesized speech that represents the first sending user, in which case it is not necessary to determine whether the receiving user is authorized to access the user-specific synthesized speech.
In some embodiments, the second text message comprises a text message sent on behalf of the non-human entity. That is, the second sending user is associated with a non-human entity. The text message sent on behalf of the non-human entity may include: a text message sent on behalf of the company advertising its services; a text message sent on behalf of an office (e.g., doctor's office, dental office, legal office, etc.) prompting a customer (e.g., patient, potential customer, etc.) for a appointment. In some embodiments, the receiving user has requested that the text message be audibly presented using default speech.
FIG. 9 illustrates an example information overlay 0908 that can be presented to provide information to a user regarding passive speech collection, according to some embodiments. As discussed below with respect to fig. 10-12, passive speech collection may include collecting audio data (e.g., including audio data of someone speaking) for the purpose of providing the audio data to a machine learning model. The machine learning model may be configured to receive text input and generate output audio that sounds like reproducing the text input in user speech (e.g., audio that sounds as if the user is reading the text input). Further, the machine learning model may be configured to receive an audio input (e.g., an audio input that includes a user utterance) and generate output text that transcribes the audio input (e.g., transcribes a portion of the audio input that includes the user utterance). Further, the machine learning model may be configured to combine audio or text input (e.g., audio or text input corresponding to what the user speaks or writes) 0 with additional audio input (e.g., mechanical sounds such as a turn of a machine or a hum of an automobile engine) and generate an audio output conveying the user language from the first input and the sound from the additional audio input. Further, the machine learning model may be configured to modulate the user's speech (e.g., to make the user's speech sound more masculine, feminized, etc.).
Similar to fig. 1B, fig. 9 also shows a user 0902, a video conference device 0904, and a setup screen 0906. As with the information overlay 0112 in FIG. 1B, the information overlay 0908 in FIG. 9 may be displayed in response to the user 0902 selecting a link (e.g., a link on the setup screen 0906) or an option that enables the passive speech collection function. For example, if the user 0902 has never previously enabled the passive speech collection function, the device 0904 may display an information overlay 0908 to inform the user 0902 what the passive speech collection function needs. Similarly, if the user 0902 has not enabled the passive speech collection function for a long period of time (e.g., six weeks, two months, one year, etc.), the device 0904 may display an information overlay 0908 to alert the user 0902 what the passive speech collection function needs. Alternatively, device 0904 may display an information overlay 0908 whenever the user selects an option that enables passive speech collection.
As shown in fig. 9, the information overlay 0908 may include announcements 0912 of various devices (e.g., smartphones, smartwatches, tablets, computers, audiovisual systems of vehicles, artificial reality devices (which may include an augmented reality device such as a pair of smart glasses, a virtual reality device such as a virtual reality headset), etc.), wherein data is collected from these devices in order to enable passive voice collection functionality. In addition, the information overlay 0908 may also include announcements 0914 of various types of data (e.g., audio data, video data, etc.) that are collected in order to enable passive voice collection functionality. Further, the information overlay 0908 may include a description of a passive voice collection function.
Information overlay 0908 may also include options 0916 (e.g., slider bar, button, etc.) that enable passive voice collection functionality. In some embodiments, option 0916 of information overlay 0908 is a single option (e.g., an "enable" button). In other embodiments, option 0916 includes further trim controls for user 0902 to selectively enable certain aspects of the passive speech collection function. For example, option 0916 may allow user 0902 to enable data collection for some of the user's devices (e.g., smart speakers) but not others (e.g., smartphones, laptops, etc.). As another example, option 0916 may allow user 0902 to enable data collection at some times of the day (e.g., morning, afternoon) and disable data collection at other times of the day (e.g., evening). As yet another example, option 0916 may allow user 0902 to enable only certain types of data collection (e.g., audio data).
Finally, the information overlay may include an option 0910 to close the overlay (e.g., an "X" button). In some embodiments, when the user 0902 closes the information overlay 0908 (e.g., closes the overlay by selecting option 0910), then the video conferencing device 0904 displays a setup screen, such as setup screen 0106 shown in fig. 1A.
Fig. 10 illustrates a user 1002 engaged in a video call via a computer 1008 according to some embodiments. As shown, in some embodiments, the user 1002 is engaged in a video call using the computer 1008. The user interface displayed on the screen of the computer 1008 indicates this. Dialog boxes 1012 and 1014 also indicate this, and dialog boxes 1012 and 1014 represent audible speech of the user and audible speech of another user engaged in a video call, respectively.
The figure also includes a smart speaker 1004, a teleconferencing device 1006 (e.g., facebook portal), and a smartphone 1010. Assuming the user has selected to enable the passive voice collection function, the smart speaker 1004, teleconferencing device 1006, and smartphone 1010 may collect the user's audio data or video data. For example, the smart speaker 1004 and the smart phone 1010 may collect audio data of the user. As another example, the teleconferencing apparatus 1006 and computer 1008 may collect audio data and video data of the user.
Other devices capable of collecting audio data and/or video data include virtual reality devices and artificial reality devices (e.g., an eye headset), audio-visual vehicle interfaces (e.g., microphones of an infotainment system of an automobile that also includes one or more associated displays in the automobile). The device need not be able to collect audio data and video data simultaneously to collect data for training a machine learning model. For example, the input of the machine learning model may include audio collected by a first electronic device, and video data collected by a second electronic device different from the first electronic device. Any device capable of collecting audio data or video data may be used for data collection purposes (as well as users, and provided with the ability to select each particular device to participate in the process of collecting audio data and/or video data).
In some embodiments, selecting the input of the machine learning model includes: data (e.g., audio recordings, video recordings) is selected from a plurality of data sources (e.g., recordings from a plurality of electronic devices) based on attributes of the individual data. For example, the audio data with the highest fidelity may be selected as the input to the machine learning model. As another example, video data having a best image of a user's face may be selected as input to a machine learning model. As yet another example, the video data with the best illumination may be selected as an input to a machine learning model. In some embodiments, the plurality of audio recordings or the plurality of video recordings are provided as inputs to a machine learning model.
In some embodiments, the collected audio data and video data of the user is sent to the electronic device for presentation as training data to a machine learning model that, after training, generates an output similar to user-specific speech model 0606 of fig. 6. In some embodiments, the data is filtered prior to or in conjunction with training to exclude all data including audio or video of the user other than user 1002.
In some embodiments, a neural network may be used to train the speech model 0606. Further, in some embodiments, deep learning may be used to train the speech model 0606. Further, in some embodiments, reinforcement learning may be used to train the speech model 0606. In some embodiments, the speech model 0606 may be trained using supervised learning, unsupervised learning, or self-supervised learning, or any combination of the three example types of learning. Self-supervised learning is particularly advantageous in training models to accurately predict missing portions of input data, which is useful in modeling personalized speech synthesis. Finally, in some embodiments, the speech model 0606 may be trained using a set of multiple learning algorithms including, but not limited to, those listed above.
Fig. 10 provides an example of the convenience of the method 1114 discussed with respect to fig. 11. The user need not speak in front of the computer to generate training data, but rather the user can simply use his device as usual (e.g., make a video call, as shown in fig. 10 and discussed briefly above).
FIG. 11 is a detailed flow diagram of a method of passively collecting training data for creating a user-specific speech model according to some embodiments. As with method 0726 of fig. 7, the operations of method 1114 may be performed by one or more processors of an electronic device (e.g., smart speaker 1204, computer 1206, video conferencing device 1208, smart watch 1210, smart phone 1212, or server 1214; or a head mounted virtual reality device, etc. in fig. 12). At least some of the operations shown in fig. 11 correspond to instructions stored in a computer memory or a computer readable storage medium. The operations of method 1114 may be performed by a single device alone. Alternatively, the operations of method 1114 may be performed in conjunction with one or more processors or hardware components of another device (e.g., smart speaker 1204, computer 1206, video conferencing device 1208, smart watch 1210, smart phone 1212, or server 1214; or a head-mounted virtual reality device, etc., in fig. 12) communicatively coupled to the device (e.g., through communication network 1202 of fig. 12), or in conjunction with instructions stored in a memory or computer-readable medium of the other device communicatively coupled to the device.
In some embodiments, the method 1114 includes: a selection enabling instruction is received (1102), wherein the selection enabling instruction indicates that the user agrees to collect training data associated with the user via one or more electronic devices. The selection enable instruction may be optional. The selection-enabled instructions may include a user responding to prompts on the one or more electronic devices. For example, selecting the enable instruction may include the user selecting the enable button 0916 in fig. 9. Selecting enablement may also include the user responding to a prompt on another device separate from the one or more electronic devices. The training data may include audio data, such as an audio recording of a user's speech. The training data may also include video data, such as a video recording of a user speaking. The one or more electronic devices may include a cellular telephone, a computer, or a video conferencing device (e.g., smart speaker 1204, computer 1206, video conferencing device 1208, smart watch 1210, or smart phone 1212 in fig. 12). The one or more electronic devices may include a microphone or a camera. Selecting the enablement instruction may include selecting enablement consent, approval, or confirmation.
The user's consent may also be a specific consent to the passive collection of background or training data without the need to obtain a specific request from the user each time certain hardware components (microphone or camera) of the one or more electronic devices are used to collect video data and/or audio data, which is then used as training data.
In some embodiments, the method 1114 further comprises: after receiving (1102) the selection enable instruction: in accordance with determining (1106) that a user is speaking (1106-yes) based on data collected by one or more electronic devices, audio data comprising the user speaking is provided (1108) as training input to a speech model specific to the user. Determining that the user is speaking may include: the identity of the user is identified using voice recognition technology or facial recognition technology (e.g., audio data or video data collected by one or more electronic devices is processed using voice recognition technology or facial recognition technology, respectively). The determining may further include: the user is required to verify his identity, for example by entering a password or completing a biometric (e.g. fingerprint) scan. Alternatively, the method 1114 may include: the identity of the user is authenticated (1104) before determining (1106) whether the user is speaking.
The present invention will greatly reduce the collection of training data (e.g., voice data or video data). Currently, collecting training data for creating a speech model involves requiring a user to speak into a microphone (or camera) for a longer period of time. However, using the passive data collection methods of the present disclosure, data collection may be performed while the user is doing something that he or she is doing every day, such as while making a video call.
In some embodiments, receiving (1102) a selection enabling instruction from a user includes: the user is presented with the option of selecting to collect training data or is determined that the user has selected to collect training data. Repeated requests for user selection to collect training data may be annoying to the user. Thus, rather than requiring the user to repeatedly select to collect training data, the method may include presenting the user with an option to select to collect training data until the user removes the selection. Then, receiving (1102) a selection enabling instruction from the user may include: it is determined that the user has selected to collect training data (i.e., the user is not required to select to collect training data again).
In some embodiments, the data collected by the one or more electronic devices includes video data. In some embodiments, the video data is acquired using one or more cameras of the electronic device. The camera may be external to one or more electronic devices, such as a webcam. The camera may also be wired to the one or more electronic devices or connected to the one or more electronic devices wirelessly (e.g., via Wi-Fi). The camera may also be located inside one or more electronic devices, such as a camera of a smart phone, a camera of a laptop computer, or a camera of a video conferencing device (e.g., a Facebook portal).
In some embodiments, the data collected by the one or more electronic devices includes audio data. The audio data may be collected using microphones of one or more electronic devices. The microphone may be external to one or more electronic devices. The microphone may also be wired to the one or more electronic devices or wirelessly connected (e.g., via bluetooth) to the one or more electronic devices. The microphone may also be located inside one or more electronic devices, such as a microphone of a smart phone, a microphone of a laptop computer, or a microphone of a video telephony device (e.g., facebook portal).
In some embodiments, data collected by one or more electronic devices is processed to allow detection of a user's characteristics while the user is speaking. The user's characteristics may be prosodic characteristics associated with the user (e.g., the user's intonation, volume, etc.). The user's features may also be facial expressions of the user (e.g., frowning, smiling mouth, etc.).
In some embodiments, the characteristics of the user include prosodic characteristics associated with the user while the user is speaking. In some embodiments, the prosodic features are provided as additional training inputs to the user-specific speech model. The speech model may be one or more speech models, each of which is trained to reflect a particular prosodic feature (or set of prosodic features).
In some embodiments, the characteristics of the user include facial expressions associated with the user while the user is speaking. In some embodiments, the method 1114 further comprises: the facial expression of the user is provided as an additional training input to the user-specific speech model. The speech model may be one or more speech models, each of which is trained to reflect a particular prosodic feature (or set of prosodic features).
In some embodiments, determining (1106) that the user is speaking includes at least one of: (A) Comparing the audio data with pre-authentication audio data associated with the user identity; and (B) comparing the video data with pre-authentication video data associated with the user identity. Some embodiments may also (or alternatively) authenticate the identity of the user through a biometric scan.
Different devices may also be used to perform: audio data and/or video data are used to ascertain which user is speaking and when it is possible to speak. For example, a first electronic device may be better positioned to collect video data that may be used to assist in determining that a user's mouth is moving, while another electronic device may be in a better position to collect audio data when the user is known to be speaking. In this manner, multiple devices may be used together to facilitate passive (and user-authorized) collection of video data and audio data as discussed herein. In some cases, the electronic device may also use the authentication data to infer which device to use to collect which type of data (e.g., as the user moves around the room, different devices may be used to collect audio data and/or video data, depending on which device is receiving data at the highest level of fidelity or using some other metric to make the evaluation).
In some embodiments, the method 1114 further comprises: upon receiving (1102) a selection enabling instruction, video data including a user speaking is provided (1108) as training input for a speech model specific to the user in accordance with a determination (1106) that the user is speaking (1106-yes) based on data collected by one or more electronic devices.
In some embodiments, providing audio data includes: first audio data from a first electronic device of the one or more electronic devices is collected, and second audio data from a second electronic device of the one or more electronic devices is collected. In some embodiments, providing the audio data further comprises providing the first audio data and the second audio data as at least one respective training input specific to the user's speech model. The first audio data and the second audio data may be combined into one training input. Alternatively, the first audio data and the second audio data may be provided as two separate training inputs.
In some embodiments, the user-specific speech model is created locally on one or more electronic devices. To protect the speech model, the speech model may be created (and stored) locally on one or more electronic devices. Alternatively, the speech model may be stored on a server. To further protect the speech model, the speech model may also be encrypted or stored such that the speech model is inaccessible except for the purpose device for which the methods of the present disclosure are performed.
In some embodiments, the user-specific speech model is created on the following device: the device is on the same local network as the one or more electronic devices. In some embodiments, the user-specific speech model is not transmitted to, or created at, a server on the network that is remote from the one or more electronic devices. In some embodiments, a user-specific speech model is used to generate user-specific synthesized speech. This may be the same user-specific synthesized speech previously discussed in connection with method 0816 in fig. 8, such that the features of the user-specific synthesized speech are also applicable thereto (e.g., a model trained using passively collected data may then be used to create the user-specific synthesized speech for use in connection with the reproduction of text messages). Additionally or alternatively, a user-specific speech model may be used to identify specific speech and then perform user-specific functions at the device (e.g., in conjunction with a user of a digital assistant).
In some embodiments, the method 1114 includes: one or more non-human voice samples are provided as training inputs to a user-specific speech model. The non-human sound may include, for example, breaking glass, pressing metal, or playing a musical instrument. For example, the method may be used to construct a robot's voice in a movie or video game, where the robot's voice will consist of a metal-squeezing sound, a metal-moving sound, or a metal-bending sound. Alternatively, the method can be used to construct the voices of a personified piano, where the voices of the piano will be composed of the sounds made by the piano. The one or more speech models may be trained using non-human speech rather than user-specific speech data. For example, the method may be used to construct speech for an avatar in a video game or in virtual reality space, where a user corresponding to the avatar is more biased to sound non-human.
FIG. 12 illustrates an example communication network that may be used to collect data (e.g., audio data, video data) for training a speech model, in accordance with some embodiments. As shown in fig. 12, in some embodiments, the server 1214 is connected to multiple devices through a communication network 1202 (e.g., a Wi-Fi network, a bluetooth network, etc.). The server may be connected to a smart speaker 1204. The server may be connected to a computer 1206 (e.g., a notebook computer, desktop computer, etc.). The server may be connected to a video conferencing device 1208 (e.g., facebook portal, etc.). The server may be connected to a smart watch 1210. And the server may be connected to a smart phone 1212.
Any data collection performed by the devices described herein, and/or any device configured to perform or cause performance of the different embodiments described above with reference to fig. 1A-11 (hereinafter "device") is performed with the consent of the user and in a manner consistent with all applicable privacy laws. The user is provided with the option of allowing the device to collect data, and the option of limiting or rejecting the device from collecting data. The user can choose to enable or disable any data collection at any time. In addition, the user is provided with the option of requesting deletion of any collected data.
It will be understood that, although the terms "first," "second," etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another element.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the claims. As used in the description of the embodiments and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will also be understood that the term "and/or" as used herein refers to and includes any and all possible combinations of one or more of the associated listed items. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
As used herein, the term "if" may be interpreted to mean "when … …" or "upon" or "in response to determining" or "in accordance with determining" or "in response to detecting" that the stated precedent of the condition is true, depending on the context. Similarly, the phrase "if it is determined that the condition precedent is true" or "if it is true" or "when the condition precedent is true" may be interpreted to mean "determined" or "in response to a determination" or "according to a determination", "detected" or "in response to a detection" that the condition precedent is true, depending on the context.
The foregoing description, for purpose of explanation, has been described with reference to specific embodiments. However, the illustrative discussions above are not intended to be exhaustive or to limit the claims to the precise forms disclosed. Many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of operation and the practical application, thereby enabling others skilled in the art to practice them.

Claims (20)

1. A method of audibly presenting a text message selectively using user-specific synthesized speech, the method comprising:
receiving, at a receiving electronic device associated with a receiving user, a first text message sent by a first sending user of a first sending electronic device;
upon receiving a request to audibly present the first text message by the receiving electronic device:
causing, by the receiving electronic device, an audible presentation of the first text message using the user-specific synthesized speech representing the first sending user in accordance with a determination that the receiving user is authorized to access the user-specific synthesized speech representing the first sending user;
Receiving, at the receiving electronic device associated with the receiving user, a second text message sent by a second sending user of a second sending electronic device; and
upon receiving a request to audibly present the second text message by the receiving electronic device:
an audible presentation of the second text message is caused by the receiving electronic device using a default voice.
2. The method of claim 1, wherein receiving the first text message further comprises: an indication is received from the first transmitting electronic device that the receiving user is authorized to access the user-specific synthesized speech representing the first transmitting user.
3. The method of claim 1, wherein the request to audibly present the first text message by the receiving electronic device comprises: a request to audibly present the first text message using the user-specific synthesized speech representing the first sending user.
4. The method of claim 1, wherein causing audible presentation of the first text message using the user-specific synthesized speech comprises:
Identifying an explicit feature of the first text message; and
changing a prosody of the audible presentation based on the explicit feature of the first text message.
5. The method of claim 1, wherein causing audible presentation of the first text message using the user-specific synthesized speech comprises:
identifying implicit features of the first text message; and
changing a prosody of the audible presentation based on the implicit characteristic of the first text message.
6. The method of claim 1, wherein causing audible presentation of the first text message using the user-specific synthesized speech comprises:
identifying a feature of an additional text message in a conversation that includes the first text message; and
changing the prosody of the audible presentation based on the characteristics of the additional text message.
7. The method of claim 1, wherein the user-specific synthesized speech is synthesized by the receiving electronic device.
8. The method of claim 1, wherein the user-specific synthesized speech is synthesized by the first transmitting electronic device.
9. The method of claim 1, wherein the user-specific synthesized speech is synthesized by a device other than the receiving electronic device and the first transmitting electronic device.
10. The method of claim 1, wherein the user-specific synthesized speech is synthesized using a speech model that is encrypted to not allow the speech model to be used for purposes other than audibly presenting text messages.
11. The method of claim 1, wherein the determination that the receiving user is authorized to access the user-specific synthesized voice representing the first transmitting user comprises: authenticating the identity of the receiving user.
12. The method of claim 1, wherein the determination that the receiving user is authorized to access the user-specific synthesized speech representing the first transmitting user is based on a strength of a relationship between the receiving user and the first transmitting user.
13. The method of claim 1, wherein the determination that the receiving user is authorized to access the user-specific synthesized voice representing the first transmitting user comprises: determining that the first sending user has granted permission to the receiving user to access the user-specific synthesized speech.
14. The method of claim 1, wherein causing audible presentation of the second text message using the default speech comprises: the determination of whether the receiving user is authorized to access the user-specific synthesized speech representing the second transmitting user is abandoned.
15. The method of claim 14, wherein the second text message comprises a text message sent on behalf of a non-human entity.
16. The method of claim 14, wherein the receiving user has requested that a text message be audibly presented using the default speech.
17. A non-transitory computer-readable storage medium comprising instructions that, when executed by a receiving electronic device associated with a receiving user, cause the receiving electronic device to:
receiving a first text message sent by a first sending user of a first sending electronic device;
upon receiving a request to audibly present the first text message by the receiving electronic device:
causing an audible presentation of the first text message using the user-specific synthesized speech representing the first sending user in accordance with a determination that the receiving user is authorized to access the user-specific synthesized speech representing the first sending user;
receiving a second text message sent by a second sending user of a second sending electronic device; and
upon receiving a request to audibly present the second text message by the receiving electronic device:
Such that the audible presentation of the second text message is made using default speech.
18. The non-transitory computer-readable storage medium of claim 17, wherein receiving the first text message further comprises: an indication is received from the first transmitting electronic device that the receiving user is authorized to access the user-specific synthesized speech representing the first transmitting user.
19. The non-transitory computer-readable storage medium of claim 17, wherein the request to audibly present the first text message by the receiving electronic device comprises: a request to audibly present the first text message using the user-specific synthesized speech representing the first sending user.
20. A receiving electronic device associated with a receiving user and comprising one or more processors, wherein the processors comprise instructions that when executed cause the receiving electronic device to:
receiving a first text message sent by a first sending user of a first sending electronic device;
upon receiving a request to audibly present the first text message by the receiving electronic device:
Causing an audible presentation of the first text message using the user-specific synthesized speech representing the first sending user in accordance with a determination that the receiving user is authorized to access the user-specific synthesized speech representing the first sending user;
receiving a second text message sent by a second sending user of a second sending electronic device; and
upon receiving a request to audibly present the second text message by the receiving electronic device:
such that the audible presentation of the second text message is made using default speech.
CN202310763655.3A 2022-06-22 2023-06-26 Method for presenting text messages using a user-specific speech model Pending CN117275453A (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US63/354,648 2022-06-22
US63/407,544 2022-09-16
US202318154778A 2023-01-13 2023-01-13
US18/154,778 2023-01-13

Publications (1)

Publication Number Publication Date
CN117275453A true CN117275453A (en) 2023-12-22

Family

ID=89203373

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310763655.3A Pending CN117275453A (en) 2022-06-22 2023-06-26 Method for presenting text messages using a user-specific speech model

Country Status (1)

Country Link
CN (1) CN117275453A (en)

Similar Documents

Publication Publication Date Title
US20240054117A1 (en) Artificial intelligence platform with improved conversational ability and personality development
US10614203B2 (en) Robot-human interactive device which performs control for authenticating a user, robot, interaction method, and recording medium storing program
US11430424B2 (en) Generating a voice model for a user
US20200279553A1 (en) Linguistic style matching agent
US12002464B2 (en) Systems and methods for recognizing a speech of a speaker
US9053096B2 (en) Language translation based on speaker-related information
JP6819672B2 (en) Information processing equipment, information processing methods, and programs
US8934652B2 (en) Visual presentation of speaker-related information
US20200319860A1 (en) Computer System and Method for Content Authoring of a Digital Conversational Character
US20130144619A1 (en) Enhanced voice conferencing
CN111542814A (en) Method, computer device and computer readable storage medium for changing responses to provide rich-representation natural language dialog
CN114391145A (en) Personal assistant with adaptive response generation AI driver
US20220231873A1 (en) System for facilitating comprehensive multilingual virtual or real-time meeting with real-time translation
JP2019208138A (en) Utterance recognition device and computer program
US11699043B2 (en) Determination of transcription accuracy
CN114566187B (en) Method of operating a system comprising an electronic device, electronic device and system thereof
CN115088033A (en) Synthetic speech audio data generated on behalf of human participants in a conversation
CN111696538A (en) Voice processing method, apparatus and medium
JP2010034695A (en) Voice response device and method
CN113194203A (en) Communication system, answering and dialing method and communication system for hearing-impaired people
McDonnell et al. “Easier or Harder, Depending on Who the Hearing Person Is”: Codesigning Videoconferencing Tools for Small Groups with Mixed Hearing Status
JP2022531994A (en) Generation and operation of artificial intelligence-based conversation systems
EP4297018A1 (en) Techniques for presenting textual messages using a user-specific voice model
CN117275453A (en) Method for presenting text messages using a user-specific speech model
KR102605178B1 (en) Device, method and computer program for generating voice data based on family relationship

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination