JP2024508033A

JP2024508033A - Instant learning of text-speech during dialogue

Info

Publication number: JP2024508033A
Application number: JP2023553359A
Authority: JP
Inventors: ビジャヤディティヤ・ペディンチ; ブヴァナ・ラマバドラン; アンドリュー・ローゼンバーグ; マテウシュ・ゴレビエフスキ
Original assignee: Google LLC
Current assignee: Google LLC
Priority date: 2021-03-03
Filing date: 2022-02-28
Publication date: 2024-02-21
Also published as: EP4285358A1; US20220284882A1; US11676572B2; US20230274727A1; CN116964662A; KR20230150377A; WO2022187168A1

Abstract

対話中のテキスト-音声(TTS)の瞬時学習のための方法は、ユーザによって話される質問に存在する特定の単語のユーザ発音(202)を受信するステップを含む。本方法は、TTS入力に存在する同じ特定の単語のTTS発音(204)を受信するステップであって、特定の単語のTTS発音が特定の単語のユーザ発音とは異なる、ステップも含む。本方法は、特定の単語と関連付けられるユーザ発音関連特徴(210)およびTTS発音関連特徴(230)を得るステップも含む。本方法は、最高信頼度と関連付けられる特定の単語のユーザ発音またはTTS発音の一方を選択する発音決定(250)を生成するステップも含む。本方法は、特定の単語に対するユーザ発音またはTTS発音を使用して質問への応答の合成音声表現を含むTTSオーディオを提供するステップも含む。A method for instantaneous text-to-speech (TTS) learning during interaction includes receiving user pronunciations (202) of particular words present in a question spoken by a user. The method also includes receiving a TTS pronunciation (204) of the same particular word present in the TTS input, the TTS pronunciation of the particular word being different from a user pronunciation of the particular word. The method also includes obtaining user pronunciation-related features (210) and TTS pronunciation-related features (230) that are associated with a particular word. The method also includes generating a pronunciation decision (250) that selects one of the user pronunciation or the TTS pronunciation of the particular word associated with the highest confidence. The method also includes providing TTS audio that includes a synthesized speech representation of a response to the question using the user pronunciation or TTS pronunciation for the particular word.

Description

本開示は、対話中のテキスト-音声の瞬時学習に関する。 The present disclosure relates to instantaneous text-to-speech learning during interaction.

ユーザは、デジタルアシスタントを通して、スマートフォン、スマートウォッチおよびスマートスピーカなどの、音声対応デバイスと度々相互作用する。これらのデジタルアシスタントは、ユーザとの対話を提供して、ユーザが全て自然な会話的相互作用を通じてタスクを完了し、自分が抱く疑問への回答を得ることを可能にする。理想的には、ユーザとデジタルアシスタントとの間の対話の間、ユーザは、デジタルアシスタントを走らせている自分の音声対応デバイスに向けられる口頭の質問を介して、あたかもユーザが別の人に話しているかのように、やりとりをすることができるべきである。デジタルアシスタントは、これらの口頭の質問を自動音声認識器(ASR)システムに提供して、行為を行うことができるように口頭の要求を処理および認識するであろう。追加的に、デジタルアシスタントは、テキスト-音声(TTS)システムも利用して、質問への応答のテキスト表現をユーザの音声対応デバイスからの可聴出力のために合成音声へ変換するであろう。しばしば、デジタルアシスタント対話中に口頭の質問と対応するTTS応答との間に語彙の重なりがあり、それによって口頭の質問内の単語のユーザ発音が、合成音声として可聴に出力されるときに質問へのデジタルアシスタント応答に存在する同じ単語のTTS発音とは異なる。 Users often interact with voice-enabled devices, such as smartphones, smart watches, and smart speakers, through digital assistants. These digital assistants provide interaction with users, allowing them to complete tasks and get answers to their questions through all-natural conversational interactions. Ideally, during an interaction between a user and a digital assistant, the user can communicate via verbal questions directed to his or her voice-enabled device running the digital assistant as if the user were speaking to another person. You should be able to interact as if you were there. The digital assistant would provide these verbal questions to an Automatic Speech Recognizer (ASR) system to process and recognize the verbal requests so that the action can be taken. Additionally, the digital assistant will also utilize text-to-speech (TTS) systems to convert textual representations of responses to questions into synthetic speech for audible output from the user's voice-enabled device. Often, there is lexical overlap between a verbal question and a corresponding TTS response during a digital assistant interaction, such that the user's pronunciation of the words in the verbal question, when output audibly as synthesized speech, is different from the question. different from the TTS pronunciation of the same word present in the digital assistant response.

「Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions」、J. Shenら著、例えばhttps://arxiv.org/abs/1712.05884において入手可能"Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions", J. Shen et al., available at e.g. https://arxiv.org/abs/1712.05884

本開示の一態様は、データ処理ハードウェア上で実行されるとデータ処理ハードウェアに、特定の単語のユーザ発音または特定の単語のテキスト-音声発音のどちらの方がテキスト-音声オーディオに使用するためにより信頼できるかを選択するための動作を行わせるコンピュータ実装方法を提供する。動作は、ユーザによって話される質問に存在する特定の単語のユーザ発音を受信するステップを含む。動作は、テキスト-音声(TTS)入力に存在する同じ特定の単語のTTS発音を受信するステップも含む。TTS入力は、質問への応答のテキスト表現を含み、そして特定の単語のTTS発音は、特定の単語のユーザ発音とは異なる。動作は、特定の単語のユーザ発音と関連付けられるユーザ発音関連特徴を得るステップと、特定の単語のTTS発音と関連付けられるTTS発音関連特徴を得るステップとをも含む。動作は、入力としてユーザ発音関連特徴およびTTS発音関連特徴を受信するように構成される発音決定モデルからの出力として、TTSオーディオに使用するために最高信頼度と関連付けられる特定の単語のユーザ発音または特定の単語のTTS発音の一方を選択する発音決定を生成するステップも含む。動作は、ユーザと関連付けられるユーザデバイスからの可聴出力のために、発音決定モデルから出力される発音決定によって選択された特定の単語に対するユーザ発音または特定の単語に対するTTS発音の一方を使用して質問への応答の合成音声表現を含むTTSオーディオを提供するステップも含む。 One aspect of the present disclosure, when executed on data processing hardware, causes the data processing hardware to determine whether a user pronunciation of a particular word or a text-to-speech pronunciation of a particular word is better used for text-to-speech audio. Provided is a computer-implemented method for performing an operation for selecting a more reliable object. The operations include receiving user pronunciations of particular words present in the question spoken by the user. The operations also include receiving a TTS pronunciation of the same particular word present in the text-to-speech (TTS) input. The TTS input includes textual representations of responses to questions, and the TTS pronunciation of a particular word is different from the user pronunciation of the particular word. The operations also include obtaining user pronunciation-related features associated with the user pronunciation of the particular word and obtaining TTS pronunciation-related features associated with the TTS pronunciation of the particular word. The operation determines, as output from a pronunciation decision model configured to receive user pronunciation-related features and TTS pronunciation-related features as input, the user pronunciation of a particular word or It also includes generating a pronunciation decision that selects one of the TTS pronunciations of a particular word. The operation uses either the user pronunciation for the particular word or the TTS pronunciation for the particular word selected by the pronunciation decision output from the pronunciation decision model for audible output from the user device associated with the user. and providing TTS audio that includes a synthesized speech representation of the response.

本開示の実装例は、以下の任意選択の特徴の1つまたは複数を含んでよい。一部の実装例において、動作は、ユーザによって話される質問に対応するオーディオデータを受信するステップと、自動音声認識器(ASR)を使用して、オーディオデータを処理して質問の表記を生成するステップとを更に含む。これらの実装例において、特定の単語のユーザ発音を受信するステップは、ASRを使用してオーディオデータを処理する間にASRの間欠状態から特定の単語のユーザ発音を抽出するステップか、特定の単語のユーザ発音を伝えるオーディオデータからの特定の単語のユーザ音響表現を抽出するステップか、オーディオデータを処理して、特定の単語のユーザ発音を伝えるユーザ音素表現を生成するステップかの少なくとも1つを含む。特定の単語のユーザ発音と関連付けられるユーザ発音関連特徴は、ASRがオーディオデータに特定の単語を認識することと関連付けられる1つまたは複数の信頼度特徴も含んでよい。 Implementations of this disclosure may include one or more of the following optional features. In some implementations, the operations include receiving audio data corresponding to a question spoken by a user and using an automatic speech recognizer (ASR) to process the audio data to generate a representation of the question. The method further includes the step of: In these implementations, receiving the user pronunciation of the particular word may include extracting the user pronunciation of the particular word from an intermittent state of the ASR while processing audio data using the ASR, or at least one of: extracting a user acoustic representation of a particular word from audio data that conveys a user pronunciation of the particular word; or processing the audio data to generate a user phonemic representation that conveys a user pronunciation of the particular word. include. The user pronunciation-related features associated with the user pronunciation of a particular word may also include one or more confidence features associated with the ASR recognizing the particular word in the audio data.

一部の例では、特定の単語のユーザ発音と関連付けられるユーザ発音関連特徴は、質問がユーザによって話されたときのユーザの地理的地域か、ユーザと関連付けられる言語人口学的情報か、ユーザおよび/または他のユーザによって話された以前の質問内の特定の単語を発音するときにユーザ発音を使用する頻度かの少なくとも1つを含む。特定の単語のTTS発音を受信するステップは、TTSシステムへの入力として、質問への応答のテキスト表現を含むTTS入力を受信するステップと、TTSシステムからの出力として、質問への応答の初期合成音声表現を含むTTSオーディオの初期サンプルを生成するステップと、TTSオーディオの初期サンプルから特定の単語のTTS音響表現を抽出するステップであり、TTS音響表現が特定の単語のTTS発音を伝える、ステップとを含んでよい。 In some examples, the user pronunciation-related features that are associated with a user's pronunciation of a particular word are the user's geographic region when the question was spoken by the user, the linguistic demographic information associated with the user, the user's and/or the frequency with which the user pronunciation is used when pronouncing a particular word in a previous question that has been spoken by other users. Receiving a TTS pronunciation of a particular word comprises, as input to the TTS system, receiving TTS input containing a textual representation of the response to the question and, as output from the TTS system, an initial synthesis of the response to the question. generating an initial sample of TTS audio that includes a phonetic representation; and extracting a TTS acoustic representation of a particular word from the initial sample of TTS audio, the TTS acoustic representation conveying a TTS pronunciation of the particular word. may include.

任意選択で、特定の単語のTTS発音を受信するステップは、質問への応答のテキスト表現を処理して、特定の単語のTTS発音を伝えるTTS音素表現を生成するステップを含んでよい。一部の例では、特定の単語のTTS発音と関連付けられるTTS発音関連特徴は、特定の単語に対する確認済みの好ましい発音か、1つまたは複数の補助情報源からの発音マイニングを使用して推定される特定の単語に対する未確認の発音か、特定の単語を発音するためのいくらかでも他の異形が存在するかどうかを示す発音異形特徴か、特定の単語のユーザ誤発音の可能性を示す発音複雑性特徴かの少なくとも1つを含む。 Optionally, receiving the TTS pronunciation of the particular word may include processing the textual representation of the response to the question to generate a TTS phoneme representation conveying the TTS pronunciation of the particular word. In some instances, the TTS pronunciation-related features associated with the TTS pronunciation of a particular word are estimated using verified preferred pronunciations for the particular word or pronunciation mining from one or more auxiliary sources. pronunciation complexity, which indicates the likelihood of user mispronunciation of a particular word; Contains at least one of the following characteristics:

一部の実装例において、特定の単語のユーザ発音または特定の単語のTTS発音の一方を選択する発音決定を生成した後に、動作は、以降のTTS出力内の特定の単語を発音するためにユーザが特定の単語のユーザ発音または特定の単語のTTS発音のどちらの方を好むかを示すユーザからの明示的なフィードバックを受信するステップと、ユーザからの明示的なフィードバックに基づいて発音決定モデルを更新するステップとを更に含む。ここで、ユーザからの明示的なフィードバックが特定の単語のユーザ発音をユーザが好むことを示すとき、動作は、特定の単語を含むTTSオーディオを生成するときに特定の単語のユーザ発音を使用するようにTTSシステムを更新するステップを更に含む。一部の例では、ユーザデバイスからの可聴出力のためにTTSオーディオを提供した後に、動作は、特定の単語を含むユーザまたは別のユーザによって話される次の質問に対応するオーディオデータを受信するステップと、発音決定によって選択された特定の単語に対するユーザ発音または特定の単語に対するTTS発音の一方と同じようにユーザまたは他のユーザが次の質問内の特定の単語を発音したか否かを示す黙示的なユーザフィードバックを判定するステップと、黙示的なユーザフィードバックに基づいて発音決定モデルを更新するステップとを更に含む。 In some implementations, after generating a pronunciation decision that selects either the user pronunciation of a particular word or the TTS pronunciation of the particular word, the operation receiving explicit feedback from the user indicating whether the user prefers the user pronunciation of a particular word or the TTS pronunciation of the particular word; and creating a pronunciation decision model based on the explicit feedback from the user. and updating. Here, when explicit feedback from the user indicates that the user prefers the user pronunciation of a particular word, the behavior is to use the user pronunciation of the particular word when generating TTS audio containing the particular word. The method further includes the step of updating the TTS system to ensure that the TTS system is updated. In some examples, after providing TTS audio for audible output from the user device, the operation receives audio data corresponding to the next question spoken by the user or another user that includes the particular word. step and whether the user or another user pronounced the particular word in the next question the same way as either the user pronunciation for the particular word selected by the pronunciation determination or the TTS pronunciation for the particular word. The method further includes determining implicit user feedback and updating the pronunciation decision model based on the implicit user feedback.

本開示の別の態様は、データ処理ハードウェアと、データ処理ハードウェアと通信しており、かつデータ処理ハードウェア上で実行されるとデータ処理ハードウェアに、特定の単語のユーザ発音または特定の単語のテキスト-音声発音のどちらの方がテキスト-音声オーディオに使用するためにより信頼できるかを選択するための動作を行わせる命令を記憶しているメモリハードウェアとを含むシステムを提供する。動作は、ユーザによって話される質問に存在する特定の単語のユーザ発音を受信することを含む。動作は、テキスト-音声(TTS)入力に存在する同じ特定の単語のTTS発音を受信することも含む。TTS入力は、質問への応答のテキスト表現を含み、そして特定の単語のTTS発音は、特定の単語のユーザ発音とは異なる。動作は、特定の単語のユーザ発音と関連付けられるユーザ発音関連特徴を得ることと、特定の単語のTTS発音と関連付けられるTTS発音関連特徴を得ることとをも含む。動作は、入力としてユーザ発音関連特徴およびTTS発音関連特徴を受信するように構成される発音決定モデルからの出力として、TTSオーディオに使用するために最高信頼度と関連付けられる特定の単語のユーザ発音または特定の単語のTTS発音の一方を選択する発音決定を生成することも含む。動作は、ユーザと関連付けられるユーザデバイスからの可聴出力のために、発音決定モデルから出力される発音決定によって選択された特定の単語に対するユーザ発音または特定の単語に対するTTS発音の一方を使用して質問への応答の合成音声表現を含むTTSオーディオを提供することも含む。 Another aspect of the present disclosure is to have data processing hardware in communication with the data processing hardware and, when executed on the data processing hardware, cause the data processing hardware to specify a user pronunciation of a particular word or a particular and memory hardware storing instructions for performing operations for selecting which of the text-to-speech pronunciations of a word is more reliable for use in text-to-speech audio. The operations include receiving user pronunciations of particular words present in the question spoken by the user. The operations also include receiving a TTS pronunciation of the same particular word present in the text-to-speech (TTS) input. The TTS input includes textual representations of responses to questions, and the TTS pronunciation of a particular word is different from the user pronunciation of the particular word. The operations also include obtaining user pronunciation-related features associated with the user pronunciation of the particular word and obtaining TTS pronunciation-related features associated with the TTS pronunciation of the particular word. The operation determines, as output from a pronunciation decision model configured to receive user pronunciation-related features and TTS pronunciation-related features as input, the user pronunciation of a particular word or It also includes generating a pronunciation decision that selects one of the TTS pronunciations of a particular word. The operation uses either the user pronunciation for the particular word or the TTS pronunciation for the particular word selected by the pronunciation decision output from the pronunciation decision model for audible output from the user device associated with the user. and providing TTS audio containing a synthesized speech representation of the response.

この態様は、以下の任意選択の特徴の1つまたは複数を含んでよい。一部の実装例において、動作は、ユーザによって話される質問に対応するオーディオデータを受信することと、自動音声認識器(ASR)を使用して、オーディオデータを処理して質問の表記を生成することとを更に含む。これらの実装例において、特定の単語のユーザ発音を受信することは、ASRを使用してオーディオデータを処理する間にASRの間欠状態から特定の単語のユーザ発音を抽出することか、特定の単語のユーザ発音を伝えるオーディオデータからの特定の単語のユーザ音響表現を抽出することか、オーディオデータを処理して、特定の単語のユーザ発音を伝えるユーザ音素表現を生成することかの少なくとも1つを含む。特定の単語のユーザ発音と関連付けられるユーザ発音関連特徴は、ASRがオーディオデータに特定の単語を認識することと関連付けられる1つまたは複数の信頼度特徴を含んでよい。 This aspect may include one or more of the following optional features. In some implementations, the operations include receiving audio data corresponding to a question spoken by a user and processing the audio data to generate a representation of the question using an automatic speech recognizer (ASR). It further includes: In these implementations, receiving the user pronunciation of a particular word may involve extracting the user pronunciation of the particular word from an intermittent state of the ASR while processing audio data using ASR, or receiving the user pronunciation of the particular word extracting a user acoustic representation of a particular word from audio data that conveys a user pronunciation of the particular word; or processing the audio data to generate a user phonemic representation that conveys a user pronunciation of the particular word. include. The user pronunciation-related features associated with the user pronunciation of the particular word may include one or more confidence features associated with the ASR recognizing the particular word in the audio data.

一部の例では、特定の単語のユーザ発音と関連付けられるユーザ発音関連特徴は、質問がユーザによって話されたときのユーザの地理的地域か、ユーザと関連付けられる言語人口学的情報か、ユーザおよび/または他のユーザによって話された以前の質問内の特定の単語を発音するときにユーザ発音を使用する頻度かの少なくとも1つを含む。特定の単語のTTS発音を受信することは、TTSシステムへの入力として、質問への応答のテキスト表現を含むTTS入力を受信することと、TTSシステムからの出力として、質問への応答の初期合成音声表現を含むTTSオーディオの初期サンプルを生成することと、TTSオーディオの初期サンプルから特定の単語のTTS音響表現を抽出することであり、TTS音響表現が特定の単語のTTS発音を伝える、こととを含んでよい。 In some examples, the user pronunciation-related features that are associated with a user's pronunciation of a particular word are the user's geographic region when the question was spoken by the user, the linguistic demographic information associated with the user, the user's and/or the frequency with which the user pronunciation is used when pronouncing a particular word in a previous question that has been spoken by other users. Receiving a TTS pronunciation of a particular word consists of receiving, as input to the TTS system, TTS input containing a textual representation of the response to the question, and as output from the TTS system, the initial synthesis of the response to the question. generating an initial sample of TTS audio that includes a phonetic representation; extracting a TTS acoustic representation of a specific word from the initial sample of TTS audio, the TTS acoustic representation conveying the TTS pronunciation of the specific word; and may include.

任意選択で、特定の単語のTTS発音を受信することは、質問への応答のテキスト表現を処理して、特定の単語のTTS発音を伝えるTTS音素表現を生成することを含んでよい。一部の例では、特定の単語のTTS発音と関連付けられるTTS発音関連特徴は、特定の単語に対する確認済みの好ましい発音か、1つまたは複数の補助情報源からの発音マイニングを使用して推定される特定の単語に対する未確認の発音か、特定の単語を発音するためのいくらかでも他の異形が存在するかどうかを示す発音異形特徴か、特定の単語のユーザ誤発音の可能性を示す発音複雑性特徴かの少なくとも1つを含む。 Optionally, receiving the TTS pronunciation of the particular word may include processing the textual representation of the response to the question to generate a TTS phoneme representation that conveys the TTS pronunciation of the particular word. In some instances, the TTS pronunciation-related features associated with the TTS pronunciation of a particular word are estimated using verified preferred pronunciations for the particular word or pronunciation mining from one or more auxiliary sources. pronunciation complexity, which indicates the likelihood of user mispronunciation of a particular word; Contains at least one of the following characteristics:

一部の実装例において、特定の単語のユーザ発音または特定の単語のTTS発音の一方を選択する発音決定を生成した後に、動作は、以降のTTS出力内の特定の単語を発音するためにユーザが特定の単語のユーザ発音または特定の単語のTTS発音のどちらの方を好むかを示すユーザからの明示的なフィードバックを受信することと、ユーザからの明示的なフィードバックに基づいて発音決定モデルを更新することとを更に含む。ここで、ユーザからの明示的なフィードバックが特定の単語のユーザ発音をユーザが好むことを示すとき、動作は、特定の単語を含むTTSオーディオを生成するときに特定の単語のユーザ発音を使用するようにTTSシステムを更新することを更に含む。一部の例では、ユーザデバイスからの可聴出力のためにTTSオーディオを提供した後に、動作は、特定の単語を含むユーザまたは別のユーザによって話される次の質問に対応するオーディオデータを受信することと、発音決定によって選択された特定の単語に対するユーザ発音または特定の単語に対するTTS発音の一方と同じようにユーザまたは他のユーザが次の質問内の特定の単語を発音したか否かを示す黙示的なユーザフィードバックを判定することと、黙示的なユーザフィードバックに基づいて発音決定モデルを更新することとを更に含む。 In some implementations, after generating a pronunciation decision that selects either the user pronunciation of a particular word or the TTS pronunciation of the particular word, the operation receives explicit feedback from users indicating whether they prefer the user pronunciation of a particular word or the TTS pronunciation of a particular word, and develops a pronunciation decision model based on explicit feedback from the user. and updating. Here, when explicit feedback from the user indicates that the user prefers the user pronunciation of a particular word, the behavior is to use the user pronunciation of the particular word when generating TTS audio containing the particular word. further including updating the TTS system to In some examples, after providing TTS audio for audible output from the user device, the operation receives audio data corresponding to the next question spoken by the user or another user that includes the particular word. and whether the user or other users pronounced the particular word in the next question the same way as either the user pronunciation for the particular word or the TTS pronunciation for the particular word selected by the pronunciation determination. The method further includes determining implicit user feedback and updating the pronunciation decision model based on the implicit user feedback.

本開示の1つまたは複数の実装例の詳細が添付図面および以下の説明に明らかにされる。他の態様、特徴および利点は、同説明および図面から、ならびに請求項から明らかであろう。 The details of one or more implementations of the disclosure are set forth in the accompanying drawings and the description below. Other aspects, features, and advantages will be apparent from the description and drawings, and from the claims.

音声環境例の概略図である。1 is a schematic diagram of an example audio environment; FIG. 特定の単語のユーザ発音または特定の単語のテキスト-音声発音のどちらの方がテキスト-音声オーディオに使用するためにより信頼できるかを決定するための発音決定モデル例の概略図である。FIG. 2 is a schematic diagram of an example pronunciation decision model for determining whether a user pronunciation of a particular word or a text-to-speech pronunciation of a particular word is more reliable for use in text-to-speech audio. TTSオーディオを生成するためのテキスト-音声モデル例の概略図である。1 is a schematic diagram of an example text-to-speech model for generating TTS audio; FIG. 特定の単語に対するユーザ発音が同じ単語に対するTTS発音とは異なりかつより信頼できる対話例の概略図である。FIG. 2 is a schematic diagram of an example of a dialogue in which a user's pronunciation for a specific word is different and more reliable than the TTS pronunciation for the same word; ユーザとデジタルアシスタントとの間の対話中のテキスト-音声の瞬時学習の方法のための動作の配置例のフローチャートである。2 is a flowchart of an example arrangement of operations for a method of instantaneous text-to-speech learning during interaction between a user and a digital assistant. 本明細書に記載されるシステムおよび方法を実装するために使用され得るコンピューティングデバイス例の概略図である。1 is a schematic diagram of an example computing device that may be used to implement the systems and methods described herein.

様々な図面中の同様の参照記号は、同様の要素を示す。 Like reference symbols in the various drawings indicate similar elements.

ユーザが行為を行うためにデジタルアシスタントに対して口頭の質問を発し、そしてデジタルアシスタントが質問に関連するテキスト-音声(TTS)応答を可聴に出力するときなど、ユーザとデジタルアシスタントとの間の対話中に、しばしば口頭の質問およびTTS応答の語彙の重なりがある。例えば、ユーザは音声検索質問「ベア(Bexar)はどこにある?」を話すことができ、対応するオーディオデータが自然言語理解(NLU)モジュールによる解釈のための対応する表記への変換のために自動音声認識器(ASR)に渡され、そして検索質問として検索エンジンに提供されて結果を取り出す。取り出された結果を使用して、デジタルアシスタントは、質問への応答のテキスト表現を含むTTS入力を生成してよい。ここで、応答のテキスト表現は「ベアはテキサス州に位置する」となり、これは、ユーザと関連付けられるデバイスからの可聴出力のために合成音声を含む対応するTTSオーディオを生成するためのTTSシステムへのTTS入力に与えられる。 Interactions between a user and a digital assistant, such as when a user issues a verbal question to the digital assistant to perform an action, and the digital assistant audibly outputs a text-to-speech (TTS) response related to the question. There is often lexical overlap between verbal questions and TTS responses. For example, a user can speak the voice search question "Where is Bexar?" and the corresponding audio data is automatically translated into the corresponding notation for interpretation by a natural language understanding (NLU) module. It is passed to a speech recognizer (ASR) and then provided as a search question to a search engine to retrieve the results. Using the retrieved results, the digital assistant may generate a TTS input that includes a textual representation of the response to the question. Here, the textual representation of the response would be "Bear is located in Texas," and this would be passed to the TTS system to generate corresponding TTS audio, including synthesized speech, for audible output from the device associated with the user. given to the TTS input.

一部の事例では、ユーザは、口頭の質問内の単語を、TTS入力として提供される質問への応答のテキスト表現に存在する同じ単語をTTSシステムが発音することになる仕方とは異なって発音する。競合する発音の違いは、幾つか挙げれば、一般に別々に訓練されて、しばしば更新の点で同期しなくなるASRシステムとTTSシステムとの間の正規化の欠如、TTSシステムが発音するように訓練されていない珍しい単語/用語(例えば、連絡先名、新語等)、ユーザが知らずに誤発音しており、ASRシステムはロバストであり正確に認識しかつTTSシステムは正しく発音する単語、およびユーザが故意に誤発音してASRシステムを誘導する単語など、各種の理由に起因し得る。 In some cases, users pronounce words in a verbal question differently than the way a TTS system would pronounce the same words present in the textual representation of the response to the question provided as TTS input. do. Differences between competing pronunciations include the lack of normalization between ASR and TTS systems, which are generally trained separately and often become out of sync in terms of updates, and the lack of normalization between ASR and TTS systems, which are typically trained separately and often become out of sync in terms of updates; Unusual words/terms (e.g. contact names, neologisms, etc.) that the user has unknowingly mispronounced, the ASR system is robust and correctly recognizes and the TTS system pronounces correctly, and the user has intentionally mispronounced the words/terms This can be due to various reasons, such as words being mispronounced to induce the ASR system.

本明細書における実装例は、質問中にユーザによって話される用語のユーザ発音が質問への応答に存在する同じ用語のTTS発音とは異なる時を識別し、そして発音決定モデルを使用して、合成音声としてユーザに応答を伝えるTTSオーディオ内の単語を発音するためにユーザ発音またはTTS発音のどちらの方を使用するべきかを判定することに向けられる。以下に更に詳細に記載されるように、発音決定モデルは、TTS発音関連特徴および/またはユーザ発音関連特徴に基づいて同じ単語の競合する発音間で選択するように訓練および連続的に更新されてよい。すなわち、特定の単語のユーザ発音と関連付けられるユーザ発音関連特徴および特定の単語のTTS発音と関連付けられるTTS発音関連特徴に基づいて、発音決定モデルは、特定の単語のユーザ発音の信頼度および同じ特定の単語のTTS発音の信頼度を推定して、ユーザ発音またはTTS発音のどちらの方がより信頼できるかを判定してよい。 Example implementations herein identify when the user pronunciation of a term spoken by the user during a question differs from the TTS pronunciation of the same term present in response to the question, and use a pronunciation decision model to: It is directed to determining whether a user pronunciation or a TTS pronunciation should be used to pronounce a word in TTS audio that conveys a response to the user as a synthesized speech. As described in further detail below, the pronunciation decision model is trained and continuously updated to choose between competing pronunciations of the same word based on TTS pronunciation-related features and/or user pronunciation-related features. good. That is, based on the user pronunciation-related features associated with the user pronunciation of a particular word and the TTS pronunciation-related features associated with the TTS pronunciation of the particular word, the pronunciation decision model determines the confidence level of the user pronunciation of the particular word and the same identification. The reliability of the TTS pronunciation of the word may be estimated to determine whether the user pronunciation or the TTS pronunciation is more reliable.

図1を参照すると、一部の実装例において、音声環境100が、音声対応デバイス110(デバイス110またはユーザデバイス110とも称される)に質問12を話すユーザ10を含む。具体的には、ユーザ10は、ユーザデバイス110上で実行しているデジタルアシスタント115(『デジタルアシスタントインタフェース』とも称される)と対話をしており、ユーザ10によって話される質問12がデジタルアシスタント115に動作を行うことを要求する。ユーザ10(すなわち、質問12の話者)は、質問12を話して、デジタルアシスタント115に応答を求めてまたはデジタルアシスタント115に質問12によって指定されるタスクを実行させてよい。デバイス110は、音声環境100内の1人または複数のユーザ10からの音を捕捉するように構成される。デバイス110のまたはデバイス110と関連付けられる音声対応システム(例えば、デジタルアシスタント115)は、質問12に対応し、質問12によって指定される1つまたは複数の動作を行い、そしてデバイス110からの可聴出力のために合成音声154として質問への応答を提供してよい。 Referring to FIG. 1, in some implementations, a voice environment 100 includes a user 10 speaking a question 12 to a voice-enabled device 110 (also referred to as device 110 or user device 110). Specifically, the user 10 is interacting with a digital assistant 115 (also referred to as a "digital assistant interface") running on the user device 110, and a question 12 spoken by the user 10 causes the digital assistant to Request 115 to perform an action. User 10 (ie, the speaker of question 12) may speak question 12 and request a response from digital assistant 115 or cause digital assistant 115 to perform a task specified by question 12. Device 110 is configured to capture sound from one or more users 10 within audio environment 100. A voice-enabled system (e.g., digital assistant 115) of or associated with device 110 responds to question 12, performs one or more actions specified by question 12, and outputs an audible output from device 110. The response to the question may be provided as a synthesized voice 154 for the purpose.

ここで、デバイス110は、ユーザ10による口頭の質問12に対応するオーディオデータ14を捕捉する。デバイス110は、ユーザ10と関連付けられかつオーディオデータ14を受信することが可能な任意のコンピューティングデバイスに相当してよい。ユーザデバイス110の一部の例には、モバイルデバイス(例えば、移動電話、タブレット、ラップトップ、電子書籍リーダ等)、コンピュータ、ウェアラブルデバイス(例えば、スマートウォッチ)、音楽プレーヤ、キャスティングデバイス、スマート家電(例えば、スマートテレビ)およびモノのインターネット(IoT)デバイス、リモートコントロール、スマートスピーカ等を含むが、これらに限定されない。デバイス110は、データ処理ハードウェア112、およびデータ処理ハードウェア112と通信しており、かつデータ処理ハードウェア112によって実行されると、データ処理ハードウェア112に音声および/またはテキスト処理に関連する1つまたは複数の動作を行わせる命令を記憶しているメモリハードウェア114を含む。一部の例では、デバイス110は、1つまたは複数のアプリケーション(すなわち、ソフトウェアアプリケーション)を含んでおり、各アプリケーションがデバイス110と関連付けられる1つまたは複数の音声処理システム/モデル140、150、200を活用してアプリケーション内の様々な機能を行ってよい。例えば、デバイス110は、ユーザ10に合成再生オーディオ154(合成音声154とも称される)を伝達するように構成されるデジタルアシスタント115を実行して、ユーザ10と会話して様々なタスクの履行を支援する。 Here, device 110 captures audio data 14 corresponding to verbal questions 12 by user 10. Device 110 may represent any computing device associated with user 10 and capable of receiving audio data 14. Some examples of user devices 110 include mobile devices (e.g., mobile phones, tablets, laptops, e-book readers, etc.), computers, wearable devices (e.g., smart watches), music players, casting devices, smart home appliances (e.g., Examples include, but are not limited to, smart TVs) and Internet of Things (IoT) devices, remote controls, smart speakers, etc. The device 110 is in communication with the data processing hardware 112 and the data processing hardware 112 and, when executed by the data processing hardware 112, causes the data processing hardware 112 to receive information related to voice and/or text processing. It includes memory hardware 114 that stores instructions for performing one or more operations. In some examples, device 110 includes one or more applications (i.e., software applications), each application associated with one or more audio processing systems/models 140, 150, 200. can be used to perform various functions within the application. For example, device 110 may run digital assistant 115 that is configured to communicate synthetically played audio 154 (also referred to as synthetic speech 154) to user 10 to converse with user 10 and perform various tasks. Assist.

デバイス110は、音声環境100内のオーディオデータ14を捕捉して電気信号へ変換するためのオーディオ捕捉デバイス(例えば、マイクロホン)116および可聴オーディオ信号(例えば、デバイス110からの合成再生信号154)を伝達するための音声出力デバイス(例えば、スピーカ)118を伴うオーディオサブシステムを更に含む。図示した例ではデバイス110が単一のオーディオ捕捉デバイス116を実装するのに対して、デバイス110は、本開示の範囲から逸脱することなくオーディオ捕捉デバイス116のアレイを実装してよく、それによってアレイ内の1つまたは複数のオーディオ捕捉デバイス116が物理的にデバイス110に常駐するのではなく、オーディオサブシステムと通信していてよい(例えば、デバイス110の周辺機器)。例えば、デバイス110は、車両の全体にわたって配設されるマイクロホンのアレイを活かす車両インフォテインメントシステムに相当してよい。同様に、音声出力デバイス118は、1つまたは複数のスピーカを含んでよく、デバイス110に常駐するか、それと通信しているか、組合せとして1つまたは複数のスピーカがデバイス110に常駐し、そして1つまたは複数の他のスピーカが物理的にデバイス110から除かれるがデバイス110と通信しているかである。 Device 110 communicates an audio capture device (e.g., a microphone) 116 and an audible audio signal (e.g., a synthesized playback signal 154 from device 110) for capturing and converting audio data 14 within audio environment 100 into an electrical signal. It further includes an audio subsystem with an audio output device (eg, speaker) 118 for playing. While in the illustrated example device 110 implements a single audio capture device 116, device 110 may implement an array of audio capture devices 116 without departing from the scope of this disclosure, thereby providing an array of One or more audio capture devices 116 within may not physically reside on device 110, but may be in communication with an audio subsystem (eg, a peripheral of device 110). For example, device 110 may correspond to a vehicle infotainment system that utilizes an array of microphones located throughout the vehicle. Similarly, audio output device 118 may include one or more speakers residing on or in communication with device 110, or in combination with one or more speakers residing on device 110 and one One or more other speakers may be physically removed from device 110 but in communication with device 110.

更には、デバイス110は、リモートシステム130とネットワーク120を介して通信するように構成されてよい。リモートシステム130は、リモートデータ処理ハードウェア134(例えば、リモートサーバもしくはCPU)および/またはリモートメモリハードウェア136(例えば、リモートデータベースもしくは他のストレージハードウェア)などのリモートリソース132を含んでよい。デバイス110は、リモートリソース132を活用して、音声処理および/または合成再生伝達に関連する様々な機能性を行ってよい。例えば、デバイス110は、自動音声認識(ASR)システム140を使用して音声認識を、および/またはテキスト-音声(TTS)システム150を使用してテキストから音声への変換を行うように構成される。追加的に、発音決定モデル200が、ユーザ10によって話される質問12にも質問12への応答のテキスト表現に対応するTTS入力152にも存在する特定の単語に対するユーザ発音202またはTTS発音204間で選択する発音決定250を生成してよい。TTSシステム150は、質問12への応答の合成音声表現を含むTTSオーディオ154を作成し、それによってTTSオーディオ154は、発音決定モデル200から出力される発音決定250によって選択されたユーザ発音202またはTTS発音204の一方を使用して特定の単語を発音する。 Further, device 110 may be configured to communicate with remote system 130 via network 120. Remote system 130 may include remote resources 132, such as remote data processing hardware 134 (eg, a remote server or CPU) and/or remote memory hardware 136 (eg, remote database or other storage hardware). Device 110 may utilize remote resources 132 to perform various functionality related to audio processing and/or synthetic playback transmission. For example, device 110 is configured to perform speech recognition using automatic speech recognition (ASR) system 140 and/or text-to-speech conversion using text-to-speech (TTS) system 150. . Additionally, pronunciation decision model 200 determines between user pronunciations 202 or TTS pronunciations 204 for particular words that are present in question 12 spoken by user 10 as well as in TTS input 152 corresponding to the textual representation of the response to question 12. may generate a pronunciation determination 250 to be selected. TTS system 150 creates TTS audio 154 that includes a synthesized speech representation of the response to question 12, whereby TTS audio 154 is a user pronunciation 202 or TTS Pronounce certain words using one of the pronunciations 204.

これらのシステム/モデル140、150、200は、デバイス110に常駐し(オンデバイスシステムと称される)、またはリモートで常駐する(例えば、リモートシステム130に常駐する)がデバイス110と通信していてよい。一部の例では、これらのシステム140、150、200の一部がローカルまたはオンデバイスに常駐する一方で、他はリモートで常駐する。言い換えれば、これらのシステム140、150、200のいずれも、ローカル、リモート、または任意の組合せで両方でよい。例えば、システム140、150、200のサイズまたは処理要件がかなり大きい場合、システム140、150、200は、リモートシステム130に常駐してよい。更にデバイス110が1つまたは複数のシステム140、150、200のサイズまたは処理要件をサポートし得る場合、1つまたは複数のシステム140、150、200は、データ処理ハードウェア112および/またはメモリハードウェア114を使用してデバイス110に常駐してよい。任意選択で、システム140、150、200の1つまたは複数は、ローカル/オンデバイスにもリモートでも常駐してよい。例えば、デバイス110とリモートシステム130との間のネットワーク120への接続が利用可能であるときには、システム140、150、200の1つまたは複数がデフォルトでリモートシステム130上で実行してよいが、接続が失われるまたはネットワーク120が利用不可能であると、システム140、150、200は、代わりにデバイス110上でローカルに実行する。 These systems/models 140, 150, 200 may reside on the device 110 (referred to as on-device systems) or may reside remotely (e.g., reside on a remote system 130) in communication with the device 110. good. In some examples, some of these systems 140, 150, 200 reside locally or on-device while others reside remotely. In other words, any of these systems 140, 150, 200 may be local, remote, or both in any combination. For example, systems 140, 150, 200 may reside on remote system 130 if the size or processing requirements of systems 140, 150, 200 are significant. Additionally, if device 110 may support the size or processing requirements of one or more systems 140, 150, 200, one or more systems 140, 150, 200 may include data processing hardware 112 and/or memory hardware. 114 may be used to reside on the device 110. Optionally, one or more of systems 140, 150, 200 may reside locally/on-device or remotely. For example, when a connection to network 120 between device 110 and remote system 130 is available, one or more of systems 140, 150, 200 may be running on remote system 130 by default; or network 120 is unavailable, systems 140, 150, 200 instead run locally on device 110.

図示した例では、ASRシステム140は、入力として、質問12に対応するオーディオデータ14を受信し、そしてオーディオデータ14を処理して、出力として、質問12の表記142を生成する。ASRシステム140は、自然言語理解(NLU)機能性を含んで表記142に質問解釈(例えば、意味解析)を行ってよい。表記142は、デジタルアシスタント115が質問12への応答を生成する、より詳細には、質問12への応答のテキスト表現に対応するTTS入力152を生成するために使用してよいテキストの系列によって表現される書記素表現を含む。例えば、前からの例を続けると、ユーザ10は、質問「Where is Bexar?(Bexarはどこにある?)」を話すことができ、これが捕捉されて、対応する表記142を生成するASRシステム140による処理のために対応するオーディオデータ14へ変換される。単語Bexarに対する標準的な発音がテキサス州の文脈では「 In the illustrated example, ASR system 140 receives as input audio data 14 corresponding to question 12 and processes audio data 14 to produce a representation 142 of question 12 as output. ASR system 140 may include natural language understanding (NLU) functionality to perform query interpretation (eg, semantic analysis) on notation 142. Notation 142 is represented by a sequence of texts that digital assistant 115 may use to generate a response to question 12, and more particularly to generate a TTS input 152 that corresponds to a textual representation of the response to question 12. contains grapheme representations. For example, continuing the example from before, user 10 can speak the question "Where is Bexar?" which is captured by ASR system 140 which generates the corresponding notation 142. It is converted into corresponding audio data 14 for processing. The standard pronunciation for the word Bexar in the Texas context is "

」であるのに対して、ユーザ10は、知らずに「X」をはっきり発音することによって単語Bexarをその英語スペリングに基づいて「 ”, whereas user 10 unknowingly spelled the word Bexar based on its English spelling by pronouncing the “X” clearly.

」として誤発音した。特に、ASRシステム140はロバストであり、テキサス州の都市としての単語Bexarのユーザ誤発音を正確に認識する。 '' was mispronounced. In particular, the ASR system 140 is robust and accurately recognizes user mispronunciations of the word Bexar as a city in Texas.

表記142を生成した(そして任意のNLU機能性を行った)後に、デジタルアシスタント115は、次いで表記142を使用して質問12への応答を判定してよい。例えば、都市名(例えば、Bexar)に対して場所を判定するために、デジタルアシスタント115は、検索エンジンに表記142(「Where is Bexar?」)または表記の識別部分(例えば、「where」および「Bexar」)を含む検索文字列を渡してよい。検索エンジンは、次いでデジタルアシスタント115が質問12への応答のテキスト表現を含むTTS入力152を生成するために解釈する1つまたは複数の検索結果を返してよい。特に、単語「Bexar」が質問12の表記142にもTTS入力152にも存在しており、それによって表記142もTTS入力152のテキスト表現も単語「Bexar」の同じ書記素表現を共有する。テキスト表現は、特定の自然言語の書記素/文字の系列を含んでよい。書記素/文字の系列は、字母、数字、句読点および/または他の特殊文字を含むことができる。 After generating notation 142 (and performing any NLU functionality), digital assistant 115 may then use notation 142 to determine a response to question 12. For example, to determine a location relative to a city name (e.g., Bexar), the digital assistant 115 may request the search engine to display the notation 142 (“Where is Bexar?”) or the identifying portions of the notation (e.g., “where” and “ You can pass a search string containing "Bexar"). The search engine may return one or more search results that the digital assistant 115 then interprets to generate a TTS input 152 that includes a textual representation of the response to the question 12. In particular, the word "Bexar" is present in both the representation 142 of question 12 and the TTS input 152, such that both the representation 142 and the textual representation of the TTS input 152 share the same grapheme representation of the word "Bexar". A textual representation may include a sequence of graphemes/characters of a particular natural language. The grapheme/character series may include letters, numbers, punctuation marks and/or other special characters.

TTSシステム150は、TTS入力152を、デバイス110がユーザ10に質問12への応答を伝達するために可聴に出力することになる対応するTTSオーディオ154(例えば、合成音声)へ変換してよい。TTSオーディオ154の可聴出力に先立ち、TTSシステム150および/または発音決定モデル200は、質問12に存在する特定の単語のユーザ発音202がTTS入力152に存在する同じ特定の単語のTTS発音204とは異なることを識別してよい。図示した例では、単語Bexarのユーザ発音202が「 TTS system 150 may convert TTS input 152 to corresponding TTS audio 154 (eg, synthesized speech) that device 110 audibly outputs to communicate the response to question 12 to user 10 . Prior to audible output of TTS audio 154, TTS system 150 and/or pronunciation decision model 200 determines what the user pronunciation 202 of a particular word present in question 12 is the TTS pronunciation 204 of the same particular word present in TTS input 152. You can distinguish between different things. In the illustrated example, the user pronunciation 202 of the word Bexar is “

」である一方で、単語BexarのTTS発音204は「 ” while the TTS pronunciation 204 of the word Bexar is “

」である。一部の実装例において、発音決定モデル200は、特定の単語のユーザ発音202と関連付けられるユーザ発音関連特徴(UPF)210および特定の単語のTTS発音204と関連付けられるTTS発音関連特徴(TTSPF)230を得る。図2を参照しつつ以下に更に詳細に記載されるように多数の異なるUPF210およびTTSPF230が得られて発音決定250を生成するためのベースを形成してよいが、図1の本例に描かれる発音決定モデル200は、ユーザがBexarを度々誤発音することを示すユーザ統計を含むUPF210の他に、Bexarがテキサス州の文脈では広く「 ”. In some implementations, pronunciation decision model 200 includes user pronunciation-related features (UPF) 210 associated with user pronunciation 202 of a particular word and TTS pronunciation-related features (TTSPF) 230 associated with TTS pronunciation 204 of a particular word. get. Although a number of different UPFs 210 and TTSPFs 230 may be obtained and form the basis for generating pronunciation decisions 250, as described in more detail below with reference to FIG. 2, as depicted in the present example in FIG. In addition to UPF210, which includes user statistics showing that users frequently mispronounce Bexar, Pronunciation Decision Model 200 also shows that Bexar is widely used in the Texas context.

」として発音されることを示すTTSPF230を得る。したがって、発音決定モデル200は、ユーザが通常Bexarを誤発音しており、かつテキサス州の文脈で話される場合TTS発音204が広く採用されるという事実により、BexarのTTS発音204に対して推定される信頼度より低いBexarのユーザ発音202に対する信頼度を推定してよい。特に、TTS応答のテキスト表現に存在する単語Bexarに対する「 I get TTSPF230 indicating that it is pronounced as ``. Therefore, the pronunciation decision model 200 estimates against the TTS pronunciation 204 of Bexar due to the fact that users commonly mispronounce Bexar and the TTS pronunciation 204 is widely adopted when spoken in a Texas context. The reliability of Bexar's user pronunciation 202 may be estimated to be lower than the reliability determined by Bexar. In particular, " for the word Bexar present in the textual representation of the TTS response.

」のTTS発音204は、確認されるのではなく、代わりに1つまたは複数の補助情報源からの発音マイニングを通じてTTSシステム150によって学習される。例えば、補助情報源は、TTSシステム150によって使用するための様々な文脈内の様々な用語の発音のためにマイニングされ得るオーディオおよび/またはビデオウェブソースを含んでよい。 The TTS pronunciation 204 of `` is not verified, but instead is learned by the TTS system 150 through pronunciation mining from one or more auxiliary sources. For example, auxiliary information sources may include audio and/or video web sources that may be mined for pronunciations of various terms in various contexts for use by TTS system 150.

TTSシステム150がTTS入力152からTTSオーディオ154を生成すると、TTS出力154は、質問12への応答のテキスト表現を含むTTS入力152を定める書記素/文字の系列によって形成される単語を人間が発音するであろう仕方に近似する合成音声を含む。例を続けると、発音決定250が特定の単語BexarのTTS発音204を選択する結果として、TTSオーディオ154でのTTS発音204の使用を維持し、そしてユーザデバイス110からの可聴出力のためにTTSオーディオ154を提供することになる。そのため、TTSオーディオ154は、質問12への応答の合成音声表現として可聴に出力され、かつTTS入力152に存在する単語Bexarのテキスト表現に対して「 When the TTS system 150 generates TTS audio 154 from the TTS input 152, the TTS output 154 includes a textual representation of the response to the question 12. Contains synthesized speech that approximates how it would be played. Continuing with the example, pronunciation decision 250 selects a TTS pronunciation 204 for a particular word Bexar, maintains the use of TTS pronunciation 204 in TTS audio 154, and maintains the use of TTS pronunciation 204 in TTS audio 154 for audible output from user device 110. 154 will be provided. Therefore, TTS audio 154 is audibly output as a synthesized speech representation of the response to question 12, and for the textual representation of the word Bexar present in TTS input 152, "

」のTTS発音204を使用する。 ” using TTS pronunciation 204.

一部の例では、特定の単語に対してユーザ発音202またはTTS発音204の選択された一方を使用して可聴出力のためにTTSオーディオ154を提供した後に、発音決定モデル200および/またはTTSシステム150は、ユーザ10が発音決定250に同意するか、代わりに発音決定250に選択されなかったユーザ発音202またはTTS発音204の他方を好むかを示す明示的なフィードバックを提供するようにユーザ10に促す。より詳細には、特定の単語に対する好ましい発音を示す明示的なフィードバックは、特定の単語の競合するユーザおよびTTS発音に対して次の発音決定250をするときに発音決定モデル200によって得られるTTSPF230として使用されてよい。ユーザ10は、モデル200が近い(例えば、閾値信頼度範囲内である)ユーザおよびTTS発音202、204に対して信頼度を推定するときに、ならびに/または信頼度が信頼度閾値を満足させることができないときに、明示的なフィードバックを提供するように促されてよい。明示的なフィードバックは、発音決定モデル200および/またはTTSシステム150を更新するために使用されてよい。発音決定モデル200および/またはTTSシステム150を更新するのに先立ち、特定の単語に対する好ましい発音が妥当な異形であることを確認する、および/または明示的なフィードバックが敵対的なユーザ訂正でないことを確認するために、追加の確認ステップが実施されてよい。 In some examples, after providing TTS audio 154 for audible output using a selected one of user pronunciation 202 or TTS pronunciation 204 for a particular word, pronunciation decision model 200 and/or TTS system 150 to provide explicit feedback to user 10 indicating whether user 10 agrees with pronunciation decision 250 or instead prefers the other of user pronunciation 202 or TTS pronunciation 204 that was not selected in pronunciation decision 250. prompt. More specifically, explicit feedback indicating a preferred pronunciation for a particular word is provided by the pronunciation decision model 200 as a TTSPF 230 when making the next pronunciation decision 250 against competing user and TTS pronunciations of a particular word. May be used. User 10 determines when model 200 estimates confidence for users and TTS pronunciations 202, 204 that are close (e.g., within a threshold confidence range) and/or that confidence satisfies a confidence threshold. be encouraged to provide explicit feedback when they are unable to do so. Explicit feedback may be used to update pronunciation decision model 200 and/or TTS system 150. Prior to updating pronunciation decision model 200 and/or TTS system 150, verifying that the preferred pronunciation for a particular word is a reasonable variant and/or that explicit feedback is not an adversarial user correction. Additional verification steps may be performed to confirm.

上の例の代替の実装例において、TTSPF230が単語Bexarに対する「 In an alternative implementation of the above example, TTSPF230 would write "

」のユーザ発音202が以前に好ましい発音としてユーザ10によって指定されたことを示せば、発音決定モデル200は、おそらくTTS発音204よりユーザ発音202に対して高い信頼度を推定し、その結果としてTTSシステム150がTTSオーディオ154を変更してユーザ発音202を採用するであろう。この実装例において、TTSシステム150がユーザに特有のローカル/カスタムTTSシステム150であれば、TTSPF230が単語Bexarに対して好ましい発音として「 ” indicates that user pronunciation 202 of System 150 will modify TTS audio 154 to adopt user pronunciation 202. In this example implementation, if the TTS system 150 is a user-specific local/custom TTS system 150, then the TTSPF 230 has the preferred pronunciation for the word Bexar.

」のユーザ発音202を示す結果として、TTSシステム150が最初に、TTS入力152に存在するときに単語Bexarを「 As a result of the user pronunciation 202 of "

」として発音することによってTTSオーディオ154を生成するようにTTSシステム150を再訓練することになってよい。 The TTS system 150 may be retrained to generate the TTS audio 154 by pronouncing it as ``.''.

TTSシステム150は、入力テキスト(例えば、TTS入力)をTTSオーディオ154として対応する合成音声表現へ変換することが可能な任意の種類のTTSシステムを含んでよい。例えば、TTSシステム150は、パラメトリックTTSモデル、またはディープニューラルネットワーク(例えば、アテンションベースのTacotronネットワーク)を活用して合成音声としての可聴出力のためにTTSオーディオ154を生成するTTSモデル300(図3のTTSモデル)を利用してよい。一部の実装例において、TTSモデルは、音声特徴(例えば、TTS入力152の特徴)のエンコードされた表現である埋込みを処理して、TTSオーディオ154を表現するオーディオ波形(例えば、時間とともにオーディオ信号の振幅を定める時間領域オーディオ波形)を生成する。一旦生成されると、TTSシステム150は、TTSオーディオ154をデバイス110に通信して、デバイス110が質問12への応答の合成音声表現としてTTSオーディオ154を出力できるようにする。例えば、デバイス110は、1つまたは複数のスピーカ118から、モデル200から出力される発音決定250によって選択される TTS system 150 may include any type of TTS system capable of converting input text (eg, TTS input) to a corresponding synthesized speech representation as TTS audio 154. For example, the TTS system 150 may utilize a parametric TTS model or a TTS model 300 (see FIG. TTS model) may be used. In some implementations, the TTS model processes embeddings that are encoded representations of audio features (e.g., features of the TTS input 152) to represent the audio waveform (e.g., the audio signal over time) that represents the TTS audio 154. generate a time-domain audio waveform (time-domain audio waveform) that determines the amplitude of Once generated, TTS system 150 communicates TTS audio 154 to device 110 so that device 110 can output TTS audio 154 as a synthesized speech representation of the response to question 12. For example, device 110 is selected from one or more speakers 118 by pronunciation determination 250 output from model 200.

のTTS発音204を使用して「Bexar is in Texas(Bexarはテキサス州にある)」のTTSオーディオ154を可聴に出力する。ここで、TTSシステム150のTTSモデル300は、合成音声154の音声関連属性を制御するように構成される。言い換えれば、TTSモデル300は、自然度の点で人間の話者の声をシミュレートする一方で、詳細潜在特徴をモデル化することによって多様な合成音声を生成することもできるように構成される。図1がデジタルアシスタント115に対する応用の文脈でのTTSシステム150の一例を描くが、TTSシステム150(例えば、TTSモデル300を使用)は、例えば音声検索、ナビゲーションまたは文書読上げなどの、他のテキスト-音声シナリオに応用可能である。 outputs audible TTS audio 154 of "Bexar is in Texas" using TTS pronunciation 204 of "Bexar is in Texas". Here, TTS model 300 of TTS system 150 is configured to control speech-related attributes of synthesized speech 154. In other words, the TTS model 300 is configured to simulate the voice of a human speaker in terms of naturalness, while also being able to generate diverse synthetic speech by modeling detailed latent features. . Although FIG. 1 depicts an example of a TTS system 150 in the context of an application to a digital assistant 115, the TTS system 150 (e.g., using a TTS model 300) may also be used for other text-to-text applications, such as voice search, navigation, or document reading. It can be applied to voice scenarios.

一部の実装例において、TTSシステム150と関連付けられるTTSモデル300がユーザデバイス110に常駐するが、地理的地域内のユーザ全体に共有されるグローバルモデルとして訓練される場合、モデル300を更新/再訓練するために連合学習技術が利用される。例えば、ユーザデバイス110は、ローカルに実行しているTTSモデル300のオンデバイスバージョンのパラメータを再訓練/更新し、次いで更新されたパラメータまたは訓練損失を、グローバルTTSモデル300を更新するためのサーバと共有してよい。そうする際に、グローバルTTSモデル300は、ユーザ人口全体のユーザに自分のオーディオデータ、質問内容、またはユーザが共有したくない他の情報を共有することを要求することなく複数ユーザから受信されるパラメータ更新/訓練損失を使用して再訓練/更新されてよい。 In some implementations, if the TTS model 300 associated with the TTS system 150 resides on the user device 110 but is trained as a global model that is shared across users within a geographic region, the model 300 may be updated/replayed. Federated learning techniques are used to train. For example, the user device 110 retrains/updates the parameters of an on-device version of the TTS model 300 that it is running locally, and then transfers the updated parameters or training losses to the server for updating the global TTS model 300. May be shared. In doing so, the Global TTS Model 300 can receive data from multiple users without requiring users across the user population to share their audio data, question content, or other information that users do not wish to share. May be retrained/updated using parameter updates/training losses.

図3を参照すると、一部の例では、TTSモデル300は、エンコーダ302およびデコーダ304を有するエンコーダデコーダネットワークアーキテクチャを含む。一部の実装例において、エンコーダデコーダ302、304の構造は、Tacotron 2の系列間リカレントニューラルネットワーク(RNN)に相当する(例えば、Shen、Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions、https://arxiv.org/pdf/1712.05884.pdfにおいて入手可能、に記載されており、参照により本明細書に組み込まれる)。一部の構成では、エンコーダ302は、入力として、TTS入力152またはTTS入力152に対応する埋込み(例えば、文字埋込み)を受信し、そして出力として、デコーダ304が後に生成することになる各メル周波数スペクトログラムに対して文脈ベクトルVcを生成するように構成される。文脈ベクトルVcは、固定長であり、一般にTTS入力152のテキスト表現を形成する文字の系列に対応する特定の位置に現れる特徴を定めてよい。一部の構成では、TTS入力152は、エンコーダ302へ入力されるのに先立ちまず対応する音素系列へ(例えば、書記素音素モデルなどの正規化エンジンを介して)変換される書記素系列を含む。 Referring to FIG. 3, in some examples, a TTS model 300 includes an encoder-decoder network architecture having an encoder 302 and a decoder 304. In some implementations, the encoder-decoder 302, 304 structure corresponds to a series-to-sequence recurrent neural network (RNN) in Tacotron 2 (e.g., Shen, Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions, https:// available at arxiv.org/pdf/1712.05884.pdf, incorporated herein by reference). In some configurations, encoder 302 receives as input a TTS input 152 or an embedding (e.g., a character embedding) corresponding to TTS input 152, and as output each Mel frequency that decoder 304 is to subsequently produce. It is configured to generate a context vector Vc for the spectrogram. The context vector Vc is of fixed length and may define features that appear at particular positions that generally correspond to sequences of characters forming the textual representation of the TTS input 152. In some configurations, TTS input 152 includes a grapheme sequence that is first converted (e.g., via a normalization engine, such as a grapheme-phoneme model) into a corresponding phoneme sequence prior to being input to encoder 302. .

エンコーダ302は、双方向長短期記憶(LTSM)層が続く1つまたは複数の畳込み層を含んでよい。各畳込み層におけるニューロンは、前の層におけるニューロンの小さなサブセットから入力を受信してよい。この点で、ニューロン結合性により、畳込み層は、TTS入力152のテキスト表現に対応する文字の系列の位置に特定の隠れ特徴が現れると活性化するフィルタを学習できるようになる。一部の実装例において、各畳込み層におけるフィルタは、一連の文字(例えば、4、5または6文字)にわたってよい。各畳込み層に、バッチ正規化および整流化線形ユニット(RELU)が続いてよい。エンコーダ302が1つまたは複数の畳込み層を含む場合、双方向LSTM層がこれらの畳込み層に続いてよい。ここで、双方向LSTMは、TTS入力152に対応する文字の系列の順次特徴表現を生成するために最終畳込み層によって生成される隠れ特徴を処理するように構成される。順次特徴表現は、特徴ベクトルの系列を含んでよい。 Encoder 302 may include one or more convolutional layers followed by a bidirectional long short-term memory (LTSM) layer. Neurons in each convolutional layer may receive input from a small subset of neurons in previous layers. In this regard, neuronal connectivity allows the convolutional layer to learn filters that activate when certain hidden features appear at positions in the sequence of characters corresponding to the textual representation of the TTS input 152. In some implementations, the filter in each convolutional layer may span a series of characters (eg, 4, 5, or 6 characters). Each convolutional layer may be followed by a batch normalization and rectification linear unit (RELU). If encoder 302 includes one or more convolutional layers, bidirectional LSTM layers may follow these convolutional layers. Here, the bidirectional LSTM is configured to process the hidden features generated by the final convolutional layer to generate a sequential feature representation of the sequence of characters corresponding to the TTS input 152. The sequential feature representation may include a sequence of feature vectors.

一部の実装例において、エンコーダ302は、エンコーダ302から順次特徴表現を受信するように、そして順次特徴表現を処理して各デコーダ出力ステップに対して文脈ベクトルVcを生成するように構成されるアテンションネットワークも含む。すなわち、アテンションネットワークは、デコーダ304が後に生成することになるメル周波数スペクトログラムの各フレームに対して固定長文脈ベクトルVcを生成できる。フレームは、入力信号の小部分(例えば、10ミリ秒サンプル)に基づくメル周波数スペクトログラムの単位を指す。アテンションネットワークのアーキテクチャは、特定のTTSシステム150に応じて変化してよい。アテンションネットワークの一部の例には、加法アテンションネットワーク、位置有感アテンションネットワーク、ガウス混合モデル(GMM)アテンションネットワーク(例えば、長い発話に対する一般化を改善する)、前方アテンションネットワーク、段階的単調アテンションネットワークまたは動的畳込みアテンションネットワークを含む。アテンションネットワークにより、TTSモデル300は、文脈ベクトルVcを生成するために特定のアテンション重みを受信する追加の入力(例えば、音声埋込みeを伴う)に基づいて出力系列(例えば、出力ログメルスペクトログラムフレームの系列)を生成することができてよい。 In some implementations, encoder 302 is configured to receive sequential feature representations from encoder 302 and to process the sequential feature representations to generate a context vector Vc for each decoder output step. Including networks. That is, the attention network can generate a fixed length context vector Vc for each frame of the Mel frequency spectrogram that decoder 304 will subsequently generate. A frame refers to a unit of mel frequency spectrogram based on a small portion (eg, 10 millisecond samples) of the input signal. The architecture of the attention network may vary depending on the particular TTS system 150. Some examples of attention networks include additive attention networks, location-sensitive attention networks, Gaussian mixture model (GMM) attention networks (e.g., to improve generalization to long utterances), forward attention networks, and stepwise monotone attention networks. or involving a dynamic convolutional attention network. The attention network allows the TTS model 300 to generate an output sequence (e.g., an output log mel spectrogram frame of sequence).

デコーダ304は、意図された音声関連属性(例えば、TTS入力152に存在する各単語に対するTTS発音、意図された韻律および/または音声特性)を含む表出的な音声の出力オーディオ信号AS(例えば、出力系列メル周波数スペクトログラム)を発生させるようにニューラルネットワーク(例えば、自己回帰リカレントニューラルネットワーク)として構成される。例えば、文脈ベクトルV_Cに基づいて、デコーダ304は、エンコーダ302によって生成されるエンコードされた表現から音声信号の表現(例えば、メルフレームまたはスペクトログラムフレーム)を予測する。すなわち、デコーダ304は、入力として、1つまたは複数の文脈ベクトルV_Cを受信するように構成され、そして各文脈ベクトルV_Cに対して、メル周波数スペクトログラムが音の周波数領域表現であるメル周波数スペクトログラムの対応するフレームを生成してよい。一部の例では、デコーダ304は、Tacotron 2と類似のアーキテクチャを含む。言い換えれば、デコーダ304は、プリネット、長短期記憶(LSTM)サブネットワーク、線形投影および畳込みポストネットを有するアーキテクチャを含んでよい。 The decoder 304 receives an output audio signal AS (e.g., configured as a neural network (eg, an autoregressive recurrent neural network) to generate an output sequence (mel frequency spectrogram). For example, based on the context vector V _C , decoder 304 predicts a representation of the audio signal (eg, a mel frame or a spectrogram frame) from the encoded representation produced by encoder 302. That is, the decoder 304 is configured to receive as input one or more context vectors _VC , and for each context vector _VC , the mel frequency spectrogram is a frequency domain representation of the sound. may generate corresponding frames. In some examples, decoder 304 includes an architecture similar to Tacotron 2. In other words, decoder 304 may include an architecture with prinets, long short-term memory (LSTM) subnetworks, linear projections, and convolutional postnets.

一部の構成では、TTSモデル300は、音声合成器306(シンセサイザ306とも称される)も含む。シンセサイザ306は、メル周波数スペクトログラムを受信するように、そして合成音声としてメル周波数スペクトログラムに基づいてTTSオーディオ154の出力サンプルを生成するように構成される任意のネットワークであることができる。一部の他の実装例では、シンセサイザ306は、ボコーダを含む。例えば、音声合成器306は、WaveRNNボコーダを含んでよい(例えば、「Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions」、J. Shenら著、例えばhttps://arxiv.org/abs/1712.05884において入手可能、によって記載されているように)。ここで、WaveRNNボコーダは、TTSモデル300によって予測されるスペクトログラムを条件にする24kHzでサンプリングされる16ビット信号を発生させてよい。一部の他の実装例では、シンセサイザ306は、訓練可能なスペクトログラム波形インバータである。シンセサイザ306が波形を生成した後に、オーディオサブシステムが、波形を使用してTTSオーディオ154を生成し、そして合成音声としての可聴再生(例えば、デバイス110上)のためにTTSオーディオ154を提供する、または生成された波形を別のシステムに提供して、他のシステムがTTSオーディオ154を生成および再生できるようにすることができる。一般的に言えば、シンセサイザ306は、合成音声154の結果的な発音、韻律および/または文体への影響をほとんどまたは全く有さず、実際には、シンセサイザ306が音声信号の表現(例えば、デコーダ304によって出力されるメルフレームまたはスペクトログラムフレーム)を波形へ変換するのでTTSオーディオ154のオーディオ忠実度に影響を与えるだけである。 In some configurations, TTS model 300 also includes a speech synthesizer 306 (also referred to as synthesizer 306). Synthesizer 306 can be any network configured to receive the Mel frequency spectrogram and to generate output samples of TTS audio 154 based on the Mel frequency spectrogram as synthesized speech. In some other implementations, synthesizer 306 includes a vocoder. For example, speech synthesizer 306 may include a WaveRNN vocoder (e.g., "Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions", J. Shen et al., available at, e.g., https://arxiv.org/abs/1712.05884). possible, as described by). Here, the WaveRNN vocoder may generate a 16-bit signal sampled at 24 kHz subject to the spectrogram predicted by the TTS model 300. In some other implementations, synthesizer 306 is a trainable spectrogram waveform inverter. After synthesizer 306 generates the waveform, an audio subsystem uses the waveform to generate TTS audio 154 and provides TTS audio 154 for audible playback (e.g., on device 110) as synthesized speech. Or the generated waveform can be provided to another system so that the other system can generate and play TTS audio 154. Generally speaking, synthesizer 306 has little or no influence on the resulting pronunciation, prosody, and/or style of synthesized speech 154; It only affects the audio fidelity of TTS Audio 154 because it converts the mel frames or spectrogram frames output by TTS Audio 154 into waveforms.

図2は、特定の単語のユーザ発音202または同じ特定の単語の異なるTTS発音204のどちらの方がTTSオーディオ154に使用するためにより信頼できるかを選択する発音決定250を生成する発音決定モデル例200の概略図を示す。発音決定モデル200は、ユーザデバイス110、リモートシステム130またはその組合せ上で実行してよい。一部の実装例において、発音決定モデル200は、ユーザ10とデジタルアシスタント115との間の対話/相互作用からの学習に基づいてユーザ10に固有の発音決定250をするために発音決定モデル200がカスタマイズされるという点でユーザ固有である。他の実装例において、発音決定モデル200は、複数ユーザによって共有され、そして発音決定250を全てのユーザ10が自分のデジタルアシスタント115とする対話に基づかせることを瞬時に学習する。 FIG. 2 shows an example pronunciation decision model that generates a pronunciation decision 250 that selects whether a user pronunciation 202 of a particular word or a different TTS pronunciation 204 of the same particular word is more reliable for use in TTS audio 154. 200 schematic diagrams are shown. Pronunciation decision model 200 may be executed on user device 110, remote system 130, or a combination thereof. In some implementations, the pronunciation decision model 200 is configured to make pronunciation decisions 250 specific to the user 10 based on learning from interactions/interactions between the user 10 and the digital assistant 115. It is user-specific in that it is customized. In other implementations, the pronunciation decision model 200 is shared by multiple users and instantly learns to base the pronunciation decisions 250 on interactions that all users 10 have with their digital assistants 115.

発音決定モデル200は、様々な技術を使用して特定の単語に対するユーザ発音202を受信してよい。例えば、特定の単語のユーザ発音202は、ASRシステム140がユーザによって話される質問12に対応するオーディオデータ14を処理している間にASRシステム140の間欠状態から抽出されてよい。別の実装例において、モデル200は、オーディオデータから特定の単語のユーザ音響表現を抽出することによって特定の単語のユーザ発音202を受信する。ここで、ユーザ音響表現は、特定の単語に対するユーザ発音202を伝える。更に別の実装例において、質問12に対応するオーディオデータ14が処理されて、特定の単語のユーザ発音を伝えるユーザ音素表現を生成する。提示される開示は、単語のユーザ発音を得るためのいずれの具体的な技術にも限定されない。 Pronunciation decision model 200 may receive user pronunciation 202 for a particular word using various techniques. For example, the user pronunciation 202 of a particular word may be extracted from an intermittent state of the ASR system 140 while the ASR system 140 is processing audio data 14 corresponding to the question 12 spoken by the user. In another implementation, the model 200 receives a user pronunciation 202 of a particular word by extracting the user acoustic representation of the particular word from audio data. Here, the user acoustic representation conveys the user pronunciation 202 for a particular word. In yet another implementation, audio data 14 corresponding to question 12 is processed to generate a user phoneme representation that conveys the user's pronunciation of the particular word. The presented disclosure is not limited to any specific technique for obtaining user pronunciations of words.

発音決定モデル200は、様々な技術を使用して特定の単語に対するTTS発音204を受信してよい。一例では、発音決定モデル200は、特定の単語のTTS発音204を伝える音素表現を受信する。例えば、TTSシステム150は、TTS入力152の書記素表現を対応する音素表現へ変換する書記素音素モデル(例えば、図3の正規化エンジン)を含んでよい。別の例では、TTSシステム150によって生成される(が可聴に出力されない)TTSオーディオの初期サンプルから特定の単語のTTS音響表現(例えば、波形)が抽出される。ここで、音響表現は、特定の単語のTTS発音204を伝える。TTS音響表現は、図3のTTSモデル300のデコーダ304からの出力として発生されるオーディオ信号As(例えば、出力系列メル周波数スペクトログラム)からも抽出され得る。更に別の例では、TTS発音204は、TTS入力152のテキスト表現を音声への変換のために処理する間にTTSシステム150の間欠状態から抽出される。例えば、TTSシステム150の抽出される間欠状態は、図3のTTSモデル300のエンコーダ302から出力される系列文脈ベクトルVcを含んでよい。 Pronunciation decision model 200 may receive TTS pronunciations 204 for particular words using various techniques. In one example, pronunciation decision model 200 receives a phoneme representation that conveys a TTS pronunciation 204 of a particular word. For example, TTS system 150 may include a grapheme-phoneme model (eg, the normalization engine of FIG. 3) that converts grapheme representations of TTS input 152 into corresponding phoneme representations. In another example, a TTS acoustic representation (eg, a waveform) of a particular word is extracted from an initial sample of TTS audio produced (but not audibly output) by the TTS system 150. Here, the acoustic representation conveys the TTS pronunciation 204 of a particular word. A TTS acoustic representation may also be extracted from the audio signal As (eg, an output sequence Mel frequency spectrogram) generated as an output from the decoder 304 of the TTS model 300 of FIG. In yet another example, TTS pronunciation 204 is extracted from an intermittent state of TTS system 150 while processing a textual representation of TTS input 152 for conversion to speech. For example, the extracted intermittent state of the TTS system 150 may include the sequence context vector Vc output from the encoder 302 of the TTS model 300 of FIG.

ユーザ10によって話される質問12にも質問12への応答のテキスト表現を含むTTS入力152にも存在する特定の単語に対して、TTSシステム150または発音決定モデル200は、特定の単語のユーザ発音202およびTTS発音204が互いと異なり、そのため合成音声としての可聴出力のために提供されるTTSオーディオ154内の特定の単語を発音するために最終的にどちらの方が使用されることになるかに関して競合している時を識別してよい。発音決定モデル200は、特定の単語と関連付けられる1つもしくは複数のユーザ発音関連特徴(UPF)210および/または1つもしくは複数のTTS発音関連特徴(TTSPF)230に基づいて発音決定250を生成するように訓練される。ここで、発音決定250は、質問12への応答を伝えるTTSオーディオ154内の特定の単語を発音するために使用するべき最高信頼度と関連付けられるユーザ発音202またはTTS発音204の一方を選択する。一部の例では、発音決定モデル200は、ユーザ発音信頼度推定子220およびTTS発音推定子240を含む。これらの例では、ユーザ発音信頼度推定子220は、ユーザ発音202が好ましいという可能性を示す特定の単語のユーザ発音202に対するユーザ発音信頼度222を推定する。他方で、TTS発音推定子240は、TTS発音204が好ましいという可能性を示す特定の単語のTTS発音204に対するTTS発音信頼度242を推定する。一部の実装例において、発音決定モデル200は、ユーザ発音信頼度222とTTS発音信頼度242との間の対数尤度比として発音決定250を生成する。 For a particular word that is present both in question 12 spoken by user 10 and in TTS input 152 that includes a textual representation of the response to question 12, TTS system 150 or pronunciation decision model 200 determines the user's pronunciation of the particular word. 202 and TTS pronunciation 204 differ from each other and therefore which one will ultimately be used to pronounce a particular word within the TTS audio 154 provided for audible output as synthesized speech. May identify when there is conflict regarding. Pronunciation decision model 200 generates pronunciation decisions 250 based on one or more user pronunciation-related features (UPF) 210 and/or one or more TTS pronunciation-related features (TTSPF) 230 associated with a particular word. be trained as such. Here, pronunciation decision 250 selects one of user pronunciation 202 or TTS pronunciation 204 associated with the highest degree of confidence to use to pronounce the particular word in TTS audio 154 that conveys the response to question 12. In some examples, pronunciation decision model 200 includes user pronunciation confidence estimator 220 and TTS pronunciation estimator 240. In these examples, the user pronunciation confidence estimator 220 estimates the user pronunciation confidence 222 for the user pronunciation 202 of a particular word indicating the likelihood that the user pronunciation 202 is preferred. On the other hand, the TTS pronunciation estimator 240 estimates a TTS pronunciation confidence level 242 for the TTS pronunciation 204 of a particular word indicating the possibility that the TTS pronunciation 204 is preferred. In some implementations, pronunciation decision model 200 generates pronunciation decision 250 as a log-likelihood ratio between user pronunciation confidence 222 and TTS pronunciation confidence 242.

特定の単語のユーザ発音202と関連付けられるUPF210には、質問12中にユーザ10によって話されるときに特定の単語の発音と関連付けられる信頼性(または非信頼性)を確かめる際に発音決定モデル200にとって有益な情報を伝える多数の異なる特徴を含んでよい。例えば、質問12の表記142は、特定の単語の単語型および表記内に認識される他の単語に鑑みて特定の単語の文脈を伝え得る。追加的に、ユーザ10と関連付けられる言語人口学的情報は、ユーザの特定の言語の熟練度を伝え得る。例えば、母国語としてアメリカ英語を話すユーザ10がイタリアンレストランの名前を誤発音しがちであり得る一方で、英語およびイタリア語の2カ国語話者はイタリア語の名前/単語を誤発音しそうにない。UPFは、ユーザ10と関連付けられる、年齢および性別などの、任意の種類の人口学的情報を提供し得る。質問12が話されたときにユーザ10が位置する地理的地域は、特定の単語のユーザ発音202と関連付けられるUPF210として役立ち得る。例えば、珍しい名前を伴う街路アドレスへのナビゲーション命令のための質問12では、TTSシステム150を訓練するために使用される訓練例に珍しい名前の存在が乏しいまたは実在しないことがあるので、TTSシステム150は、珍しい名前を誤発音する傾向があり得る。したがって、ユーザの地理的地域が珍しい名前を伴って要求された街路アドレスに近いことを特定するUPF210は、珍しい名前のユーザ発音202が珍しい名前のTTS発音204より信頼でき得るという強力な指標の役目をし得る。ユーザは、ユーザの地理的地域へのアクセスを明示的に許可してよく、そして地理的地域へのアクセスをいつでも拒否してよい。このシナリオでは、UPF210によって提供されるユーザ10の地理的地域と結合され、TTSPF230により、TTSシステム150が訓練されている場所/方言が要求された街路アドレスと関連付けられる地理的地域の外側であり得ることを示し、それによって更にユーザ発音信頼度222を高めかつ/またはTTS発音信頼度242を下げる。 The UPF 210 associated with the user pronunciation 202 of a particular word includes a pronunciation decision model 200 in ascertaining the reliability (or unreliability) associated with the pronunciation of the particular word when spoken by the user 10 during question 12. may include many different features that convey information useful to the user. For example, the representation 142 of question 12 may convey the context of a particular word in light of the word type of the particular word and other words recognized within the representation. Additionally, linguistic demographic information associated with user 10 may convey the user's proficiency in a particular language. For example, a user 10 who speaks American English as a native language may be more likely to mispronounce the name of an Italian restaurant, whereas a bilingual English and Italian speaker is less likely to mispronounce Italian names/words. . The UPF may provide any type of demographic information associated with the user 10, such as age and gender. The geographic region in which user 10 is located when question 12 is spoken may serve as a UPF 210 that is associated with user pronunciation 202 of a particular word. For example, in question 12 for navigation instructions to street addresses with unusual names, the TTS system 150 may have a tendency to mispronounce unusual names. Therefore, the UPF 210, which identifies that the user's geographic region is close to the requested street address with the unusual name, serves as a strong indicator that the user pronunciation 202 of the unusual name may be more reliable than the TTS pronunciation 204 of the unusual name. can do. A user may explicitly grant access to a user's geographic region, and may deny access to a geographic region at any time. In this scenario, combined with the user 10's geographic region provided by UPF 210 and TTSPF 230, the location/dialect in which the TTS system 150 is trained may be outside the geographic region associated with the requested street address. , thereby further increasing the user pronunciation reliability 222 and/or lowering the TTS pronunciation reliability 242.

その上、ユーザおよび/または他のユーザによって話された以前の質問内の特定の単語を発音するときにユーザ発音202を使用する頻度を提供する履歴情報が、発音決定モデル200への入力として提供されるUPF210として役立ち得る。例えば、特定の単語に対して地理的地域全体の複数ユーザが度々同じユーザ発音202を使用することで、TTSシステム150および/またはASRシステム140が訓練されていない新生の発音および/または未知の方言変化を示し得る。モデル200によって生成される信頼度発音決定250を改善することに加えて、特定の単語のユーザ発音202を反映するTTSオーディオ154を作成することを学習するようにTTSシステム150を更新するために、特定の単語を発音する際の新生の発音および/または未知の方言変化の知識が使用されてよい。 Additionally, historical information providing the frequency of using the user pronunciation 202 when pronouncing particular words in previous questions spoken by the user and/or other users is provided as input to the pronunciation decision model 200. Can serve as UPF210. For example, multiple users across a geographic region often use the same user pronunciation 202 for a particular word, resulting in a nascent pronunciation and/or unknown dialect that the TTS system 150 and/or ASR system 140 has not been trained on. can indicate change. In addition to improving the confidence pronunciation decisions 250 produced by the model 200, the TTS system 150 is updated to learn to create TTS audio 154 that reflects the user pronunciation 202 of a particular word. Knowledge of emerging pronunciations and/or unknown dialect variations in pronouncing particular words may be used.

一部の例では、以前の質問12内の特定の単語のユーザ発音202を使用してユーザ10の閾値頻度を確立することで、特定の単語の好ましい発音としてユーザ発音202を示す黙示的なユーザフィードバックの役目をする。例えば、決定モデル200がTTSオーディオ154内の特定の単語を発音するために異なるユーザ発音202よりもTTS発音204を選択した1つまたは複数の以前の対話セッションにおいて、ユーザ10および/または他のユーザが特定の単語を発音するためにユーザ発音202を使用し続けた以降の質問12が、ユーザ10および/または他のユーザがTTS発音204よりもユーザ発音202を好むことを示し得る。この黙示的なフィードバックは、TTSシステム150を訓練/更新するために更に使用されてよい。 In some examples, by using the user pronunciation 202 of a particular word in a previous question 12 to establish a threshold frequency for user 10, the implied user indicates the user pronunciation 202 as the preferred pronunciation of the particular word. Serves as feedback. For example, in one or more previous interaction sessions in which decision model 200 selected TTS pronunciation 204 over a different user pronunciation 202 to pronounce a particular word in TTS audio 154, user 10 and/or other users Subsequent questions 12 that continue to use user pronunciation 202 to pronounce a particular word may indicate that user 10 and/or other users prefer user pronunciation 202 over TTS pronunciation 204. This implicit feedback may be further used to train/update the TTS system 150.

一部の追加の実装例において、特定の単語のユーザ発音202と関連付けられるUPF210には、オーディオデータ14におけるASRシステム140による特定の単語の認識が正しいという可能性を示す1つまたは複数のASR信頼度特徴を含む。発音決定モデル200は、質問12の表記142を生成するために使用されるASRシステム140から直接にASR信頼度特徴を得てよい。ASR信頼度特徴には、限定することなく、特定の単語を認識することと関連付けられる可能性/事後スコア、特定の単語が正しく認識されたという可能性を示す信頼度スコア(ASRシステム140または補助信頼度モデルによって得られる)、および単語を認識することと関連付けられる任意の認識混同性を含んでよい。ここで、認識混同性は、表記142に対する候補認識結果の事後単語ラティスおよび/またはASRシステム140のニューラルネットワークモデルの可能な認識結果の事後分布のエントロピーから確かめられてよい。 In some additional implementations, the UPF 210 associated with a user pronunciation 202 of a particular word includes one or more ASR confidences indicating the likelihood that recognition of the particular word by the ASR system 140 in the audio data 14 is correct. Contains degree features. Pronunciation decision model 200 may obtain ASR confidence features directly from ASR system 140 used to generate representation 142 of question 12. ASR confidence features include, without limitation, a likelihood/posterior score associated with recognizing a particular word, a confidence score indicating the probability that a particular word was correctly recognized (ASR system 140 or (obtained by a confidence model), and any recognition confusability associated with recognizing the word. Here, recognition confusability may be ascertained from the entropy of the posterior word lattice of candidate recognition results for the notation 142 and/or the posterior distribution of possible recognition results of the neural network model of the ASR system 140.

一例では、TTS発音204を使用して発音されたときのASRシステム140による特定の単語の以前の誤認識に続き、代わりにユーザ発音202を使用して単語を発音した特定の単語の次の認識成功は、ユーザ10が単にASRシステム140を欺き誤認識エラーを回避するためにユーザ発音202を選んだおそれがあるので、ユーザ発音202が信頼できないことを示し得る。そのような知識は、特定の単語のTTS発音204を認識する際にロバストになるようにASRシステム140を更新するために使用されてよい。追加的または代替的に、ユーザ発音202を使用して発音されたときのASR140による特定の単語の以前の誤認識に、代わりにTTS発音204を使用して発音された特定の単語の次の認識が続いたという知識は、ユーザ発音202を認識する際にロバストになるようにASRシステム140を更新するために使用されてよい。 In one example, following a previous misrecognition of a particular word by ASR system 140 when pronounced using TTS pronunciation 204, a subsequent recognition of a particular word instead pronouncing the word using user pronunciation 202 A success may indicate that the user pronunciation 202 is unreliable because the user 10 may have chosen the user pronunciation 202 simply to fool the ASR system 140 and avoid false recognition errors. Such knowledge may be used to update the ASR system 140 to be robust in recognizing TTS pronunciations 204 of particular words. Additionally or alternatively, a previous misrecognition of a particular word by the ASR 140 when pronounced using the user pronunciation 202 may result in a subsequent recognition of the particular word pronounced using the TTS pronunciation 204 instead. The knowledge that the ASR system 140 has been followed may be used to update the ASR system 140 to be robust in recognizing user pronunciations 202.

UPF210と同様に、特定の単語のTTS発音204と関連付けられるTTSPF230には、TTSシステム150によって受信されるTTS入力152に存在する特定の単語を発音するためにTTS発音204を使用するための信頼性(または非信頼性)を確かめる際に発音決定モデル200にとって有益な情報を伝える多数の異なる特徴を含んでよい。例えば、質問12への応答のテキスト表現を含むTTS入力152は、特定の単語の単語型およびTTS入力152に存在する他の言葉に鑑みて特定の単語の文脈を伝え得る。追加的に、TTSシステム150が訓練されている地理的地域/方言は、発音決定モデル200にTTSPF230として提供されてよい。 Similar to UPF 210, TTSPF 230 associated with TTS pronunciation 204 of a particular word includes the reliability for using TTS pronunciation 204 to pronounce a particular word present in TTS input 152 received by TTS system 150. may include a number of different features that convey information useful to pronunciation decision model 200 in ascertaining (or unreliability). For example, a TTS input 152 containing a textual representation of a response to question 12 may convey the context of a particular word in light of the word type of the particular word and other words present in the TTS input 152. Additionally, the geographic region/dialect in which TTS system 150 is trained may be provided to pronunciation decision model 200 as TTSPF 230.

一部の例では、TTSPF230は、TTS発音204が特定の単語に対する確認済みの好ましい発音を含むかどうかを示す。好ましい発音として確認されるTTS発音204は、TTS発音204に対してTTS発音信頼度推定子240によって推定されるTTS発音信頼度242を上昇させかつユーザ発音202に対してユーザ発音信頼度推定子220によって推定されるユーザ発音信頼度222を低下させるための強力な指標の役目をし得る。特定の単語に対する好ましい発音は、特定の単語の好ましい発音のオーディオを対応する書記素表現でマッピングする訓練サンプルペアに関してTTSシステム150を訓練することによって、TTSシステム150の訓練中に手動で確認されてよい。例えば、母国語としてスウェーデン語を話さないユーザ10がスウェーデンを海外旅行しており、スウェーデンの都市への行き方のための質問12を話すが都市名を誤発音すると、TTSシステム150がスウェーデン語に対して訓練されておりかつ都市名のTTS発音204が確認されていることをTTSPF230が示すことで、TTS発音204がユーザ発音202より信頼できるという信頼度を上昇させることになる。 In some examples, TTSPF 230 indicates whether TTS pronunciation 204 includes a verified preferred pronunciation for a particular word. The TTS pronunciation 204 that is confirmed as a preferred pronunciation increases the TTS pronunciation reliability 242 estimated by the TTS pronunciation reliability estimator 240 with respect to the TTS pronunciation 204 and increases the TTS pronunciation reliability 242 estimated by the TTS pronunciation reliability estimator 220 with respect to the user pronunciation 202. It can serve as a strong indicator for reducing the user pronunciation reliability 222 estimated by . Preferred pronunciations for particular words are manually verified during training of TTS system 150 by training TTS system 150 on training sample pairs that map audio of preferred pronunciations of particular words with corresponding grapheme representations. good. For example, if user 10, who does not speak Swedish as a native language, is traveling abroad in Sweden and speaks question 12 for directions to a city in Sweden, but mispronounces the city name, TTS system 150 responds to By indicating that the TTS pronunciation 204 of the city name has been trained and verified, the TTSPF 230 increases the confidence that the TTS pronunciation 204 is more reliable than the user pronunciation 202.

好ましい発音には、ユーザ10がユーザによって指定される単語に対する好ましい発音を提供し、そしてTTSオーディオ154内の単語を発音するために好ましい発音を使用するようにTTSシステム150を構成してなる、ユーザ確認の好ましい発音も含んでよい。このシナリオは、さもなければTTSシステム150が単に名前の書記素表現だけに基づいて異なって発音するであろう固有の発音を有し得るカスタム名または連絡先名を発音するために共通である。同様に、ASRシステム140は、これらの連絡先名を正確に認識するのは、それらの固有の発音により困難であり得る。特定の単語(例えば、名前または他の固有名詞)に対する好ましい発音を提供するために、ユーザは、好ましい発音を使用して単語を話してよく、そしてTTSシステム150は、好ましい発音を対応する書記素表現にマッピングしてよく、その結果、対応する書記素表現がTTS入力152に存在するときに、TTSシステム150が好ましい発音を使用して単語を発音する。 The preferred pronunciation comprises the user 10 providing a preferred pronunciation for a word specified by the user and configuring the TTS system 150 to use the preferred pronunciation to pronounce the word in the TTS audio 154. The preferred pronunciation of confirmation may also be included. This scenario is common for pronouncing custom or contact names that may have a unique pronunciation that the TTS system 150 would otherwise pronounce differently based solely on the graphemic representation of the name. Similarly, ASR system 140 may have difficulty accurately recognizing these contact names due to their unique pronunciation. To provide a preferred pronunciation for a particular word (e.g., a name or other proper noun), a user may speak the word using the preferred pronunciation, and the TTS system 150 translates the preferred pronunciation into the corresponding grapheme. may be mapped to a representation such that the TTS system 150 pronounces the word using the preferred pronunciation when the corresponding grapheme representation is present in the TTS input 152.

他の例では、ユーザ10は、デジタルアシスタント115との以前の対話セッション中に明示的なフィードバックを介して特定の単語に対する好ましい発音を提供する。これらの例では、ユーザ10によって話された質問中に単語が発音された仕方とは異なって特定の単語が発音される、質問への応答を伝えるTTSオーディオ154の可聴出力の後に、デジタルアシスタント115は、次のTTSオーディオ154内の単語を発音するためにユーザがどちらの発音を好むかを示すために明示的なフィードバックを提供するようにユーザ10に促してよい。別のシナリオでは、促されることなく、ユーザ10は、TTSオーディオ154内の特定の単語の発音に満足していなければ、「正しく言わなかった」または「[好ましい発音を使用して]ということ?」などの応答を発することによって明示的なフィードバックを提供してよい。特定の単語のTTS発音204がユーザ確認の好ましい発音を含むことをTTSPF230が示すことは、ユーザが質問12を話すときに特定の単語の異なるユーザ発音202を使用することを意識的に選ぶ理由の見解に矛盾するように見えるが、ユーザ10は、ASRシステム140を欺き誤認識エラーを回避しようとして、特定の単語を意図的に誤発音し、または少なくともユーザ10によって好まれる発音と矛盾して発音し得る。 In other examples, user 10 provides a preferred pronunciation for a particular word via explicit feedback during a previous interaction session with digital assistant 115. In these examples, the audible output of TTS audio 154 conveying a response to a question, in which certain words are pronounced differently than the way the words were pronounced during the question spoken by user 10, is followed by digital assistant 115. may prompt the user 10 to provide explicit feedback to indicate which pronunciation the user prefers to pronounce the word in the next TTS audio 154. In another scenario, without being prompted, the user 10 may say that if he is not satisfied with the pronunciation of a particular word in the TTS audio 154, he can say "I didn't say it correctly" or "[use the preferred pronunciation]?" Explicit feedback may be provided by issuing a response such as ``. The fact that TTSPF 230 indicates that the TTS pronunciation 204 of a particular word includes the user-confirmed preferred pronunciation may explain why the user consciously chooses to use a different user pronunciation 202 of the particular word when speaking Question 12. Seemingly inconsistent with the view, user 10 intentionally mispronounces certain words, or at least pronounces them inconsistently with the pronunciation preferred by user 10, in an attempt to deceive ASR system 140 and avoid false recognition errors. It is possible.

一部の実装例において、TTSPF230は、TTS発音204が特定の単語に対する未確認の発音を含むかどうかを示す。未確認の発音は、TTS発音の信頼性の信頼度を上昇させてよいが、その程度は確認された発音よりは小さい。特定の単語に対するTTS発音204は、発音が1つまたは複数の補助情報源からの発音マイニングを介してTTSシステム150によって学習/推定される場合は未確認でよい。例えば、上の例では、TTSシステム150は、単語/用語Bexarに対するTTS発音204を、テキサス州の文脈で標準的な発音「 In some implementations, TTSPF 230 indicates whether TTS pronunciation 204 includes an unconfirmed pronunciation for a particular word. Unconfirmed pronunciations may increase confidence in the reliability of TTS pronunciations, but to a lesser extent than confirmed pronunciations. The TTS pronunciation 204 for a particular word may be unconfirmed if the pronunciation is learned/estimated by the TTS system 150 via pronunciation mining from one or more auxiliary sources. For example, in the example above, the TTS system 150 uses the TTS pronunciation 204 for the word/term Bexar as the standard pronunciation in the Texas context.

」を正しく使用するオーディオおよび/またはビデオ情報源からの発音マイニングを使用して推定される未確認の発音として学習してよい。 ' may be learned as unconfirmed pronunciations that are correctly estimated using pronunciation mining from audio and/or video sources.

追加的または代替的に、発音決定250をするための発音決定モデル200に入力される他のTTSPF230には、限定することなく、発音異形および発音複雑性特徴を含む。発音異形特徴は、特定の単語を発音するためのいくらかでも他の異形が存在するかどうかを示してよい。特定の単語の発音のこれらの異形は、時間とともに発音決定モデル200によって学習されてよい。異なる発音は、異なる文脈にマッピングされてよい。例えば、大多数の話者は特定の単語の特定の発音を使用してよい一方で、特定の地理的領域の話者が同じ単語を排他的に異なって発音してよい。他方で、発音複雑性特徴は、特定の単語のユーザ誤発音の可能性を示してよい。TTSシステム150は、ユーザが発音するのが困難として認められ得る特定の音素系列/表現を識別することによってユーザ誤発音の強い可能性を示す発音複雑性特徴を提供してよい。TTSPF230は、ユーザ人口が質問内の特定の単語を誤発音する頻度を示す誤発音統計を更に示してよい。 Additionally or alternatively, other TTSPFs 230 that are input to the pronunciation decision model 200 for making pronunciation decisions 250 include, without limitation, pronunciation variants and pronunciation complexity features. A phonetic variant feature may indicate whether there are any other variants for pronouncing a particular word. These variants of pronunciation of a particular word may be learned by pronunciation decision model 200 over time. Different pronunciations may be mapped to different contexts. For example, a majority of speakers may use a particular pronunciation of a particular word, while speakers in a particular geographic area may pronounce the same word exclusively differently. On the other hand, the pronunciation complexity feature may indicate the likelihood of user mispronunciation of a particular word. The TTS system 150 may provide pronunciation complexity features that indicate a strong likelihood of user mispronunciation by identifying specific phoneme sequences/expressions that may be found difficult for the user to pronounce. TTSPF 230 may further show mispronunciation statistics indicating how often the user population mispronounces particular words in the question.

図4は、質問12に存在する特定の単語に対するユーザ発音202が、質問12への応答を伝えるTTS入力152に存在する同じ単語のTTS発音204とは異なる、ユーザ10とユーザデバイス110上で実行しているデジタルアシスタント115との間の対話例の概略図400を提供する。図示した例では、ユーザ10は、デジタルアシスタント115に向けられる質問12「Play songs by A$AP Rocky(A$APロッキーによる曲を再生して)」を話す。この例では、単語「A$AP」のユーザ発音202はA-S-A-Pであり、そしてASRシステム140はロバストであり、ユーザ発音202を「A$AP」と正しく認識する。ここで、ASRシステム140は、質問12の表記142を生成し、そしてNLU機能性を適用して、デジタルアシスタント115が音楽プレーヤアプリケーションにA$AP Rockyという名のアーティストによる音楽曲を可聴に出力させる要求として質問12を解釈する。デジタルアシスタント115は、デジタルアシスタント115が質問12を理解しており、質問12を満たしている最中であることを確認する質問12への応答も生成する。ここで、デジタルアシスタント115は、質問12への応答「Playing latest songs by A$AP Rocky(A$APロッキーによる最新曲を再生中)」のテキスト表現を生成し、TTSオーディオ154への変換のためにTTSシステム150にTTS入力152として提供される。 FIG. 4 shows a user pronunciation 202 for a particular word present in question 12 that is different from the TTS pronunciation 204 of the same word present in the TTS input 152 conveying the response to question 12 performed on user 10 and user device 110. provides a schematic diagram 400 of an example interaction between a digital assistant 115 and a digital assistant 115. In the illustrated example, user 10 speaks question 12 directed to digital assistant 115, "Play songs by A$AP Rocky." In this example, the user pronunciation 202 of the word "A$AP" is A-S-A-P, and the ASR system 140 is robust and correctly recognizes the user pronunciation 202 as "A$AP." Here, ASR system 140 generates representation 142 of question 12 and applies NLU functionality to cause digital assistant 115 to cause music player application to audibly output a musical song by an artist named A$AP Rocky. Interpret question 12 as a request. Digital assistant 115 also generates a response to question 12 that confirms that digital assistant 115 understands question 12 and is in the process of fulfilling question 12. Here, the digital assistant 115 generates a text representation of "Playing latest songs by A$AP Rocky" in response to question 12 and for conversion to TTS audio 154. is provided as a TTS input 152 to the TTS system 150.

しかしながら、このシナリオでは、TTSシステム150は、TTS入力152に存在する単語「A$AP」の発音の仕方を知らず、したがって単語「A$AP」に対してTTS発音204は存在しない。ユーザ発音202を使用して発音される単語A$APを認識するための強い信頼度があることを示す1つまたは複数のASR信頼度特徴を含むUPF210に基づいて、発音決定モデル200は、ユーザ発音信頼度222が高いと結論し、したがってTTSオーディオ154内の単語A$APを発音するためにA-S-A-Pのユーザ発音202を選択する発音決定250を生成してよい。図示した例では、TTSオーディオ154は、ユーザデバイス110から可聴に出力され、単語「A$AP」を発音するためにユーザ発音202を使用する質問12への応答の合成音声表現を含む。 However, in this scenario, the TTS system 150 does not know how to pronounce the word "A$AP" that is present in the TTS input 152, and therefore there is no TTS pronunciation 204 for the word "A$AP." The pronunciation decision model 200 is based on the UPF 210 that includes one or more ASR confidence features indicating that it has a strong confidence in recognizing the word A$AP pronounced using the user pronunciation 202. A pronunciation decision 250 may be generated that concludes that the pronunciation confidence level 222 is high and therefore selects the user pronunciation 202 of A-S-A-P to pronounce the word A$AP in the TTS audio 154. In the illustrated example, TTS audio 154 is audibly output from user device 110 and includes a synthesized speech representation of a response to question 12 that uses user pronunciation 202 to pronounce the word "A$AP."

発音決定250が或る閾値を満たさない信頼度と関連付けられれば、モデル200および/またはTTSシステム150は、ユーザ発音202を使用することに対して多少の疑いがあり得ると判定し、したがってユーザが発音決定250に同意するか異なる発音を好むかを示すようにユーザ10に促してよい。プロンプトへのユーザの応答は、上記したように明示的なフィードバックを含んでよい。プロンプトは、発音決定に同意するか異なる発音を好むかをユーザ10に尋ねる合成音声をTTSシステム150が出力する可聴プロンプトでよい。プロンプトは、ユーザデバイス110と通信している画面上に通知を表示する視覚プロンプトを追加的または代替的に含んでよい。ここで、ユーザは、好ましい発音を示す、または最低でも、発音決定250に関する満足もしくは不満を示すグラフィック要素を選択するユーザ入力指示を提供し得る。 If the pronunciation decision 250 is associated with a confidence level that does not meet a certain threshold, the model 200 and/or the TTS system 150 determines that there may be some doubt about using the user pronunciation 202 and therefore the user User 10 may be prompted to indicate whether they agree with pronunciation decision 250 or prefer a different pronunciation. The user's response to the prompt may include explicit feedback as described above. The prompt may be an audible prompt in which the TTS system 150 outputs a synthesized voice asking the user 10 whether he or she agrees with the pronunciation decision or prefers a different pronunciation. The prompt may additionally or alternatively include a visual prompt that displays a notification on a screen in communication with user device 110. Here, the user may provide user input instructions to select a graphical element that indicates a preferred pronunciation or, at a minimum, satisfaction or dissatisfaction with the pronunciation decision 250.

発音決定250がユーザ発音202をTTS発音204より信頼できるとして選択する(またはTTS発音204が利用可能でない)と、TTSシステム150がTTSオーディオ154を作成することができない、またはTTSオーディオ154が、特定の単語を発音するためにユーザ発音202を使用して、さもなければ品質基準を満たさない合成品質を含むであろう、シナリオが存在し得る。このシナリオは、特定の単語のユーザ発音202がTTSシステム150(および関連付けられたTTSモデル300)が訓練されたアクセント/方言とは異なるアクセント/方言で話されるときに発生し得る。TTSシステム150は、ユーザ発音202をオンザフライで使用してTTSオーディオ154を作成するための各種の異なる技術を利用してよい。一部の例では、TTSシステム150は、ユーザによって話される質問12に対応するオーディオデータ14から抽出される特定の単語の音響表現を使用する。これらの例では、オーディオデータ14から抽出される特定の単語の音響表現は、TTSオーディオ154へ挿入されてよい。他方で、TTSシステム150は、特定の単語の音響表現から音素表現を更に抽出し、そして音素表現を使用して特定の単語のユーザ発音202でTTSオーディオ154を作成してよい。 If pronunciation determination 250 selects user pronunciation 202 as more reliable than TTS pronunciation 204 (or TTS pronunciation 204 is not available), then TTS system 150 is unable to create TTS audio 154, or TTS audio 154 is There may be scenarios where the user pronunciation 202 is used to pronounce a word that would otherwise contain synthetic quality that does not meet the quality criteria. This scenario may occur when the user pronunciation 202 of a particular word is spoken in a different accent/dialect than the accent/dialect in which the TTS system 150 (and associated TTS model 300) was trained. TTS system 150 may utilize a variety of different techniques to create TTS audio 154 using user pronunciations 202 on the fly. In some examples, TTS system 150 uses acoustic representations of particular words extracted from audio data 14 that correspond to questions 12 spoken by the user. In these examples, acoustic representations of particular words extracted from audio data 14 may be inserted into TTS audio 154. On the other hand, TTS system 150 may further extract phonemic representations from the acoustic representation of the particular word and use the phonemic representation to create TTS audio 154 with the user pronunciation 202 of the particular word.

追加の例では、TTSシステム150は、特定の単語のユーザ発音202を含む口頭の質問12に対応するオーディオデータ14の一部分から導出される潜在表現を得る。これらの例では、潜在表現は、ユーザ発音202を使用して特定の単語を発音するTTSオーディオ154を作成するようにTTSシステム150(および関連付けられたTTSモデル)を誘導する。更に別の例では、ユーザ発音202を含む口頭の質問12に対応するオーディオデータ14の部分に音声変換技術が適用されて、特定の単語を合成音声で発音するためにユーザ発音202を使用するTTSオーディオ154を作成する。 In an additional example, TTS system 150 obtains a latent representation derived from a portion of audio data 14 corresponding to verbal question 12 that includes user pronunciation 202 of a particular word. In these examples, the latent representation directs the TTS system 150 (and associated TTS model) to create TTS audio 154 that pronounces the particular word using the user pronunciation 202. In yet another example, speech conversion technology is applied to a portion of the audio data 14 corresponding to the oral question 12 that includes the user pronunciation 202 to create a TTS that uses the user pronunciation 202 to pronounce a particular word in a synthesized voice. Create audio 154.

ソフトウェアアプリケーション(すなわち、ソフトウェアリソース)は、コンピューティングデバイスにタスクを行わせるコンピュータソフトウェアを指してよい。一部の例では、ソフトウェアアプリケーションは、「アプリケーション」、「アプリ」または「プログラム」と称されてよい。アプリケーション例には、システム診断アプリケーション、システム管理アプリケーション、システム保守アプリケーション、ワードプロセッシングアプリケーション、スプレッドシートアプリケーション、メッセージングアプリケーション、メディアストリーミングアプリケーション、ソーシャルネットワーキングアプリケーションおよびゲーミングアプリケーションを含むが、これらに限定されない。 A software application (ie, software resource) may refer to computer software that causes a computing device to perform a task. In some examples, a software application may be referred to as an "application," "app," or "program." Example applications include, but are not limited to, system diagnostic applications, system management applications, system maintenance applications, word processing applications, spreadsheet applications, messaging applications, media streaming applications, social networking applications, and gaming applications.

非一時的メモリは、コンピューティングデバイスによる使用のために一時的または永続的にプログラム(例えば、命令の系列)またはデータ(例えば、プログラム状態情報)を記憶するために使用される物理デバイスでよい。非一時的メモリは、揮発性および/または不揮発性アドレス可能半導体メモリでよい。不揮発性メモリの例には、フラッシュメモリおよびリードオンリメモリ(ROM)/プログラマブルリードオンリメモリ(PROM)/消去可能プログラマブルリードオンリメモリ(EPROM)/電子的消去可能プログラマブルリードオンリメモリ(EEPROM)(例えば、ブートプログラムなどのファームウェアのために典型的に使用される)を含むが、これらに限定されない。揮発性メモリの例には、ランダムアクセスメモリ(RAM)、ダイナミックランダムアクセスメモリ(DRAM)、スタティックランダムアクセスメモリ(SRAM)、相変化メモリ(PCM)の他にディスクまたはテープを含むが、これらに限定されない。 Non-transitory memory may be a physical device used to temporarily or permanently store programs (eg, sequences of instructions) or data (eg, program state information) for use by a computing device. Non-transitory memory may be volatile and/or non-volatile addressable semiconductor memory. Examples of non-volatile memory include flash memory and read-only memory (ROM)/programmable read-only memory (PROM)/erasable programmable read-only memory (EPROM)/electronically erasable programmable read-only memory (EEPROM) (e.g. (typically used for firmware such as boot programs). Examples of volatile memory include, but are not limited to, random access memory (RAM), dynamic random access memory (DRAM), static random access memory (SRAM), phase change memory (PCM), as well as disk or tape. Not done.

図5は、口頭の質問12にも質問12への応答のテキスト表現を含むTTS入力152にも存在する単語のユーザ発音202または異なるTTS発音204の一方を選択する方法500のための動作の配置例のフローチャートである。動作502で、方法500は、ユーザ10によって話される質問12に存在する特定の単語のユーザ発音202を受信することを含む。動作504で、方法500は、TTS入力152に存在する同じ特定の単語のTTS発音204を受信することを含む。TTS入力152は、質問12への応答のテキスト表現を含み、そして特定の単語のTTS発音204は、特定の単語のユーザ発音202とは異なる。動作506で、方法500は、特定の単語のユーザ発音202と関連付けられるユーザ発音関連特徴210を得ることを含む。動作508で、方法500は、特定の単語のTTS発音204と関連付けられるTTS発音関連特徴230を得ることを含む。動作510で、方法500は、入力としてユーザ発音関連特徴210およびTTS発音関連特徴230を受信するように構成される発音決定モデル200からの出力として、TTSオーディオ154に使用するために最高信頼度と関連付けられる特定の単語のユーザ発音202または特定の単語のTTS発音204の一方を選択する発音決定250を生成することを含む。動作512で、方法500は、ユーザ10と関連付けられるユーザデバイス110からの可聴出力のために、発音決定モデル200から出力される発音決定250によって選択された特定の単語に対するユーザ発音202または特定の単語に対するTTS発音204の一方を使用して質問12への応答の合成音声表現を含むTTSオーディオ154を提供することを含む。 FIG. 5 shows an arrangement of operations for a method 500 of selecting either a user pronunciation 202 or a different TTS pronunciation 204 of a word that is present in both the verbal question 12 and the TTS input 152 containing a textual representation of the response to question 12. 1 is an example flowchart. At act 502, method 500 includes receiving a user pronunciation 202 of a particular word present in question 12 spoken by user 10. At act 504, method 500 includes receiving a TTS pronunciation 204 of the same particular word present in TTS input 152. TTS input 152 includes a textual representation of the response to question 12, and the TTS pronunciation 204 of the particular word differs from the user pronunciation 202 of the particular word. At act 506, method 500 includes obtaining user pronunciation-related features 210 that are associated with user pronunciation 202 of the particular word. At act 508, method 500 includes obtaining TTS pronunciation-related features 230 that are associated with TTS pronunciation 204 of the particular word. At operation 510, method 500 determines the highest confidence level for use in TTS audio 154 as an output from pronunciation decision model 200 configured to receive user pronunciation-related features 210 and TTS pronunciation-related features 230 as input. The method includes generating a pronunciation decision 250 that selects either an associated user pronunciation 202 of the particular word or a TTS pronunciation 204 of the particular word. At act 512, the method 500 determines the user pronunciation 202 for the particular word or the particular word selected by the pronunciation determination 250 output from the pronunciation determination model 200 for audible output from the user device 110 associated with the user 10. and providing TTS audio 154 containing a synthesized speech representation of the response to question 12 using one of the TTS pronunciations 204 for the question.

図6は、本文書に記載されるシステムおよび方法を実装するために使用され得るコンピューティングデバイス例600(例えば、システム600)の概略図である。コンピューティングデバイス600は、ラップトップ、デスクトップ、ワークステーション、携帯情報端末、サーバ、ブレードサーバ、メインフレームおよび他の適切なコンピュータなど、様々な形態のデジタルコンピュータを表現すると意図される。ここで図示される部品、それらの接続および関係、ならびにそれらの機能は、単に例証的であると意味され、本文書に記載および/または特許請求される本発明の実装例を限定するとは意味されない。 FIG. 6 is a schematic diagram of an example computing device 600 (eg, system 600) that may be used to implement the systems and methods described in this document. Computing device 600 is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. The components illustrated herein, their connections and relationships, and their functions are meant to be illustrative only and are not meant to limit the example implementations of the invention described and/or claimed herein. .

コンピューティングデバイス600は、プロセッサ610(例えば、データ処理ハードウェア610)、メモリ620(例えば、メモリハードウェア620)、ストレージデバイス630、メモリ620および高速拡張ポート650に接続する高速インタフェース/コントローラ640、ならびに低速バス670およびストレージデバイス630に接続する低速インタフェース/コントローラ660を含む。部品610、620、630、640、650および660の各々は、様々なバスを使用して相互接続され、かつ共通のマザーボードにまたは適宜他の方式で搭載されてよい。プロセッサ610(すなわち、ユーザデバイス110のデータ処理ハードウェア112またはリモートシステム130のデータ処理ハードウェア134)は、メモリ620にまたはストレージデバイス630に記憶される命令を含め、コンピューティングデバイス600内での実行のための命令を処理して、高速インタフェース640に結合されるディスプレイ680などの外部入出力デバイスにグラフィカルユーザインタフェース(GUI)のためのグラフィック情報を表示できる。メモリ620およびストレージデバイス630は、ユーザデバイス110のメモリハードウェア114またはリモートシステム130のメモリハードウェア136を含んでよい。他の実装例では、複数プロセッサおよび/または複数バスが、適宜、複数メモリおよび複数種類のメモリと共に使用されてよい。同じく、複数コンピューティングデバイス600が接続されて、各デバイスが必要な動作の一部分を提供してよい(例えば、サーババンク、一群のブレードサーバまたはマルチプロセッサシステムとして)。 Computing device 600 includes a processor 610 (e.g., data processing hardware 610), memory 620 (e.g., memory hardware 620), a storage device 630, a high-speed interface/controller 640 that connects to memory 620 and high-speed expansion port 650; Includes a low speed interface/controller 660 that connects to a low speed bus 670 and storage device 630. Each of components 610, 620, 630, 640, 650, and 660 may be interconnected using various buses and mounted on a common motherboard or in other manners as appropriate. Processor 610 (i.e., data processing hardware 112 of user device 110 or data processing hardware 134 of remote system 130) executes within computing device 600, including instructions stored in memory 620 or in storage device 630. and display graphic information for a graphical user interface (GUI) on an external input/output device, such as a display 680 coupled to high-speed interface 640. Memory 620 and storage device 630 may include memory hardware 114 of user device 110 or memory hardware 136 of remote system 130. In other implementations, multiple processors and/or multiple buses may be used with multiple memories and multiple types of memory, where appropriate. Similarly, multiple computing devices 600 may be connected, each device providing a portion of the required operation (eg, as a server bank, a group of blade servers, or a multiprocessor system).

メモリ620は、コンピューティングデバイス600内に非一時的に情報を記憶する。メモリ620は、コンピュータ可読媒体、揮発性メモリユニットまたは不揮発性メモリユニットでよい。非一時的メモリ620は、コンピューティングデバイス600による使用のために一時的または永続的にプログラム(例えば、命令の系列)またはデータ(例えば、プログラム状態情報)を記憶するために使用される物理デバイスでよい。不揮発性メモリの例には、フラッシュメモリおよびリードオンリメモリ(ROM)/プログラマブルリードオンリメモリ(PROM)/消去可能プログラマブルリードオンリメモリ(EPROM)/電子的消去可能プログラマブルリードオンリメモリ(EEPROM)(例えば、ブートプログラムなどのファームウェアのために典型的に使用される)を含むが、これらに限定されない。揮発性メモリの例には、ランダムアクセスメモリ(RAM)、ダイナミックランダムアクセスメモリ(DRAM)、スタティックランダムアクセスメモリ(SRAM)、相変化メモリ(PCM)の他にディスクまたはテープを含むが、これらに限定されない。 Memory 620 stores information within computing device 600 on a non-transitory basis. Memory 620 may be a computer readable medium, a volatile memory unit, or a non-volatile memory unit. Non-transitory memory 620 is a physical device used to temporarily or permanently store programs (e.g., sequences of instructions) or data (e.g., program state information) for use by computing device 600. good. Examples of non-volatile memory include flash memory and read-only memory (ROM)/programmable read-only memory (PROM)/erasable programmable read-only memory (EPROM)/electronically erasable programmable read-only memory (EEPROM) (e.g. (typically used for firmware such as boot programs). Examples of volatile memory include, but are not limited to, random access memory (RAM), dynamic random access memory (DRAM), static random access memory (SRAM), phase change memory (PCM), as well as disk or tape. Not done.

ストレージデバイス630は、コンピューティングデバイス600のために大容量記憶を提供することが可能である。一部の実装例において、ストレージデバイス630は、コンピュータ可読媒体である。様々な異なる実装例において、ストレージデバイス630は、フロッピーディスクデバイス、ハードディスクデバイス、光ディスクデバイスもしくはテープデバイス、フラッシュメモリもしくは他の同様のソリッドステートメモリデバイス、またはストレージエリアネットワークもしくは他の構成のデバイスを含め、デバイスのアレイでよい。追加の実装例において、コンピュータプログラム製品が情報担体に有形に具現化される。コンピュータプログラム製品は、実行されると、上記したものなど、1つまたは複数の方法を行う命令を含む。情報担体は、メモリ620、ストレージデバイス630またはプロセッサ610上のメモリなど、コンピュータまたは機械可読媒体である。 Storage device 630 can provide mass storage for computing device 600. In some implementations, storage device 630 is a computer readable medium. In various different implementations, storage device 630 includes a floppy disk device, a hard disk device, an optical disk device or tape device, a flash memory or other similar solid-state memory device, or a device in a storage area network or other configuration. Can be an array of devices. In additional implementations, a computer program product is tangibly embodied on an information carrier. The computer program product includes instructions that, when executed, perform one or more methods, such as those described above. The information carrier is a computer or machine readable medium, such as memory 620, storage device 630 or memory on processor 610.

高速コントローラ640がコンピューティングデバイス600のための帯域幅集約動作を管理する一方で、低速コントローラ660は低帯域幅集約動作を管理する。そのような責務の割当ては単に例証的である。一部の実装例において、高速コントローラ640は、メモリ620、ディスプレイ680(例えば、グラフィックプロセッサまたはアクセラレータを通じて)に、および様々な拡張カード(図示せず)を受けることができる高速拡張ポート650に結合される。一部の実装例において、低速コントローラ660は、ストレージデバイス630および低速拡張ポート690に結合される。低速拡張ポート690は、様々な通信ポート(例えば、USB、ブルートゥース（登録商標）、イーサネット、無線イーサネット)を含んでよく、キーボード、ポインティングデバイス、スキャナ、または例えばネットワークアダプタを通じて、スイッチもしくはルータといったネットワークデバイスなど、1つまたは複数の入出力デバイスに結合されてよい。 High speed controller 640 manages bandwidth intensive operations for computing device 600, while low speed controller 660 manages low bandwidth intensive operations. Such assignment of responsibilities is merely illustrative. In some implementations, high-speed controller 640 is coupled to memory 620, display 680 (e.g., through a graphics processor or accelerator), and high-speed expansion port 650 that can receive various expansion cards (not shown). Ru. In some implementations, low speed controller 660 is coupled to storage device 630 and low speed expansion port 690. The low-speed expansion port 690 may include a variety of communication ports (e.g., USB, Bluetooth, Ethernet, wireless Ethernet), such as a keyboard, pointing device, scanner, or network device, such as a switch or router, through a network adapter. etc., may be coupled to one or more input/output devices.

コンピューティングデバイス600は、図に図示されるように、幾つかの異なる形態で実装されてよい。例えば、それは、標準サーバ600aもしくは重ねて一群のそのようなサーバ600aとして、ラップトップコンピュータ600bとして、またはラックサーバシステム600cの一部として実装されてよい。 Computing device 600 may be implemented in several different forms, as illustrated in the figures. For example, it may be implemented as a standard server 600a or a stack of such servers 600a, as a laptop computer 600b, or as part of a rack server system 600c.

本明細書に記載されるシステムおよび技術の様々な実装例は、デジタル電子および/もしくは光回路網、集積回路網、特別設計のASIC(特定用途向け集積回路)、コンピュータハードウェア、ファームウェア、ソフトウェア、ならびに/またはその組合せで実現できる。これらの様々な実装例は、ストレージシステム、少なくとも1つの入力デバイスおよび少なくとも1つの出力デバイスからデータおよび命令を受信するように、ならびにそれらにデータおよび命令を送信するように結合される、専用または汎用でよい、少なくとも1つのプログラマブルプロセッサを含むプログラマブルシステム上で実行および/または解釈可能である1つまたは複数のコンピュータプログラムによる実装例を含むことができる。 Various implementations of the systems and techniques described herein include digital electronic and/or optical circuitry, integrated circuitry, specially designed ASICs (Application Specific Integrated Circuits), computer hardware, firmware, software, and/or a combination thereof. These various implementations include a storage system, dedicated or general purpose, coupled to receive data and instructions from, and transmit data and instructions to, at least one input device and at least one output device. may include one or more computer program implementations that are executable and/or interpretable on a programmable system that may include at least one programmable processor.

これらのコンピュータプログラム(プログラム、ソフトウェア、ソフトウェアアプリケーションまたはコードとしても知られる)は、プログラマブルプロセッサ用の機械命令を含み、そして高水準手続き型および/もしくはオブジェクト指向プログラミング言語で、ならびに/またはアセンブリ/機械語で実装できる。本明細書で使用される場合、用語「機械可読媒体」および「コンピュータ可読媒体」は、機械可読信号として機械命令を受信する機械可読媒体を含め、プログラマブルプロセッサに機械命令および/またはデータを提供するために使用される任意のコンピュータプログラム製品、非一時的コンピュータ可読媒体、装置および/またはデバイス(例えば、磁気ディスク、光ディスク、メモリ、プログラマブル論理デバイス(PLD))を指す。用語「機械可読信号」は、プログラマブルプロセッサに機械命令および/またはデータを提供するために使用される任意の信号を指す。 These computer programs (also known as programs, software, software applications or code) contain machine instructions for a programmable processor and are written in high-level procedural and/or object-oriented programming languages and/or in assembly/machine language. It can be implemented with As used herein, the terms "machine-readable medium" and "computer-readable medium" include a machine-readable medium that receives machine instructions as a machine-readable signal and provides machine instructions and/or data to a programmable processor. Refers to any computer program product, non-transitory computer readable medium, apparatus and/or device (eg, magnetic disk, optical disk, memory, programmable logic device (PLD)) used for. The term "machine readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

本明細書に記載されるプロセスおよび論理フローは、1つまたは複数のコンピュータプログラムを実行して、入力データを演算して出力を生成することによって機能を行う、1つまたは複数の、データ処理ハードウェアとも称される、プログラマブルプロセッサによって行うことができる。プロセスおよび論理フローは、専用論理回路網、例えばFPGA(フィールドプログラマブルゲートアレイ)またはASIC(特定用途向け集積回路)によって行うこともできる。コンピュータプログラムの実行のために適切なプロセッサは、例として、汎用および専用の両マイクロプロセッサ、ならびに任意の種類のデジタルコンピュータの任意の1つまたは複数のプロセッサを含む。一般に、プロセッサは、リードオンリメモリまたはランダムアクセスメモリまたは両方から命令およびデータを受信することになる。コンピュータの必須要素は、命令を行うためのプロセッサ、ならびに命令およびデータを記憶するための1つまたは複数のメモリデバイスである。一般に、コンピュータは更に、データを記憶するための1つまたは複数の大容量ストレージデバイス、例えば、磁気、光磁気ディスクまたは光ディスクを含む、または作動的に結合されて、それからデータを受信するもしくはそれにデータを転送するもしくは両方行うことになる。しかしながら、コンピュータは、そのようなデバイスを有する必要はない。コンピュータプログラム命令およびデータを記憶するために適切なコンピュータ可読媒体は、例として、半導体メモリデバイス、例えばEPROM、EEPROMおよびフラッシュメモリデバイス、磁気ディスク、例えば内部ハードディスクまたはリムーバブルディスク、光磁気ディスク、ならびにCD-ROMおよびDVD-ROMディスクを含め、全ての形態の不揮発性メモリ、媒体およびメモリデバイスを含む。プロセッサおよびメモリは、専用論理回路網によって補足、またはそれに組み込みできる。 The processes and logic flows described herein refer to one or more data processing hardware components that execute one or more computer programs to perform functions by operating on input data and producing output. This can be done by a programmable processor, also called hardware. The process and logic flow may also be performed by dedicated logic circuitry, such as an FPGA (Field Programmable Gate Array) or an ASIC (Application Specific Integrated Circuit). Processors suitable for the execution of a computer program include, by way of example, both general and special purpose microprocessors, and any one or more processors of any kind of digital computer. Generally, a processor will receive instructions and data from read-only memory and/or random access memory. The essential elements of a computer are a processor for executing instructions, and one or more memory devices for storing instructions and data. Generally, the computer further includes, or is operatively coupled to, one or more mass storage devices for storing data, such as magnetic, magneto-optical or optical disks, for receiving data from or receiving data therefrom. or both. However, a computer does not need to have such a device. Computer readable media suitable for storing computer program instructions and data include, by way of example, semiconductor memory devices such as EPROM, EEPROM and flash memory devices, magnetic disks such as internal hard disks or removable disks, magneto-optical disks, and CD- Includes all forms of non-volatile memory, media and memory devices, including ROM and DVD-ROM discs. The processor and memory can be supplemented by or incorporated into dedicated logic circuitry.

ユーザとの相互作用に備えるために、本開示の1つまたは複数の態様は、ユーザに情報を表示するためのディスプレイデバイス、例えばCRT(陰極線管)、LCD(液晶表示)モニタまたはタッチスクリーンならびに任意選択でユーザがコンピュータに入力を提供できるキーボードおよびポインティングデバイス、例えばマウスまたはトラックボールを有するコンピュータ上に実装できる。ユーザとの相互作用を提供するために他の種類のデバイスも使用でき、例えば、ユーザに提供されるフィードバックは任意の形態の感覚フィードバック、例えば視覚フィードバック、聴覚フィードバックまたは触覚フィードバックであることができ、そしてユーザからの入力は、音響、音声または触覚入力を含め、任意の形態で受信できる。加えて、コンピュータは、ユーザによって使用されるデバイスに文書を送信し、それから文書を受信することによって、例えばユーザのクライアントデバイス上のウェブブラウザから受信される要求に応答してウェブブラウザにウェブページを送信することによって、ユーザと相互作用できる。 To provide for interaction with a user, one or more aspects of the present disclosure may include a display device, such as a CRT (cathode ray tube), LCD (liquid crystal display) monitor or touch screen, as well as an optional It can be implemented on a computer with a keyboard and pointing device, such as a mouse or trackball, that optionally allows a user to provide input to the computer. Other types of devices can also be used to provide interaction with the user, for example the feedback provided to the user can be any form of sensory feedback, such as visual feedback, auditory feedback or tactile feedback; Input from the user can then be received in any form, including acoustic, audio or tactile input. In addition, the computer may cause a web page to appear on a web browser in response to a request received from a web browser on a user's client device, for example, by sending the document to the device used by the user and receiving the document from the device. You can interact with users by sending them.

幾つかの実装例が記載された。それにもかかわらず、本開示の趣旨および範囲から逸脱することなく様々な変更がなされ得ることが理解されるであろう。したがって、他の実装例は以下の特許請求の範囲内である。 Some implementation examples were described. Nevertheless, it will be understood that various changes may be made without departing from the spirit and scope of the disclosure. Accordingly, other implementations are within the scope of the following claims.

10 ユーザ
12 質問
14 オーディオデータ
100 音声環境
110 音声対応デバイス
112 データ処理ハードウェア
114 メモリハードウェア
115 デジタルアシスタント
116 オーディオ捕捉デバイス
118 音声出力デバイス
120 ネットワーク
130 リモートシステム
132 リモートリソース
134 リモートデータ処理ハードウェア
136 リモートメモリハードウェア
140 自動音声認識(ASR)システム
142 表記
150 テキスト-音声(TTS)システム
152 TTS入力
154 TTSオーディオ
200 発音決定モデル
202 ユーザ発音
204 TTS発音
210 ユーザ発音関連特徴(UPF)
220 ユーザ発音信頼度推定子
222 ユーザ発音信頼度
230 TTS発音関連特徴(TTSPF)
240 TTS発音信頼度推定子
242 TTS発音信頼度
250 発音決定
300 TTSモデル
302 エンコーダ
304 デコーダ
306 シンセサイザ
600 コンピューティングデバイス
600a 標準サーバ
600b ラップトップコンピュータ
600c ラックサーバシステム
610 プロセッサ
620 メモリ
630 ストレージデバイス
640 高速インタフェース/コントローラ
650 高速拡張ポート
660 低速インタフェース/コントローラ
670 低速バス
680 ディスプレイ
690 低速拡張ポート 10 users
12 questions
14 Audio data
100 Audio environment
110 Voice-enabled devices
112 Data processing hardware
114 Memory hardware
115 Digital Assistant
116 Audio Capture Device
118 Audio output device
120 network
130 remote system
132 remote resources
134 Remote Data Processing Hardware
136 Remote Memory Hardware
140 Automatic Speech Recognition (ASR) System
142 Notation
150 Text-to-Speech (TTS) System
152 TTS input
154 TTS Audio
200 Pronunciation determination model
202 User pronunciation
204 TTS pronunciation
210 User pronunciation related features (UPF)
220 User pronunciation reliability estimator
222 User pronunciation reliability
230 TTS pronunciation related features (TTSPF)
240 TTS pronunciation reliability estimator
242 TTS pronunciation reliability
250 Pronunciation determined
300 TTS model
302 encoder
304 decoder
306 Synthesizer
600 computing devices
600a standard server
600b laptop computer
600c rack server system
610 processor
620 memory
630 storage device
640 High Speed Interface/Controller
650 high speed expansion port
660 low speed interface/controller
670 low speed bus
680 display
690 Low Speed Expansion Port

Claims

A computer-implemented method (500), when executed on data processing hardware (610), causes the data processing hardware (610) to perform an operation, the operation comprising:
receiving a user pronunciation (202) of a particular word present in the question (12) spoken by the user;
receiving a TTS pronunciation (204) of the same particular word present in a text-to-speech (TTS) input (152), said TTS input (152) comprising a textual representation of a response to said question (12); , the TTS pronunciation (204) of the particular word is different from the user pronunciation (202) of the particular word;
obtaining user pronunciation-related features (210) associated with the user pronunciation (202) of the particular word;
obtaining TTS pronunciation-related features (230) associated with the TTS pronunciation (204) of the particular word;
as an output from a pronunciation decision model (200) configured to receive as input said user pronunciation related features (210) and said TTS pronunciation related features (230) for use in TTS audio (154). generating a pronunciation decision (250) that selects one of the user pronunciation (202) of the particular word associated with a degree or the TTS pronunciation (204) of the particular word;
the user pronunciation (202) for the particular word selected by the pronunciation decision (250) output from the pronunciation decision model (200) for audible output from a user device (110) associated with the user; or providing the TTS audio (154) comprising a synthesized speech representation of the response to the question (12) using the one of the TTS pronunciations (204) for the particular word. Method(500).

The said operation is
receiving audio data (14) corresponding to the question (12) spoken by the user;
and processing the audio data (14) to generate a representation (142) of the question (12) using an automatic speech recognizer (ASR (140)). Method(500).

receiving the user pronunciation (202) of the specific word;
extracting the user pronunciation (202) of the particular word from the intermittent state of the ASR (140) while processing the audio data (14) using the ASR (140);
extracting a user acoustic representation of the particular word from the audio data (14), wherein the user acoustic representation conveys the user pronunciation (202) of the particular word;
3. The method (500) of claim 2, comprising at least one of: processing the audio data (14) to generate a user phoneme representation conveying the user pronunciation (202) of the particular word.

one of the user pronunciation-related features (210) associated with the user pronunciation (202) of the particular word that is associated with the ASR (140) recognizing the particular word in the audio data (14); The method (500) of claim 2 or 3, comprising a plurality of reliability features.

The user pronunciation related feature (210) associated with the user pronunciation (202) of the specific word,
the geographic region of the user when the question (12) was spoken by the user;
linguistic demographic information associated with said user;
1. Frequency of using the user pronunciation (202) when pronouncing the particular word in previous questions (12) spoken by the user and/or other users. The method according to any one of (500) to (4) above.

receiving the TTS pronunciation (204) of the specific word;
receiving as input to a TTS system (150) said TTS input (152) comprising a textual representation of said response to said question (12);
generating as output from the TTS system (150) an initial sample of TTS audio (154) comprising an initial synthesized speech representation of the response to the question (12);
extracting a TTS acoustic representation of the particular word from the initial sample of the TTS audio (154), the TTS acoustic representation conveying the TTS pronunciation (204) of the particular word; A method (500) according to any one of claims 1 to 5.

receiving the TTS pronunciation (204) of the particular word, processing the textual representation of the response to the question (12) to convey the TTS pronunciation (204) of the particular word; 7. A method (500) according to any one of claims 1 to 6, comprising the step of generating a representation.

The TTS pronunciation related feature (230) associated with the TTS pronunciation (204) of the specific word,
Is it a confirmed preferred pronunciation for said particular word?
an unconfirmed pronunciation for said particular word that is inferred using pronunciation mining from one or more auxiliary sources;
a pronunciation variant feature indicating whether there are any other variants for pronouncing said particular word;
8. The method (500) of any one of claims 1 to 7, including at least one of a pronunciation complexity feature indicating a likelihood of user mispronunciation of the particular word.

after said operation generates said pronunciation decision (250) selecting said one of said user pronunciation (202) of said particular word or said TTS pronunciation (204) of said particular word;
Whether the user prefers the user pronunciation of the specific word (202) or the TTS pronunciation of the specific word (204) to pronounce the specific word in subsequent TTS output (204) receiving explicit feedback from said user indicating that;
9. The method (500) of any preceding claim, further comprising: updating the pronunciation decision model (200) based on the explicit feedback from the user.

when the operation generates TTS audio (154) including the particular word when the explicit feedback from the user indicates that the user prefers the user pronunciation (202) of the particular word; 10. The method (500) of claim 9, further comprising updating a TTS system (150) to use the user pronunciation (202) of the particular word.

after the operation provides the TTS audio (154) for audible output from the user device (110);
receiving audio data (14) corresponding to a next question (12) spoken by the user or another user that includes the particular word;
In the same way as the one of the user pronunciation (202) for the specific word selected by the pronunciation determination (250) or the TTS pronunciation (204) for the specific word, the user or the other user determining implicit user feedback indicating whether the particular word in question (12) has been pronounced;
11. The method (500) of any preceding claim, further comprising: updating the pronunciation decision model (200) based on the implicit user feedback.

said operation is such that said implicit feedback is similar to said one of said user pronunciation (202) for said particular word selected by said pronunciation determination (250) or said TTS pronunciation (204) for said particular word; 12. The method of claim 11, further comprising updating the TTS system (150) based on the implicit user feedback when the user indicates that he has pronounced the particular word in the next question (12). Method described (500).

data processing hardware (610);
a system (600) comprising: memory hardware (620) in communication with the data processing hardware (610), the memory hardware (620) executing on the data processing hardware (610); It stores an instruction that causes the data processing hardware (610) to perform an operation when the operation is performed.
receiving a user pronunciation (202) of a particular word present in the question (12) spoken by the user;
receiving a TTS pronunciation of the same specific word present in a text-to-speech (TTS) input (152), said TTS input (152) comprising a textual representation of a response to said question (12); the TTS pronunciation (204) of the word is different from the user pronunciation (202) of the specific word;
obtaining a user pronunciation related feature (210) associated with the user pronunciation (202) of the specific word;
obtaining a TTS pronunciation related feature (230) associated with the TTS pronunciation (204) of the specific word;
as an output from a pronunciation decision model (200) configured to receive as input said user pronunciation related features (210) and said TTS pronunciation related features (230) for use in TTS audio (154). generating a pronunciation decision (250) that selects one of the user pronunciation (202) of the particular word associated with a degree or the TTS pronunciation (204) of the particular word;
the user pronunciation (202) for the particular word selected by the pronunciation decision (250) output from the pronunciation decision model (200) for audible output from a user device (110) associated with the user; or providing said TTS audio (154) comprising a synthesized speech representation of said response to said question (12) using said one of said TTS pronunciations (204) for said particular word ( 600).

The said operation is
receiving audio data (14) corresponding to the question (12) spoken by the user;
14. Processing the audio data (14) to generate a representation (142) of the question (12) using an automatic speech recognizer (ASR) (140). System(600).

receiving the user pronunciation (202) of the specific word;
extracting the user pronunciation (202) of the particular word from the intermittent state of the ASR (140) while processing the audio data (14) using the ASR (140);
extracting a user acoustic representation of the particular word from the audio data (14), wherein the user acoustic representation conveys the user pronunciation (202) of the particular word;
15. The system (600) of claim 14, comprising at least one of: processing the audio data (14) to generate a user phoneme representation that conveys the user pronunciation (202) of the particular word.

one of the user pronunciation-related features (210) associated with the user pronunciation (202) of the particular word that is associated with the ASR (140) recognizing the particular word in the audio data (14); or a plurality of reliability features.

The user pronunciation related feature (210) associated with the user pronunciation (202) of the specific word,
the geographic region of the user when the question (12) was spoken by the user;
linguistic demographic information associated with said user;
13. Frequency of using the user pronunciation (202) when pronouncing the particular word in previous questions (12) spoken by the user and/or other users. 16. The system (600) according to any one of paragraphs 1 to 16.

receiving the TTS pronunciation (204) of the specific word;
receiving as input to a TTS system (150) said TTS input (152) comprising a textual representation of said response to said question (12);
producing as an output from the TTS system (150) an initial sample of TTS audio (154) comprising an initial synthesized speech representation of the response to the question (12);
extracting a TTS acoustic representation of the particular word from the initial sample of the TTS audio (154), the TTS acoustic representation conveying the TTS pronunciation (204) of the particular word; A system (600) according to any one of claims 13 to 17.

Receiving the TTS pronunciation (204) of the particular word processes the textual representation of the response to the question (12) to convey the TTS pronunciation (204) of the particular word. 19. The system (600) of any one of claims 13-18, comprising generating a representation.

The TTS pronunciation related feature (230) associated with the TTS pronunciation (204) of the specific word,
Is it a confirmed preferred pronunciation for said particular word?
an unconfirmed pronunciation for said particular word that is inferred using pronunciation mining from one or more auxiliary sources;
a pronunciation variant feature indicating whether there are any other variants for pronouncing said particular word;
20. The system (600) of any one of claims 13-19, comprising at least one of a pronunciation complexity feature indicating a likelihood of user mispronunciation of the particular word.

after said operation generates said pronunciation decision (250) selecting said one of said user pronunciation (202) of said particular word or said TTS pronunciation (204) of said particular word;
Whether the user prefers the user pronunciation of the specific word (202) or the TTS pronunciation of the specific word (204) to pronounce the specific word in subsequent TTS output (204) receiving explicit feedback from said user indicating that;
21. The system (600) of any one of claims 13-20, further comprising: updating the pronunciation decision model (200) based on the explicit feedback from the user.

when the operation generates TTS audio (154) including the particular word when the explicit feedback from the user indicates that the user prefers the user pronunciation (202) of the particular word; 22. The system (600) of claim 21, further comprising updating a TTS system (150) to use the user pronunciation (202) of the particular word.

after the operation provides the TTS audio (154) for audible output from the user device (110);
receiving audio data (14) corresponding to a next question (12) spoken by the user or another user that includes the particular word;
In the same way as the one of the user pronunciation (202) for the specific word selected by the pronunciation determination (250) or the TTS pronunciation (204) for the specific word, the user or the other user determining implicit user feedback indicating whether the particular word in question (12) has been pronounced;
23. The system (600) of any one of claims 13-22, further comprising: updating the pronunciation decision model (200) based on the implicit user feedback.

said operation is such that said implicit feedback is similar to said one of said user pronunciation (202) for said particular word selected by said pronunciation determination (250) or said TTS pronunciation (204) for said particular word; 24. The method of claim 23, further comprising updating the TTS system (150) based on the implicit user feedback when the user indicates that he has pronounced the particular word in the next question (12). System described (600).