JP2023006055A

JP2023006055A - Program, information processing device, and method

Info

Publication number: JP2023006055A
Application number: JP2021108439A
Authority: JP
Inventors: 尚吾早川; Shogo Hayakawa; 中順井上; Nakamasa Inoue
Original assignee: Coefont; Coefont Co Ltd
Current assignee: Coefont; Coefont Co Ltd
Priority date: 2021-06-30
Filing date: 2021-06-30
Publication date: 2023-01-18
Anticipated expiration: 2041-06-30
Also published as: JP7012935B1

Abstract

To make it possible to estimate reading and accent according to context.SOLUTION: There is provided a program for operating a computer (10) provided with a processor (11), which causes the processor to execute the steps of: acquiring learning data including language data and speech language data representing the language data by a speech language defined to simultaneously represent reading and accent (S101); learning a translation model that outputs the speech language data when the language data is input using the learning data (S102); and outputting the learned translation model (S103).SELECTED DRAWING: Figure 5

Description

本開示は、プログラム、情報処理装置、方法に関する。 The present disclosure relates to programs, information processing apparatuses, and methods.

従来から、言語データから音声を合成する技術が開発されている。特許文献１には、「音声合成装置１の音響特徴量推定部４２は、発話内容を表す文章を当該発話内容の読み方を表す文字又は文字列、及び、韻律を表す韻律記号と発話に与える特徴を表す発話スタイル記号との一方又は両方を用いた文字列により記述したテキストデータを、テキストデータから音響特徴量を生成する音響特徴量生成モデルに入力し、音響特徴量を推定する。ボコーダ部４３は、推定された音響特徴量を用いて音声波形を推定する。音響特徴量生成モデルは、ＤＮＮを用いたエンコーダ及びデコーダを有する。エンコーダは、ＲＮＮにより、テキストデータが示す発話内容に文章内における当該発話内容の前後の文字列を考慮した文字列の特徴量を生成する。デコーダは、ＲＮＮにより、エンコーダが生成した特徴量と過去に生成した音響特徴量とに基づいて発話内容に対応する音響特徴量を生成する」技術が開示されている。 Conventionally, techniques for synthesizing speech from language data have been developed. In Patent Document 1, "The acoustic feature amount estimation unit 42 of the speech synthesizer 1 prepares sentences representing the contents of an utterance by extracting characters or character strings representing how to read the contents of the utterance, prosody symbols representing the prosody, and features given to the utterance. The text data described by a character string using one or both of the utterance style symbols representing the estimates the speech waveform using the estimated acoustic feature quantity.The acoustic feature quantity generation model has an encoder and decoder using DNN.The encoder uses the RNN to match the utterance content indicated by the text data in the sentence. A character string feature amount is generated in consideration of character strings before and after the content of the utterance, and the decoder generates a sound corresponding to the utterance content based on the feature amount generated by the encoder and the sound feature amount generated in the past by the RNN. Generating a feature amount” technology is disclosed.

また、テキストから音声合成を行うために、入力されたテキストに対して、テキストの読みとアクセントとをそれぞれ推定する技術がある。例えば、テキスト「マレーシアの水」について、読み「まれーしあのみず」を推定するモデルや、アクセント「１２２１１１１２」（アクセント表現）を推定するモデルがある（https://sites.google.com/site/suzukimasayuki/accent）。 There is also a technique for estimating the pronunciation and accent of an input text in order to synthesize speech from the text. For example, for the text "Malaysia no Mizu", there is a model that estimates the reading "Mareishi Amizu" and a model that estimates the accent "12211112" (accented expression) (https://sites.google.com/site /suzukimasayuki/accent).

特開第２０２０－０３４８８３号公報Japanese Patent Application Laid-Open No. 2020-034883

しかし、先行技術では、読みとアクセントを別々に推定することはできるが、これはテキストに対して形態素解析を行うことで単語と読み方を推定し、当該単語の既知のアクセントを当てはめることにより行われている。このため、文脈に沿ったアクセントを推定することが難しい、という問題があった。また、アクセントの正解データがまだ存在してない新語については、読みとアクセントとを推定することができない、という問題があった。 However, in the prior art, reading and accent can be estimated separately, but this is done by estimating words and readings by performing morphological analysis on the text and applying known accents of the words. ing. Therefore, there is a problem that it is difficult to estimate the accent according to the context. Moreover, there is a problem that the pronunciation and accent cannot be estimated for new words for which correct accent data do not yet exist.

本開示の目的は、文脈に沿った読みとアクセントとを推定できるようにすることである The purpose of this disclosure is to enable contextual reading and accent estimation

そこで、文脈に沿った読みとアクセントとを推定することができる技術を提供する。 Therefore, a technology is provided that can estimate the reading and accent according to the context.

本開示に係るプログラムは、プロセッサを備えるコンピュータを動作させるためのプログラムであって、前記プロセッサに、言語データと、読みとアクセントとを同時に表すように定義した発話言語により前記言語データを表現した発話言語データとを含む学習データを取得するステップと、前記学習データを用いて、言語データを入力すると、前記発話言語データを出力する翻訳モデルを学習するステップと、学習した前記翻訳モデルを出力するステップと、を実行させる。 A program according to the present disclosure is a program for operating a computer having a processor, wherein the processor is provided with utterances expressing the linguistic data in an utterance language defined so as to express linguistic data and pronunciation and accent at the same time. acquiring learning data including language data; using the learning data to learn a translation model that outputs the spoken language data when the language data is input; and outputting the learned translation model. and let it run.

本開示によれば、文脈に沿った読みとアクセントとを推定することができる。 According to the present disclosure, contextual reading and accent can be estimated.

情報処理システム１の構成を示すブロック図である。1 is a block diagram showing the configuration of an information processing system 1; FIG. 情報処理装置１０の機能構成を示すブロック図である。2 is a block diagram showing the functional configuration of the information processing device 10; FIG. 翻訳モデルの構成例を示す図である。It is a figure which shows the structural example of a translation model. ユーザ端末２０に表示される画面の例を示す図である。4 is a diagram showing an example of a screen displayed on the user terminal 20; FIG. 情報処理装置１０による学習処理を行う流れの一例を示すフローチャートである。4 is a flowchart showing an example of the flow of learning processing by the information processing apparatus 10; 情報処理装置１０による音声合成処理を行う流れの一例を示すフローチャートである。4 is a flow chart showing an example of the flow of performing voice synthesis processing by the information processing device 10. FIG. 発話言語表現の例を示す図である。It is a figure which shows the example of an utterance language expression.

以下、図面を参照しつつ、本開示の実施形態について説明する。以下の説明では、同一の部品には同一の符号を付してある。それらの名称及び機能も同じである。従って、それらについての詳細な説明は繰り返さない。 Hereinafter, embodiments of the present disclosure will be described with reference to the drawings. In the following description, the same parts are given the same reference numerals. Their names and functions are also the same. Therefore, a detailed description thereof will not be repeated.

＜本開示の概要＞
本開示は、ユーザが入力したテキストデータについて、音声を合成し、合成音声を再生するプログラム、情報処理装置、及び方法について説明する。また、本開示は、テキストデータから、本開示に係る発話言語データを推定する翻訳モデルを学習するプログラム等についても説明する。 <Summary of this disclosure>
The present disclosure describes a program, an information processing device, and a method for synthesizing speech from text data input by a user and reproducing synthesized speech. The present disclosure also describes a program, etc., for learning a translation model for estimating spoken language data according to the present disclosure from text data.

＜１．情報処理システム１の構成＞
図１を用いて、本開示に係る情報処理システム１について説明する。本開示に係る情報処理システム１は、ユーザが入力したテキストデータについて、音声を合成し、合成音声を再生する。 <1. Configuration of Information Processing System 1>
An information processing system 1 according to the present disclosure will be described with reference to FIG. The information processing system 1 according to the present disclosure synthesizes speech for text data input by a user, and reproduces synthesized speech.

図１は、情報処理システム１の構成を示す図である。情報処理システム１は、情報処理装置１０と、ユーザ端末２０と、ネットワーク３０とを備える。 FIG. 1 is a diagram showing the configuration of an information processing system 1. As shown in FIG. The information processing system 1 includes an information processing device 10 , a user terminal 20 and a network 30 .

本開示に係る情報処理装置１０は、翻訳モデルを学習する学習処理、音声を合成する音声合成処理等を実行するための装置である。情報処理装置１０は、例えば、ラップトップパソコン又はラックマウント型若しくはタワー型等のコンピュータ等である。情報処理装置１０は、複数の情報処理装置１０等により構成されてもよい。情報処理システム１を実現することに要する複数の機能の配分の仕方は、各ハードウェアの処理能力、情報処理システム１に求められる仕様等に鑑みて適宜決定することができる。 The information processing device 10 according to the present disclosure is a device for executing learning processing for learning a translation model, speech synthesis processing for synthesizing speech, and the like. The information processing apparatus 10 is, for example, a laptop computer, a rack-mount type computer, a tower type computer, or the like. The information processing device 10 may be configured by a plurality of information processing devices 10 and the like. The method of distributing a plurality of functions required for realizing the information processing system 1 can be appropriately determined in consideration of the processing capability of each piece of hardware, the specifications required for the information processing system 1, and the like.

情報処理装置１０は、プロセッサ１１と、メモリ１２と、ストレージ１３と、通信ＩＦ１４と、入出力ＩＦ１５とを含んで構成される。 The information processing device 10 includes a processor 11 , a memory 12 , a storage 13 , a communication IF 14 and an input/output IF 15 .

プロセッサ１１は、プログラムに記述された命令セットを実行するためのハードウェアであり、演算装置、レジスタ、周辺回路などにより構成される。 The processor 11 is hardware for executing an instruction set described in a program, and is composed of an arithmetic unit, registers, peripheral circuits, and the like.

メモリ１２は、プログラム、及び、プログラム等で処理されるデータ等を一時的に記憶するためのものであり、例えばＤＲＡＭ（Dynamic Random Access Memory）等の揮発性のメモリである。 The memory 12 temporarily stores programs and data processed by the programs, and is a volatile memory such as a DRAM (Dynamic Random Access Memory).

ストレージ１３は、データを保存するための記憶装置であり、例えばフラッシュメモリ、ＨＤＤ（Hard Disc Drive）、ＳＳＤ（Solid State Drive）である。 The storage 13 is a storage device for storing data, and is, for example, a flash memory, HDD (Hard Disc Drive), or SSD (Solid State Drive).

通信ＩＦ１４は、情報処理装置１０が外部の装置と通信するため、信号を入出力するためのインタフェースである。通信ＩＦ１４は、インターネット、広域イーサネット等のネットワーク３０に有線又は無線により接続する。 The communication IF 14 is an interface for inputting/outputting signals so that the information processing device 10 communicates with an external device. The communication IF 14 is wired or wirelessly connected to a network 30 such as the Internet or wide area Ethernet.

入出力ＩＦ１５は、入力操作を受け付けるための入力装置（例えば、マウス等のポインティングデバイス、キーボード）、及び、情報を提示するための出力装置（ディスプレイ、スピーカ等）とのインタフェースとして機能する。 The input/output IF 15 functions as an interface with an input device (for example, a pointing device such as a mouse, a keyboard) for receiving input operations, and an output device (display, speaker, etc.) for presenting information.

ユーザ端末２０は、例えば、ラップトップパソコン、スマートフォン、タブレット等のコンピュータである。 The user terminal 20 is, for example, a computer such as a laptop computer, a smart phone, or a tablet.

情報処理装置１０及びユーザ端末２０は、ネットワーク３０を介して相互に通信可能に構成される。 The information processing device 10 and the user terminal 20 are configured to be able to communicate with each other via the network 30 .

＜１．２．情報処理装置１０の構成＞
図２は、情報処理装置１０の機能構成を示すブロック図である。図３に示すように、情報処理装置１０は、通信部１１０と、記憶部１２０と、制御部１３０とを含む。 <1.2. Configuration of Information Processing Device 10>
FIG. 2 is a block diagram showing the functional configuration of the information processing device 10. As shown in FIG. As shown in FIG. 3, the information processing apparatus 10 includes a communication section 110, a storage section 120, and a control section .

通信部１１０は、情報処理装置１０が外部の装置と通信するための処理を行う。 The communication unit 110 performs processing for the information processing device 10 to communicate with an external device.

記憶部１２０は、情報処理装置１０が使用するデータ及びプログラムを記憶する。記憶部１２０は、学習データＤＢ１２１、モデルＤＢ１２２等を記憶する。 The storage unit 120 stores data and programs used by the information processing apparatus 10 . The storage unit 120 stores a learning data DB 121, a model DB 122, and the like.

学習データＤＢ１２１は、学習データを保持するデータベースである。学習データは、言語データと、読みとアクセントとを同時に表すように定義した発話言語により言語データを表現した発話言語データとを含む。言語データは、音声合成の対象となる言語データであり、例えばテキストデータ、音声データ等である。学習データについて詳細は後述する。 The learning data DB 121 is a database that holds learning data. The learning data includes linguistic data and spoken language data expressing the linguistic data in a spoken language defined so as to simultaneously represent pronunciation and accent. The linguistic data is linguistic data to be subjected to speech synthesis, such as text data and voice data. Details of the learning data will be described later.

モデルＤＢ１２２は、翻訳モデルと、翻訳モデルのパラメータとを保持するデータベースである。モデルＤＢ１２２が保持する翻訳モデルのパラメータは、後述の学習部１３３により翻訳モデルが学習される度に更新される。また、モデルＤＢ１２２は、学習部１３３により翻訳モデルが学習される前には、初期値のパラメータが保持する。 The model DB 122 is a database that holds translation models and translation model parameters. The parameters of the translation model held by the model DB 122 are updated each time the translation model is learned by the learning unit 133, which will be described later. In addition, the model DB 122 holds initial parameters before the translation model is learned by the learning unit 133 .

制御部１３０は、情報処理装置１０のプロセッサ１１がプログラムに従って処理を行うことにより、受信制御部１３１、送信制御部１３２、学習部１３３、入力部１３４、翻訳部１３５、及び合成部１３６に示す機能を発揮する。 The control unit 130 performs functions shown in the reception control unit 131, the transmission control unit 132, the learning unit 133, the input unit 134, the translation unit 135, and the synthesis unit 136 by the processor 11 of the information processing device 10 performing processing according to the program. demonstrate.

受信制御部１３１は、情報処理装置１０が外部の装置から通信プロトコルに従って信号を受信する処理を制御する。 The reception control unit 131 controls processing for the information processing device 10 to receive a signal from an external device according to a communication protocol.

送信制御部１３２は、情報処理装置１０が外部の装置に対し通信プロトコルに従って信号を送信する処理を制御する。 The transmission control unit 132 controls processing for transmitting a signal from the information processing device 10 to an external device according to a communication protocol.

学習部１３３は、学習データを用いて、言語データを入力すると、発話言語データを出力する翻訳モデルを学習する。 The learning unit 133 uses learning data to learn a translation model that outputs spoken language data when language data is input.

具体的には、学習部１３３は、まず、学習データＤＢ１２１から、学習データを取得する。学習データは、言語データと、読みとアクセントとを同時に表すように定義した発話言語により言語データを表現した発話言語データとを含む。 Specifically, the learning unit 133 first acquires learning data from the learning data DB 121 . The learning data includes linguistic data and spoken language data expressing the linguistic data in a spoken language defined so as to simultaneously represent pronunciation and accent.

言語データは、言語データは、音声合成の対象となる言語データであり、例えばテキストデータ、音声データ等である。本開示では、言語データが、テキストデータである場合を例に説明する。なお、言語データが、音声データである場合、情報処理装置１０は、音声解析により、音声データをテキストデータに変換する構成とすればよい。 The linguistic data is linguistic data to be subjected to speech synthesis, such as text data and voice data. In the present disclosure, a case where language data is text data will be described as an example. If the language data is voice data, the information processing apparatus 10 may be configured to convert the voice data into text data by voice analysis.

発話言語は、読みとアクセントとを同時に表すように定義したものである。従来、音声特徴量を抽出前の読み及びアクセントについては、言語データを読みのみで表現したものと、言語データをアクセントのみで表現したものとを組み合わせることにより表現していた。例えば、従来は、テキスト「マレーシアの水」について、読み「まれーしあのみず」と、アクセント「１２２１１１１２」とを表していた。このアクセントの１は、下がった音、２は上がった音に対応する。しかし、これでは、文脈を脈に沿ったアクセントを推定することが難しい。また、読みとアクセントとが分かれていることにより、翻訳モデルの学習効率が低下してしまう。そこで、本開示の発話言語は、言語データを読みのみで表現したものと、言語データをアクセントのみで表現したものとを別々に含まずに、読みとアクセントとを表すものとして新たに定義した。 Spoken language is defined to represent reading and accent at the same time. Conventionally, the pronunciation and accent before extracting the speech feature amount are expressed by combining the language data expressed only by the pronunciation and the language data expressed only by the accent. For example, conventionally, the text "Malaysian Water" was represented by the reading "Malaysia Amizu" and the accent "12211112". A 1 in this accent corresponds to a lowered note and a 2 to a raised note. However, this makes it difficult to estimate the accent along the context. In addition, the separation of pronunciation and accent reduces the learning efficiency of the translation model. Therefore, the spoken language of the present disclosure is newly defined as a representation of the pronunciation and the accent, without separately including the representation of the linguistic data only by the pronunciation and the representation of the linguistic data only by the accent.

具体的には、本開示の発話言語は、言語データの１音について、当該１音の読みと、当該１音のアクセントとを同一の記号で一度に表すように定義した。定義した発話言語は、下記の法則を持つ。
・ひらがな、「ー」は、アクセントの「１（下がる）」に対応する。
・カタカナ、「～」は、アクセントの「２（上がる）」に対応する。 Specifically, the spoken language of the present disclosure is defined such that, for one sound of language data, the reading of the one sound and the accent of the one sound are represented at once by the same symbol. The defined spoken language has the following rules.
・Hiragana "-" corresponds to the accent "1 (down)".
・The katakana "~" corresponds to the accent "2 (up)".

例えば、上記テキスト「マレーシアの水」について、読み（まれーしあのみず）とアクセント（１２２１１１１２）があったとき、
・「ま」の対応するアクセントは、「１（下がる）」なので、ひらがなの「ま」
・「れ」の対応するアクセントは、「２（上がる）」なので、カタカナの「レ」
・「ー」の対応するアクセントは、「２（上がる）」なので、波線の「～」
・「し」の対応するアクセントは、「１（下がる）」なので、ひらがなの「し」
・「あ」の対応するアクセントは、「１（下がる）」なので、ひらがなの「あ」
・「の」の対応するアクセントは、「１（下がる）」なので、ひらがなの「の」
・「み」の対応するアクセントは、「１（上がる）」なので、ひらがなの「み」
・「ず」の対応するアクセントは、「２（上がる）」なので、ひらがなの「ズ」
となる。よって、当該発話言語では、「マレーシアの水」は、「まレ～しあのみズ」となる。このように、発話言語は、読みを、アクセントに応じて２つの表現方法で１音ごとに使い分ける。なお、これは日本語に限定されず、他の言語であれば、例えば、読みを表す国際音声記号をアクセントに応じて、アクセント記号を付与したり、反転させたりすることで、使い分けるようにすればよい。 For example, for the above text "Malaysian water", when there is a reading (mareishi anmizu) and an accent (12211112),
・The accent corresponding to "ma" is "1 (down)", so the hiragana "ma"
・The accent corresponding to "re" is "2 (up)", so the katakana "re"
・The accent corresponding to "-" is "2 (up)", so the wavy line "-"
・The accent corresponding to "shi" is "1 (down)", so the hiragana "shi"
・The accent corresponding to "a" is "1 (down)", so the hiragana "a"
・The accent corresponding to "no" is "1 (down)", so the hiragana "no"
・The accent corresponding to "mi" is "1 (rising)", so the hiragana "mi"
・The accent corresponding to "zu" is "2 (go up)", so the hiragana "zu"
becomes. Therefore, in the said spoken language, "Malaysian water" becomes "Male-Shiamizu". In this way, spoken languages use two different ways of expressing each sound depending on the accent. In addition, this is not limited to Japanese, and in other languages, for example, depending on the accent, the international phonetic symbols that represent the reading can be used properly by adding accent marks or inverting them. Just do it.

また、発話言語は、感情表現と、韻律支持記号とを含めて定義してよい。感情表現は、例えば、Unicodeで定義される絵文字(顔以外を含む)で表現すればよい。また、韻律支持記号は、例えば、感嘆符を「！」、疑問符を「？」、「あげる」を「↑」、「さげる」を「↓」、「左に押す」を「←」等、任意の記号で表現すればよい。感情表現と、韻律支持記号とを含ませることにより、翻訳モデルによる、アクセント推定が、文脈に沿ったものになり、かつ、感情表現も可能となる。 Spoken language may also be defined to include emotional expressions and prosodic support symbols. Emotional expressions may be expressed, for example, by Unicode-defined pictograms (including non-faces). In addition, the prosody supporting symbols are arbitrary, such as "!" for exclamation mark, "?" for question mark, "↑" for "raise", "↓" for "lower", "←" for "push left", and so on. It can be expressed by the symbol of By including emotional expressions and prosody-supporting symbols, the accent estimation by the translation model is in line with the context, and emotional expressions are also possible.

翻訳モデルは、学習データを用いて、言語データを入力すると、発話言語データを出力する。図３は、翻訳モデルの構成例を示す図である。翻訳モデルは、第１分割結合部１５１と、翻訳部１５２と、第２分割結合部１５３とを含む。 The translation model uses learning data to output spoken language data when language data is input. FIG. 3 is a diagram showing a configuration example of a translation model. The translation model includes a first split-joint part 151 , a translation part 152 and a second split-joint part 153 .

第１分割結合部１５１は、言語データを、トークン列に分割する。具体的には、第１分割結合部１５１は、まず、テキストデータが入力されると、テキストデータに対し、予め用意した単語非対応辞書を用いて、トークンリストを生成する。単語非対応辞書は、単語と読みとが必ずしも対応していない辞書である。従来の辞書は、それぞれの単語に対して、読みやアクセント情報が付与されていた。一方、本開示の単語非対応辞書では、単語に対して、その読みが登録されているとは限らない。また、本開示の単語非対応辞書は、読みに対応する単語があるとも限らない。 The first dividing/combining unit 151 divides the language data into token strings. Specifically, first, when text data is input, the first splitting/joining unit 151 generates a token list for the text data using a word non-corresponding dictionary prepared in advance. A non-word correspondence dictionary is a dictionary in which words and readings do not necessarily correspond. In conventional dictionaries, reading and accent information are assigned to each word. On the other hand, in the word-uncorresponding dictionary of the present disclosure, readings of words are not necessarily registered. In addition, the word-uncorresponding dictionary of the present disclosure does not always have words corresponding to readings.

具体的には、当該単語非対応辞書は、自然言語における単語のリストである単語リストと、発話言語における単語のリストである発話単語リストからなる。自然言語の単語リストは、漢字仮名交じり文に関する単語のリストであり、当該単語の読み及びアクセントに関する情報が紐づいていないものある。例えば、自然言語の単語リストは、「新聞」、「会社」、「テレビ」等の一般的な単語が登録されている。 Specifically, the non-word correspondence dictionary consists of a word list, which is a list of words in natural language, and a spoken word list, which is a list of words in spoken language. A natural language word list is a list of words related to sentences containing kanji and kana, and some of the words are not associated with information on readings and accents. For example, the natural language word list includes general words such as "newspaper", "company", and "television".

発話言語の発話単語リストは、発話言語の文章を適度な長さに区切った文字列を単語とする発話単語のリストである。発話単語リストは、発話において頻出するものが登録されており、自然言語の単語とは独立したものである。例えば、発話単語リストは、「かっタ？」、「ごメんね（顔文字）」、「ッしょ」等の一般的な単語とは異なる発話単語が登録されている。ここで、（顔文字）は、顔文字として発話言語で取り扱う記号が入る。この単語非対応辞書を導入することで、自然な言い回しや感情の表現が、適度にまとまった形で登録できる。この発話単語リストを用いることにより、当該翻訳モデルにおいて、文脈に応じたアクセントの推定を実現することができる。なお、発話単語リストは、手作業で頻度が高い発話単語を登録する、尤度最大化アルゴリズム等により自動的に最適化した発話単語を登録する、又は、その両者を用いることにより発話単語を登録することにより、生成される。 The spoken word list of the spoken language is a list of spoken words whose words are character strings obtained by dividing sentences of the spoken language into appropriate lengths. The utterance word list contains words that frequently appear in utterances, and is independent of natural language words. For example, the utterance word list registers utterance words that are different from general words, such as "Katta?" Here, (emoticon) is a symbol to be handled as an emoticon in the spoken language. By introducing this word-uncorresponding dictionary, it is possible to register natural phrases and emotional expressions in an appropriately organized form. By using this uttered word list, it is possible to estimate the accent according to the context in the translation model. The utterance word list is created by manually registering utterance words with high frequency, by registering utterance words automatically optimized by a likelihood maximization algorithm, etc., or by using both. is generated by

トークンは、文章や単語を構成する最小の要素である。第１分割結合部１５１は、トークンリストから、適切なトークン列を選択することにより、テキストデータをトークン列に分割する。そして、第１分割結合部１５１は、トークン列を、翻訳部１５２に出力する。 Tokens are the smallest elements that make up a sentence or word. The first splitting/joining unit 151 splits the text data into token strings by selecting an appropriate token string from the token list. Then, first splitting/coupling section 151 outputs the token string to translation section 152 .

また、第１分割結合部１５１は、翻訳部１５２からトークン列が入力されると、トークン列を結合したテキストデータを出力する。 Also, when a token string is input from the translation unit 152, the first splitting/joining unit 151 outputs text data in which the token string is joined.

翻訳部１５２は、第１分割結合部１５１からテキストデータのトークン列が入力されると、発話言語のトークン列に翻訳する。翻訳部１５２は、例えば、ＲＮＮ（ＲｅｃｕｒｒｅｎｔＮｅｕｒａｌＮｅｔｗｏｒｋ）、ＬＳＴＭ（ＬｏｎｇＳｈｏｒｔＴｅｒｍＭｅｍｏｒｙ）、又は、Ａｔｔｅｎｔｉｏｎ機構のみを用いるＥｎｃｏｄｅｒ－Ｄｅｃｏｄｅｒモデルである。Ａｔｔｅｎｔｉｏｎ機構のみを用いるＥｎｃｏｄｏｒ－Ｄｅｃｏｄｅｒモデルは、例えば、Ｔｒａｎｓｆｏｍｅｒと呼ばれるモデルが開示されている（https://arxiv.org/pdf/1706.03762.pdf）。本開示では、Ｔｒａｎｓｆｏｍｅｒを採用する場合を例に説明する。ＴｒａｎｓｆｏｍｅｒのようなＥｎｃｏｄｅｒ－Ｄｅｃｏｄｅｒモデルは、ＲＮＮ、ＣＮＮといった構造を用いない。Ｅｎｃｏｄｅｒ層と、Ｄｅｃｏｄｅｒ層とは、任意の階層を用いることができる。翻訳部１５２は、発話言語のトークン列を、第２分割結合部１５３に出力する。 When the token string of text data is input from the first splitting/joining unit 151, the translation unit 152 translates it into a token string of the spoken language. The translation unit 152 is, for example, an RNN (Recurrent Neural Network), an LSTM (Long Short Term Memory), or an Encoder-Decoder model that uses only an Attention mechanism. As an encoder-decoder model using only the attention mechanism, for example, a model called Transformer is disclosed (https://arxiv.org/pdf/1706.03762.pdf). In the present disclosure, a case where Transformer is adopted will be described as an example. Encoder-Decoder models like Transformer do not use structures such as RNN and CNN. Arbitrary layers can be used for the Encoder layer and the Decoder layer. Translating unit 152 outputs the token string of the spoken language to second splitting/coupling unit 153 .

第２分割結合部１５３は、翻訳部１５２から入力された発話言語のトークン列を、当該トークン列を結合した発話言語データを出力する。 The second dividing/combining unit 153 outputs spoken language data obtained by combining the token strings of the spoken language input from the translation unit 152 .

また、第２分割結合部１５３は、発話言語データを、トークン列に分割する。具体的には、第２分割結合部１５３は、まず、発話言語データが入力されると、発話言語データに対し、自然言語の辞書を用いて、トークンリストを生成する。このトークンリストの生成は、翻訳モデルの学習時に行われ、第１分割結合部１５１のトークンリストの生成とは独立して行われる。第２分割結合部１５３は、トークンリストから、適切なトークン列を選択することにより、テキストデータをトークン列に分割する。そして、第２分割結合部１５３は、トークン列を、翻訳部１５２に出力する。 Also, the second splitting/joining unit 153 splits the spoken language data into token strings. Specifically, when the spoken language data is input, the second splitting/coupling unit 153 first generates a token list for the spoken language data using a natural language dictionary. The generation of this token list is performed when the translation model is learned, and is performed independently of the generation of the token list of the first splitter/joiner 151 . The second splitting/joining unit 153 splits the text data into token strings by selecting an appropriate token string from the token list. Then, second splitting and joining section 153 outputs the token string to translation section 152 .

学習部１３３は、テキストデータを第１分割結合部１５１に、当該テキストデータに対応する発話言語データを第２分割結合部１５３にそれぞれ入力する。学習部１３３は、テキストデータのトークン列が、翻訳部１５２により翻訳された発話言語データのトークン列を取得する。また、学習部１３３は、第２分割結合部１５３により分割された発話言語データのトークン列を取得する。学習部１３３は、翻訳部１５２から出力されたトークン列と、第２分割結合部１５３により出力されたトークン列とが一致するように、翻訳モデルのパラメータを学習する。翻訳モデルは、Ａｔｔｅｎｔｉｏｎ機構を用いているため、自然言語の文脈において発話言語として表現する際にどこに着目すればよいかを自動で学習することができる。学習方法は、Ｔｒａｎｓｆｏｍｅｒと同様の方法、その他任意の手法を用いることができる。例えば、学習部１３３は、この一連の処理を各学習データに行うことで、翻訳モデルのパラメータを学習する。 The learning unit 133 inputs the text data to the first splitting/joining unit 151 and inputs the spoken language data corresponding to the text data to the second splitting/joining unit 153 . The learning unit 133 acquires the token string of spoken language data obtained by translating the token string of the text data by the translation unit 152 . Also, the learning unit 133 acquires the token string of the spoken language data divided by the second splitting/combining unit 153 . The learning unit 133 learns the parameters of the translation model so that the token string output from the translation unit 152 and the token string output from the second splitting/joining unit 153 match each other. Since the translation model uses an attention mechanism, it can automatically learn where to focus when expressing spoken language in the context of a natural language. As a learning method, a method similar to Transformer or any other method can be used. For example, the learning unit 133 learns the parameters of the translation model by performing this series of processes on each piece of learning data.

そして、学習部１３３は、学習した翻訳モデルのパラメータを、モデルＤＢ１２２に記憶する。このように、学習部１３３は、自然言語の単語リストと、読みとアクセントとを同時に表すように定義した発話言語により言語データを表現した発話言語のリストとが非対応な状況下で、単語非対応辞書と、発話言語データとを翻訳モデルの学習に用いることで、文字数が少なくなることによる学習効率を向上させると共に、文脈に沿った読み及びアクセントを推定することができる。 Then, the learning unit 133 stores the parameters of the learned translation model in the model DB 122 . In this way, the learning unit 133 learns that the natural language word list does not correspond to the spoken language list that expresses the language data in the spoken language defined so as to express the pronunciation and the accent at the same time. By using the correspondence dictionary and the spoken language data for learning the translation model, the learning efficiency can be improved by reducing the number of characters, and the reading and accent according to the context can be estimated.

入力部１３４は、言語データの入力を受け付ける。具体的には、ユーザ端末２０から、受信制御部１３１がテキストデータを受信することにより、入力部１３４がテキストデータの入力を受け付ける。 The input unit 134 receives input of language data. Specifically, when the reception control unit 131 receives the text data from the user terminal 20, the input unit 134 accepts input of the text data.

図４は、ユーザ端末２０に表示される画面の例を示す図である。図４に示すように、画面１６０は、テキストボックス１６１と、ボタン１６２とを含む。 FIG. 4 is a diagram showing an example of a screen displayed on the user terminal 20. As shown in FIG. As shown in FIG. 4, screen 160 includes text box 161 and button 162 .

テキストボックス１６１は、テキストデータを入力するためのテキストボックスである。 A text box 161 is a text box for entering text data.

ボタン１６２は、テキストボックス１６１に入力されたテキストデータを情報処理装置１０に送信し、情報処理装置１０から合成音声を受信し、当該合成音声を再生するためのボタンである。 The button 162 is a button for transmitting text data input to the text box 161 to the information processing apparatus 10, receiving synthetic speech from the information processing apparatus 10, and reproducing the synthetic speech.

このように、本開示の入力部１３４は、ユーザにより音声合成したいテキストデータの入力を受け付ける。なお、ネットワークを介さない構成としてもよい。 In this way, the input unit 134 of the present disclosure receives input of text data that the user wants to synthesize into speech. In addition, it is good also as a structure which does not go through a network.

翻訳部１３５は、言語データを、予め学習された翻訳モデルを用いて、言語データの読みとアクセントとを同時に表すように定義した発話言語により言語データを表現した発話言語データに翻訳する。 The translation unit 135 uses a translation model learned in advance to translate the language data into spoken language data expressing the language data in a spoken language defined so as to simultaneously express the pronunciation and accent of the language data.

具体的には、翻訳部１３５は、モデルＤＢ１２２から、翻訳モデルと、学習済みのパラメータとを取得する。次に、翻訳部１３５は、テキストデータを、翻訳モデルに入力することにより、発話言語データを得る。そして、翻訳部１３５は、発話言語データを、合成部１３６に出力する。 Specifically, the translation unit 135 acquires a translation model and learned parameters from the model DB 122 . Next, the translation unit 135 obtains spoken language data by inputting the text data into the translation model. Then, translation section 135 outputs the spoken language data to synthesis section 136 .

合成部１３６は、発話言語データに基づいて、言語データの音声特徴量を抽出し、当該音声特徴量に基づいて、音声合成を行う。 The synthesizing unit 136 extracts the speech feature amount of the language data based on the spoken language data, and performs speech synthesis based on the speech feature amount.

具体的には、合成部１３６は、まず、発話言語データを任意の音声特徴量推定モデルに入力することで、音声特徴量を取得する。音声特徴量推定モデルは、読みとアクセントとを入力すると、音声特徴量を出力するモデルである。音声特徴量推定モデルは、例えば、ＤＮＮ等である。音声特徴量は、例えば、メルスペクトログラム等である。合成部１３６は、使用する音声特徴量推定モデルの入力形式に合わせて、発話言語データの読み及びアクセントを、それぞれ抽出する構成としてもよい。 Specifically, the synthesizing unit 136 first acquires speech feature values by inputting the spoken language data into an arbitrary speech feature value estimation model. The speech feature estimation model is a model that outputs speech features when reading and accent are input. The speech feature quantity estimation model is, for example, DNN. The speech feature amount is, for example, a mel spectrogram. The synthesizing unit 136 may be configured to extract the pronunciation and accent of the spoken language data according to the input format of the speech feature quantity estimation model to be used.

次に、合成部１３６は、音声特徴量から、任意のボコーダを用いて、音声を合成する。ボコーダは、音声特徴量から、音声波形を生成するものである。ボコーダは。音声波形が、所定の人、キャラクター、動物等を再現するように、予め学習されたものであってもよい。そして、合成部１３６は、音声を合成した合成音声を、送信制御部１３２に、ユーザ端末２０に対し送信させる。 Next, the synthesizing unit 136 synthesizes speech from the speech feature using an arbitrary vocoder. A vocoder generates a voice waveform from a voice feature amount. the vocoder. The voice waveform may be learned in advance so as to reproduce a predetermined person, character, animal, or the like. Then, the synthesizing unit 136 causes the transmission control unit 132 to transmit the synthetic voice obtained by synthesizing the voices to the user terminal 20 .

＜２．動作＞
以下では、情報処理システム１における処理について図面を参照しながら説明する。 <2. Operation>
Processing in the information processing system 1 will be described below with reference to the drawings.

＜２．１．学習処理＞
図５は、情報処理装置１０による学習処理を行う流れの一例を示すフローチャートである。情報処理装置１０は、当該処理を、任意のタイミングで実行する。任意のタイミングは、例えば、情報処理装置１０の操作者により、学習開始信号を受信したタイミング等である。 <2.1. Learning processing>
FIG. 5 is a flowchart showing an example of the flow of learning processing by the information processing apparatus 10. As shown in FIG. The information processing apparatus 10 executes the process at arbitrary timing. The arbitrary timing is, for example, the timing at which the operator of the information processing apparatus 10 receives a learning start signal.

ステップＳ１０１において、学習部１３３は、学習データＤＢ１２１から、言語データと、読みとアクセントとを同時に表すように定義した発話言語により言語データを表現した発話言語データとを含む学習データを取得する。 In step S101, the learning unit 133 acquires from the learning data DB 121 learning data including language data and spoken language data expressing the language data in a spoken language defined so as to simultaneously express reading and accent.

ステップＳ１０２において、学習部１３３は、学習データを用いて、翻訳モデルを学習する。 In step S102, the learning unit 133 learns a translation model using learning data.

ステップＳ１０３において、学習部１３３は、学習した翻訳モデルのパラメータを、モデルＤＢ１２２に記憶し、処理を終了する。学習処理によれば、情報処理装置１０は、読みとアクセントとを同時に表すように定義した発話言語により言語データを表現した発話言語データを翻訳モデルの学習に用いる。これにより、情報処理装置１０は、文脈に沿った読み及びアクセントを推定することができる翻訳モデルを、文字数が少なくなることによる学習効率を向上させつつ、学習することができる。 In step S103, the learning unit 133 stores the learned parameters of the translation model in the model DB 122, and ends the process. According to the learning process, the information processing apparatus 10 uses spoken language data expressing language data in a spoken language defined so as to express both pronunciation and accent at the same time for learning a translation model. As a result, the information processing apparatus 10 can learn a translation model that can estimate the reading and accent according to the context while improving the learning efficiency by reducing the number of characters.

＜２．２．音声合成処理＞
図６は、情報処理装置１０による音声合成処理を行う流れの一例を示すフローチャートである。情報処理装置１０は、当該処理を、任意のタイミングで実行する。任意のタイミングは、例えば、ユーザ端末２０からテキストデータを受信したタイミング等である。 <2.2. Speech Synthesis Processing>
FIG. 6 is a flow chart showing an example of the flow of voice synthesis processing by the information processing apparatus 10. As shown in FIG. The information processing apparatus 10 executes the process at arbitrary timing. The arbitrary timing is, for example, the timing of receiving text data from the user terminal 20, or the like.

ステップＳ１１１において、ユーザ端末２０から、受信制御部１３１がテキストデータを受信することにより、入力部１３４がテキストデータの入力を受け付ける。 In step S111, the reception control unit 131 receives the text data from the user terminal 20, and the input unit 134 accepts input of the text data.

ステップＳ１１２において、翻訳部１３５は、モデルＤＢ１２２から、翻訳モデルと、学習済みのパラメータとを取得する。 In step S<b>112 , the translation unit 135 acquires a translation model and learned parameters from the model DB 122 .

ステップＳ１１３において、翻訳部１３５は、テキストデータを、翻訳モデルに入力することにより、発話言語データを得る。 In step S113, the translation unit 135 obtains spoken language data by inputting the text data into the translation model.

ステップＳ１１４において、合成部１３６は、発話言語データを任意の音声特徴量推定モデルに入力することで、音声特徴量を取得する。 In step S114, the synthesizing unit 136 acquires speech features by inputting the spoken language data to an arbitrary speech feature estimation model.

ステップＳ１１５において、合成部１３６は、音声特徴量から、任意のボコーダを用いて、音声を合成する。 In step S115, the synthesizing unit 136 synthesizes speech from the speech feature using an arbitrary vocoder.

ステップＳ１１６において、合成部１３６は、音声を合成した合成音声を、送信制御部１３２に、ユーザ端末２０に対し送信させ、処理を終了する。このように、情報処理装置１０は、言語データを、予め学習された翻訳モデルを用いて、言語データの読みとアクセントとを同時に表すように定義した発話言語により言語データを表現した発話言語データに翻訳する。これにより、情報処理装置１０は、文脈に沿った読み及びアクセントを推定することができる。 In step S116, the synthesizing unit 136 causes the transmission control unit 132 to transmit the synthetic voice obtained by synthesizing the voices to the user terminal 20, and ends the process. In this way, the information processing apparatus 10 converts the language data into spoken language data that expresses the language data in a spoken language that is defined to simultaneously represent the reading and accent of the language data using a translation model that has been learned in advance. translate. Thereby, the information processing apparatus 10 can estimate the reading and accent according to the context.

＜３．小括＞
従来では、読み及びアクセントを推定するために、既に読みが分かっている単語の辞書を用いていた。このような辞書を用いたアクセント推定には、存在しない新しい単語についてはアクセントが推定できず、文脈に応じたアクセント変化に弱く、かつ、感情表現を含んだ自然なアクセントに対応できない、という問題があった。このため、従来の読み・アクセント推定では、自然な音声合成を実現することができなかった。 <3. Summary>
Conventionally, a dictionary of words whose pronunciation is already known is used to estimate the pronunciation and accent. Accent estimation using such dictionaries has the problem that it cannot estimate accents for new words that do not exist, is vulnerable to accent changes according to context, and cannot handle natural accents that include emotional expressions. there were. Therefore, conventional pronunciation/accent estimation cannot achieve natural speech synthesis.

以上説明したように、本開示によれば、言語データと、読みとアクセントとを同時に表すように定義した発話言語により言語データを表現した発話言語データとを含む学習データを取得し、当該学習データを用いて、言語データを入力すると、発話言語データを出力する翻訳モデルを学習し、学習した翻訳モデルを出力する。これにより、文脈に沿った読み及びアクセントを推定することができる翻訳モデルを学習することができる。 As described above, according to the present disclosure, learning data including linguistic data and spoken language data expressing the linguistic data in a spoken language defined so as to express reading and accent at the same time is acquired, and the learning data is used to learn a translation model that outputs spoken language data when language data is input, and outputs the learned translation model. This makes it possible to learn a translation model capable of estimating reading and accent in context.

また、本開示によれば、言語データを、予め学習された翻訳モデルを用いて、言語データの読みとアクセントとを同時に表すように定義した発話言語により言語データを表現した発話言語データに翻訳する。これにより、文脈に沿った読み及びアクセントを推定することができる。 In addition, according to the present disclosure, the language data is translated into spoken language data expressing the language data in a spoken language defined to simultaneously represent the reading and accent of the language data using a pre-learned translation model. . This makes it possible to estimate the reading and accent according to the context.

更に、発話言語データに基づいて、言語データの音声特徴量を抽出し、音声特徴量に基づいて、音声合成を行うことにより、合成音声を求める。これにより、文脈に沿った滑らかな音声合成を行うことができる。 Further, based on the uttered language data, the speech feature amount of the language data is extracted, and based on the speech feature amount, speech synthesis is performed to obtain synthesized speech. This makes it possible to perform smooth speech synthesis in line with the context.

＜その他の変形例＞
以上、開示に係る実施形態について説明したが、これらはその他の様々な形態で実施することが可能であり、種々の省略、置換及び変更を行なって実施することができる。これらの実施形態及び変形例ならびに省略、置換及び変更を行なったものは、特許請求の範囲の技術的範囲とその均等の範囲に含まれる。 <Other Modifications>
Although the disclosed embodiments have been described above, they can be implemented in various other forms, and can be implemented with various omissions, substitutions, and modifications. These embodiments, modifications, omissions, substitutions and changes are included in the technical scope of the claims and their equivalents.

例えば、情報処理装置１０の各機能を、他の装置に構成してもよい。例えば、記憶部１２０の各ＤＢは、外部のデータベースとして構築してもよい。 For example, each function of the information processing device 10 may be configured in another device. For example, each DB of the storage unit 120 may be constructed as an external database.

また、上記開示では、発話言語は、読みとアクセントとで定義したが、これに限定されるものではない。例えば、発話言語は、読みと、強弱、感情等を表現する際にも用いることができる。図７は、発話言語表現の例である。例えば、発話言語は、強弱について、０～２の３段階で表現することもできる。この場合、発話言語の表現に、ひらがな、カタカナのみならず、図７右部にある「２」に相当する当て字を用いることができる。 Also, in the above disclosure, the spoken language is defined by reading and accent, but it is not limited to this. For example, spoken language can be used to express readings, dynamics, emotions, and the like. FIG. 7 is an example of spoken language representation. For example, the spoken language can be expressed in three levels from 0 to 2 in terms of strength. In this case, not only hiragana and katakana, but also phonetic equivalents corresponding to "2" on the right side of FIG. 7 can be used to express spoken language.

＜付記＞
以上の各実施形態で説明した事項を、以下に付記する。
（付記１）プロセッサ（１１）を備えるコンピュータ（１０）を動作させるためのプログラムであって、前記プログラムは、前記プロセッサに、言語データと、読みとアクセントとを同時に表すように定義した発話言語により前記言語データを表現した発話言語データとを含む学習データを取得するステップ（Ｓ１０１）と、前記学習データを用いて、言語データを入力すると、前記発話言語データを出力する翻訳モデルを学習するステップ（Ｓ１０２）と、学習した前記翻訳モデルを出力するステップ（Ｓ１０３）と、を実行させるプログラム。 <Appendix>
The items described in each of the above embodiments will be added below.
(Appendix 1) A program for operating a computer (10) comprising a processor (11), the program instructing the processor according to a speech language defined to simultaneously represent language data, reading and accent A step of acquiring learning data including spoken language data expressing the language data (S101); and a step of learning a translation model that outputs the spoken language data when language data is input using the learning data (S101). S102) and a step of outputting the learned translation model (S103).

（付記２）前記発話言語は、前記言語データを読みのみで表現したものと、前記言語データをアクセントのみで表現したものとを別々に含まないものである、（付記１）に記載のプログラム。 (Supplementary Note 2) The program according to (Supplementary Note 1), wherein the spoken language does not separately include the language data expressed only by pronunciation and the language data expressed only by accents.

（付記３）前記発話言語は、前記言語データの１音について、前記１音の読みと、前記１音のアクセントとを同一の記号で表す、（付記１）又は(付記２)に記載のプログラム。 (Supplementary note 3) The program according to (Supplementary note 1) or (Supplementary note 2), wherein the spoken language expresses the reading of the one sound and the accent of the one sound with the same symbol for one sound of the language data. .

（付記４）前記翻訳モデルは、Ａｔｔｅｎｔｉｏｎ機構のみを用いるＥｎｃｏｄｅｒ－Ｄｅｃｏｄｅｒモデルである（付記２）又は（付記３）に記載のプログラム。 (Appendix 4) The program according to (Appendix 2) or (Appendix 3), wherein the translation model is an Encoder-Decoder model using only an Attention mechanism.

（付記５）前記学習するステップにおいて、自然言語の単語のリストである第１単語リストと、発話言語の単語のリストである第２単語リストと、前記学習データとを用いて、前記翻訳モデルを学習し、前記第１単語リストは、自然言語の単語についての読み及びアクセントが付与されていないリストであり、前記第２単語リストは、発話言語における単語のリストである、（付記１）～（付記４）の何れかに記載のプログラム。 (Appendix 5) In the learning step, the translation model is generated using a first word list that is a list of natural language words, a second word list that is a list of spoken language words, and the learning data. (Appendix 1)-( A program according to any one of Appendix 4).

（付記６）プロセッサ（１１）を備えるコンピュータ（１０）を動作させるためのプログラムであって、前記プログラムは、前記プロセッサに、言語データの入力を受け付けるステップ（Ｓ１１１）と、前記言語データを、予め学習された翻訳モデルを用いて、言語データの読みとアクセントとを同時に表すように定義した発話言語により前記言語データを表現した発話言語データに翻訳するステップ（Ｓ１１２）と、前記発話言語データを出力するステップと、を実行させ、前記翻訳モデルは、言語データを入力すると、前記発話言語データを出力する、プログラム。 (Additional remark 6) A program for operating a computer (10) having a processor (11), the program comprising a step of accepting input of language data (S111), and pre-loading the language data to the processor Using the learned translation model, translating the language data into spoken language data expressing the spoken language in a spoken language defined to simultaneously express the pronunciation and accent of the language data (S112); and outputting the spoken language data. and wherein said translation model, upon receiving language data, outputs said spoken language data.

（付記７）前記発話言語データに基づいて、前記言語データの音声特徴量を抽出するステップ（Ｓ１１４）と、前記音声特徴量に基づいて、音声合成を行うことにより、合成音声を求めるステップ（Ｓ１１５）と、を実行させ、前記出力するステップにおいて、前記合成音声を出力する（Ｓ１１６）、（付記６）に記載のプログラム。 (Supplementary Note 7) A step of extracting speech feature amounts of the language data based on the spoken language data (S114), and a step of obtaining synthesized speech by performing speech synthesis based on the speech feature amounts (S115 ), and outputting the synthesized speech in the step of outputting (S116).

（付記８）プロセッサ（１１）を備える情報処理装置（１０）であって、言語データと、読みとアクセントとを同時に表すように定義した発話言語により前記言語データを表現した発話言語データとを含む学習データを取得するステップ（Ｓ１０１）と、前記学習データを用いて、言語データを入力すると、前記発話言語データを出力する翻訳モデルを学習するステップ（Ｓ１０２）と、学習した前記翻訳モデルを出力するステップ（Ｓ１０３）と、を実行する情報処理装置。 (Appendix 8) An information processing device (10) having a processor (11), comprising language data and spoken language data expressing the language data in a spoken language defined to simultaneously express reading and accent. a step of acquiring learning data (S101); a step of learning a translation model for outputting said spoken language data when language data is input using said learning data (S102); and a step of outputting said learned translation model. An information processing device that executes step (S103).

（付記９）コンピュータ（例えば、情報処理装置１０）が、言語データと、読みとアクセントとを同時に表すように定義した発話言語により前記言語データを表現した発話言語データとを含む学習データを取得するステップ（Ｓ１０１）と、前記学習データを用いて、言語データを入力すると、前記発話言語データを出力する翻訳モデルを学習するステップ（Ｓ１０２）と、学習した前記翻訳モデルを出力するステップ（Ｓ１０３）と、を実行する方法。 (Appendix 9) A computer (for example, the information processing device 10) acquires learning data including language data and spoken language data expressing the language data in a spoken language defined so as to express both pronunciation and accent at the same time. a step (S101), a step (S102) of learning a translation model that outputs the spoken language data when language data is input using the learning data, and a step (S103) of outputting the learned translation model; , how to run.

１情報処理システム、１０情報処理装置、１１プロセッサ、１２メモリ、１３ストレージ、１４通信ＩＦ、１５入出力ＩＦ、２０ユーザ端末、３０ネットワーク、１１０通信部、１２０記憶部、１２１学習データＤＢ、１２２モデルＤＢ、１３０制御部、１３１受信制御部、１３２送信制御部、１３３学習部、１３４入力部、１３５翻訳部、１３６合成部。

1 information processing system, 10 information processing device, 11 processor, 12 memory, 13 storage, 14 communication IF, 15 input/output IF, 20 user terminal, 30 network, 110 communication unit, 120 storage unit, 121 learning data DB, 122 model DB, 130 control unit, 131 reception control unit, 132 transmission control unit, 133 learning unit, 134 input unit, 135 translation unit, 136 synthesis unit.

Claims

A program for operating a computer comprising a processor, the program causing the processor to:
acquiring learning data including language data and spoken language data representing the language data in a spoken language defined to simultaneously represent pronunciation and accent;
using the learning data to learn a translation model that outputs the spoken language data when language data is input;
a step of outputting the learned translation model;
program to run.

The spoken language does not separately include a representation of the linguistic data in pronunciation only and a representation of the linguistic data in accent only,
A program according to claim 1.

In the spoken language, for one sound of the language data, the reading of the one sound and the accent of the one sound are represented by the same symbol,
3. A program according to claim 1 or 2.

4. The program according to claim 2, wherein said translation model is an Encoder-Decoder model using only an Attention mechanism.

in the learning step, learning the translation model using a first word list that is a list of natural language words, a second word list that is a list of spoken language words, and the learning data;
wherein the first word list is a reading and unaccented list of natural language words;
the second word list is a list of words in a spoken language;
A program according to any one of claims 1 to 4.

A program for operating a computer comprising a processor, the program causing the processor to:
accepting input of language data;
a step of translating the language data into spoken language data expressing the language data in a spoken language defined so as to simultaneously represent the pronunciation and accent of the language data using a pre-learned translation model;
outputting said spoken language data;
and
When the translation model receives language data, it outputs the spoken language data.
program.

extracting speech features of the language data based on the spoken language data;
obtaining synthesized speech by performing speech synthesis based on the speech feature;
and
outputting the synthesized speech in the outputting step;
7. A program according to claim 6.

An information processing device comprising a processor,
acquiring learning data including language data and spoken language data representing the language data in a spoken language defined to simultaneously represent pronunciation and accent;
using the learning data to learn a translation model that outputs the spoken language data when language data is input;
a step of outputting the learned translation model;
Information processing device that executes

the computer
acquiring learning data including language data and spoken language data representing the language data in a spoken language defined to simultaneously represent pronunciation and accent;
using the learning data to learn a translation model that outputs the spoken language data when language data is input;
a step of outputting the learned translation model;
how to run.