JP2024054051A

JP2024054051A - System and method for training an acoustic model

Info

Publication number: JP2024054051A
Application number: JP2022192811A
Authority: JP
Inventors: 竜之介大道; 慶二郎才野; 方成西村
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2022-10-04
Filing date: 2022-12-01
Publication date: 2024-04-16
Also published as: JP2024054052A; JP2024054058A; JP2024054053A

Abstract

【課題】複数の訓練用データから音響モデルの訓練に使用するデータを選択可能にすることで、様々な訓練を容易に実行可能にすること。【解決手段】音響モデルの訓練システムは、ネットワークに接続可能な、第１ユーザが使用する第１デバイスと、前記ネットワークに接続可能なサーバと、を含む。前記第１デバイスは、前記第１ユーザによる制御の下で、前記サーバに複数の音波形をアップロードし、既にアップロードされた、又は、これからアップロードされる前記複数の音波形から第１波形セットとして一以上の音波形を選択し、音響特徴量を生成する音響モデルに対する第１訓練ジョブの第１実行指示を前記サーバに送信する。前記サーバは、前記第１デバイスからの前記第１実行指示に基づいて、選択された前記第１波形セットを用いて前記第１訓練ジョブの実行を開始し、前記第１訓練ジョブによって訓練された訓練済み音響モデルを前記第１デバイスに提供する。【選択図】図４[Problem] To make it possible to easily perform various training by making it possible to select data to be used for training an acoustic model from a plurality of training data. [Solution] An acoustic model training system includes a first device connectable to a network and used by a first user, and a server connectable to the network. The first device uploads a plurality of sound waveforms to the server under the control of the first user, selects one or more sound waveforms as a first waveform set from the plurality of sound waveforms that have already been uploaded or will be uploaded, and transmits a first execution instruction for a first training job for an acoustic model that generates acoustic features to the server. Based on the first execution instruction from the first device, the server starts execution of the first training job using the selected first waveform set, and provides the first device with a trained acoustic model trained by the first training job. [Selected Figure] Figure 4

Description

本発明の一実施形態は、音響モデルの訓練システム及び方法に関する。 One embodiment of the present invention relates to a system and method for training an acoustic model.

特定の歌手の声音及び特定の楽器の演奏音を合成する音声合成（Sound Synthesis）技術が知られている。特に、機械学習を利用した音声合成技術（例えば、特許文献１、２）では、ユーザによって入力された楽譜データ及び音響データに基づいて当該特定の音声及び演奏音で自然な発音の合成音声を出力するために、十分に訓練された音響モデルが要求される。 Sound synthesis technology is known that synthesizes the vocal sounds of a specific singer and the sounds played on a specific instrument. In particular, in voice synthesis technology that uses machine learning (e.g., Patent Documents 1 and 2), a fully trained acoustic model is required to output a synthetic voice with natural pronunciation from the specific voice and performance sounds based on the sheet music data and audio data input by the user.

特開２０２０－０７６８４３号公報JP 2020-076843 A 国際公開第２０２２／０８０３９５号International Publication No. 2022/080395

しかしながら、音響モデルを十分に訓練するためには、膨大な量の声音及び演奏音について言語特徴量をラベル付けする必要があり、莫大な時間と費用を要していた。そのため、資金を十分に有する企業しか音響モデルの訓練を実行することができず、音響モデルの種類が限られていた。 However, to fully train an acoustic model, it was necessary to label a huge amount of speech and performance sounds with linguistic features, which required a huge amount of time and money. As a result, only well-funded companies were able to train acoustic models, and the types of acoustic models available were limited.

本発明の一実施形態の目的の一つは、複数の訓練用データから音響モデルの訓練に使用するデータを選択可能にすることで、様々な訓練を容易に実行可能にすることである。 One of the objectives of one embodiment of the present invention is to make it easy to perform various types of training by allowing the selection of data to be used for training an acoustic model from multiple training data.

本発明の一実施形態による音響モデルの訓練システムは、ネットワークに接続可能な、第１ユーザが使用する第１デバイスと、前記ネットワークに接続可能なサーバと、を含む。前記第１デバイスは、前記第１ユーザによる制御の下で、前記サーバに複数の音波形をアップロードし、既にアップロードされた、又は、これからアップロードされる前記複数の音波形から第１波形セットとして一以上の音波形を選択し、音響特徴量を生成する音響モデルに対する第１訓練ジョブの第１実行指示を前記サーバに送信する。前記サーバは、前記第１デバイスからの前記第１実行指示に基づいて、選択された前記第１波形セットを用いて前記第１訓練ジョブの実行を開始し、前記第１訓練ジョブによって訓練された訓練済み音響モデルを前記第１デバイスに提供する。 An acoustic model training system according to an embodiment of the present invention includes a first device connectable to a network and used by a first user, and a server connectable to the network. The first device, under the control of the first user, uploads a plurality of sound waveforms to the server, selects one or more sound waveforms as a first waveform set from the plurality of sound waveforms that have already been uploaded or will be uploaded, and transmits a first execution instruction for a first training job for an acoustic model that generates acoustic features to the server. Based on the first execution instruction from the first device, the server starts execution of the first training job using the selected first waveform set, and provides the first device with a trained acoustic model trained by the first training job.

本発明の一実施形態による音響モデルの訓練方法は、予め保存された複数の音波形から、音響特徴量を生成する音響モデルに対する第１訓練ジョブを実行させるための一以上の音波形を選択させるインターフェースを第１ユーザに提供することを１以上のコンピュータにより実現させる。 An acoustic model training method according to one embodiment of the present invention includes providing a first user with an interface that allows the first user to select one or more sound waveforms from a plurality of pre-stored sound waveforms for executing a first training job for an acoustic model that generates acoustic features, using one or more computers.

本発明の一実施形態によれば、複数の訓練用データから音響モデルの訓練に使用するデータを選択可能にすることで、様々な訓練を容易に実行可能にできる。 According to one embodiment of the present invention, by making it possible to select data to be used for training an acoustic model from multiple training data, it becomes possible to easily perform various types of training.

本発明の一実施形態における音響モデル訓練システムの全体構成を示す図である。1 is a diagram showing the overall configuration of an acoustic model training system according to an embodiment of the present invention. 本発明の一実施形態におけるサーバの構成を示すブロック図である。FIG. 2 is a block diagram showing a configuration of a server in one embodiment of the present invention. 本発明の一実施形態における音響モデルの概念を示すブロック図である。FIG. 2 is a block diagram showing the concept of an acoustic model in one embodiment of the present invention. 本発明の一実施形態における音響モデルの訓練方法及び音声合成方法を示すシーケンス図である。FIG. 2 is a sequence diagram showing a method for training an acoustic model and a method for synthesizing speech in an embodiment of the present invention. 本発明の一実施形態における音響モデルの訓練方法におけるＧＵＩの一例を示す図である。FIG. 2 is a diagram showing an example of a GUI in a method for training an acoustic model in one embodiment of the present invention. 本発明の一実施形態における音響モデルの訓練方法及び音声合成方法を示すシーケンス図である。FIG. 2 is a sequence diagram showing a method for training an acoustic model and a method for synthesizing speech in an embodiment of the present invention. 本発明の一実施形態における音響モデルの情報公開及び試聴要求に係るＧＵＩの一例を示す図である。FIG. 13 is a diagram showing an example of a GUI related to disclosure of acoustic model information and a preview request in one embodiment of the present invention. 本発明の一実施形態における音響モデルの訓練方法及び音声合成方法を示すシーケンス図である。FIG. 2 is a sequence diagram showing a method for training an acoustic model and a method for synthesizing speech in an embodiment of the present invention. 本発明の一実施形態における音響モデルの訓練時に公開情報を設定するときのＧＵＩの一例を示す図である。FIG. 13 is a diagram showing an example of a GUI for setting public information during training of an acoustic model in one embodiment of the present invention. 本発明の一実施形態における音響モデルの訓練方法を示すフローチャートである。4 is a flow chart illustrating a method for training an acoustic model in accordance with an embodiment of the present invention. 本発明の一実施形態における音響モデルの訓練に用いる音波形の収録方法を示すシーケンス図である。FIG. 2 is a sequence diagram showing a method for recording sound waveforms used for training an acoustic model in one embodiment of the present invention. 本発明の一実施形態において、サーバによって管理されるデータ構成を示す図である。FIG. 2 is a diagram showing a data structure managed by a server in one embodiment of the present invention. 本発明の一実施形態における音響モデルの訓練においてサーバに送信されるデータを示す図である。FIG. 2 illustrates data sent to a server in training an acoustic model in one embodiment of the present invention. 本発明の一実施形態における音響モデルの訓練方法を示すフローチャートである。4 is a flow chart illustrating a method for training an acoustic model in accordance with an embodiment of the present invention. 本発明の一実施形態における音響モデルの訓練に適した楽曲の推薦方法を示すフローチャートである。1 is a flow chart illustrating a method for recommending songs suitable for training an acoustic model in accordance with an embodiment of the present invention.

以下、本発明の一実施形態における音響モデルの訓練システム及び方法について、図面を参照しながら詳細に説明する。以下に示す実施形態は本発明を実施する形態の一例であって、本発明はこれらの実施形態に限定して解釈されない。本実施形態で参照する図面において、同一部分又は同様の機能を有する部分には同一の符号又は類似の符号（数字の後にＡ、Ｂ等を付しただけの符号）が付されており、それらの繰り返しの説明は省略される場合がある。 The following describes in detail an acoustic model training system and method according to one embodiment of the present invention with reference to the drawings. The embodiments described below are examples of ways of implementing the present invention, and the present invention is not to be interpreted as being limited to these embodiments. In the drawings referred to in this embodiment, identical parts or parts having similar functions are given the same or similar symbols (symbols consisting of only a number followed by A, B, etc.), and repeated explanations of them may be omitted.

以下の実施形態において、「楽譜データ」は、音符の音高及び強度に関する情報、音符における音韻に関する情報、音符の発音期間に関する情報、及び演奏記号に関する情報を含むデータである。例えば、楽譜データは、楽曲の楽譜及び歌詞の少なくとも一方を示すデータである。楽譜データは、当該楽曲を構成する音符の時系列を示すデータであってもよく、当該楽曲を構成する言語の時系列を示すデータであってもよい。 In the following embodiments, "musical score data" is data that includes information about the pitch and intensity of notes, information about the phonemes of notes, information about the duration of note pronunciation, and information about performance symbols. For example, musical score data is data that indicates at least one of the musical score and lyrics of a piece of music. Music score data may be data that indicates the time sequence of notes that make up the piece of music, or data that indicates the time sequence of language that makes up the piece of music.

「音波形」は、音声の波形データであり、その音声を発する音源は、音源ＩＤで特定される。例えば、音波形は、歌唱の波形データ及び楽器音の波形データの少なくとも一方である。例えば、音波形は、マイク等の入力装置を介して取り込まれた歌手の歌声及び楽器の演奏音の波形データを含む。音源ＩＤは、その歌手の歌唱の音色、又はその楽器の演奏音の音色を特定する。音波形のうち、音響モデルを用いて合成音波形を生成するために入力される音波形を「合成用音波形」といい、音響モデルを訓練するために用いられる音波形を「訓練用音波形」という。合成用音波形と訓練用音波形とを区別する必要がない場合、これらを併せて、単に「音波形」という。 A "sound waveform" is waveform data of a voice, and the sound source that produces the voice is identified by a sound source ID. For example, a sound waveform is at least one of waveform data of singing and waveform data of musical instrument sounds. For example, a sound waveform includes waveform data of a singer's singing voice and musical instrument sounds captured via an input device such as a microphone. The sound source ID identifies the timbre of the singer's singing or the timbre of the musical instrument sounds. Among the sound waveforms, a sound waveform that is input to generate a synthetic sound waveform using an acoustic model is called a "synthesis sound waveform," and a sound waveform used to train the acoustic model is called a "training sound waveform." When there is no need to distinguish between synthesis sound waveforms and training sound waveforms, they are collectively simply called "sound waveforms."

「音響モデル」は、楽譜データの楽譜特徴量の入力と、音波形の音響特徴量の入力とを有する。音響モデルとして、例えば、国際公開第２０２２／０８０３９５号に記載された、楽譜エンコーダ１１１、音響エンコーダ１２１、切換部１３１、及び音響デコーダ１３３を有する音響モデルを用いる。この音響モデルは、入力された楽譜データの楽譜特徴量又は音波形の音響特徴量と音源ＩＤとを処理することで、その音源ＩＤが示す音色を有する、目的とする音波形の音響特徴量を生成する機能を有し、新たな合成音波形を生成するための音声合成プログラムによって使用される音声合成モデルである。音声合成プログラムは、ある楽曲の楽譜データから生成した楽譜特徴量と音源ＩＤとを音響モデルに供給することで、その音源ＩＤが示す音色で、かつその楽曲の音響特徴量を得て、その音響特徴量を音波形に変換する。或いは、音声合成プログラムは、ある楽曲の音波形から生成した音響特徴量と音源ＩＤとを音響モデルに供給することで、その音源ＩＤが示す音色で、かつその楽曲の新たな音響特徴量を得て、その新たな音響特徴量を音波形に変換する。音響モデル毎に、所定数の音源ＩＤが用意される。つまり、各音響モデルは、所定数の音色のうちの、音源ＩＤが示す音色の音響特徴量を選択的に生成する。 The "acoustic model" has an input of the score feature of the score data and an input of the acoustic feature of the sound waveform. For example, an acoustic model having a score encoder 111, an acoustic encoder 121, a switching unit 131, and an acoustic decoder 133 described in International Publication No. 2022/080395 is used as the acoustic model. This acoustic model has a function of generating acoustic features of a target sound waveform having a timbre indicated by the sound source ID by processing the score feature of the input score data or the acoustic feature of the sound waveform and the sound source ID, and is a voice synthesis model used by a voice synthesis program for generating a new synthetic sound waveform. The voice synthesis program supplies the score feature generated from the score data of a certain song and the sound source ID to the acoustic model, thereby obtaining the acoustic feature of the song with the timbre indicated by the sound source ID, and converting the acoustic feature into a sound waveform. Alternatively, the voice synthesis program supplies acoustic features generated from the sound waveform of a certain piece of music and a sound source ID to an acoustic model, thereby obtaining new acoustic features for that piece of music in the timbre indicated by the sound source ID, and converting the new acoustic features into a sound waveform. A predetermined number of sound source IDs are prepared for each acoustic model. In other words, each acoustic model selectively generates acoustic features for the timbre indicated by the sound source ID from among a predetermined number of timbres.

音響モデルは、機械学習を利用した、例えば、畳み込みニューラルネットワーク（ＣＮＮ）又はリカレントニューラルネットワーク（ＲＮＮ）などを利用した所定のアーキテクチャの生成モデルである。音響特徴量は、自然音や合成音の波形の周波数スペクトルにおける発音の特徴を表すものであり、音響特徴量が近いことは歌声又は演奏音の音色やその時間変化が似ていることを意味する。 An acoustic model is a generative model of a given architecture that uses machine learning, for example, a convolutional neural network (CNN) or a recurrent neural network (RNN). Acoustic features represent the pronunciation characteristics in the frequency spectrum of the waveform of natural sounds or synthetic sounds, and similar acoustic features mean that the timbre of the singing voice or performance sound and its changes over time are similar.

音響モデルの訓練においては、音響モデルによって、参照した音波形の音響特徴量と類似する音響特徴量が生成されるように、音響モデルの変数が変更される。訓練には、例えば、国際公開第２０２２／０８０３９５号に記載された訓練プログラムＰ２、楽譜データＤ１（訓練用楽譜データ）、及び学習用音響データＤ２（訓練用音波形）を用いる。複数の音源ＩＤに対応する複数の音声の波形を用いた基本訓練によって、複数の音源ＩＤに対応する複数の音色の合成音の音響特徴量を生成できるように、音響モデル（楽譜エンコーダ、音響エンコーダ、及び音響デコーダ）の変数が変更される。さらに、その訓練済の音響モデルを、（未使用の）新たな音源ＩＤに対応する別の音色の音波形を用いて補助訓練することで、その音響モデルは、新たな音源ＩＤの示す音色の音響特徴量を生成できるようになる。具体的には、ＸＸＸさん（複数人）の声の音波形で訓練済の音響モデルに対して、さらに、新たな音源ＩＤを用いて、ＹＹＹさん（一人）の声音の音波形で補助訓練を行うことにより、ＹＹＹさんの声音の音響特徴量を生成できる音響モデルになるように、音響モデル（少なくとも、音響デコーダ）の変数が変更される。音響モデルに対する、上記のような、新たな音源ＩＤに対応する訓練の単位を「訓練ジョブ」という。つまり、訓練ジョブとは、訓練のプログラムによって実行される一連の訓練プロセスを意味する。 In training the acoustic model, the variables of the acoustic model are changed so that the acoustic model generates acoustic features similar to those of the referenced sound waveform. For example, the training program P2 described in International Publication No. 2022/080395, the score data D1 (training score data), and the learning sound data D2 (training sound waveform) are used for training. The variables of the acoustic model (score encoder, sound encoder, and sound decoder) are changed so that acoustic features of a synthetic sound of a plurality of timbres corresponding to a plurality of sound source IDs can be generated by basic training using waveforms of a plurality of voices corresponding to a plurality of sound source IDs. Furthermore, by supplementarily training the trained acoustic model using a sound waveform of another timbre corresponding to a new (unused) sound source ID, the acoustic model can generate acoustic features of the timbre indicated by the new sound source ID. Specifically, by further performing supplementary training on an acoustic model already trained with the sound waveforms of the voices of XXX (multiple people) using a new sound source ID and the sound waveforms of the voice of YYY (one person), the variables of the acoustic model (at least the acoustic decoder) are changed so that the acoustic model becomes one that can generate acoustic features of YYY's voice. The unit of training of the acoustic model corresponding to the new sound source ID as described above is called a "training job." In other words, a training job refers to a series of training processes executed by a training program.

「プログラム」とは、プロセッサ及びメモリを備えたコンピュータにおいてプロセッサより実行される命令又は命令群を指す。「コンピュータ」は、プログラムの実行主体を指す総称である。例えば、サーバ（又はクライアント）によりプログラムが実行される場合、「コンピュータ」は、サーバ（又はクライアント）を指す。また、サーバとクライアントとの間の分散処理により「プログラム」が実行される場合、「コンピュータ」は、サーバ及びクライアントの両方を含む。この場合、「プログラム」は、「サーバで実行されるプログラム」及び「クライアントで実行されるプログラム」を含む。「プログラム」が、複数のサーバ間で分散処理される場合も同様に、「コンピュータ」は、複数のサーバを含み、「プログラム」は、各サーバで実行される各プログラムを含む。 A "program" refers to an instruction or group of instructions executed by a processor in a computer equipped with a processor and memory. A "computer" is a general term referring to the entity that executes a program. For example, when a program is executed by a server (or a client), the "computer" refers to the server (or the client). Also, when a "program" is executed by distributed processing between a server and a client, the "computer" includes both the server and the client. In this case, the "program" includes "a program executed by a server" and "a program executed by a client." Similarly, when a "program" is processed in a distributed manner among multiple servers, the "computer" includes the multiple servers, and the "program" includes each program executed by each server.

［１．第１実施形態］
［１－１．システムの全体構成］
図１は、本発明の一実施形態における音響モデル訓練システムの全体構成を示す図である。図１に示すように、音響モデル訓練システム１０は、サーバ１００（Ｓｅｒｖｅｒ）、通信端末２００（ＴＭ１）、及び通信端末３００（ＴＭ２）を含む。サーバ１００及び通信端末２００、３００は、それぞれネットワーク４００に接続可能である。通信端末２００及び通信端末３００は、それぞれネットワーク４００を介してサーバ１００と通信できる。通信端末２００を「第１デバイス」という場合がある。通信端末２００を使用するユーザを「第１ユーザ」という場合がある。 [1. First embodiment]
[1-1. Overall system configuration]
Fig. 1 is a diagram showing the overall configuration of an acoustic model training system in one embodiment of the present invention. As shown in Fig. 1, the acoustic model training system 10 includes a server 100, a communication terminal 200 (TM1), and a communication terminal 300 (TM2). The server 100 and the communication terminals 200, 300 can each be connected to a network 400. The communication terminals 200 and 300 can each communicate with the server 100 via the network 400. The communication terminal 200 may be referred to as a "first device". A user who uses the communication terminal 200 may be referred to as a "first user".

本実施形態において、サーバ１００は、音声合成器として機能し、音響モデルの訓練を実施するコンピュータである。サーバ１００は、ストレージ１１０を備えている。図１では、ストレージ１１０がサーバ１００に直接接続された構成が例示されているが、この構成に限定されない。例えば、ストレージ１１０が直接又は他のコンピュータを介してネットワーク４００に接続され、サーバ１００とストレージ１１０との間のデータの送受信がネットワーク４００を介して行われていてもよい。 In this embodiment, the server 100 is a computer that functions as a speech synthesizer and trains an acoustic model. The server 100 includes a storage 110. FIG. 1 illustrates a configuration in which the storage 110 is directly connected to the server 100, but the configuration is not limited to this. For example, the storage 110 may be connected to the network 400 directly or via another computer, and data may be transmitted and received between the server 100 and the storage 110 via the network 400.

通信端末２００は、音響モデルを訓練するための訓練用音波形を選択し、サーバ１００に訓練を実行する指示を送信する端末である。通信端末３００は、通信端末２００とは異なる端末であり、サーバ１００にアクセス可能な端末である。詳細は後述するが、通信端末３００は、訓練中の音響モデルに関する公開情報を閲覧又は試聴する端末である。通信端末２００、３００は、例えばスマートフォン若しくはタブレット端末などのモバイル用の通信端末、又は、デスクトップ型パソコンなどの据え置き用の通信端末を含む。 The communication terminal 200 is a terminal that selects a training sound waveform for training an acoustic model and transmits an instruction to the server 100 to execute the training. The communication terminal 300 is a terminal different from the communication terminal 200 and is a terminal that can access the server 100. As will be described in detail later, the communication terminal 300 is a terminal for viewing or listening to public information related to the acoustic model under training. The communication terminals 200 and 300 include, for example, mobile communication terminals such as smartphones or tablet terminals, or stationary communication terminals such as desktop personal computers.

ネットワーク４００は一般的なＷｏｒｌｄＷｉｄｅＷｅｂ（ＷＷＷ）サービスによって提供されるインターネット、ＷＡＮ（ＷｉｄｅＡｒｅａＮｅｔｗｏｒｋ）、又は社内ＬＡＮなどのＬＡＮ（ＬｏｃａｌＡｒｅａＮｅｔｗｏｒｋ）である。 Network 400 may be the Internet provided by a typical World Wide Web (WWW) service, a Wide Area Network (WAN), or a Local Area Network (LAN) such as an in-house LAN.

［１－２．音声合成に用いられるサーバの構成］
図２は、本発明の一実施形態におけるサーバの構成を示すブロック図である。図２に示すように、サーバ１００は、制御部１０１、ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）１０２、ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）１０３、ユーザインタフェース（ＵＩ）１０４、通信インターフェース１０５、及びストレージ１１０を備える。サーバ１００の各機能部が協働することによって、本実施形態の音声合成技術が実現される。 [1-2. Configuration of the server used for speech synthesis]
Fig. 2 is a block diagram showing the configuration of a server according to an embodiment of the present invention. As shown in Fig. 2, the server 100 includes a control unit 101, a RAM (Random Access Memory) 102, a ROM (Read Only Memory) 103, a user interface (UI) 104, a communication interface 105, and a storage 110. The speech synthesis technology of the present embodiment is realized by the cooperation of the functional units of the server 100.

制御部１０１は、中央演算処理装置（ＣＰＵ：ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）、画像処理装置（ＧＰＵ：ＧｒａｐｈｉｃｓＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）、当該ＣＰＵ及びＧＰＵに接続されたレジスタやメモリなどの記憶装置を含む。制御部１０１は、メモリに一時的に記憶されたプログラムをＣＰＵ及びＧＰＵによって実行し、サーバ１００に備えられた各機能を実現させる。具体的には、制御部１０１は、通信端末２００からの各種要求信号に応じて演算処理を行い、通信端末２００、３００にコンテンツデータを提供する。 The control unit 101 includes a central processing unit (CPU), an image processing unit (GPU), and storage devices such as registers and memory connected to the CPU and GPU. The control unit 101 executes programs temporarily stored in the memory using the CPU and GPU, and realizes each function provided in the server 100. Specifically, the control unit 101 performs calculations in response to various request signals from the communication terminal 200, and provides content data to the communication terminals 200 and 300.

ＲＡＭ１０２は、演算処理に必要な制御プログラム、音響モデル（アーキテクチャと変数で構成される）及びコンテンツデータなどを一時的に記憶する。また、ＲＡＭ１０２は、例えばデータバッファとして使用され、通信端末２００など、外部機器から受信した各種データを、ストレージ１１０に記憶させるまでの間、一時的に保持する。ＲＡＭ１２として、例えば、ＳＲＡＭ（ＳｔａｔｉｃＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）又はＤＲＡＭ（ＤｙｎａｍｉｃＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）などの汎用メモリを用いてもよい。 RAM 102 temporarily stores control programs required for computational processing, acoustic models (consisting of architecture and variables), content data, and the like. RAM 102 is also used, for example, as a data buffer, and temporarily holds various data received from external devices such as communication terminal 200 until the data is stored in storage 110. For example, general-purpose memory such as SRAM (Static Random Access Memory) or DRAM (Dynamic Random Access Memory) may be used as RAM 12.

ＲＯＭ１０３は、サーバ１００の機能を実現させるための各種プログラム、各種音響モデル、及びパラメータ等を記憶する。ＲＯＭ１０３に記憶されているプログラム、音響モデル、及びパラメータ等は、必要に応じて制御部１０１によって読み出され、実行ないし利用される。 The ROM 103 stores various programs, various acoustic models, parameters, etc. for implementing the functions of the server 100. The programs, acoustic models, parameters, etc. stored in the ROM 103 are read by the control unit 101 as necessary and executed or used.

ユーザインタフェース１０４は、制御部１０１の制御によって、その表示器にグラフィカルユーザインタフェース（ＧＵＩ）などの各種の表示画像を表示し、サーバ１００のユーザからの入力を受け付ける。 Under the control of the control unit 101, the user interface 104 displays various display images such as a graphical user interface (GUI) on its display device and accepts input from the user of the server 100.

通信インターフェース１０５は、制御部１０１の制御によって、ネットワーク４００に接続して、ネットワーク４００に接続された通信端末２００、３００などの他の通信装置との間で、情報の送信及び受信を行うインターフェースである。 The communication interface 105 is an interface that connects to the network 400 under the control of the control unit 101, and transmits and receives information between the communication interface 105 and other communication devices, such as the communication terminals 200 and 300, that are connected to the network 400.

ストレージ１１０は、不揮発性メモリ、ハードディスクドライブなどの恒久的な情報の保持及び書き換えが可能な記録装置（記録媒体）である。ストレージ１１０は、プログラム、音響モデル、及び当該プログラムの実行に必要となるパラメータ等の情報を記憶する。図２に示すように、ストレージ１１０には、例えば音声合成プログラム１１１、訓練ジョブ１１２、楽譜データ１１３、及び音波形１１４が記憶されている。これらのプログラム及びデータは一般的な音声合成に係るものを使用することができ、例えば、国際公開第２０２２／０８０３９５号に記載された音声合成プログラムＰ１、訓練プログラムＰ２、楽譜データＤ１、及び音響データＤ２をそれぞれ用いてもよい。 The storage 110 is a recording device (recording medium) capable of permanently retaining and rewriting information, such as a non-volatile memory or a hard disk drive. The storage 110 stores information such as a program, an acoustic model, and parameters required for executing the program. As shown in FIG. 2, the storage 110 stores, for example, a voice synthesis program 111, a training job 112, musical score data 113, and a sound waveform 114. These programs and data can be those related to general voice synthesis, and for example, the voice synthesis program P1, training program P2, musical score data D1, and acoustic data D2 described in International Publication No. WO 2022/080395 may be used.

上記のように、音声合成プログラム１１１は、楽譜データや音波形から合成音波形を生成するためのプログラムである。制御部１０１が音声合成プログラム１１１を実行するとき、制御部１０１は音響モデル１２０を使用して合成音波形を生成する。なお、当該合成音波形は、国際公開第２０２２／０８０３９５号に記載された音響データＤ３に対応する。訓練ジョブ１１２で制御部１０１により実行される音響モデル１２０の訓練プログラムは、例えば国際公開第２０２２／０８０３９５号に記載されたエンコーダ及び音響デコーダを訓練するプログラムである。楽譜データは、楽曲を規定するデータである。音波形は、音声又は演奏音の波形データであり、例えば歌手の歌声又は楽器の演奏音を示す波形データである。 As described above, the voice synthesis program 111 is a program for generating a synthetic voice waveform from musical score data and sound waveforms. When the control unit 101 executes the voice synthesis program 111, the control unit 101 generates a synthetic voice waveform using the acoustic model 120. The synthetic voice waveform corresponds to the acoustic data D3 described in WO 2022/080395. The training program for the acoustic model 120 executed by the control unit 101 in the training job 112 is, for example, a program for training the encoder and acoustic decoder described in WO 2022/080395. The musical score data is data that specifies a musical piece. The sound waveform is waveform data of a voice or a performance sound, for example, waveform data indicating the singing voice of a singer or the performance sound of an instrument.

［１－３．音声合成に用いられるサーバの機能構成］
図３は、本発明の一実施形態における音響モデルの概念を示すブロック図である。上記のように、音響モデル１２０は、図２の制御部１０１が音声合成プログラム１１１を読み出して実行するとき、その制御部１０１が実行する音声合成技術において使用される機械学習モデルである。音響モデル１２０は、音響特徴量を生成する。音響モデル１２０には、制御部１０１により、入力信号として所望の楽曲の楽譜データ１１３の楽譜特徴量１２３又は音波形１１４の音響特徴量１２４が入力される。音響モデル１２０を用いて、音源ＩＤと当該楽譜特徴量１２３とを処理することにより、当該楽曲の合成音の音響特徴量１２９が生成される。制御部１０１は、その音響特徴量１２９に基づいて、当該楽曲を音源ＩＤで特定される歌手が歌唱した又は楽器で演奏した合成音波形１３０を合成して出力する。又は、音響モデル１２０を用いて、音源ＩＤと当該音響特徴量１２４とを処理することにより、当該楽曲の合成音の音響特徴量１２９を生成する。制御部１０１は、その音響特徴量１２９に基づいて、当該楽曲の音波形を音源ＩＤで特定される歌手の歌声又は楽器の演奏音の音色に変換した合成音波形１３０を合成して出力する。 [1-3. Functional configuration of server used for speech synthesis]
FIG. 3 is a block diagram showing the concept of an acoustic model in one embodiment of the present invention. As described above, the acoustic model 120 is a machine learning model used in the voice synthesis technology executed by the control unit 101 in FIG. 2 when the control unit 101 reads and executes the voice synthesis program 111. The acoustic model 120 generates an acoustic feature. The control unit 101 inputs the score feature 123 of the score data 113 of a desired piece of music or the acoustic feature 124 of the sound waveform 114 as an input signal to the acoustic model 120. The acoustic model 120 processes the sound source ID and the score feature 123 to generate an acoustic feature 129 of the synthetic sound of the piece of music. The control unit 101 synthesizes and outputs a synthetic sound waveform 130 of the piece of music sung by a singer specified by the sound source ID or played on an instrument based on the acoustic feature 129. Alternatively, the acoustic model 120 processes the sound source ID and the acoustic feature 124 to generate the acoustic feature 129 of the synthetic sound of the piece of music. Based on the acoustic feature quantity 129, the control unit 101 synthesizes and outputs a synthetic sound waveform 130 by converting the sound waveform of the song into the tone color of the singer's singing voice or the performance sound of an instrument specified by the sound source ID.

音響モデル１２０は、機械学習を利用した生成モデルであり、訓練プログラムを実行している（つまり、訓練ジョブ１１２を実行中の）制御部１０１によって訓練される。制御部１０１は、（未使用の）新たな音源ＩＤと訓練用音波形を用いて音響モデル１２０を訓練し、音響モデル１２０（少なくとも音響デコーダ）の変数を決定する。具体的には、制御部１０１は、訓練用音波形から訓練用の音響特徴量を生成し、音響モデル１２０に新たな音源ＩＤと訓練用の音響特徴量が入力された場合に、合成音波形１３０を生成する音響特徴量が訓練用の音響特徴量に近づくように、上記の変数を徐々に繰り返し変更する。訓練用音波形は、例えば、通信端末２００又は通信端末３００からサーバ１００にアップロード（送信）され、ストレージ１１０にユーザデータとして保存されてもよく、参考データとしてサーバ１００の管理者が予めストレージ１１０に保存したものでもよい。以下の説明において、ストレージ１１０に保存することをサーバ１００に保存する、という場合がある。 The acoustic model 120 is a generative model that uses machine learning, and is trained by the control unit 101 that is executing a training program (i.e., that is executing the training job 112). The control unit 101 trains the acoustic model 120 using a (new) sound source ID and a training sound waveform, and determines the variables of the acoustic model 120 (at least the acoustic decoder). Specifically, the control unit 101 generates training acoustic features from the training sound waveform, and when a new sound source ID and training acoustic features are input to the acoustic model 120, gradually and repeatedly changes the above variables so that the acoustic features that generate the synthetic sound waveform 130 approach the training acoustic features when the new sound source ID and training acoustic features are input to the acoustic model 120. The training sound waveform may be uploaded (transmitted) from the communication terminal 200 or the communication terminal 300 to the server 100, for example, and stored in the storage 110 as user data, or may be stored in the storage 110 in advance by the administrator of the server 100 as reference data. In the following description, storing in the storage 110 may be referred to as storing in the server 100.

［１－４．音声合成方法］
図４は、本発明の一実施形態における音響モデルの訓練方法及び音声合成方法を示すシーケンス図である。図４に示す音響モデルの訓練方法では、通信端末２００がサーバ１００に訓練用音波形をアップロードする例を示す。ただし、上記のように、訓練用音波形はその他の方法でサーバ１００に予め保存されていてもよい。図４に示すシーケンスにおける訓練ジョブを「第１訓練ジョブ」という場合がある。通信端末２００側の処理ＴＭ１の各ステップおよびサーバ１００側の処理Ｓｅｒｖｅｒの各ステップは、実際には、通信端末２００の制御部およびサーバ１００の制御部１０１がそれぞれ実行するが、ここでは説明を単純にするため、通信端末２００およびサーバ１００を各ステップの実行主体として表現する。特に断らない限り、以降のシーケンス図の説明やフローチャートの説明も同様である。 [1-4. Voice synthesis method]
FIG. 4 is a sequence diagram showing a method for training an acoustic model and a method for synthesizing speech in an embodiment of the present invention. In the method for training an acoustic model shown in FIG. 4, an example is shown in which the communication terminal 200 uploads a training sound waveform to the server 100. However, as described above, the training sound waveform may be stored in advance in the server 100 by other methods. The training job in the sequence shown in FIG. 4 may be referred to as a "first training job". Each step of the process TM1 on the communication terminal 200 side and each step of the process Server on the server 100 side are actually executed by the control unit of the communication terminal 200 and the control unit 101 of the server 100, respectively, but here, for simplicity of explanation, the communication terminal 200 and the server 100 are expressed as the executing subjects of each step. Unless otherwise specified, the following explanation of the sequence diagram and the explanation of the flowchart are the same.

図４に示すように、まず、通信端末２００（第１デバイス）は、サーバ１００の第１ユーザのアカウントにログインした第１ユーザの指示に基づいて、サーバ１００に、１又は複数の訓練用音波形をアップロード（送信）する（ステップＳ４０１）。サーバ１００は、Ｓ４０１で送信された訓練用音波形を、第１ユーザの記憶領域に記憶する（ステップＳ４１１）。サーバ１００にアップロードされる音波形は１つでも、複数でもよく、複数の音波形は第１ユーザの記憶領域の複数のフォルダに分けて記憶されてもよい。上記のステップＳ４０１、４１１は、以下の訓練ジョブを実行するための準備に係るステップである。 As shown in FIG. 4, first, the communication terminal 200 (first device) uploads (transmits) one or more training sound waveforms to the server 100 based on instructions from the first user who has logged in to the first user's account on the server 100 (step S401). The server 100 stores the training sound waveforms transmitted in S401 in the first user's memory area (step S411). Either one or multiple sound waveforms may be uploaded to the server 100, and multiple sound waveforms may be stored in multiple folders in the first user's memory area. The above steps S401 and S411 are steps related to preparation for executing the following training job.

続いて、以下に訓練ジョブを実行するためのステップを説明する。通信端末２００は、サーバ１００に訓練ジョブの実行を要求する（ステップＳ４０２）。Ｓ４０２の要求に対して、サーバ１００は、通信端末２００に対して、予め保存された音波形（及び保存される予定の音波形）のうち訓練ジョブに使用する音波形を選択するためのグラフィカルユーザインターフェース（ＧＵＩ）を提供する（ステップＳ４１２）。 Next, the steps for executing a training job are described below. The communication terminal 200 requests the server 100 to execute a training job (step S402). In response to the request of S402, the server 100 provides the communication terminal 200 with a graphical user interface (GUI) for selecting a sound waveform to be used for the training job from among the sound waveforms previously stored (and sound waveforms to be stored) (step S412).

通信端末２００は、その表示器にＳ４１２で提供されたＧＵＩを表示し、第１ユーザは、そのＧＵＩを用いて、記憶領域（乃至所望のフォルダ）にアップロードされた複数の音波形から一以上の訓練用音波形を波形セット１４９（図５参照）として選択する（ステップＳ４０３）。Ｓ４０３で波形セット１４９（訓練用音波形）が選択された後に、第１ユーザからの指示に応じて、通信端末２００は、訓練ジョブの実行開始を指示する（ステップＳ４０４）。 The communication terminal 200 displays the GUI provided in S412 on its display, and the first user uses the GUI to select one or more training sound waveforms from the multiple sound waveforms uploaded to the storage area (or a desired folder) as a waveform set 149 (see FIG. 5) (step S403). After the waveform set 149 (training sound waveforms) is selected in S403, the communication terminal 200 instructs the first user to start executing a training job in response to an instruction from the first user (step S404).

Ｓ４０４における通信端末２００（第１デバイス）からの指示に基づいて、サーバ１００は、選択された波形セット１４９を用いて訓練ジョブの実行を開始する（ステップＳ４１３）。換言すると、Ｓ４１３において、Ｓ４１２で提供されたＧＵＩを介した第１ユーザの指示に基づいて訓練ジョブが実行される。 Based on the instruction from the communication terminal 200 (first device) in S404, the server 100 starts executing the training job using the selected waveform set 149 (step S413). In other words, in S413, the training job is executed based on the instruction of the first user via the GUI provided in S412.

訓練には、選択された波形セット１４９中の各波形が全部使用されるのではなく、無音区間やノイズ区間などを除いた有用区間のみを含む前処理済み波形セットが使用される。また、訓練される音響モデル１２０（ベース音響モデル）として、音響デコーダが未訓練の音響モデル１２０を用いてもよいが、複数の基本訓練済みの音響モデル１２０のうち、波形セット１４９の波形の音響特徴量に近い音響特徴量の生成を学習した音響デコーダを含む音響モデル１２０を選択して用いれば、訓練ジョブにかかる時間やコストを低減できる。何れの音響モデル１２０を選ぶとしても、楽譜エンコーダと音響エンコーダは、基本訓練済みのものを用いる。 For training, not all of the waveforms in the selected waveform set 149 are used, but a preprocessed waveform set that includes only useful sections excluding silent sections and noise sections is used. In addition, an acoustic model 120 with an untrained acoustic decoder may be used as the acoustic model 120 to be trained (base acoustic model), but the time and cost required for the training job can be reduced by selecting and using an acoustic model 120 from multiple basic trained acoustic models 120 that includes an acoustic decoder that has learned to generate acoustic features similar to those of the waveforms in the waveform set 149. Whichever acoustic model 120 is selected, the score encoder and acoustic encoder are used that have been basic trained.

ベース音響モデルは、第１ユーザが選択した波形セット１４９に基づいて、サーバ１００が決定してもよい。又は、第１ユーザが、複数の訓練済み音響モデルのいずれかをベース音響モデルとして選択して、第１実行指示にそのベース音響モデルを示す指定データを含めてもよい。音響デコーダに供給する音源ＩＤ（例えば、歌手ＩＤ、楽器ＩＤなど）としては、未使用の新たな音源ＩＤを用いる。ここで、新たな音源ＩＤとしてどの音源ＩＤを使用されたかを、ユーザは必ずしも知らなくてよい。ただ、訓練済みモデルを使用して音声合成する際には、自動的に、その新たな音源ＩＤが用いられる。 The base acoustic model may be determined by the server 100 based on the waveform set 149 selected by the first user. Alternatively, the first user may select one of a plurality of trained acoustic models as the base acoustic model and include designation data indicating the base acoustic model in the first execution instruction. A new, unused sound source ID is used as the sound source ID (e.g., singer ID, instrument ID, etc.) to be supplied to the acoustic decoder. Here, the user does not necessarily need to know which sound source ID has been used as the new sound source ID. However, when synthesizing speech using the trained model, the new sound source ID is automatically used.

訓練ジョブでは、前処理済み波形セットから一部の短波形を少しずつ取り出し、取り出した短波形を用いて音響モデル（少なくとも音響デコーダ）を訓練する、という単位訓練を繰り返す。単位訓練では、前記新たな音源ＩＤと短波形の音響特徴量とを音響モデル１２０に入力し、それに応じて音響モデル１２０が出力する音響特徴量と入力した音響特徴量の間の差分が小さくなるよう、音響モデルの変数を調整する。変数の調整には、例えば、誤差逆伝搬法を用いる。単位訓練を繰り返すことで、前処理済み波形セットによる訓練が一通り終わったら、音響モデル１２０が生成する音響特徴量の品質を評価して、当該品質が所定の基準に達していなければ、その前処理済み波形セットを用いて、再び音響モデルの訓練を行う。音響モデル１２０が生成する音響特徴量の品質が所定の基準に達していれば、訓練ジョブは完了し、その時点の音響モデル１２０が訓練済み音響モデル１２０となる。 In the training job, a unit training is repeated in which some short waveforms are extracted little by little from the preprocessed waveform set, and the extracted short waveforms are used to train the acoustic model (at least the acoustic decoder). In the unit training, the new sound source ID and the acoustic features of the short waveforms are input to the acoustic model 120, and the variables of the acoustic model are adjusted accordingly so that the difference between the acoustic features output by the acoustic model 120 and the input acoustic features becomes small. For example, the backpropagation method is used to adjust the variables. By repeating the unit training, once the training using the preprocessed waveform set is completed, the quality of the acoustic features generated by the acoustic model 120 is evaluated, and if the quality does not reach a predetermined standard, the preprocessed waveform set is used to train the acoustic model again. If the quality of the acoustic features generated by the acoustic model 120 reaches a predetermined standard, the training job is completed, and the acoustic model 120 at that point becomes the trained acoustic model 120.

Ｓ４１３で訓練ジョブが完了することで、訓練済み音響モデル１２０が確立される（ステップＳ４１４）。この訓練済み音響モデル１２０を「第１音響モデル」という場合がある。サーバ１００は、通信端末２００に、訓練済み音響モデル１２０が確立されたことを通知する（ステップＳ４１５）。上記のＳ４０３～Ｓ４１５のステップが、音響モデル１２０の訓練ジョブである。 When the training job is completed in S413, the trained acoustic model 120 is established (step S414). This trained acoustic model 120 may be referred to as the "first acoustic model." The server 100 notifies the communication terminal 200 that the trained acoustic model 120 has been established (step S415). The above steps S403 to S415 are the training job for the acoustic model 120.

Ｓ４１５の通知の後に、第１ユーザからの指示に応じて、通信端末２００が、所望の楽曲の楽譜データを含む音声合成の指示をサーバ１００に送信する（ステップＳ４０５）。それに応じて、サーバ１００は、音声合成プログラムを実行して、その楽譜データに基づいて、Ｓ４１４で完成した訓練済み音響モデル１２０を用いた音声合成を実行する（ステップＳ４１６）。Ｓ４１６で生成された合成音波形１３０を通信端末２００に送信する（ステップＳ４１７）。この音声合成では、前記新たな音源ＩＤが用いられる。 After the notification in S415, in response to an instruction from the first user, the communication terminal 200 transmits a voice synthesis instruction including the score data of the desired piece of music to the server 100 (step S405). In response, the server 100 executes a voice synthesis program and performs voice synthesis using the trained acoustic model 120 completed in S414 based on the score data (step S416). The synthetic sound waveform 130 generated in S416 is transmitted to the communication terminal 200 (step S417). In this voice synthesis, the new sound source ID is used.

Ｓ４１６及びＳ４１７を併せて、訓練ジョブによって訓練された訓練済み音響モデル１２０（音声合成機能）を、通信端末２００（第１デバイス）ないし第１ユーザに提供する、ということができる。ステップＳ４１６の音声合成プログラムの実行を、サーバ１００の代わりに、通信端末２００で行ってもよい。その場合、サーバ１００は、当該訓練済み音響モデル１２０を通信端末２００に送信し、通信端末２００は、受け取った訓練済み音響モデル１２０を用いて、前記新たな音源ＩＤで、所望の楽曲の楽譜データに基づく音声合成処理を実行し、合成音波形１３０を取得する。 By combining S416 and S417, it can be said that the trained acoustic model 120 (voice synthesis function) trained by the training job is provided to the communication terminal 200 (first device) or the first user. The execution of the voice synthesis program in step S416 may be performed by the communication terminal 200 instead of the server 100. In this case, the server 100 transmits the trained acoustic model 120 to the communication terminal 200, and the communication terminal 200 uses the received trained acoustic model 120 to execute voice synthesis processing based on the sheet music data of the desired song with the new sound source ID, and obtains the synthetic voice waveform 130.

本実施形態では、Ｓ４０２で訓練ジョブの実行を要求する前に、Ｓ４０１で訓練用音波形をアップロードしたが、この構成に限定されない。例えば、訓練用音波形のアップロードが、Ｓ４０４で訓練ジョブの実行を指示した後に行われてもよい。この場合、Ｓ４０３において、通信端末２００に記憶された複数の音波形（未アップロードの音波形を含む）から、波形セット１４９として一以上の音波形が選択され、訓練ジョブの実行指示に応じて、選択された音波形のうちの未アップロードの音波形が、アップロードされてもよい。 In this embodiment, the training sound waveform is uploaded in S401 before the execution of the training job is requested in S402, but this configuration is not limited to this. For example, the training sound waveform may be uploaded after the execution of the training job is instructed in S404. In this case, in S403, one or more sound waveforms are selected as waveform set 149 from a plurality of sound waveforms (including sound waveforms that have not been uploaded) stored in the communication terminal 200, and the sound waveforms that have not been uploaded from the selected sound waveforms may be uploaded in response to the instruction to execute the training job.

［１－５．ＧＵＩ１４０］
ここで、Ｓ４１２で提供されるＧＵＩの一例について説明する。図５は、本発明の一実施形態における音響モデルの訓練方法における第１ＧＵＩの一例を示す図である。図５に示すＧＵＩ１４０は、通信端末２００のユーザインタフェースに含まれる表示器に表示される。図５に示すように、ＧＵＩ１４０には、訓練用音波形の候補として、音波形Ａ、音波形Ｂ、・・・、音波形Ｚ（例えば、特定のフォルダにアップロード済みの音波形）が表示される。それぞれの音波形の隣には、チェックボックス１４１、１４２、・・・、１４３が表示されている。上記のように訓練用音波形の候補として表示された音波形Ａ、音波形Ｂ、・・・、音波形Ｚは、例えば、同一人による歌声に係る音波形であり、それぞれ楽曲や歌い方が異なっていてもよい。音波形は、同一の楽器の種々の演奏音であってもよい。 [1-5. GUI 140]
An example of the GUI provided in S412 will now be described. Fig. 5 is a diagram showing an example of a first GUI in the acoustic model training method according to an embodiment of the present invention. The GUI 140 shown in Fig. 5 is a GUI for The training sound waveform is displayed on a display included in the user interface of the terminal 200. As shown in FIG. 5, the GUI 140 displays sound waveform A, sound waveform B, ..., sound waveform Z ( For example, the training waveforms uploaded to a specific folder are displayed. Next to each of the training waveforms are check boxes 141, 142, ..., 143. The sound waveform A, sound waveform B, ..., sound waveform Z displayed as shape candidates are, for example, sound waveforms related to the singing voice of the same person, and each may be a different song or singing style. The forms may be different sounds played on the same instrument.

上記の構成を換言すると、Ｓ４１２において、サーバ１００は、予め保存された複数の音波形（及び保存される予定の音波形）から、音響モデル１２０に対する訓練ジョブを実行させるための一以上の音波形を、波形セット１４９として第１ユーザに選択させるＧＵＩを、通信端末２００に提供する。 In other words, in S412, the server 100 provides the communication terminal 200 with a GUI that allows the first user to select one or more sound waveforms as a waveform set 149 from a plurality of pre-stored sound waveforms (and sound waveforms to be stored) for executing a training job for the acoustic model 120.

上記Ｓ４０３において、通信端末２００の第１ユーザによって、図５に示すチェックボックス１４１、１４２、・・・、１４３がチェックされることで、訓練用音波形が選択される。図５では、訓練用音波形として、チェックボックス１４１、１４２がチェックされ、音波形Ａ及び音波形Ｂが波形セット１４９として選択された例を示す。波形セット１４９として選択する波形は１つでも複数でもよい。 In S403 above, the first user of the communication terminal 200 checks the check boxes 141, 142, ..., 143 shown in Figure 5 to select training sound waveforms. Figure 5 shows an example in which check boxes 141 and 142 are checked as training sound waveforms, and sound waveform A and sound waveform B are selected as waveform set 149. One or more waveforms may be selected as waveform set 149.

上記Ｓ４０４において、チェックボックス１４１、１４２が選択された状態で、実行ボタン１４４が押されたのに応じて、通信端末２００は、Ｓ４０４の訓練ジョブの指示を実行する。当該訓練ジョブの指示に応じて、サーバ１００は、音波形Ａ及び音波形Ｂからなる波形セット１４９を用いた音響モデル１２０の訓練を開始する。実行ボタン１４４が押されるとは、実行ボタン１４４がクリック又はタップされることを含む。 In the above S404, when the execute button 144 is pressed while the check boxes 141 and 142 are selected, the communication terminal 200 executes the instruction for the training job in S404. In response to the instruction for the training job, the server 100 starts training the acoustic model 120 using the waveform set 149 consisting of sound waveform A and sound waveform B. Pressing the execute button 144 includes clicking or tapping the execute button 144.

以上のように、本実施形態に係る音響モデル訓練システム１０は、予めストレージ１１０に保存された複数の音波形（及び保存される予定の音波形）から一以上の音波形を選択して、選択された音波形を訓練用音波形として音響モデル１２０に対する訓練ジョブを実行する。上記の構成によって、通信端末２００の第１ユーザは、未訓練の又は訓練済の音響モデル１２０を訓練することで、所望の音響モデル１２０を得る。なお、音波形のサーバ１００へのアップロードは、波形セット１４９の選択や訓練ジョブの実行指示より後でもよい。つまり、訓練ジョブに使用する音波形は、訓練ジョブが開始されるより前の任意の時点で、通信端末２００からサーバ１００にアップロードされてもよい。また、音響デコーダが訓練済み音響モデルの補助訓練ならば、従来の音響モデル１２０に比べて、短時間かつ低コストで、訓練済み音響モデル１２０を得られる。 As described above, the acoustic model training system 10 according to this embodiment selects one or more sound waveforms from a plurality of sound waveforms (and sound waveforms to be stored) stored in advance in the storage 110, and executes a training job for the acoustic model 120 using the selected sound waveforms as training sound waveforms. With the above configuration, the first user of the communication terminal 200 obtains a desired acoustic model 120 by training an untrained or trained acoustic model 120. Note that the sound waveforms may be uploaded to the server 100 after the selection of the waveform set 149 or the instruction to execute a training job. In other words, the sound waveforms used in the training job may be uploaded from the communication terminal 200 to the server 100 at any time before the training job is started. Also, if the acoustic decoder is auxiliary training of a trained acoustic model, the trained acoustic model 120 can be obtained in a short time and at a low cost compared to the conventional acoustic model 120.

［２．第２実施形態］
図６及び図７を用いて、第２実施形態に係る音響モデル訓練システム１０Ａについて説明する。音響モデル訓練システム１０Ａの全体構成及びサーバに関するブロック図は第１実施形態に係る音響モデル訓練システム１０と同じなので、説明を省略する。以下の説明において、第１実施形態と同じ構成については説明を省略し、主に第１実施形態と相違する点について説明する。以下の説明において、第１実施形態と同様の構成について説明をする場合、図１～図５を参照し、これらの図に示された符号の後にアルファベット“Ａ”を付して説明する。 [2. Second embodiment]
An acoustic model training system 10A according to the second embodiment will be described with reference to Figures 6 and 7. The overall configuration of the acoustic model training system 10A and the block diagram relating to the server are the same as those of the acoustic model training system 10 according to the first embodiment, and therefore their description will be omitted. In the following description, description of the same configuration as in the first embodiment will be omitted, and differences from the first embodiment will be mainly described. In the following description, when describing the same configuration as in the first embodiment, reference will be made to Figures 1 to 5, and the alphabet "A" will be added after the reference numerals shown in these figures.

［２－１．音声合成方法］
図６は、本発明の一実施形態における音響モデルの訓練方法及び音声合成方法を示すシーケンス図である。図６に示す音響モデルの訓練方法では、ユーザの指示で訓練ジョブの実行が開始されてから訓練済み音響モデルが完成するまでの間に、その訓練ジョブの進行状態を示す情報を、第３者に公開する構成について説明する。図６のステップＳ６０１以前のステップは、図４のＳ４０１～Ｓ４０３と同様なので、説明を省略する。図６のＳ６０１は図４のＳ４０４と同じである。以下の説明において、通信端末３００Ａを使用する、上記第３者に該当するユーザを「第２ユーザ」という場合がある。 [2-1. Voice synthesis method]
Fig. 6 is a sequence diagram showing a method for training an acoustic model and a method for synthesizing speech in an embodiment of the present invention. In the method for training an acoustic model shown in Fig. 6, a configuration is described in which information indicating the progress of a training job is made available to a third party from the time the execution of the training job is started at the instruction of a user until the trained acoustic model is completed. The steps before step S601 in Fig. 6 are the same as S401 to S403 in Fig. 4, and therefore will not be described. S601 in Fig. 6 is the same as S404 in Fig. 4. In the following description, a user who uses communication terminal 300A and corresponds to the third party may be referred to as a "second user".

Ｓ６０１における通信端末２００Ａからの第１ユーザによる実行指示に基づいて、サーバ１００Ａは、新たな音源ＩＤと選択された波形セット１４９Ａを用いて、ベース音響モデルの訓練ジョブの実行を開始する（ステップＳ６１１）。訓練ジョブの完了時には、その成果として、この波形セット１４９Ａで訓練された訓練済み音響モデル１２０Ａが得られる。Ｓ６１１において、訓練ジョブが開始されると、サーバ１００Ａは、通信端末２００Ａに対して訓練ジョブを開始したことを通知し、訓練ジョブの状態を示す状態情報を第３者への公開すること、つまり第３者による閲覧を許すことについて、その可否を通信端末２００Ａに問い合わせる（ステップＳ６１２）。通信端末２００Ａは、Ｓ６１２の問い合わせに対して、第１ユーザが訓練ジョブの状態を示す状態情報を公開する旨の公開指示を行なえば、その公開指示をサーバ１００Ａに送信する（ステップＳ６０２）。第１ユーザが公開指示を行わなければ、通信端末２００Ａは、公開指示を送信しない。この状態情報は、その公開指示の有無に関係なく通信端末２００Ａに送信され、その表示器に表示されて、第１ユーザにより閲覧される。 Based on the execution instruction from the first user from the communication terminal 200A in S601, the server 100A starts the execution of the training job of the base acoustic model using the new sound source ID and the selected waveform set 149A (step S611). At the completion of the training job, the trained acoustic model 120A trained with this waveform set 149A is obtained as a result. When the training job is started in S611, the server 100A notifies the communication terminal 200A that the training job has been started, and inquires the communication terminal 200A about whether or not to disclose the status information indicating the status of the training job to a third party, that is, whether or not to allow a third party to view the status information (step S612). If the first user issues a disclosure instruction to disclose the status information indicating the status of the training job in response to the inquiry in S612, the communication terminal 200A transmits the disclosure instruction to the server 100A (step S602). If the first user does not issue a disclosure instruction, the communication terminal 200A does not transmit the disclosure instruction. This status information is sent to the communication terminal 200A regardless of whether or not a disclosure instruction has been issued, and is displayed on its display for viewing by the first user.

Ｓ６０２において、上記のように第１ユーザによる公開指示に基づいて、サーバ１００Ａは、Ｓ６１１で実行開始された第１ユーザの訓練ジョブの状態を示す状態情報を、通信端末３００Ａに対して公開する（ステップＳ６１３）。これにより、第３者は、通信端末３００Ａの表示器に表示されたその状態情報を閲覧できる。 In S602, based on the disclosure instruction from the first user as described above, the server 100A discloses to the communication terminal 300A the status information indicating the status of the training job of the first user that was started in S611 (step S613). This allows a third party to view the status information displayed on the display of the communication terminal 300A.

なお、第１ユーザが、訓練ジョブの状態を示す状態情報を公開することに予め同意して公開指示が行われている場合は、Ｓ６１２、Ｓ６０２のステップを省略できる。つまり、その予め行われた第１ユーザの公開指示に基づいて、第１ユーザの訓練ジョブの状態を示す状態情報が第２ユーザに公開されてもよい。 Note that if the first user has agreed in advance to making the status information indicating the status of the training job public and has issued a disclosure instruction, steps S612 and S602 can be omitted. In other words, the status information indicating the status of the first user's training job may be made public to the second user based on the disclosure instruction issued in advance by the first user.

Ｓ６２２以降のＳ６１５～Ｓ６１８のステップは、図４のＳ４１４～Ｓ４１７のステップと同様なので、説明を省略する。 Steps S615 to S618 after S622 are similar to steps S414 to S417 in Figure 4, so their explanation will be omitted.

図６では、訓練ジョブを実行する指示を出した通信端末２００Ａとは異なる通信端末３００Ａが、試聴要求を実行する主体である構成を例示したが、この構成に限定されない。例えば、訓練ジョブの実行を指示した通信端末２００Ａ（第１ユーザ）が、自ら訓練ジョブの進行度を確認するために、試聴要求を実行してもよい。例えば、通信端末２００Ａが試聴要求をすることで、進行度が１００％に達していなくても、第１ユーザが試聴用の合成音波形に満足するタイミングで訓練ジョブを終了できる。 In FIG. 6, a configuration is illustrated in which a communication terminal 300A different from the communication terminal 200A that issued the instruction to execute the training job is the entity that executes the preview request, but this configuration is not limited to this. For example, the communication terminal 200A (first user) that issued the instruction to execute the training job may execute the preview request in order to check the progress of the training job itself. For example, by communication terminal 200A making a preview request, the training job can be ended at a timing when the first user is satisfied with the synthesized sound waveform for preview, even if the progress has not reached 100%.

［２－２．ＧＵＩ１５０Ａ］
ここで、Ｓ６１３で提供されるＧＵＩの一例について説明する。図７は、本発明の一実施形態における音響モデルの情報公開及び試聴要求に係るＧＵＩの一例を示す図である。図７に示すＧＵＩ１５０Ａは、通信端末２００Ａ、３００Ａの表示器に表示される。 [2-2. GUI 150A]
Here, an example of a GUI provided in S613 will be described. FIG. 7 is a diagram showing an example of a GUI related to disclosure of acoustic model information and a preview request in an embodiment of the present invention. is displayed on the display of communication terminal 200A, 300A.

図７に示すように、ＧＵＩ１５０Ａには、状態情報に応じた進行度を示す項目１５１Ａ及び詳細情報を示す項目１５２Ａと、試聴を要求する試聴ボタン１５７Ａとが表示されている。本実施形態では、進行度を示す項目１５１Ａは、音響モデル１２０Ａの訓練ジョブの進行度を示している。ただし、当該項目１５１Ａは、例えば完了予想を１００％とする経過時間、及び音響モデル１２０Ａの変数の変化の程度など、完成度以外の項目であってもよい。 As shown in FIG. 7, GUI 150A displays item 151A indicating the progress according to the status information, item 152A indicating detailed information, and preview button 157A for requesting preview. In this embodiment, item 151A indicating the progress indicates the progress of the training job of acoustic model 120A. However, item 151A may be an item other than the degree of completion, such as the elapsed time until a completion prediction of 100% is reached, and the degree of change in the variables of acoustic model 120A.

項目１５１Ａは、訓練ジョブの進行度をパーセント表示するプログレスバーである。項目１５１Ａにおいて、進行度が示す現在の状態は、訓練ジョブの開始時に見積もられた総訓練量に対する現在の訓練量であってもよく、訓練ジョブの実行中における音響モデル１２０Ａの変数の変化の様子から見積もられた総訓練量に対する現在の訓練量であってもよい。つまり、訓練ジョブの状態は時間経過に応じて変化し、サーバ１００Ａは、当該訓練ジョブの状態の経時変化を示す進行度を、項目１５１Ａとして通信端末に提供して表示する。訓練ジョブの状態は時間経過に応じて変化するため、サーバ１００Ａは、訓練ジョブの状態を示す状態情報を、その情報が変化したときに、或いは、一定時間ごとに、繰り返し更新し、通信端末２００Ａ、３００Ａに対して繰り返し提供する。 Item 151A is a progress bar that displays the progress of the training job as a percentage. In item 151A, the current state indicated by the progress may be the current training amount relative to the total training amount estimated at the start of the training job, or the current training amount relative to the total training amount estimated from the change in the variables of acoustic model 120A during the execution of the training job. In other words, the state of the training job changes over time, and server 100A provides and displays the progress indicating the change over time in the state of the training job as item 151A to the communication terminal. Since the state of the training job changes over time, server 100A repeatedly updates the state information indicating the state of the training job when the information changes or at regular intervals, and repeatedly provides it to communication terminals 200A and 300A.

本実施形態では、訓練ジョブの状態を示す状態情報が、通信端末２００Ａ、３００Ａに対して繰り返しリアルタイムに提供する構成を例示したが、この構成に限定されない。例えば、当該状態情報は、通信端末２００Ａ、３００Ａの各々に対して１回だけ提供可能な構成であってもよい。又は、当該状態情報は、通信端末３００Ａを用いた第２ユーザによる公開要求に基づいて、当該公開要求のタイミングの上記状態情報が通信端末３００Ａ（第２デバイス）に表示されてもよい。 In this embodiment, a configuration has been exemplified in which status information indicating the status of a training job is repeatedly provided to communication terminals 200A and 300A in real time, but this configuration is not limited to this. For example, the status information may be provided only once to each of communication terminals 200A and 300A. Alternatively, the status information may be displayed on communication terminal 300A (second device) based on a disclosure request made by a second user using communication terminal 300A, with the status information at the time of the disclosure request being displayed.

図７では、進行度を示す項目１５１Ａとして、プログレスバーが表示された構成を例示したが、この構成に限定されない。例えば、進行度を数値でパーセント表示をしてもよい。 In FIG. 7, a progress bar is shown as an example of item 151A indicating the degree of progress, but the present invention is not limited to this configuration. For example, the degree of progress may be displayed as a numerical percentage.

項目１５２Ａは、訓練ジョブの詳細を示す情報である。図７では、項目１５２Ａの詳細情報の一例として、音響モデル名称１５３Ａ、訓練用音波形１５４Ａ、完了予想１５５Ａ、及び訓練実行者１５６Ａが表示されている。音響モデル名称１５３Ａは、第１ユーザが設定した名称である。例えば、「音声Ｘ→Ｙ」は、Ｘ（一人若しくは複数人の歌手Ｘ、または１つ若しくは複数の楽器Ｘ）の音声を合成する訓練前の音響モデル１２０Ａ（ベース音響モデル）を、実行中の訓練ジョブによって、Ｙ（新たな歌手Ｙまたは楽器Ｙ）の音声を合成する訓練済み音響モデル１２０Ａに変化させることを意味する。訓練用音波形１５４Ａは、実行中の訓練ジョブにおいて、音響モデル１２０Ａの訓練に使用される音波形を示す。図７の例は、音響モデル１２０Ａのために音波形Ｂが使用されることを意味する。完了予想１５５Ａは、実行中の訓練ジョブの進行度が１００％に達すると予想される日時を示す。訓練実行者１５６Ａは、実行中の訓練ジョブを実行したユーザ名を示す。当該ユーザ名は、アカウント名であってもよく、ニックネームであってもよい。図７では、訓練実行者１５６Ａは「Ｕ１」である。Ｕ１はＹに係る歌手又は演奏者と同一でもよく、異なってもよい。 Item 152A is information showing details of the training job. In FIG. 7, an acoustic model name 153A, a training sound waveform 154A, a completion forecast 155A, and a training executor 156A are displayed as examples of detailed information of item 152A. The acoustic model name 153A is a name set by the first user. For example, "voice X→Y" means that the pre-training acoustic model 120A (base acoustic model) that synthesizes the voice of X (one or more singers X, or one or more instruments X) is changed to a trained acoustic model 120A that synthesizes the voice of Y (a new singer Y or instrument Y) by the training job being executed. The training sound waveform 154A indicates a sound waveform used for training the acoustic model 120A in the training job being executed. The example in FIG. 7 means that the sound waveform B is used for the acoustic model 120A. The completion forecast 155A indicates the date and time when the progress of the training job being executed is expected to reach 100%. Training executor 156A indicates the name of the user who executed the ongoing training job. The user name may be an account name or a nickname. In FIG. 7, training executor 156A is "U1." U1 may be the same as or different from the singer or performer associated with Y.

試聴ボタン１５７Ａは、後述する試聴要求を実行するボタンである。例えば、図６において、Ｓ６１３における情報公開の後に、第２ユーザが試聴ボタン１５７Ａを押すことによって、通信端末３００Ａがサーバ１００Ａに対して合成音声の試聴を要求する（ステップＳ６２１）。Ｓ６２１において試聴要求が実行されると、サーバ１００Ａは、当該試聴要求が実行された時点における進行度の音響モデル１２０Ａを用いた試聴用の音声合成を、前記新たな音源ＩＤを用いて実行し、試聴用の合成音波形を提供する（ステップＳ６１４）。当該試聴用の合成音波形の提供によって、通信端末３００Ａは、上記の時点における音響モデル１２０Ａによって生成された合成音声を試聴できる（ステップＳ６２２）。当然ながら、この試聴は、通信端末２００Ａでも行える。 Preview button 157A is a button for executing a preview request, which will be described later. For example, in FIG. 6, after the information is disclosed in S613, the second user presses preview button 157A, causing communication terminal 300A to request server 100A to preview the synthetic voice (step S621). When the preview request is executed in S621, server 100A executes preview speech synthesis using acoustic model 120A at the progress level at the time the preview request was executed, using the new sound source ID, and provides a synthetic voice waveform for preview (step S614). By providing the synthetic voice waveform for preview, communication terminal 300A can preview the synthetic voice generated by acoustic model 120A at the above time (step S622). Naturally, this preview can also be performed by communication terminal 200A.

訓練ジョブは、ある一群の処理（バッチ）を単位として、バッチ単位でまとめて実行される。上記の試聴要求が実行された時点で、音響モデル１２０Ａが１つのバッチ処理の最中である場合、直前のバッチ処理で得られた音響モデル１２０Ａで生成した試聴用の合成音波形を提供してもよいし、その時点以後で、実行中のバッチ処理が完了したタイミングで、得られた音響モデル１２０Ａで生成した試聴用の合成音波形の提供を行ってもよい。つまり、サーバ１００Ａは、通信端末２００Ａ、３００Ａからの試聴要求に基づいて、当該試聴要求のタイミングに応じた音響モデル１２０Ａによる試聴用の合成音波形を、第１および第２ユーザに提供する。 Training jobs are executed in batches, with each batch being a group of processes. If acoustic model 120A is in the middle of a batch process at the time the preview request is executed, the synthetic sound waveform for preview generated by acoustic model 120A obtained in the immediately preceding batch process may be provided, or after that point, when the ongoing batch process is completed, the synthetic sound waveform for preview generated by the obtained acoustic model 120A may be provided. In other words, based on the preview request from communication terminals 200A and 300A, server 100A provides the first and second users with a synthetic sound waveform for preview generated by acoustic model 120A according to the timing of the preview request.

以上のように、本実施形態に係る音響モデル訓練システム１０Ａによると、通信端末３００Ａの第２ユーザは、訓練ジョブによって音響モデル１２０Ａが訓練され、確立されていく過程を閲覧できる。又は、通信端末２００Ａの第１ユーザは、上記のように、進行度が１００％に達していなくても、満足するタイミングで訓練ジョブを終了できる。 As described above, according to the acoustic model training system 10A of this embodiment, the second user of the communication terminal 300A can view the process in which the acoustic model 120A is trained and established by the training job. Alternatively, the first user of the communication terminal 200A can end the training job at a time when he or she is satisfied, even if the progress has not reached 100%, as described above.

［３．第３実施形態］
図８及び図９を用いて、第３実施形態に係る音響モデル訓練システム１０Ｂについて説明する。音響モデル訓練システム１０Ｂの全体構成及びサーバに関するブロック図は第１実施形態に係る音響モデル訓練システム１０と同じなので、説明を省略する。以下の説明において、第１実施形態と同じ構成については説明を省略し、主に第１実施形態と相違する点について説明する。以下の説明において、第１実施形態と同様の構成について説明をする場合、図１～図５を参照し、これらの図に示された符号の後にアルファベット“Ｂ”を付して説明する。 [3. Third embodiment]
An acoustic model training system 10B according to the third embodiment will be described with reference to Figures 8 and 9. The overall configuration of the acoustic model training system 10B and the block diagram relating to the server are the same as those of the acoustic model training system 10 according to the first embodiment, and therefore their description will be omitted. In the following description, description of the same configuration as in the first embodiment will be omitted, and differences from the first embodiment will be mainly described. In the following description, when describing the same configuration as in the first embodiment, reference will be made to Figures 1 to 5, and the alphabet "B" will be added after the reference numerals shown in these figures.

［３－１．音声合成方法］
図８は、本発明の一実施形態における音響モデルの訓練方法及び音声合成方法を示すシーケンス図である。図８に示す音響モデルの訓練方法では、第１訓練ジョブ及び第２訓練ジョブが並行して実行されており、各々の訓練ジョブに関する状態情報を第３者に対して選択的に公開する構成について説明する。図８のステップＳ８０１以前のステップは、図４のＳ４０１～Ｓ４０３と同様なので、説明を省略する。図８のＳ８０１は図４のＳ４０４と同じである。 [3-1. Voice synthesis method]
Fig. 8 is a sequence diagram showing an acoustic model training method and a speech synthesis method in one embodiment of the present invention. In the acoustic model training method shown in Fig. 8, a first training job and a second training job are executed in parallel, and a configuration in which state information regarding each training job is selectively made public to a third party will be described. Steps before step S801 in Fig. 8 are the same as S401 to S403 in Fig. 4, and therefore description thereof will be omitted. S801 in Fig. 8 is the same as S404 in Fig. 4.

Ｓ８０１における通信端末２００Ｂからの第１実行指示に基づいて、サーバ１００Ｂは、新たな音源ＩＤと第１ユーザの選択した第１波形セットを用いて、第１ベース音響モデルの第１訓練ジョブを実行する（ステップＳ８１１）。Ｓ８１１において、第１訓練ジョブが開始されると、サーバ１００Ｂは、通信端末２００Ｂに対して第１訓練ジョブを開始したことを通知し、第１訓練ジョブに関する第１状態情報を第３者に公開することについて、可否を通信端末２００Ｂに問い合わせる（ステップＳ８１２）。本実施形態において、上記の「第３者」は第２ユーザに該当する。通信端末２００Ｂは、Ｓ８１２の問い合わせに対して、第１状態情報を公開する旨の公開指示をサーバ１００Ｂに送信する（ステップＳ８０２）。 Based on the first execution instruction from the communication terminal 200B in S801, the server 100B executes a first training job of the first base acoustic model using the new sound source ID and the first waveform set selected by the first user (step S811). When the first training job is started in S811, the server 100B notifies the communication terminal 200B that the first training job has been started, and inquires of the communication terminal 200B as to whether or not the first status information related to the first training job should be made public to a third party (step S812). In this embodiment, the above "third party" corresponds to the second user. In response to the inquiry in S812, the communication terminal 200B transmits a disclosure instruction to the server 100B to disclose the first status information (step S802).

Ｓ８０２において、上記のように第１ユーザによる第１公開指示に基づいて、サーバ１００Ｂは、Ｓ８１１で実行された第１訓練ジョブに関する第１状態情報を、通信端末３００Ｂ（第２ユーザ）に対して公開する（ステップＳ８１３）。第１ユーザが第１公開指示をしなかった場合は、サーバ１００Ｂは、第２ユーザに第１状態情報を公開しない。 In S802, based on the first disclosure instruction by the first user as described above, the server 100B discloses the first status information regarding the first training job executed in S811 to the communication terminal 300B (second user) (step S813). If the first user does not issue the first disclosure instruction, the server 100B does not disclose the first status information to the second user.

続いて、Ｓ８０３における通信端末２００Ｂからの第２実行指示に基づいて、サーバ１００Ｂは、新たな音源ＩＤと第１ユーザが選択した第２波形セットを用いて、第２ベース音響モデルの第２訓練ジョブを実行する（ステップＳ８１４）。Ｓ８１１、Ｓ８１４によって、第１訓練ジョブ及び第２訓練ジョブが並行して実行される。第１ベース音響モデルと第２ベース音響モデルとは相互に独立であり、両者の用いる音源ＩＤ間には何の関連性もない。例えば、ｎ個の訓練ジョブを並行処理する場合は、ｎ個の仮想マシンを起動することによって実現される。第２訓練ジョブに用いられる第２波形セットは第１訓練ジョブに用いられる第１波形セットと異なるが、第２訓練ジョブの訓練プログラムは第１訓練ジョブの訓練プログラムと同じである。第１訓練ジョブの完了時には、その成果として、第１波形セットで訓練された第１訓練済み音響モデルが得られ、また、第２訓練ジョブの完了時には、その成果として、第２波形セットで訓練された第２訓練済み音響モデルが得られる。 Next, based on the second execution instruction from the communication terminal 200B in S803, the server 100B executes a second training job of the second base acoustic model using a new sound source ID and the second waveform set selected by the first user (step S814). The first training job and the second training job are executed in parallel by S811 and S814. The first base acoustic model and the second base acoustic model are independent of each other, and there is no correlation between the sound source IDs used by the two. For example, when n training jobs are processed in parallel, this is realized by starting n virtual machines. The second waveform set used in the second training job is different from the first waveform set used in the first training job, but the training program of the second training job is the same as the training program of the first training job. When the first training job is completed, the first trained acoustic model trained with the first waveform set is obtained as a result, and when the second training job is completed, the second trained acoustic model trained with the second waveform set is obtained as a result.

第２訓練ジョブを実行する方法は第１訓練ジョブを実行する方法と同様である。第２訓練ジョブでは、第１ユーザが、予め保存された複数の音波形（及び保存される予定の音波形）から選択した一以上の音波形である第２波形セットが使用される。 The method for executing the second training job is similar to the method for executing the first training job. In the second training job, a second waveform set is used, which is one or more sound waveforms selected by the first user from a plurality of pre-stored sound waveforms (and sound waveforms to be stored).

Ｓ８１４において、第２訓練ジョブが開始されると、サーバ１００Ｂは、通信端末２００Ｂに対して第２訓練ジョブを開始したことを通知し、第２訓練ジョブに関する第２状態情報の公開可否を通信端末２００Ｂに問い合わせる（ステップＳ８１５）。通信端末２００Ｂは、この問い合わせに対して、第２訓練ジョブに関する第２状態情報を公開する旨の第２公開指示をサーバ１００Ｂに送信する（ステップＳ８０４）。第２公開指示を受信したサーバ１００Ｂは、Ｓ８１４で実行された第２訓練ジョブに関する第２状態情報を、通信端末３００Ｂ（第２ユーザ）に対して公開する（ステップＳ８１６）。第１ユーザが第２公開指示をしなかった場合は、サーバ１００Ｂは、第２ユーザに第２状態情報を公開しない。 When the second training job is started in S814, the server 100B notifies the communication terminal 200B that the second training job has been started, and inquires of the communication terminal 200B whether or not to make the second status information related to the second training job public (step S815). In response to this inquiry, the communication terminal 200B transmits a second disclosure instruction to the server 100B to make the second status information related to the second training job public (step S804). The server 100B that has received the second disclosure instruction makes the second status information related to the second training job executed in S814 public to the communication terminal 300B (second user) (step S816). If the first user has not issued a second disclosure instruction, the server 100B does not make the second status information public to the second user.

なお、第１ユーザが、第１訓練ジョブ又は第２訓練ジョブに関する状態情報を公開することに予め同意して公開指示が行われている場合は、Ｓ８１２、Ｓ８０２、Ｓ８１５、Ｓ８０４のステップを省略できる。つまり、その予め行われた第１ユーザの公開指示に基づいて、第１訓練ジョブ又は第２訓練ジョブに関する状態情報が第２ユーザに公開されてもよい。 Note that if the first user has agreed in advance to making the status information regarding the first training job or the second training job public and has issued a disclosure instruction, steps S812, S802, S815, and S804 can be omitted. In other words, the status information regarding the first training job or the second training job may be made public to the second user based on the disclosure instruction issued in advance by the first user.

Ｓ８１６以降のＳ８３１～Ｓ８２１のステップは、基本的には、図６のＳ６２１～Ｓ６１８のステップと同様であるが、第１訓練ジョブと第２訓練ジョブの各々について、個別に実行される。 Steps S831 to S821 after S816 are basically the same as steps S621 to S618 in FIG. 6, but are executed separately for each of the first training job and the second training job.

［３－２．ＧＵＩ１６０Ｂ］
ここで、Ｓ８１５で第１ユーザに対して提供されるＧＵＩの一例について説明する。図９は、本発明の一実施形態における音響モデルの訓練時に公開情報を設定するときの公開設定用ＧＵＩの一例を示す図である。図９に示すＧＵＩ１６０Ｂは、第１ユーザの通信端末２００Ｂ表示器に表示される。 [3-2. GUI 160B]
Here, an example of a GUI provided to the first user in S815 will be described. Fig. 9 is a diagram showing an example of a public setting GUI when setting public information during training of an acoustic model in an embodiment of the present invention. The GUI 160B shown in Fig. 9 is displayed on the display of the communication terminal 200B of the first user.

図９に示すように、ＧＵＩ１６０Ｂは、訓練ジョブの状態情報を公開する際に、どのような情報を公開するか設定する画面である。本実施形態では、公開設定項目１６１Ｂには、第１訓練ジョブの項目１６２Ｂ及び第２訓練ジョブの項目１６７Ｂがある。第１訓練ジョブ１６２Ｂには、詳細設定の一例として、音響モデル名称１６３Ｂ、訓練用音波形１６４Ｂ、完了予想１６５Ｂ、及び訓練実行者１６６Ｂの項目が表示されている。第２訓練ジョブ１６７Ｂには、詳細設定の一例として、音響モデル名称１６８Ｂ、訓練用音波形１６９Ｂ、完了予想１７０Ｂ、及び訓練実行者１７１Ｂの項目が表示されている。上記の各項目は図７に示す各項目と同じなので、説明を省略する。 As shown in FIG. 9, GUI 160B is a screen for setting what information to make public when making the status information of a training job public. In this embodiment, the public setting item 161B includes a first training job item 162B and a second training job item 167B. In the first training job 162B, items such as acoustic model name 163B, training sound waveform 164B, completion forecast 165B, and training executor 166B are displayed as examples of detailed settings. In the second training job 167B, items such as acoustic model name 168B, training sound waveform 169B, completion forecast 170B, and training executor 171B are displayed as examples of detailed settings. The above items are the same as the items shown in FIG. 7, so explanations will be omitted.

図９のＧＵＩ１６０Ｂにおいて、ユーザによって選択された項目は『黒塗りの四角形（■）』で表示されており、ユーザ選択されていない項目は『白抜きの四角形（□）』で表示されている。第１ユーザによって第１訓練ジョブ１６２Ｂの項目が選択されると、第１訓練ジョブに係る詳細項目は、全て自動的に選択される。この場合、第１訓練ジョブに係る全ての項目が公開対象となる。第２訓練ジョブ１６７Ｂの項目が非選択の場合、第１ユーザは、第２訓練ジョブに係る詳細項目を個別に選択できる。図９の場合、音響モデル名称１６８Ｂ及び訓練用音波形１６９Ｂの項目のみが選択されている。この場合、第２訓練ジョブについて、選択された詳細項目のみが公開対象となる。第１通信端末は、第１訓練ジョブの第１状態情報のうち、第１ユーザにより公開対象として選択された範囲の情報について、サーバ１００Ｂに第１公開指示を送信し（Ｓ８０２およびＳ８０４）、第２訓練ジョブの第２状態情報のうち、第１ユーザにより公開対象として選択された範囲の情報について、第２公開指示を送信する（Ｓ８０４）。つまり、サーバ１００Ｂは、第１ユーザによる公開指示に基づいて、第１状態情報及び第２状態情報の少なくとも一方を、個別にかつ選択的に、第２ユーザに公開する（通信端末３００Ｂに提供する）。第１訓練ジョブ及び第２訓練ジョブの複数の項目のうち、公開指示を受け取らなかった項目については、対応する状態情報を第２ユーザに公開しない。 In GUI 160B of FIG. 9, items selected by the user are displayed as a filled square (■), and items not selected by the user are displayed as an open square (□). When an item of first training job 162B is selected by the first user, all detailed items related to the first training job are automatically selected. In this case, all items related to the first training job are made public. If an item of second training job 167B is not selected, the first user can individually select detailed items related to the second training job. In the case of FIG. 9, only the items acoustic model name 168B and training sound waveform 169B are selected. In this case, only the selected detailed items for the second training job are made public. The first communication terminal transmits a first disclosure instruction to the server 100B for the range of information selected by the first user as the disclosure target among the first status information of the first training job (S802 and S804), and transmits a second disclosure instruction for the range of information selected by the first user as the disclosure target among the second status information of the second training job (S804). In other words, the server 100B individually and selectively discloses at least one of the first status information and the second status information to the second user (provides it to the communication terminal 300B) based on the disclosure instruction from the first user. For items of the first training job and the second training job for which a disclosure instruction has not been received, the corresponding status information is not disclosed to the second user.

なお、Ｓ８１２においても上記と同様のＧＵＩが提供されるが、そのＧＵＩでは、第１訓練ジョブ１６２Ｂに関係する項目のみが表示される。 Note that in S812, a GUI similar to the above is provided, but in that GUI, only items related to the first training job 162B are displayed.

公開ボタン１７２Ｂは、訓練中の音響モデルに関する情報公開を指示するボタンである。図８のＳ８０４において、第１ユーザが公開ボタン１７２Ｂを押すことによって、第１訓練ジョブ及び第２訓練ジョブの状態情報のうち、ユーザによって選択された公開対象項目の公開指示が、通信端末２００Ｂからサーバ１００Ｂに送信され、その公開対象項目の状態情報が図７と同様の形式で第３者に公開される（ステップＳ８１６）。 The publish button 172B is a button that instructs the disclosure of information about the acoustic model being trained. In S804 of FIG. 8, when the first user presses the publish button 172B, an instruction to publish items to be published selected by the user from among the status information of the first training job and the second training job is sent from the communication terminal 200B to the server 100B, and the status information of the items to be published is published to third parties in a format similar to that of FIG. 7 (step S816).

以上のように、本実施形態に係る音響モデル訓練システム１０Ｂによると、第１ユーザは、自身が起動した複数の訓練ジョブを、第３者に対して個別に公開できる。また、第１ユーザは、訓練ジョブの詳細項目ごとに、公開する項目と公開しない項目とを自由に設定できる。 As described above, according to the acoustic model training system 10B of this embodiment, the first user can individually make multiple training jobs that the first user has started public to third parties. In addition, the first user can freely set which items to make public and which items not to make public for each detailed item of the training job.

［４．第４実施形態］
図１０を用いて、第４実施形態に係る音響モデル訓練システム１０Ｃについて説明する。音響モデル訓練システム１０Ｃの全体構成及びサーバに関するブロック図は第１実施形態に係る音響モデル訓練システム１０と同じなので、説明を省略する。以下の説明において、第１実施形態と同じ構成については説明を省略し、主に第１実施形態と相違する点について説明する。以下の説明において、第１実施形態と同様の構成について説明をする場合、図１～図５を参照し、これらの図に示された符号の後にアルファベット“Ｃ”を付して説明する。 [4. Fourth embodiment]
An acoustic model training system 10C according to the fourth embodiment will be described with reference to FIG. 10. The overall configuration of the acoustic model training system 10C and the block diagram relating to the server are the same as those of the acoustic model training system 10 according to the first embodiment, and therefore their description will be omitted. In the following description, description of the same configuration as in the first embodiment will be omitted, and differences from the first embodiment will be mainly described. In the following description, when describing the same configuration as in the first embodiment, reference will be made to FIGS. 1 to 5, and the alphabet "C" will be added after the reference numerals shown in these figures.

［４－１．音声合成方法］
図１０は、本発明の一実施形態における音響モデルの訓練方法を示すフローチャートである。図１０に示す音響モデルの訓練方法では、ユーザにより課金に対する支払いが実行されたことを条件に、そのユーザが実行指示した訓練ジョブを実行する。図１０では、図４のＳ４０４の訓練ジョブ指示からＳ４１３の訓練ジョブ実行までの間に行われる動作について説明する。図１０のステップＳ１００１、Ｓ１００４は、それぞれ図４のＳ４０４、Ｓ４１３と同じである。 [4-1. Voice synthesis method]
Fig. 10 is a flowchart showing a method for training an acoustic model in one embodiment of the present invention. In the method for training an acoustic model shown in Fig. 10, a training job instructed by a user is executed on the condition that the user has paid the charge. Fig. 10 explains the operations performed from the instruction of a training job in S404 in Fig. 4 to the execution of the training job in S413. Steps S1001 and S1004 in Fig. 10 are the same as S404 and S413 in Fig. 4, respectively.

図１０に示すように、Ｓ１００１で通信端末２００Ｃによって、訓練ジョブの実行指示（第１実行指示）がサーバ１００Ｃに送信される。続いて、その実行指示を受け取ったサーバ１００Ｃによって、訓練ジョブの実行を指示した第１ユーザに対する課金が実行され、通信端末２００Ｃに課金に係る情報が通知される（ステップＳ１００２）。当該通知の後に、サーバ１００Ｃによって、通信端末２００Ｃがサーバ１００Ｃの運営者に対してその課金の支払いを実行したか否かの判断が行われる（ステップＳ１００３）。通信端末２００Ｃがその支払いを実行すると（Ｓ１００３の「Ｙｅｓ」）、サーバ１００Ｃによって、選択された波形セットを用いて、その課金の範囲で、ベース音響モデルに対し、その実行指示された訓練ジョブが実行される（ステップＳ１００４）。一方、通信端末２００Ｃがその支払いを実行しないと（Ｓ１００３の「Ｎｏ」）、サーバ１００Ｃによる訓練ジョブは実行されず、通信端末２００Ｃに対してエラー（訓練ジョブの不実行）が通知される（ステップＳ１００５）。サーバ１００Ｃは、Ｓ１００２の課金処理を、サーバ１００Ｃの制御部が単位時間の訓練ジョブを行う（Ｓ１００４）ごとに実行し、第１ユーザからの支払いを得れば（Ｓ１００３）、訓練中の音響モデルに対して、次の単位時間の訓練ジョブを実行（Ｓ１００４）してもよい。 As shown in FIG. 10, in S1001, communication terminal 200C transmits an instruction to execute a training job (first execution instruction) to server 100C. Next, server 100C, which has received the execution instruction, charges the first user who instructed the execution of the training job, and notifies communication terminal 200C of information related to the charge (step S1002). After the notification, server 100C determines whether communication terminal 200C has paid the charge to the operator of server 100C (step S1003). When communication terminal 200C has made the payment ("Yes" in S1003), server 100C executes the instructed training job on the base acoustic model using the selected waveform set, within the scope of the charge (step S1004). On the other hand, if the communication terminal 200C does not make the payment ("No" in S1003), the training job is not executed by the server 100C, and an error (non-execution of the training job) is notified to the communication terminal 200C (step S1005). The server 100C executes the billing process of S1002 each time the control unit of the server 100C executes a training job for a unit time (S1004), and when payment is received from the first user (S1003), the server 100C may execute the training job for the next unit time for the acoustic model being trained (S1004).

以上のように、本実施形態に係る音響モデル訓練システム１０Ｃによると、第１ユーザは、支払った分に見合う訓練ジョブを、サーバ１００Ｃに実行させることができる。 As described above, according to the acoustic model training system 10C of this embodiment, the first user can have the server 100C execute a training job that corresponds to the amount paid.

［５．第５実施形態］
図１１～図１４を用いて、第５実施形態に係る音響モデル訓練システム１０Ｄについて説明する。音響モデル訓練システム１０Ｄの全体構成及びサーバに関するブロック図は第１実施形態に係る音響モデル訓練システム１０と同じなので、説明を省略する。以下の説明において、第１実施形態と同じ構成については説明を省略し、主に第１実施形態と相違する点について説明する。以下の説明において、第１実施形態と同様の構成について説明をする場合、図１～図５を参照し、これらの図に示された符号の後にアルファベット“Ｄ”を付して説明する。 [5. Fifth embodiment]
An acoustic model training system 10D according to the fifth embodiment will be described with reference to Figures 11 to 14. The overall configuration of the acoustic model training system 10D and the block diagram relating to the server are the same as those of the acoustic model training system 10 according to the first embodiment, and therefore their description will be omitted. In the following description, description of the same configuration as in the first embodiment will be omitted, and differences from the first embodiment will be mainly described. In the following description, when describing the same configuration as in the first embodiment, reference will be made to Figures 1 to 5, and the alphabet "D" will be added after the reference numerals shown in these figures.

［５－１．音声合成方法］
図１１は、本発明の一実施形態における音響モデルの訓練に用いる音波形の収録方法を示すシーケンス図である。図１１に示す収録方法では、例えばカラオケボックスなどの録音用空間で訓練用音波形の録音及びサーバへのアップロードを実行する構成について説明する。録音用空間は実空間である。以下の説明において、録音用空間としてレンタル空間を例示する。 [5-1. Voice synthesis method]
Fig. 11 is a sequence diagram showing a method for recording a sound waveform used for training an acoustic model in one embodiment of the present invention. In the recording method shown in Fig. 11, a configuration is described in which a training sound waveform is recorded in a recording space such as a karaoke box and uploaded to a server. The recording space is a real space. In the following description, a rental space is exemplified as the recording space.

図１１に示すカラオケサーバ５００Ｄは、例えば、カラオケボックス及びカラオケブース等の貸出を統括するサーバ又はコンピュータである。カラオケサーバ５００Ｄは、例えば一店舗に備えられたカラオケボックス及びカラオケブースなどの複数のレンタル空間の何れか１つのレンタル空間を特定する空間ＩＤ、及び各レンタル空間が利用可能か否かを示す利用可能性を管理する。レンタル空間は、カラオケボックスなどの完全に閉じられた空間でもよいし、カラオケブースなどのように、一部が外部に開放された空間でもよい。各レンタル空間には、録音機能とカラオケサーバ５００Ｄとの通信機能とを備えたカラオケ機器が設置されている。カラオケサーバ５００Ｄは、ネットワーク４００Ｄに接続可能であり、ネットワーク４００Ｄを介してサーバ１００Ｄと通信できる。本実施形態において、サーバ１００Ｄは、カラオケサーバ５００Ｄに対するレンタル空間の利用予約業務を代行する。ただし、詳細は後述するが、この構成に限定されない。 The karaoke server 500D shown in FIG. 11 is, for example, a server or computer that manages the rental of karaoke boxes and karaoke booths. The karaoke server 500D manages a space ID that identifies one of multiple rental spaces, such as karaoke boxes and karaoke booths, provided in one store, and the availability that indicates whether each rental space is available. The rental space may be a completely closed space, such as a karaoke box, or a space that is partially open to the outside, such as a karaoke booth. Each rental space is equipped with karaoke equipment that has a recording function and a communication function with the karaoke server 500D. The karaoke server 500D can be connected to the network 400D and can communicate with the server 100D via the network 400D. In this embodiment, the server 100D handles the reservation of rental spaces for the karaoke server 500D. However, the details will be described later, but the present invention is not limited to this configuration.

まず、通信端末２００Ｄは、サーバ１００Ｄが提供する音響モデル訓練サービスに対してログインをする（ステップＳ１１０１）。Ｓ１１０１において、通信端末２００Ｄは、サーバ１００Ｄに対して、当該サービスを利用する第１ユーザが入力したアカウント情報（例えば、ユーザＩＤとパスワード）を送信する。サーバ１００Ｄは、通信端末２００Ｄから受信したアカウント情報に基づいてユーザ認証を行い、第１ユーザのそのユーザＩＤのアカウントへのログインを承認する（ステップＳ１１１１）。ユーザ認証は、サーバ１００Ｄではなく、外部の認証サーバで行ってもよい。 First, the communication terminal 200D logs in to the acoustic model training service provided by the server 100D (step S1101). In S1101, the communication terminal 200D transmits to the server 100D account information (e.g., a user ID and a password) entered by a first user who uses the service. The server 100D performs user authentication based on the account information received from the communication terminal 200D, and approves the first user's login to the account with that user ID (step S1111). User authentication may be performed by an external authentication server instead of the server 100D.

通信端末２００Ｄは、Ｓ１１１１でログインしたユーザＩＤにおいて、当該サービスの利用を含む所望の日時における所望の空間ＩＤのレンタル空間の予約を要求する（ステップＳ１１０２）。サーバ１００Ｄは、Ｓ１１０２の予約要求を受けると、カラオケサーバ５００Ｄに対して当該日時における当該空間ＩＤのレンタル空間の利用状況又は空き状況を確認する（ステップＳ１１１２）。カラオケサーバ５００Ｄは、当該レンタル空間が利用可能であれば、予約を行い（ステップＳ１１２１）、当該日時における当該空間ＩＤのレンタル空間の予約が完了した旨の予約完了情報をサーバ１００Ｄに送信する。前記予約要求で、第１ユーザが前払いを指定している場合は、ステップＳ１１２１で、レンタル料と当該サービスの利用料の課金を行う。当該サービスの利用料は、レンタル空間の利用後に実行される、そこでの収録波形を用いた基本的な訓練ジョブの対価である。通信端末２００Ｄは、レンタル空間の予約要求をカラオケサーバに対して行い、その予約要求に応じて予約を行ったカラオケサーバ５００Ｄからサーバ１００Ｄに、その予約に係るユーザＩＤと空間ＩＤを含む予約完了情報を送信してもよい。 The communication terminal 200D requests the reservation of a rental space of a desired space ID at a desired date and time including the use of the service, using the user ID logged in at S1111 (step S1102). When the server 100D receives the reservation request of S1102, it checks the usage status or availability of the rental space of the space ID at the date and time with the karaoke server 500D (step S1112). If the rental space is available, the karaoke server 500D makes a reservation (step S1121) and transmits reservation completion information to the server 100D indicating that the reservation of the rental space of the space ID at the date and time has been completed. If the first user has specified prepayment in the reservation request, the rental fee and the service fee are charged in step S1121. The service fee is the price for a basic training job using the waveform recorded there, which is executed after the rental space is used. The communication terminal 200D may make a reservation request for a rental space to the karaoke server, and in response to the reservation request, the karaoke server 500D that made the reservation may transmit reservation completion information including the user ID and space ID related to the reservation to the server 100D.

サーバ１００Ｄは、カラオケサーバ５００Ｄから予約完了情報を受信すると（ステップＳ１１１３）、当該予約完了情報に係る空間ＩＤと第１ユーザのユーザＩＤとをリンクさせる（ステップＳ１１１４）。そして、予約が完了したことを通信端末２００Ｄに通知する（ステップＳ１１１５）。予約完了通知は、カラオケサーバ５００Ｄから通信端末２００Ｄに送られてもよい。 When the server 100D receives reservation completion information from the karaoke server 500D (step S1113), it links the space ID related to the reservation completion information with the user ID of the first user (step S1114). Then, it notifies the communication terminal 200D that the reservation has been completed (step S1115). The reservation completion notification may be sent from the karaoke server 500D to the communication terminal 200D.

通信端末２００Ｄが予約完了通知を受けると、通信端末２００Ｄは第１ユーザに対して、予約が完了したこと、並びに、予約されたレンタル空間及び日時を特定する情報を表示する。上記のレンタル空間を特定する情報は、例えば空間ＩＤで特定されるカラオケボックスの部屋番号である。第１ユーザが、予約した日時に、予約されたレンタル空間に移動し、レンタル空間に備えられたカラオケ機器を操作して所望の楽曲を選択することで、その楽曲の伴奏が当該レンタル空間で再生される。第１ユーザはカラオケ機器を用いて録音開始指示及び録音終了指示を実行する。これらの指示に伴い、カラオケサーバ５００Ｄでは、第１ユーザの歌声又は楽器の演奏音を録音する（ステップＳ１１２２）。 When the communication terminal 200D receives the reservation completion notification, the communication terminal 200D displays to the first user that the reservation has been completed, as well as information identifying the reserved rental space and date and time. The information identifying the rental space is, for example, the room number of a karaoke booth identified by a space ID. When the first user moves to the reserved rental space on the reserved date and time and operates the karaoke equipment provided in the rental space to select a desired song, the accompaniment to the song is played in the rental space. The first user executes a recording start instruction and a recording end instruction using the karaoke equipment. In response to these instructions, the karaoke server 500D records the first user's singing voice or the sound of an instrument being played (step S1122).

レンタル空間の利用時間が終了したとき（録音完了）、カラオケサーバ５００Ｄ（レンタル業者）は、レンタル空間と訓練ジョブの利用料が先払いされていなければ、その利用料を第１ユーザに課金し、第１ユーザは、カラオケサーバ５００Ｄの端末にて、その利用料金を支払う。レンタル料金とセットなので、訓練ジョブの利用料は、その分だけＳ１００２での課金よりディスカウントしてもよい。第１ユーザは、録音完了した音波形（波形データ）から、サーバ１００Ｄにアップロードする音波形を選択し、さらに、訓練ジョブの利用料が支払われた場合、アップロードする音波形の中からその訓練ジョブに使用する波形セットを選択する。カラオケサーバ５００Ｄは、選択された音波形及び録音が行われた空間ＩＤをサーバ１００Ｄの第１ユーザのユーザＩＤで特定される、第１ユーザの記憶領域にアップロードする（ステップＳ１１２３）。 When the usage time of the rental space ends (recording is completed), the karaoke server 500D (rental company) charges the first user for the rental space and training job usage fee if they have not been paid in advance, and the first user pays the usage fee at the terminal of the karaoke server 500D. Since the fee is set together with the rental fee, the fee for the training job may be discounted from the fee charged in S1002 accordingly. The first user selects a sound waveform to be uploaded to the server 100D from the sound waveforms (waveform data) that have been recorded, and further, if the fee for the training job has been paid, selects a waveform set to be used for the training job from the sound waveforms to be uploaded. The karaoke server 500D uploads the selected sound waveform and the space ID where the recording was performed to the memory area of the first user identified by the user ID of the first user of the server 100D (step S1123).

サーバ１００Ｄは、アップロードされた音波形及び空間ＩＤを、第１ユーザの記憶領域に互いにリンクさせて記憶する（ステップＳ１１１６）。サーバ１００Ｄにアップロードされ、記憶される音波形は１つであってもよく、複数であってもよい。 The server 100D stores the uploaded sound waveform and space ID in the memory area of the first user by linking them to each other (step S1116). The sound waveforms uploaded and stored in the server 100D may be one or multiple.

Ｓ１１１４で、空間ＩＤと第１ユーザのユーザＩＤとがリンクし、Ｓ１１１６で、アップロードされた音波形と空間ＩＤとがリンクする。したがって、サーバ１００Ｄは、図１２に示すように、第１ユーザのユーザＩＤ１８０Ｄ、空間ＩＤ１８１Ｄ、及びアップロードされた音波形１８２Ｄをリンクして記憶する。図１２は、本発明の一実施形態において、サーバによって管理されるデータの例である。ユーザＩＤ１８０Ｄは、図１１のＳ１１１１でログインしたアカウントのユーザＩＤであり、後述する図１３の各データは、ユーザＩＤに対応した記憶領域に記憶される。空間ＩＤ１８１Ｄは、図１１のＳ１１２２で録音が行われた空間の空間ＩＤである。音波形１８２Ｄは、図１１のＳ１１２２で録音され、Ｓ１１２３でサーバ１００Ｄに送信された音波形である。 In S1114, the space ID and the user ID of the first user are linked, and in S1116, the uploaded sound waveform and the space ID are linked. Therefore, as shown in FIG. 12, the server 100D links and stores the user ID 180D of the first user, the space ID 181D, and the uploaded sound waveform 182D. FIG. 12 is an example of data managed by a server in one embodiment of the present invention. The user ID 180D is the user ID of the account logged in in S1111 of FIG. 11, and each data in FIG. 13 described later is stored in a storage area corresponding to the user ID. The space ID 181D is the space ID of the space where the recording was made in S1122 of FIG. 11. The sound waveform 182D is the sound waveform recorded in S1122 of FIG. 11 and sent to the server 100D in S1123.

サーバ１００Ｄは、Ｓ１１２３で音波形がアップロードされた記憶領域から、当該音波形をアップロードした第１ユーザのユーザＩＤを特定する（ステップＳ１１１７）。その後、第１ユーザからの指示に基づいて、サーバ１００Ｄは、新たな音源ＩＤとアップロードされた音波形を用いて、ベース音響モデルの訓練ジョブを実行する（ステップＳ１１１８）。 The server 100D identifies the user ID of the first user who uploaded the sound waveform from the storage area where the sound waveform was uploaded in S1123 (step S1117). After that, based on an instruction from the first user, the server 100D executes a training job for the base acoustic model using the new sound source ID and the uploaded sound waveform (step S1118).

ここで、Ｓ１１２３でカラオケサーバ５００Ｄからサーバ１００Ｄにアップロードされるデータについて図１３を用いて説明する。図１１の説明では、Ｓ１１２３で第１ユーザの歌声又は演奏音を示す音波形だけがサーバ１００Ｄにアップロードされる構成を例示したが、この構成に限定されない。例えば、歌声の場合、図１３に示すように、カラオケ機器によってレンタル空間に供給される楽曲のガイドメロディを構成する音を示す音高データ５０３Ｄ及び楽曲の歌詞を示すテキストデータ５０２Ｄが、当該音波形５０１Ｄとともにサーバ１００Ｄにアップロードされてもよい。演奏音の場合は、テキストデータ５０２Ｄはアップロードされない。 The data uploaded from the karaoke server 500D to the server 100D in S1123 will now be described with reference to FIG. 13. In the description of FIG. 11, a configuration has been exemplified in which only the sound waveform representing the first user's singing voice or performance sound is uploaded to the server 100D in S1123, but this configuration is not limiting. For example, in the case of singing voice, as shown in FIG. 13, pitch data 503D indicating the sounds constituting the guide melody of the song supplied to the rental space by the karaoke device and text data 502D indicating the lyrics of the song may be uploaded to the server 100D together with the sound waveform 501D. In the case of performance sound, the text data 502D is not uploaded.

カラオケサーバ５００Ｄが、Ｓ１１２２で録音されたデータを、Ｓ１１２３でサーバ１００Ｄにアップロードするステップについて図１４を用いて説明する。図１１の説明では、Ｓ１１２２で録音された音波形が、特段のステップを経ることなく、Ｓ１１２３でサーバ１００Ｄにアップロードされる構成を例示したが、この構成に限定されない。例えば、図１４に示すように、録音された音波形に係る音声データを再生したうえで、第１ユーザが、その音波形のアップロードの要否を判断してもよい。図１４の例では、カラオケ機器又は通信端末２００Ｄを使用して、第１ユーザに対して、録音された音波形の再生要否、当該音波形のアップロード要否、再録音の要否、及び動作終了の要否を問い合わせる。これらの４つの問い合わせは、一つのＧＵＩで順番に表示されてもよく、再生ボタン、アップロードボタン、再録音ボタン、及び終了ボタンとして並列にＧＵＩ上に表示されてもよい。 The step in which the karaoke server 500D uploads the data recorded in S1122 to the server 100D in S1123 will be described with reference to FIG. 14. In the description of FIG. 11, a configuration in which the sound waveform recorded in S1122 is uploaded to the server 100D in S1123 without going through any particular steps has been exemplified, but this configuration is not limited to this. For example, as shown in FIG. 14, the first user may determine whether or not to upload the sound waveform after playing back the audio data related to the recorded sound waveform. In the example of FIG. 14, the karaoke machine or communication terminal 200D is used to inquire of the first user whether or not to play back the recorded sound waveform, whether or not to upload the sound waveform, whether or not to re-record, and whether or not to end the operation. These four inquiries may be displayed in sequence on one GUI, or may be displayed in parallel on the GUI as a play button, an upload button, a re-record button, and an end button.

図１１のＳ１１２２で音声データの録音が完了した後に、図１４に示すように、カラオケサーバ５００Ｄは、第１ユーザによる再生指示の有無を判断する（ステップＳ１４０１）。Ｓ１４０１で再生指示があった場合（Ｓ１４０１の「Ｙｅｓ」）、カラオケサーバ５００Ｄは、カラオケ機器を使用して、図１１のＳ１１２２で録音された音声データを、録音が行われたレンタル空間で再生する（ステップＳ１４０２）。当該再生の際に、当該音声データのみを再生してもよく、当該音声データをガイドメロディとともに再生してもよい。Ｓ１４０２で再生が行われた後、再びＳ１４０１のステップに戻る。Ｓ１４０１で再生指示がない場合（Ｓ１４０１の「Ｎｏ」）、Ｓ１４０２の再生を実行せずに次のステップに進む。 After the recording of the audio data is completed in S1122 of FIG. 11, as shown in FIG. 14, the karaoke server 500D judges whether or not a playback instruction has been issued by the first user (step S1401). If a playback instruction has been issued in S1401 ("Yes" in S1401), the karaoke server 500D uses the karaoke equipment to play back the audio data recorded in S1122 of FIG. 11 in the rental space where the recording was made (step S1402). During the playback, the audio data alone may be played back, or the audio data may be played back together with a guide melody. After playback is performed in S1402, the process returns to step S1401. If no playback instruction has been issued in S1401 ("No" in S1401), the process proceeds to the next step without executing playback in S1402.

続いて、図１１のＳ１１２２で録音された音声データについて、アップロードの要否が判断される（ステップＳ１４０３）。例えば、カラオケサーバ５００Ｄは、第１ユーザに対して、録音された音声データをアップロードするか否かを選択するＧＵＩを提供し、第１ユーザによる選択に従ってアップロードの要否を判断する。 Next, a determination is made as to whether or not uploading is required for the voice data recorded in S1122 of FIG. 11 (step S1403). For example, the karaoke server 500D provides the first user with a GUI for selecting whether or not to upload the recorded voice data, and determines whether or not uploading is required based on the selection made by the first user.

Ｓ１４０３でアップロードが必要であると判断された場合（Ｓ１４０３の「Ｙｅｓ」）、図１１のＳ１１２３のアップロードが実行され、上記の動作が終了する。一方、Ｓ１４０３でアップロードを実行する指示がない場合（Ｓ１４０３の「Ｎｏ」）、再録音の要否が判断される（ステップＳ１４０４）。例えば、カラオケサーバ５００Ｄは、第１ユーザに対して、再録音を行うか否かを選択するＧＵＩを提供し、第１ユーザによる選択に従って再録音の要否を判断する。 If it is determined in S1403 that uploading is necessary ("Yes" in S1403), the upload in S1123 of FIG. 11 is executed, and the above operation ends. On the other hand, if there is no instruction to execute uploading in S1403 ("No" in S1403), a determination is made as to whether or not re-recording is necessary (step S1404). For example, the karaoke server 500D provides the first user with a GUI to select whether or not to re-record, and determines whether or not re-recording is necessary according to the selection made by the first user.

Ｓ１４０４で再録音が必要であると判断された場合（Ｓ１４０４の「Ｙｅｓ」）、カラオケサーバ５００Ｄは、図１１のＳ１１２２と同様の方法で再録音を行う（ステップＳ１４０５）。Ｓ１４０５の再録音が終了すると、再度、Ｓ１４０１で再生指示の有無が判断される。Ｓ１４０４で再録音を開始する指示がない場合（Ｓ１４０４の「Ｎｏ」）、動作終了の可否が判断される（ステップＳ１４０６）。Ｓ１４０６で動作を終了していいと判断された場合（Ｓ１４０６の「Ｙｅｓ」）、上記の動作が終了する。一方、Ｓ１４０６で動作終了の指示がない場合（Ｓ１４０６の「Ｎｏ」）、Ｓ１４０１のステップに戻る。Ｓ１４０１における再生指示、Ｓ１４０３におけるアップロード実行指示、Ｓ１４０４における再録音の開始指示、及びＳ１４０６の終了指示がない場合、カラオケサーバ５００Ｄは、これらの判断ステップを繰り返し実行する。 If it is determined in S1404 that re-recording is necessary ("Yes" in S1404), the karaoke server 500D re-records in the same manner as S1122 in FIG. 11 (step S1405). When the re-recording in S1405 is finished, the presence or absence of a playback instruction is determined again in S1401. If there is no instruction to start re-recording in S1404 ("No" in S1404), it is determined whether or not to end the operation (step S1406). If it is determined in S1406 that the operation can be ended ("Yes" in S1406), the above operation ends. On the other hand, if there is no instruction to end the operation in S1406 ("No" in S1406), the process returns to step S1401. If there is no instruction to play in S1401, an instruction to perform uploading in S1403, an instruction to start re-recording in S1404, or an instruction to end in S1406, the karaoke server 500D repeatedly executes these determination steps.

本実施形態では、サーバ１００Ｄが、カラオケサーバ５００Ｄに対するレンタル空間の利用予約業務を代行する構成を例示したが、この構成に限定されない。例えば、カラオケサーバ５００Ｄが、レンタル空間の利用予約業務を行ってもよい。その場合、サーバ１００Ｄとカラオケサーバ５００Ｄは、第１ユーザの第１アカウント情報を共有する。さらに、サーバ１００Ｄは、カラオケサーバ５００Ｄから受信した空間ＩＤと音波形を、第１ユーザのユーザＩＤ（第１アカウント情報）にリンクして記憶する。その後のステップは、図１１のＳ１１２２以降と同様である。 In this embodiment, the server 100D performs the reservation service for the rental space on behalf of the karaoke server 500D, but the present invention is not limited to this configuration. For example, the karaoke server 500D may perform the reservation service for the rental space. In this case, the server 100D and the karaoke server 500D share the first account information of the first user. Furthermore, the server 100D links the space ID and sound waveform received from the karaoke server 500D to the user ID (first account information) of the first user and stores them. Subsequent steps are the same as those from S1122 onwards in FIG. 11.

図１１のＳ１１２２における録音開始指示及び録音終了指示は、楽曲の開始及び終了によって実行されてもよく、第１ユーザの任意の操作によって実行されてもよい。つまり、第１ユーザの録音指示に基づいて、楽曲の再生期間のうち指定された期間の音声データのみを収録してもよい。録音開始指示及び録音終了指示は、カラオケ機器を用いて実行されてもよく、通信端末２００Ｄを用いて実行されてもよい。つまり、Ｓ１１２２の録音は、楽曲の再生期間の少なくとも一部の期間だけ実行されてもよい。上記の構成を換言すると、図１３に示すように、サーバ１００Ｄは、レンタル空間において提供された、楽曲の第１ユーザが歌唱ないし演奏するパートの音を示す音高データ５０３Ｄや楽曲の歌詞を示すテキストデータ５０２Ｄを、楽曲の再生期間の少なくとも一部の期間における歌唱が収録された音声データである音波形５０１Ｄとともに、カラオケサーバ５００Ｄから受信してもよい。そして、サーバ１００Ｄは、当該歌唱ないし演奏音の音波形５０１Ｄを訓練用音波形として、楽譜データとリンクして記憶する。 The recording start instruction and recording end instruction in S1122 in FIG. 11 may be executed by the start and end of the song, or by any operation of the first user. In other words, only audio data for a specified period of the song's playback period may be recorded based on the recording instruction of the first user. The recording start instruction and recording end instruction may be executed using a karaoke machine or may be executed using the communication terminal 200D. In other words, the recording in S1122 may be executed only for at least a part of the song's playback period. In other words, as shown in FIG. 13, the server 100D may receive from the karaoke server 500D pitch data 503D indicating the sound of the part of the song sung or played by the first user provided in the rental space and text data 502D indicating the lyrics of the song, together with sound waveforms 501D that are audio data recording singing during at least a part of the song's playback period. The server 100D then stores the sound waveforms 501D of the singing or playing sounds as training sound waveforms, linked to the sheet music data.

以上のように、本実施形態に係る音響モデル訓練システム１０Ｄによると、カラオケボックス等を利用して音声データを録音し、サーバ１００Ｄにアップロードできるため、第１ユーザが音声データを録音するための環境を準備する労力を軽減できる。 As described above, according to the acoustic model training system 10D of this embodiment, voice data can be recorded using a karaoke booth or the like and uploaded to the server 100D, thereby reducing the effort required for the first user to prepare an environment for recording voice data.

［６．第６実施形態］
図１５を用いて、第６実施形態に係る音響モデル訓練システム１０Ｅについて説明する。音響モデル訓練システム１０Ｅの全体構成及びサーバに関するブロック図は第１実施形態に係る音響モデル訓練システム１０と同じなので、説明を省略する。以下の説明において、第１実施形態と同じ構成については説明を省略し、主に第１実施形態と相違する点について説明する。以下の説明において、第１実施形態と同様の構成について説明をする場合、図１～図５を参照し、これらの図に示された符号の後にアルファベット“Ｅ”を付して説明する。 [6. Sixth embodiment]
An acoustic model training system 10E according to the sixth embodiment will be described with reference to FIG. 15. The overall configuration of the acoustic model training system 10E and the block diagram relating to the server are the same as those of the acoustic model training system 10 according to the first embodiment, and therefore their description will be omitted. In the following description, description of the same configuration as in the first embodiment will be omitted, and differences from the first embodiment will be mainly described. In the following description, when describing the same configuration as in the first embodiment, reference will be made to FIGS. 1 to 5, and the alphabet "E" will be added after the reference numerals shown in these figures.

［６－１．音声合成方法］
図１５は、本発明の一実施形態における目的とする音響モデルの訓練に適した楽曲の推薦方法を示すフローチャートである。図１５に示す推薦方法では、訓練用音波形としてサーバ１００Ｅに予め保存された音波形の全部、又はその一部、又は、ユーザの選択した波形セットに基づいて、当該音波形に適した楽曲を第１ユーザに推薦する構成について説明する。通信端末１００Ｅは、予め、第１ユーザが想定している音高又は音響特徴量に関しての当該音響モデルの使用範囲を示す情報を、第１ユーザから受け取っている。 [6-1. Voice synthesis method]
Fig. 15 is a flowchart showing a method for recommending a piece of music suitable for training a target acoustic model in one embodiment of the present invention. The recommendation method shown in Fig. 15 describes a configuration for recommending to a first user a piece of music suitable for a sound waveform based on all or a part of a sound waveform previously stored in the server 100E as a training sound waveform, or a waveform set selected by the user. The communication terminal 100E has received information from the first user in advance indicating the range of use of the acoustic model with respect to pitch or acoustic features that the first user expects.

まず、サーバ１００Ｅは、予め保存された訓練用音波形又は選択された波形セットの分析を行う（ステップＳ１５０１）。分析される訓練用音波形は、保存された訓練用音波形の全部ではなく、その一部の特定の音源（特定の歌唱者又は特定の楽器）の音波形である。例えば、サーバ１００Ｅの第１ユーザの記憶領域に歌唱者別又は楽器別のフォルダを設け、訓練用音波形を、それぞれ対応する歌唱者ないし楽器のフォルダに分けて保存しておき、各フォルダに記憶されている音波形について、当該分析を個別に行うとよい。波形セットは、第１ユーザが特定の歌唱者又は特定の楽器の音響モデルを訓練するために選択した、特定の歌唱者又は特定の楽器の音波形のセットである。当該分析は、例えば音波形の音高又は音響特徴量に基づいて行われる。さらに、分析を行った音波形の楽曲が判っている場合、その音波形をその楽曲の歌唱又は演奏音の楽譜データと対比することによって、音高、音色、強弱等に関して、歌唱スキル又は演奏スキルを判定できる。又は、当該分析によって、歌唱スタイル、演奏スタイル、歌唱音域、又は演奏音域を判定できる。 First, the server 100E analyzes the previously stored training sound waveforms or the selected waveform set (step S1501). The training sound waveforms to be analyzed are not all of the stored training sound waveforms, but are part of a specific sound source (a specific singer or a specific instrument). For example, a folder for each singer or instrument may be provided in the memory area of the first user of the server 100E, and the training sound waveforms may be stored separately in folders for the corresponding singer or instrument, and the analysis may be performed individually for the sound waveforms stored in each folder. The waveform set is a set of sound waveforms for a specific singer or a specific instrument selected by the first user to train the acoustic model of a specific singer or a specific instrument. The analysis is performed, for example, based on the pitch or acoustic features of the sound waveform. Furthermore, if the song of the analyzed sound waveform is known, the singing skill or playing skill can be determined in terms of pitch, timbre, dynamics, etc. by comparing the sound waveform with the score data of the sung or played sound of the song. Alternatively, the analysis can determine the singing style, playing style, singing range, or playing range.

歌唱スタイルは歌い方であり、演奏スタイルは演奏の仕方である。具体的には、歌唱スタイルとして、ニュートラル、ビブラート、ハスキー、フライ、及びグロウル等が挙げられる。演奏スタイルとして、擦弦楽器であれば、ニュートラル、ビブラート、ピチカート、スピカート、フラジョレット、及びトレモロ等が挙げられ、撥弦楽器であれば、ニュートラル、ポジション、レガート、スライド、及びスラップ／ミュート等が挙げられる。クラリネットであれば、ニュートラル、スタカート、ビブラート、及びトリル等が挙げられる。なお、例えば、上記のビブラートは、ビブラートを多用する歌唱スタイル又は演奏スタイルを意味する。歌唱又は演奏におけるピッチ、音量、音色、及びこれらの動的挙動は、全体的にスタイルによって変わる。訓練ジョブにおいて、サーバ１００Ｅは、新たな音色ＩＤと波形セットとに加えて、その波形セットの分析で得られる歌唱スタイル又は演奏スタイルを入力としつつ、ベース音響モデル１２０Ｅを訓練してもよい。 A singing style is a way of singing, and a performance style is a way of performing. Specifically, singing styles include neutral, vibrato, husky, fly, and growl. Performance styles include neutral, vibrato, pizzicato, spiccato, flageolet, and tremolo for bowed string instruments, and neutral, position, legato, slide, and slap/mute for plucked string instruments. For a clarinet, neutral, staccato, vibrato, and trill are examples. Note that, for example, the above vibrato refers to a singing style or performance style that uses vibrato a lot. The pitch, volume, timbre, and their dynamic behaviors in singing or performing vary depending on the style as a whole. In a training job, the server 100E may train the base acoustic model 120E while inputting a singing style or performance style obtained by analyzing the waveform set in addition to a new timbre ID and waveform set.

訓練用音波形の歌唱音域及び演奏音域は、特定の歌唱者の歌唱及び特定の楽器の演奏音の複数の音波形における音高の分布から判断され、その歌唱者ないし楽器の音波形の音域を示す。 The singing range and performance range of the training sound waveforms are determined from the distribution of pitches in multiple sound waveforms of a specific singer's singing and a specific instrument's performance, and indicate the range of the sound waveform of that singer or instrument.

特定の音源の音色に関して、音高データ及び音響特徴量の使用予定範囲が網羅されていない場合に、サーバ１００Ｅは、用意した訓練用音波形では音響モデルの訓練が十分にできていないと判断する。Ｓ１５０１の分析を行うことで、サーバ１００Ｅは、特定の音源の音色を使用予定の全音域のうちで、音波形が全くない又は少ない音域を検出する。そして、サーバ１００Ｅは、データが不十分な音域を補充するために、第１ユーザに推薦するための１以上の楽曲を特定する（ステップＳ１５０２）。そして、Ｓ１５０２で特定された楽曲を示す情報を通信端末２００Ｅ（第１ユーザ）に提供し、通信端末２００Ｅは、受信したその情報をその表示器に表示する。 When the planned range of use of pitch data and acoustic features is not covered for the timbre of a specific sound source, the server 100E determines that the prepared training sound waveforms are not sufficient for training the acoustic model. By performing the analysis of S1501, the server 100E detects ranges for which there are no or few sound waveforms among the total range for which the timbre of a specific sound source is planned to be used. The server 100E then identifies one or more songs to recommend to the first user in order to fill in the ranges for which there is insufficient data (step S1502). The server 100E then provides information indicating the songs identified in S1502 to the communication terminal 200E (first user), and the communication terminal 200E displays the received information on its display.

以上のように、本実施形態に係る音響モデル訓練システム１０Ｅによると、訓練用音波形として用意された音波形では使用予定範囲をカバーしない場合に、それを第１ユーザに知らせられるため、第１ユーザは使用予定範囲を網羅する訓練用音波形を準備できる。 As described above, according to the acoustic model training system 10E of this embodiment, if the sound waveforms prepared as training sound waveforms do not cover the intended range of use, the first user is notified of this, so that the first user can prepare training sound waveforms that cover the intended range of use.

なお、本発明は上記の実施形態に限られたものではなく、趣旨を逸脱しない範囲で適宜変更することが可能である。各実施形態は、技術的な矛盾を生じない限り、互いに組み合わせることができる。 The present invention is not limited to the above-described embodiments, and can be modified as appropriate without departing from the spirit and scope of the invention. The embodiments can be combined with each other as long as no technical contradiction occurs.

１０：音響モデル訓練システム、１００：サーバ、１０１：制御部、１０２：ＲＡＭ、１０３：ＲＯＭ、１０４：ユーザインタフェース、１０５：通信インターフェース、１１０：ストレージ、１１１：音声合成プログラム、１１２：訓練ジョブ、１１３：楽譜データ、１１４：音波形、１２０：音響モデル、１３０：合成音波形、１４０：ＧＵＩ、１４１、１４２、１４３：チェックボックス、１４４：実行ボタン、１５０Ａ：ＧＵＩ、１５１Ａ、１５２Ａ：進行度を示す項目、１５３Ａ：音響モデル名称、１５４Ａ：訓練用音波形、１５５Ａ：完了予想、１５６Ａ：訓練実行者、１５７Ａ：試聴ボタン、１６０Ｂ：ＧＵＩ、１６１Ｂ：公開設定項目、１６２Ｂ：第１訓練ジョブ、１６３Ｂ、１６８Ｂ：音響モデル名称、１６４Ｂ、１６９Ｂ：訓練用音波形、１６５Ｂ、１７０Ｂ：完了予想、１６６Ｂ、１７１Ｂ：訓練実行者、１６７Ｂ：第２訓練ジョブ、１７２Ｂ：公開ボタン、１８０Ｄ：アカウント情報、１８２Ｄ：音波形、２００、３００：通信端末、４００：ネットワーク、４１１：ステップ、５００Ｄ：カラオケサーバ、５０１Ｄ：音波形、５０２Ｄ：テキストデータ、５０３Ｄ：音高データ 10: Acoustic model training system, 100: Server, 101: Control unit, 102: RAM, 103: ROM, 104: User interface, 105: Communication interface, 110: Storage, 111: Voice synthesis program, 112: Training job, 113: Music score data, 114: Sound waveform, 120: Acoustic model, 130: Synthesized sound waveform, 140: GUI, 141, 142, 143: Checkbox, 144: Execute button, 150A: GUI, 151A, 152A: Items indicating progress, 153A: Acoustic model name, 154A: Training sound waveform, 155A: Completion forecast, 156A: Training executor, 157A: Preview button, 160B: GUI, 161B: Public setting item, 162B: First training job, 163B, 168B: Acoustic model name, 164B, 169B: Training sound waveform, 165B, 170B: Completion forecast, 166B, 171B: Training executor, 167B: Second training job, 172B: Publish button, 180D: Account information, 182D: Sound waveform, 200, 300: Communication terminal, 400: Network, 411: Step, 500D: Karaoke server, 501D: Sound waveform, 502D: Text data, 503D: Pitch data

Claims

a first device connectable to a network and used by a first user;
a server connectable to the network;
The first device, under control of the first user,
uploading a plurality of sound waveforms to the server;
selecting one or more sound waveforms as a first waveform set from the plurality of sound waveforms already uploaded or to be uploaded;
Transmitting a first execution instruction to the server for a first training job for an acoustic model that generates acoustic features;
The server, based on the first execution instruction from the first device,
commencing execution of the first training job using the selected first waveform set;
An acoustic model training system that provides a trained acoustic model trained by the first training job to the first device.

A method for training an acoustic model implemented by one or more computers, which provides a first user with an interface that allows the user to select one or more sound waveforms to be used in a first training job for an acoustic model that generates acoustic features from a plurality of pre-stored sound waveforms.

receiving one or more waveforms selected by the first user using the interface as a first waveform set;
commencing execution of the first training job using the first waveform set based on a first execution instruction by the first user via the interface;
The training method of claim 2 , further comprising: providing the acoustic model trained by the first training job to the first user as a first acoustic model.

The training method according to claim 3, further comprising providing first status information indicating a status of the first training job to a second user different from the first user based on a first disclosure instruction by the first user.

Displaying the first status information on a first device used by the first user;
The training method of claim 4 , further comprising: displaying the first status information on a second device used by the second user based on the first disclosure instruction.

The state of the first training job changes over time;
The training method of claim 4 , wherein the first status information displayed on a second device used by the second user is updated repeatedly.

The training method according to claim 4, wherein the progress of the state of the first training job is displayed as the first status information.

The training method according to claim 4, wherein the first status information regarding the timing of the disclosure request is displayed on a second device used by the second user based on a disclosure request by the second user.

receiving a second set of waveforms, the second set being a new selection of one or more waveforms by the first user using the interface;
and initiating execution of a second training job using the second waveform set based on a second execution instruction by the first user.
The training method of claim 3 , wherein the first training job and the second training job are performed in parallel.

The training method according to claim 9, further comprising providing at least one of first status information regarding the first training job and second status information regarding the second training job to a second device of a second user different from the first user, based on a disclosure instruction by the first user.

billing the first user in response to the first execution instruction of the first user;
The method of claim 2 , further comprising starting execution of the first training job if payment of the charge is confirmed.

receiving a space ID that identifies a real space;
The training method according to claim 2 , further comprising linking the space ID with account information of the first user for a service that provides the training method.

The training method according to claim 12, further comprising: charging the first user who has the account information linked to the space ID.

receiving sheet music data representing the sounds constituting the piece of music reproduced in the real space together with audio data recording the singing or playing sounds during at least a portion of the reproduction period of the piece of music;
13. The training method according to claim 12, wherein the voice data is stored as a pre-saved sound waveform in a manner linked to the musical score data.

The training method according to claim 14, wherein the audio data is recorded for a specified period of the playback period based on a recording instruction from the first user.

reproducing the audio data in the real space based on a reproduction instruction from the first user;
The training method according to claim 14, further comprising the step of inquiring the first user as to whether or not to save the audio data played back in response to the playback instruction as one of the plurality of pre-stored sound waveforms provided to the first user.

Analyzes pre-stored sound waveforms,
Identifying a piece of music to be recommended to the first user based on an analysis result obtained by the analysis;
The method of claim 2 , further comprising providing information to the first user indicative of the identified piece of music.

The training method according to claim 17, wherein the analysis results indicate at least one of singing style, playing style, singing range, and playing range.

The training method according to claim 17, wherein the analysis results indicate playing skills.