JP7424359B2

JP7424359B2 - Information processing device, singing voice output method, and program

Info

Publication number: JP7424359B2
Application number: JP2021183657A
Authority: JP
Inventors: 大樹倉光; 頌子奈良; 強宮木; 浩雅椎原; 健一山内; 晋山中
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2017-06-14
Filing date: 2021-11-10
Publication date: 2024-01-30
Anticipated expiration: 2037-06-14
Also published as: JP2022017561A; JP2019003000A; WO2018230670A1; JP6977323B2

Description

本発明は、ユーザの入力に対し歌唱を含む音声を用いて応答する技術に関する。 The present invention relates to a technique for responding to user input using voice, including singing.

ユーザの指示に応じて楽曲を出力する技術が知られている。例えば特許文献１は、ユーザの状況や嗜好に応じて楽曲の雰囲気を変える技術を開示している。特許文献２は、運動体の状態に応じた楽音を出力する装置において、飽きの来ない独特な選曲をする技術を開示している。 2. Description of the Related Art Techniques for outputting music according to user instructions are known. For example, Patent Document 1 discloses a technique for changing the atmosphere of a song according to the user's situation and preferences. Patent Document 2 discloses a technique for selecting a unique piece of music that you will never get tired of in a device that outputs musical tones depending on the state of a moving body.

特開２００６－８５０４５号公報Japanese Patent Application Publication No. 2006-85045 特許第４４９６９９３号公報Patent No. 4496993

特許文献１及び２はいずれも、ユーザとのインタラクションに応じて歌唱音声を出力するものではなかった。
これに対し本発明は、ユーザとのインタラクションに応じて歌唱音声を出力する技術を提供する。 In both Patent Documents 1 and 2, singing voices are not output in response to interaction with a user.
In contrast, the present invention provides a technique for outputting singing voices in response to interaction with a user.

本発明は、コンテンツに含まれる文字列を分解して得られた複数の部分コンテンツの中から第１の部分コンテンツを特定するステップと、前記第１の部分コンテンツに含まれる文字列を用いて合成された歌唱音声を出力するステップと、前記歌唱音声に対するユーザの反応を受け付けるステップと、前記反応に応じて、前記第１の部分コンテンツに続く第２の部分コンテンツに含まれる文字列を用いて合成された歌唱音声を出力するステップとを有する歌唱音声の出力方法を提供する。 The present invention includes a step of specifying a first partial content from among a plurality of partial contents obtained by decomposing a character string included in the content, and a step of synthesizing the content using the character string included in the first partial content. a step of outputting the singing voice, a step of receiving a user's reaction to the singing voice, and a step of synthesizing the voice using a character string included in a second partial content following the first partial content according to the reaction. Provided is a method for outputting a singing voice, the method comprising:

この歌唱音声の出力方法は、前記反応に応じて、前記第２の部分コンテンツに含まれる文字列を用いた歌唱合成に用いられる要素を決定するステップを有してもよい。 This singing voice output method may include a step of determining, depending on the reaction, elements to be used for singing synthesis using character strings included in the second partial content.

前記要素は、前記歌唱合成のパラメータ、メロディ、若しくはテンポ、又は前記歌唱音
声における伴奏のアレンジを含んでもよい。 The elements may include parameters of the singing synthesis, a melody, or a tempo, or an arrangement of accompaniment in the singing voice.

前記歌唱音声の合成は、複数のデータベースの中から選択された少なくとも１つのデータベースに記録された素片を用いて行われ、この歌唱音声の出力方法は、前記反応に応じて、前記第２の部分コンテンツに含まれる文字列を用いた歌唱合成の際に用いられるデータベースを選択するステップを有してもよい。 The synthesis of the singing voice is performed using the segments recorded in at least one database selected from a plurality of databases, and the method of outputting the singing voice is based on the second The method may include a step of selecting a database to be used in song synthesis using character strings included in the partial content.

前記歌唱音声の合成は、複数のデータベースの中から選択された複数のデータベースに記録された素片を用いて行われ、前記データベースを選択するステップにおいて、複数のデータベースが選択され、この歌唱音声の出力方法は、前記複数のデータベースの利用比率を、前記反応に応じて決定するステップを有してもよい。 The synthesis of the singing voice is performed using the fragments recorded in a plurality of databases selected from among the plurality of databases, and in the step of selecting the database, the plurality of databases are selected and the synthesis of the singing voice is performed. The output method may include a step of determining a utilization ratio of the plurality of databases according to the reaction.

この歌唱音声の出力方法は、前記第１の部分コンテンツに含まれる文字列の一部を他の文字列に置換するステップを有し、前記歌唱音声を出力するステップにおいて、一部が前記他の文字列に置換された前記第１の部分コンテンツに含まれる文字列を用いて合成された歌唱音声が出力されてもよい。 This method of outputting singing audio includes the step of replacing a part of the character string included in the first partial content with another character string, and in the step of outputting the singing audio, a part of the character string is replaced with another character string. A singing voice synthesized using a character string included in the first partial content replaced with a character string may be output.

前記他の文字列と前記置換の対象となる文字列とは、音節数又はモーラ数が同じであってもよい。 The other character string and the character string to be replaced may have the same number of syllables or the same number of moras.

この歌唱音声の出力方法は、前記反応に応じて、前記第２の部分コンテンツの一部を他の文字列に置換するステップを有し、前記歌唱音声を出力するステップにおいて、一部が前記他の文字列に置換された前記第２の部分コンテンツに含まれる文字列を用いて合成された歌唱音声が出力されてもよい。 This method of outputting a singing voice includes the step of replacing a part of the second partial content with another character string in accordance with the reaction, and in the step of outputting the singing voice, a part of the second partial content is replaced with another character string. A singing voice synthesized using the character string included in the second partial content replaced with the character string may be output.

この歌唱音声の出力方法は、前記第１の部分コンテンツに含まれる文字列が示す事項に応じた時間長となるよう合成された歌唱音声を、前記第１の部分コンテンツの歌唱音声と前記第２の部分コンテンツの歌唱音声との間に出力するステップを有してもよい。 This method of outputting a singing voice combines the singing voice of the first partial content with the singing voice of the second partial content, which is synthesized so as to have a time length corresponding to the item indicated by the character string included in the first partial content. It may also include a step of outputting the partial content between the singing voice and the singing voice.

この歌唱音声の出力方法は、前記第１の部分コンテンツに含まれる第１文字列が示す事項に応じた第２文字列を用いて合成された歌唱音声を、当該第１の部分コンテンツの歌唱音声の出力後、当該第１文字列が示す事項に応じた時間長に応じたタイミングで出力するステップを有してもよい。 This method of outputting a singing voice is such that a singing voice synthesized using a second character string corresponding to a matter indicated by a first character string included in the first partial content is outputted to the singing voice of the first partial content. After outputting, the first character string may include a step of outputting at a timing corresponding to a time length corresponding to the item indicated by the first character string.

また、本発明は、コンテンツに含まれる文字列を分解して得られた複数の部分コンテンツの中から第１の部分コンテンツを特定する特定部と、前記第１の部分コンテンツに含まれる文字列を用いて合成された歌唱音声を出力する出力部と、前記歌唱音声に対するユーザの反応を受け付ける受け付け部とを有し、前記出力部は、前記反応に応じて、前記第１の部分コンテンツに続く第２の部分コンテンツに含まれる文字列を用いて合成された歌唱音声を出力する情報処理システムを提供する。 The present invention also provides a specifying unit that identifies a first partial content from among a plurality of partial contents obtained by disassembling a character string included in the content; an output unit that outputs a singing voice synthesized using the singing voice; and a receiving unit that receives a user's reaction to the singing voice; To provide an information processing system that outputs a singing voice synthesized using character strings included in partial content No. 2.

本発明によれば、ユーザとのインタラクションに応じて歌唱音声を出力することができる。 According to the present invention, singing voice can be output in response to interaction with a user.

一実施形態に係る音声応答システム１の概要を示す図。FIG. 1 is a diagram showing an overview of a voice response system 1 according to an embodiment. 音声応答システム１の機能の概要を例示する図。1 is a diagram illustrating an overview of functions of a voice response system 1. FIG. 入出力装置１０のハードウェア構成を例示する図。1 is a diagram illustrating a hardware configuration of an input/output device 10. FIG. 応答エンジン２０及び歌唱合成エンジン３０のハードウェア構成を例示する図。FIG. 3 is a diagram illustrating the hardware configuration of a response engine 20 and a singing synthesis engine 30. 学習機能５１に係る機能構成を例示する図。5 is a diagram illustrating a functional configuration related to a learning function 51. FIG. 学習機能５１に係る動作の概要を示すフローチャート。5 is a flowchart showing an overview of operations related to the learning function 51. 学習機能５１に係る動作を例示するシーケンスチャート。5 is a sequence chart illustrating an operation related to the learning function 51. 分類テーブル５１６１を例示する図。A diagram illustrating a classification table 5161. 歌唱合成機能５２に係る機能構成を例示する図。A diagram illustrating a functional configuration related to a singing synthesis function 52. 歌唱合成機能５２に係る動作の概要を示すフローチャート。5 is a flowchart showing an overview of operations related to the singing synthesis function 52. 歌唱合成機能５２に係る動作を例示するシーケンスチャート。5 is a sequence chart illustrating operations related to the singing synthesis function 52. 応答機能５３に係る機能構成を例示する図。5 is a diagram illustrating a functional configuration related to a response function 53. FIG. 応答機能５３に係る動作を例示するフローチャート。5 is a flowchart illustrating an operation related to the response function 53. 音声応答システム１の動作例１を示す図。FIG. 2 is a diagram showing a first example of operation of the voice response system 1. FIG. 音声応答システム１の動作例２を示す図。FIG. 2 is a diagram illustrating a second example of operation of the voice response system 1; 音声応答システム１の動作例３を示す図。FIG. 3 is a diagram showing a third example of operation of the voice response system 1; 音声応答システム１の動作例４を示す図。FIG. 4 is a diagram showing a fourth example of operation of the voice response system 1; 音声応答システム１の動作例５を示す図。FIG. 5 is a diagram showing a fifth example of operation of the voice response system 1; 音声応答システム１の動作例６を示す図。FIG. 6 is a diagram showing a sixth example of operation of the voice response system 1; 音声応答システム１の動作例７を示す図。FIG. 7 is a diagram showing a seventh example of operation of the voice response system 1; 音声応答システム１の動作例８を示す図。FIG. 7 is a diagram showing an example 8 of operation of the voice response system 1; 音声応答システム１の動作例９を示す図。FIG. 9 is a diagram showing a ninth example of operation of the voice response system 1; 音声応答システム１の動作例１０を示す図。FIG. 3 is a diagram showing an example 10 of operation of the voice response system 1; 音声応答システム１の動作例１１を示す図。FIG. 3 is a diagram showing an example 11 of operation of the voice response system 1;

１．システム概要
図１は、一実施形態に係る音声応答システム１の概要を示す図である。音声応答システム１は、ユーザが声によって入力（又は指示）を行うと、それに対し自動的に音声による応答を出力するシステムであり、いわゆるＡＩ（Artificial Intelligence）音声アシスタントである。以下、ユーザから音声応答システム１に入力される音声を「入力音声」といい、入力音声に対し音声応答システム１から出力される音声を「応答音声」という。特にこの例において、音声応答は歌唱を含む。すなわち、音声応答システム１は、歌唱合成システムの一例である。例えば、音声応答システム１に対しユーザが「何か歌って」と話しかけると、音声応答システム１は自動的に歌唱を合成し、合成された歌唱を出力する。 1. System Overview FIG. 1 is a diagram showing an overview of a voice response system 1 according to an embodiment. The voice response system 1 is a system that automatically outputs a voice response when a user inputs (or gives instructions) by voice, and is a so-called AI (Artificial Intelligence) voice assistant. Hereinafter, the voice input from the user to the voice response system 1 will be referred to as "input voice", and the voice output from the voice response system 1 in response to the input voice will be referred to as "response voice". In this particular example, the audio response includes singing. That is, the voice response system 1 is an example of a singing synthesis system. For example, when a user speaks to the voice response system 1 by saying, "Sing something," the voice response system 1 automatically synthesizes a song and outputs the synthesized song.

音声応答システム１は、入出力装置１０、応答エンジン２０、及び歌唱合成エンジン３０を含む。入出力装置１０は、マンマシンインターフェースを提供する装置であり、ユーザからの入力音声を受け付け、その入力音声に対する応答音声を出力する装置である。応答エンジン２０は、入出力装置１０により受け付けられた入力音声を分析し、応答音声を生成する。この応答音声は、少なくとも一部に歌唱音声を含む。歌唱合成エンジン３０は、応答音声に用いられる歌唱音声を合成する。 The voice response system 1 includes an input/output device 10, a response engine 20, and a singing synthesis engine 30. The input/output device 10 is a device that provides a man-machine interface, and is a device that receives input voice from a user and outputs a response voice to the input voice. The response engine 20 analyzes the input voice received by the input/output device 10 and generates a response voice. This response voice includes singing voice at least in part. The singing synthesis engine 30 synthesizes singing voices used as response voices.

図２は、音声応答システム１の機能の概要を例示する図である。音声応答システム１は、学習機能５１、歌唱合成機能５２、及び応答機能５３を有する。応答機能５３は、ユーザの入力音声を分析し、分析結果に基づいて応答音声を提供する機能であり、入出力装置１０及び応答エンジン２０により提供される。学習機能５１は、ユーザの入力音声からユーザの嗜好を学習する機能であり、歌唱合成エンジン３０により提供される。歌唱合成機能５２は、応答音声に用いられる歌唱音声を合成する機能であり、歌唱合成エンジン３０により提供される。学習機能５１、歌唱合成機能５２、及び応答機能５３の関係は以下のとおりである。学習機能５１は、応答機能５３により得られた分析結果を用いてユーザの嗜好を学習する。歌唱合成機能５２は、学習機能５１によって行われた学習に基づいて歌唱音声を合成する。応答機能５３は、歌唱合成機能５２により合成された歌唱音声を用いた応答をする。各機能の詳細は後述する。 FIG. 2 is a diagram illustrating an overview of the functions of the voice response system 1. The voice response system 1 has a learning function 51, a singing synthesis function 52, and a response function 53. The response function 53 is a function that analyzes the user's input voice and provides a response voice based on the analysis result, and is provided by the input/output device 10 and the response engine 20. The learning function 51 is a function of learning the user's preferences from the user's input voice, and is provided by the singing synthesis engine 30. The singing synthesis function 52 is a function of synthesizing singing voices used as response voices, and is provided by the singing synthesis engine 30. The relationship between the learning function 51, the singing synthesis function 52, and the response function 53 is as follows. The learning function 51 uses the analysis results obtained by the response function 53 to learn the user's preferences. The singing synthesis function 52 synthesizes singing voices based on the learning performed by the learning function 51. The response function 53 responds using the singing voice synthesized by the singing synthesis function 52. Details of each function will be described later.

図３は、入出力装置１０のハードウェア構成を例示する図である。入出力装置１０は、マイクロフォン１０１、入力信号処理部１０２、出力信号処理部１０３、スピーカ１０４、ＣＰＵ（Central Processing Unit）１０５、センサー１０６、モータ１０７、及びネットワークＩＦ１０８を有する。マイクロフォン１０１はユーザの音声を電気信号（入力音信号）に変換する装置である。入力信号処理部１０２は、入力音信号に対しアナログ／デジタル変換等の処理を行い、入力音声を示すデータ（以下「入力音声データ」という）を出力する装置である。出力信号処理部１０３は、応答音声を示すデータ（以下「応答音声データ」という）に対しデジタル／アナログ変換等の処理を行い、出力音信号を出力する装置である。スピーカ１０４は、出力音信号を音に変換する（出力音信号に基づいて音を出力する）装置である。ＣＰＵ１０５は、入出力装置１０の他の要素を制御する装置であり、メモリー（図示略）からプログラムを読み出して実行する。センサー１０６は、ユーザの位置（入出力装置１０から見たユーザの方向）を検知するセンサーであり、一例としては赤外線センサー又は超音波センサーである。モータ１０７は、ユーザのいる方向に向くように、マイクロフォン１０１及びスピーカ１０４の少なくとも一方の向きを変化させる。一例において、マイクロフォン１０１がマイクロフォンアレイであり、ＣＰＵ１０５が、マイクロフォンアレイにより収音された音に基づいてユーザのいる方向を検知してもよい。ネットワークＩＦ１０８は、ネットワーク（例えばインターネット）を介した通信を行うためのインターフェースであり、例えば、所定の無線通信規格（例えばいわゆるＷｉＦｉ（登録商標））に従った通信を行うためのアンテナ及びチップセットを含む。 FIG. 3 is a diagram illustrating the hardware configuration of the input/output device 10. As shown in FIG. The input/output device 10 includes a microphone 101, an input signal processing section 102, an output signal processing section 103, a speaker 104, a CPU (Central Processing Unit) 105, a sensor 106, a motor 107, and a network IF 108. The microphone 101 is a device that converts a user's voice into an electrical signal (input sound signal). The input signal processing unit 102 is a device that performs processing such as analog/digital conversion on an input sound signal and outputs data representing input sound (hereinafter referred to as "input sound data"). The output signal processing unit 103 is a device that performs processing such as digital/analog conversion on data indicating a response voice (hereinafter referred to as "response voice data") and outputs an output sound signal. The speaker 104 is a device that converts an output sound signal into sound (outputs sound based on the output sound signal). The CPU 105 is a device that controls other elements of the input/output device 10, and reads programs from a memory (not shown) and executes them. The sensor 106 is a sensor that detects the user's position (the user's direction as seen from the input/output device 10), and is, for example, an infrared sensor or an ultrasonic sensor. The motor 107 changes the direction of at least one of the microphone 101 and the speaker 104 so as to face the direction in which the user is present. In one example, the microphone 101 may be a microphone array, and the CPU 105 may detect the direction in which the user is located based on sounds picked up by the microphone array. The network IF 108 is an interface for communicating via a network (for example, the Internet), and includes, for example, an antenna and a chipset for communicating in accordance with a predetermined wireless communication standard (for example, so-called WiFi (registered trademark)). include.

図４は、応答エンジン２０及び歌唱合成エンジン３０のハードウェア構成を例示する図である。応答エンジン２０は、ＣＰＵ２０１、メモリー２０２、ストレージ２０３、及び通信ＩＦ２０４を有するコンピュータ装置である。ＣＰＵ２０１は、プログラムに従って各種の演算を行い、コンピュータ装置の他の要素を制御する。メモリー２０２は、ＣＰＵ２０１がプログラムを実行する際のワークエリアとして機能する主記憶装置であり、例えばＲＡＭ（Random Access Memory）を含む。ストレージ２０３は、各種のプログラム及びデータを記憶する不揮発性の補助記憶装置であり、例えばＨＤＤ（Hard Disk Drive）又はＳＳＤ（Solid State Drive）を含む。通信ＩＦ２０４は、所定の通信規格（例えばEthernet）に従った通信を行うためのコネクタ及びチップセットを含む。この例において、ストレージ２０３は、コンピュータ装置を音声応答システム１における応答エンジン２０として機能させるためのプログラム（以下「応答プログラム」という）を記憶している。ＣＰＵ２０１が応答プログラムを実行することにより、コンピュータ装置は応答エンジン２０として機能する。応答エンジン２０は、例えばいわゆるＡＩである。 FIG. 4 is a diagram illustrating the hardware configuration of the response engine 20 and the singing synthesis engine 30. The response engine 20 is a computer device having a CPU 201, a memory 202, a storage 203, and a communication IF 204. The CPU 201 performs various calculations according to programs and controls other elements of the computer device. The memory 202 is a main storage device that functions as a work area when the CPU 201 executes a program, and includes, for example, RAM (Random Access Memory). The storage 203 is a nonvolatile auxiliary storage device that stores various programs and data, and includes, for example, an HDD (Hard Disk Drive) or an SSD (Solid State Drive). Communication IF 204 includes a connector and a chipset for communicating in accordance with a predetermined communication standard (eg, Ethernet). In this example, the storage 203 stores a program (hereinafter referred to as a "response program") for causing a computer device to function as the response engine 20 in the voice response system 1. The computer device functions as the response engine 20 by the CPU 201 executing the response program. The response engine 20 is, for example, a so-called AI.

歌唱合成エンジン３０は、ＣＰＵ３０１、メモリー３０２、ストレージ３０３、及び通信ＩＦ３０４を有するコンピュータ装置である。各要素の詳細は応答エンジン２０と同様である。この例において、ストレージ３０３は、コンピュータ装置を音声応答システム１における歌唱合成エンジン３０として機能させるためのプログラム（以下「歌唱合成プログラム」という）を記憶している。ＣＰＵ３０１が歌唱合成プログラムを実行することにより、コンピュータ装置は歌唱合成エンジン３０として機能する。 The singing synthesis engine 30 is a computer device having a CPU 301, a memory 302, a storage 303, and a communication IF 304. The details of each element are the same as those of the response engine 20. In this example, the storage 303 stores a program (hereinafter referred to as "singing synthesis program") for causing the computer device to function as the singing synthesis engine 30 in the voice response system 1. When the CPU 301 executes the singing synthesis program, the computer device functions as the singing synthesis engine 30.

この例において、応答エンジン２０及び歌唱合成エンジン３０は、インターネット上において、いわゆるクラウドサービスとして提供される。なお、応答エンジン２０及び歌唱合成エンジン３０は、クラウドコンピューティングによらないサービスであってもよい。以下、学習機能５１、歌唱合成機能５２、及び応答機能５３のそれぞれについて、その機能の詳細及び動作を説明する。 In this example, the response engine 20 and the singing synthesis engine 30 are provided as a so-called cloud service on the Internet. Note that the response engine 20 and the singing synthesis engine 30 may be services that are not based on cloud computing. The details and operation of each of the learning function 51, singing synthesis function 52, and response function 53 will be described below.

２．学習機能
２－１．構成
図５は、学習機能５１に係る機能構成を例示する図である。学習機能５１に係る機能要素として、音声応答システム１は、音声分析部５１１、感情推定部５１２、楽曲解析部５１３、歌詞抽出部５１４、嗜好分析部５１５、記憶部５１６、及び処理部５１０を有する。また、入出力装置１０は、ユーザの入力音声を受け付ける受け付け部、及び応答音声を出力する出力部として機能する。 2. Learning function 2-1. Configuration FIG. 5 is a diagram illustrating a functional configuration related to the learning function 51. As shown in FIG. As functional elements related to the learning function 51, the voice response system 1 includes a voice analysis section 511, an emotion estimation section 512, a music analysis section 513, a lyrics extraction section 514, a preference analysis section 515, a storage section 516, and a processing section 510. . The input/output device 10 also functions as a reception unit that receives user's input voice and an output unit that outputs response voice.

音声分析部５１１は、入力音声を分析する。ここでいう分析は、応答音声を生成するために用いられる情報を入力音声から取得する処理をいい、具体的には、入力音声をテキスト化（すなわち文字列に変換）する処理、得られたテキストからユーザの要求を判断する処理、ユーザの要求に対してコンテンツを提供するコンテンツ提供部６０を特定する処理、特定されたコンテンツ提供部６０に対し指示を行う処理、コンテンツ提供部６０からデータを取得する処理、取得したデータを用いて応答を生成する処理を含む。この例において、コンテンツ提供部６０は、音声応答システム１の外部システムである。コンテンツ提供部６０は、少なくとも、楽曲等のコンテンツを音として再生するためのデータ（以下「楽曲データ」という）を出力するサービス（例えば、楽曲のストリーミングサービス又はネットラジオ）を提供するコンピュータリソースであり、例えば、音声応答システム１の外部サーバである。 The speech analysis unit 511 analyzes input speech. The analysis here refers to the process of acquiring information used to generate a response voice from the input voice. Specifically, it refers to the process of converting the input voice into text (that is, converting it into a character string), and the process of converting the input voice into a text string. A process of determining a user's request from a user's request, a process of identifying a content providing unit 60 that provides content in response to the user's request, a process of instructing the identified content providing unit 60, and acquiring data from the content providing unit 60. This includes processing to generate a response using the acquired data. In this example, the content providing unit 60 is an external system of the voice response system 1. The content providing unit 60 is a computer resource that provides at least a service (for example, a music streaming service or internet radio) that outputs data for reproducing content such as music as sound (hereinafter referred to as "song data"). , for example, an external server of the voice response system 1.

楽曲解析部５１３は、コンテンツ提供部６０から出力される楽曲データを解析する。楽曲データの解析とは、楽曲の特徴を抽出する処理をいう。楽曲の特徴は、例えば、曲調、リズム、コード進行、テンポ、及びアレンジの少なくとも１つを含む。特徴の抽出には公知の技術が用いられる。 The music analysis unit 513 analyzes music data output from the content providing unit 60. Analysis of music data refers to processing for extracting features of music. The characteristics of a song include, for example, at least one of melody, rhythm, chord progression, tempo, and arrangement. A known technique is used to extract the features.

歌詞抽出部５１４は、コンテンツ提供部６０から出力される楽曲データから歌詞を抽出する。一例において、楽曲データは、音データに加えメタデータを含む。音データは、楽曲の信号波形を示すデータであり、例えば、ＰＣＭ（Pulse Code Modulation）データ等の非圧縮データ、又はＭＰ３データ等の圧縮データを含む。メタデータはその楽曲に関連する情報を含むデータであり、例えば、楽曲タイトル、実演者名、作曲者名、作詞者名、アルバムタイトル、及びジャンル等の楽曲の属性、並びに歌詞等の情報を含む。歌詞抽出部５１４は、楽曲データに含まれるメタデータから、歌詞を抽出する。楽曲データがメタデータを含まない場合、歌詞抽出部５１４は、音データに対し音声認識処理を行い、音声認識により得られたテキストから歌詞を抽出する。 The lyrics extracting unit 514 extracts lyrics from the music data output from the content providing unit 60. In one example, music data includes metadata in addition to sound data. The sound data is data indicating a signal waveform of a music piece, and includes, for example, uncompressed data such as PCM (Pulse Code Modulation) data, or compressed data such as MP3 data. Metadata is data that includes information related to the song, and includes information such as the song title, performer name, composer name, lyricist name, album title, song attributes such as genre, and lyrics. . Lyrics extraction unit 514 extracts lyrics from metadata included in music data. If the music data does not include metadata, the lyrics extraction unit 514 performs voice recognition processing on the sound data and extracts lyrics from the text obtained by voice recognition.

感情推定部５１２は、ユーザの感情を推定する。この例において、感情推定部５１２は、入力音声からユーザの感情を推定する。感情の推定には公知の技術が用いられる。一例において、感情推定部５１２は、音声応答システム１が出力する音声における（平均）音高と、それに対するユーザの応答の音高との関係に基づいてユーザの感情を推定してもよい。あるいは、感情推定部５１２は、音声分析部５１１によりテキスト化された入力音声、又は分析されたユーザの要求に基づいてユーザの感情を推定してもよい。 The emotion estimation unit 512 estimates the user's emotion. In this example, the emotion estimation unit 512 estimates the user's emotion from the input voice. Known techniques are used to estimate emotions. In one example, the emotion estimation unit 512 may estimate the user's emotion based on the relationship between the (average) pitch of the voice output by the voice response system 1 and the pitch of the user's response thereto. Alternatively, the emotion estimation unit 512 may estimate the user's emotion based on the input voice converted into text by the voice analysis unit 511 or the user's request analyzed.

嗜好分析部５１５は、ユーザが再生を指示した楽曲の再生履歴、解析結果、及び歌詞、並びにその楽曲の再生を指示したときのユーザの感情のうち少なくとも１つを用いて、ユーザの嗜好を示す情報（以下「嗜好情報」という）を生成する。嗜好分析部５１５は、生成された嗜好情報を用いて、記憶部５１６に記憶されている分類テーブル５１６１を更新する。分類テーブル５１６１は、ユーザの嗜好を記録したテーブル（又はデータベース）であり、例えば、ユーザ毎かつ感情毎に、楽曲の特徴（例えば、音色、曲調、リズム、コード進行、及びテンポ）、楽曲の属性（実演者名、作曲者名、作詞者名、及びジャンル）、及び歌詞を記録したものである。記憶部５１６は、歌唱合成に用いるパラメータをユーザと対応付けて記録したテーブルから、トリガを入力したユーザに応じたパラメータを読み出す読み出し部の一例である。なおここで、歌唱合成に用いるパラメータとは、歌唱合成の際に参照されるデータをいい、分類テーブル５１６１の例では、音色、曲調、リズム、コード進行、テンポ、実演者名、作曲者名、作詞者名、ジャンル、及び歌詞を含む概念である。 The preference analysis unit 515 indicates the user's preferences using at least one of the playback history, analysis results, and lyrics of the song that the user has instructed to play, and the user's emotion when the user has instructed to play the song. Generate information (hereinafter referred to as "preference information"). The preference analysis unit 515 updates the classification table 5161 stored in the storage unit 516 using the generated preference information. The classification table 5161 is a table (or database) that records user preferences, and includes, for example, song characteristics (for example, timbre, melody, rhythm, chord progression, and tempo) and song attributes for each user and emotion. (performer's name, composer's name, lyricist's name, and genre) and lyrics are recorded. The storage unit 516 is an example of a reading unit that reads parameters corresponding to the user who inputs the trigger from a table in which parameters used for singing synthesis are recorded in association with users. Note that the parameters used for singing synthesis refer to data referenced during singing synthesis, and in the example of the classification table 5161, the parameters include timbre, melody, rhythm, chord progression, tempo, performer name, composer name, This concept includes the lyricist name, genre, and lyrics.

２－２．動作
図６は、学習機能５１に係る音声応答システム１の動作の概要を示すフローチャートである。ステップＳ１１において、音声応答システム１は、入力音声を分析する。ステップＳ１２において、音声応答システム１は、入力音声により指示された処理を行う。ステップＳ１３において、音声応答システム１は、入力音声が学習の対象となる事項を含むか判断する。入力音声が学習の対象となる事項を含むと判断された場合（Ｓ１３：ＹＥＳ）、音声応答システム１は、処理をステップＳ１４に移行する。入力音声が学習の対象となる事項を含まないと判断された場合（Ｓ１３：ＮＯ）、音声応答システム１は、処理をステップＳ１８に移行する。ステップＳ１４において、音声応答システム１は、ユーザの感情を推定する。ステップＳ１５において、音声応答システム１は、再生が指示された楽曲を解析する。ステップＳ１６において、音声応答システム１は、再生が指示された楽曲の歌詞を取得する。ステップＳ１７において、音声応答システム１は、ステップＳ１４～Ｓ１６において得られた情報を用いて、分類テーブルを更新する。 2-2. Operation FIG. 6 is a flowchart showing an overview of the operation of the voice response system 1 related to the learning function 51. In step S11, the voice response system 1 analyzes the input voice. In step S12, the voice response system 1 performs processing instructed by the input voice. In step S13, the voice response system 1 determines whether the input voice includes items to be learned. If it is determined that the input voice includes items to be learned (S13: YES), the voice response system 1 moves the process to step S14. If it is determined that the input voice does not include the subject of learning (S13: NO), the voice response system 1 moves the process to step S18. In step S14, the voice response system 1 estimates the user's emotion. In step S15, the voice response system 1 analyzes the music piece that has been instructed to be played. In step S16, the voice response system 1 obtains the lyrics of the song for which playback has been instructed. In step S17, the voice response system 1 updates the classification table using the information obtained in steps S14 to S16.

ステップＳ１８以降の処理は学習機能５１すなわち分類テーブルの更新と直接は関係ないが、分類テーブルを用いる処理を含むので説明する。ステップＳ１８において、音声応答システム１は、入力音声に対する応答音声を生成する。このとき、必要に応じて分類テーブルが参照される。ステップＳ１９において、音声応答システム１は、応答音声を出力する。以下、学習機能５１に係る音声応答システム１の動作をより詳細に説明する。 Although the processing after step S18 is not directly related to the learning function 51, that is, updating the classification table, it will be explained because it includes processing using the classification table. In step S18, the voice response system 1 generates a response voice to the input voice. At this time, the classification table is referred to as necessary. In step S19, the voice response system 1 outputs a response voice. Hereinafter, the operation of the voice response system 1 related to the learning function 51 will be explained in more detail.

図７は、学習機能５１に係る音声応答システム１の動作を例示するシーケンスチャートである。ユーザは、例えば音声応答システム１の加入時又は初回起動時に、音声応答システム１に対しユーザ登録を行う。ユーザ登録は、例えば、ユーザ名（又はログインＩＤ）及びパスワードの設定を含む。図７のシーケンスの開始時点において入出力装置１０は起動しており、ユーザのログイン処理が完了している。すなわち、音声応答システム１において、入出力装置１０を使用しているユーザが特定されている。また、入出力装置１０は、ユーザの音声入力（発声）を待ち受けている状態である。なお、音声応答システム１がユーザを特定する方法はログイン処理に限定されない。例えば、音声応答システム１は、入力音声に基づいてユーザを特定してもよい。 FIG. 7 is a sequence chart illustrating the operation of the voice response system 1 related to the learning function 51. A user performs user registration with the voice response system 1, for example, when joining the voice response system 1 or when starting the voice response system 1 for the first time. User registration includes, for example, setting a user name (or login ID) and password. At the start of the sequence in FIG. 7, the input/output device 10 is activated and the user login process is completed. That is, in the voice response system 1, the user using the input/output device 10 is specified. Further, the input/output device 10 is in a state of waiting for voice input (utterance) from the user. Note that the method by which the voice response system 1 identifies a user is not limited to the login process. For example, the voice response system 1 may identify the user based on the input voice.

ステップＳ１０１において、入出力装置１０は、入力音声を受け付ける。入出力装置１０は、入力音声をデータ化し、音声データを生成する。音声データは、入力音声の信号波形を示す音データ及びヘッダを含む。ヘッダには、入力音声の属性を示す情報が含まれる。入力音声の属性は、例えば、入出力装置１０を特定するための識別子、その音声を発したユーザのユーザ識別子（例えば、ユーザ名又はログインＩＤ）、及びその音声を発した時刻を示すタイムスタンプを含む。ステップＳ１０２において、入出力装置１０は、入力音声を示す音声データを音声分析部５１１に出力する。 In step S101, the input/output device 10 receives input audio. The input/output device 10 converts input audio into data and generates audio data. The audio data includes audio data indicating the signal waveform of input audio and a header. The header includes information indicating attributes of the input audio. The attributes of the input voice include, for example, an identifier for identifying the input/output device 10, a user identifier (for example, a user name or login ID) of the user who uttered the voice, and a timestamp indicating the time when the voice was uttered. include. In step S102, the input/output device 10 outputs audio data indicating input audio to the audio analysis unit 511.

ステップＳ１０３において、音声分析部５１１は、音声データを用いて入力音声を分析する。この分析において、音声分析部５１１は、入力音声が学習の対象となる事項を含むか判断する。この例において学習の対象となる事項とは、楽曲を特定する事項をいい、具体的には楽曲の再生指示である。 In step S103, the voice analysis unit 511 analyzes the input voice using voice data. In this analysis, the speech analysis unit 511 determines whether the input speech includes items to be learned. In this example, the items to be learned refer to items that specify a song, and specifically, are instructions for playing a song.

ステップＳ１０４において、処理部５１０は、入力音声により指示された処理を行う。処理部５１０が行う処理は、例えば楽曲のストリーミング再生である。この場合、コンテンツ提供部６０は複数の楽曲データが記録された楽曲データベースを有する。処理部５１０は、指示された楽曲の楽曲データを楽曲データベースから読み出す。処理部５１０は、読み出した楽曲データを、入力音声の送信元の入出力装置１０に送信する。別の例において、処理部５１０が行う処理は、ネットラジオの再生である。この場合、コンテンツ提供部６０は、ラジオ音声のストリーミング放送を行う。処理部５１０は、コンテンツ提供部６０から受信したストリーミングデータを、入力音声の送信元の入出力装置１０に送信する。 In step S104, the processing unit 510 performs processing instructed by the input voice. The processing performed by the processing unit 510 is, for example, streaming playback of music. In this case, the content providing unit 60 has a music database in which a plurality of pieces of music data are recorded. The processing unit 510 reads music data of the instructed music from the music database. The processing unit 510 transmits the read music data to the input/output device 10 that is the source of the input audio. In another example, the processing performed by the processing unit 510 is playback of internet radio. In this case, the content providing unit 60 performs streaming broadcast of radio audio. The processing unit 510 transmits the streaming data received from the content providing unit 60 to the input/output device 10 that is the source of the input audio.

ステップＳ１０３において入力音声が学習の対象となる事項を含むと判断された場合、処理部５１０はさらに、分類テーブルを更新するための処理を行う（ステップＳ１０５）。この例において、分類テーブルを更新するための処理には、感情推定部５１２に対する感情推定の要求（ステップＳ１０５１）、楽曲解析部５１３に対する楽曲解析の要求（ステップＳ１０５２）、及び歌詞抽出部５１４に対する歌詞抽出の要求（ステップＳ１０５３）を含む。 If it is determined in step S103 that the input speech includes items to be learned, the processing unit 510 further performs processing to update the classification table (step S105). In this example, the processing for updating the classification table includes a request to the emotion estimation unit 512 for emotion estimation (step S1051), a request to the music analysis unit 513 to analyze the song (step S1052), and a request to the lyrics extraction unit 514 to perform song analysis. It includes an extraction request (step S1053).

感情推定が要求されると、感情推定部５１２は、ユーザの感情を推定し（ステップＳ１０６）、推定した感情を示す情報（以下「感情情報」という）を、要求元である処理部５１０に出力する（ステップＳ１０７）。この例において、感情推定部５１２は、入力音声を用いてユーザの感情を推定する。感情推定部５１２は、例えば、テキスト化された入力音声に基づいて感情を推定する。一例において、感情を示すキーワードがあらかじめ定義されており、テキスト化された入力音声がこのキーワードを含んでいた場合、感情推定部５１２は、ユーザがその感情であると判断する（例えば、「クソッ」というキーワードが含まれていた場合、ユーザの感情が「怒り」であると判断する）。別の例において、感情推定部５１２は、入力音声の音高、音量、速度又はこれらの時間変化に基づいて感情を推定する。一例において、入力音声の平均音高がしきい値よりも低い場合、感情推定部５１２はユーザの感情が「悲しい」であると判断する。別の例において、感情推定部５１２は、音声応答システム１が出力する音声における（平均）音高と、それに対するユーザの応答の音高との関係に基づいてユーザの感情を推定してもよい。具体的には、音声応答システム１が出力する音声の音高が高いにもかかわらず、ユーザが応答した音声の音高が低い場合、感情推定部５１２はユーザの感情が「悲しい」であると判断する。さらに別の例において、感情推定部５１２は、音声における語尾の音高と、それに対するユーザの応答の音高との関係に基づいてユーザの感情を推定してもよい。あるいは、感情推定部５１２は、これら複数の要素を複合的に考慮してユーザの感情を推定してもよい。 When emotion estimation is requested, the emotion estimation unit 512 estimates the user's emotion (step S106), and outputs information indicating the estimated emotion (hereinafter referred to as “emotion information”) to the processing unit 510 that is the request source. (Step S107). In this example, the emotion estimation unit 512 estimates the user's emotion using the input voice. The emotion estimating unit 512 estimates emotions based on, for example, input speech converted into text. In one example, a keyword indicating an emotion is defined in advance, and if the input voice converted into text includes this keyword, the emotion estimation unit 512 determines that the user has that emotion (for example, "Damn it"). If the keyword is included, it is determined that the user's emotion is "anger"). In another example, the emotion estimating unit 512 estimates the emotion based on the pitch, volume, and speed of the input voice, or temporal changes thereof. In one example, if the average pitch of the input voice is lower than the threshold, the emotion estimation unit 512 determines that the user's emotion is "sad." In another example, the emotion estimation unit 512 may estimate the user's emotion based on the relationship between the (average) pitch of the voice output by the voice response system 1 and the pitch of the user's response thereto. . Specifically, if the pitch of the voice output by the voice response system 1 is high but the pitch of the voice the user responds to is low, the emotion estimation unit 512 determines that the user's emotion is "sad". to decide. In yet another example, the emotion estimation unit 512 may estimate the user's emotion based on the relationship between the pitch of the final word in the voice and the pitch of the user's response to the pitch. Alternatively, the emotion estimation unit 512 may estimate the user's emotion by considering these multiple factors in a composite manner.

別の例において、感情推定部５１２は、音声以外の入力を用いてユーザの感情を推定してもよい。音声以外の入力としては、例えば、カメラにより撮影されたユーザの顔の映像、又は温度センサーにより検知されたユーザの体温、若しくはこれらの組み合わせが用いられる。具体的には、感情推定部５１２は、ユーザの表情からユーザの感情が「楽しい」、「怒り」、「悲しい」のいずれであるかを判断する。また、感情推定部５１２は、ユーザの顔の動画において、表情の変化に基づいてユーザの感情を判断してもよい。あるいは、感情推定部５１２は、ユーザの体温が高いと「怒り」、低いと「悲しい」と判断してもよい。 In another example, the emotion estimator 512 may estimate the user's emotion using input other than voice. As the input other than audio, for example, an image of the user's face captured by a camera, the user's body temperature detected by a temperature sensor, or a combination thereof is used. Specifically, the emotion estimation unit 512 determines whether the user's emotion is "happy," "angry," or "sad" from the user's facial expression. Furthermore, the emotion estimation unit 512 may determine the user's emotion based on changes in facial expressions in a video of the user's face. Alternatively, the emotion estimation unit 512 may determine that the user is "angry" if the user's body temperature is high, and that the user is "sad" if the user's body temperature is low.

楽曲解析が要求されると、楽曲解析部５１３は、ユーザの指示により再生される楽曲を解析し（ステップＳ１０８）、解析結果を示す情報（以下「楽曲情報」という）を、要求元である処理部５１０に出力する（ステップＳ１０９）。 When music analysis is requested, the music analysis unit 513 analyzes the music played according to the user's instructions (step S108), and sends information indicating the analysis results (hereinafter referred to as "music information") to the requesting process. The information is output to the unit 510 (step S109).

歌詞抽出が要求されると、歌詞抽出部５１４は、ユーザの指示により再生される楽曲の歌詞を取得し（ステップＳ１１０）、取得した歌詞を示す情報（以下「歌詞情報」という）を、要求元である処理部５１０に出力する（ステップＳ１１１）。 When lyrics extraction is requested, the lyrics extraction unit 514 acquires the lyrics of the song to be played according to the user's instructions (step S110), and sends information indicating the acquired lyrics (hereinafter referred to as "lyrics information") to the requester. is output to the processing unit 510 (step S111).

ステップＳ１１２において、処理部５１０は、感情推定部５１２、楽曲解析部５１３、及び歌詞抽出部５１４からそれぞれ取得した感情情報、楽曲情報、及び歌詞情報の組を、嗜好分析部５１５に出力する。 In step S<b>112 , the processing unit 510 outputs the set of emotion information, music information, and lyrics information obtained from the emotion estimation unit 512 , music analysis unit 513 , and lyrics extraction unit 514 to the preference analysis unit 515 .

ステップＳ１１３において、嗜好分析部５１５は、複数組の情報を分析し、ユーザの嗜好を示す情報を得る。この分析のため、嗜好分析部５１５は、過去のある期間（例えば、システムの稼働開始から現時点までの期間）に渡って、これらの情報の組を複数、記録する。一例において、嗜好分析部５１５は、楽曲情報を統計処理し、統計的な代表値（例えば、平均値、最頻値、又は中央値）を計算する。この統計処理により、例えば、テンポの平均値、並びに音色、曲調、リズム、コード進行、作曲者名、作詞者名、及び実演者名の最頻値が得られる。また、嗜好分析部５１５は、形態素解析等の技術を用いて歌詞情報により示される歌詞を単語レベルに分解したうえで各単語の品詞を特定し、特定の品詞（例えば名詞）の単語についてヒストグラムを作成し、登場頻度が所定の範囲（例えば上位５％）にある単語を特定する。さらに、嗜好分析部５１５は、特定された単語を含み、構文上の所定の区切り（例えば、分、節、又は句）に相当する単語群を歌詞情報から抽出する。例えば、「好き」という語の登場頻度が高い場合、この語を含む「そんな君が好き」、「とても好きだから」等の単語群が歌詞情報から抽出される。これらの平均値、最頻値、及び単語群は、ユーザの嗜好を示す情報（パラメータ）の一例である。あるいは、嗜好分析部５１５は、単なる統計処理とは異なる所定のアルゴリズムに従って複数組の情報を分析し、ユーザの嗜好を示す情報を得てもよい。あるいは、嗜好分析部５１５は、ユーザからフィードバックを受け付け、これらのパラメータの重みをフィードバックに応じて調整してもよい。ステップＳ１１４において、嗜好分析部５１５は、ステップＳ１１３により得られた情報を用いて、分類テーブル５１６１を更新する。 In step S113, the preference analysis unit 515 analyzes multiple sets of information to obtain information indicating the user's preferences. For this analysis, the preference analysis unit 515 records a plurality of sets of these pieces of information over a certain period in the past (for example, a period from the start of operation of the system to the present time). In one example, the preference analysis unit 515 statistically processes the music information and calculates a statistical representative value (for example, an average value, a mode value, or a median value). Through this statistical processing, for example, the average value of tempo and the mode of tone, melody, rhythm, chord progression, composer's name, lyricist's name, and performer's name are obtained. In addition, the preference analysis unit 515 uses technology such as morphological analysis to decompose the lyrics indicated by the lyric information into word level, identifies the part of speech of each word, and creates a histogram for words with a specific part of speech (for example, a noun). words whose appearance frequency falls within a predetermined range (for example, the top 5%). Furthermore, the preference analysis unit 515 extracts a group of words that include the identified word and correspond to a predetermined syntactic break (for example, a minute, clause, or phrase) from the lyrics information. For example, if the word "suki" appears frequently, a group of words including this word, such as "I like you so much" and "I like you very much," are extracted from the lyrics information. These average values, mode values, and word groups are examples of information (parameters) indicating user preferences. Alternatively, the preference analysis unit 515 may analyze multiple sets of information according to a predetermined algorithm different from simple statistical processing to obtain information indicating the user's preferences. Alternatively, the preference analysis unit 515 may receive feedback from the user and adjust the weights of these parameters according to the feedback. In step S114, the preference analysis unit 515 updates the classification table 5161 using the information obtained in step S113.

図８は、分類テーブル５１６１を例示する図である。この図では、ユーザ名が「山田太郎」であるユーザの分類テーブル５１６１を示している。分類テーブル５１６１において、楽曲の特徴、属性、及び歌詞が、ユーザの感情と対応付けて記録されている。分類テーブル５１６１を参照すれば、例えば、ユーザ「山田太郎」が「嬉しい」という感情を抱いているときには、「恋」、「愛」、及び「love」という語を歌詞に含み、テンポが約６０であり、「I→V→VIm→IIIm→IV→I→IV→V」というコード進行を有し、ピアノの音色が主である楽曲を好むことが示される。本実施形態によれば、ユーザの嗜好を示す情報を自動的に得ることができる。分類テーブル５１６１に記録される嗜好情報は、学習が進むにつれ、すなわち音声応答システム１の累積使用時間が増えるにつれ、蓄積され、よりユーザの嗜好を反映したものとなる。この例によれば、ユーザの嗜好を反映した情報を自動的に得ることができる。 FIG. 8 is a diagram illustrating the classification table 5161. This figure shows a classification table 5161 for a user whose user name is "Taro Yamada." In the classification table 5161, the characteristics, attributes, and lyrics of songs are recorded in association with the user's emotions. Referring to the classification table 5161, for example, when the user "Taro Yamada" is feeling "happy", the words "koi", "ai", and "love" are included in the lyrics, and the tempo is about 60. This indicates that he prefers songs that have a chord progression of "I→V→VIm→IIIm→IV→I→IV→V" and are dominated by piano tones. According to this embodiment, information indicating the user's preferences can be automatically obtained. The preference information recorded in the classification table 5161 is accumulated as learning progresses, that is, as the cumulative usage time of the voice response system 1 increases, and becomes more reflective of the user's preferences. According to this example, information that reflects the user's preferences can be automatically obtained.

なお、嗜好分析部５１５は、分類テーブル５１６１の初期値をユーザ登録時又は初回ログイン時等、所定のタイミングにおいて設定してもよい。この場合において、音声応答システム１は、システム上でユーザを表すキャラクタ（例えばいわゆるアバター）をユーザに選択させ、選択されたキャラクタに応じた初期値を有する分類テーブル５１６１を、そのユーザに対応する分類テーブルとして設定してもよい。 Note that the preference analysis unit 515 may set the initial value of the classification table 5161 at a predetermined timing, such as at the time of user registration or first login. In this case, the voice response system 1 allows the user to select a character (for example, a so-called avatar) representing the user on the system, and creates a classification table 5161 having an initial value according to the selected character into a classification corresponding to the user. It may also be set as a table.

この実施形態において説明した分類テーブル５１６１に記録されるデータはあくまで例示である。例えば、分類テーブル５１６１にはユーザの感情が記録されず、少なくとも、歌詞が記録されていればよい。あるいは、分類テーブル５１６１には歌詞が記録されず、少なくとも、ユーザの感情と楽曲解析の結果とが記録されていればよい。 The data recorded in the classification table 5161 described in this embodiment is merely an example. For example, the classification table 5161 does not need to record the user's emotions, but only needs to record at least the lyrics. Alternatively, the lyrics may not be recorded in the classification table 5161, and at least the user's emotions and the results of song analysis may be recorded.

３．歌唱合成機能
３－１．構成
図９は、歌唱合成機能５２に係る機能構成を例示する図である。歌唱合成機能５２に係
る機能要素として、音声応答システム１は、音声分析部５１１、感情推定部５１２、記憶部５１６、検知部５２１、歌唱生成部５２２、伴奏生成部５２３、及び合成部５２４を有する。歌唱生成部５２２は、メロディ生成部５２２１及び歌詞生成部５２２２を有する。以下において、学習機能５１と共通する要素については説明を省略する。 3. Singing synthesis function 3-1. Configuration FIG. 9 is a diagram illustrating a functional configuration related to the singing synthesis function 52. As functional elements related to the singing synthesis function 52, the voice response system 1 includes a voice analysis section 511, an emotion estimation section 512, a storage section 516, a detection section 521, a singing generation section 522, an accompaniment generation section 523, and a synthesis section 524. . The song generation section 522 includes a melody generation section 5221 and a lyrics generation section 5222. In the following, description of elements common to the learning function 51 will be omitted.

歌唱合成機能５２に関し、記憶部５１６は、素片データベース５１６２を記憶する。素片データベースは、歌唱合成において用いられる音声素片データを記録したデータベースである。音声素片データは、１又は複数の音素をデータ化したものである。音素とは、言語上の意味の区別の最小単位（例えば母音や子音）に相当するものであり、ある言語の実際の調音と音韻体系全体を考慮して設定される、その言語の音韻論上の最小単位である。音声素片は、特定の発声者によって発声された入力音声のうち所望の音素や音素連鎖に相当する区間が切り出されたものである。本実施形態における音声素片データは、音声素片の周波数スペクトルを示すデータである。以下の説明では、「音声素片」の語は、単一の音素（例えばモノフォン）や、音素連鎖（例えばダイフォンやトライフォン）を含む。 Regarding the singing synthesis function 52, the storage unit 516 stores a segment database 5162. The segment database is a database that records speech segment data used in singing synthesis. The speech segment data is data obtained by converting one or more phonemes. A phoneme corresponds to the smallest unit of linguistic meaning distinction (e.g. vowels and consonants), and is defined by the phonology of a language, which is established by considering the actual articulation and phonological system of the language as a whole. is the smallest unit of A speech segment is a segment corresponding to a desired phoneme or phoneme chain cut out of input speech uttered by a specific speaker. The speech segment data in this embodiment is data indicating the frequency spectrum of a speech segment. In the following description, the term "phonetic segment" includes a single phoneme (eg, monophone) and a chain of phonemes (eg, diphone or triphone).

記憶部５１６は、素片データベース５１６２を複数、記憶してもよい。複数の素片データベース５１６２は、例えば、それぞれ異なる歌手（又は話者）により発音された音素を記録したものを含んでもよい。あるいは、複数の素片データベース５１６２は、単一の歌手（又は話者）により、それぞれ異なる歌い方又は声色で発音された音素を記録したものを含んでもよい。 The storage unit 516 may store a plurality of elemental piece databases 5162. The plurality of segment databases 5162 may include, for example, records of phonemes pronounced by different singers (or speakers). Alternatively, the plurality of segment databases 5162 may include records of phonemes pronounced in different singing styles or tones by a single singer (or speaker).

歌唱生成部５２２は、歌唱音声を生成する、すなわち歌唱合成する。歌唱音声とは、与えられた歌詞を与えられたメロディに従って発した音声をいう。メロディ生成部５２２１は、歌唱合成に用いられるメロディを生成する。歌詞生成部５２２２は、歌唱合成に用いられる歌詞を生成する。メロディ生成部５２２１及び歌詞生成部５２２２は、分類テーブル５１６１に記録されている情報を用いてメロディ及び歌詞を生成してもよい。歌唱生成部５２２は、メロディ生成部５２２１により生成されたメロディ及び歌詞生成部５２２２により生成された歌詞を用いて歌唱音声を生成する。伴奏生成部５２３は、歌唱音声に対する伴奏を生成する。合成部５１９は、歌唱生成部５２２により生成された歌唱音声、伴奏生成部５２３により生成された伴奏、及び素片データベース５１６２に記録されている音声素片を用いて歌唱音声を合成する。 The singing generation unit 522 generates singing voices, that is, performs singing synthesis. Singing audio refers to audio that utters given lyrics according to a given melody. The melody generation unit 5221 generates a melody used for singing synthesis. The lyrics generation unit 5222 generates lyrics used for singing synthesis. The melody generation unit 5221 and the lyrics generation unit 5222 may generate the melody and lyrics using the information recorded in the classification table 5161. The singing generation unit 522 generates singing voice using the melody generated by the melody generation unit 5221 and the lyrics generated by the lyrics generation unit 5222. The accompaniment generation unit 523 generates accompaniment for the singing voice. The synthesis unit 519 synthesizes a singing voice using the singing voice generated by the singing generation unit 522, the accompaniment generated by the accompaniment generation unit 523, and the speech segments recorded in the segment database 5162.

３－２．動作
図１０は、歌唱合成機能５２に係る音声応答システム１の動作（歌唱合成方法）の概要を示すフローチャートである。ステップＳ２１において、音声応答システム１は、歌唱合成をトリガするイベントが発生したか判断する。すなわち、音声応答システム１は、歌唱合成をトリガするイベントを検知する。歌唱合成をトリガするイベントは、例えば、ユーザから音声入力が行われたというイベント、カレンダーに登録されたイベント（例えば、アラーム又はユーザの誕生日）、ユーザから音声以外の手法（例えば入出力装置１０に無線接続されたスマートフォン（図示略）への操作）により歌唱合成の指示が入力されたというイベント、及びランダムに発生するイベントのうち少なくとも１つを含む。歌唱合成をトリガするイベントが発生したと判断された場合（Ｓ２１：ＹＥＳ）、音声応答システム１は、処理をステップＳ２２に移行する。歌唱合成をトリガするイベントが発生していないと判断された場合（Ｓ２１：ＮＯ）、音声応答システム１は、歌唱合成をトリガするイベントが発生するまで待機する。 3-2. Operation FIG. 10 is a flowchart showing an overview of the operation (singing synthesis method) of the voice response system 1 related to the singing synthesis function 52. In step S21, the voice response system 1 determines whether an event that triggers singing synthesis has occurred. That is, the voice response system 1 detects an event that triggers singing synthesis. The event that triggers singing synthesis is, for example, an event in which the user inputs a voice, an event registered in a calendar (for example, an alarm or the user's birthday), or an event in which the user inputs a voice using a method other than voice (for example, the input/output device 10). The event includes at least one of an event in which an instruction for singing synthesis is input by an operation on a smartphone (not shown) wirelessly connected to the computer, and an event that occurs randomly. If it is determined that an event that triggers singing synthesis has occurred (S21: YES), the voice response system 1 moves the process to step S22. If it is determined that an event that triggers singing synthesis has not occurred (S21: NO), the voice response system 1 waits until an event that triggers singing synthesis occurs.

ステップＳ２２において、音声応答システム１は、歌唱合成パラメータを読み出す。ステップＳ２３において、音声応答システム１は、歌詞を生成する。ステップＳ２４において、音声応答システム１は、メロディを生成する。ステップＳ２５において、音声応答システム１は、生成した歌詞及びメロディの一方を他方に合わせて修正する。ステップＳ２
６において、音声応答システム１は、使用する素片データベースを選択する。ステップＳ２７において、音声応答システム１は、ステップＳ２３、Ｓ２６、及びＳ２７において得られた、メロディ、歌詞、及び素片データベースを用いて歌唱合成を行う。ステップＳ２８において、音声応答システム１は、伴奏を生成する。ステップＳ２９において、音声応答システム１は、歌唱音声と伴奏とを合成する。ステップＳ２３～Ｓ２９の処理は、図６のフローにおけるステップＳ１８の処理の一部である。以下、歌唱合成機能５２に係る音声応答システム１の動作をより詳細に説明する。 In step S22, the voice response system 1 reads singing synthesis parameters. In step S23, the voice response system 1 generates lyrics. In step S24, the voice response system 1 generates a melody. In step S25, the voice response system 1 modifies one of the generated lyrics and melody to match the other. Step S2
In step 6, the voice response system 1 selects a segment database to be used. In step S27, the voice response system 1 performs singing synthesis using the melody, lyrics, and segment database obtained in steps S23, S26, and S27. In step S28, the voice response system 1 generates accompaniment. In step S29, the voice response system 1 synthesizes the singing voice and the accompaniment. The processing in steps S23 to S29 is a part of the processing in step S18 in the flow of FIG. Hereinafter, the operation of the voice response system 1 related to the singing synthesis function 52 will be explained in more detail.

図１１は、歌唱合成機能５２に係る音声応答システム１の動作を例示するシーケンスチャートである。歌唱合成をトリガするイベントを検知すると、検知部５２１は歌唱生成部５２２に対し歌唱合成を要求する（ステップＳ２０１）。歌唱合成の要求はユーザの識別子を含む。歌唱合成を要求されると、歌唱生成部５２２は、記憶部５１６に対しユーザの嗜好を問い合わせる（ステップＳ２０２）。この問い合わせはユーザ識別子を含む。問い合わせを受けると、記憶部５１６は、分類テーブル５１６１の中から、問い合わせに含まれるユーザ識別子と対応する嗜好情報を読み出し、読み出した嗜好情報を歌唱生成部５２２に出力する（ステップＳ２０３）。さらに歌唱生成部５２２は、感情推定部５１２に対しユーザの感情を問い合わせる（ステップＳ２０４）。この問い合わせはユーザ識別子を含む。問い合わせを受けると、感情推定部５１２は、そのユーザの感情情報を歌唱生成部５２２に出力する（ステップＳ２０５）。 FIG. 11 is a sequence chart illustrating the operation of the voice response system 1 related to the singing synthesis function 52. When detecting an event that triggers song synthesis, the detection unit 521 requests song synthesis to the song generation unit 522 (step S201). The song synthesis request includes the user's identifier. When song synthesis is requested, the song generation unit 522 inquires of the user's preferences from the storage unit 516 (step S202). This query includes a user identifier. Upon receiving the inquiry, the storage unit 516 reads preference information corresponding to the user identifier included in the inquiry from the classification table 5161, and outputs the read preference information to the song generation unit 522 (step S203). Further, the song generation unit 522 inquires the emotion estimation unit 512 about the user's emotion (step S204). This query includes a user identifier. Upon receiving the inquiry, the emotion estimation unit 512 outputs the user's emotion information to the song generation unit 522 (step S205).

ステップＳ２０６において、歌唱生成部５２２は、歌詞のソースを選択する。歌詞のソースは入力音声に応じて決められる。歌詞のソースは、大きくは、処理部５１０及び分類テーブル５１６１のいずれかである。処理部５１０から歌唱生成部５２２に出力される歌唱合成の要求は、歌詞（又は歌詞素材）を含んでいる場合と、歌詞を含んでいない場合とがある。歌詞素材とは、それ単独では歌詞を形成することができず、他の歌詞素材と組み合わせることによって歌詞を形成する文字列をいう。歌唱合成の要求が歌詞を含んでいる場合とは、例えば、ＡＩによる応答そのもの（「明日の天気は晴れです」等）にメロディを付けて応答音声を出力する場合をいう。歌唱合成の要求は処理部５１０によって生成されることから、歌詞のソースは処理部５１０であるということもできる。さらに、処理部５１０は、コンテンツ提供部６０からコンテンツを取得する場合があるので、歌詞のソースはコンテンツ提供部６０であるということもできる。コンテンツ提供部６０は、例えば、ニュースを提供するサーバ又は気象情報を提供するサーバである。あるいは、コンテンツ提供部６０は、既存の楽曲の歌詞を記録したデータベースを有するサーバである。図ではコンテンツ提供部６０は１台のみ示しているが、複数のコンテンツ提供部６０が存在してもよい。歌唱合成の要求に歌詞が含まれている場合、歌唱生成部５２２は、歌唱合成の要求を歌詞のソースとして選択する。歌唱合成の要求に歌詞が含まれていない場合（例えば、入力音声による指示が「何か歌って」のように歌詞の内容を特に指定しないものである場合）、歌唱生成部５２２は、分類テーブル５１６１を歌詞のソースとして選択する。 In step S206, the song generation unit 522 selects a lyrics source. The source of the lyrics is determined according to the input audio. The source of lyrics is roughly either the processing unit 510 or the classification table 5161. The song synthesis request output from the processing unit 510 to the song generation unit 522 may include lyrics (or lyric material) or may not include lyrics. A lyric material is a character string that cannot form lyrics by itself, but can form lyrics by combining with other lyric materials. The case where the request for singing synthesis includes lyrics refers to, for example, a case where a melody is added to the response itself by the AI (such as "Tomorrow's weather will be sunny") and a response voice is output. Since the singing synthesis request is generated by the processing unit 510, it can also be said that the source of the lyrics is the processing unit 510. Furthermore, since the processing unit 510 may acquire content from the content providing unit 60, it can also be said that the source of the lyrics is the content providing unit 60. The content providing unit 60 is, for example, a server that provides news or a server that provides weather information. Alternatively, the content providing unit 60 is a server that has a database that records lyrics of existing songs. Although only one content providing unit 60 is shown in the figure, a plurality of content providing units 60 may exist. If the song synthesis request includes lyrics, the song generation unit 522 selects the song synthesis request as the lyrics source. If the request for singing synthesis does not include lyrics (for example, if the input voice instruction does not specify the content of the lyrics, such as "sing something"), the singing generation unit 522 generates a classification table. 5161 as the lyrics source.

ステップＳ２０７において、歌唱生成部５２２は、選択されたソースに対し歌詞素材の提供を要求する。ここでは、分類テーブル５１６１すなわち記憶部５１６がソースとして選択された例を示している。この場合、この要求はユーザ識別子及びそのユーザの感情情報を含む。歌詞素材提供の要求を受けると、記憶部５１６は、要求に含まれるユーザ識別子及び感情情報に対応する歌詞素材を分類テーブル５１６１から抽出する（ステップＳ２０８）。記憶部５１６は、抽出した歌詞素材を歌唱生成部５２２に出力する（ステップＳ２０９）。 In step S207, the song generation unit 522 requests the selected source to provide lyrics material. Here, an example is shown in which the classification table 5161, that is, the storage unit 516 is selected as the source. In this case, the request includes a user identifier and emotional information for that user. Upon receiving the request to provide lyric material, the storage unit 516 extracts the lyric material corresponding to the user identifier and emotion information included in the request from the classification table 5161 (step S208). The storage unit 516 outputs the extracted lyrics material to the singing generation unit 522 (step S209).

歌詞素材を取得すると、歌唱生成部５２２は、歌詞生成部５２２２に対し歌詞の生成を要求する（ステップＳ２１０）。この要求は、ソースから取得した歌詞素材を含む。歌詞の生成が要求されると、歌詞生成部５２２２は、歌詞素材を用いて歌詞を生成する（ステ
ップＳ２１１）。歌詞生成部５２２２は、例えば、歌詞素材を複数、組み合わせることにより歌詞を生成する。あるいは、各ソースは１曲全体分の歌詞を記憶していてもよく、この場合、歌詞生成部５２２２は、ソースが記憶している歌詞の中から、歌唱合成に用いる１曲分の歌詞を選択してもよい。歌詞生成部５２２２は、生成した歌詞を歌唱生成部５２２に出力する（ステップＳ２１２）。 Upon acquiring the lyrics material, the song generation unit 522 requests the lyrics generation unit 5222 to generate lyrics (step S210). This request includes the lyric material obtained from the source. When generation of lyrics is requested, the lyrics generation unit 5222 generates lyrics using lyrics material (step S211). The lyrics generation unit 5222 generates lyrics by combining a plurality of lyrics materials, for example. Alternatively, each source may store lyrics for an entire song, and in this case, the lyrics generation unit 5222 selects lyrics for one song to be used for singing synthesis from among the lyrics stored in the source. You may. The lyrics generation unit 5222 outputs the generated lyrics to the song generation unit 522 (step S212).

ステップＳ２１３において、歌唱生成部５２２は、メロディ生成部５２２１に対しメロディの生成を要求する。この要求は、ユーザの嗜好情報及び歌詞の音数を特定する情報を含む。歌詞の音数を特定する情報は、生成された歌詞の文字数、モーラ数、又は音節数である。メロディの生成が要求されると、メロディ生成部５２２１は、要求に含まれる嗜好情報に応じてメロディを生成する（ステップＳ２１４）。具体的には例えば以下のとおりである。メロディ生成部５２２１は、メロディの素材（例えば、２小節又は４小節程度の長さを有する音符列、又は音符列をリズムや音高の変化といった音楽的な要素に細分化した情報列）のデータベース（以下「メロディデータベース」という。図示略）にアクセスすることができる。メロディデータベースは、例えば記憶部５１６に記憶される。メロディデータベースには、メロディの属性が記録されている。メロディの属性は、例えば、適合する曲調又は歌詞、作曲者名等の楽曲情報を含む。メロディ生成部５２２１は、メロディデータベースに記録されている素材の中から、要求に含まれる嗜好情報に適合する１又は複数の素材を選択し、選択された素材を組み合わせて所望の長さのメロディを得る。歌唱生成部５２２は、生成したメロディを特定する情報（例えばＭＩＤＩ等のシーケンスデータ）を歌唱生成部５２２に出力する（ステップＳ２１５）。 In step S213, the song generation unit 522 requests the melody generation unit 5221 to generate a melody. This request includes user preference information and information specifying the number of notes in the lyrics. The information specifying the number of sounds in the lyrics is the number of characters, mora, or syllables of the generated lyrics. When generation of a melody is requested, the melody generation unit 5221 generates a melody according to the preference information included in the request (step S214). Specifically, for example, it is as follows. The melody generation unit 5221 is a database of melody materials (for example, a note string having a length of about 2 or 4 measures, or an information string in which a note string is subdivided into musical elements such as rhythm and pitch changes). (hereinafter referred to as "melody database", not shown). The melody database is stored in the storage unit 516, for example. The melody database records the attributes of the melody. The attributes of the melody include, for example, music information such as a suitable melody or lyrics, composer name, and the like. The melody generation unit 5221 selects one or more materials that match the preference information included in the request from among the materials recorded in the melody database, and combines the selected materials to create a melody of a desired length. obtain. The song generation unit 522 outputs information (for example, sequence data such as MIDI) that specifies the generated melody to the song generation unit 522 (step S215).

ステップＳ２１６において、歌唱生成部５２２は、メロディ生成部５２２１に対しメロディの修正、又は歌詞生成部５２２２に対し歌詞の生成を要求する。この修正の目的の一つは、歌詞の音数（例えばモーラ数）とメロディの音数とを一致させることである。例えば、歌詞のモーラ数がメロディの音数よりも少ない場合（字足らずの場合）、歌唱生成部５２２は、歌詞の文字数を増やすよう、歌詞生成部５２２２に要求する。あるいは、歌詞のモーラ数がメロディの音数よりも多い場合（字余りの場合）、歌唱生成部５２２は、メロディの音数を増やすよう、メロディ生成部５２２１に要求する。この図では、歌詞を修正する例を説明する。ステップＳ２１７において、歌詞生成部５２２２は、修正の要求に応じて歌詞を修正する。メロディの修正をする場合、メロディ生成部５２２１は、例えば音符を分割して音符数を増やすことによりメロディを修正する。歌詞生成部５２２２又はメロディ生成部５２２１は、歌詞の文節の区切りの部分とメロディのフレーズの区切り部分とを一致させるよう調整してもよい。歌詞生成部５２２２は、修正した歌詞を歌唱生成部５２２に出力する（ステップＳ２１８）。 In step S216, the song generation unit 522 requests the melody generation unit 5221 to modify the melody or the lyrics generation unit 5222 to generate lyrics. One of the purposes of this modification is to match the number of notes in the lyrics (for example, the number of moras) with the number of notes in the melody. For example, if the number of moras in the lyrics is less than the number of sounds in the melody (if there are fewer characters), the singing generation unit 522 requests the lyrics generation unit 5222 to increase the number of characters in the lyrics. Alternatively, if the number of moras in the lyrics is greater than the number of notes in the melody (in case of excess characters), the singing generation unit 522 requests the melody generation unit 5221 to increase the number of notes in the melody. In this figure, an example of modifying lyrics will be explained. In step S217, the lyrics generation unit 5222 modifies the lyrics in response to the modification request. When modifying the melody, the melody generation unit 5221 modifies the melody by, for example, dividing the notes and increasing the number of notes. The lyrics generation unit 5222 or the melody generation unit 5221 may adjust the lyrics so that the section between the clauses of the lyrics matches the section between the phrases of the melody. The lyrics generation unit 5222 outputs the corrected lyrics to the song generation unit 522 (step S218).

歌詞を受けると、歌唱生成部５２２は、歌唱合成に用いられる素片データベース５１６２を選択する（ステップＳ２１９）。素片データベース５１６２は、例えば、歌唱合成をトリガしたイベントに関するユーザの属性に応じて選択される。あるいは、素片データベース５１６２は、歌唱合成をトリガしたイベントの内容に応じて選択されてもよい。さらにあるいは、素片データベース５１６２は、分類テーブル５１６１に記録されているユーザの嗜好情報に応じて選択されてもよい。歌唱生成部５２２は、これまでの処理で得られた歌詞及びメロディに従って、選択された素片データベース５１６２から抽出された音声素片を合成し、合成歌唱のデータを得る（ステップＳ２２０）。なお、分類テーブル５１６１には、歌唱における声色の変更、タメ、しゃくり、ビブラート等の歌唱の奏法に関するユーザの嗜好を示す情報が記録されてもよく、歌唱生成部５２２は、これらの情報を参照して、ユーザの嗜好に応じた奏法を反映した歌唱を合成してもよい。歌唱生成部５２２は、生成された合成歌唱のデータを合成部５２４に出力する（ステップＳ２２２１）。 Upon receiving the lyrics, the song generation unit 522 selects the segment database 5162 used for song synthesis (step S219). The segment database 5162 is selected, for example, depending on the user's attributes regarding the event that triggered the song synthesis. Alternatively, the segment database 5162 may be selected depending on the content of the event that triggered the song synthesis. Furthermore, the segment database 5162 may be selected according to user preference information recorded in the classification table 5161. The song generation unit 522 synthesizes the speech segments extracted from the selected segment database 5162 according to the lyrics and melody obtained in the previous processing, and obtains data of a synthesized song (step S220). Note that the classification table 5161 may record information indicating the user's preferences regarding singing rendition techniques such as changing the tone of voice in singing, tapping, shaking, and vibrato, and the singing generating unit 522 refers to this information. Then, a song that reflects a rendition style according to the user's preference may be synthesized. The song generation unit 522 outputs the generated synthetic song data to the synthesis unit 524 (step S2221).

さらに、歌唱生成部５２２は、伴奏生成部５２３に対し伴奏の生成を要求する（Ｓ２２
２）。この要求は、歌唱合成におけるメロディを示す情報を含む。伴奏生成部５２３は、要求に含まれるメロディに応じて伴奏を生成する（ステップＳ２２３）。メロディに対し自動的に伴奏を付ける技術としては、周知の技術が用いられる。メロディデータベースにおいてメロディのコード進行を示すデータ（以下「コード進行データ」）が記録されている場合、伴奏生成部５２３は、このコード進行データを用いて伴奏を生成してもよい。あるいは、メロディデータベースにおいてメロディに対する伴奏用のコード進行データが記録されている場合、伴奏生成部５２３は、このコード進行データを用いて伴奏を生成してもよい。さらにあるいは、伴奏生成部５２３は、伴奏のオーディオデータをあらかじめ複数、記憶しておき、その中からメロディのコード進行に合ったものを読み出してもよい。また、伴奏生成部５２３は、例えば伴奏の曲調を決定するために分類テーブル５１６１を参照し、ユーザの嗜好に応じた伴奏を生成してもよい。伴奏生成部５２３は、生成された伴奏のデータを合成部５２４に出力する（ステップＳ２２４）。 Further, the singing generation unit 522 requests the accompaniment generation unit 523 to generate an accompaniment (S22
2). This request includes information indicating the melody in the singing synthesis. The accompaniment generation unit 523 generates accompaniment according to the melody included in the request (step S223). A well-known technique is used to automatically add accompaniment to a melody. If data indicating the chord progression of the melody (hereinafter referred to as "chord progression data") is recorded in the melody database, the accompaniment generation unit 523 may generate accompaniment using this chord progression data. Alternatively, if chord progression data for accompaniment to the melody is recorded in the melody database, the accompaniment generation unit 523 may generate accompaniment using this chord progression data. Alternatively, the accompaniment generation section 523 may store a plurality of pieces of accompaniment audio data in advance, and read out the accompaniment audio data that matches the chord progression of the melody. Further, the accompaniment generation unit 523 may refer to the classification table 5161 to determine the melody of the accompaniment, for example, and generate accompaniment according to the user's preference. The accompaniment generation unit 523 outputs the generated accompaniment data to the synthesis unit 524 (step S224).

合成歌唱及び伴奏のデータを受けると、合成部５２４は、合成歌唱及び伴奏を合成する（ステップＳ２２５）。合成に際しては、演奏の開始位置やテンポを合わせることによって、歌唱と伴奏とが同期するように合成される。こうして伴奏付きの合成歌唱のデータが得られる。合成部５２４は、合成歌唱のデータを出力する。 Upon receiving the synthetic singing and accompaniment data, the synthesizing unit 524 synthesizes the synthetic singing and accompaniment (step S225). When compositing, the singing and accompaniment are synchronized by matching the starting position and tempo of the performance. In this way, data for synthetic singing with accompaniment is obtained. The synthesis unit 524 outputs data of synthesized singing.

ここでは、最初に歌詞が生成され、その後、歌詞に合わせてメロディを生成する例を説明した。しかし、音声応答システム１は、先にメロディを生成し、その後、メロディに合わせて歌詞を生成してもよい。また、ここでは歌唱と伴奏とが合成された後に出力される例を説明したが、伴奏が生成されず、歌唱のみが出力されてもよい（すなわちアカペラでもよい）。また、ここでは、まず歌唱が合成された後に歌唱に合わせて伴奏が生成される例を説明したが、まず伴奏が生成され、伴奏に合わせて歌唱が合成されてもよい。 Here, we have explained an example in which lyrics are first generated and then a melody is generated to match the lyrics. However, the voice response system 1 may first generate a melody and then generate lyrics in accordance with the melody. Furthermore, although an example has been described in which the singing and accompaniment are combined and then output, the accompaniment may not be generated and only the singing may be output (that is, a cappella may be used). Furthermore, here, an example has been described in which the accompaniment is generated in accordance with the singing after the singing is first synthesized, but the accompaniment may be generated first and then the singing may be synthesized in accordance with the accompaniment.

４．応答機能
図１２は、応答機能５３に係る音声応答システム１の機能構成を例示する図である。応答機能５３に係る機能要素として、音声応答システム１は、音声分析部５１１、感情推定部５１２、及びコンテンツ分解部５３１を有する。以下において、学習機能５１及び歌唱合成機能５２と共通する要素については説明を省略する。コンテンツ分解部５３１は、一のコンテンツを複数の部分コンテンツに分解する。この例においてコンテンツとは、応答音声として出力される情報の内容をいい、具体的には、例えば、楽曲、ニュース、レシピ、又は教材（スポーツ教習、楽器教習、学習ドリル、クイズ）をいう。 4. Response Function FIG. 12 is a diagram illustrating a functional configuration of the voice response system 1 related to the response function 53. As functional elements related to the response function 53, the voice response system 1 includes a voice analysis section 511, an emotion estimation section 512, and a content decomposition section 531. In the following, description of elements common to the learning function 51 and the singing synthesis function 52 will be omitted. The content decomposition unit 531 decomposes one content into a plurality of partial contents. In this example, content refers to the content of information output as a response voice, and specifically refers to, for example, songs, news, recipes, or teaching materials (sports lessons, musical instrument lessons, learning drills, quizzes).

図１３は、応答機能５３に係る音声応答システム１の動作を例示するフローチャートである。ステップＳ３１において、音声分析部５１１は、再生するコンテンツを特定する。再生するコンテンツは、例えばユーザの入力音声に応じて特定される。具体的には、音声分析部５１１が入力音声を解析し、入力音声により再生が指示されたコンテンツを特定する。一例において、「ハンバーグのレシピ教えて」という入力音声が与えられると、音声分析部１１は、「ハンバーグのレシピ」を提供するよう、処理部５１０に指示する。処理部５１０は、コンテンツ提供部６０にアクセスし、「ハンバーグのレシピ」を説明したテキストデータを取得する。こうして取得されたデータが、再生されるコンテンツとして特定される。処理部５１０は、特定されたコンテンツをコンテンツ分解部５３１に通知する。 FIG. 13 is a flowchart illustrating the operation of the voice response system 1 related to the response function 53. In step S31, the audio analysis unit 511 identifies content to be played. The content to be reproduced is specified, for example, according to the user's input voice. Specifically, the audio analysis unit 511 analyzes the input audio and identifies the content that is instructed to be played by the input audio. In one example, when an input voice saying "Tell me the hamburger recipe" is given, the speech analysis unit 11 instructs the processing unit 510 to provide the "hamburger recipe". The processing unit 510 accesses the content providing unit 60 and obtains text data that describes the “hamburger recipe.” The data thus obtained is specified as the content to be reproduced. The processing unit 510 notifies the content decomposition unit 531 of the specified content.

ステップＳ３２において、コンテンツ分解部５３１は、コンテンツを複数の部分コンテンツに分解する。一例において、「ハンバーグのレシピ」は複数のステップ（材料を切る、材料を混ぜる、成形する、焼く等）から構成されるところ、コンテンツ分解部５３１は、「ハンバーグのレシピ」のテキストを、「材料を切るステップ」、「材料を混ぜるステップ」、「成形するステップ」、及び「焼くステップ」の４つの部分コンテンツに分解する。コンテンツの分解位置は、例えばＡＩにより自動的に判断される。あるいは、コンテンツに区切りを示すマーカーをあらかじめ埋め込んでおき、そのマーカーの位置でコンテンツが分解されてもよい。 In step S32, the content decomposition unit 531 decomposes the content into a plurality of partial contents. In one example, a "hamburger recipe" is composed of multiple steps (cutting ingredients, mixing ingredients, shaping, baking, etc.), and the content decomposition unit 531 converts the text of "hamburger recipe" into "ingredients". It is broken down into four partial contents: ``cutting step,'' ``mixing ingredients,'' ``molding step,'' and ``baking step.'' The content decomposition position is automatically determined, for example, by AI. Alternatively, a marker indicating a break may be embedded in the content in advance, and the content may be decomposed at the position of the marker.

ステップＳ３３において、コンテンツ分解部５３１は、複数の部分コンテンツのうち対象となる一の部分コンテンツを特定する（特定部の一例）。対象となる部分コンテンツは再生される部分コンテンツであり、元のコンテンツにおけるその部分コンテンツの位置関係に応じて決められる。「ハンバーグのレシピ」の例では、コンテンツ分解部５３１は、まず、「材料を切るステップ」を対象となる部分コンテンツとして特定する。次にステップＳ３３の処理が行われるとき、コンテンツ分解部５３１は、「材料を混ぜるステップ」を対象となる部分コンテンツとして特定する。コンテンツ分解部５３１は、特定した部分コンテンツをコンテンツ修正部５３２に通知する。 In step S33, the content decomposition unit 531 specifies one target partial content among the plurality of partial contents (an example of a specifying unit). The target partial content is the partial content to be reproduced, and is determined according to the positional relationship of the partial content in the original content. In the example of "hamburger recipe", the content decomposition unit 531 first specifies "step of cutting ingredients" as the target partial content. Next, when the process of step S33 is performed, the content decomposition unit 531 specifies the "step of mixing materials" as the target partial content. The content decomposition unit 531 notifies the content modification unit 532 of the identified partial content.

ステップＳ３４において、コンテンツ修正部５３２は、対象となる部分コンテンツを修正する。具体的修正の方法は、コンテンツに応じて定義される。例えば、ニュース、気象情報、及びレシピといったコンテンツに対して、コンテンツ修正部５３２は修正を行わない。例えば、教材又はクイズのコンテンツに対して、コンテンツ修正部５３２は、問題として隠しておきたい部分を他の音（例えばハミング、「ラララ」、ビープ音等）に置換する。このとき、コンテンツ修正部５３２は、置換前の文字列とモーラ数又は音節数が同一の文字列を用いて置換する。コンテンツ修正部５３２は、修正された部分コンテンツを歌唱生成部５２２に出力する。 In step S34, the content modification unit 532 modifies the target partial content. The specific modification method is defined depending on the content. For example, the content modification unit 532 does not modify content such as news, weather information, and recipes. For example, with respect to the content of a teaching material or a quiz, the content modification unit 532 replaces a part that is desired to be hidden as a question with another sound (for example, humming, "la la la", beep sound, etc.). At this time, the content modification unit 532 replaces the character string with a character string that has the same number of moras or syllables as the character string before replacement. Content modification section 532 outputs the modified partial content to song generation section 522.

ステップＳ３５において、歌唱生成部５２２は、修正された部分コンテンツを歌唱合成する。歌唱生成部５２２により生成された歌唱音声は、最終的に、入出力装置１０から応答音声として出力される。応答音声を出力すると、音声応答システム１はユーザの応答待ち状態となる（ステップＳ３６）。ステップＳ３６において、音声応答システム１は、ユーザの応答を促す歌唱又は音声（例えば「できましたか？」等）を出力してもよい。音声分析部５１１は、ユーザの応答に応じて次の処理を決定する。次の部分コンテンツの再生を促す応答が入力された場合（Ｓ３６：次）、音声分析部５１１は、処理をステップＳ３３に移行する。次の部分コンテンツの再生を促す応答は、例えば、「次のステップへ」、「できた」、「終わった」等の音声である。次の部分コンテンツの再生を促す応答以外の応答が入力された場合（Ｓ３６：終了）、音声分析部５１１は、音声の出力を停止するよう処理部５１０に指示する。 In step S35, the song generation unit 522 singly synthesizes the corrected partial content. The singing voice generated by the singing generation unit 522 is finally output from the input/output device 10 as a response voice. After outputting the response voice, the voice response system 1 enters a state of waiting for a user's response (step S36). In step S36, the voice response system 1 may output a song or a voice (for example, "Did you do it?") that prompts the user to respond. The voice analysis unit 511 determines the next process depending on the user's response. If a response requesting reproduction of the next partial content is input (S36: next), the audio analysis unit 511 moves the process to step S33. The response prompting reproduction of the next partial content is, for example, a voice such as "Go to next step", "Done", "Done", etc. If a response other than a response prompting reproduction of the next partial content is input (S36: End), the audio analysis unit 511 instructs the processing unit 510 to stop outputting audio.

ステップＳ３７において、処理部５１０は、部分コンテンツの合成音声の出力を、少なくとも一時的に停止する。ステップＳ３８において、処理部５１０は、ユーザの入力音声に応じた処理を行う。ステップＳ３８における処理には、例えば、現在のコンテンツの再生中止、ユーザから指示されたキーワード検索、及び別のコンテンツの再生開始が含まれる。例えば、「歌を止めて欲しい」、「もう終わり」、又は「おしまい」等の応答が入力された場合、処理部５１０は、現在のコンテンツの再生を中止する。例えば、「短冊切りってどうやるの？」又は「アーリオオーリオって何？」等、質問型の応答が入力された場合、処理部５１０は、ユーザの質問に回答するための情報をコンテンツ提供部６０から取得する。処理部５１０は、ユーザの質問に対する回答の音声を出力する。この回答は歌唱ではなく、話声であってもよい。「○○の曲かけて」等、別のコンテンツの再生を指示する応答が入力された場合、処理部５１０は、指示されたコンテンツをコンテンツ提供部６０から取得し、再生する。 In step S37, the processing unit 510 at least temporarily stops outputting the synthesized audio of the partial content. In step S38, the processing unit 510 performs processing according to the user's input voice. The processing in step S38 includes, for example, stopping reproduction of the current content, searching for a keyword instructed by the user, and starting reproduction of another content. For example, if a response such as "I want you to stop singing", "It's over", or "It's over" is input, the processing unit 510 stops playing the current content. For example, when a question-type response such as "How do you cut strips of paper?" or "What is Ario Olio?" is input, the processing unit 510 provides content with information to answer the user's question. 60. The processing unit 510 outputs the audio of the answer to the user's question. This answer may be spoken rather than sung. When a response instructing reproduction of another content, such as "play the song by XX", is input, the processing unit 510 acquires the instructed content from the content providing unit 60 and reproduces it.

なおここではコンテンツが複数の部分コンテンツに分解され、部分コンテンツ毎にユーザの反応に応じて次の処理を決定する例を説明した。しかし、応答機能５３が応答音声を出力する方法はこれに限定さない。例えば、コンテンツは部分コンテンツに分解されず、
そのまま話声として、又はそのコンテンツを歌詞として用いた歌唱音声として出力されてもよい。音声応答システム１は、ユーザの入力音声に応じて、又は出力されるコンテンツに応じて、部分コンテンツに分解するか、分解せずそのまま出力するか判断してもよい。 Note that an example has been described here in which the content is decomposed into a plurality of partial contents and the next process is determined for each partial content according to the user's reaction. However, the method by which the response function 53 outputs the response voice is not limited to this. For example, content is not decomposed into partial content,
The content may be output as a spoken voice or as a singing voice using the content as lyrics. The voice response system 1 may determine whether to decompose into partial contents or output as is without decomposing, depending on the user's input voice or the content to be output.

５．動作例
以下、具体的な動作例をいくつか説明する。各動作例において特に明示はしないが、各動作例は、それぞれ、上記の学習機能、歌唱合成機能、及び応答機能の少なくとも１つ以上に基づくものである。なお以下の動作例はすべて日本語が使用される例を説明するが、使用される言語は日本語に限定されず、どのような言語でもよい。 5. Operation Examples Some specific operation examples will be explained below. Although not explicitly stated in each operation example, each operation example is based on at least one of the above learning function, singing synthesis function, and response function. Note that all the operation examples below are explained using Japanese, but the language used is not limited to Japanese, and any language may be used.

５－１．動作例１
図１４は、音声応答システム１の動作例１を示す図である。この例において、ユーザは「佐藤一太郎（実演者名）の『さくらさくら』（楽曲名）をかけて」という入力音声により、楽曲の再生を要求する。音声応答システム１は、この入力音声に従って楽曲データベースを検索し、要求された楽曲を再生する。このとき、音声応答システム１は、この入力音声を入力したときのユーザの感情及びこの楽曲の解析結果を用いて、分類テーブルを更新する。分類テーブルは、楽曲の再生が要求される度に分類テーブルを更新する。分類テーブルは、ユーザが音声応答システム１に対し楽曲の再生を要求する回数が増えるにつれ（すなわち、音声応答システム１の累積使用時間が増えるにつれ）、よりそのユーザの嗜好を反映したものになっていく。 5-1. Operation example 1
FIG. 14 is a diagram showing an example 1 of operation of the voice response system 1. In this example, the user requests the reproduction of a song by inputting a voice that says, "Play 'Sakura Sakura' (song title) by Ichitaro Sato (performer name)." The voice response system 1 searches the music database according to this input voice and plays the requested music. At this time, the voice response system 1 updates the classification table using the user's emotion when inputting this input voice and the analysis result of this song. The classification table is updated every time a request is made to play a song. The classification table becomes more reflective of the user's preferences as the number of times the user requests the voice response system 1 to play music increases (that is, as the cumulative usage time of the voice response system 1 increases). go.

５－２．動作例２
図１５は、音声応答システム１の動作例２を示す図である。この例において、ユーザは「何か楽しい曲歌って」という入力音声により、歌唱合成を要求する。音声応答システム１は、この入力音声に従って歌唱合成を行う。歌唱合成に際し、音声応答システム１は、分類テーブルを参照する。分類テーブルに記録されている情報を用いて、歌詞及びメロディを生成する。したがって、ユーザの嗜好を反映した楽曲を自動的に作成することができる。 5-2. Operation example 2
FIG. 15 is a diagram showing a second example of operation of the voice response system 1. In this example, the user requests singing synthesis with the input voice "sing me some fun song." The voice response system 1 performs singing synthesis according to this input voice. When synthesizing a song, the voice response system 1 refers to the classification table. Lyrics and melodies are generated using the information recorded in the classification table. Therefore, it is possible to automatically create music that reflects the user's preferences.

５－３．動作例３
図１６は、音声応答システム１の動作例３を示す図である。この例において、ユーザは「今日の天気は？」という入力音声により、気象情報の提供を要求する。この場合、処理部５１０はこの要求に対する回答として、コンテンツ提供部６０のうち気象情報を提供するサーバにアクセスし、今日の天気を示すテキスト（例えば「今日は一日快晴」）を取得する。処理部５１０は、取得したテキストを含む、歌唱合成の要求を歌唱生成部５２２に出力する。歌唱生成部５２２は、この要求に含まれるテキストを歌詞として用いて、歌唱合成を行う。音声応答システム１は、入力音声に対する回答として「今日は一日快晴」にメロディ及び伴奏を付けた歌唱音声を出力する。 5-3. Operation example 3
FIG. 16 is a diagram showing a third example of operation of the voice response system 1. In this example, the user requests the provision of weather information by inputting a voice that says, "What's the weather like today?" In this case, as a response to this request, the processing unit 510 accesses a server that provides weather information in the content providing unit 60 and obtains text indicating today's weather (for example, "Today is a clear day"). The processing unit 510 outputs a song synthesis request including the acquired text to the song generation unit 522. The song generation unit 522 performs song synthesis using the text included in this request as lyrics. The voice response system 1 outputs a singing voice of "Today is a sunny day" with a melody and accompaniment as a response to the input voice.

５－４．動作例４
図１７は、音声応答システム１の動作例４を示す図である。この例において、図示された応答が開始される前に、ユーザは音声応答システム１を２週間、使用し、恋愛の歌をよく再生していた。そのため、分類テーブルには、そのユーザが恋愛の歌が好きであることを示す情報が記録される。音声応答システム１は、「出会いの場所はどこがいい？」や、「季節はいつがいいかな？」など、歌詞生成のヒントとなる情報を得るためにユーザに質問をする。音声応答システム１は、これらの質問に対するユーザの回答を用いて歌詞を生成する。なおこの例において、使用期間がまだ２週間と短いため、音声応答システム１の分類テーブルは、まだユーザの嗜好を十分に反映できておらず、感情との対応付けも十分ではない。そのため、本当はユーザはバラード調の曲が好みであるにも関わらず、それとは異なるロック調の曲を生成したりする。 5-4. Operation example 4
FIG. 17 is a diagram showing a fourth example of operation of the voice response system 1. In this example, the user had been using voice response system 1 for two weeks and had often played romance songs before the illustrated response was initiated. Therefore, information indicating that the user likes love songs is recorded in the classification table. The voice response system 1 asks the user questions such as "Where is the best place to meet?" and "When is the best season?" in order to obtain information that can be a hint for lyrics generation. The voice response system 1 generates lyrics using the user's answers to these questions. In this example, since the period of use is still short, 2 weeks, the classification table of the voice response system 1 has not yet sufficiently reflected the user's preferences, and the correspondence with emotions has not been sufficiently established. Therefore, even though the user actually likes ballad-style songs, a different rock-style song is generated.

５－５．動作例５
図１８は、音声応答システム１の動作例５を示す図である。この例は、動作例３からさらに音声応答システム１の使用を続け、累積使用期間が１月半となった例を示している。動作例３と比較すると分類テーブルはユーザの嗜好をより反映したものとなっており、合成される歌唱はユーザの嗜好に沿ったものになっている。ユーザは、最初は不完全だった音声応答システム１の反応が徐々に自分の嗜好に合うように変化していく体験をすることができる。 5-5. Operation example 5
FIG. 18 is a diagram showing operation example 5 of the voice response system 1. This example shows an example in which the voice response system 1 has been continued to be used since operation example 3, and the cumulative usage period is one and a half months. Compared to Operation Example 3, the classification table reflects the user's preferences more, and the synthesized singing matches the user's preferences. The user can experience that the response of the voice response system 1, which was initially incomplete, gradually changes to match his or her preference.

５－６．動作例６
図１９は、音声応答システム１の動作例６を示す図である。この例において、ユーザは、「ハンバーグのレシピを教えてくれる？」という入力音声により、「ハンバーグ」の「レシピ」のコンテンツの提供を要求する。音声応答システム１は、「レシピ」というコンテンツが、あるステップが終了してから次のステップに進むべきものである点を踏まえ、コンテンツを部分コンテンツに分解し、ユーザの反応に応じて次の処理を決定する態様で再生することを決定する。 5-6. Operation example 6
FIG. 19 is a diagram showing operation example 6 of the voice response system 1. In this example, the user requests the provision of the content of "recipe" for "hamburger steak" by inputting a voice saying "Can you tell me the recipe for hamburger steak?" The voice response system 1 breaks down the content into partial content based on the fact that the content called "recipe" should proceed to the next step after a certain step is completed, and performs the next process according to the user's reaction. It is decided to play the data in a manner that determines.

「ハンバーグ」の「レシピ」はステップ毎に分解され、各ステップの歌唱を出力する度に、音声応答システム１は「できましたか？」、「終わりましたか？」等、ユーザの応答を促す音声を出力する。ユーザが「できたよ」、「次は？」等、次のステップの歌唱を指示する入力音声を発すると、音声応答システム１は、それに応答して次のステップの歌唱を出力する。ユーザが「タマネギのみじん切りってどうやるの？」と質問する入力音声を発すると、音声応答システム１は、それに応答して「タマネギのみじん切り」の歌唱を出力する。「タマネギのみじん切り」の歌唱を終えると、音声応答システム１は、「ハンバーグ」の「レシピ」の続きから歌唱を開始する。 The "recipe" for "hamburger" is broken down step by step, and each time the singing of each step is output, the voice response system 1 generates a voice prompting the user to respond, such as "Did you do it?", "Are you finished?" Output. When the user utters an input voice instructing the singing of the next step, such as "Done" or "What's next?", the voice response system 1 outputs the singing of the next step in response. When the user utters an input voice asking, "How do you chop an onion?", the voice response system 1 outputs a song "Chop an onion" in response. When the singing of "chopped onion" is finished, the voice response system 1 starts singing from the continuation of "recipe" for "hamburger steak".

音声応答システム１は、第１の部分コンテンツの歌唱音声と、それに続く第２の部分コンテンツの歌唱音声との間に、別のコンテンツの歌唱音声を出力してもよい。音声応答システム１は、例えば、第１の部分コンテンツに含まれる文字列が示す事項に応じた時間長となるよう合成された歌唱音声を、第１の部分コンテンツの歌唱音声と第２の部分コンテンツの歌唱音声との間に出力する。具体的には、第１の部分コンテンツが「ここで材料を２０分、煮込みましょう」というように、待ち時間が２０分発生することを示していた場合、音声応答システム１は、材料を煮込んでいる間に流す２０分の歌唱を合成し、出力する。 The voice response system 1 may output the singing voice of another content between the singing voice of the first partial content and the singing voice of the second partial content that follows. For example, the voice response system 1 combines the singing voice of the first partial content with the singing voice of the first partial content and the singing voice of the second partial content, which has been synthesized to have a time length corresponding to the item indicated by the character string included in the first partial content. output between the singing voice and the singing voice. Specifically, if the first partial content indicates that a waiting time of 20 minutes will occur, such as "Let's simmer the ingredients for 20 minutes," the voice response system 1 will simmer the ingredients. Synthesize and output 20 minutes of singing while you are there.

また、音声応答システム１は、第１の部分コンテンツに含まれる第１文字列が示す事項に応じた第２文字列を用いて合成された歌唱音声を、第１の部分コンテンツの歌唱音声の出力後、第１文字列が示す事項に応じた時間長に応じたタイミングで出力してもよい。具体的には、第１の部分コンテンツが「ここで材料を２０分、煮込みましょう」というように、待ち時間が２０分発生することを示していた場合、音声応答システム１は、「煮込み終了です」（第２文字列の一例）という歌唱音声を、第１の部分コンテンツを出力してから２０分後に出力してもよい。あるいは、第１の部分コンテンツが「ここで材料を２０分、煮込みましょう」である例において、待ち時間の半分（１０分）経過したときに、「煮込み終了まであと１０分です」などとラップ風に歌唱してもよい。 In addition, the voice response system 1 outputs the singing voice synthesized using the second character string corresponding to the first character string included in the first partial content as the singing voice of the first partial content. Thereafter, the output may be output at a timing corresponding to a time length corresponding to the item indicated by the first character string. Specifically, if the first partial content indicates that a waiting time of 20 minutes will occur, such as "Let's simmer the ingredients for 20 minutes," the voice response system 1 will respond with "Simmering is completed." It is also possible to output the singing voice "It is" (an example of the second character string) 20 minutes after outputting the first partial content. Or, in an example where the first partial content is ``Let's simmer the ingredients for 20 minutes here'', when half of the waiting time (10 minutes) has passed, a message such as ``10 minutes left until the end of simmering'' is written. You may sing in the wind.

５－７．動作例７
図２０は、音声応答システム１の動作例７を示す図である。この例において、ユーザは、「世界史の年号の暗記問題出してくれる？」という入力音声により、「世界史」の「暗記問題」のコンテンツの提供を要求する。音声応答システム１は、「暗記問題」というコンテンツが、ユーザの記憶を確認するためのものである点を踏まえ、コンテンツを部分コンテンツに分解し、ユーザの反応に応じて次の処理を決定する態様で再生することを決定する。 5-7. Operation example 7
FIG. 20 is a diagram showing operation example 7 of the voice response system 1. In this example, the user requests the provision of the content of the "memorization problem" of "world history" by inputting a voice saying, "Can you give me the memorization problem of the years in world history?" Based on the fact that the content "memorization questions" is for checking the user's memory, the voice response system 1 breaks down the content into partial content and determines the next process according to the user's reaction. Decide to play with.

例えば、音声応答システム１は、「卑弥呼にサンキュー（２３９）魏の皇帝」という年号暗記文を、音声応答システム１は、「卑弥呼に」及び「サンキュー魏の皇帝」という２つの部分コンテンツに分解する。音声応答システム１は、「卑弥呼に」という歌唱を出力するとユーザの反応を待つ。ユーザが何か音声を発すると、音声応答システム１は、ユーザが発した音声が正解であるか判断し、その判断結果に応じた音声を出力する。例えば、ユーザが「サンキュー魏の皇帝」という正解の音声を発した場合、音声応答システム１は、「正解です」等の音声を出力する。あるいは、ユーザが「わかりません」等、正解ではない音声を発した場合、音声応答システム１は、「卑弥呼にサンキュー魏の皇帝」という正解の歌唱を出力する。 For example, the voice response system 1 decomposes the era name memorization sentence "Thank you Himiko (239) Emperor of Wei" into two partial contents: "To Himiko" and "Thank you Emperor of Wei". do. The voice response system 1 outputs the song "Himiko ni" and waits for the user's response. When the user utters something, the voice response system 1 determines whether the voice uttered by the user is correct and outputs a voice according to the determination result. For example, when the user utters the correct voice, ``Thank you, Emperor of Wei,'' the voice response system 1 outputs a voice such as ``That's correct.'' Alternatively, if the user utters a voice that is not correct, such as "I don't understand," the voice response system 1 outputs the correct song, "Thank you, Himiko, Emperor of Wei."

５－８．動作例８
図２１は、音声応答システム１の動作例８を示す図である。動作例７と同様、ユーザは、「世界史」の「暗記問題」のコンテンツの提供を要求する。音声応答システム１は、「暗記問題」というコンテンツが、ユーザの記憶を確認するためのものである点を踏まえ、このコンテンツの一部を隠して出力する。隠すべき部分は、例えばコンテンツにおいて定義されていてもよいし、処理部５１０すなわちＡＩが形態素解析等の結果に基づいて判断してもよい。 5-8. Operation example 8
FIG. 21 is a diagram showing operation example 8 of the voice response system 1. Similar to operation example 7, the user requests the provision of the content of "memorization questions" for "world history." The voice response system 1 outputs a portion of the content "memorization questions" with a portion thereof hidden, based on the fact that the content is for checking the user's memory. The portion to be hidden may be defined in the content, for example, or may be determined by the processing unit 510, that is, the AI, based on the results of morphological analysis or the like.

例えば、音声応答システム１は、「卑弥呼にサンキュー（２３９）魏の皇帝」という年号暗記文のうち、「にサンキュー」の部分を隠して歌唱する。具体的には、音声応答システム１は、隠す部分を他の音又は文字列（例えばハミング、「ラララ」、ビープ音等）に置換する。置換に用いられる音又は文字列は、置換前とモーラ数又は音節数が同一である音又は文字列である。一例において、音声応答システム１は、「卑弥呼・ラ・ラ・ラ・ラ・ラ・魏の皇帝」という歌唱を出力する。音声応答システム１は、この歌唱を出力するとユーザの反応を待つ。ユーザが何か音声を発すると、音声応答システム１は、ユーザが発した音声が正解であるか判断し、その判断結果に応じた音声を出力する。例えば、ユーザが「卑弥呼にサンキュー魏の皇帝」という音声を発した場合、音声応答システム１は、「正解です」等の音声を出力する。あるいは、ユーザが「わかりません」という音声を発した場合、音声応答システム１は、「卑弥呼にサンキュー魏の皇帝」という正解の歌唱を出力する。 For example, the voice response system 1 sings the era name memorization sentence "Thank you to Himiko (239) Emperor of Wei" while hiding the part of "Thank you to Himiko". Specifically, the voice response system 1 replaces the hidden portion with another sound or character string (for example, humming, "la la la", beep sound, etc.). The sound or character string used for replacement is a sound or character string that has the same number of moras or syllables as before replacement. In one example, the voice response system 1 outputs the song "Himiko la la la la la la Emperor of Wei." After outputting this singing, the voice response system 1 waits for the user's reaction. When the user utters some voice, the voice response system 1 determines whether the voice uttered by the user is correct and outputs a voice according to the determination result. For example, when the user utters a voice such as "Thank you Himiko, Emperor of Wei," the voice response system 1 outputs a voice such as "That's correct." Alternatively, if the user utters a voice saying "I don't understand," the voice response system 1 outputs the correct song, "Thank you Himiko, Emperor of Wei."

また、音声応答システム１は、第１の部分コンテンツに対するユーザの反応に応じて、それに続く第２の部分コンテンツの一部又は全部を他の文字列に置換してもよい。例えば、問題集やクイズのコンテンツにおいて、第１問（第１の部分コンテンツの一例）に正解した場合と不正解だった場合とで、第２問（第２の部分コンテンツの一例）において他の文字列に置換する文字数を変化させてもよい（例えば、第１問が正解だった場合には第２問はより多くの文字を隠し、第１問が不正解だった場合には第２問はより少ない文字を隠す）。 Furthermore, the voice response system 1 may replace part or all of the second partial content that follows the first partial content with another character string, depending on the user's reaction to the first partial content. For example, in the content of a question book or quiz, depending on whether the first question (an example of the first partial content) is answered correctly or incorrectly, the second question (an example of the second partial content) is answered differently. You may change the number of characters replaced in the string (for example, if the first question is correct, the second question hides more characters, and if the first question is incorrect, the second question is hidden). hides fewer characters).

５－９．動作例９
図２２は、音声応答システム１の動作例９を示す図である。この例において、ユーザは、「工場における工程の手順書を読み上げてくれる？」という入力音声により、「手順書」のコンテンツの提供を要求する。音声応答システム１は、「手順書」というコンテンツが、ユーザの記憶を確認するためのものである点を踏まえ、コンテンツを部分コンテンツに分解し、ユーザの反応に応じて次の処理を決定する態様で再生することを決定する。 5-9. Operation example 9
FIG. 22 is a diagram showing operation example 9 of the voice response system 1. In this example, the user requests the provision of the content of the "procedure manual" by inputting the input voice "Can you read me the procedure manual for the process at the factory?". Based on the fact that the content called "procedure manual" is for confirming the user's memory, the voice response system 1 breaks down the content into partial content and determines the next process according to the user's reaction. Decide to play with.

例えば、音声応答システム１は、手順書をランダムな位置で区切り、複数の部分コンテ
ンツに分解する。音声応答システム１は、一の部分コンテンツの歌唱を出力すると、ユーザの反応を待つ。例えば「スイッチＡを押した後、メータＢの値が１０以下となったところでスイッチＢを押す」という手順のコンテンツにつき、音声応答システム１が「スイッチＡを押した後」という部分を歌唱し、ユーザの反応を待つ。ユーザが何か音声を発すると、音声応答システム１は、次の部分コンテンツの歌唱を出力する。あるいはこのとき、ユーザが次の部分コンテンツを正しく言えたか否かに応じて、次の部分コンテンツの歌唱のスピードを変更してもよい。具体的には、ユーザが次の部分コンテンツを正しく言えた場合、音声応答システム１は、次の部分コンテンツの歌唱のスピードを上げる。あるいは、ユーザが次の部分コンテンツを正しく言えなかった場合、音声応答システム１は、次の部分コンテンツの歌唱のスピードを下げる。 For example, the voice response system 1 divides a procedure manual at random positions and decomposes it into a plurality of partial contents. After outputting the singing of the first partial content, the voice response system 1 waits for the user's reaction. For example, for the content of the procedure "After pressing switch A, press switch B when the value of meter B becomes 10 or less", the voice response system 1 sings the part "after pressing switch A", Wait for user's reaction. When the user utters some voice, the voice response system 1 outputs the singing of the next partial content. Alternatively, at this time, the singing speed of the next partial content may be changed depending on whether the user can correctly say the next partial content. Specifically, if the user can say the next partial content correctly, the voice response system 1 increases the speed of singing the next partial content. Alternatively, if the user cannot say the next partial content correctly, the voice response system 1 reduces the singing speed of the next partial content.

５－１０．動作例１０
図２３は、音声応答システム１の動作例１０を示す図である。動作例１０は、高齢者の認知症対策の動作例である。この例において、ユーザが高齢者であることはあらかじめユーザ登録等により設定されている。音声応答システム１は、例えばユーザの指示に応じて既存の歌を歌い始める。音声応答システム１は、ランダムな位置、又は所定の位置（例えばサビの手前）において歌唱を一時停止する。その際、「うーん分からない」、「忘れちゃった」等のメッセージを発し、あたかも歌詞を忘れたかのように振る舞う。音声応答システム１は、この状態でユーザの応答を待つ。ユーザが何か音声を発すると、音声応答システム１は、ユーザが発した言葉（の一部）を正解の歌詞として、その言葉の続きから歌唱を出力する。なお、ユーザが何か言葉を発した場合、音声応答システム１は「ありがとう」等の応答を出力してもよい。ユーザの応答待ちの状態で所定時間が経過したときは、音声応答システム１は、「思い出した」等の話声を出力し、一時停止した部分の続きから歌唱を再開してもよい。 5-10. Operation example 10
FIG. 23 is a diagram showing an operation example 10 of the voice response system 1. Operation example 10 is an operation example of dementia countermeasures for elderly people. In this example, the fact that the user is an elderly person has been set in advance through user registration or the like. The voice response system 1 starts singing an existing song, for example, in response to a user's instruction. The voice response system 1 temporarily stops singing at a random position or a predetermined position (for example, before the chorus). At that time, they send messages such as ``Hmm, I don't understand'' or ``I forgot,'' acting as if they have forgotten the lyrics. The voice response system 1 waits for the user's response in this state. When the user utters some voice, the voice response system 1 outputs a song continuing from the words, using (part of) the words uttered by the user as the correct lyrics. Note that when the user utters something, the voice response system 1 may output a response such as "Thank you". When a predetermined period of time has elapsed while waiting for a user's response, the voice response system 1 may output a voice such as "I remember" and resume singing from where it left off.

５－１１．動作例１１
図２４は、音声応答システム１の動作例１１を示す図である。この例において、ユーザは「何か楽しい曲歌って」という入力音声により、歌唱合成を要求する。音声応答システム１は、この入力音声に従って歌唱合成を行う。歌唱合成の際に用いる素片データベースは、例えばユーザ登録時に選択されたキャラクタに応じて選択される（例えば、男性キャラクタが選択された場合、男性歌手による素片データベースが用いられる）。ユーザは、歌の途中で「女性の声に変えて」等、素片データベースの変更を指示する入力音声を発する。音声応答システム１は、ユーザの入力音声に応じて、歌唱合成に用いる素片データベースを切り替える。素片データベースの切り替えは、音声応答システム１が歌唱音声を出力しているときに行われてもよいし、動作例７～１０のように音声応答システム１がユーザの応答待ちの状態のときに行われてもよい。 5-11. Operation example 11
FIG. 24 is a diagram showing an operation example 11 of the voice response system 1. In this example, the user requests singing synthesis with the input voice "sing me some fun song." The voice response system 1 performs singing synthesis according to this input voice. The segment database used for song synthesis is selected, for example, according to the character selected at the time of user registration (for example, if a male character is selected, a segment database of male singers is used). In the middle of a song, the user utters an input voice that instructs to change the fragment database, such as "change to a female voice." The voice response system 1 switches the segment database used for song synthesis according to the user's input voice. The segment database may be switched while the voice response system 1 is outputting the singing voice, or when the voice response system 1 is waiting for a user's response as in operation examples 7 to 10. May be done.

既に説明したように、音声応答システム１は、単一の歌手（又は話者）により、それぞれ異なる歌い方又は声色で発音された音素を記録した複数の素片データベースを有してもよい。このような場合において、音声応答システム１は、ある音素について、複数の素片データベースから抽出した複数の素片を、ある比率（利用比率）で組み合わせて、すなわち加算して用いてもよい。さらに、音声応答システム１は、この利用比率を、ユーザの反応に応じて決めてもよい。具体的には、ある歌手について、通常の声と甘い声とで２つの素片データベースが記録されているときに、ユーザが「もっと甘い声で」という入力音声を発すると甘い声の素片データベースの利用比率を高め、「もっともっと甘い声で」という入力音声を発すると甘い声の素片データベースの利用比率をさらい高める。 As already explained, the voice response system 1 may have a plurality of segment databases in which phonemes pronounced by a single singer (or speaker) in different singing styles or tones are recorded. In such a case, the voice response system 1 may combine, or add, a plurality of segments extracted from a plurality of segment databases at a certain ratio (utilization ratio) for a certain phoneme. Furthermore, the voice response system 1 may decide this usage ratio according to the user's reaction. Specifically, when two segment databases are recorded for a certain singer, one with a normal voice and one with a sweet voice, when the user utters the input voice "with a sweeter voice," the database with the sweet voice is created. By increasing the usage rate of the database of sweet voice fragments, and emitting the input voice ``Speak more and more sweetly'', the usage rate of the sweet voice segment database will be increased.

６．変形例
本発明は上述の実施形態に限定されるものではなく、種々の変形実施が可能である。以下、変形例をいくつか説明する。以下の変形例のうち２つ以上のものが組み合わせて用い
られてもよい。 6. Modifications The present invention is not limited to the above-described embodiments, and various modifications are possible. Some modified examples will be explained below. Two or more of the following modifications may be used in combination.

本稿において歌唱音声とは、少なくともその一部に歌唱を含む音声をいい、歌唱を含まない伴奏のみの部分、又は話声のみの部分を含んでいてもよい。例えば、コンテンツを複数の部分コンテンツに分解する例において、少なくとも１つの部分コンテンツは、歌唱を含んでいなくてもよい。また、歌唱は、ラップ、又は詩の朗読を含んでもよい。 In this paper, the singing voice refers to a voice that includes singing at least in part, and may include a portion that is only an accompaniment that does not include singing, or a portion that is only a speaking voice. For example, in an example in which content is decomposed into a plurality of partial contents, at least one partial content may not include singing. Singing may also include rapping or reciting poetry.

実施形態においては、学習機能５１、歌唱合成機能５２、及び応答機能５３が相互に関連している例を説明したが、これらの機能は、それぞれ単独で提供されてもよい。例えば、学習機能５１により得られた分類テーブルが、例えば楽曲を配信する楽曲配信システムにおいてユーザの嗜好を知るために用いられてもよい。あるいは、歌唱合成機能５２は、学習機能５１により生成された分類テーブルではなく、ユーザが手入力した分類テーブルを用いて歌唱合成を行ってもよい。また、音声応答システム１の機能要素の少なくとも一部は省略されてもよい。例えば、音声応答システム１は、感情推定部５１２を有していなくてもよい。 In the embodiment, an example has been described in which the learning function 51, the singing synthesis function 52, and the response function 53 are related to each other, but each of these functions may be provided independently. For example, the classification table obtained by the learning function 51 may be used, for example, to learn user preferences in a music distribution system that distributes music. Alternatively, the singing synthesis function 52 may perform singing synthesis using a classification table manually input by the user instead of the classification table generated by the learning function 51. Furthermore, at least some of the functional elements of the voice response system 1 may be omitted. For example, the voice response system 1 does not need to include the emotion estimation unit 512.

入出力装置１０、応答エンジン２０、及び歌唱合成エンジン３０に対する機能の割り当ては、実施形態において例示されたものに限定されない。例えば、音声分析部５１１及び感情推定部５１２が入出力装置に実装されてもよい。また、入出力装置１０、応答エンジン２０、及び歌唱合成エンジン３０の相対的な配置は、実施形態において例示されたものに限定されない。例えば、歌唱合成エンジン３０は入出力装置１０と応答エンジン２０との間に配置され、応答エンジン２０から出力される応答のうち歌唱合成が必要と判断される応答について、歌唱合成を行ってもよい。また、音声応答システム１において用いられるコンテンツは、コンテンツ提供部６０から提供されるもの、すなわちネットワーク又はクラウド上に存在するものに限定されない。音声応答システム１において用いられるコンテンツは、入出力装置１０又は入出力装置１０と通信可能な装置等の、ローカルな装置に記憶されていてもよい。 The assignment of functions to the input/output device 10, the response engine 20, and the singing synthesis engine 30 is not limited to that illustrated in the embodiment. For example, the voice analysis section 511 and the emotion estimation section 512 may be implemented in an input/output device. Moreover, the relative arrangement of the input/output device 10, the response engine 20, and the singing synthesis engine 30 is not limited to that illustrated in the embodiment. For example, the singing synthesis engine 30 may be placed between the input/output device 10 and the response engine 20, and may perform singing synthesis on a response that is determined to require singing synthesis among the responses output from the response engine 20. . Further, the content used in the voice response system 1 is not limited to that provided by the content providing unit 60, that is, that existing on a network or cloud. The content used in the voice response system 1 may be stored in a local device such as the input/output device 10 or a device capable of communicating with the input/output device 10.

入出力装置１０、応答エンジン２０、及び歌唱合成エンジン３０のハードウェア構成は実施形態において例示されたものに限定されない。例えば、入出力装置１０は、タッチスクリーン及びディスプレイを有するコンピュータ装置、例えばスマートフォン又はタブレット端末であってもよい。これに関連し、音声応答システム１に対するユーザの入力は音声を介するものに限定されず、タッチスクリーン、キーボード、又はポインティングデバイスを介して入力されるものであってもよい。また、入出力装置１０は、人感センサーを有してもよい。この場合において、音声応答システム１は、この人感センサーを用いて、ユーザが近くにいるかいないかに応じて、動作を制御してもよい。例えば、ユーザが入出力装置１０の近くにいないと判断される場合、音声応答システム１は、音声を出力しない（対話を返さない）という動作をしてもよい。ただし、音声応答システム１が出力する音声の内容によっては、ユーザが入出力装置１０の近くにいるいないにかかわらず、音声応答システム１はその音声を出力してもよい。例えば、動作例６の後半で説明したような、残りの待ち時間を案内する音声については、音声応答システム１は、ユーザが入出力装置１０の近くにいるいないにかかわらず出力してもよい。なお、ユーザが入出力装置１０の近くにいるかいないかの検出については、ユーザに動きがあまりない場合の対応を考え、カメラや温度センサーなど、人感センサー以外のセンサーを用いたり、複数のセンサーを併用したりしてもよい。 The hardware configurations of the input/output device 10, response engine 20, and singing synthesis engine 30 are not limited to those illustrated in the embodiment. For example, the input/output device 10 may be a computer device with a touch screen and a display, such as a smartphone or a tablet terminal. In this regard, the user's input to the voice response system 1 is not limited to that via voice, but may also be input via a touch screen, keyboard, or pointing device. Further, the input/output device 10 may include a human sensor. In this case, the voice response system 1 may use this human sensor to control operations depending on whether the user is nearby or not. For example, when it is determined that the user is not near the input/output device 10, the voice response system 1 may perform an operation of not outputting voice (not returning dialogue). However, depending on the content of the voice output by the voice response system 1, the voice response system 1 may output the voice regardless of whether or not the user is near the input/output device 10. For example, as described in the second half of Operation Example 6, the voice response system 1 may output the voice indicating the remaining waiting time regardless of whether the user is near the input/output device 10 or not. Regarding detection of whether the user is near the input/output device 10, considering the case where the user does not move much, sensors other than the human sensor such as a camera or temperature sensor may be used, or multiple sensors may be used. may be used in combination.

実施形態において例示したフローチャート及びシーケンスチャートはあくまで例示であり、音声応答システム１の動作はこれに限定されない。実施形態で例示したフローチャート又はシーケンスチャートにおいて、処理の順序が入れ替えられたり、一部の処理が省略されたり、新たな処理が追加されたりしてもよい。 The flowcharts and sequence charts illustrated in the embodiment are merely examples, and the operation of the voice response system 1 is not limited thereto. In the flowcharts or sequence charts illustrated in the embodiments, the order of the processes may be changed, some processes may be omitted, or new processes may be added.

入出力装置１０、応答エンジン２０、及び歌唱合成エンジン３０において実行されるプログラムは、ＣＤ－ＲＯＭ又は半導体メモリー等の記録媒体に記憶された状態で提供されてもよいし、インターネット等のネットワークを介したダウンロードにより提供されてもよい。 The programs executed in the input/output device 10, the response engine 20, and the singing synthesis engine 30 may be provided in a state stored in a recording medium such as a CD-ROM or a semiconductor memory, or may be provided via a network such as the Internet. It may also be provided by download.

１…音声応答システム、１０…入出力装置、２０…応答エンジン、３０…歌唱合成エンジン、５１…学習機能、５２…歌唱合成機能、５３…応答機能、６０…コンテンツ提供部、１０１…マイクロフォン、１０２…入力信号処理部、１０３…出力信号処理部、１０４…スピーカ、１０５…ＣＰＵ、１０６…センサー、１０７…モータ、１０８…ネットワークＩＦ、２０１…ＣＰＵ、２０２…メモリー、２０３…ストレージ、２０４…通信ＩＦ、３０１…ＣＰＵ、３０２…メモリー、３０３…ストレージ、３０４…通信ＩＦ、５１０…処理部、５１１…音声分析部、５１２…感情推定部、５１３…楽曲解析部、５１４…歌詞抽出部、５１５…嗜好分析部、５１６…記憶部、５２１…検知部、５２２…歌唱生成部、５２３…伴奏生成部、５２４…合成部、５２２１…メロディ生成部、５２２２…歌詞生成部、５３１…コンテンツ分解部、５３２…コンテンツ修正部 1... Voice response system, 10... Input/output device, 20... Response engine, 30... Singing synthesis engine, 51... Learning function, 52... Singing synthesis function, 53... Response function, 60... Content providing unit, 101... Microphone, 102 ...Input signal processing unit, 103...Output signal processing unit, 104...Speaker, 105...CPU, 106...Sensor, 107...Motor, 108...Network IF, 201...CPU, 202...Memory, 203...Storage, 204...Communication IF , 301...CPU, 302...Memory, 303...Storage, 304...Communication IF, 510...Processing section, 511...Speech analysis section, 512...Emotion estimation section, 513...Music analysis section, 514...Lyrics extraction section, 515...Preference Analysis unit, 516... Storage unit, 521... Detection unit, 522... Song generation unit, 523... Accompaniment generation unit, 524... Synthesis unit, 5221... Melody generation unit, 5222... Lyrics generation unit, 531... Content decomposition unit, 532... Content correction department

Claims

a first acquisition means for acquiring an input voice indicating a user's request inputted via an input/output device;
a second acquisition means for acquiring a text indicating an answer to the request based on the input voice from the server;
The text is the lyrics, and the input voice of the user is input through the input/output device during a certain period in the past, and is generated from the input voice instructing to play a song from the input/output device. a singing synthesis means for synthesizing a singing voice using a melody according to preference information indicating the user's preferences regarding the characteristics, attributes, and lyrics of a song;
and output means for outputting the singing voice to the input/output device,
When the input voice acquired by the first acquisition means requests singing synthesis, the singing synthesis means uses a melody according to the preference information generated using the emotion estimated from the input voice. An information processing device that synthesizes the singing voice.

The information processing apparatus according to claim 1, wherein the second acquisition means acquires, from the server, text of non-music information indicating a response to a request based on the input voice.

The information processing apparatus according to claim 1 or 2, wherein the second acquisition means acquires a search result obtained using the input voice as a search key as the text.

comprising a decomposition means for decomposing the text into a plurality of partial contents including a first partial content and a second partial content;
The information processing device according to claim 3, wherein the output means outputs the singing voice of the first partial content, waits for a reaction from the user, and then outputs the singing voice of the second partial content.

The singing synthesis means synthesizes a singing voice of the first partial content, and then synthesizes a singing voice that prompts the user to respond to the first partial content,
The information processing device according to claim 4, wherein the output means outputs a singing voice that prompts a response from the user after outputting the singing voice of the first partial content.

The information processing device according to any one of claims 1 to 5, wherein the singing synthesis means modifies one of the lyrics and the melody to match the other when synthesizing the singing voice.

The information processing device according to claim 6, wherein the singing synthesis means performs the correction so that the number of sounds in the lyrics matches the number of sounds in the melody.

Obtaining input audio indicating a user's request input via an input/output device,
obtaining text indicating an answer to the request based on the input audio from another server;
The text is the lyrics, and the input voice of the user is input through the input/output device during a certain period in the past, and is generated from the input voice instructing to play a song from the input/output device. Synthesize a singing voice using a melody according to preference information indicating the user's preferences regarding the characteristics, attributes, and lyrics of the song,
outputting the singing voice to the input/output device;
When the acquired input voice requests singing synthesis, in the synthesis of the singing voice, the singing voice is synthesized using a melody according to preference information generated using the emotion estimated from the input voice. are synthesized
How to output singing audio.

to the computer,
Obtaining input audio indicating a user's request input via an input/output device,
obtaining text indicating an answer to the request based on the input audio from another server;
The text is the lyrics, and the input voice of the user is input through the input/output device during a certain period in the past, and is generated from the input voice instructing to play a song from the input/output device. Synthesize a singing voice using a melody according to preference information indicating the user's preferences regarding song characteristics, attributes, and lyrics,
outputting the singing voice to the input/output device;
When the acquired input voice requests singing synthesis, in the synthesis of the singing voice, the singing voice is synthesized using a melody according to preference information generated using the emotion estimated from the input voice. are synthesized
A program for executing processing.