JP5954221B2

JP5954221B2 - Sound source identification system and sound source identification method

Info

Publication number: JP5954221B2
Application number: JP2013039583A
Authority: JP
Inventors: 典昭阿瀬見
Original assignee: Brother Industries Ltd
Current assignee: Brother Industries Ltd
Priority date: 2013-02-28
Filing date: 2013-02-28
Publication date: 2016-07-20
Anticipated expiration: 2033-02-28
Also published as: JP2014167556A

Description

本発明は、文章データに基づく合成音の生成に適した音源データを特定する音源特定システム、及び音源特定方法に関する。 The present invention relates to a sound source specifying system and a sound source specifying method for specifying sound source data suitable for generating synthesized sound based on text data.

従来、周知の音声合成技術を用いて、入力された文章データを読み上げる音声合成装置が知られている（特許文献１参照）。
この特許文献１に記載された音声合成装置では、入力された文章データによって表されたテキストを解析し、その解析結果として属性情報を導出する。そして、属性情報と予め対応付けられた韻律パラメータに、上記解析結果である属性情報を照合し、類似度が基準値以上となる属性情報と対応付けられた韻律パラメータを用いて音声合成を実行する。 2. Description of the Related Art Conventionally, a speech synthesizer that reads out input text data using a known speech synthesis technique is known (see Patent Document 1).
In the speech synthesizer described in this Patent Document 1, the text represented by the input sentence data is analyzed, and attribute information is derived as the analysis result. Then, the attribute information which is the analysis result is compared with the prosodic parameters previously associated with the attribute information, and the speech synthesis is performed using the prosodic parameters associated with the attribute information whose similarity is equal to or higher than the reference value. .

なお、特許文献１に記載された属性情報とは、文の構造を表す情報であり、例えば、モーラ数、アクセント型、品詞などの情報である。 The attribute information described in Patent Document 1 is information representing the structure of a sentence, and is information such as the number of mora, accent type, part of speech, etc., for example.

特開２０００−０５６７８８号公報JP 2000-056788 A

ところで、音声合成装置においては、音声合成によってテキストを読み上げた合成音に対して、当該テキストの内容に適した表情を付与することが求められている。
しかしながら、特許文献１に記載された音声合成装置では、文構造を表す属性情報に従って、音声合成に用いる韻律データを特定している。このため、特許文献１に記載された音声合成装置では、音声合成によってテキストを読上げた合成音は、当該テキストに適した表情が付与されないという課題がある。 By the way, in the speech synthesizer, it is required to give a facial expression suitable for the content of the text to the synthesized speech read out by the speech synthesis.
However, in the speech synthesizer described in Patent Literature 1, prosodic data used for speech synthesis is specified in accordance with attribute information representing a sentence structure. For this reason, in the speech synthesizer described in Patent Document 1, there is a problem that a synthesized sound read out by text synthesis is not given a facial expression suitable for the text.

つまり、従来の技術では、音声合成によって文章データを読上げた合成音を出力する際に、当該文章データの合成音に適切な表情を付与可能な音源データ（音声パラメータ）を特定することが困難であるという問題がある。 In other words, in the conventional technology, when outputting a synthesized sound obtained by reading out text data by speech synthesis, it is difficult to specify sound source data (speech parameters) that can give an appropriate expression to the synthesized sound of the text data. There is a problem that there is.

そこで、本発明は、音声合成によって文章データを読上げた合成音を出力する際に、当該文章データの合成音に適切な表情を付与可能な音源データ（音声パラメータ）を特定することを目的とする。 Therefore, the present invention has an object to specify sound source data (speech parameters) that can give an appropriate expression to the synthesized sound of the sentence data when outputting the synthesized sound obtained by reading out the sentence data by speech synthesis. .

上記目的を達成するためになされた本発明の音源特定システムは、文章取得手段と、文章解析手段と、音源解析手段と、マッチング手段と、情報提示手段とを備えている。
本発明の音源特定システムでは、文章取得手段が、指定された文章を構成する文字列を表す文章データを取得し、文章解析手段が、文章取得手段で取得された文章データによって表される文章を解析し、当該文章にて出現する各種類の表情の分布度合いを表すテキスト表情分布を導出する。 The sound source identification system of the present invention made to achieve the above object includes a sentence acquisition means, a sentence analysis means, a sound source analysis means, a matching means, and an information presentation means.
In the sound source identification system of the present invention, the sentence acquisition unit acquires sentence data representing a character string constituting the designated sentence, and the sentence analysis unit extracts the sentence represented by the sentence data acquired by the sentence acquisition unit. Analyze and derive a text expression distribution representing the distribution degree of each type of expression that appears in the sentence.

そして、音源解析手段が、複数種類の表情が出現する内容の文章として規定された規定内容文について発声された音の少なくとも一つの音声パラメータと、規定内容文について発声したときの各表情を表す表情データとを発声ごとに対応付けたデータである音源データを格納した記憶装置から音源データそれぞれを取得して解析し、音源データごとに、当該音源データに含まれる音声パラメータにて表される音声に表出する各種類の表情の分布度合いを表す音源表情分布を導出する。 Then, the sound source analysis means includes at least one voice parameter of the sound uttered for the specified content sentence specified as the text of the content in which a plurality of types of facial expressions appear, and facial expressions representing each expression when the specified content sentence is uttered Each sound source data is acquired and analyzed from a storage device storing sound source data that is data associated with each utterance, and each sound source data is converted into a sound represented by a sound parameter included in the sound source data. A sound source facial expression distribution representing the distribution degree of each type of facial expression to be expressed is derived.

さらに、本発明の音源特定システムでは、マッチング手段が、文章解析手段にて導出されたテキスト表情分布を、音源解析手段にて導出された音源表情分布それぞれに照合して、両者の相関値を導出し、情報提示手段が、マッチング手段にて導出された相関値の中で、値が最も高い相関値に対応する音源データを提示する。 Further, in the sound source identification system of the present invention, the matching means collates the text expression distribution derived by the sentence analysis means with each of the sound source expression distributions derived by the sound source analysis means and derives a correlation value between them. Then, the information presenting means presents sound source data corresponding to the correlation value having the highest value among the correlation values derived by the matching means.

本発明の音源特定システムによれば、テキスト表情分布との相関値が最大となる音源表情分布に対応する音源データを特定できる。この相関値が最大となる音源データは、文章データにて出現する表情分布に、最も高い一致度の表情分布を有した音声パラメータを含むものである。 According to the sound source identification system of the present invention, it is possible to identify sound source data corresponding to a sound source facial expression distribution having a maximum correlation value with the text facial expression distribution. The sound source data having the maximum correlation value includes a speech parameter having a facial expression distribution with the highest degree of matching in the facial expression distribution appearing in the text data.

したがって、本発明の音源特定システムによれば、音声合成によって文章データを読上げた合成音を出力する際に、当該文章データの合成音に適切な表情を付与可能な音源データ（音声パラメータ）を特定することができる。 Therefore, according to the sound source identification system of the present invention, when outputting a synthesized sound obtained by reading out text data by speech synthesis, sound source data (speech parameters) that can give an appropriate expression to the synthesized sound of the text data is identified. can do.

なお、本発明における「表情」とは、少なくとも、感情や情緒、情景、状況を含む概念である。
ところで、本発明の音源特定システムにおいては、内容情報取得手段と、波形取得手段と、パラメータ導出手段と、表情データ生成手段と、音源データ登録手段とを備えていても良い。 The “expression” in the present invention is a concept including at least emotions, emotions, scenes, and situations.
By the way, the sound source identification system of the present invention may include content information acquisition means, waveform acquisition means, parameter derivation means, facial expression data generation means, and sound source data registration means.

この場合、内容情報取得手段が、複数種類の表情が出現する内容の文章を構成する文字列を表す規定内容文を取得し、波形取得手段が、内容情報取得手段で取得した規定内容文である特定内容情報によって表される文字列について発声された音声波形である対象波形を取得する。さらに、パラメータ導出手段が、波形取得手段で取得した対象波形から、音声パラメータを導出し、表情データ生成手段が、特定内容情報に基づいて、対象波形にて表出される表情を推定し、その推定結果を表情データとして生成する。 In this case, the content information acquisition means acquires the specified content sentence representing the character string constituting the text of the content in which multiple types of facial expressions appear, and the waveform acquisition means is the specified content sentence acquired by the content information acquisition means. A target waveform, which is a speech waveform uttered for a character string represented by the specific content information, is acquired. Further, the parameter deriving unit derives a speech parameter from the target waveform acquired by the waveform acquiring unit, and the facial expression data generating unit estimates the facial expression expressed in the target waveform based on the specific content information, and the estimation The result is generated as facial expression data.

そして、音源データ登録手段が、パラメータ導出手段で導出された音声パラメータと、表情データ生成手段で生成された表情データとを対応付けることで、音源データを生成し、記憶装置に記憶する。 Then, the sound source data registration unit generates sound source data by associating the voice parameter derived by the parameter deriving unit with the facial expression data generated by the facial expression data generation unit, and stores the sound source data in the storage device.

このような音源特定システムによれば、発声内容情報及び当該発声内容情報に対する音声波形に基づいて音源データを生成することができる。
つまり、本発明の音源特定システムによれば、発声内容情報によって表される文字列を多くの人物に発声させた各対象波形から音声パラメータを導出することで、多様な発声者の音声パラメータを導出できる。 According to such a sound source identification system, sound source data can be generated based on utterance content information and a speech waveform corresponding to the utterance content information.
That is, according to the sound source identification system of the present invention, the speech parameters of various speakers are derived by deriving the speech parameters from each target waveform obtained by causing many people to utter the character string represented by the utterance content information. it can.

この結果、音源特定システムによれば、音声パラメータの種類を多様化できるため、音声合成によって文章データを読上げた合成音を出力する際に多様な音源データから選択でき、より適切な表情を付与できる。 As a result, according to the sound source identification system, the types of speech parameters can be diversified, so that when a synthesized sound obtained by reading out text data by speech synthesis is output, it can be selected from various sound source data and a more appropriate expression can be given. .

なお、本発明は、音源データを特定する方法である音源特定方法としてなされたものであっても良い。
この場合、音源特定方法では、文章データをコンピュータに取得させる文章取得過程と、文章データによって表される文章を解析し、テキスト表情分布をコンピュータに導出させる文章解析過程と、音声パラメータと表情データとを発声ごとに対応付けた音源データを格納した記憶装置から音源データそれぞれを取得して解析し、音源表情分布をコンピュータに導出させる音源解析過程と、テキスト表情分布を音源表情分布それぞれに照合して、両者の相関値をコンピュータに導出させるマッチング過程と、相関値の中で、値が最も高い相関値に対応する音源データをコンピュータに提示させる情報提示過程とを有していても良い。 The present invention may be implemented as a sound source specifying method that is a method of specifying sound source data.
In this case, the sound source localization methods, and text acquisition step of Ru to acquire text data to the computer, to analyze the text represented by the text data, and text analysis process of Ru is derived text expression distribution computer, speech parameters and expressions from a storage device and a data storing sound source data which associates each utterance by acquiring the respective sound source data analyzed, a sound source analysis process of Ru is derived sound source expression distribution computer, the text expression distribution to each sound source expression distribution matching to a matching process which Ru is derived correlation values of both the computer, in the correlation values, have an information presentation step of Ru is presented sound source data corresponding to the highest correlation value is the value to the computer Also good.

このような音源特定方法を実行すれば、請求項１に係る音源特定システムと同様の効果を得ることができる。 By executing such a sound source specifying method, the same effect as that of the sound source specifying system according to claim 1 can be obtained.

音源特定システムの概略構成を示すブロック図である。It is a block diagram which shows schematic structure of a sound source identification system. 音源データ登録処理の処理手順を示すフローチャートである。It is a flowchart which shows the process sequence of a sound source data registration process. 音源特定処理の処理手順を示すフローチャートである。It is a flowchart which shows the process sequence of a sound source specific process. 音源特定処理の処理概要を示す説明図である。It is explanatory drawing which shows the process outline | summary of a sound source identification process. 音源特定処理の処理概要を示す説明図である。It is explanatory drawing which shows the process outline | summary of a sound source identification process.

以下に本発明の実施形態を図面と共に説明する。
〈音声合成システム〉
図１に示す音声合成システム１は、ユーザが指定した文章データＷＴの内容を、ユーザが指定した特徴の合成音にて出力するシステムであり、少なくとも一つの情報処理サーバ１０と、少なくとも一つの音声出力端末６０とを備えている。 Embodiments of the present invention will be described below with reference to the drawings.
<Speech synthesis system>
A speech synthesis system 1 shown in FIG. 1 is a system that outputs the contents of text data WT designated by a user as synthesized sounds having characteristics designated by the user, and includes at least one information processing server 10 and at least one speech. And an output terminal 60.

この音声合成システム１では、音声出力端末６０のユーザが指定した文章データＷＴを情報処理サーバ１０が解析し、少なくとも、予め登録された複数の音源データＳＤの中から、当該ユーザの希望に合致する音源データＳＤを抽出して提示する。さらに、音声合成システム１では、音源データＳＤに基づいて、音声出力端末６０が音声合成を実行して、指定された文章データＷＴに対応する内容の合成音を出力する。 In this speech synthesis system 1, the information processing server 10 analyzes the text data WT designated by the user of the speech output terminal 60, and at least matches the user's wish from a plurality of pre-registered sound source data SD. The sound source data SD is extracted and presented. Further, in the speech synthesis system 1, the speech output terminal 60 performs speech synthesis based on the sound source data SD, and outputs a synthesized sound having contents corresponding to the designated text data WT.

すなわち、音声合成システム１は、本発明における音源特定システムとして機能する。
〈音声出力端末〉
音声出力端末６０は、通信部６１と、情報受付部６２と、表示部６３と、音入力部６４と、音出力部６５と、記憶部６６と、制御部７０とを備えている。本実施形態における音声出力端末６０として、例えば、周知の携帯端末を想定しても良いし、いわゆるパーソナルコンピュータといった周知の情報処理装置を想定しても良い。なお、携帯端末には、周知の電子書籍端末や、携帯電話、タブレット端末などの携帯情報端末を含む。 That is, the speech synthesis system 1 functions as a sound source identification system in the present invention.
<Audio output terminal>
The audio output terminal 60 includes a communication unit 61, an information receiving unit 62, a display unit 63, a sound input unit 64, a sound output unit 65, a storage unit 66, and a control unit 70. As the audio output terminal 60 in the present embodiment, for example, a known portable terminal may be assumed, or a known information processing apparatus such as a so-called personal computer may be assumed. Note that portable terminals include well-known electronic book terminals, and portable information terminals such as mobile phones and tablet terminals.

通信部６１は、通信網を介して音声出力端末６０が外部との間で情報通信を行う。情報受付部６２は、入力装置（図示せず）を介して入力された情報を受け付ける。表示部６３は、制御部７０からの信号に基づいて画像を表示する。 In the communication unit 61, the audio output terminal 60 performs information communication with the outside via a communication network. The information receiving unit 62 receives information input via an input device (not shown). The display unit 63 displays an image based on a signal from the control unit 70.

音入力部６４は、音を電気信号に変換して制御部７０に入力する装置であり、例えば、マイクロホンである。音出力部６５は、音を出力する周知の装置であり、例えば、ＰＣＭ音源と、スピーカとを備えている。記憶部６６は、記憶内容を読み書き可能に構成された不揮発性の記憶装置である。記憶部６６には、各種処理プログラムや各種データが記憶される。 The sound input unit 64 is a device that converts sound into an electric signal and inputs the electric signal to the control unit 70, and is, for example, a microphone. The sound output unit 65 is a known device that outputs sound, and includes, for example, a PCM sound source and a speaker. The storage unit 66 is a non-volatile storage device configured to be able to read and write stored contents. The storage unit 66 stores various processing programs and various data.

また、制御部７０は、ＲＯＭ７２、ＲＡＭ７４、ＣＰＵ７６を少なくとも有した周知のコンピュータを中心に構成されている。
すなわち、各音声出力端末６０は、当該音声出力端末６０のユーザが指定した文章データＷＴ、及び当該文章データＷＴに適した音源データＳＤを情報処理サーバ１０から取得して音声合成を実行する。そして、その音声合成によって、文章データＷＴの内容を表す合成音を生成して出力する。
〈情報処理サーバ〉
情報処理サーバ１０は、通信部１２と、制御部２０と、記憶部３０とを備え、少なくとも、文章を構成する文字列を表す文章データＷＴと、予め入力された音声の音声特徴量を少なくとも含む音源データＳＤとが格納されたサーバである。 The control unit 70 is configured around a known computer having at least a ROM 72, a RAM 74, and a CPU 76.
That is, each voice output terminal 60 acquires text data WT designated by the user of the voice output terminal 60 and sound source data SD suitable for the text data WT from the information processing server 10 and executes voice synthesis. Then, by the voice synthesis, a synthesized sound representing the contents of the text data WT is generated and output.
<Information processing server>
The information processing server 10 includes a communication unit 12, a control unit 20, and a storage unit 30, and includes at least sentence data WT representing a character string constituting a sentence and at least a voice feature amount of speech input in advance. This is a server in which the sound source data SD is stored.

通信部１２は、通信網を介して、情報処理サーバ１０が外部との間で通信を行う。本実施形態における通信網とは、例えば、公衆無線通信網やネットワーク回線である。
制御部２０は、電源が切断されても記憶内容を保持する必要がある処理プログラムやデータを格納するＲＯＭ２２と、処理プログラムやデータを一時的に格納するＲＡＭ２４と、ＲＯＭ２２やＲＡＭ２４に記憶された処理プログラムに従って各種処理を実行するＣＰＵ２６とを少なくとも有した周知のコンピュータを中心に構成されている。この制御部２０は、通信部１２や記憶部３０を制御する。 In the communication unit 12, the information processing server 10 communicates with the outside through a communication network. The communication network in this embodiment is, for example, a public wireless communication network or a network line.
The control unit 20 includes a ROM 22 that stores processing programs and data that need to retain stored contents even when the power is turned off, a RAM 24 that temporarily stores processing programs and data, and processes stored in the ROM 22 and RAM 24. A known computer having at least a CPU 26 that executes various processes according to a program is mainly configured. The control unit 20 controls the communication unit 12 and the storage unit 30.

記憶部３０は、記憶内容を読み書き可能に構成された不揮発性の記憶装置である。この記憶装置とは、例えば、ハードディスク装置やフラッシュメモリなどである。記憶部３０には、文章データＷＴと、音源データＳＤとが格納されている。 The storage unit 30 is a non-volatile storage device configured to be able to read and write stored contents. The storage device is, for example, a hard disk device or a flash memory. The storage unit 30 stores text data WT and sound source data SD.

ここでいう文章データＷＴは、例えば、書籍をテキストデータ化したデータであり、書籍ごとに予め用意されている。ここでいう書籍とは、小説などである。
音源データＳＤは、音声パラメータＰＶ_jと、タグデータ（表情データ）ＴＧ_jとを音源ｊごとに対応付けたデータである。 The text data WT here is, for example, data obtained by converting a book into text data, and is prepared in advance for each book. Books here are novels and the like.
The sound source data SD is data in which the sound parameter PV _j and the tag data (expression data) TG _j are associated with each sound source j.

音声パラメータＰＶは、人が発した音の波形を表す少なくとも一つの特徴量である。この特徴量は、いわゆるフォルマント合成に用いる音声の特徴量であり、発声者ごと、かつ、音素ごとに用意される。音声パラメータＰＶにおける特徴量として、発声音声における各音素での基本周波数Ｆ０、メル周波数ケプストラム（ＭＦＣＣ）、音素長、パワー、及びそれらの時間差分を少なくとも備えている。 The voice parameter PV is at least one feature amount representing a waveform of a sound emitted by a person. This feature amount is a feature amount of speech used for so-called formant synthesis, and is prepared for each speaker and for each phoneme. As a feature quantity in the speech parameter PV, at least a fundamental frequency F0, a mel frequency cepstrum (MFCC), a phoneme length, a power, and a time difference thereof in each phoneme in the uttered speech are provided.

タグデータＴＧは、音声パラメータＰＶによって表される音の性質を表すデータであり、発声者の特徴を表す発声者特徴データと、当該音声が発声されたときの発声者の表情を表す表情データとを少なくとも含む。発声者特徴データには、例えば、発声者の性別、年齢などを含む。また、表情データは、感情や情緒、情景、状況を少なくとも含む表情としての概念を表すデータであり、発声者の表情を推定するために必要な情報を含んでも良い。 The tag data TG is data representing the nature of the sound represented by the speech parameter PV, and includes speaker feature data representing the features of the speaker, and expression data representing the expression of the speaker when the speech is spoken. At least. The speaker characteristic data includes, for example, the sex and age of the speaker. The expression data is data representing a concept as an expression including at least emotions, emotions, scenes, and situations, and may include information necessary for estimating the expression of the speaker.

これらの音声パラメータＰＶとタグデータＴＧとが対応付けられた音源データＳＤは、音源データ登録処理を制御部２０が実行することで生成され、記憶部３０に記憶される。〈音源データ登録処理〉
その音源データ登録処理は、起動されると、図２に示すように、複数種類の表情が出現する内容の文章として予め規定された規定内容文の文字列を表す文章データＷＴを取得する（Ｓ１１０）。このＳ１１０にて取得する文章データＷＴを、以下では、発声内容文章データと称す。 The sound source data SD in which the sound parameters PV and the tag data TG are associated is generated by the sound source data registration process performed by the control unit 20 and stored in the storage unit 30. <Sound source data registration process>
When the sound source data registration process is started, as shown in FIG. 2, sentence data WT representing a character string of a prescribed content sentence prescribed in advance as a sentence having contents in which a plurality of types of facial expressions appear is acquired (S110). ). The text data WT acquired in S110 is hereinafter referred to as utterance content text data.

続いて、Ｓ１１０にて取得した発声内容文章データに対応する一つの音声波形データを取得する（Ｓ１２０）。この音声波形データは、発声内容文章データによって表される規定内容文について、予め発声された音声波形それぞれを表すデータであり、多様な人物によって予め発声されたものである。 Subsequently, one speech waveform data corresponding to the utterance content sentence data obtained in S110 is obtained (S120). The speech waveform data is data representing each speech waveform uttered in advance with respect to the specified content sentence represented by the utterance content text data, and is uttered in advance by various persons.

さらに、Ｓ１２０にて取得した音声波形データそれぞれから音声パラメータＳＶを導出する（Ｓ１３０）。本実施形態のＳ１３０では、基本周波数、メル周波数ケプストラム（ＭＦＣＣ）、パワー、それらの時間差分を、それぞれ、音声パラメータＳＶとして導出する。これらの基本周波数、ＭＦＣＣ、パワーの導出方法は、周知であるため、ここでの詳しい説明は省略するが、例えば、基本周波数であれば、時間軸に沿った自己相関、周波数スペクトルの自己相関、またはケプストラム法などの手法を用いて導出すれば良い。また、ＭＦＣＣであれば、時間分析窓ごとに周波数解析（例えば、ＦＦＴ）をした結果について、周波数ごとの大きさを対数化した結果を、さらに、周波数解析することで導出すれば良い。パワーについては、時間分析窓における振幅の二乗した結果を時間方向に積分することで導出すれば良い。 Further, a speech parameter SV is derived from each speech waveform data acquired in S120 (S130). In S130 of the present embodiment, the fundamental frequency, the mel frequency cepstrum (MFCC), the power, and the time difference between them are each derived as a voice parameter SV. Since these fundamental frequency, MFCC, and power derivation methods are well known, detailed description thereof is omitted here. For example, if the fundamental frequency is used, autocorrelation along the time axis, autocorrelation of the frequency spectrum, Alternatively, it may be derived using a method such as a cepstrum method. In the case of the MFCC, the result of frequency analysis (for example, FFT) for each time analysis window may be derived by further frequency analysis of the logarithmic magnitude of each frequency. The power may be derived by integrating the squared result of the amplitude in the time analysis window in the time direction.

続いて、音源データ登録処理では、表情データを推定する表情データ推定処理を実行する（Ｓ１４０）。この表情データ推定処理では、Ｓ１１０にて取得した発声内容文章データを解析した結果に基づいて、音声波形データによって表現された表情を推定する。 Subsequently, in the sound source data registration processing, facial expression data estimation processing for estimating facial expression data is executed (S140). In this facial expression data estimation process, the facial expression expressed by the speech waveform data is estimated based on the result of analyzing the utterance content text data acquired in S110.

ここでいう「発声内容文章データ」の解析とは、例えば、発声内容文章データに対応する文章を形態素解析することで特定した各単語について、単語それぞれに対応する単語表情情報を取得する。ここでいう単語表情情報とは、単語それぞれと、各単語によって表される表情の内容とを予め対応付けた情報であり、単語表情データベースに予め格納されている。そして、取得した単語表情情報に従って、同一内容を表す表情の登場頻度を各表情の内容ごとに集計し、この集計の結果、最も頻度が高い表情の内容を、当該音声波形データによって表された表情として推定すれば良い。 The analysis of “speech content text data” here refers to, for example, acquiring word expression information corresponding to each word for each word specified by morphological analysis of text corresponding to the utterance content text data. The word expression information here is information that associates each word with the contents of the expression represented by each word in advance, and is stored in advance in the word expression database. Then, according to the acquired word facial expression information, the appearance frequency of facial expressions representing the same content is aggregated for each facial expression content, and as a result of the aggregation, the content of the most frequent facial expression is expressed by the facial expression represented by the speech waveform data. Can be estimated.

続いて、Ｓ１３０にて導出した音声パラメータＳＶと、Ｓ１４０にて推定した表情データとを対応する音声波形データごとに対応付けることで、音源データＳＤを生成して記憶部３０に格納する音声パラメータ登録を実行する（Ｓ１５０）。なお、本実施形態のＳ１５０にて記憶部３０に格納される音声パラメータＳＶと対応付けられるデータは、表情データに加えて、発声した文章の内容（種類）や、発声者ＩＤ、発声者特徴データを含む（即ち、タグデータＴＧである）。これら発声者ＩＤや発声者特徴データは、情報処理サーバ１０や音声出力端末６０、その他の端末へのログインに用いる情報を発声者ＩＤや発声者特徴データとして取得すれば良い。 Subsequently, by associating the voice parameter SV derived in S130 with the facial expression data estimated in S140 for each corresponding voice waveform data, voice parameter registration for generating the sound source data SD and storing it in the storage unit 30 is performed. Execute (S150). Note that the data associated with the speech parameter SV stored in the storage unit 30 in S150 of the present embodiment includes the content (type) of the spoken sentence, the speaker ID, and the speaker feature data in addition to the facial expression data. (That is, tag data TG). For the speaker ID and speaker feature data, information used for logging in to the information processing server 10, the voice output terminal 60, and other terminals may be acquired as the speaker ID and speaker feature data.

その後、本音声パラメータ登録処理を終了する。
つまり、本実施形態の音声パラメータ登録処理では、発声内容文章データによって表される文章に対して発声された一つの音声波形データを解析し、音声パラメータＳＶを導出する。これと共に、音声パラメータ登録処理では、当該発声内容文章データによって表される文章を解析し、当該音声パラメータＳＶにて表現される表情を表す表情データを導出する。 Thereafter, the voice parameter registration process is terminated.
That is, in the speech parameter registration process of the present embodiment, one speech waveform data uttered with respect to the sentence represented by the utterance content sentence data is analyzed, and the speech parameter SV is derived. At the same time, in the voice parameter registration process, the sentence represented by the utterance content sentence data is analyzed, and facial expression data representing the expression expressed by the voice parameter SV is derived.

そして、音声パラメータ登録処理では、それらの対応する音声パラメータＳＶと表情データとを対応付けることで音源データＳＤを生成し、その音源データＳＤを記憶部３０に記憶する。これにより、記憶部３０には、規定内容文について発声された音声ごとに作成された音源データＳＤが格納される。
〈音源特定処理〉
次に、情報処理サーバ１０の制御部２０が実行する音源特定処理について説明する。 In the sound parameter registration process, the sound source data SD is generated by associating the corresponding sound parameters SV with the expression data, and the sound source data SD is stored in the storage unit 30. Thus, the storage unit 30 stores the sound source data SD created for each voice uttered with respect to the specified content sentence.
<Sound source identification processing>
Next, the sound source identification process executed by the control unit 20 of the information processing server 10 will be described.

この音源特定処理は、起動されると、図３に示すように、音声出力端末６０にて指定された文章データＷＴを表す文章指定情報を取得する（Ｓ３１０）。続いて、Ｓ３１０にて取得した文章指定情報に対応する文章データ（以下、「指定文章データ」と称す）ＷＴを記憶部３０から取得する（Ｓ３２０）。このＳ３２０にて取得する指定文章データＷＴは、図４（Ａ）に示すように、文章を構成する文字列そのもの、即ち、テキストデータである。 When this sound source identification process is started, as shown in FIG. 3, the sentence designation information representing the sentence data WT designated by the voice output terminal 60 is acquired (S310). Subsequently, sentence data (hereinafter referred to as “designated sentence data”) WT corresponding to the sentence designation information obtained in S310 is obtained from the storage unit 30 (S320). The designated sentence data WT acquired in S320 is a character string itself constituting the sentence, that is, text data, as shown in FIG.

さらに、Ｓ３２０にて取得した指定文章データＷＴをテキスト解析し、指定文章データＷＴによって表される文章中に登場する登場人物ｉと、各登場人物ｉが発声すべきテキストの内容を表す発声テキストとを対応付けた話者テキスト対応データを生成する（Ｓ３３０）。なお、ここでいう登場人物ｉとは、発話者とナレータとを含むものである。例えば、会話文については、文章中にて当該会話文を発声した人物を表す発話者を登場人物ｉとして、地の文についてはナレータを登場人物ｉとして特定する。 Further, the designated sentence data WT acquired in S320 is subjected to text analysis, a character i appearing in the sentence represented by the designated sentence data WT, and a utterance text representing the content of the text that each character i should utter, Is generated in correspondence with the speaker text (S330). In addition, the character i here includes a speaker and a narrator. For example, for a conversational sentence, a speaker representing the person who uttered the conversational sentence in the sentence is identified as the character i, and for a local sentence, the narrator is identified as the character i.

具体的には、Ｓ３３０では、まず、Ｓ３１０にて取得した指定文章データＷＴを、当該指定文章データＷＴによって表される文章中の句読点及び括弧にて分割して、図４（Ｂ）に示すように、文章を構成する単位区間である発声テキストに切り分ける。そして、その切り分けた発声テキストに対して形態素解析、及び係り受け解析を実行して、当該単位区間を発声すべき登場人物ｉを特定する。さらに、各発声テキストと、当該発声テキストに対応する登場人物ｉとを対応付けることで、図４（Ｃ）に示すような、登場人物ｉ（図中：話者）と発声テキスト（図中：テキスト）とを対応付けた話者テキスト対応データを生成する。 Specifically, in S330, first, the designated sentence data WT acquired in S310 is divided by punctuation marks and parentheses in the sentence represented by the designated sentence data WT, as shown in FIG. Then, it is divided into utterance texts which are unit sections constituting the sentence. Then, morphological analysis and dependency analysis are performed on the uttered text, and the character i who should utter the unit section is specified. Further, by associating each utterance text with the character i corresponding to the utterance text, the character i (in the figure: speaker) and the utterance text (in the figure: text) as shown in FIG. ) Is generated in correspondence with speaker text.

なお、形態素解析や係り受け解析は、周知の手法を用いれば良く、例えば、形態素解析であれば、“ＭｅＣａｂ”を用いれば良い。また、係り受け解析であれば、“Ｃａｂｏｃｈａ（「工藤拓，松本裕治，“チャンキングの段階適用による日本語係り受け解析”，情報処理学会論文誌，４３（６），１８３４−１８４２（２００１）」）”などを用いれば良い。 For morphological analysis and dependency analysis, a known method may be used. For example, in the case of morphological analysis, “MeCab” may be used. For dependency analysis, “Cabocha (“ Taku Kudo, Yuji Matsumoto, “Japanese Dependency Analysis by Chunking Stage Application” ”, Transactions of Information Processing Society of Japan, 43 (6), 1834-1842 (2001). ")" Or the like may be used.

音源特定処理へと戻り、話者テキスト対応データに基づいて、登場人物ｉごとに対応付けられた発声テキストを解析して、各発声テキストに出現する表情を特定する（Ｓ３４０）。このＳ３４０における解析は、上述した単語表情情報に基づいて、発声テキストに含まれる各単語によって表される表情の内容を取得することで実施すれば良い。 Returning to the sound source identification process, the utterance text associated with each character i is analyzed based on the speaker text correspondence data, and the facial expression appearing in each utterance text is identified (S340). The analysis in S340 may be performed by acquiring the content of the facial expression represented by each word included in the utterance text based on the word facial expression information described above.

続いて、指定文章データＷＴによって表される文章中の登場人物ｉごとに、Ｓ３４０における表情解析の結果を集計し、登場人物ｉごとの表情の分布を表すテキスト表情分布ｔｐｄ（ｉ，ｋ）を導出する（Ｓ３５０）。このＳ３５０にて導出されるテキスト表情分布ｔｐｄ（ｉ，ｋ）は、図５（Ａ）に示すように、指定文章データＷＴによって表される文章中の登場人物ｉが表現すべき各表情を項目ｋとして、各表情の強さの分布を表したものである。 Subsequently, for each character i in the text represented by the designated text data WT, the results of facial expression analysis in S340 are tabulated, and a text expression distribution tpd (i, k) representing the distribution of facial expressions for each character i is obtained. Derived (S350). The text expression distribution tpd (i, k) derived in S350 includes items representing each expression to be expressed by the character i in the sentence represented by the designated sentence data WT, as shown in FIG. k represents the distribution of the intensity of each facial expression.

さらに、記憶部３０に記憶された音源データＳＤに基づいて、各音源データＳＤにおける表情の分布を表す音源表情分布ｖｐｄ（ｊ，ｋ）を導出する（Ｓ３６０）。このＳ３６０では、具体的には、表情の内容が中立状態である表情データと対応付けられた音声パラメータｓｐ＿ｎ（ｊ）それぞれを基準とし、その基準から、各表情ｋを内容とする表情データと対応付けられた音声パラメータｓｐ＿ｅ（ｊ）のそれぞれへのベクトルを音源表情分布ｖｐｄ（ｊ，ｋ）として、下記（１）式にて音源ｊごとに導出する。 Further, based on the sound source data SD stored in the storage unit 30, a sound source expression distribution vpd (j, k) representing the expression distribution in each sound source data SD is derived (S360). In S360, specifically, the voice parameter sp_n (j) associated with the facial expression data in which the content of the facial expression is in a neutral state is used as a reference, and the expression data corresponding to each facial expression k is determined from that reference. A vector to each of the attached speech parameters sp_e (j) is derived as a sound source expression distribution vpd (j, k) for each sound source j by the following equation (1).

この（１）式によって音源ｊごとに導出される音源表情分布ｖｐｄは、図５（Ｂ）に示すように、規定内容文にて出現する各表情を項目ｋとして、各表情の強さの分布を表したものとなる。

As shown in FIG. 5B, the sound source expression distribution vpd derived for each sound source j by the expression (1) is a distribution of the intensity of each expression with each expression appearing in the specified content sentence as an item k. It is a representation.

続いて、Ｓ３５０にて導出した登場人物ｉごとのテキスト表情分布ｔｐｄ（ｉ，ｋ）を、Ｓ３６０にて導出した音源表情分布ｖｐｄ（ｊ，ｋ）それぞれに照合し、相関値ｃｏｒ（ｉ，ｊ）を導出する（Ｓ３７０）。このＳ３７０における相関値ｃｏｒ（ｉ，ｊ）の導出は、下記（２）式，（３）式に従って実行する。 Subsequently, the text expression distribution tpd (i, k) for each character i derived in S350 is collated with each of the sound source expression distributions vpd (j, k) derived in S360, and the correlation value cor (i, j ) Is derived (S370). The derivation of the correlation value cor (i, j) in S370 is executed according to the following equations (2) and (3).

なお、（２）式，及び（３）式におけるＴＰ及びＶＰは、それぞれ、テキスト表情分布ｔｐｄ、音源表情分布ｖｐｄを、出現する各表情の項目ｋに関して相加平均したものであり、ｋｍａｘは表情の数量である。 Note that TP and VP in the expressions (2) and (3) are the arithmetic mean of the text expression distribution tpd and the sound source expression distribution vpd with respect to the item k of each appearing expression, and kmax is the expression. Is the quantity.

さらに、Ｓ３７０にて導出した相関値ｃｏｒ（ｉ，ｊ）が最大となる音源データＳＤを、登場人物ｉごとに提示する（Ｓ３８０）。このＳ３８０における提示とは、通信部１２を介して、相関値ｃｏｒ（ｉ，ｊ）が最大となる登場人物ｉごとの音源データＳＤを、音声出力端末６０の表示部６３に出力することでも良い。

Furthermore, the sound source data SD that maximizes the correlation value cor (i, j) derived in S370 is presented for each character i (S380). The presentation in S380 may be output to the display unit 63 of the audio output terminal 60 via the communication unit 12 for the sound source data SD for each character i having the maximum correlation value cor (i, j). .

その後、本音源特定処理を終了する。
つまり、音源特定処理では、指定文章データＷＴによって表される文章を解析し、当該文章にて出現する各種類の表情の分布度合いを表すテキスト表情分布ｔｐｄ（ｉ，ｋ）を登場人物ｉごとに導出する。そして、記憶部３０に記憶された音源データＳＤそれぞれを取得して解析し、音源データＳＤごとに、当該音源データＳＤに含まれる音声パラメータＰＶにて表される音声に表出する各種類の表情の分布度合いを表す音源表情分布ｖｐｄ（ｊ，ｋ）を導出する。 Thereafter, the sound source identification process is terminated.
In other words, in the sound source identification process, the sentence represented by the designated sentence data WT is analyzed, and the text expression distribution tpd (i, k) representing the distribution degree of each type of expression appearing in the sentence for each character i. To derive. Each of the sound source data SD stored in the storage unit 30 is acquired and analyzed, and for each sound source data SD, each type of facial expression expressed in the sound represented by the sound parameter PV included in the sound source data SD. The sound source facial expression distribution vpd (j, k) representing the distribution degree of is derived.

さらに、音源特定処理では、導出された登場人物ｉごとのテキスト表情分布ｔｐｄ（ｉ，ｋ）を、音源表情分布ｖｐｄ（ｊ，ｋ）それぞれに照合して、両者の相関値ｃｏｒ（ｉ，ｊ）を導出して、その相関値ｃｏｒ（ｉ，ｊ）が最も高い音源データＳＤを提示する。
［実施形態の効果］
以上説明したように、音声合成システム１によれば、テキスト表情分布ｔｐｄ（ｉ，ｋ）との相関値ｃｏｒ（ｉ，ｊ）が最大となる音源表情分布ｖｐｄ（ｊ，ｋ）に対応する音源データＳＤを特定して提示できる。この相関値ｃｏｒ（ｉ，ｊ）が最大となる音源データＳＤは、文章データＷＴにて出現する表情分布に、最も高い一致度の表情分布を有した音声パラメータＰＶを含むものである。 Further, in the sound source identification processing, the derived text expression distribution tpd (i, k) for each character i is collated with each of the sound source expression distributions vpd (j, k), and the correlation value cor (i, j) between the two. ) And the sound source data SD having the highest correlation value cor (i, j) is presented.
[Effect of the embodiment]
As described above, according to the speech synthesis system 1, the sound source corresponding to the sound source expression distribution vpd (j, k) having the maximum correlation value cor (i, j) with the text expression distribution tpd (i, k). Data SD can be specified and presented. The sound source data SD having the maximum correlation value cor (i, j) includes the speech parameter PV having the facial expression distribution with the highest matching degree in the facial expression distribution appearing in the text data WT.

したがって、音声合成システム１によれば、音声合成によって文章データＷＴを読上げた合成音を出力する際に、当該文章データＷＴの合成音に適切な表情を付与可能な音源データＳＤを特定することができる。 Therefore, according to the speech synthesis system 1, when outputting a synthesized sound obtained by reading out the text data WT by speech synthesis, the sound source data SD that can give an appropriate expression to the synthesized sound of the text data WT can be specified. it can.

しかも、音声合成システム１によれば、発声内容文章データ及び発声内容文章データに対する音声波形に基づいて音源データＳＤを生成することができる。そして、音声合成システム１によれば、発声内容文章データによって表される文字列を多くの人物に発声させた各対象波形から音声パラメータＰＶを導出することで、多様な発声者の音声パラメータＰＶを導出できる。 Moreover, according to the speech synthesis system 1, the sound source data SD can be generated based on the utterance content sentence data and the speech waveform for the utterance content sentence data. The speech synthesis system 1 derives the speech parameters PV from various target waveforms obtained by causing many people to utter the character string represented by the utterance content text data, thereby obtaining the speech parameters PV of various speakers. Can be derived.

この結果、音声合成システム１によれば、音声パラメータＰＶの種類を多様化できるため、音声合成によって文章データＷＴを読上げた合成音を出力する際に多様な音源データＳＤから選択でき、より適切な表情を付与できる。
［その他の実施形態］
以上、本発明の実施形態について説明したが、本発明は上記実施形態に限定されるものではなく、本発明の要旨を逸脱しない範囲において、様々な態様にて実施することが可能である。 As a result, according to the speech synthesis system 1, since the types of speech parameters PV can be diversified, when outputting synthesized speech obtained by reading out the text data WT by speech synthesis, it can be selected from various sound source data SD, and more appropriate Can add facial expressions.
[Other Embodiments]
As mentioned above, although embodiment of this invention was described, this invention is not limited to the said embodiment, In the range which does not deviate from the summary of this invention, it is possible to implement in various aspects.

すなわち、上記実施形態の構成の一部を、課題を解決できる限りにおいて省略した態様も本発明の実施形態である。また、上記実施形態と変形例とを適宜組み合わせて構成される態様も本発明の実施形態である。また、特許請求の範囲に記載した文言によって特定される発明の本質を逸脱しない限度において考え得るあらゆる態様も本発明の実施形態である。 That is, the aspect which abbreviate | omitted a part of structure of the said embodiment as long as the subject could be solved is also embodiment of this invention. Further, an aspect configured by appropriately combining the above embodiment and the modification is also an embodiment of the present invention. Moreover, all the aspects which can be considered in the limit which does not deviate from the essence of the invention specified by the wording described in the claims are the embodiments of the present invention.

例えば、上記実施形態の音源特定処理におけるＳ３７０では、相関値ｃｏｒ（ｉ，ｊ）が最大となる音源データＳＤを提示していたが、Ｓ３７０にて提示する音源データＳＤは、これに限るものではない。すなわち、Ｓ３７０にて提示する音源データＳＤは、登場人物ｉごとに、相関値ｃｏｒ（ｉ，ｊ）が最大値から規定数（例えば「５」）までに該当する音源データＳＤを提示しても良い。 For example, in S370 in the sound source identification process of the above embodiment, the sound source data SD having the maximum correlation value cor (i, j) is presented. However, the sound source data SD presented in S370 is not limited to this. Absent. In other words, the sound source data SD presented in S370 may be the sound source data SD corresponding to the character i having a correlation value cor (i, j) from the maximum value to a specified number (eg, “5”) for each character i. good.

また、Ｓ３７０における音源データＳＤの提示では、登場人物ｉの性別に応じて、出力対象を決定しても良い。つまり、登場人物ｉの性別が、男であれば、発声者特徴データにおける性別が男性である音源データＳＤを提示し、登場人物ｉの性別が、女であれば、発声者特徴データにおける性別が女性である音源データＳＤを提示することが好ましい。 In addition, in the presentation of the sound source data SD in S370, the output target may be determined according to the gender of the character i. That is, if the gender of the character i is male, the sound source data SD in which the gender in the speaker characteristic data is male is presented. If the gender of the character i is female, the gender in the speaker characteristic data is displayed. It is preferable to present the sound source data SD that is female.

この場合、登場人物ｉの性別は、登場人物ｉの名前を表す固有名詞における性別から判定しても良いし、登場人物ｉを表す代名詞から判定しても良い。
また、上記実施形態における音源特定処理におけるＳ３６０では、相関値ｃｏｒを導出する際に、テキスト表情分布ｔｐｄ（ｉ，ｋ）と音源表情分布ｖｐｄ（ｊ，ｋ）とを正規化していたが、各正規化は、テキスト表情分布ｔｐｄ（ｉ，ｋ）と音源表情分布ｖｐｄ（ｊ，ｋ）とのそれぞれを導出する際に実行しても良い。
［実施形態と特許請求の範囲との対応関係］
最後に、上記実施形態の記載と、特許請求の範囲の記載との関係を説明する。 In this case, the gender of the character i may be determined from the gender in the proper noun representing the name of the character i or may be determined from the pronoun representing the character i.
In S360 in the sound source identification process in the above embodiment, the text expression distribution tpd (i, k) and the sound source expression distribution vpd (j, k) are normalized when the correlation value cor is derived. Normalization may be performed when each of the text expression distribution tpd (i, k) and the sound source expression distribution vpd (j, k) is derived.
[Correspondence between Embodiment and Claims]
Finally, the relationship between the description of the above embodiment and the description of the scope of claims will be described.

上記実施形態の音源特定処理におけるＳ３１０，Ｓ３２０を実行することで得られる機能が、特許請求の範囲の記載における文章取得手段に相当し、Ｓ３３０〜Ｓ３５０が、特許請求の範囲の記載における文章解析手段に相当する。さらに、上記実施形態の音源特定処理におけるＳ３６０が、特許請求の範囲の記載における音源解析手段に相当し、Ｓ３７０が、マッチング手段に相当し、Ｓ３８０が、情報提示手段に相当する。 The function obtained by executing S310 and S320 in the sound source identification processing of the above embodiment corresponds to the sentence acquisition means in the claims, and S330 to S350 are sentence analysis means in the claims. It corresponds to. Further, S360 in the sound source identification process of the above embodiment corresponds to the sound source analysis means in the claims, S370 corresponds to the matching means, and S380 corresponds to the information presentation means.

また、上記実施形態の音源登録処理におけるＳ１１０を実行することで得られる機能が、特許請求の範囲の記載における内容情報取得手段に相当し、Ｓ１２０が、波形取得手段に相当する。そして、Ｓ１３０を実行することで得られる機能が、特許請求の範囲の記載におけるパラメータ導出手段に相当し、Ｓ１４０が、表情データ生成手段に相当し、Ｓ１５０が音源データ登録手段に相当する。 In addition, the function obtained by executing S110 in the sound source registration process of the above embodiment corresponds to the content information acquisition unit in the claims, and S120 corresponds to the waveform acquisition unit. The function obtained by executing S130 corresponds to the parameter deriving unit in the claims, S140 corresponds to the facial expression data generation unit, and S150 corresponds to the sound source data registration unit.

１…音声合成システム１０…情報処理サーバ１２…通信部２０…制御部２２…ＲＯＭ２４…ＲＡＭ２６…ＣＰＵ３０…記憶部６０…音声出力端末６１…通信部６２…情報受付部６３…表示部６４…音入力部６５…音出力部６６…記憶部７０…制御部７２…ＲＯＭ７４…ＲＡＭ７６…ＣＰＵ DESCRIPTION OF SYMBOLS 1 ... Speech synthesis system 10 ... Information processing server 12 ... Communication part 20 ... Control part 22 ... ROM 24 ... RAM 26 ... CPU 30 ... Memory | storage part 60 ... Voice output terminal 61 ... Communication part 62 ... Information reception part 63 ... Display part 64 ... Sound input unit 65 ... Sound output unit 66 ... Storage unit 70 ... Control unit 72 ... ROM 74 ... RAM 76 ... CPU

Claims

Sentence acquisition means for acquiring sentence data representing a character string constituting the specified sentence;
Sentence analysis means for analyzing a sentence represented by sentence data acquired by the sentence acquisition means and deriving a text expression distribution representing a distribution degree of each type of expression appearing in the sentence;
For each utterance, at least one voice parameter of a sound uttered for a prescribed content sentence defined as a sentence of content in which multiple types of facial expressions appear, and facial expression data representing each facial expression when uttered for the prescribed content sentence Each of the sound source data is acquired and analyzed from a storage device that stores sound source data that is associated data, and each sound source data is represented in the sound represented by the sound parameter included in the sound source data. Sound source analysis means for deriving a sound source facial expression distribution representing the distribution degree of each type of facial expression;
Matching means for deriving a correlation value between the text expression distribution derived by the sentence analysis means and each sound source expression distribution derived by the sound source analysis means,
A sound source specifying system comprising: information presenting means for presenting the sound source data corresponding to the correlation value having the highest value among the correlation values derived by the matching means.

Content information acquisition means for acquiring a specified content sentence representing a character string constituting a sentence of content in which multiple types of facial expressions appear;
A waveform acquisition unit that acquires a target waveform that is a speech waveform uttered for a character string represented by the specific content information that is the specified content sentence acquired by the content information acquisition unit;
Parameter deriving means for deriving the speech parameter from the target waveform obtained by the waveform obtaining means;
Facial expression data generating means for estimating the facial expression expressed in the target waveform based on the specific content information and generating the estimation result as the facial expression data;
Sound source data registration means for generating the sound source data by associating the voice parameter derived by the parameter derivation means with the expression data generated by the expression data generation means and storing the sound source data in the storage device. The sound source specifying system according to claim 1.

A sentence acquisition process for causing a computer to acquire sentence data representing a character string constituting the specified sentence;
Analyzing a sentence represented by sentence data acquired in the sentence acquisition process, and causing a computer to derive a text expression distribution representing a distribution degree of each type of expression appearing in the sentence; and
For each utterance, at least one voice parameter of a sound uttered for a prescribed content sentence defined as a sentence of content in which multiple types of facial expressions appear, and facial expression data representing each facial expression of the voice uttered for the prescribed content sentence Each of the sound source data is acquired and analyzed from a storage device that stores sound source data that is associated data, and each sound source data is represented in the sound represented by the sound parameter included in the sound source data. A sound source analysis process that causes the computer to derive a sound source facial expression distribution representing the distribution of each type of facial expression;
Matching the text expression distribution derived in the sentence analysis process with the sound source expression distribution derived in the sound source analysis process, and causing the computer to derive a correlation value between the two,
A sound source identification method comprising: an information presentation process for causing the computer to present the sound source data corresponding to the correlation value having the highest value among the correlation values derived in the matching process.