JP2011170622A

JP2011170622A - Content providing system, content providing method, and content providing program

Info

Publication number: JP2011170622A
Application number: JP2010033821A
Authority: JP
Inventors: Seiichi Miki; 清一三木
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2010-02-18
Filing date: 2010-02-18
Publication date: 2011-09-01
Anticipated expiration: 2030-02-18
Also published as: JP5589426B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide an adequate content on the basis of voice characteristics such as a manner of speaking which is peculiar to a user. <P>SOLUTION: A content providing system 100 includes a content storage part 126 which makes voice characteristics associated with a content which is favorably presented to a user having the voice characteristics and stores them, a voice characteristics detection part 104 which compares the value of a voice element of the user which is computed on the basis of inputted user voice data with index data for determining a tendency to a standard to detect the voice characteristics peculiar to the user, a content-selecting part 106 which selects the content which is correlated with the voice characteristics detected by the voice characteristics detection part 104 from among the content stored in the content storage part 126, and a presentation part 108 which presents the selected content. <P>COPYRIGHT: (C)2011,JPO&INPIT

Description

本発明は、コンテンツ提供システム、コンテンツ提供方法、およびコンテンツ提供プログラムに関する。 The present invention relates to a content providing system, a content providing method, and a content providing program.

特許文献１（特開２００８−１７０８２０号公報）には、感情種別とコンテンツを関連付けて記憶する感情種別コンテンツ記憶手段と、音声入力手段から入力された音声の特徴量を算出する特徴量算出手段と、算出された音声特徴量に基づいて、感情種別を判定する感情種別判定手段と、判定された前記感情種別に関連付けて記憶されている前記コンテンツを読み取るコンテンツ読み取り手段と、読み取られたコンテンツをコンテンツ再生手段へ送る送信手段と、を備えるコンテンツ提供システムが記載されている。当該文献に記載された技術では、言語や話者に依存しない単純な特徴量として、音量や音高といった韻律成分を用い、特徴量の所定時間（たとえば、過去１秒間）の基本統計量をもって、話者の現在の話し方としている。そして、話し方の定常状態（たとえば、過去５秒間の基本統計量）からの逸脱量から、各感情の度合いを求めている。 Japanese Patent Application Laid-Open No. 2008-170820 discloses an emotion type content storage unit that stores an emotion type and content in association with each other, and a feature amount calculation unit that calculates a feature amount of a voice input from the voice input unit. An emotion type determining means for determining an emotion type based on the calculated audio feature quantity, a content reading means for reading the content stored in association with the determined emotion type, and the read content as a content A content providing system including a transmission unit that transmits to a reproduction unit is described. In the technique described in the document, as a simple feature quantity independent of a language or a speaker, a prosodic component such as a volume or a pitch is used, and a basic statistic for a predetermined time (for example, the past one second) of the feature quantity is obtained. The speaker is currently speaking. Then, the degree of each emotion is obtained from the deviation from the steady state of speaking (for example, the basic statistics for the past 5 seconds).

特開２００８−１７０８２０号公報JP 2008-170820 A

特許文献１に記載の技術では、ユーザの一時的な感情が判断されるだけである。しかし、たとえばユーザに効果的な広告等をコンテンツとして提示したい場合に、ユーザに一時的に発生した感情に関連するコンテンツよりも、ユーザ固有の性質に関連するコンテンツを提示した方が、ユーザの興味を引く可能性が高いと考えられる。ユーザに一時的に発生した感情に関連するコンテンツを提示する方法では、ユーザの感情が変わる度にコンテンツも変更され、コンテンツを繰り返し提示することによりユーザの興味を高めるということもできない。従来の技術では、ユーザ固有の話し方等の音声特徴に基づき、適切なコンテンツを提供することができなかった。 In the technique described in Patent Document 1, only a temporary emotion of the user is determined. However, for example, when it is desired to present an effective advertisement or the like as content to the user, it is better to present the content related to the user's unique property than the content related to the emotion generated temporarily to the user. Is likely to pull. In the method of presenting content related to emotions temporarily generated to the user, the content is changed each time the user's emotion changes, and the user's interest cannot be increased by repeatedly presenting the content. In the conventional technology, it is not possible to provide appropriate content based on voice characteristics such as a user-specific way of speaking.

本発明の目的は、上述した課題である、ユーザ固有の話し方等の音声特徴に基づき、適切なコンテンツを提供することができないという問題を解決するコンテンツ提供システム、コンテンツ提供方法、およびコンテンツ提供プログラムを提供することにある。 An object of the present invention is to provide a content providing system, a content providing method, and a content providing program that solve the above-described problem that it is impossible to provide appropriate content based on voice characteristics such as a user-specific way of speaking. It is to provide.

本発明によれば、
音声特徴と、当該音声特徴を有するユーザに提示したいコンテンツとを対応付けて記憶するコンテンツ記憶手段と、
入力されたユーザの音声データに基づき算出した当該ユーザの音声要素の値を、標準に対する傾向を判断するための指標データと比較して、当該ユーザに固有の音声特徴を検出する音声特徴検出手段と、
前記コンテンツ記憶手段に記憶された前記コンテンツの中から、前記音声特徴検出手段が検出した前記音声特徴に対応付けられたコンテンツを選択するコンテンツ選択手段と、
前記コンテンツ選択手段が選択したコンテンツを提示する提示手段と、
を含むコンテンツ提供システムが提供される。 According to the present invention,
Content storage means for storing the audio feature and the content to be presented to the user having the audio feature in association with each other;
A voice feature detecting means for detecting a voice feature unique to the user by comparing a value of the voice component of the user calculated based on the input voice data of the user with index data for judging a tendency with respect to a standard; ,
Content selection means for selecting content associated with the audio features detected by the audio feature detection means from among the contents stored in the content storage means;
Presenting means for presenting the content selected by the content selecting means;
Is provided.

本発明によれば、
音声特徴と、当該音声特徴を有するユーザに提示したいコンテンツとを対応付けて記憶するコンテンツ記憶手段を含むコンピュータシステムを用いたコンテンツ提供方法であって、
入力されたユーザの音声データに基づき算出した当該ユーザの音声要素の値を、標準に対する傾向を判断するための指標データと比較して、当該ユーザに固有の音声特徴を検出する音声特徴検出ステップと、
前記コンテンツ記憶手段に記憶された前記コンテンツの中から、前記音声特徴検出ステップで検出された前記音声特徴に対応付けられたコンテンツを選択するコンテンツ選択ステップと、
前記コンテンツ選択ステップで選択されたコンテンツを提示する提示ステップと、
を含むコンテンツ提供方法が提供される。 According to the present invention,
A content providing method using a computer system including a content storage unit that stores an audio feature and content to be presented to a user having the audio feature in association with each other,
A voice feature detection step of detecting a voice feature unique to the user by comparing a value of the voice component of the user calculated based on the input voice data of the user with index data for determining a tendency with respect to a standard; ,
A content selection step of selecting content associated with the audio feature detected in the audio feature detection step from the content stored in the content storage means;
A presentation step of presenting the content selected in the content selection step;
A content providing method is provided.

本発明によれば、
コンピュータを、
音声特徴と、当該音声特徴を有するユーザに提示したいコンテンツとを対応付けて記憶するコンテンツ記憶手段、
入力されたユーザの音声データに基づき算出した当該ユーザの音声要素の値を、標準に対する傾向を判断するための指標データと比較して、当該ユーザに固有の音声特徴を検出する音声特徴検出手段、
前記コンテンツ記憶手段に記憶された前記コンテンツの中から、前記音声特徴検出手段が検出した前記音声特徴に対応付けられたコンテンツを選択するコンテンツ選択手段、
前記コンテンツ選択手段が選択したコンテンツを提示する提示手段、
として機能させるコンテンツ提供プログラムが提供される。 According to the present invention,
Computer
Content storage means for storing a voice feature and a content to be presented to a user having the voice feature in association with each other;
A voice feature detecting means for detecting a voice feature unique to the user by comparing the value of the voice component of the user calculated based on the input voice data of the user with index data for judging a tendency with respect to the standard;
Content selection means for selecting content associated with the audio feature detected by the audio feature detection means from among the content stored in the content storage means;
Presenting means for presenting the content selected by the content selecting means;
As a result, a content providing program is provided.

なお、以上の構成要素の任意の組合せ、本発明の表現を方法、装置、システム、記録媒体、コンピュータプログラムなどの間で変換したものもまた、本発明の態様として有効である。 It should be noted that any combination of the above-described constituent elements and a conversion of the expression of the present invention between a method, an apparatus, a system, a recording medium, a computer program, etc. are also effective as an aspect of the present invention.

本発明によれば、ユーザ固有の話し方等の音声特徴に基づき、適切なコンテンツを提供することができる。 According to the present invention, it is possible to provide appropriate content based on voice features such as a user-specific way of speaking.

本発明の実施の形態におけるコンテンツ提供システムの構成の一例を示すブロック図である。It is a block diagram which shows an example of a structure of the content provision system in embodiment of this invention. 本発明の実施の形態における指標データ記憶部の構成の一例を示す図である。It is a figure which shows an example of a structure of the parameter | index data storage part in embodiment of this invention. 本発明の実施の形態におけるコンテンツ記憶部の構成の一例を示す図である。It is a figure which shows an example of a structure of the content memory | storage part in embodiment of this invention. 本発明の実施の形態におけるコンテンツ提供システムを含むネットワーク構造を示すブロック図である。It is a block diagram which shows the network structure containing the content provision system in embodiment of this invention. 本発明の実施の形態におけるコンテンツ提供システムの処理手順を示すフローチャートである。It is a flowchart which shows the process sequence of the content provision system in embodiment of this invention. 本発明の実施の形態におけるコンテンツ提供システムの構成の他の例を示すブロック図である。It is a block diagram which shows the other example of a structure of the content provision system in embodiment of this invention. 本発明の実施の形態におけるコンテンツ提供システムの構成の一例を示すブロック図である。It is a block diagram which shows an example of a structure of the content provision system in embodiment of this invention. 本発明の実施の形態における音声要素値蓄積記憶部の構成の一例を示す図である。It is a figure which shows an example of a structure of the audio | voice element value accumulation | storage part in embodiment of this invention. 本発明の実施の形態における指標データ記憶部の構成の他の例を示す図である。It is a figure which shows the other example of a structure of the parameter | index data storage part in embodiment of this invention. 本発明の実施の形態における指標データ記憶部の構成の他の例を示す図である。It is a figure which shows the other example of a structure of the parameter | index data storage part in embodiment of this invention.

以下、本発明の実施の形態について、図面を用いて説明する。尚、すべての図面において、同様の構成要素には同様の符号を付し、適宜説明を省略する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. In all the drawings, the same constituent elements are denoted by the same reference numerals, and the description thereof is omitted as appropriate.

（第１の実施の形態）
図１は、本実施の形態におけるコンテンツ提供システムの構成の一例を示すブロック図である。
コンテンツ提供システム１００は、音声認識部１０２（音声認識手段）と、音声特徴検出部１０４（音声特徴検出手段）と、コンテンツ選択部１０６（コンテンツ選択手段）と、提示部１０８（提示手段）と、音響モデル記憶部１２０と、言語モデル記憶部１２２と、指標データ記憶部１２４（指標データ記憶手段）と、コンテンツ記憶部１２６（コンテンツ記憶手段）とを含む。 (First embodiment)
FIG. 1 is a block diagram illustrating an example of a configuration of a content providing system according to the present embodiment.
The content providing system 100 includes a voice recognition unit 102 (voice recognition unit), a voice feature detection unit 104 (voice feature detection unit), a content selection unit 106 (content selection unit), a presentation unit 108 (presentation unit), An acoustic model storage unit 120, a language model storage unit 122, an index data storage unit 124 (index data storage unit), and a content storage unit 126 (content storage unit) are included.

音響モデル記憶部１２０は、音響モデルを記憶する。言語モデル記憶部１２２は、言語モデルを記憶する。これらの音響モデルおよび言語モデルは、一般的に用いられているものとすることができる。 The acoustic model storage unit 120 stores an acoustic model. The language model storage unit 122 stores a language model. These acoustic models and language models can be generally used.

音声認識部１０２は、音響モデル記憶部１２０に記憶された音響モデルおよび言語モデル記憶部１２２に記憶された言語モデルに基づき、入力された音声データに対する音声認識結果の候補となる単語の音響スコア、言語スコア、および当該音響スコアおよび当該言語スコアに基づく総スコアを算出する。音声認識部１０２は、たとえば、総スコアが高い単語を入力された音声データの音声認識結果として選出する。 The speech recognition unit 102 is based on the acoustic model stored in the acoustic model storage unit 120 and the language model stored in the language model storage unit 122, and the acoustic score of words that are candidates for speech recognition results for the input speech data, A language score and a total score based on the acoustic score and the language score are calculated. For example, the voice recognition unit 102 selects a word having a high total score as a voice recognition result of input voice data.

また、音声認識部１０２は、入力された音声データに基づき、ユーザの種々の音声要素の値を算出する。音声要素は、音声認識を行う際に用いる音声データから抽出する各種パラメータや音声認識結果自体等、ユーザの話し方を示す情報とすることができる。音声要素は、たとえば、音声パワー、話速、音響スコア、言語スコア、認識結果信頼度、認識結果中のフィラー（言い淀み）率等とすることができる。また、音声認識部１０２は、ユーザが男性か女性かも検出することができる。音声認識部１０２は、これらの音声要素の値を、所定時間の間に入力された当該ユーザの音声データに基づき算出することができる。 The voice recognition unit 102 calculates values of various voice elements of the user based on the input voice data. The speech element can be information indicating how the user speaks, such as various parameters extracted from speech data used when speech recognition is performed, and the speech recognition result itself. The voice element can be, for example, voice power, speech speed, acoustic score, language score, recognition result reliability, filler rate in the recognition result, and the like. The voice recognition unit 102 can also detect whether the user is male or female. The speech recognition unit 102 can calculate the values of these speech elements based on the user's speech data input during a predetermined time.

指標データ記憶部１２４は、種々の音声要素につき、標準に対する傾向を判断するための指標データを記憶する。指標データは、複数の話者の音声に基づき算出したものとすることができる。本実施の形態において、指標データ記憶部１２４は、指標データとして、種々の音声要素につき複数の話者の音声に基づき算出した標準的な値である標準値を記憶する。図２は、指標データ記憶部１２４の構成の一例を示す図である。ここでは、指標データ記憶部１２４は、各音声要素につき、男性および女性の標準値をそれぞれ記憶している。 The index data storage unit 124 stores index data for determining a tendency with respect to a standard for various speech elements. The index data can be calculated based on the voices of a plurality of speakers. In the present embodiment, the index data storage unit 124 stores, as index data, standard values that are standard values calculated based on the voices of a plurality of speakers for various speech elements. FIG. 2 is a diagram illustrating an example of the configuration of the index data storage unit 124. Here, the index data storage unit 124 stores male and female standard values for each voice element.

図１に戻り、音声特徴検出部１０４は、音声認識部１０２が算出したユーザの音声要素の値を、音声要素毎に、指標データ記憶部１２４に記憶された当該音声要素の標準値と比較して、当該ユーザに固有の音声特徴を検出する。 Returning to FIG. 1, the voice feature detection unit 104 compares the value of the user's voice element calculated by the voice recognition unit 102 with the standard value of the voice element stored in the index data storage unit 124 for each voice element. Thus, a voice feature unique to the user is detected.

たとえば、本実施の形態において、音声認識部１０２がユーザが女性であることを検出した場合、音声特徴検出部１０４は、当該ユーザの各音声要素の値を、図２に示した「女性」の標準値と比較する。たとえば、当該ユーザの音声パワーの値が、図２に示した「女性」の「音声パワー」の標準値「ａ２」より大きい場合は、当該女性の音声パワーの傾向は、「音声パワー大」となる。また、同様に、当該ユーザの認識結果中のフィラー率の値が、図２に示した「女性」の「認識結果中のフィラー率」の標準値「ｆ２」より小さい場合は、当該女性の認識結果中のフィラー率の傾向は、「認識結果中のフィラー率低」となる。なお、ここで、各標準値は一意の値ではなく、ある範囲の値とすることもできる。そのため、たとえば「女性」の「音声パワー」の標準値「ａ２」として、女性の標準的な音声パワーの範囲を記憶しておくことができる。このようにすれば、ユーザの音声が標準よりも極端に音声パワーが大きい場合や小さい場合に、そのユーザの音声特徴を「音声パワー大」や「音声パワー小」とすることができる。 For example, in this embodiment, when the voice recognition unit 102 detects that the user is a woman, the voice feature detection unit 104 sets the value of each voice element of the user to “female” shown in FIG. Compare with standard values. For example, when the value of the voice power of the user is larger than the standard value “a2” of “sound power” of “female” shown in FIG. 2, the tendency of the voice power of the woman is “high voice power”. Become. Similarly, if the value of the filler rate in the recognition result of the user is smaller than the standard value “f2” of the “filler rate in the recognition result” of “female” shown in FIG. The tendency of the filler rate in the result is “low filler rate in the recognition result”. Here, each standard value is not a unique value, but may be a range of values. Therefore, for example, as a standard value “a2” of “sound power” of “female”, the standard range of female voice power can be stored. In this way, when the voice of the user is much higher or lower than the standard, the voice feature of the user can be “high voice power” or “low voice power”.

また、音声特徴検出部１０４は、ユーザの各音声要素の値を指標データ記憶部１２４に記憶された各音声要素の標準値と比較して、標準値からのずれ量を検出することもできる。標準値からのずれ量が大きい場合、そのユーザは、その音声特徴を有する度合が大きいということになる。 In addition, the voice feature detection unit 104 can detect the amount of deviation from the standard value by comparing the value of each voice element of the user with the standard value of each voice element stored in the index data storage unit 124. When the amount of deviation from the standard value is large, the user has a high degree of having the voice feature.

コンテンツ記憶部１２６は、音声特徴と、当該音声特徴を有するユーザに提示したいコンテンツとを対応付けて記憶する。ユーザに提示したいコンテンツは、音声特徴に基づき、そのような傾向を有するユーザが苦手としているものに関する内容、得意としているものに関する内容、興味がありそうな内容を想定して選択することができる。 The content storage unit 126 stores the audio feature and the content to be presented to the user having the audio feature in association with each other. The content to be presented to the user can be selected on the basis of the audio characteristics, assuming the content related to what the user with such a tendency is not good at, the content related to what the user is good at, and the content that is likely to be interesting.

図３は、コンテンツ記憶部１２６の構成の一例を示す図である。コンテンツ記憶部１２６は、コンテンツ欄と、音声特徴欄とを含む。
たとえばコンテンツ「Ａ（話し方教室のＣＭ）」には、音声特徴として、「音声パワー小」、「音響モデルスコア低」、「認識結果中のフィラー率高」、「話速大」、「認識結果信頼度低」等が対応付けて記憶されている。 FIG. 3 is a diagram illustrating an example of the configuration of the content storage unit 126. The content storage unit 126 includes a content column and an audio feature column.
For example, in the content “A (CM of a speaking class)”, the voice features are “speech power low”, “low acoustic model score”, “high filler rate in recognition result”, “high speech rate”, “recognition result” “Reliability” is stored in association with each other.

ユーザの音声特徴が「フィラー率高」となっているのは、そのユーザの音声データ中にたとえば「えー」、「あのー」等のフィラーが多く含まれているということである。つまり、このような音声特徴を有するユーザは、話し方があまりうまくないことが想定され、自分でも苦手意識を持っている可能性が高い。そのため、このような音声特徴を有するユーザに提示するコンテンツとして、話し方教室のＣＭ（コマーシャル）を対応付けておくことができる。 The voice feature of the user is “high filler rate” because the user's voice data contains many fillers such as “e” and “ano”. That is, it is assumed that a user having such a voice feature is not good at speaking, and is likely to be weak at himself. For this reason, CM (commercial) in the classroom for speaking can be associated with the content presented to the user having such voice characteristics.

また、コンテンツ「Ｂ（カラオケ店のＣＭ）」には、音声特徴として、「音声パワー大」等が対応付けて記憶されている。音声パワーが大きいユーザは、カラオケを得意としている可能性が高いと想定される。そのため、このような音声特徴を有するユーザに提示するコンテンツとして、カラオケ店のＣＭ（コマーシャル）を対応付けておくことができる。 In addition, the content “B (CM of the karaoke store)” stores “audio power high” and the like as the audio feature in association with each other. It is assumed that a user with high voice power is likely to be good at karaoke. Therefore, CM (commercial) of a karaoke store can be associated with the content presented to the user having such voice characteristics.

図１に戻り、コンテンツ選択部１０６は、コンテンツ記憶部１２６に記憶されたコンテンツの中から、音声特徴検出部１０４が検出した音声特徴に対応付けられたコンテンツを選択する。提示部１０８は、コンテンツ選択部１０６が選択したコンテンツをユーザに提示する。 Returning to FIG. 1, the content selection unit 106 selects the content associated with the audio feature detected by the audio feature detection unit 104 from the content stored in the content storage unit 126. The presentation unit 108 presents the content selected by the content selection unit 106 to the user.

たとえば、上述したように、ユーザの音声特徴が「音声パワー大」の場合、コンテンツ選択部１０６は、図３に示した音声特徴「音声パワー大」に対応付けられたコンテンツ「Ｂ（カラオケ店のＣＭ）」を選択する。提示部１０８は、コンテンツ選択部１０６が選択したコンテンツ「Ｂ（カラオケ店のＣＭ）」を当該ユーザに提示する。 For example, as described above, when the voice feature of the user is “high voice power”, the content selecting unit 106 selects the content “B (karaoke shop) associated with the voice feature“ high voice power ”shown in FIG. CM) ”is selected. The presenting unit 108 presents the content “B (karaoke shop CM)” selected by the content selecting unit 106 to the user.

また、音声特徴検出部１０４は、各ユーザの各音声特徴を、各音声要素の標準値からのずれ量等に対応付けてコンテンツ選択部１０６に出力通知することもできる。たとえば、ユーザの音声特徴が複数ある場合、コンテンツ選択部１０６は、コンテンツ記憶部１２６に記憶されたコンテンツの中から、標準値からのずれ量が最も大きい音声特徴に対応付けられたコンテンツを選択することができる。また、たとえば、コンテンツ選択部１０６は、コンテンツ記憶部１２６に記憶されたコンテンツの中から、標準値からのずれ量が大きい順に各音声特徴に対応付けられたコンテンツを選択して、順次ユーザに提示するようにすることもできる。 The audio feature detection unit 104 can also output and notify the audio selection of each user in association with the amount of deviation from the standard value of each audio element, etc. For example, when there are a plurality of audio features of the user, the content selection unit 106 selects the content associated with the audio feature having the largest deviation from the standard value from the content stored in the content storage unit 126. be able to. In addition, for example, the content selection unit 106 selects content associated with each audio feature from the content stored in the content storage unit 126 in descending order of deviation from the standard value, and sequentially presents it to the user. You can also do it.

図４は、本実施の形態におけるコンテンツ提供システム１００を含むネットワーク構造を示すブロック図である。
このネットワーク構造は、コンテンツ提供システム１００と、コンテンツ提供システム１００にネットワーク１５０を介して接続されたユーザ端末装置２００とを含む。ユーザ端末装置２００は、ユーザのＰＣ等とすることができる。ユーザ端末装置２００は、たとえばパーソナルコンピュータ等により構成することができる。ユーザ端末装置２００には、たとえばマイク等の音声入力手段、およびディスプレイ等の表示手段が設けられた構成とすることができる。ユーザ端末装置２００の音声入力手段を介して音声データが入力されると、当該音声データは、ネットワーク１５０を介してコンテンツ提供システム１００の音声認識部１０２（図１参照）に入力される。コンテンツ提供システム１００の提示部１０８がコンテンツを提供すると、当該コンテンツは、ネットワーク１５０を介してユーザ端末装置２００に入力され、ユーザ端末装置２００の表示手段に表示される。なお、ユーザ端末装置２００のユーザの音声データは、ユーザ端末装置２００との対応が取れていれば、たとえば電話回線等、ネットワーク１５０以外のネットワークを介してコンテンツ提供システム１００の音声認識部１０２に入力される構成とすることもできる。 FIG. 4 is a block diagram showing a network structure including the content providing system 100 in the present embodiment.
This network structure includes a content providing system 100 and a user terminal device 200 connected to the content providing system 100 via a network 150. The user terminal device 200 can be a user's PC or the like. The user terminal device 200 can be configured by a personal computer, for example. The user terminal device 200 can be configured to include voice input means such as a microphone and display means such as a display. When voice data is input via the voice input unit of the user terminal device 200, the voice data is input to the voice recognition unit 102 (see FIG. 1) of the content providing system 100 via the network 150. When the presentation unit 108 of the content providing system 100 provides content, the content is input to the user terminal device 200 via the network 150 and displayed on the display unit of the user terminal device 200. Note that the voice data of the user of the user terminal device 200 is input to the voice recognition unit 102 of the content providing system 100 via a network other than the network 150 such as a telephone line if the correspondence with the user terminal device 200 is achieved. It can also be set as the structure made.

次に、本実施の形態において、音声データが入力されてから、コンテンツが提供されるまでの手順を説明する。図５は、本実施の形態におけるコンテンツ提供システム１００の処理手順を示すフローチャートである。 Next, in this embodiment, a procedure from when audio data is input until content is provided will be described. FIG. 5 is a flowchart showing a processing procedure of the content providing system 100 according to the present embodiment.

音声データが入力されると（ステップＳ１００）、音声認識部１０２は音声認識処理を行う（ステップＳ１０２）。この処理は、通常の音声認識処理とすることができる。 When voice data is input (step S100), the voice recognition unit 102 performs voice recognition processing (step S102). This process can be a normal voice recognition process.

つづいて、音声特徴検出部１０４は、音声認識部１０２が検出した音声要素の値を指標データ記憶部１２４に記憶された各音声要素の標準値と比較して、当該ユーザ固有の音声特徴を検出する（ステップＳ１０４）。コンテンツ選択部１０６は、コンテンツ記憶部１２６から音声特徴検出部１０４が検出した音声特徴に対応付けられたコンテンツを選択する（ステップＳ１０６）。提示部１０８は、コンテンツ選択部１０６が選択したコンテンツをネットワーク１５０を介してユーザ端末装置２００に提供する（ステップＳ１０８）。 Subsequently, the voice feature detection unit 104 compares the value of the voice element detected by the voice recognition unit 102 with the standard value of each voice element stored in the index data storage unit 124 to detect the voice feature unique to the user. (Step S104). The content selection unit 106 selects content associated with the audio feature detected by the audio feature detection unit 104 from the content storage unit 126 (step S106). The presentation unit 108 provides the content selected by the content selection unit 106 to the user terminal device 200 via the network 150 (step S108).

本実施の形態におけるコンテンツ提供システム１００によれば、ユーザの各種音声要素の値に基づき検出されたユーザ固有の音声特徴に応じて、ユーザに提供するコンテンツが決定される。たとえば、「えー」や「あのー」等のフィラーがよく入る、早口、声がきれい等のユーザ固有の音声特徴に応じてコンテンツが提供される。そのため、ユーザの興味を引くコンテンツを提供できる可能性を高めることができる。また、このようなユーザ固有の音声特徴はあまり変化しないことから、同様のコンテンツを繰り返し提供することもでき、ユーザの興味を高めることもできる。 According to content providing system 100 in the present embodiment, the content to be provided to the user is determined according to the user-specific audio characteristics detected based on the values of the various audio elements of the user. For example, content is provided in accordance with user-specific audio features such as “Eh” and “Anno”, which often contain fillers, fast speech, and good voice. Therefore, it is possible to increase the possibility of providing content that attracts the user's interest. In addition, since the user-specific voice characteristics do not change so much, the same content can be repeatedly provided, and the user's interest can be enhanced.

また、本実施の形態において、図６に示したように、コンテンツ提供システム１００は、ユーザ情報を取得するユーザ情報取得部１１０をさらに含む構成とすることもできる。ここで、ユーザ情報は、たとえばユーザの性別や年齢等を含むことができる。たとえばユーザ情報取得部１１０がユーザ情報としてユーザの性別を取得した場合は、音声特徴検出部１０４は、入力されたユーザの性別に基づき、指標データ記憶部１２４の男性または女性の標準値のいずれかを用いてユーザの音声特徴を検出することができる。 Moreover, in this Embodiment, as shown in FIG. 6, the content provision system 100 can also be set as the structure further including the user information acquisition part 110 which acquires user information. Here, the user information can include, for example, the sex and age of the user. For example, when the user information acquisition unit 110 acquires the user's gender as the user information, the voice feature detection unit 104 selects either the male or female standard value in the index data storage unit 124 based on the input user's gender. Can be used to detect the voice characteristics of the user.

また、指標データ記憶部１２４には、図２に示したように男女別の標準値だけでなく、たとえば年代別の標準値も記憶しておくことができる。たとえば、ユーザ情報取得部１１０がユーザ情報としてユーザの年齢を取得した場合は、音声特徴検出部１０４は、入力されたユーザの年齢に基づき、指標データ記憶部１２４の該当する年代の標準値を用いてユーザの音声特徴を検出することができる。 In addition, the index data storage unit 124 can store not only standard values for each gender as shown in FIG. 2, but also standard values for each age, for example. For example, when the user information acquisition unit 110 acquires the user's age as the user information, the voice feature detection unit 104 uses the standard value of the corresponding age in the index data storage unit 124 based on the input user's age. Thus, the voice feature of the user can be detected.

（第２の実施の形態）
図７は、本実施の形態におけるコンテンツ提供システムの構成の一例を示すブロック図である。
本実施の形態において、コンテンツ提供システム１００が、平均値算出部１１２（平均値算出手段）および音声要素値蓄積記憶部１２８（音声要素値蓄積記憶手段）をさらに含む点で第１の実施の形態と異なる。ここではコンテンツ提供システム１００がユーザ情報取得部１１０も含む構成を示す。 (Second Embodiment)
FIG. 7 is a block diagram showing an example of the configuration of the content providing system in the present embodiment.
In the present embodiment, the content providing system 100 is a first embodiment in that the content providing system 100 further includes an average value calculation unit 112 (average value calculation unit) and an audio element value accumulation storage unit 128 (audio element value accumulation storage unit). And different. Here, a configuration in which the content providing system 100 also includes the user information acquisition unit 110 is shown.

本実施の形態において、ユーザ情報取得部１１０は、ユーザ情報として、ユーザの識別情報を取得することができる。平均値算出部１１２は、ユーザの識別情報に基づきユーザを識別し、当該ユーザの音声要素の値を当該ユーザの識別情報に対応付けて音声要素値蓄積記憶部１２８に蓄積する。平均値算出部１１２は、たとえばその音声データが入力された日時に対応付けて音声要素の値を音声要素値蓄積記憶部１２８に記憶することができる。また、平均値算出部１１２は、音声要素値蓄積記憶部１２８を参照して、異なるときに蓄積された当該ユーザの音声要素の値の平均値を算出する。本実施の形態において、音声特徴検出部１０４は、平均値算出部１１２が算出した各音声要素の値の平均値を、指標データ記憶部１２４に記憶された当該音声要素の標準値と比較して、当該ユーザに固有の音声特徴を検出することができる。 In the present embodiment, the user information acquisition unit 110 can acquire user identification information as user information. The average value calculation unit 112 identifies the user based on the user identification information, and stores the value of the voice element of the user in the voice element value storage unit 128 in association with the identification information of the user. The average value calculation unit 112 can store the value of the voice element in the voice element value accumulation storage unit 128 in association with the date and time when the voice data is input, for example. Further, the average value calculation unit 112 refers to the voice element value accumulation storage unit 128 and calculates the average value of the voice element values of the user accumulated at different times. In the present embodiment, the audio feature detection unit 104 compares the average value of each audio element calculated by the average value calculation unit 112 with the standard value of the audio element stored in the index data storage unit 124. , Voice features unique to the user can be detected.

音声要素値蓄積記憶部１２８は、ユーザの識別情報に対応付けて当該ユーザの音声要素の値を蓄積する。とともに、図８は、音声要素値蓄積記憶部１２８の構成の一例を示す図である。
ここでは、ユーザの識別情報（ユーザＩＤ）が「０００２ｆ」のユーザの音声要素値データを示す。このユーザは、たとえば少なくとも「２０１０／０１／０３」、「２０１０／０１／１０」、「２０１０／０１／１２」の３回コンテンツ提供システム１００にアクセスして音声認識部１０２を介して音声データを入力している。そのため、各アクセス時の音声データに基づき算出された種々の音声要素の値が音声要素値蓄積記憶部１２８に記憶されている。 The voice element value accumulation storage unit 128 accumulates the value of the voice element of the user in association with the user identification information. In addition, FIG. 8 is a diagram illustrating an example of the configuration of the audio element value accumulation storage unit 128.
Here, the voice element value data of the user whose user identification information (user ID) is “0002f” is shown. This user, for example, accesses the content providing system 100 at least “2010/01/03”, “2010/01/10”, and “2010/01/12”, and transmits audio data via the audio recognition unit 102. You are typing. For this reason, various audio element values calculated based on the audio data at the time of each access are stored in the audio element value storage unit 128.

この場合、平均値算出部１１２は、たとえば音声要素「音声パワー」のデータとして、「２０１０／０１／０３」、「２０１０／０１／１０」、「２０１０／０１／１２」にそれぞれ対応付けて記憶されている「ａ１０」、「ａ１１」、「ａ１２」という値から平均値を算出する。音声特徴検出部１０４は、平均値算出部１１２が算出した平均値を、指標データ記憶部１２４に記憶された音声要素「音声パワー」の標準値と比較して、このユーザの音声要素「音声パワー」が標準程度か、標準より大きいか小さいか等の特徴を検出する。 In this case, the average value calculation unit 112 stores, for example, data of the voice element “voice power” in association with “2010/01/03”, “2010/01/10”, and “2010/01/12”, respectively. The average value is calculated from the values “a10”, “a11”, and “a12”. The voice feature detection unit 104 compares the average value calculated by the average value calculation unit 112 with the standard value of the voice element “voice power” stored in the index data storage unit 124, and compares the user's voice element “voice power”. ”Is detected as a standard, larger than a standard, or smaller.

本実施の形態においても、第１の実施の形態と同様の効果が得られる。また、本実施の形態において、異なるときに蓄積されたユーザの音声要素の値の平均値が用いられるので、ユーザ固有の音声特徴を安定的に検出することができる。 Also in this embodiment, the same effect as that of the first embodiment can be obtained. In the present embodiment, since the average value of the user's voice element values accumulated at different times is used, the user-specific voice characteristics can be stably detected.

また、以上の例では、ユーザ情報取得部１１０がユーザの識別情報を取得し、平均値算出部１１２がユーザの識別情報に基づきユーザを識別する例を示した。しかし、コンテンツ提供システム１００は、ユーザ情報取得部１１０を有しない構成とすることもでき、平均値算出部１１２は、ユーザの音声に基づく音声認証によってユーザを識別するようにすることもできる。 In the above example, the user information acquisition unit 110 acquires user identification information, and the average value calculation unit 112 identifies a user based on the user identification information. However, the content providing system 100 may be configured not to include the user information acquisition unit 110, and the average value calculation unit 112 may identify the user by voice authentication based on the user's voice.

図１に示したコンテンツ提供システム１００の各構成要素は、ハードウエア単位の構成ではなく、機能単位のブロックを示している。コンテンツ提供システム１００の各構成要素は、任意のコンピュータのＣＰＵ、メモリ、メモリにロードされた本図の構成要素を実現するプログラム、そのプログラムを格納するハードディスクなどの記憶ユニット、ネットワーク接続用インタフェースを中心にハードウエアとソフトウエアの任意の組合せによって実現される。そして、その実現方法、装置にはいろいろな変形例があることは、当業者には理解されるところである。 Each component of the content providing system 100 illustrated in FIG. 1 is not a hardware unit configuration but a functional unit block. Each component of the content providing system 100 is centered on an arbitrary computer CPU, memory, a program for realizing the components shown in the figure loaded in the memory, a storage unit such as a hard disk for storing the program, and a network connection interface. It is realized by any combination of hardware and software. It will be understood by those skilled in the art that there are various modifications to the implementation method and apparatus.

以上、図面を参照して本発明の実施形態について述べたが、これらは本発明の例示であり、上記以外の様々な構成を採用することもできる。 As mentioned above, although embodiment of this invention was described with reference to drawings, these are the illustrations of this invention, Various structures other than the above are also employable.

第１の実施の形態において、指標データ記憶部１２４に男女別または年齢別に各音声要素の標準値を記憶する例を示したが、標準値は、その他種々の分類毎に記憶しておくことができる。たとえば、複数の音声要素の中の一つに注目して、その音声要素の値に応じた分類毎に標準値を記憶することもできる。 In the first embodiment, an example is shown in which the standard value of each voice element is stored in the index data storage unit 124 by sex or age, but the standard value may be stored for each of various other classifications. it can. For example, focusing on one of a plurality of speech elements, a standard value can be stored for each classification according to the value of the speech element.

一例を図９に示す。ここでは、音声パワーが大、中、小のそれぞれについて、その他の音声要素である話速、音響スコア、言語スコア、認識結果信頼度、認識結果中のフィラー率等の標準値がそれぞれ記憶された例を示す。音声認識部１０２に音声データが入力され、音声認識部１０２により各音声要素の値が算出されると、音声特徴検出部１０４は、そのユーザの音声パワーの値に基づきユーザの音声の音声パワーが大、中、小のいずれに該当するかを判断する。そして、該当する音声パワーに対応付けられた他の音声要素の標準値を用いて、それら他の音声要素が標準より大きいか小さいか等を判断する。たとえば、入力された音声データの音声パワーの値がａ１〜ａ２だったとすると、このユーザの音声パワーは大ということになり、このユーザの話速を判断する際には、標準値「ｂ３」が用いられる。 An example is shown in FIG. Here, standard values such as speech speed, acoustic score, language score, recognition result reliability, filler rate in recognition results, etc., which are other speech elements, are stored for each of the speech power of large, medium and small An example is shown. When voice data is input to the voice recognition unit 102 and the value of each voice element is calculated by the voice recognition unit 102, the voice feature detection unit 104 determines the voice power of the user's voice based on the voice power value of the user. Judge whether it falls under large, medium or small. Then, using the standard values of other audio elements associated with the corresponding audio power, it is determined whether these other audio elements are larger or smaller than the standard. For example, if the voice power value of the input voice data is a1 to a2, the voice power of this user is large, and the standard value “b3” is used when determining the user's speech speed. Used.

また、指標データ記憶部１２４は、指標データとして、各音声要素の標準値ではなく、標準に対する傾向を判断するための指標値を記憶することもできる。一例を図１０に示す。ここでは、音声要素として音声パワーおよび話速について、それぞれ大（高）、標準（特徴なし）、小（低）と判断するための指標値が記憶されている。音声認識部１０２に音声データが入力され、音声認識部１０２により各音声要素の値が算出されると、音声特徴検出部１０４は、たとえばそのユーザの音声パワーの値が指標データ記憶部１２４の大（高）、標準（特徴なし）、小（低）のいずれの指標値に含まれるかを判断し、当該ユーザに固有の音声特徴を検出する。 In addition, the index data storage unit 124 can also store an index value for determining a tendency with respect to a standard, instead of the standard value of each voice element, as index data. An example is shown in FIG. Here, index values for determining speech power and speech speed as large (high), standard (no feature), and small (low) are stored as speech elements. When voice data is input to the voice recognition unit 102 and the value of each voice element is calculated by the voice recognition unit 102, the voice feature detection unit 104 has a voice power value of the user stored in the index data storage unit 124, for example. It is determined whether the index value is included in (high), standard (no feature), or small (low), and a voice feature unique to the user is detected.

また、指標データ記憶部１２４は、指標データとして、複数の話者の音声の各音声要素の値のばらつきや分散度を示すデータを記憶することもできる。音声特徴検出部１０４は、入力された音声データの各音声要素の値に基づき、当該ユーザの音声要素が複数の話者の音声の各音声要素の値のたとえば平均値からどの方向にずれているか、またどの程度ずれているか等に基づき、当該ユーザの音声特徴を検出することができる。 In addition, the index data storage unit 124 can also store data indicating variation and degree of dispersion of each voice element value of the voices of a plurality of speakers as index data. Based on the value of each voice element of the input voice data, the voice feature detecting unit 104 determines in which direction the voice element of the user is deviated from, for example, the average value of the voice element values of the voices of a plurality of speakers. In addition, based on the degree of deviation, etc., the voice feature of the user can be detected.

また、音声特徴検出部１０４は、音声認識部１０２から入力されたユーザの音声データに基づき算出された各音声要素の値を、指標値記憶部１２４にフィードバックして、指標値記憶部１２４の指標データを更新していくこともできる。これにより、指標値記憶部１２４の指標データに用いられる話者の母数を増やすことができ、精度の高い指標データを構築することができる。 Also, the voice feature detection unit 104 feeds back the value of each voice element calculated based on the user's voice data input from the voice recognition unit 102 to the index value storage unit 124, and the index of the index value storage unit 124. You can also update the data. Thereby, the parameter number of speakers used for the index data in the index value storage unit 124 can be increased, and index data with high accuracy can be constructed.

また、以上の実施の形態においては、音声データを入力したユーザのユーザ端末装置２００にコンテンツが提供されるのみの例を示したが、ユーザの音声特徴が検出された場合、当該音声特徴を有するユーザを募集しているような広告主とユーザとをマッチングするような構成とすることもできる。たとえば、コンテンツ記憶部１２６の各コンテンツに広告主への通知の要否の設定も対応付けて記憶しておくことができる。この場合、広告主が音声特徴として「音声パワーが大きい」等登録しており、通知要になっている場合、ユーザの音声パワーが大きいことが検出された場合に、ユーザの了承を得た後に当該ユーザの情報を広告主に転送するような設定とすることもできる。 Moreover, in the above embodiment, an example in which content is only provided to the user terminal device 200 of the user who has input the voice data has been shown. However, when the voice feature of the user is detected, the user has the voice feature. It can also be set as the structure which matches the advertiser and the user who are recruiting users. For example, each content in the content storage unit 126 can also be stored in association with the necessity of notification to the advertiser. In this case, after obtaining the user's approval when the advertiser has registered as a voice feature such as “high voice power” and notification is required, or when it is detected that the user's voice power is high It can also be set to transfer the user information to the advertiser.

さらに、たとえば「ある声優との類似度」等を音声要素とすることもできる。この場合、コンテンツ提供システム１００は、当該声優の声の特徴との類似度を判断して、類似度が高ければ、その声優に関するコンテンツが提示されるように設定しておくことができる。 Furthermore, for example, “similarity with a certain voice actor” or the like can be used as a voice element. In this case, the content providing system 100 can determine the degree of similarity with the voice feature of the voice actor and set the content related to the voice actor to be presented if the similarity is high.

１００コンテンツ提供システム
１０２音声認識部
１０４音声特徴検出部
１０６コンテンツ選択部
１０８提示部
１１０ユーザ情報取得部
１１２平均値算出部
１２０音響モデル記憶部
１２２言語モデル記憶部
１２４指標データ記憶部
１２６コンテンツ記憶部
１２８音声要素値蓄積記憶部
１５０ネットワーク
２００ユーザ端末装置 DESCRIPTION OF SYMBOLS 100 Content provision system 102 Speech recognition part 104 Voice feature detection part 106 Content selection part 108 Presentation part 110 User information acquisition part 112 Average value calculation part 120 Acoustic model storage part 122 Language model storage part 124 Index data storage part 126 Content storage part 128 Voice element value storage unit 150 Network 200 User terminal device

Claims

Content storage means for storing the audio feature and the content to be presented to the user having the audio feature in association with each other;
A voice feature detecting means for detecting a voice feature unique to the user by comparing a value of the voice component of the user calculated based on the input voice data of the user with index data for judging a tendency with respect to a standard; ,
Content selection means for selecting content associated with the audio features detected by the audio feature detection means from among the contents stored in the content storage means;
Presenting means for presenting the content selected by the content selecting means;
Content providing system including

The content providing system according to claim 1,
The content providing system, wherein the index data of the value of the voice element is calculated based on voices of a plurality of speakers.

The content providing system according to claim 1 or 2,
A content providing system further comprising index data storage means for storing the index data of the value of the audio element.

The content providing system according to any one of claims 1 to 3,
Voice element value storage means for storing the value of the voice element of the user in association with the identification information of the user;
The user is identified, the value of the voice element of the user is stored in the voice element value storage unit in association with the identification information of the user, and the average of the voice elements of the user stored at different times An average value calculating means for calculating a value;
Further including
The content providing system for detecting an audio feature unique to the user by comparing the average value of the audio elements calculated by the average value calculating unit with a standard value of the audio element.

A content providing method using a computer system including a content storage unit that stores an audio feature and content to be presented to a user having the audio feature in association with each other,
A voice feature detection step of detecting a voice feature unique to the user by comparing a value of the voice component of the user calculated based on the input voice data of the user with index data for determining a tendency with respect to a standard; ,
A content selection step of selecting content associated with the audio feature detected in the audio feature detection step from the content stored in the content storage means;
A presentation step of presenting the content selected in the content selection step;
A content providing method including:

Computer
Content storage means for storing a voice feature and a content to be presented to a user having the voice feature in association with each other;
A voice feature detecting means for detecting a voice feature unique to the user by comparing the value of the voice component of the user calculated based on the input voice data of the user with index data for judging a tendency with respect to the standard;
Content selection means for selecting content associated with the audio feature detected by the audio feature detection means from among the content stored in the content storage means;
Presenting means for presenting the content selected by the content selecting means;
Content providing program to function as