JP2014206773A

JP2014206773A - Communication service providing device, communication service providing method and program

Info

Publication number: JP2014206773A
Application number: JP2013082179A
Authority: JP
Inventors: 亮博小林; Akihiro Kobayashi; 啓一郎帆足; Keiichiro Hoashi
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2013-04-10
Filing date: 2013-04-10
Publication date: 2014-10-30
Anticipated expiration: 2033-04-10
Also published as: JP6087704B2

Abstract

PROBLEM TO BE SOLVED: To estimate a scene suitable for next utterance in communication from a multimedia content including utterance.SOLUTION: A communication service providing device 100 includes: an utterance scene extraction unit 110 for extracting an utterance text by cutting and dividing a multimedia content including utterance by one utterance unit as an utterance scene; an utterance learning unit 120 for learning, from a sequence of the extracted utterance text, a sequence of the utterance text in a certain section on the content as a state sequence, the next utterance scene in the section as an action node and a map from the state sequence to the action node as an utterance model; an utterance model storage unit 130 for storing the learned utterance model; a communication history storage unit 140 for storing utterance history of the communication; and a next utterance scene estimation unit 150 for estimating an utterance scene suitable for the next utterance in the communication on the basis of the utterance history of the communication and the utterance model.

Description

本発明は、コミュニケーションに適する一シーンを、発話を含むマルチメディアコンテンツから推定するコミュニケーションサービス提供装置、コミュニケーションサービス提供方法、およびプログラムに関する。 The present invention relates to a communication service providing apparatus, a communication service providing method, and a program for estimating one scene suitable for communication from multimedia contents including an utterance.

一般的に映像コンテンツや漫画コンテンツは、登場するキャラクタの発話により構成されている。そのため、これらのコンテンツでのキャラクタの発話を、コミュニケーションにおいてユーザが自分の発話として利用することが可能であり、コミュニケーションとこれらのコンテンツとの親和性は高い。既に、インターネット上の掲示板サービス等では、コンテンツホルダーに無許可で、ユーザが、これらのコンテンツの一シーンを画像やテキストの形にてコミュニケーションに利用するケースが見られる。 In general, video content and comic content are composed of utterances of characters that appear. Therefore, it is possible for a user to use the utterance of a character in these contents as his / her utterance in communication, and the affinity between communication and these contents is high. Already, in a bulletin board service or the like on the Internet, there is a case where a user uses one scene of these contents for communication in the form of an image or text without permission of the content holder.

しかしながら、膨大なコンテンツの無数のシーンの中から、ユーザがコミュニケーションの展開に合致したシーンを見つけ出し、コミュニケーションに利用することは大きな労力を伴っていた。特に、漫画コンテンツは一ページに複数のコマが存在し、複数の発話が存在するため、ユーザが一シーン（一コマ）を切り出し、自分の発話としてコミュニケーションに利用することが困難であった。 However, it has been a great effort for the user to find a scene that matches the development of communication from among a myriad of scenes of enormous content and use it for communication. In particular, since the comic content has a plurality of frames on one page and a plurality of utterances, it is difficult for the user to cut out one scene (one frame) and use it as his / her utterance for communication.

そこで、特許文献１に記載の技術では、マンガのページを画像処理してコマの枠線を認識することで一ページを複数のコマに分割して、一コマを一コンテンツとすることより、コマ単位でデータを利用することを可能にしている。また、同文献では、吹き出し中のテキストを抽出することでコマ（コンテンツ）の検索を容易にしている。 Therefore, in the technique described in Patent Document 1, image processing is performed on a manga page and a frame border is recognized, so that one page is divided into a plurality of frames and one frame is set as one content. Data can be used in units. In the same document, a frame (content) is easily searched by extracting text in a balloon.

一方、コンテンツを検索するシステムとしては、Ｇｏｏｇｌｅ（登録商標）の画像検索が有名である（例えば、非特許文献１参照）。Ｇｏｏｇｌｅの画像検索では、コンテンツに含まれる画像の周囲に存在するテキスト（見出し・画像タイトル・解説文等）と画像ＵＲＬとを関連付けてインデックスを作成することで、テキストから画像を検索することが可能である。更に、画像そのものの特徴を抽出し、画像間を関連付けることによって、検索の機能を向上させる技術も提案されている（例えば、非特許文献２参照）。 On the other hand, as a system for searching for content, Google (registered trademark) image search is well known (for example, see Non-Patent Document 1). In Google image search, it is possible to search for images from text by creating an index by associating text (headings, image titles, explanations, etc.) existing around the images included in the content with image URLs. It is. Furthermore, a technique for improving the search function by extracting the features of the images themselves and associating the images has been proposed (see Non-Patent Document 2, for example).

特開２０１１−２３８０４３号公報JP 2011-238043 A

ｈｔｔｐｓ：／／ｗｗｗ．ｇｏｏｇｌｅ．ｃｏ．ｊｐ／ｉｍｇｈｐ［２０１３年４月４日検索］https: // www. Google. co. jp / imghp [Search April 4, 2013] ＹｕｓｈｉＪｉｎｇ，ａｎｄＳｈｕｍｅｅｔＢａｌｕｊａ，「ＰａｇｅＲａｎｋｆｏｒＰｒｏｄｕｃｔＩｍａｇｅＳｅａｒｃｈ」，ＷＷＷ２００８／ＲｅｆｅｒｅｅｄＴｒａｃｋ：ＲｉｃｈＭｅｄｉａ，２００８．［２０１３年４月４日検索］Yushi Jing, and Scheme Baruja, "PageRank for Product Image Search", WWW 2008 / Referenced Track: Rich Media, 2008. [Search April 4, 2013]

しかしながら、上述した技術を用いて映像コンテンツや漫画コンテンツの一シーンをコミュニケーションに利用する場合、一シーン中に含まれるテキストを用いて検索を行い、検索で得られた大量の候補からユーザが適切なシーンを選択する作業が必要となる。そのため、スムーズなコミュニケーションは困難であり、コミュニケーションのリアルタイム性が失われてしまうという問題点があった。特に、携帯電話上の狭い画面では、検索で得られた大量の候補から適切なシーンを選択することは困難であり、スムーズなコミュニケーションは不可能に近かった。 However, when one scene of video content or comic content is used for communication using the above-described technology, a search is performed using text included in one scene, and a user is appropriately selected from a large number of candidates obtained by the search. Work to select a scene is required. For this reason, smooth communication is difficult, and the real-time nature of communication is lost. In particular, on a narrow screen on a mobile phone, it is difficult to select an appropriate scene from a large number of candidates obtained by search, and smooth communication is almost impossible.

そこで本発明は、上記課題に鑑みて、コミュニケーションにおける次の発話に適したシーンを、発話を含むマルチメディアコンテンツから推定するコミュニケーションサービス提供装置、コミュニケーションサービス提供方法、およびプログラムを提供することを目的とする。 Therefore, in view of the above problems, the present invention has an object of providing a communication service providing apparatus, a communication service providing method, and a program for estimating a scene suitable for the next utterance in communication from multimedia contents including the utterance. To do.

本発明は、上記の課題を解決するために、以下の事項を提案している。なお、理解を容易にするために、本発明の実施形態に対応する符号を付して説明するが、これに限定されるものではない。 The present invention proposes the following matters in order to solve the above problems. In addition, in order to make an understanding easy, although the code | symbol corresponding to embodiment of this invention is attached | subjected and demonstrated, it is not limited to this.

（１）本発明は、コミュニケーションに利用するコンテンツとして、発話を含むマルチメディアコンテンツから当該コミュニケーションに適したシーンを提供するコミュニケーションサービス提供装置において、前記発話を含むマルチメディアコンテンツを、一発話単位に発話シーンとして切り分け、発話テキストを抽出する発話シーン抽出手段と、前記発話シーン抽出手段で抽出した発話テキストの系列をコミュニケーションのシークエンスとして、コンテンツ上のある区間の発話テキストの系列を状態シーケンスとして、前記区間の次の発話シーンを行動ノードとして、前記状態シーケンスから前記行動ノードへのマップを発話モデルとして学習する発話学習手段と、前記発話学習手段で学習した発話モデルを記憶する発話モデル記憶手段と、前記コミュニケーションの発話履歴を記憶するコミュニケーション履歴記憶手段と、前記コミュニケーション履歴記憶手段に記憶されている前記コミュニケーションの発話履歴と前記発話モデル記憶手段に記憶されている発話モデルとに基づいて、前記コミュニケーションにおける次発話に適した発話シーンを推定する次発話シーン推定手段と、を備えるコミュニケーションサービス提供装置を提案している。 (1) The present invention provides a communication service providing apparatus that provides a scene suitable for communication from multimedia content including utterance as content used for communication. In the communication service providing apparatus, the multimedia content including the utterance is uttered for each utterance. An utterance scene extraction unit that extracts an utterance text as a scene, and a sequence of utterance texts extracted by the utterance scene extraction unit as a communication sequence, and a sequence of utterance texts in a certain section on content as a state sequence, An utterance learning means for learning a map from the state sequence to the action node as an utterance model, and an utterance model storage for storing the utterance model learned by the utterance learning means The communication history storage means for storing the communication utterance history, the communication utterance history stored in the communication history storage means, and the utterance model stored in the utterance model storage means, There is proposed a communication service providing apparatus comprising: a next utterance scene estimating means for estimating an utterance scene suitable for the next utterance in the communication.

（２）本発明は、（１）のコミュニケーションサービス提供装置について、前記発話シーン抽出手段で抽出された各発話シーンに対し、当該各発話シーンの属性情報をタグとして付与するタグ付与手段と、前記発話シーン抽出手段で抽出された各発話シーンに対応付けて、前記タグ付与手段で当該各発話シーンに付与されたタグを記憶するタグ記憶手段と、を備え、前記発話学習手段が、コンテンツ上のある区間の前記発話シーンの系列に対応する前記タグ付与手段で付与されたタグの系列を状態タグシーケンス、前記区間の次の発話シーンに付与されたタグを行動タグノードとして、状態タグシーケンスから行動タグノードへのマップを前記発話モデルとして学習し、前記コミュニケーション履歴記憶手段に記憶されている前記コミュニケーションの発話履歴と前記発話モデル記憶手段に記憶されている発話モデルとに基づいて、前記コミュニケーションの次発話に適したシーンに付与されるタグを推定する次シーンタグ推定手段と、前記次シーンタグ推定手段で推定されたタグと前記タグ記憶手段に記憶されているタグとに基づいて、前記次発話に適した発話シーンを検索する同タグシーン検索手段と、を備えることを特徴とするコミュニケーションサービス提供装置を提案している。 (2) In the communication service providing apparatus of (1), the present invention provides tag providing means for assigning attribute information of each utterance scene as a tag to each utterance scene extracted by the utterance scene extraction means, Tag storage means for storing the tag assigned to each utterance scene by the tag assigning means in association with each utterance scene extracted by the utterance scene extraction means, and the utterance learning means on the content An action tag node from a state tag sequence with a tag sequence assigned by the tag assigning unit corresponding to the utterance scene sequence in a section as a state tag sequence and a tag assigned to the next utterance scene in the interval as an action tag node The communication map is stored as the utterance model and stored in the communication history storage means. Next scene tag estimating means for estimating a tag to be assigned to a scene suitable for the next utterance of communication based on the utterance history of the application and the utterance model stored in the utterance model storage means, and the next scene tag And a tag scene search means for searching for an utterance scene suitable for the next utterance based on the tag estimated by the estimation means and the tag stored in the tag storage means. Proposing device.

（３）本発明は、（２）のコミュニケーションサービス提供装置について、前記属性情報は、前記発話シーンの発話テキスト、当該発話シーンに登場するキャラクタの感情、当該発話シーンの構成要素を少なくとも含むことを特徴とするコミュニケーションサービス提供装置を提案している。 (3) In the communication service providing apparatus according to (2), the attribute information includes at least an utterance text of the utterance scene, an emotion of a character appearing in the utterance scene, and components of the utterance scene. We have proposed a communication service providing device.

（４）本発明は、（１）から（３）のコミュニケーションサービス提供装置について、前記発話シーン抽出手段が、話し言葉および効果音を前記発話テキストとして抽出することを特徴とするコミュニケーションサービス提供装置を提案している。 (4) The present invention proposes a communication service providing apparatus in which the utterance scene extracting means extracts spoken words and sound effects as the uttered text for the communication service providing apparatus of (1) to (3). doing.

（５）本発明は、（１）から（４）のコミュニケーションサービス提供装置について、発話学習手段が、コンテンツ毎に、前記発話シーン抽出手段で抽出した発話テキストの系列をコミュニケーションのシークエンスとして、コンテンツ上のある区間の発話テキストの系列を状態シーケンスとして、前記区間の次の発話シーンを行動ノードとして、前記状態シーケンスから前記行動ノードへのマップを発話モデルとして学習し、前記発話モデル記憶手段が、前記コンテンツ毎に、前記発話学習手段で生成された発話モデルを記憶し、ユーザ毎に、ユーザが利用したマルチメディアコンテンツの履歴を記憶する利用履歴記憶手段と、前記コミュニケーションを行っているユーザについて、前記利用履歴記憶手段に記憶されている履歴からコンテンツ候補を抽出するコンテンツ候補抽出手段と、前記発話モデル記憶手段に記憶された複数の発話モデルから、前記コンテンツ候補抽出手段で抽出されたコンテンツ候補に対応付けて記憶されている発話モデルを選択する発話モデル選択手段と、を備え、前記次発話シーン推定手段が、前記コミュニケーションの次発話に適したシーンを、前記発話モデル選択手段で選択された発話モデルの中から前記コミュニケーション履歴記憶手段に記憶されている当該コミュニケーションの発話履歴に基づいて、前記次発話に適した発話シーンを推定することを特徴とするコミュニケーションサービス提供装置を提案している。 (5) In the communication service providing apparatus according to (1) to (4), the present invention provides, for each content, an utterance text sequence extracted by the utterance scene extraction unit for each content as a communication sequence. A sequence of utterance texts in a certain section as a state sequence, a next utterance scene in the section as an action node, a map from the state sequence to the action node as an utterance model, and the utterance model storage means, For each content, the utterance model generated by the utterance learning means is stored, and for each user, the usage history storage means for storing the history of multimedia content used by the user, and the user performing the communication, From the history stored in the usage history storage means, A speech candidate stored in association with the content candidate extracted by the content candidate extraction unit from a plurality of speech models stored in the speech model storage unit. Utterance model selection means, and the next utterance scene estimation means stores a scene suitable for the next utterance of the communication in the communication history storage means from among the utterance models selected by the utterance model selection means. The communication service providing apparatus is characterized in that the utterance scene suitable for the next utterance is estimated based on the utterance history of the communication.

（６）本発明は、（５）のコミュニケーションサービス提供装置について、前記利用履歴記憶手段に記憶されている履歴に基づいて、各マルチメディアコンテンツを基底とし当該各マルチメディアコンテンツの利用回数を係数とするコンテンツ履歴ベクトルを、ユーザ毎に生成するコンテンツ履歴ベクトル生成手段と、前記コンテンツ履歴ベクトル生成手段で生成されたコンテンツ履歴ベクトルに基づいて、前記コミュニケーションを行っているユーザとの距離が小さいユーザをコンテンツ類似ユーザとして抽出するコンテンツ類似ユーザ抽出手段と、を備え、前記コンテンツ候補抽出手段が、前記コンテンツ類似ユーザ抽出手段により求められた前記コンテンツ類似ユーザに基づいて、利用履歴記憶手段に記憶されている履歴からコンテンツ候補を抽出することを特徴とするコミュニケーションサービス提供装置を提案している。 (6) In the communication service providing apparatus according to (5), the present invention uses each multimedia content as a basis based on the history stored in the usage history storage unit, and uses the number of uses of each multimedia content as a coefficient. Content history vector generating means for generating a content history vector for each user, and a user having a small distance from the user performing communication based on the content history vector generated by the content history vector generating means Content similar user extraction means for extracting as a similar user, and the content candidate extraction means stores the history stored in the usage history storage means based on the content similar user obtained by the content similar user extraction means Conte from It proposes a communication service providing device and extracting the tool candidate.

（７）本発明は、（５）または（６）のコミュニケーションサービス提供装置について、ユーザ毎に発話履歴を記憶する発話履歴記憶手段と、前記発話履歴記憶手段に記憶されている発話履歴に基づいて、各単語を基底とし当該各単語の出現頻度を係数とする発話履歴ベクトルを、ユーザ毎に生成する発話履歴ベクトル生成手段と、前記発話履歴ベクトル生成手段で生成された発話履歴ベクトルに基づいて、前記コミュニケーションを行っているユーザとの距離が小さいユーザを発話類似ユーザとして抽出する発話類似ユーザ抽出手段と、を備え、前記コンテンツ候補抽出手段が、前記発話類似ユーザ抽出手段により求められた前記発話類似ユーザに基づいて、前記利用履歴記憶手段に記憶されている履歴からコンテンツ候補を抽出することを特徴とするコミュニケーションサービス提供装置を提案している。 (7) The present invention relates to the communication service providing apparatus according to (5) or (6), based on an utterance history storage unit that stores an utterance history for each user, and an utterance history stored in the utterance history storage unit. Based on the utterance history vector generation means for generating for each user an utterance history vector based on each word and the frequency of appearance of each word as a coefficient, and the utterance history vector generated by the utterance history vector generation means, Utterance similar user extraction means for extracting a user having a small distance from the communicating user as an utterance similar user, and the content candidate extraction means is determined by the utterance similar user extraction means. Based on the user, content candidates are extracted from the history stored in the usage history storage means. It has proposed a communication service providing device according to claim.

（８）本発明は、（１）から（７）のコミュニケーションサービス提供装置について、前記次発話シーン推定手段で前記コミュニケーションにおける次発話に適した複数の発話シーンが推定された場合に、当該複数の発話シーンに対し、前記コミュニケーションを行っているユーザから受け付けたテキストに基づいて画像検索を行い、当該次発話に適した発話シーンの候補を絞り込み手段を備えることを特徴とするコミュニケーションサービス提供装置を提案している。 (8) In the communication service providing apparatus according to (1) to (7), the present invention provides a plurality of utterance scenes suitable for the next utterance in the communication when the next utterance scene estimation unit estimates the plurality of utterance scenes. Providing a communication service providing apparatus, comprising: an image search for an utterance scene based on text received from a user performing the communication, and a means for narrowing down utterance scene candidates suitable for the next utterance doing.

（９）本発明は、（１）から（８）のコミュニケーションサービス提供装置について、前記次発話シーン推定手段で前記コミュニケーションにおける次発話に適すると推定された発話シーンの中から、前記コミュニケーションを行っているユーザが選択した発話シーンの権利を当該ユーザが有するか否かの認証を行う認証手段を備え、前記認証手段で認証できた場合に、前記コミュニケーションを行っている他のユーザに前記ユーザが選択した発話シーンを送信することを特徴とするコミュニケーション提供装置を提案している。 (9) In the communication service providing apparatus according to (1) to (8), the present invention performs the communication from the utterance scenes estimated by the next utterance scene estimation means to be suitable for the next utterance in the communication. Authentication means for authenticating whether or not the user has the right of the utterance scene selected by the user, and when the authentication means can authenticate, the user selects the other user performing the communication A communication providing apparatus characterized by transmitting a uttered scene is proposed.

（１０）本発明は、コミュニケーションに利用するコンテンツとして、発話を含むマルチメディアコンテンツから当該コミュニケーションに適したシーンを提供するコミュニケーションサービス提供装置におけるコミュニケーションサービス提供方法であって、前記コミュニケーションサービス提供装置は、発話シーン抽出手段、発話学習手段、発話モデル記憶手段、前記コミュニケーションの発話履歴を記憶するコミュニケーション履歴記憶手段、および次発話シーン推定手段を備え、前記発話シーン抽出手段が、前記発話を含むマルチメディアコンテンツを、一発話単位に発話シーンとして切り分け、発話テキストを抽出する第１のステップと、前記発話学習手段が、前記第１のステップで抽出した発話テキストの系列をコミュニケーションのシークエンスとして、コンテンツ上のある区間の発話テキストの系列を状態シーケンスとして、前記区間の次の発話シーンを行動ノードとして、前記状態シーケンスから前記行動ノードへのマップを発話モデルとして学習する第２のステップと、前記発話モデル記憶手段が、前記第２のステップで学習した発話モデルを記憶する第３のステップと、前記次発話シーン推定手段が、前記コミュニケーション履歴記憶手段に記憶されている前記コミュニケーションの発話履歴と前記発話モデル記憶手段に記憶されている発話モデルとに基づいて、前記コミュニケーションにおける次発話に適した発話シーンを推定する第４のステップと、を備えるコミュニケーションサービス提供方法を提案している。 (10) The present invention is a communication service providing method in a communication service providing apparatus that provides a scene suitable for communication from multimedia content including speech as content used for communication, wherein the communication service providing apparatus includes: Multimedia content including utterance scene extraction means, utterance learning means, utterance model storage means, communication history storage means for storing the communication utterance history, and next utterance scene estimation means, wherein the utterance scene extraction means includes the utterance Is divided into utterance scenes in units of utterances, and the utterance text is extracted, and the utterance learning means communicates the series of utterance texts extracted in the first step. As a sequence of content, a sequence of utterance texts in a certain section on the content is used as a state sequence, the next utterance scene in the section is used as an action node, and a map from the state sequence to the action node is learned as an utterance model. The second utterance model storage means stores the utterance model learned in the second step, and the next utterance scene estimation means is stored in the communication history storage means. Proposing a communication service providing method comprising: a fourth step of estimating an utterance scene suitable for the next utterance in the communication based on an utterance history of communication and an utterance model stored in the utterance model storage means ing.

（１１）本発明は、コミュニケーションに利用するコンテンツとして、発話を含むマルチメディアコンテンツから当該コミュニケーションに適したシーンを提供するコミュニケーションサービス提供装置におけるコミュニケーションサービス提供方法をコンピュータに実行させるためのプログラムであって、前記コミュニケーションサービス提供装置は、発話シーン抽出手段、発話学習手段、発話モデル記憶手段、前記コミュニケーションの発話履歴を記憶するコミュニケーション履歴記憶手段、および次発話シーン推定手段を備え、前記発話シーン抽出手段が、前記発話を含むマルチメディアコンテンツを、一発話単位に発話シーンとして切り分け、発話テキストを抽出する第１のステップと、前記発話学習手段が、前記第１のステップで抽出した発話テキストの系列をコミュニケーションのシークエンスとして、コンテンツ上のある区間の発話テキストの系列を状態シーケンスとして、前記区間の次の発話シーンを行動ノードとして、前記状態シーケンスから前記行動ノードへのマップを発話モデルとして学習する第２のステップと、前記発話モデル記憶手段が、前記第２のステップで学習した発話モデルを記憶する第３のステップと、前記次発話シーン推定手段が、前記コミュニケーション履歴記憶手段に記憶されている前記コミュニケーションの発話履歴と前記発話モデル記憶手段に記憶されている発話モデルとに基づいて、前記コミュニケーションにおける次発話に適した発話シーンを推定する第４のステップと、をコンピュータに実行させるためのプログラムを提案している。 (11) The present invention is a program for causing a computer to execute a communication service providing method in a communication service providing apparatus that provides a scene suitable for communication from multimedia content including speech as content used for communication. The communication service providing apparatus includes utterance scene extraction means, utterance learning means, utterance model storage means, communication history storage means for storing the communication utterance history, and next utterance scene estimation means, and the utterance scene extraction means includes The first step of dividing the multimedia content including the utterance as an utterance scene for each utterance and extracting the utterance text, and the utterance learning means includes the first step. The extracted sequence of utterance texts is used as a communication sequence, the sequence of utterance texts in a certain section on the content as a state sequence, the next utterance scene in the section as an action node, and a map from the state sequence to the action node. A second step of learning as an utterance model; a third step of storing the utterance model learned in the second step by the utterance model storage means; and a next utterance scene estimating means of the communication history storage means. A fourth step of estimating an utterance scene suitable for the next utterance in the communication based on the utterance history of the communication stored in the utterance and the utterance model stored in the utterance model storage means; Propose a program to run ing.

本発明によれば、コミュニケーションにおける次の発話に適したシーンを、発話を含むマルチメディアコンテンツから推定することができる。 According to the present invention, a scene suitable for the next utterance in communication can be estimated from multimedia contents including the utterance.

本発明の第１の実施形態に係るコミュニケーションサービス提供装置の構成を示す図である。It is a figure which shows the structure of the communication service provision apparatus which concerns on the 1st Embodiment of this invention. 本発明の第１の実施形態に係る認証部による発話シーンの認証例を示す図である。It is a figure which shows the example of authentication of the utterance scene by the authentication part which concerns on the 1st Embodiment of this invention. 本発明の第１の実施形態に係る発話モデル作成処理フローを示す図である。It is a figure which shows the speech model creation process flow which concerns on the 1st Embodiment of this invention. 本発明の第１の実施形態に係る次発話シーン推定処理フローを示す図である。It is a figure which shows the next utterance scene estimation process flow which concerns on the 1st Embodiment of this invention. 本発明の第２の実施形態に係るコミュニケーションサービス提供装置の構成を示す図である。It is a figure which shows the structure of the communication service provision apparatus which concerns on the 2nd Embodiment of this invention. 本発明の第３の実施形態に係るコミュニケーションサービス提供装置の構成を示す図である。It is a figure which shows the structure of the communication service provision apparatus which concerns on the 3rd Embodiment of this invention.

以下、図面を用いて、本発明の実施形態について詳細に説明する。なお、本実施形態における構成要素は適宜、既存の構成要素等との置き換えが可能であり、また、他の既存の構成要素との組み合わせを含む様々なバリエーションが可能である。したがって、本実施形態の記載をもって、特許請求の範囲に記載された発明の内容を限定するものではない。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings. Note that the constituent elements in the present embodiment can be appropriately replaced with existing constituent elements and the like, and various variations including combinations with other existing constituent elements are possible. Therefore, the description of the present embodiment does not limit the contents of the invention described in the claims.

＜第１の実施形態＞
＜コミュニケーションサービス提供装置の構成＞
図１は、本発明の第１の実施形態に係るコミュニケーションサービス提供装置１００の構成を示す図である。本実施形態に係るコミュニケーションサービス提供装置１００は、発話を含むマルチメディアコンテンツ（以下、簡略化のためコンテンツという）の各シーンの発話から作成した発話モデルに基づいて、コミュニケーションにおける次発話に適したシーンをコンテンツの各シーンの中から推定する装置である。 <First Embodiment>
<Configuration of communication service providing device>
FIG. 1 is a diagram illustrating a configuration of a communication service providing apparatus 100 according to the first embodiment of the present invention. The communication service providing apparatus 100 according to the present embodiment is a scene suitable for the next utterance in communication based on the utterance model created from the utterance of each scene of multimedia content including utterance (hereinafter referred to as content for simplification). Is estimated from each scene of the content.

なお、ここで、コミュニケーションとは、ネットワークを介して行われるコミュニケーションであって、例えば、ＬＩＮＥ（登録商標）、Ｔｗｉｔｔｅｒ（登録商標）、Ｆａｃｅｂｏｏｋ（登録商標）等である。また、本実施形態においてコミュニケーションサービス提供装置１００は、ユーザ端末やコミュニケーションサービスを提供するサーバから独立した装置とするが、ユーザ端末やサーバがその機能を備えることにより実現してもよい。 Here, communication is communication performed via a network, and is, for example, LINE (registered trademark), Twitter (registered trademark), Facebook (registered trademark), or the like. Further, in the present embodiment, the communication service providing apparatus 100 is an apparatus independent from the user terminal and the server that provides the communication service, but may be realized by the user terminal and the server having the function.

コミュニケーションサービス提供装置１００は、コミュニケーションにおける次発話に適すると推定したシーンを、コミュニケーションを行うユーザに提供することができ、ユーザはコンテンツのシーンをコミュニケーションに容易に利用することが可能となる。また、ユーザに提示されるシーンが予め絞り込まれているので、ユーザはシーンの選択が容易になり、その結果、スムーズなコミュニケーションが可能となる。 The communication service providing apparatus 100 can provide a user who performs communication with a scene that is estimated to be suitable for the next utterance in communication, and the user can easily use the content scene for communication. In addition, since the scenes presented to the user are narrowed down in advance, the user can easily select a scene, and as a result, smooth communication is possible.

図１に示すように、本実施形態に係るコミュニケーションサービス提供装置１００は、発話シーン抽出部１１０、発話学習部１２０、発話モデル記憶部１３０、コミュニケーション履歴記憶部１４０、次発話シーン推定部１５０、絞り込み部１６０、および認証部１７０から構成される。 As shown in FIG. 1, the communication service providing apparatus 100 according to the present embodiment includes an utterance scene extraction unit 110, an utterance learning unit 120, an utterance model storage unit 130, a communication history storage unit 140, a next utterance scene estimation unit 150, and a narrowing down. Unit 160 and authentication unit 170.

発話シーン抽出部１１０は、発話を含むコンテンツを１発話単位に発話シーンとして切り分け、切り分けた発話シーンから発話テキストを抽出する。ここで、発話を含むコンテンツとは、発話を含む画像からなるコンテンツであって、例えば、映像コンテンツや漫画コンテンツである。また、映像コンテンツとは、映画、アニメーション、ドラマ等である。 The utterance scene extraction unit 110 separates content including an utterance as an utterance scene for each utterance, and extracts an utterance text from the divided utterance scene. Here, the content including an utterance is a content including an image including an utterance, and is, for example, a video content or a comic content. The video content is a movie, animation, drama or the like.

具体的には、発話シーン抽出部１１０は、コンテンツが映像コンテンツの場合には、発話毎に、発話時の画像を１発話シーンとして切り分け、切り分けた発話シーンに対応する発話をテキストに変換する。また、漫画コンテンツの場合には、特許文献１に記載の技術を用いて、１コマを１発話シーンとして切り分け、切り分けた発話シーンから発話テキストを抽出する。 Specifically, when the content is video content, the utterance scene extraction unit 110 separates the utterance image as one utterance scene for each utterance, and converts the utterance corresponding to the divided utterance scene into text. In the case of comic content, using the technique described in Patent Document 1, one frame is segmented as one utterance scene, and the utterance text is extracted from the segmented utterance scene.

ここで、非特許文献２に代表されるような既存の対話システムにおいては、言いよどみや言い直しといった話し言葉、および擬音やフィラーといった効果音は、無意味な発話として、発話テキストには含めていなかった。しかし、本発明においては、話し言葉および効果音も発話テキストに含める。話し言葉および効果音によって感情や発話シーンの状況を表すことができるので、コミュニケーションに適した発話シーンを推定する際に意味を持つからである。 Here, in the existing dialogue system represented by Non-Patent Document 2, spoken words such as stagnation and rephrasing, and sound effects such as onomatopoeia and filler were not included in the utterance text as meaningless utterances. . However, in the present invention, spoken words and sound effects are also included in the utterance text. This is because the state of emotions and utterance scenes can be expressed by spoken words and sound effects, and is meaningful when estimating utterance scenes suitable for communication.

発話シーン抽出部１１０が、発話シーンに切り分け、発話シーンから発話テキストを抽出するコンテンツは、コミュニケーションの次発話に適した発話シーンを推定するに用いる発話モデルを作成する際に、ネットワークを介してコンテンツサーバから任意または所定の条件で取得したコンテンツや、コミュニケーションサービス提供装置１００の管理者等から入力されたコンテンツである。 The content from which the utterance scene extraction unit 110 divides into utterance scenes and extracts the utterance text from the utterance scene is the content via the network when creating the utterance model used to estimate the utterance scene suitable for the next utterance of communication. Content acquired from a server under arbitrary or predetermined conditions, or content input by an administrator of the communication service providing apparatus 100 or the like.

発話学習部１２０は、発話シーン抽出部１１０で抽出した発話テキストを、抽出元の発話シーンの時系列順に並べた列を発話のシークエンスとして扱い、コンテンツ上のある区間の発話シーンから抽出された発話テキストの系列を状態シーケンスとして、ある区間の次の発話シーンを行動ノードとして、状態シーケンスから行動ノードへのマップを発話モデルとして学習する。 The utterance learning unit 120 treats the utterance text extracted by the utterance scene extraction unit 110 as a sequence of utterances in a chronological order of the utterance scenes of the extraction source, and utterances extracted from the utterance scenes in a certain section on the content A sequence of text is used as a state sequence, a next utterance scene in a certain section is used as an action node, and a map from the state sequence to the action node is learned as an utterance model.

あるコンテンツについて、発話シーンの時系列がＡ、Ｂ、Ｃ、Ｄであって、各発話シーンの発話テキストが順にａ、ｂ、ｃ、ｄである場合には、例えば、発話モデルはａ→ｂ→ｃ→Ｄで表すことができる。 For a certain content, when the time series of utterance scenes are A, B, C, and D, and the utterance texts of each utterance scene are a, b, c, and d in order, for example, the utterance model is a → b → c → D.

発話モデル記憶部１３０は、発話学習部１２０で生成された発話モデルを記憶する。なお、発話モデルは、各コンテンツについて１つずつ作成されてもよいし、ユーザの利用頻度が高い複数のコンテンツや任意の複数のコンテンツから１つ作成されてもよい。また、発話モデルは、定期的に作成されてもよいし、コミュニケーションサービス提供装置１００の管理者等の指示に応じて作成されてもよい。 The utterance model storage unit 130 stores the utterance model generated by the utterance learning unit 120. Note that one utterance model may be created for each content item, or one utterance model may be created from a plurality of content items or a plurality of arbitrary content items that are frequently used by the user. Further, the utterance model may be created periodically, or may be created in accordance with an instruction from an administrator of the communication service providing apparatus 100 or the like.

コミュニケーション履歴記憶部１４０は、コミュニケーションを行っている１以上のユーザの発話履歴を記憶する。具体的には、発話をしたユーザを識別するユーザ識別情報とユーザの発話のテキストとを対応付けて、コミュニケーション毎に記憶する。なお、発話履歴記憶部３３０は、ユーザが発話を行う毎にコミュニケーションが行われているサーバから取得して記憶してもよいし、コミュニケーションが行われているサーバを介してユーザから発話シーンの提供依頼があった際に、そのサーバが蓄積している発話履歴を取得して記憶してもよい。このとき、同じ１以上のユーザが行った過去の発話履歴を取得してもよい。 The communication history storage unit 140 stores the utterance history of one or more users who are performing communication. Specifically, the user identification information for identifying the user who made the utterance and the text of the user's utterance are associated with each other and stored for each communication. Note that the utterance history storage unit 330 may acquire and store the utterance scene from the server with which communication is performed each time the user utters, or provide the utterance scene from the user through the server with which communication is performed. When requested, the utterance history stored in the server may be acquired and stored. At this time, past utterance histories performed by one or more users may be acquired.

次発話シーン推定部１５０は、コミュニケーションにおける次発話に適した発話シーンを、コミュニケーションにおけるそれまでの発話履歴を記憶しているコミュニケーション履歴記憶部１４０と発話モデルを記憶している発話モデル記憶部１３０とに基づいて、推定する。なお、次発話シーン推定部１５０は、次発話シーン要求を受け付けたことに応じて、コミュニケーションにおける次発話に適した発話シーンを推定する。ここで、次発話シーン要求は、ユーザが自発的に行ってもよいし、一のユーザの発話が終わったことに応じて自動的にされてもよい。 The next utterance scene estimation unit 150 includes an utterance scene suitable for the next utterance in communication, a communication history storage unit 140 storing an utterance history so far in communication, and an utterance model storage unit 130 storing an utterance model. Estimate based on The next utterance scene estimation unit 150 estimates an utterance scene suitable for the next utterance in communication in response to receiving the next utterance scene request. Here, the next utterance scene request may be made voluntarily by the user, or may be automatically made in response to the end of the utterance of one user.

コミュニケーションにおける次発話に適したシーンの推定方法としては、例えば、コミュニケーションにおけるそれまでの発話履歴と発話モデルの発話テキストとの単語のマッチングにより直前の発話に適した発話シーンを推定し、直前の発話に適すると推定された発話シーンの次シーンを次発話に適したシーンとして推定する方法がある。具体的は、各シーンとその前後のシーンに含まれる発話テキストから単語を抽出し、直前の発話テキストとマッチする単語が多いシーンの次シーンを次発話に適したシーンとして推定する。なお、直前の発話は１つとは限らず、直前の２つの発話や３つの発話であってもよい。 As a method of estimating a scene suitable for the next utterance in communication, for example, an utterance scene suitable for the immediately preceding utterance is estimated by matching words between the utterance history so far in communication and the utterance text of the utterance model, and the immediately preceding utterance is estimated. There is a method for estimating the next scene of the utterance scene estimated to be suitable for the next utterance as a scene suitable for the next utterance. Specifically, words are extracted from utterance texts included in each scene and the preceding and following scenes, and the next scene of a scene with many words that match the immediately preceding utterance text is estimated as a scene suitable for the next utterance. Note that the immediately preceding utterance is not limited to one, but may be two immediately preceding utterances or three utterances.

次発話シーン推定部１５０は、発話モデルとコミュニケーションにおけるそれまでの発話履歴とを利用して、次発話に適した発話シーンを推定することにより、直前の発話が同じであってもそれまでのコミュニケーションの流れが異なる場合には、コミュニケーションの流れに適した異なる発話シーンを推定することが可能となる。また、発話学習部１２０で、非特許文献２に提案されている技術を用いて、コンテンツのキャラクタ間の発話のやりとりから発話モデルを学習すると、次発話シーン推定部１５０は、複数ターンからなるコミュニケーションに対しても次発話に適した発話シーンを推定することが可能となる。 The next utterance scene estimation unit 150 estimates the utterance scene suitable for the next utterance by using the utterance model and the utterance history so far in the communication, so that the previous utterance is the same even if the previous utterance is the same. When the flow of the voices is different, different utterance scenes suitable for the flow of communication can be estimated. Further, when the utterance learning unit 120 learns an utterance model from the exchange of utterances between content characters using the technology proposed in Non-Patent Document 2, the next utterance scene estimation unit 150 performs communication consisting of a plurality of turns. Therefore, it is possible to estimate an utterance scene suitable for the next utterance.

次発話シーン推定部１５０で、コミュニケーションにおける次発話に適すると推定された発話シーンを、コミュニケーションを行っているユーザに提供することにより、ユーザは次発話に合った発話シーンを容易に利用することができ、発話シーンを利用したスムーズなコミュニケーションが可能となる。 By providing the user who is performing communication with the utterance scene that is estimated to be suitable for the next utterance in communication by the next utterance scene estimation unit 150, the user can easily use the utterance scene that matches the next utterance. And smooth communication using the utterance scene.

絞り込み部１６０は、次発話シーン推定部１５０で複数の発話シーンが推定された場合に、ユーザから受け付けたテキストに基づいて画像検索を行い、次発話シーンとしてユーザに提供する発話シーンの絞り込みを行う。それにより、ユーザに提供される発話シーンの数を絞り込むことができ、よりスムーズなコミュニケーションが可能になる。絞り込み部１６０が行う画像検索としては、例えば、非特許文献１に記載の技術を用いることができる。 When the next utterance scene estimation unit 150 estimates a plurality of utterance scenes, the narrowing-down unit 160 performs an image search based on the text received from the user and narrows down the utterance scenes to be provided to the user as the next utterance scene. . Thereby, the number of utterance scenes provided to the user can be narrowed down, and smoother communication becomes possible. As an image search performed by the narrowing-down unit 160, for example, the technique described in Non-Patent Document 1 can be used.

なお、ユーザが次発話を行う前に次発話のキーワード等を入力することによって、ユーザに提供する発話シーンの絞り込みを行ってもよいし、ユーザが次発話の文字を入力する毎に発話シーンの候補の絞り込みを行ってもよい。 The user may narrow down the utterance scene to be provided to the user by inputting a keyword or the like of the next utterance before the user utters the next utterance, or every time the user inputs the character of the next utterance. Candidates may be narrowed down.

認証部１７０は、次発話シーン推定部１５０で推定された発話シーンの中から、コミュニケーションを行っているユーザが選択した発話シーンの権利を、ユーザが有するか否かの認証を行う。認証部１７０は、認証できた場合には、コミュニケーションを行っている他のユーザにユーザが選択した発話シーンを送信し、一方、認証できなかった場合には、権利を有さない旨や権利の購入を促す通知等を行う。 The authentication unit 170 authenticates whether or not the user has the right of the utterance scene selected by the user who is performing communication from the utterance scenes estimated by the next utterance scene estimation unit 150. If the authentication unit 170 is authenticated, the authentication unit 170 transmits the utterance scene selected by the user to the other user who is performing communication. On the other hand, if the authentication unit 170 fails to authenticate, the authentication unit 170 indicates that the user has no right. Provide notifications to encourage purchases.

本発明では、コンテンツ全体だけでなく、発話シーン単位やチャプタ単位等によってコンテンツを細分化した一部についても権利を定義することができるものとする。それにより、必要な部分だけの権利取得が可能となり、ユーザの要望に柔軟に対応することが可能となる。 In the present invention, it is possible to define rights not only for the entire content, but also for a part of the content that has been subdivided in units of speech scenes, chapters, or the like. As a result, it is possible to acquire rights only for necessary portions, and it is possible to flexibly respond to user requests.

図２を用いて、認証部１７０による発話シーンの認証例について説明する。なお、本説明において、コミュニケーションサービスを提供するサービスサーバがコミュニケーションサービス提供装置１００の機能を備えているとする。 An example of authentication of an utterance scene by the authentication unit 170 will be described with reference to FIG. In this description, it is assumed that the service server that provides the communication service has the function of the communication service providing apparatus 100.

（ａ）は、次発話シーンとして、サービスサーバが提供している発話シーン、または、発話シーンを含むチャプタやコンテンツを利用する場合の認証方法である。 (A) is an authentication method when the utterance scene provided by the service server, or a chapter or content including the utterance scene is used as the next utterance scene.

まず、サービスサーバは、次発話シーン推定部１５０で推定された発話シーンの中から、発話ユーザから次発話シーンとして利用する発話シーンの選択を受け付ける。次に、サービスサーバは、ユーザが選択した発話シーン、または、その発話シーンを含むチャプタやコンテンツのコンテンツ識別情報、発話ユーザおよび受話ユーザの少なくとも一方のユーザ識別情報等をコンテンツホルダーに送付する。次に、コンテンツホルダーは、サービスサーバから受信した情報に基づいて、ユーザが選択した発話シーン、または、その発話シーンを含むチャプタやコンテンツの権利情報をサービスサーバに送信する。 First, the service server accepts selection of an utterance scene to be used as the next utterance scene from the utterance user from the utterance scenes estimated by the next utterance scene estimation unit 150. Next, the service server sends the utterance scene selected by the user, the chapter including the utterance scene, the content identification information of the content, the user identification information of at least one of the utterance user and the reception user, or the like to the content holder. Next, based on the information received from the service server, the content holder transmits the utterance scene selected by the user, or the chapter and content right information including the utterance scene to the service server.

そして、サービスサーバは、コンテンツホルダーから受信した権利情報に基づいて、発話ユーザが選択した発話シーンの権利を有しているか否かを判断する。発話ユーザが選択した発話シーンの権利を有している場合には、受話ユーザに発話シーンを送信する。一方、発話ユーザが選択した発話シーンの権利を有していない場合には、サービスサーバは、権利購入に必要な料金を発話ユーザに請求し、発話ユーザから支払われたことに応じて、サービスサーバは、受話ユーザに発話シーンを送信する。併せて、サービスサーバは、料金の支払いがあった発話シーンの権利情報をコンテンツホルダーに送信する。 Then, the service server determines whether or not the user has the right of the utterance scene selected by the utterance user based on the right information received from the content holder. When the utterance user has the right of the selected utterance scene, the utterance scene is transmitted to the receiving user. On the other hand, when the utterance user does not have the right of the selected utterance scene, the service server charges the utterance user for a fee necessary for purchasing the right, and in response to the payment from the utterance user, the service server Transmits the utterance scene to the receiving user. At the same time, the service server transmits the right information of the utterance scene for which the fee has been paid to the content holder.

（ｂ）は、発話ユーザの端末に権利を保有している、発話シーン、または、発話シーンを含むチャプタやコンテンツを利用する場合の認証方法である。 (B) is an authentication method in the case of using an utterance scene or a chapter or content including the utterance scene, which has a right in the terminal of the utterance user.

まず、発話ユーザのユーザ端末は、次発話シーン推定部１５０で推定された発話シーンの中から発話ユーザが選択した発話シーンの権利を自端末に保持している場合には、保持するＤＲＭ情報、発話ユーザおよび受話ユーザの少なくとも一方のユーザ識別情報等をサービスサーバに送信する。次に、サービスサーバは、受信したＤＲＭ情報、発話ユーザおよび受話ユーザの識別情報等をコンテンツホルダーに送信する。次に、コンテンツホルダーは、サービスサーバから受信した情報に基づいて、ユーザが選択した発話シーン、または、その発話シーンを含むチャプタやコンテンツの権利情報をサービスサーバに送信する。 First, if the user terminal of the utterance user holds the right of the utterance scene selected by the utterance user from the utterance scenes estimated by the next utterance scene estimation unit 150, the DRM information to be retained, User identification information or the like of at least one of the speaking user and the receiving user is transmitted to the service server. Next, the service server transmits the received DRM information, identification information of the talking user and the receiving user, and the like to the content holder. Next, based on the information received from the service server, the content holder transmits the utterance scene selected by the user, or the chapter and content right information including the utterance scene to the service server.

次に、サービスサーバは、コンテンツホルダーから受信した権利情報に基づいて、ユーザが選択した発話シーンについて、発話ユーザが受話ユーザに送信するのに必要な権利を有しているか否かを判断する。発話ユーザが選択した発話シーンについて必要な権利を有している場合には、受話ユーザに発話シーンを送信する。一方、発話ユーザが選択した発話シーンについて必要な権利を有していない場合には、サービスサーバは、権利購入に必要な料金を発話ユーザに請求し、発話ユーザから支払われたことに応じて、サービスサーバは、受話ユーザに発話シーンを送信する。併せて、サービスサーバは、料金の支払いがあった発話シーンの権利情報をコンテンツホルダーに送信する。 Next, based on the right information received from the content holder, the service server determines whether or not the utterance user has a right necessary to transmit to the receiving user for the utterance scene selected by the user. If the utterance user has the necessary right for the utterance scene selected, the utterance scene is transmitted to the receiving user. On the other hand, if the utterance user does not have the necessary rights for the utterance scene selected, the service server charges the utterance user for the fee necessary for right purchase, and in response to payment from the utterance user, The service server transmits the utterance scene to the receiving user. At the same time, the service server transmits the right information of the utterance scene for which the fee has been paid to the content holder.

なお、コミュニケーションサービス提供装置１００に認証部１７０を備えず、次発話シーン推定部１５０で推定された発話シーンの中から発話ユーザが選択した発話シーンの権利の認証を既存のシステムを用いて行ってもよい。既存のシステムにて、発話ユーザが選択した発話シーンの権利の認証が行われるとコンテンツホルダーからサービスサーバにコンテンツが送信され、サービスサーバは、受信したコンテンツを受話ユーザにコンテンツを送信する。 Note that the communication service providing apparatus 100 does not include the authentication unit 170, and uses the existing system to authenticate the right of the utterance scene selected by the utterance user from the utterance scenes estimated by the next utterance scene estimation unit 150. Also good. When the right of the utterance scene selected by the uttering user is authenticated in the existing system, the content is transmitted from the content holder to the service server, and the service server transmits the received content to the receiving user.

＜コミュニケーションサービス処理フロー＞
本発明の第１の実施形態に係るコミュニケーションサービス処理は、発話モデル作成処理と、次発話シーン推定処理とからなる。図３は、本発明の第１の実施形態に係る発話モデル作成処理フローを示す図である。 <Communication service processing flow>
The communication service process according to the first embodiment of the present invention includes an utterance model creation process and a next utterance scene estimation process. FIG. 3 is a diagram showing an utterance model creation processing flow according to the first embodiment of the present invention.

まず、ステップＳ１において、発話シーン抽出部１１０が、コンテンツを１発話単位に発話シーンに切り分ける。 First, in step S1, the utterance scene extraction unit 110 divides content into utterance scenes in units of utterances.

次に、ステップＳ２において、発話シーン抽出部１１０が、ステップＳ１で切り分けられた発話シーンから発話テキストを抽出する。 Next, in step S2, the utterance scene extraction unit 110 extracts the utterance text from the utterance scene cut out in step S1.

次にステップＳ３において、発話シーン抽出部１１０が、ステップＳ１で切り分けた全ての発話シーンから発話テキストを抽出したか否か判断する。全ての発話シーンから発話テキストを抽出した場合（ＹＥＳ）には、ステップＳ４に処理を進め、全ての発話シーンから発話テキストを抽出していない場合（ＮＯ）には、ステップＳ２に処理を戻す。 Next, in step S3, the utterance scene extraction unit 110 determines whether or not the utterance text has been extracted from all the utterance scenes cut out in step S1. If the utterance text is extracted from all utterance scenes (YES), the process proceeds to step S4. If the utterance text is not extracted from all utterance scenes (NO), the process returns to step S2.

次に、ステップＳ４において、ステップＳ１で切り分けられた発話シーンと、ステップＳ２で抽出された発話テキストから発話モデルを学習する。 Next, in step S4, an utterance model is learned from the utterance scene cut out in step S1 and the utterance text extracted in step S2.

次に、ステップＳ５において、ステップＳ４で学習した発話モデルを発話モデル記憶部１３０に記憶する。 Next, in step S5, the utterance model learned in step S4 is stored in the utterance model storage unit 130.

図４は、本発明の第１の実施形態に係る次発話シーン推定処理フローを示す図である。 FIG. 4 is a diagram showing a next utterance scene estimation processing flow according to the first embodiment of the present invention.

まず、ステップＳ１１において、次発話シーン推定部１５０が、コミュニケーション履歴記憶部１４０からコミュニケーションにおける発話履歴を取得する。 First, in step S <b> 11, the next utterance scene estimation unit 150 acquires an utterance history in communication from the communication history storage unit 140.

次に、ステップＳ１２において、次発話シーン推定部１５０が、発話モデル記憶部１３０から発話モデルを取得する。 Next, in step S <b> 12, the next utterance scene estimation unit 150 acquires an utterance model from the utterance model storage unit 130.

次に、ステップＳ１３において、次発話シーン推定部１５０が、ステップＳ１１で取得した発話履歴と、ステップＳ１２で取得した発話モデルとに基づいて、コミュニケーションにおける次発話に適した発話シーンを推定する。 Next, in step S13, the next utterance scene estimation unit 150 estimates an utterance scene suitable for the next utterance in communication based on the utterance history acquired in step S11 and the utterance model acquired in step S12.

以上、説明したように、本実施形態によれば、コミュニケーションにおける次の発話に適したシーンを、発話を含むマルチメディアコンテンツから学習した発話モデルとコミュニケーションの発話履歴とに基づいて、推定することができる。その結果、コミュニケーションにおける次発話に適すると推定された発話シーンを、コミュニケーションを行っているユーザに提供することにより、ユーザは次発話に合った発話シーンを容易に利用することができ、発話シーンを利用したスムーズなコミュニケーションが可能となる。 As described above, according to the present embodiment, a scene suitable for the next utterance in communication can be estimated based on the utterance model learned from multimedia content including the utterance and the utterance history of communication. it can. As a result, the user can easily use the utterance scene suitable for the next utterance by providing the utterance scene estimated to be suitable for the next utterance in the communication to the user who is performing the communication. Smooth communication is possible.

＜第２の実施形態＞
図５を用いて、本発明の第２の実施形態について説明する。なお、本実施形態におけるコミュニケーションサービス提供装置は、発話シーンの属性情報に基づいて、コミュニケーションのおける次発話シーンを推定する。なお、第１の実施形態と同一の符号を付す構成要素については、同一の機能を有することから、その詳細な説明は省略する。 <Second Embodiment>
A second embodiment of the present invention will be described with reference to FIG. Note that the communication service providing apparatus according to the present embodiment estimates the next utterance scene in which communication is possible based on the attribute information of the utterance scene. In addition, about the component which attaches | subjects the same code | symbol as 1st Embodiment, since it has the same function, the detailed description is abbreviate | omitted.

＜コミュニケーションサービス提供装置の構成＞
図５は、本発明の第２の実施形態に係るコミュニケーションサービス提供装置２００の構成を示す図である。図５に示すように、本実施形態において、コミュニケーションサービス提供装置２００は、発話シーン抽出部１１０、タグ付与部２１０、タグ記憶部２２０、発話学習部１２１、発話モデル記憶部１３０、コミュニケーション履歴記憶部１４０、次発話シーンタグ推定部２３０、および同タグシーン検索部２４０から構成される。 <Configuration of communication service providing device>
FIG. 5 is a diagram showing a configuration of a communication service providing apparatus 200 according to the second embodiment of the present invention. As shown in FIG. 5, in this embodiment, the communication service providing apparatus 200 includes an utterance scene extraction unit 110, a tag addition unit 210, a tag storage unit 220, an utterance learning unit 121, an utterance model storage unit 130, and a communication history storage unit. 140, a next utterance scene tag estimation unit 230, and a tag scene search unit 240.

タグ付与部２１０は、発話シーン抽出部１１０で抽出された各発話シーンに対し、各発話シーンの属性情報をタグとして付与する。ここで、各発話シーンの属性情報には、発話シーンの発話テキスト、発話シーンに登場するキャラクタの感情、発話シーンの構成要素を少なくとも含む。また、発話シーンの構成要素とは、ストーリーの段階（例えば、起承転結のいずれか）、登場しているキャラクタ、キャラクタの位置やサイズといった画面構成、学校や海辺といった背景である。属性情報は、発話シーンの画像解析等により自動的に取得してもよいし、発話シーンから人手により取得してもよい。 The tag assignment unit 210 assigns the attribute information of each utterance scene as a tag to each utterance scene extracted by the utterance scene extraction unit 110. Here, the attribute information of each utterance scene includes at least the utterance text of the utterance scene, the emotion of the character appearing in the utterance scene, and the constituent elements of the utterance scene. Further, the constituent elements of the utterance scene are the stage of the story (for example, one of the start and end), the characters appearing, the screen configuration such as the position and size of the character, and the background such as the school and the seaside. The attribute information may be automatically acquired by image analysis of the utterance scene or may be manually acquired from the utterance scene.

タグ記憶部２２０は、発話シーン抽出部１１０で抽出された各発話シーンに対応付けて、タグ付与部２１０で各発話シーンに付与されたタグを記憶する。 The tag storage unit 220 stores the tag assigned to each utterance scene by the tag assigning unit 210 in association with each utterance scene extracted by the utterance scene extraction unit 110.

発話学習部１２１は、コンテンツ上のある区間の発話シーンの時系列に対応する、タグ付与部２１０で付与されたタグの時系列を状態タグシーケンス、区間の次の発話シーンに付与されたタグを行動タグノードとして、状態タグシーケンスから行動タグノードへのマップを発話モデルとして学習する。 The utterance learning unit 121 corresponds to the time series of the utterance scenes in a certain section on the content, the time series of the tags attached by the tag assignment unit 210, the tag attached to the utterance scene next to the section. As an action tag node, a map from a state tag sequence to an action tag node is learned as an utterance model.

次発話シーンタグ推定部２３０は、コミュニケーションにおける次発話に適したシーンに付与されるタグを、コミュニケーションにおけるそれまでの発話履歴を記憶している発話履歴記憶部３３０と発話モデルを記憶している発話モデル記憶部１３０とに基づいて、推定する。なお、次発話シーン推定部１５０は、発話シーン要求を受け付けたことに応じて、コミュニケーションにおける次発話に適したシーンを推定する。ここで、発話シーン要求は、ユーザが自発的に行ってもよいし、一のユーザの発話が終わったことに応じて自動的にされてもよい。 The next utterance scene tag estimation unit 230 has a tag given to a scene suitable for the next utterance in communication, an utterance history storage unit 330 that stores a history of utterances in communication, and an utterance that stores an utterance model. Estimation is performed based on the model storage unit 130. The next utterance scene estimation unit 150 estimates a scene suitable for the next utterance in communication in response to receiving the utterance scene request. Here, the utterance scene request may be made voluntarily by the user, or may be automatically made in response to the end of the utterance of one user.

同タグシーン検索部２４０は、次発話シーンタグ推定部２３０で推定されたタグと一致するタグを、タグ記憶部２２０に記憶されているタグから検索する。そして、同タグシーン検索部２４０は、検索されたタグが付与されている発話シーンを次発話に適した発話シーンとして推定する。それにより、コミュニケーションにおける次発話に適する発話シーンを、発話シーンの属性情報のタグから推定することができる。 The tag scene search unit 240 searches a tag stored in the tag storage unit 220 for a tag that matches the tag estimated by the next utterance scene tag estimation unit 230. Then, the tag scene search unit 240 estimates an utterance scene to which the searched tag is assigned as an utterance scene suitable for the next utterance. Thereby, the utterance scene suitable for the next utterance in communication can be estimated from the tag of the attribute information of the utterance scene.

以上、説明したように、本実施形態によれば、コミュニケーションにおける次発話に適する発話シーンを、発話シーンの属性情報から推定する。それにより、属性情報は発話シーンを抽象化した情報であるので、コミュニケーションの流れに最も合っている発話シーンだけでなく、だいたい合っている発話シーンも次発話に適した発話シーンとして推定することができる。その結果、ユーザの予想と異なる発話シーンも提供され、コミュニケーションに用いる発話シーンの選択肢の幅を広げることができる。 As described above, according to the present embodiment, the utterance scene suitable for the next utterance in communication is estimated from the attribute information of the utterance scene. As a result, since the attribute information is information that abstracts the utterance scene, it is possible to estimate not only the utterance scene that best matches the flow of communication, but also the utterance scene that roughly matches as the utterance scene suitable for the next utterance. it can. As a result, an utterance scene different from the user's expectation is also provided, and the range of utterance scene choices used for communication can be expanded.

＜第３の実施形態＞
図６を用いて、本発明の第３の実施形態について説明する。なお、本実施形態におけるコミュニケーションサービス提供装置は、複数の発話モデルの中から、特定のコンテンツから生成された発話モデルを選択し、選択した発話モデルから次発話シーンを推定する。なお、第１の実施形態と同一の符号を付す構成要素については、同一の機能を有することから、その詳細な説明は省略する。 <Third Embodiment>
A third embodiment of the present invention will be described with reference to FIG. Note that the communication service providing apparatus according to the present embodiment selects an utterance model generated from specific content from a plurality of utterance models, and estimates a next utterance scene from the selected utterance model. In addition, about the component which attaches | subjects the same code | symbol as 1st Embodiment, since it has the same function, the detailed description is abbreviate | omitted.

＜コミュニケーションサービス提供装置の構成＞
図６は、本発明の第３の実施形態に係るコミュニケーションサービス提供装置３００の構成を示す図である。図６に示すように、本実施形態において、コミュニケーションサービス提供装置３００は、発話シーン抽出部１１０、発話学習部１２２、発話モデル記憶部１３２、利用履歴記憶部３１０、コンテンツ履歴ベクトル生成部３２０、発話履歴記憶部３３０、発話履歴ベクトル生成部３４０、類似ユーザ抽出部３５０、コンテンツ候補抽出部３６０、発話モデル選択部３７０、および次発話シーン推定部１５２から構成される。 <Configuration of communication service providing device>
FIG. 6 is a diagram illustrating a configuration of a communication service providing apparatus 300 according to the third embodiment of the present invention. As shown in FIG. 6, in this embodiment, the communication service providing apparatus 300 includes an utterance scene extraction unit 110, an utterance learning unit 122, an utterance model storage unit 132, a usage history storage unit 310, a content history vector generation unit 320, an utterance. It includes a history storage unit 330, an utterance history vector generation unit 340, a similar user extraction unit 350, a content candidate extraction unit 360, an utterance model selection unit 370, and a next utterance scene estimation unit 152.

発話学習部１２２は、コンテンツ毎に、発話シーン抽出部１１０で抽出した発話テキストを、抽出元の発話シーンの時系列順に並べた列を発話のシークエンスとして扱い、コンテンツ上のある区間の発話シーンから抽出された発話テキストの系列を状態シーケンスとして、ある区間の次の発話シーンを行動ノードとして、状態シーケンスから行動ノードへのマップを発話モデルとして学習する。 The utterance learning unit 122 treats, for each content, a sequence in which the utterance text extracted by the utterance scene extraction unit 110 is arranged in chronological order of the extraction source utterance scene as an utterance sequence, and from the utterance scene in a certain section on the content. A series of extracted utterance texts is used as a state sequence, a next utterance scene in a certain section is used as an action node, and a map from the state sequence to the action node is learned as an utterance model.

発話モデル記憶部１３２は、発話学習部１２２で生成されたコンテンツ毎の発話モデルを、コンテンツ毎に記憶する。なお、発話モデルは、定期的に作成されてもよいし、コミュニケーションサービス提供装置３００の管理者等の指示に応じて作成されてもよい。 The utterance model storage unit 132 stores the utterance model for each content generated by the utterance learning unit 122 for each content. Note that the utterance model may be created periodically or in response to an instruction from the administrator of the communication service providing apparatus 300 or the like.

利用履歴記憶部３１０は、ユーザ毎に、ユーザが利用したコンテンツの履歴を記憶する。利用履歴記憶部３１０は、例えば、ユーザの識別情報に対応付けてコンテンツの識別情報と利用回数とを記憶している。 The usage history storage unit 310 stores, for each user, a history of content used by the user. The usage history storage unit 310 stores, for example, content identification information and the number of uses in association with user identification information.

コンテンツ履歴ベクトル生成部３２０は、利用履歴記憶部３１０に記憶されている履歴に基づいて、各コンテンツを基底とし各コンテンツの利用回数を係数とするコンテンツ履歴ベクトルを、ユーザ毎に生成する。コンテンツ履歴ベクトルは（１）式で表すことができる。 Based on the history stored in the usage history storage unit 310, the content history vector generation unit 320 generates, for each user, a content history vector based on each content and using the number of times each content is used as a coefficient. The content history vector can be expressed by equation (1).

発話履歴記憶部３３０は、ユーザ毎に発話履歴を記憶する。具体的には、発話履歴記憶部３３０は、ユーザの識別情報に対応付けて、ユーザが過去に行ったコミュニケーションにおける発話を記憶している。 The utterance history storage unit 330 stores an utterance history for each user. Specifically, the utterance history storage unit 330 stores utterances in communication performed by the user in the past in association with the identification information of the user.

発話履歴ベクトル生成部３４０は、発話履歴記憶部３３０に記憶されている発話履歴に基づいて、各単語を基底とし各単語の出現頻度を係数とする発話履歴ベクトルを、ユーザ毎に生成する。発話履歴ベクトルは（２）式で表すことができる。 Based on the utterance history stored in the utterance history storage unit 330, the utterance history vector generation unit 340 generates, for each user, an utterance history vector that uses each word as a base and uses the appearance frequency of each word as a coefficient. The utterance history vector can be expressed by equation (2).

類似ユーザ抽出部３５０は、コンテンツ履歴ベクトル生成部３２０で生成されたコンテンツ履歴ベクトルに基づいて、コミュニケーションを行っているユーザとの距離が小さいユーザを類似ユーザとして抽出する。具体的には、（３）式により、他ユーザとの類似度を算出し、最も類似度の小さいユーザをコンテンツ類似ユーザとする。 Based on the content history vector generated by the content history vector generation unit 320, the similar user extraction unit 350 extracts a user having a small distance from the communicating user as a similar user. Specifically, the degree of similarity with other users is calculated by equation (3), and the user with the lowest degree of similarity is set as a content similar user.

また、類似ユーザ抽出部３５０は、発話履歴ベクトル生成部３４０で生成された発話履歴ベクトルに基づいて、コミュニケーションを行っているユーザとの距離が小さいユーザを類似ユーザとして抽出する。具体的には、発話履歴ベクトルに基づいて類似ユーザを抽出する場合と同様に（３）式により、他ユーザとの類似度を算出し、最も類似度の小さいユーザを発話類似ユーザとする。 Also, the similar user extraction unit 350 extracts a user having a small distance from the user performing communication as a similar user based on the utterance history vector generated by the utterance history vector generation unit 340. Specifically, similar to the case of extracting similar users based on the utterance history vector, the degree of similarity with other users is calculated by equation (3), and the user with the lowest degree of similarity is set as the utterance similar user.

コンテンツ候補抽出部３６０は、コミュニケーションを行っているユーザについて、利用履歴記憶部３１０に記憶されている履歴からコンテンツ候補を抽出する。具体的には、コンテンツ候補抽出部３６０は利用履歴記憶部３１０に記憶されている履歴に基づいて、コミュニケーションを行っているユーザの利用頻度が高いコンテンツを抽出する。 The content candidate extraction unit 360 extracts content candidates from the history stored in the usage history storage unit 310 for the user who is performing communication. Specifically, the content candidate extraction unit 360 extracts content that is frequently used by the user who is performing communication based on the history stored in the usage history storage unit 310.

また、コンテンツ候補抽出部３６０は、類似ユーザ抽出部３５０により求められたコンテンツ類似ユーザに基づいて、利用履歴記憶部３１０に記憶されている履歴からコンテンツ候補を抽出する。具体的には、コンテンツ候補抽出部３６０は利用履歴記憶部３１０に記憶されている履歴に基づいて、コンテンツ類似ユーザの利用頻度が高いコンテンツを抽出する。 Further, the content candidate extraction unit 360 extracts content candidates from the history stored in the usage history storage unit 310 based on the content similar users obtained by the similar user extraction unit 350. Specifically, the content candidate extraction unit 360 extracts content that is frequently used by content-similar users based on the history stored in the usage history storage unit 310.

更に、コンテンツ候補抽出部３６０は、類似ユーザ抽出部３５０により求められた発話類似ユーザに基づいて、利用履歴記憶部３１０に記憶されている履歴からコンテンツ候補を抽出する。具体的には、コンテンツ候補抽出部３６０は利用履歴記憶部３１０に記憶されている履歴に基づいて、発話類似ユーザの利用頻度が高いコンテンツを抽出する。 Further, the content candidate extraction unit 360 extracts content candidates from the history stored in the usage history storage unit 310 based on the utterance similar user obtained by the similar user extraction unit 350. Specifically, the content candidate extraction unit 360 extracts content that is frequently used by utterance-like users based on the history stored in the usage history storage unit 310.

発話モデル選択部３７０は、発話モデル記憶部１３０に記憶された複数の発話モデルから、コンテンツ候補抽出部３６０で抽出されたコンテンツ候補に対応付けて記憶されている発話モデルを選択する。 The utterance model selection unit 370 selects an utterance model stored in association with the content candidate extracted by the content candidate extraction unit 360 from the plurality of utterance models stored in the utterance model storage unit 130.

次発話シーン推定部１５２は、コミュニケーションにおける次発話に適した発話シーンを、コミュニケーションにおけるそれまでの発話履歴を記憶している発話履歴記憶部３３０と、発話モデル選択部３７０で選択された発話モデルを記憶している発話モデル記憶部１３０とに基づいて、推定する。なお、次発話シーン推定部１５２は、発話シーン要求を受け付けたことに応じて、コミュニケーションにおける次発話に適した発話シーンを推定する。ここで、発話シーン要求は、ユーザが自発的に行ってもよいし、一のユーザの発話が終わったことに応じて自動的にされてもよい。コミュニケーションにおける次発話に適したシーンの推定方法については、第１の実施形態と同様である。 The next utterance scene estimation unit 152 selects the utterance scene suitable for the next utterance in communication, the utterance history storage unit 330 storing the utterance history so far in communication, and the utterance model selected by the utterance model selection unit 370. Estimation is performed based on the stored utterance model storage unit 130. The next utterance scene estimation unit 152 estimates an utterance scene suitable for the next utterance in communication in response to receiving the utterance scene request. Here, the utterance scene request may be made voluntarily by the user, or may be automatically made in response to the end of the utterance of one user. The scene estimation method suitable for the next utterance in communication is the same as that in the first embodiment.

以上、説明したように、本実施形態によれば、ユーザが良く利用するコンテンツはユーザが好むコンテンツであって、コミュニケーションに利用する可能性が高い。そのため、ユーザが良く利用するコンテンツから学習された発話モデルを次発話シーンの推定に用いることで、ユーザが良く利用するコンテンツに含まれる発話シーンを次発話に適した発話シーンとしてユーザに提示でき、次発話に適した発話シーンの推定精度を向上させることができる。 As described above, according to the present embodiment, the content that is frequently used by the user is the content that the user likes, and is likely to be used for communication. Therefore, by using the utterance model learned from the content frequently used by the user for estimation of the next utterance scene, the utterance scene included in the content frequently used by the user can be presented to the user as an utterance scene suitable for the next utterance, The estimation accuracy of the utterance scene suitable for the next utterance can be improved.

また、利用しているコンテンツや発話がユーザと類似する類似ユーザが良く利用するコンテンツから学習された発話モデルを次発話シーンの推定に用いることで、ユーザが利用していないコンテンツや利用頻度の少ないが、ユーザの好みに合うと推定されるコンテンツに含まれる発話シーンを次発話に適した発話シーンとしてユーザに提示できるので、ユーザの予想と異なる発話シーンも提供され、ユーザの選択肢の幅を広げることができる。 In addition, by using an utterance model learned from content frequently used by similar users whose content and utterances are similar to the user for estimating the next utterance scene, content not used by the user and less frequently used However, since the utterance scene included in the content estimated to meet the user's preference can be presented to the user as the utterance scene suitable for the next utterance, an utterance scene different from the user's expectation is also provided, and the range of options of the user is expanded. be able to.

なお、コミュニケーションサービス提供装置の処理をコンピュータ読み取り可能な記録媒体に記録し、この記録媒体に記録されたプログラムを機器に読み込ませ、実行することによって本発明のコミュニケーションサービス提供装置を実現することができる。ここでいうコンピュータシステムとは、ＯＳや周辺装置等のハードウェアを含む。 Note that the communication service providing apparatus of the present invention can be realized by recording the processing of the communication service providing apparatus on a computer-readable recording medium, causing the device to read and execute the program recorded on the recording medium. . The computer system here includes an OS and hardware such as peripheral devices.

また、「コンピュータシステム」は、ＷＷＷ（ＷｏｒｌｄＷｉｄｅＷｅｂ）システムを利用している場合であれば、ホームページ提供環境（あるいは表示環境）も含むものとする。また、上記プログラムは、このプログラムを記憶装置等に格納したコンピュータシステムから、伝送媒体を介して、あるいは、伝送媒体中の伝送波により他のコンピュータシステムに伝送されてもよい。ここで、プログラムを伝送する「伝送媒体」は、インターネット等のネットワーク（通信網）や電話回線等の通信回線（通信線）のように情報を伝送する機能を有する媒体のことをいう。 Further, the “computer system” includes a homepage providing environment (or display environment) if a WWW (World Wide Web) system is used. The program may be transmitted from a computer system storing the program in a storage device or the like to another computer system via a transmission medium or by a transmission wave in the transmission medium. Here, the “transmission medium” for transmitting the program refers to a medium having a function of transmitting information, such as a network (communication network) such as the Internet or a communication line (communication line) such as a telephone line.

また、上記プログラムは、前述した機能の一部を実現するためのものであってもよい。更に、前述した機能をコンピュータシステムにすでに記録されているプログラムとの組み合わせで実現できるもの、いわゆる差分ファイル（差分プログラム）であってもよい。 The program may be for realizing a part of the functions described above. Furthermore, what can implement | achieve the function mentioned above in combination with the program already recorded on the computer system, what is called a difference file (difference program) may be sufficient.

以上、この発明の実施形態につき、図面を参照して詳述してきたが、具体的な構成はこの実施形態に限られるものではなく、この発明の要旨を逸脱しない範囲の設計等も含まれる。 The embodiments of the present invention have been described in detail with reference to the drawings. However, the specific configuration is not limited to the embodiments, and includes designs and the like that do not depart from the gist of the present invention.

１００コミュニケーションサービス提供装置
１１０発話シーン抽出部
１２０発話学習部
１３０発話モデル記憶部
１４０コミュニケーション履歴記憶部
１５０次発話シーン推定部
１６０絞り込み部
１７０認証部 DESCRIPTION OF SYMBOLS 100 Communication service provision apparatus 110 Utterance scene extraction part 120 Utterance learning part 130 Utterance model memory | storage part 140 Communication history memory | storage part 150 Next utterance scene estimation part 160 Narrowing part 170 Authentication part

Claims

In a communication service providing apparatus that provides a scene suitable for communication from multimedia content including speech as content used for communication,
The multimedia content including the utterance is segmented as an utterance scene in one utterance unit, and an utterance scene extracting means for extracting an utterance text;
The sequence of utterance texts extracted by the utterance scene extraction means is used as a communication sequence, the sequence of utterance texts in a certain section on the content as a state sequence, the next utterance scene in the section as an action node, and the state sequence. An utterance learning means for learning a map to an action node as an utterance model;
Utterance model storage means for storing the utterance model learned by the utterance learning means;
Communication history storage means for storing the communication utterance history;
Next utterance scene estimation for estimating an utterance scene suitable for the next utterance in the communication based on the utterance history of the communication stored in the communication history storage means and the utterance model stored in the utterance model storage means Means,
A communication service providing apparatus comprising:

Tag giving means for giving, as a tag, attribute information of each utterance scene to each utterance scene extracted by the utterance scene extracting means;
Tag storage means for storing a tag assigned to each utterance scene by the tag assignment means in association with each utterance scene extracted by the utterance scene extraction means;
With
The utterance learning means behaves with a tag series assigned by the tag assignment means corresponding to the utterance scene series in a certain section on the content as a state tag sequence, and a tag attached to the next utterance scene in the section. As a tag node, a map from a state tag sequence to an action tag node is learned as the utterance model,
Based on the utterance history of the communication stored in the communication history storage means and the utterance model stored in the utterance model storage means, a tag attached to a scene suitable for the next utterance of the communication is estimated. Next scene tag estimation means,
The tag scene search means for searching for an utterance scene suitable for the next utterance based on the tag estimated by the next scene tag estimation means and the tag stored in the tag storage means;
The communication service providing apparatus according to claim 1, further comprising:

The communication service providing apparatus according to claim 2, wherein the attribute information includes at least an utterance text of the utterance scene, an emotion of a character appearing in the utterance scene, and a component of the utterance scene.

4. The communication service providing apparatus according to claim 1, wherein the utterance scene extracting unit extracts a spoken word and a sound effect as the utterance text.

For each content, the utterance learning means uses the utterance text sequence extracted by the utterance scene extraction means as a communication sequence, the utterance text sequence in a certain section on the content as a state sequence, and the next utterance scene in the section As an action node, learning a map from the state sequence to the action node as an utterance model,
The utterance model storage means stores the utterance model generated by the utterance learning means for each content,
Usage history storage means for storing a history of multimedia content used by the user for each user;
Content candidate extraction means for extracting content candidates from the history stored in the usage history storage means for the user performing the communication;
An utterance model selection means for selecting an utterance model stored in association with the content candidate extracted by the content candidate extraction means from a plurality of utterance models stored in the utterance model storage means;
With
The next utterance scene estimation unit is configured to select a scene suitable for the next utterance of the communication from the utterance model selected by the utterance model selection unit based on the utterance history of the communication stored in the communication history storage unit. The communication service providing apparatus according to claim 1, wherein an utterance scene suitable for the next utterance is estimated.

Content history vector generation means for generating for each user a content history vector based on each multimedia content and based on the history stored in the use history storage means, with the number of uses of each multimedia content as a coefficient; ,
Content similar user extraction means for extracting, as a content similar user, a user having a small distance from the user performing communication based on the content history vector generated by the content history vector generation means;
With
6. The content candidate extracting unit extracts a content candidate from a history stored in a usage history storage unit based on the content similar user obtained by the content similar user extracting unit. The communication service providing apparatus described.

Utterance history storage means for storing the utterance history for each user;
Based on the utterance history stored in the utterance history storage means, an utterance history vector generation means for generating for each user an utterance history vector with each word as a base and an appearance frequency of each word as a coefficient;
Based on the utterance history vector generated by the utterance history vector generation means, an utterance similar user extraction means for extracting a user having a small distance from the user performing the communication as an utterance similar user;
With
6. The content candidate extracting unit extracts a content candidate from a history stored in the usage history storage unit based on the utterance similar user obtained by the utterance similar user extracting unit. Or the communication service provision apparatus of Claim 6.

When a plurality of utterance scenes suitable for the next utterance in the communication are estimated by the next utterance scene estimation means, an image search is performed on the plurality of utterance scenes based on text received from the user performing the communication. The communication service providing apparatus according to claim 1, further comprising: means for narrowing down utterance scene candidates suitable for the next utterance.

Authentication is performed as to whether or not the user has the right of the utterance scene selected by the user performing the communication from the utterance scenes estimated to be suitable for the next utterance in the communication by the next utterance scene estimation means. With authentication means,
The communication provision according to any one of claims 1 to 8, wherein when the authentication unit can authenticate, the utterance scene selected by the user is transmitted to another user who is performing the communication. apparatus.

A communication service providing method in a communication service providing apparatus that provides a scene suitable for communication from multimedia content including speech as content used for communication,
The communication service providing apparatus includes an utterance scene extraction means, an utterance learning means, an utterance model storage means, a communication history storage means for storing the utterance history of the communication, and a next utterance scene estimation means,
A first step in which the utterance scene extraction means separates the multimedia content including the utterance as an utterance scene into one utterance unit, and extracts an utterance text;
The utterance learning means uses the sequence of utterance texts extracted in the first step as a communication sequence, the sequence of utterance texts in a certain section on the content as a state sequence, and the next utterance scene in the section as an action node. A second step of learning a map from the state sequence to the action node as an utterance model;
A third step in which the utterance model storage means stores the utterance model learned in the second step;
The utterance suitable for the next utterance in the communication based on the utterance history of the communication stored in the communication history storage means and the utterance model stored in the utterance model storage means. A fourth step of estimating the scene;
A communication service providing method comprising:

A program for causing a computer to execute a communication service providing method in a communication service providing apparatus that provides a scene suitable for communication from multimedia content including an utterance as content used for communication,
The communication service providing apparatus includes an utterance scene extraction means, an utterance learning means, an utterance model storage means, a communication history storage means for storing the utterance history of the communication, and a next utterance scene estimation means,
A first step in which the utterance scene extraction means separates the multimedia content including the utterance as an utterance scene into one utterance unit, and extracts an utterance text;
The utterance learning means uses the sequence of utterance texts extracted in the first step as a communication sequence, the sequence of utterance texts in a certain section on the content as a state sequence, and the next utterance scene in the section as an action node. A second step of learning a map from the state sequence to the action node as an utterance model;
A third step in which the utterance model storage means stores the utterance model learned in the second step;
The utterance suitable for the next utterance in the communication based on the utterance history of the communication stored in the communication history storage means and the utterance model stored in the utterance model storage means. A fourth step of estimating the scene;
A program that causes a computer to execute.