JP2019211909A

JP2019211909A - Information presentation system, information presentation method and program

Info

Publication number: JP2019211909A
Application number: JP2018106181A
Authority: JP
Inventors: 亮平波多野; Ryohei Hatano
Original assignee: Toppan Printing Co Ltd
Current assignee: Toppan Inc
Priority date: 2018-06-01
Filing date: 2018-06-01
Publication date: 2019-12-12
Anticipated expiration: 2038-06-01
Also published as: JP7180127B2

Abstract

To provide an information presentation system that allows each of users to accurately hear voice information that a system provides, in which an alteration model changing speech production data for accurate hearing is set for each user, and which can conduct maintenance of the alteration model easily due to a small amount of data compared with a rule base.SOLUTION: An information presentation system according to the present invention comprises a hearing orientation estimation part that estimates a hearing orientation that indicates easiness of hearing of a voice of each user in a conversation with a user and creates and updates an alteration model changing speech production data provided by voice to the user according to a hearing orientation of the user and a presentation control part that corresponds to an alteration model set for each user and changes speech production data that is a voice answer from a dialog system so that it becomes the hearing orientation of each user.SELECTED DRAWING: Figure 1

Description

本発明は、情報提示システム、情報提示方法及びプログラムに関する。 The present invention relates to an information presentation system, an information presentation method, and a program.

近年、インターネット環境が充実したことにより、ソーシャル・ネットワーク・サービス（以下、ＳＮＳと示す）が普及し、テキストや画像を用いて複数のユーザ間において簡易に意思疎通を行うことが可能となっている。例えば、ＳＮＳのアプリケーションとしては、ＬＩＮＥ（登録商標）、Ｆａｃｅｂｏｏｋ（登録商標）メッセンジャー、Ｓｌａｃｋ（登録商標）などが代表的である。これらのＳＮＳは、一対一のユーザ間の情報のやり取りだけでなく、所定のグループにおける多人数のユーザ間で送受信する情報（複数のユーザ間における対話）を、グループ内の全てのユーザで共有する機能も有している。 In recent years, due to the enhancement of the Internet environment, social network services (hereinafter referred to as SNS) have become widespread, and it has become possible to easily communicate between a plurality of users using text and images. . For example, LINE (registered trademark), Facebook (registered trademark) messenger, Slack (registered trademark), and the like are representative examples of SNS applications. These SNSs share not only the exchange of information between one-to-one users but also the information (dialogs among a plurality of users) transmitted and received among a large number of users in a predetermined group among all users in the group. It also has a function.

また、マン・マシン対話型のＳＮＳとしては、ＧｏｏｇｌｅＡｓｓｉｓｔａｎｔ（登録商標）、ＡｍａｚｏｎＡｌｅｘａ（登録商標）、ＬｉｎｅＣｌｏｖａ（登録商標）などがある。
また、上述したアプリケーションの各々が、パーソナルコンピュータ及びスマートデバイスや、ＧｏｏｇｌｅＨｏｍｅ（登録商標）、ＡｍａｚｏｎＥｃｈｏ（登録商標）、ＣｌｏｖａＷａｖｅ（登録商標）などのスマートスピーカに搭載され、それぞれにおいて音声合成されて、音声を用いた情報提示を主体としたものも広く利用されている。 Examples of man-machine interactive SNS include Google Assistant (registered trademark), Amazon Alexa (registered trademark), and Line Cova (registered trademark).
In addition, each of the above-described applications is mounted on a smart speaker such as a personal computer and a smart device, Google Home (registered trademark), Amazon Echo (registered trademark), or Clova Wave (registered trademark), and each of them is synthesized by speech. In addition, information mainly using information presentation using voice is also widely used.

ＳＮＳにおける情報提示方法には、上記アプリケーション毎に様々な工夫がされている。
例えば、情報提示における提示内容に対して、システム上で定義したキャラクタとともに情報文を提示することを目的として、情報文の言語表現を書き換えて提示する提示方法がある（例えば、特許文献１参照）。 In the information presentation method in SNS, various ideas are made for each application.
For example, there is a presentation method for rewriting and presenting the language expression of an information sentence for the purpose of presenting an information sentence together with characters defined on the system for the presentation contents in information presentation (see, for example, Patent Document 1). .

また、ユーザに対して音声により情報提示を行う場合、提示に用いる音声の音声合成に関して、システムが合成した情報提示の音声をユーザに対して、スピーカなどを介して発話して情報の提示を行う。このとき、アプリケーションが、提示される発話をユーザが聴いた際に、発話に対して機械的な不自然さを感じさせない処理を行う情報提示の方法がある（例えば、特許文献２及び特許文献３参照）。 In addition, when information is presented to the user by voice, the information presentation voice synthesized by the system is uttered to the user via a speaker or the like for voice synthesis of voice used for presentation. . At this time, when the user listens to the utterance to be presented, there is an information presentation method for performing processing that does not feel mechanical unnaturalness to the utterance (for example, Patent Document 2 and Patent Document 3). reference).

特許第６１６１６５６号公報Japanese Patent No. 6161656 特許第５９５４３４８号公報Japanese Patent No. 595348 特許第６２３２８９２号公報Japanese Patent No. 6232922

しかしながら、ユーザとシステムとの間において、音声のみによる対話が行われる場合、ユーザの属性が年齢あるいは性別などの多様性を有しているため、ユーザの聴力や単語に対する理解力が異なる。
このため、システムとの対話において、システムが音声により提供する情報が聞き取れない、あるいは、音声における単語を理解できない等により、対話において情報の内容が正確に伝達されない場合がある。 However, when a dialogue only by voice is performed between the user and the system, the user's hearing ability and comprehension ability for words differ because the user's attributes have diversity such as age or gender.
For this reason, in the dialogue with the system, there is a case where the information provided by the system cannot be heard or the contents of the information cannot be accurately transmitted in the dialogue because the words in the voice cannot be understood.

また、ユーザとシステムとの対話を行うために、ユーザからの問いかけに対してどのような回答をするかについて、予め対話シナリオを想定して対話を実行するルールベース手法を用いることができる。
しかしながら、ルールベースに設定されたルールの各々に対して、多くのユーザの各々に対応させる必要性から手作業によるメンテナンスの必要が有る。このメンテナンスにおいて、上述した多様性のあるユーザの各々に対応させて、聞き取りにくいあるいは理解が困難な単語についての類義の単語や表現を含む上記ルールを設定し、対話シナリオとして構築することは膨大な作業が必要となる。 In addition, in order to perform a dialogue between the user and the system, a rule-based method can be used in which a dialogue is executed in advance assuming a dialogue scenario as to what kind of answer is to be given to the question from the user.
However, there is a need for manual maintenance for each of the rules set in the rule base because of the necessity to correspond to each of many users. In this maintenance, it is enormous to set up the above rules that contain similar words and expressions for words that are difficult to understand or difficult to understand, corresponding to each of the diverse users described above, and constructing it as a dialogue scenario Work is required.

本発明は、このような状況に鑑みてなされたもので、システムが音声により提供する情報を、ユーザの各々が正確に聞き取ることができ、正確に聞き取れるように発話データを変更する変更モデルがユーザ毎に設けられ、ルールベースに比較してデータ量が少ないため変更モデルのメンテナンス（ユーザに順次対応させていく修正処理）が容易に行える情報提示システム、情報提示方法及びプログラムを提供する。 The present invention has been made in view of such circumstances, and a change model that changes utterance data so that each of the users can accurately hear the information provided by the system and can be heard accurately is provided by the user. Provided are an information presentation system, an information presentation method, and a program that are provided for each, and can easily maintain a changed model (a correction process that sequentially corresponds to a user) because the amount of data is smaller than that of a rule base.

上述した課題を解決するために、本発明の情報提示システムは、ユーザとの対話において、前記ユーザの各々の音声の聞き取り易さを示す聴取志向を推定し、当該ユーザに対して音声により供給される発話データを、前記ユーザの前記聴取志向に対応して変更する変更モデル（実施形態における聴取志向推定モデル及び聴取志向テンプレートモデルの各々を含む）を生成、予測及び更新する聴取志向推定部と、前記ユーザ毎に設定されている前記変更モデルに対応し、対話システムからの音声による回答である前記発話データを、前記ユーザの各々の前記聴取志向となるよう変更する提示制御部とを備えることを特徴とする。 In order to solve the above-described problems, the information presentation system of the present invention estimates the listening orientation indicating the ease of hearing of each voice of the user in the dialog with the user, and is supplied to the user by voice. A listening orientation estimation unit that generates, predicts, and updates a change model (including each of the listening orientation estimation model and the listening orientation template model in the embodiment) that changes the speech data corresponding to the listening orientation of the user; A presentation control unit that corresponds to the change model set for each user, and that changes the utterance data, which is an answer by voice from a dialogue system, to be the listening preference of each of the users. Features.

本発明の情報提示システムは、前記聴取志向を推定する際に用いる、前記ユーザの各々との前記対話の履歴である対話履歴を対話履歴記憶部に対して、前記ユーザ毎に書き込んで記憶させる、ユーザからの発話に対してルールに基づき応答を決定する対話処理部をさらに備えることを特徴とする。 The information presentation system of the present invention writes and stores, for each user, a dialog history, which is a history of the dialog with each of the users, used for estimating the listening orientation. It further includes a dialog processing unit that determines a response to an utterance from the user based on a rule.

本発明の情報提示システムは、前記聴取志向推定部が、前記ユーザの前記対話における前記発話データに対する評価から、当該ユーザの前記聴取志向を抽出して、前記ユーザの属性情報及び当該ユーザの前記聴取志向を示す志向情報の各々を、ユーザ属性記憶部に対して、前記ユーザ毎に書き込んで記憶させることを特徴とする。 In the information presentation system of the present invention, the listening orientation estimation unit extracts the listening orientation of the user from the evaluation of the utterance data in the dialogue of the user, and the attribute information of the user and the listening of the user Each of the orientation information indicating the orientation is written and stored for each user in the user attribute storage unit.

本発明の情報提示システムは、前記ユーザの各々の前記属性情報に対応して、前記ユーザそれぞれを分類するグルーピングを行い、前記分類毎に含まれる前記ユーザに共通する前記聴取志向により、当該分類それぞれの変更モデルであるテンプレート変更モデルを生成するグルーピング推定部をさらに備えることを特徴とする。 The information presentation system of the present invention performs grouping for classifying each of the users corresponding to the attribute information of each of the users, and each of the classifications according to the listening orientation common to the users included for each of the classifications. The method further includes a grouping estimation unit that generates a template change model that is a change model.

本発明の情報提示システムは、前記聴取志向推定部が、前記変更モデルが用意されていない前記ユーザに対して、当該ユーザに対応する前記分類の前記テンプレート変更モデルを抽出し、前記対話において抽出される前記聴取志向に対応して、当該ユーザに対応する前記変更モデルを生成することを特徴とする。 In the information presentation system of the present invention, the listening orientation estimation unit extracts the template change model of the classification corresponding to the user for the user for which the change model is not prepared, and is extracted in the dialogue. The change model corresponding to the user is generated corresponding to the listening orientation.

本発明の情報提示システムは、前記属性情報が、少なくとも、前記ユーザの年齢、性別、居住地を含むデモグラフィックデータの各々の組み合わせとして設定されることを特徴とする。 The information presentation system of the present invention is characterized in that the attribute information is set as a combination of at least demographic data including age, sex, and residence of the user.

本発明の情報提示システムは、前記変更モデルが、少なくとも、前記対話処理部により決定されたシステム応答の発話データにおける単語の置き換え、前記発話データを読み上げる際の音声の周波数及び速度、文節の区切りを変更する処理を示すことを特徴とする。 In the information presentation system of the present invention, the change model includes at least replacement of words in the utterance data of the system response determined by the dialog processing unit, frequency and speed of speech when the utterance data is read out, and paragraph breaks. It is characterized by showing processing to be changed.

本発明の情報提示システムは、前記提示制御部が、前記変更モデルによる前記発話データを変更した内容である変更内容を、対話行動記憶部に対して、変更履歴として書き込んで記憶させ、前記聴取志向推定部が、前記対話の履歴と前記変更履歴とにより、前記ユーザの前記聴取志向を抽出することを特徴とする。 In the information presentation system of the present invention, the presentation control unit writes and stores the change content, which is the content of the change of the utterance data by the change model, as a change history in the dialogue action storage unit, and the listening-oriented The estimation unit extracts the listening orientation of the user from the conversation history and the change history.

本発明の情報提示方法は、聴取志向推定部が、ユーザとの対話において、前記ユーザの各々の音声の聞き取り易さを示す聴取志向を推定し、当該ユーザに対して音声により供給される発話データを、前記ユーザの前記聴取志向に対応して変更する変更モデルを生成及び更新する聴取志向推定過程と、提示制御部が、前記ユーザ毎に設定されている前記変更モデルに対応し、ユーザからの発話に対してルールに基づき応答を決定する対話処理部を介して、対話システムからの音声による回答である前記発話データを、前記ユーザの各々の前記聴取志向となるよう変更する提示制御過程とを含むことを特徴とする。 In the information presentation method of the present invention, the listening orientation estimation unit estimates the listening orientation indicating the ease of hearing of each voice of the user in the dialog with the user, and the speech data supplied to the user by voice A listening orientation estimation process for generating and updating a change model to be changed corresponding to the listening orientation of the user, and a presentation control unit corresponding to the changing model set for each user, A presentation control process for changing the utterance data, which is an answer by voice from a dialogue system, to the listening orientation of each of the users, via a dialogue processing unit that determines a response to the utterance based on a rule. It is characterized by including.

本発明のプログラムは、コンピュータを、ユーザとの対話において、前記ユーザの各々の音声の聞き取り易さを示す聴取志向を推定し、当該ユーザに対して音声により供給される発話データを、前記ユーザの前記聴取志向に対応して変更する変更モデルを生成及び更新する聴取志向推定手段、前記ユーザ毎に設定されている前記変更モデルに対応し、対話システムからの音声による回答である前記発話データを、前記ユーザの各々の前記聴取志向となるよう変更する提示制御手段として機能させるためのプログラムである。 The program of the present invention estimates the listening orientation indicating the ease of hearing of each voice of the user in the dialogue with the user, and the speech data supplied to the user by the voice Listening orientation estimation means for generating and updating a change model that changes according to the listening orientation, the utterance data corresponding to the change model set for each user, and a speech response from a dialogue system, It is a program for making it function as a presentation control means to change so that it may become the said listening preference of each said user.

以上説明したように、本発明によれば、システムが音声により提供する情報を、ユーザの各々が正確に聞き取ることができ、正確に聞き取れるように発話データを変更する変更モデルがユーザ毎に設けられ、ルールベースに比較して事前に対話内容を想定して構築すべき対話シナリオをはじめとしたデータ量が少ないため変更モデルのメンテナンス（ユーザに順次対応させていく修正処理）が容易に行える情報提示システム、情報提示方法及びプログラムを提供することができる。
また、グルーピング推定部があることで、変更モデルが存在しない、あるいは発話データなどの蓄積が不十分なユーザの各々に対して、グループ内で一般化された変更モデルであるテンプレート変更モデルを用いることで、表示情報の最適化を行うことができる。 As described above, according to the present invention, each user can accurately hear the information provided by the system, and a change model is provided for each user to change the utterance data so that it can be heard accurately. Information presentation that facilitates maintenance of the change model (correction process that sequentially corresponds to the user) because the amount of data including the dialogue scenario that should be constructed assuming the content of the dialogue in advance compared to the rule base is small A system, an information presentation method, and a program can be provided.
In addition, because there is a grouping estimation unit, a template change model, which is a change model generalized within the group, is used for each user who does not have a change model or who has insufficient accumulation of speech data, etc. Thus, the display information can be optimized.

本発明の一実施形態による、ユーザとシステムとが対話を行う情報提示システムの構成例を示すブロック図である。It is a block diagram which shows the structural example of the information presentation system with which a user and a system interact by one Embodiment of this invention. 本実施形態における対話処理部１０２の構成例を示すブロック図である。It is a block diagram which shows the structural example of the dialogue process part 102 in this embodiment. 対話履歴記憶部１０７に記憶されている対話履歴テーブルの構成例を示す図である。It is a figure which shows the structural example of the dialogue history table memorize | stored in the dialogue history memory | storage part. 本実施形態における聴取志向推定部１０３の構成例を示すブロック図である。It is a block diagram which shows the structural example of the listening orientation estimation part 103 in this embodiment. ユーザ属性記憶部１０８に記憶されているユーザ属性テーブルの構成例を示す図である。It is a figure which shows the structural example of the user attribute table memorize | stored in the user attribute memory | storage part. 対話行動記憶部１０９に記憶されている対話行動テーブルの構成例を示す図である。It is a figure which shows the structural example of the dialogue action table memorize | stored in the dialogue action memory | storage part 109. FIG. 図６に示したアクションにおける単語の置き換えの処理を説明する概念図である。FIG. 7 is a conceptual diagram illustrating word replacement processing in the action illustrated in FIG. 6. グルーピング記憶部１１０に記憶されているグルーピングテーブルの構成例を示す図である。It is a figure which shows the structural example of the grouping table memorize | stored in the grouping memory | storage part. 本実施形態の情報提示システムを用いた対話システムの動作例を示すフローチャートである。It is a flowchart which shows the operation example of the dialogue system using the information presentation system of this embodiment. 本発明の一実施形態による、ユーザとシステムとが対話を行う情報提示システムの他の構成例を示す概念図である。It is a conceptual diagram which shows the other structural example of the information presentation system with which a user and a system interact by one Embodiment of this invention.

本発明は、例えば、ユーザが質問を行うと、システム側がその質問に対応した回答を、発話データを音声合成した音声により通知する、あるいはユーザ同士で対話する対話システムに関するものである。また、ユーザの聴力及び単語の理解力の各々に対応して、システム側が通知する発話データを変更して、ユーザが対話システムからの回答を聞き易く、また内容を理解し易くする構成に関する。 The present invention relates to an interactive system in which, for example, when a user asks a question, the system side notifies an answer corresponding to the question by voice obtained by synthesizing speech data, or a dialogue between users. Further, the present invention relates to a configuration in which the utterance data notified by the system side is changed corresponding to each of the user's hearing ability and word comprehension ability so that the user can easily hear the answer from the dialogue system and understand the contents.

ユーザの聴力に対応しては、例えば、発話データを音声合成して音声として出力する際における、この音声の周波数、再生速度、再生する文節の区切り、この区切りの時間幅などの変更を行う。ここで、区切りは、発話データを音声として再生する際に、文節と文節との間あるいは単語と単語との間に挿入される無音である。区切りの時間幅は、文節と文節との間あるいは単語と単語との間に挿入される無音の時間の長さを示している。
また、ユーザの単語の理解力に対応しては、例えば、専門用語を一般的に用いている他の同義語（あるいは類義語、類語）である単語に置き換える変更を行う。 In response to the user's hearing ability, for example, when speech data is synthesized and output as speech, the frequency of the speech, the playback speed, the segmentation of the phrase to be reproduced, the duration of the segmentation, and the like are changed. Here, the delimiter is silence that is inserted between phrases or between words when the speech data is reproduced as speech. The delimiter time width indicates the length of silent time inserted between clauses or between words.
Further, in response to the user's ability to understand words, for example, the technical term is changed to be replaced with a word that is another synonym (or synonym or synonym) generally used.

以下、本発明の一実施形態について、図面を参照して説明する。図１に対応した以下の説明においては、ユーザとシステムとの対話を例に説明する。
図１は、本発明の一実施形態による、ユーザとシステムとが対話を行う情報提示システムの構成例を示すブロック図である。
図１において、情報提示システム１は、情報提示サーバ１０とユーザ端末１１との各々を備えている。
情報提示サーバ１０とユーザ端末１１との各々は、インターネットを含む情報通信網であるネットワーク５００を介してデータの送受信を行う。 Hereinafter, an embodiment of the present invention will be described with reference to the drawings. In the following description corresponding to FIG. 1, the dialogue between the user and the system will be described as an example.
FIG. 1 is a block diagram illustrating a configuration example of an information presentation system in which a user and a system interact with each other according to an embodiment of the present invention.
In FIG. 1, the information presentation system 1 includes an information presentation server 10 and a user terminal 11.
Each of the information presentation server 10 and the user terminal 11 transmits and receives data via a network 500 that is an information communication network including the Internet.

情報提示サーバ１０は、ユーザ端末１１を介して供給されるユーザの質問等に対応した回答を音声データにより、ユーザ端末１１に対して出力する。ここで、音声データは、音声を所定の音声ファイルフォーマット（非圧縮音声フォーマット、非可逆圧縮音声フォーマット及び可逆圧縮音声フォーマットなど）により、デジタル化したデータを示している。情報提示サーバ１０は、例えば、汎用コンピュータあるいはパーソナルコンピュータである。 The information presentation server 10 outputs an answer corresponding to the user's question and the like supplied via the user terminal 11 to the user terminal 11 by voice data. Here, the audio data indicates data obtained by digitizing audio in a predetermined audio file format (such as an uncompressed audio format, an irreversible compressed audio format, and a lossless compressed audio format). The information presentation server 10 is, for example, a general purpose computer or a personal computer.

ユーザ端末１１は、Amazon Alexa（登録商標）、Apple Siri（登録商標）、Googleアシスタント（登録商標）などといった仮想パーソナルアシスタント（VPA：Virtual Personal Assistant）を利用するためのプラットフォームとなる、音声によりユーザと情報提示システム１との対話を行うスマートスピーカなどである。また、ユーザ端末１１は、スマートフォン及びタブレット端末などの携帯端末、あるいはパーソナルコンピュータでも良く、表示部に画像（テキスト文字も含む）を表示させずに、音声のみによりリクエスト及びレスポンスが行われる対話アプリケーションに対して応用しても良い。 The user terminal 11 serves as a platform for using a virtual personal assistant (VPA) such as Amazon Alexa (registered trademark), Apple Siri (registered trademark), Google Assistant (registered trademark), etc. A smart speaker or the like that performs a dialogue with the information presentation system 1. Further, the user terminal 11 may be a mobile terminal such as a smartphone or a tablet terminal, or a personal computer, and is an interactive application in which a request and a response are performed only by voice without displaying an image (including text characters) on the display unit. It may be applied to.

情報提示サーバ１０は、データ入出力部１０１、対話処理部１０２、聴取志向推定部１０３、提示制御部１０４、音声合成部１０５、グルーピング推定部１０６、対話履歴記憶部１０７、ユーザ属性記憶部１０８、対話行動記憶部１０９、グルーピング記憶部１１０及び言語知識記憶部１１１の各々を備えている。 The information presentation server 10 includes a data input / output unit 101, a dialogue processing unit 102, a listening orientation estimation unit 103, a presentation control unit 104, a speech synthesis unit 105, a grouping estimation unit 106, a dialogue history storage unit 107, a user attribute storage unit 108, Each of the interactive action storage unit 109, the grouping storage unit 110, and the language knowledge storage unit 111 is provided.

データ入出力部１０１は、ユーザがユーザ端末１１に対して入力した音声の音声データを含むデータを、ネットワーク５００を介して入力する外部入力インターフェースである。
また、データ入出力部１０１は、ユーザからの音声データの内容に対する回答などの音声データを含むデータを、ネットワーク５００を介してユーザ端末１１に対して出力する外部出力インターフェースでもある。
また、データ入出力部１０１は、情報提示サーバ１０を操作する制御信号などのデータを、マイクロフォン、キーボード、各種センサを含む入力手段から直接に、データ（音声データを含む）を取得する機能を有している。 The data input / output unit 101 is an external input interface that inputs data including voice data of voice input by the user to the user terminal 11 via the network 500.
The data input / output unit 101 is also an external output interface that outputs data including voice data such as an answer to the contents of voice data from the user to the user terminal 11 via the network 500.
Further, the data input / output unit 101 has a function of acquiring data (including voice data) directly from input means including a microphone, a keyboard, and various sensors, such as a control signal for operating the information presentation server 10. doing.

対話処理部１０２は、ユーザからの音声データを解析し、ユーザの音声データの内容を推定する。そして、対話処理部１０２は、この推定した内容に対する回答メッセージ（対話における回答）として、テキスト文の発話データを生成する。 The dialogue processing unit 102 analyzes voice data from the user and estimates the contents of the user's voice data. Then, the dialogue processing unit 102 generates utterance data of a text sentence as an answer message (answer in dialogue) for the estimated content.

図２は、本実施形態における対話処理部１０２の構成例を示すブロック図である。図２において、対話処理部１０２は、解析部１０２１、対話管理部１０２２及び生成部１０２３の各々を備えている。
解析部１０２１は、データ入出力部１０１から供給される音声データを、文字化、すなわちテキスト変換し、対話履歴記憶部１０７に対して書き込んで記憶させる。 FIG. 2 is a block diagram illustrating a configuration example of the dialogue processing unit 102 in the present embodiment. In FIG. 2, the dialogue processing unit 102 includes an analysis unit 1021, a dialogue management unit 1022, and a generation unit 1023.
The analysis unit 1021 converts the voice data supplied from the data input / output unit 101 into text, that is, converts the text into text, and writes it in the dialogue history storage unit 107 for storage.

また、対話処理部１０２は、テキスト変換したテキスト文の形態素解析を行い、得られた形態素からのキーワード抽出、形態素のベクトル化などの数値化処理を行い、テキスト文の数値データへの変換処理を行う。この変換処理は、自然言語処理技術あるいは機械学習技術を用いて行う。本実施形態において、例えば、ｔｆ（term frequency）−ｉｄｆ（inverse document frequency）法によるキーワード抽出、ｗｏｒｄ２ｖｅｃ及びｄｏｃ２ｖｅｃによるベクトル化などの手法を用いても良い。対話処理部１０２は、抽出されたキーワードあるいはベクトル化された数値データを、対話管理部１０２２に対して出力する。 In addition, the dialogue processing unit 102 performs morphological analysis of the text sentence after text conversion, performs numerical processing such as keyword extraction from the obtained morpheme, vectorization of the morpheme, and performs conversion processing of the text sentence into numerical data. Do. This conversion processing is performed using natural language processing technology or machine learning technology. In the present embodiment, for example, a technique such as keyword extraction by a tf (term frequency) -idf (inverse document frequency) method, vectorization by word2vec and doc2vec may be used. The dialogue processing unit 102 outputs the extracted keyword or vectorized numerical data to the dialogue management unit 1022.

図３は、対話履歴記憶部１０７に記憶されている対話履歴テーブルの構成例を示す図である。図３において、対話履歴テーブルは、レコード毎に対して、メッセージ時刻、話者ＩＤ（Identification）、メッセージ本文、聴取志向フレーズフラグ、メッセージＩＤ１及びメッセージＩＤ２の各々の欄が設けられている。 FIG. 3 is a diagram illustrating a configuration example of a dialogue history table stored in the dialogue history storage unit 107. In FIG. 3, the dialog history table includes fields for message time, speaker ID (Identification), message text, listening-oriented phrase flag, message ID1, and message ID2 for each record.

ここで、メッセージ時刻は、メッセージのうち音声データを入力した、あるいはメッセージのうち発話データを出力した時刻を示している。話者ＩＤは、音声データあるいは発話データを出力した話者の識別情報（話者がユーザであれば後述するユーザＩＤ）を示している。例えば、話者ＩＤにおいて「Ｕ＿００１」は、話者がユーザであり、このユーザを識別するユーザ識別情報である。また、話者ＩＤにおいて「Ｃ＿００１」は、話者が情報提示サーバ１０（システム）であり、このシステムを識別するシステム識別情報である。話者が情報提示サーバ１０である場合、ユーザの音声データの内容の種別（天気、健康、科学など）毎に、異なるシステムが回答するため、複数の異なるシステム識別情報がある。 Here, the message time indicates the time when voice data is input in the message or the utterance data is output in the message. The speaker ID indicates identification information of a speaker who outputs voice data or speech data (a user ID described later if the speaker is a user). For example, “U — 001” in the speaker ID is user identification information for identifying the user who is the speaker. In the speaker ID, “C — 001” is system identification information for identifying the system where the speaker is the information presentation server 10 (system). When the speaker is the information presentation server 10, there are a plurality of different system identification information because a different system answers for each type of content of the user's voice data (weather, health, science, etc.).

また、メッセージ本文は、音声データをテキスト変換した文字データ、あるいは発話データなどのテキストデータが記載されている。聴取志向フレーズフラグは、ユーザからの音声データの内容が質問ではなく、情報提示サーバ１０からの発話データに基づく音声が理解できないことを示しているメッセージ本文に対して付与するフラグである。ここで、聴取志向フレーズフラグが「０」、すなわちフラグが立っていない場合、対応するメッセージ本文が通常の会話のフレーズであることを示している。 In the message body, text data such as character data obtained by converting voice data into text or speech data is described. The listening-oriented phrase flag is a flag that is given to the message text indicating that the content of the voice data from the user is not a question and the voice based on the utterance data from the information presentation server 10 cannot be understood. Here, when the listening-oriented phrase flag is “0”, that is, when the flag is not set, it indicates that the corresponding message body is a normal conversation phrase.

一方、聴取志向フレーズフラグが「１」、すなわちフラグが立っている場合、対応するメッセージ本文に対して、ユーザがシステムの音声の内容が判らない（聞き取れない或いは内容が理解できない）ことを示す聴取志向フレーズであることを示している。
例えば、図３において、聴取志向フレーズフラグが「１」とされたメッセージ本文は、「聞き取れなかったから、もう一回御願い」であり、情報提示サーバ１０がユーザ端末１１から出力する音声が聞き取れなかった（周波数、音量、区切りなどの聴取志向におけるパラメータにより）ことを示している。この聴取志向フレーズは、情報提示サーバ１０からのユーザのリクエストに対するレスポンスである音声コンテンツに対して、ユーザが音声コンテンツの聞き取り易さの程度を示す単語（後述するポジティブワード及びネガティブワードに対応）、あるいは同義語（類義語、類語）であり、予め識者により所定のフレーズとして登録されている。 On the other hand, when the listening-oriented phrase flag is “1”, that is, when the flag is set, the user does not understand the contents of the system voice for the corresponding message body (cannot hear or understand the contents). It shows that it is an intention phrase.
For example, in FIG. 3, the message body in which the listening-oriented phrase flag is “1” is “Since it could not be heard, please make another request”, and the voice that the information presentation server 10 outputs from the user terminal 11 cannot be heard. (According to listening-oriented parameters such as frequency, volume, and separation). This listening-oriented phrase is a word (corresponding to a positive word and a negative word described later) indicating the degree of ease of hearing of the audio content by the user with respect to the audio content that is a response to the user's request from the information presentation server 10, Or it is a synonym (synonym, synonym), and is registered as a predetermined phrase by an expert beforehand.

また、メッセージＩＤ１は、同一のレコードにおけるメッセージ本文を指し示す識別情報である。一方、メッセージＩＤ２は、ユーザ及びシステムの各々の間の対話において、メッセージＩＤ１の示すメッセージ本文のメッセージに対して直前のメッセージの識別情報（すなわち、メッセージＩＤ１）である識別情報である。
したがって、ユーザ及びシステムの各々の間の対話における前後の関係を確認する際、確認したいメッセージ本文のレコードにあるメッセージＩＤ２を読み出し、このメッセージＩＤ２と同一のメッセージＩＤ１を検索することにより、確認したいメッセージ本文の直前のメッセージのメッセージ本文を検索することができ、対話の連続したメッセージ本文の各々を容易に確認できる。 Message ID1 is identification information indicating the message text in the same record. On the other hand, the message ID 2 is identification information that is identification information of the immediately preceding message (that is, message ID 1) with respect to the message in the message body indicated by the message ID 1 in the dialogue between the user and the system.
Therefore, when confirming the relationship before and after in the dialogue between the user and the system, the message ID 2 in the record of the message body to be confirmed is read, and the message ID 1 that is the same as this message ID 2 is searched to retrieve the message to be confirmed. The message body of the message immediately before the body text can be searched, and each message body of continuous conversation can be easily confirmed.

例えば、話者ＩＤがＣ＿００１のメッセージＩＤ１：Ｍ１８０１０１００３に対して、メッセージＩＤ２：Ｍ１８０１０１００１となっている。このため、「今週の週末はいかがですか？」を回答とする質問が、メッセージＩＤ１：Ｍ１８０１０１００１の「○○にいきたい」であることが容易に検索できる。メッセージＩＤ１及びメッセージＩＤ２の各々が連続した番号でないのは、途中で他の対話が行われる場合があり、一つの対話におけるメッセージが常に連続して入力されないことを示している。 For example, message ID2: M180101001 is corresponding to message ID1: M180101003 with speaker ID C_001. For this reason, it is possible to easily search that the question with “How about this weekend?” Is “I want to go to ○○” of message ID 1: M180101001. The fact that each of message ID1 and message ID2 is not a continuous number indicates that another dialog may be performed in the middle, and messages in one dialog are not always input continuously.

図２に戻り、対話管理部１０２２は、キーワードや数値データの各々とともに供給されるメッセージＩＤ１により、対話履歴記憶部１０７を参照して、メッセージＩＤ１及びメッセージＩＤ２の各々の関係から、ユーザの状態（会話フレーズか聴取志向フレーズ）であるかを定義する。そして、対話管理部１０２２は、メッセージ本文が会話フレーズである場合、情報提示サーバ１０によるユーザへのシステム応答の指針（例えば、ジャンル指定、場所指定、天気指定、交通の時刻表指定などの対話行為タイプ）を決定する。 Returning to FIG. 2, the dialogue management unit 1022 refers to the dialogue history storage unit 107 based on the message ID 1 supplied together with each of the keyword and numerical data, and determines the user status (from the relationship between the message ID 1 and the message ID 2. Define whether the phrase is a conversational phrase or a listening-oriented phrase. When the message body is a conversation phrase, the dialogue management unit 1022 guides a system response to the user by the information presentation server 10 (for example, dialogue act such as genre designation, location designation, weather designation, traffic timetable designation, etc. Type).

すなわち、対話管理部１０２２は、ユーザからの会話フレーズとしてのリクエスト（問いかけ）に対して、システム側からのレスポンス（応答）をデータベース化した構成としても良いし、機械学習あるいは強化学習などの枠組みを用いて、リクエストに対応するレスポンスの内容を出力する対話モデルを生成して構成として用いても良い。この対話管理部１０２２の処理については、一般的な公知の技術である対話システムと同様のため、詳細な説明を省略する。 That is, the dialogue management unit 1022 may have a configuration in which responses (responses) from the system side are made into a database in response to a request (question) as a conversation phrase from a user, and a framework such as machine learning or reinforcement learning is used. It is also possible to generate a dialogue model that outputs the content of the response corresponding to the request and use it as a configuration. Since the processing of the dialog management unit 1022 is the same as that of a dialog system that is a generally known technique, detailed description thereof is omitted.

また、対話管理部１０２２は、メッセージＩＤ１の各々のメッセージ本文のキーワードや数値データそれぞれから、会話フレーズか聴取志向フレーズのいずれかであることを検出すると、対話履歴記憶部１０７の対話記憶履歴テーブルの対応するレコードにおける聴取志向フレーズフラグの操作を行う。このとき、対話管理部１０２２は、メッセージ本文が会話フレーズであると判定した場合、聴取志向フレーズフラグを「０」として、フラグを立てない。一方、対話管理部１０２２は、メッセージ本文が聴取志向フレーズであると判定した場合、聴取志向フレーズフラグを「１」として、フラグを立てる。 When the dialogue management unit 1022 detects that the message ID is either a conversation phrase or a listening-oriented phrase from the keyword and numerical data of each message body of the message ID1, the dialogue management history table 107 of the dialogue history storage unit 107 Operate the listening-oriented phrase flag in the corresponding record. At this time, when it is determined that the message body is a conversation phrase, the dialogue management unit 1022 sets the listening-oriented phrase flag to “0” and does not set the flag. On the other hand, when it is determined that the message body is a listening-oriented phrase, the dialogue management unit 1022 sets the listening-oriented phrase flag to “1” and sets a flag.

生成部１０２３は、対話管理部１０２２の出力するシステム応答の指針に基づき、その指針に対応したシステム応答モデルを用いて、テキストデータの応答文である発話データを生成する。すなわち、生成部１０２３は、リクエストのメッセージ本文から抽出したキーワードやベクトル化した数値データなどを、上記システム応答モデルに対して入力し、リクエストに対応した内容の発話データを生成する。上記システム応答モデルは、公知の技術の教師データ有りの機械学習により、リクエストの内容に対応したレスポンスの発話データが得られるように生成されている。 The generation unit 1023 generates utterance data, which is a response sentence of text data, using a system response model corresponding to the guideline of the system response output from the dialogue management unit 1022. That is, the generation unit 1023 inputs keywords extracted from the message body of the request, vectorized numerical data, and the like to the system response model, and generates utterance data having contents corresponding to the request. The system response model is generated so that utterance data of a response corresponding to the content of the request can be obtained by machine learning with teacher data of a known technique.

また、生成部１０２３は、機械学習により得られたシステム応答モデルを用いるのではなく、予め作成しておいた文章の雛形（文章テンプレート）に対して、外部ＡＰＩ（Application Programming Interface）を用いて、必要な情報を入手して当てはめて文章を完成させる手法を用いても良い。
例えば、生成部１０２３は、リクエストが電車の時刻である場合、出発駅と到着駅と、所定の時刻との情報により、時刻表検索の外部ＡＰＩを用いて、電車の時刻を検索し、検索結果を雛形の所定の位置に挿入して、レスポンスの発話データを生成する。 In addition, the generation unit 1023 does not use a system response model obtained by machine learning, but uses an external API (Application Programming Interface) for a previously created sentence template (sentence template). A method of obtaining necessary information and applying it to complete a sentence may be used.
For example, when the request is a train time, the generation unit 1023 searches for the train time using the external API of the timetable search based on the information of the departure station, the arrival station, and a predetermined time, and the search result Is inserted at a predetermined position of the template to generate response speech data.

また、本実施形態においては、機械学習によるシステム応答モデルの手法、あるいは、外部ＡＰＩの情報により、文章の雛形（文章テンプレート）を埋める手法のいずれを用いても良い。
また、本実施形態においては、リクエストと、このリクエストに対応したレスポンスとが予め書き込まれたデータベースを有している構成としても良い。この場合には、対話管理部１０２２がリクエストに対応するレスポンスを、データベースから抽出するため、生成部１０２３を備える必要は無い。
生成部１０２３は、生成した発話データとともに、少なくとも聴取志向フレーズフラグのデータを、聴取志向推定部１０３に対して出力する。 In the present embodiment, either a system response model method based on machine learning or a method of filling a sentence template (sentence template) with external API information may be used.
Moreover, in this embodiment, it is good also as a structure which has the database in which the request and the response corresponding to this request were written beforehand. In this case, since the dialogue management unit 1022 extracts a response corresponding to the request from the database, the generation unit 1023 is not necessary.
The generation unit 1023 outputs at least the listening-oriented phrase flag data to the listening-oriented estimation unit 103 together with the generated utterance data.

図１に戻り、聴取志向推定部１０３は、ユーザの属性情報（ユーザ属性情報）や対話履歴などから、このユーザの聴取志向を推定し、発話データを音声としてユーザに供給する際の提示指針を推定する。
図４は、本実施形態における聴取志向推定部１０３の構成例を示すブロック図である。図４において、モデル構築部１０３１及び聴取志向管理部１０３２の各々を備えている。
モデル構築部１０３１は、ユーザの聴取志向を推定して、提示する際に発話データ及び音声の特性を変更する指針（例えば、聴取志向のパラメータである音量、読み上げ速度及び区切りなどの調整量）を決定する処理を行う聴取志向推定モデルを生成する。本実施形態においては、聴取志向のパラメータを単にパラメータと示す場合もある。 Returning to FIG. 1, the listening orientation estimation unit 103 estimates the listening orientation of the user from the user attribute information (user attribute information), the conversation history, and the like, and provides a presentation guideline when supplying speech data to the user as speech. presume.
FIG. 4 is a block diagram illustrating a configuration example of the listening orientation estimation unit 103 in the present embodiment. In FIG. 4, a model construction unit 1031 and a listening orientation management unit 1032 are provided.
The model construction unit 1031 estimates a user's listening intention, and indicates a guideline (for example, an adjustment amount such as a volume, a reading speed, and a separation, which are listening-oriented parameters) for changing the characteristics of speech data and voice when presenting. A listening orientation estimation model is generated that performs the determining process. In the present embodiment, the listening-oriented parameter may be simply referred to as a parameter.

聴取志向推定モデルは、ユーザの属性情報及び発話データのテキストデータを入力することにより、聴取志向における変更対象のパラメータと、この変更対象のパラメータの変更量、あるいは置き換える他の単語を推定結果として出力する。
また、聴取志向テンプレートモデルは、聴取志向推定モデルと同様に、グループの属性情報及び発話データのテキストデータを入力することにより、聴取志向における変更対象のパラメータと、この変更対象のパラメータの変更量、あるいは置き換える他の単語を推定結果として出力する。 The listening orientation estimation model inputs the user's attribute information and text data of speech data, and outputs the parameters to be changed in listening orientation and the change amount of the parameters to be changed or other words to be replaced as estimation results. To do.
Also, the listening-oriented template model, like the listening-oriented estimation model, inputs the group attribute information and the text data of the utterance data, so that the change-target parameter and the change amount of the change-target parameter, Alternatively, another word to be replaced is output as an estimation result.

ここで、モデル構築部１０３１は、上記聴取志向推定モデルとして、聴取志向を推定するための数式、あるいはルールを生成（構築）し、順次更新していく。本実施形態において、聴取志向は、ユーザの聴力に対応するパラメータとして、ユーザが発話データを音声合成した音声を聞いた際、聞き取り易いやすい音声の周波数、再生速度、再生する文節の区切り、この区切りの時間幅などを示している。 Here, the model construction unit 1031 generates (constructs) mathematical formulas or rules for estimating the listening orientation as the listening orientation estimation model, and sequentially updates them. In the present embodiment, the listening orientation is a parameter corresponding to the user's hearing ability. When the user listens to the voice synthesized with the speech data, the frequency of the voice that is easy to hear, the playback speed, the segmentation of the phrase to be reproduced, this segmentation It shows the time width and so on.

モデル構築部１０３１が聴取志向推定モデルを生成する際、機械学習、教科学習、ニューラルネットワークなどのアルゴリズムを用いて、後述するように、対話履歴記憶部１０７及び対話行動記憶部１０９の各々の対話の内容や聴取志向のパラメータの変更における履歴のデータに基づいて、聴取志向推定モデルにおける聴取志向のパラメータやこのパラメータの変更量を推定する推定アルゴリズムにおける最適化を行う。 When the model construction unit 1031 generates the listening orientation estimation model, each dialogue of the dialogue history storage unit 107 and the dialogue action storage unit 109, as will be described later, using an algorithm such as machine learning, subject learning, or a neural network. Based on the history data in the change of the contents and listening-oriented parameters, the listening-oriented parameters in the listening-oriented estimation model and the estimation algorithm for estimating the amount of change of the parameters are optimized.

すなわち、モデル構築部１０３１は、推定に必要な数式やルールとして、置き換えの単語の候補、発話データ全体の文章における聴取志向に関するパラメータである周波数、読み上げ速度及び区切りなどを推定するためのパラメータ推定用の基底関数を準備し、対話履歴記憶部１０７及び対話行動記憶部１０９の各々の対話の内容や聴取志向のパラメータの変更における履歴のデータを教師データとして、聴取志向推定モデル（あるいは、後述する聴取志向テンプレートモデル）の構築、あるいは更新を行う。 That is, the model construction unit 1031 uses parameter estimation for estimating candidate words to be replaced, frequencies related to listening orientation in the sentence of the entire utterance data, reading speed, segmentation, and the like as mathematical formulas and rules necessary for estimation. Of the conversation history storage unit 107 and the dialogue action storage unit 109, and the history data in the change of the listening-oriented parameters are used as teacher data, and the listening-oriented estimation model (or a listening method described later) (Oriented template model) is constructed or updated.

また、聴取志向には、ユーザが単語の意味を理解できるか否かの知識力も含まれ、ユーザが理解できる一般的な同義語（あるいは類義語、類語）の他の単語に置き換える（変更する）こともパラメータの一つとして含まれる。
また、聴取志向には、上述した発話の周波数に対応して、発話データにおける単語に擦過音あるいは破裂音が含まれている場合、読み上げる際の周波数を低くしても、音声となった場合に高い周波数を含むことになるため、同義語（あるいは類義語、類語）であり、擦過音及び破裂音を含まない他の単語に置き換えることもパラメータの一つとして含まれる。 In addition, listening orientation also includes the knowledge of whether or not the user can understand the meaning of the word, and it is replaced (changed) with other common synonyms (or synonyms and synonyms) that the user can understand. Is also included as one of the parameters.
Also, for listening orientation, if the words in the utterance data contain a fuzzing sound or a plosive sound corresponding to the frequency of the utterance described above, even if the frequency at the time of reading is lowered, the sound is spoken. Since it includes a high frequency, it is a synonym (or a synonym, a synonym), and it is also included as one of the parameters to replace it with another word that does not include a scratching sound or a plosive sound.

上述した聴取志向推定モデルは、例えば、各ユーザのユーザ属性情報に対応させて、聞き取り易いやすい音声の周波数、再生速度、再生する文節の区切りの頻度、この区切りの時間幅などの各パラメータの調整量を設定した、また専門的或いは難解な単語をユーザに理解可能となるように置き換える、一般的に用いられる同義語であり、理解可能な他の単語の設定、さらに擦過音あるいは破裂音が含まれている単語を、同義語（あるいは類義語、類語）であり、擦過音及び破裂音を含まない他の単語の設定などを、置き換えリストをデータベースとして構成（データベース構成）しても良い。 The listening orientation estimation model described above adjusts each parameter such as the frequency of audio that is easy to hear, the playback speed, the frequency of segmentation of a phrase to be reproduced, and the time width of this segment, for example, corresponding to the user attribute information of each user A commonly used synonym that sets a quantity and replaces a specialized or esoteric word so that it can be understood by the user, including the setting of other words that can be understood, as well as fuzzing or popping sounds The replacement word may be configured as a database (database configuration), such as setting of other words that are synonyms (or synonyms, synonyms) and do not include a fuzzing sound or a plosive sound.

また、聴取志向推定モデルは、教師データを用いて機械学習を行う機械学習モデルとして構成（機械学習モデル構成）しても良い。この機械学習モデル構成の場合、モデル構築部１０３１は、対話履歴記憶部１０７、ユーザ属性記憶部１０８、対話行動記憶部１０９及びグルーピング記憶部１１０の各々に蓄積されたデータを教師データとして用いた機械学習により、ユーザ毎の聴取志向における各パラメータの変更の処理を推定する聴取志向推定モデルを生成する。 The listening orientation estimation model may be configured as a machine learning model that performs machine learning using teacher data (machine learning model configuration). In the case of this machine learning model configuration, the model construction unit 1031 uses the data accumulated in each of the dialogue history storage unit 107, the user attribute storage unit 108, the dialogue action storage unit 109, and the grouping storage unit 110 as teacher data. By learning, a listening orientation estimation model for estimating the process of changing each parameter in the listening orientation for each user is generated.

聴取志向管理部１０３２は、モデル構築部１０３１がユーザ毎に生成した（導出した）聴取志向推定モデルを用いて、ユーザの各々に対応した発話データの変更処理の内容を、提示制御部１０４に対して出力する。ここで、変更処理は、上述した音声の周波数、再生速度、再生する文節の区切りの頻度、この区切りの時間幅などの調整、及び聞き取りやすい発音となる単語への置き換えを行う処理を示している。 The listening orientation management unit 1032 uses the listening orientation estimation model generated (derived) for each user by the model construction unit 1031 to provide the presentation control unit 104 with the content of the utterance data change process corresponding to each user. Output. Here, the change process indicates a process of adjusting the frequency of the voice, the playback speed, the frequency of the segment of the phrase to be reproduced, the time width of the segment, and the replacement with a word that is easy to hear. .

また、聴取志向管理部１０３２は、リクエストの音声データを入力したユーザに対して、このユーザに対応する聴取志向推定モデルが生成されていない場合がある。この場合、聴取志向管理部１０３２は、予めテンプレートとして準備されている聴取志向テンプレートモデルを用いて、聴取志向推定モデルが生成されていないユーザに対応した発話データの変更処理の内容を、提示制御部１０４に対して出力する。
また、聴取志向管理部１０３２は、後述するグルーピング情報などを用いて類似したユーザ群の聴取志向テンプレートモデルを用いて、聴取志向推定モデルが生成されていないユーザに対応した発話データの変更処理の内容を、提示制御部１０４に対して出力する構成としても良い。 In addition, the listening orientation management unit 1032 may not generate a listening orientation estimation model corresponding to the user who has input the requested voice data. In this case, the listening orientation management unit 1032 uses the listening orientation template model prepared in advance as a template, and presents the content of the utterance data change processing corresponding to the user for which the listening orientation estimation model has not been generated. Output to 104.
In addition, the listening orientation management unit 1032 uses the listening orientation template model of a similar group of users using grouping information and the like to be described later, and the content of the utterance data change processing corresponding to the user for which the listening orientation estimation model has not been generated May be output to the presentation control unit 104.

提示制御部１０４は、聴取志向推定部１０３から供給される発話データの変更内容における単語の置き換えに関し、言語知識記憶部１１１に記憶されている置き換えテーブルにより、対象となる単語を置き換える他の表現の単語を抽出する。この置き換えテーブルは、単語と、この単語に置き換える同義の単語との対応関係を示している。例えば、提示制御部１０４は、すでに述べたように、「今週」に対して「今度」、「週末」に対して「土曜日或いは日曜日」など、擦過音や破裂音を有する単語を、擦過音や破裂音の無い単語に置き換える処理を、置き換えテーブルを参照して行う。 The presentation control unit 104 relates to the replacement of words in the change of the utterance data supplied from the listening orientation estimation unit 103, and uses other replacement expressions stored in the language knowledge storage unit 111 to replace the target word. Extract words. This replacement table shows the correspondence between a word and a synonymous word that is replaced with this word. For example, as described above, the presentation control unit 104 converts a word having a fuzzing sound or a plosive sound such as “this time” for “this week”, “Saturday or Sunday” for “weekend”, The process of replacing with a word without a plosive is performed with reference to the replacement table.

そして、提示制御部１０４は、単語の置き換えを行った発話データを、音声合成部１０５に対して出力する。
また、聴取志向推定部１０３は、単語の置き換えのみでなく、破裂音または擦過音を含む文章（文節）を、破裂音及び擦過音を含まない同義（類義）の文章に置き換えるように構成しても良い。 Then, the presentation control unit 104 outputs the utterance data subjected to word replacement to the speech synthesis unit 105.
In addition, the listening orientation estimation unit 103 is configured not only to replace words but also to replace sentences (sentences) that include a plosive sound or scratching sound with synonymous (synonymous) sentences that do not include a plosive sound or scratching sound. May be.

ここで、言語知識記憶部１１１には、聴取志向のパラメータとして、聞き取り易さに対する言語的な知見に基づき、理解しにくい単語の同義語であって一般的に用いられて理解し易い他の単語、擦過音または破裂音を含む単語の同義語であって擦過音及び破裂音を含まない他の単語（上述したように文節でも良い）が蓄積されている。 Here, the linguistic knowledge storage unit 111 is a synonym of a word that is difficult to understand based on linguistic knowledge about ease of listening as a listening-oriented parameter, and is another word that is commonly used and easy to understand. Other words (which may be a phrase as described above) that are synonyms of a word including a fuzzing sound or a plosive sound and do not include a flawing sound and a plosive sound are accumulated.

例えば、医療従事者や介護士が年齢の高い人間（高齢者）と、対話する際に高齢者に理解させるために用いる単語の言い換えに関する知見、コーパス（テキストや発話を大規模に集めてデータベース化した言語資料）、同義語（類義語、類語）の辞書、シソーラス（言葉の上位概念及び下位概念）などを用いて、所定の単語に対して置き換える他の単語との組み合わせとして、言語知識記憶部１１１に対して予め、あるいは追加して書き込んで蓄積する。 For example, knowledge and corpora (text and utterances are collected on a large scale to create a database for medical staff and caregivers with older people (elderly people). Language knowledge storage unit 111 as a combination with another word that replaces a predetermined word using a dictionary of synonyms (synonyms, synonyms), a thesaurus (superior concept and subordinate concept of words), etc. Are stored in advance or additionally.

音声合成部１０５は、提示制御部１０４から供給される発話データを、変更するパラメータと、このパラメータの調整量に対応して、発話データのテキストデータを音声合成により、ユーザに対するレスポンスとしての音声コンテンツを生成する。このとき、音声合成部１０５は、例えば、ユーザの聴力に対応する聴取志向のパラメータ、及びその調整量として、ユーザが発話データを音声合成した音声を聞いた際、聞き取り易いやすい音声の周波数、再生する際の読み上げ速度、再生する文節の区切り、この区切りの時間幅などを変更して音声合成を行う。
そして、音声合成部１０５は、音声合成により生成した音声コンテンツを、データ入出力部１０１を介して、ユーザ端末１１に対して出力する。 The speech synthesizer 105 performs speech synthesis on the speech data supplied from the presentation control unit 104 in response to a parameter to be changed and the amount of adjustment of the parameter, and speech content as a response to the user by speech synthesis. Is generated. At this time, the speech synthesizer 105, for example, as a listening-oriented parameter corresponding to the user's hearing ability and its adjustment amount, when the user listens to the speech synthesized from the speech data, the frequency of the voice that is easy to hear, playback The speech synthesis is performed by changing the reading speed at the time of reading, the segmentation of the phrase to be reproduced, the time width of the segmentation, and the like.
Then, the voice synthesis unit 105 outputs the voice content generated by the voice synthesis to the user terminal 11 via the data input / output unit 101.

グルーピング推定部１０６は、新たに履歴の発生したユーザの属性データに対応するグループを、グループ属性テーブルにより検索する。
そして、グルーピング推定部１０６は、グルーピング記憶部１１０において、上記ユーザを検索したグループのグループテーブルに追加して書き込んで記憶させる。 The grouping estimation unit 106 searches the group attribute table for a group corresponding to the attribute data of the user whose history has newly occurred.
Then, the grouping estimation unit 106 adds the information to the group table of the searched group in the grouping storage unit 110 and stores it.

図５は、ユーザ属性記憶部１０８に記憶されているユーザ属性テーブルの構成例を示す図である。図５において、ユーザ属性テーブルは、レコード毎に対して、ユーザＩＤ、年齢、性別、音量、読み上げ速度、区切り、設置環境、…などのユーザ属性の項目の欄が設けられている。ユーザＩＤは、ユーザ端末１１を用いて情報提示システム１を利用しているユーザの各々を識別するための識別情報である。年齢は、対応するユーザＩＤで識別されるユーザの年齢を示している。性別は、対応するユーザＩＤで識別されるユーザが男性（ｍａｌｅ）か女性（ｆｅｍａｌｅ）であるかを示している。 FIG. 5 is a diagram illustrating a configuration example of a user attribute table stored in the user attribute storage unit 108. 5, the user attribute table has columns of user attribute items such as user ID, age, sex, volume, reading speed, separation, installation environment,... For each record. The user ID is identification information for identifying each user who uses the information presentation system 1 using the user terminal 11. The age indicates the age of the user identified by the corresponding user ID. The gender indicates whether the user identified by the corresponding user ID is male or female.

また、音量は、対応するユーザＩＤで識別されるユーザが、聞き取り易い（聞き取りが可能な）とする音声の音量のレベル（大、中、小）を示している。読み上げ速度は、対応するユーザＩＤで識別されるユーザが、発話として聞き取り易いとする音声の速度のレベル（早い、普通、遅い）を示している。区切りは、対応するユーザＩＤで識別されるユーザが、聞き取り易いとする発音する文節の区切りを設ける数の量（多い、普通、少ない）を示している。 The volume indicates the volume level (large, medium, or small) of the sound that the user identified by the corresponding user ID is easy to hear (can be heard). The reading speed indicates a voice speed level (fast, normal, or slow) that the user identified by the corresponding user ID can easily hear as an utterance. The delimiter indicates the amount (large, normal, small) of the number of phrases that are pronounced by the user identified by the corresponding user ID.

また、設置環境は、ユーザ端末１１が設置されている場所、すなわち音声を聞き取る際の環境が、部屋が広くて音声が伝搬し易いか、部屋が小さくて反響し易いか、他の音が混入する可能性が低いか、他の音が混入する可能性が高いかなどのユーザの音声の聞き取り環境を示している。また、ユーザ属性記憶部１０８には、ユーザ毎にユーザＩＤに対応して聴取志向推定モデルが書き込まれて記憶されている。 In addition, the installation environment is the place where the user terminal 11 is installed, that is, the environment when listening to the voice, whether the room is large and the voice is easily propagated, the room is small and easily echoed, or other sounds are mixed The user's voice listening environment, such as whether there is a low possibility of being mixed or a high possibility of mixing other sounds, is shown. The user attribute storage unit 108 stores and stores a listening orientation estimation model corresponding to the user ID for each user.

図６は、対話行動記憶部１０９に記憶されている対話行動テーブルの構成例を示す図である。図６において、ユーザ属性テーブルは、レコード毎に対して、時刻、ユーザＩＤ、アクションタイプ、アクションＩＤ、実施内容、メッセージＩＤ、…などの項目の欄が設けられている。時刻は、発話データに対する何らかの変更を加える処理（アクション）が行われた時刻を示している。ユーザＩＤは、ユーザ端末１１を用いて情報提示システム１を利用しているユーザの各々を識別するための識別情報である。アクションタイプは、システム側が主導して行ったシステム主導のアクション（ａｃｔｉｖｅ）か、あるいはユーザ側からの要求に対応して行われたユーザ主導のアクション（ｐａｓｓｉｖｅ）かのいずれであるかを示している。 FIG. 6 is a diagram illustrating a configuration example of a dialogue action table stored in the dialogue action storage unit 109. In FIG. 6, the user attribute table has columns of items such as time, user ID, action type, action ID, implementation content, message ID,... For each record. The time indicates the time when a process (action) for applying some change to the speech data is performed. The user ID is identification information for identifying each user who uses the information presentation system 1 using the user terminal 11. The action type indicates either a system-initiated action (active) performed by the system side or a user-initiated action (passive) performed in response to a request from the user side. .

また、アクションＩＤは、システム主導のアクションあるいはユーザ主導のアクションの各々の変更の種類を識別する識別情報である。図６においては、例えば、アクションＩＤ：Ａ００１が「単語の置き換え」であり、アクションＩＤ：Ａ００３が「読み上げ速度の変更」、アクションＩＤ：Ａ００４が「区切りの変更」を示している。実施内容は、アクションとして実際に発話データに対して実施した変更の内容を示している。図６において、アクションＩＤ：Ａ００１の例としては、「今週→今度」が「今週」という単語を「今度」とする類似単語に置き換え、「週末→土曜、日曜」が「週末」という単語を「土曜、日曜」とする類似単語（意味が類似した単語）に置き換えていることを示している。 The action ID is identification information for identifying the type of change of each of the system-driven action or the user-driven action. In FIG. 6, for example, action ID: A001 is “word replacement”, action ID: A003 indicates “change in reading speed”, and action ID: A004 indicates “change in break”. The execution content indicates the content of the change actually performed on the utterance data as an action. In FIG. 6, as an example of action ID: A001, “this week → this time” replaces the word “this week” with a similar word “this time”, and “weekend → Saturday, Sunday” changes the word “weekend” to “ It shows that the words are replaced with similar words (words having similar meanings) such as “Saturday and Sunday”.

ここで、「周」の「ｓｈｕ」の発音は擦過音であり、高い周波数の成分が含まれるため、高い周波数が聞き取り難いユーザに対しては、擦過音を含まない類似単語に置き換える必要がある。また、高い周波数が聞き取り難いユーザに対しては、破裂音を含む単語も高い周波数を含むことになるので、擦過音の場合と同様に、破裂音を含まない類似単語に置き換える必要がある。 Here, the pronunciation of “shu” in “surround” is a rubbing sound and includes a high frequency component. Therefore, for a user who cannot easily hear a high frequency, it is necessary to replace it with a similar word that does not include the rubbing sound. . In addition, for users who are difficult to hear high frequencies, words containing plosives also contain high frequencies, so it is necessary to replace them with similar words that do not contain plosives, as in the case of scratching sounds.

また、アクションＩＤ：Ａ００３の例としては、「速度：−」が、発話の読み上げ速度を低下させた処理を示している。また、アクションＩＤ：Ａ００４の例としては、「区切り箇所：＋」が、発話データにおける文節の間に所定の時間を設け、すなわち読み上げる際に一つの文節を読み上げた後に、所定の時間（間）を置いて次の文節を読み上げる頻度を増加させることを示している。
メッセージＩＤは、同一のレコードにおける、アクションが行われたメッセージを指し示す識別情報であり、図３におけるメッセージＩＤ１と同一の識別情報である。 Further, as an example of the action ID: A003, “speed: −” indicates a process in which the utterance reading speed is reduced. Further, as an example of the action ID: A004, “delimiter: +” provides a predetermined time between phrases in the utterance data, that is, a predetermined time (interval) after reading one phrase when reading out. To increase the frequency of reading the next phrase.
The message ID is identification information indicating the message in which the action is performed in the same record, and is the same identification information as the message ID 1 in FIG.

図７は、図６に示したアクションにおける単語の置き換えの処理を説明する概念図である。図７においは、話者であるユーザ及びユーザ端末１１の各々が発話する音声を、文字データとして可視化して説明する。
図７（ａ）は、システム主導のアクションとしての単語の置き換えを示している。ユーザ３０１がユーザ端末１１に対して音声により、ユーザが吹き出し（speech balloon）３５１の「○○は？」というリクエスト（質問）をした際、情報提示サーバ１０が吹き出し４５１の「今週の土曜日…」という発話データをレスポンスとして回答する。このとき、聴取志向管理部１０３２は、ユーザ３０１に対応した聴取志向推定モデルを参照しているが、このユーザ３０１に対して、上記発話データに対して変更の処理を行うことが記載されていないため、対話処理部１０２が供給する発話データをそのままレスポンス（回答）としている。 FIG. 7 is a conceptual diagram illustrating word replacement processing in the action shown in FIG. In FIG. 7, the speech uttered by each of the user who is a speaker and the user terminal 11 is visualized and explained as character data.
FIG. 7A shows word replacement as a system-led action. When the user 301 makes a request (question) “sound balloon?” In the speech balloon 351 to the user terminal 11 by voice, the information presentation server 10 reads “Saturday this week…” in the speech balloon 451. Is returned as a response. At this time, the listening orientation management unit 1032 refers to the listening orientation estimation model corresponding to the user 301, but it is not described that the user 301 is subjected to a change process on the utterance data. Therefore, the utterance data supplied by the dialogue processing unit 102 is used as a response (answer) as it is.

一方、ユーザ３０２がユーザ端末１１に対して音声により、吹き出し３５１の「○○は？」というリクエスト（質問）をした際、情報提示サーバ１０が上記吹き出し４５１の「今週の土曜日…」という発話データを、吹き出し４５２の「今度の土曜日…」と変更した後にレスポンスとして回答する。このとき、聴取志向管理部１０３２は、ユーザ３０１に対応した聴取志向推定モデルを参照し、このユーザ３０２に対して、上記発話データに対して変更の処理を行うことが記載されているため、対話処理部１０２が供給する発話データを、聴取志向推定モデルに対応して変更処理を行っている。
すなわち、属性情報において、ユーザ３０１（例えば、年齢２０代）に比較してユーザ３０２（例えば、年齢７０代）の年齢が高く、ユーザ３０２は周波数が低い音声の方が聞き易いため、聴取志向推定モデルには破裂音や擦過音を含む単語の置き換えの処理が設定されている。 On the other hand, when the user 302 makes a request (question) “Oh wa?” Of the speech balloon 351 to the user terminal 11 by voice, the information presentation server 10 utterance data “Saturday of this week ...” of the speech balloon 451. Is replied as a response after changing the balloon 452 to “next Saturday…”. At this time, it is described that the listening orientation management unit 1032 refers to the listening orientation estimation model corresponding to the user 301 and performs processing for changing the utterance data for the user 302. The utterance data supplied by the processing unit 102 is changed according to the listening orientation estimation model.
That is, in the attribute information, the user 302 (for example, age 70s) is higher in age than the user 301 (for example, age 20s), and the user 302 is more likely to hear the voice having a lower frequency, so the listening orientation estimation is performed. The model has a replacement process for words including plosives and scratches.

図７（ｂ）は、ユーザ主導のアクションとしての単語の置き換えを示している。図示はしていないが、ユーザ３０３（例えば、年齢７０代）がユーザ端末１１に対して音声による「○○は？」というリクエスト（質問）をした際、情報提示サーバ１０が上記吹き出し４５３の「今週の土曜日…」という、対話処理部１０２が供給する発話データをそのままレスポンス（回答）としている。しかしながら、「今週の土曜日…」の音声に対して、ユーザ３０３が吹き出し３５３の「えっ？／もう一度」という、聴取志向フレーズのリクエスト（要求）が入力される。 FIG. 7B shows word replacement as a user-initiated action. Although not shown, when the user 303 (for example, age 70's) makes a request (question) by voice to the user terminal 11, the information presentation server 10 displays “ The utterance data supplied by the dialogue processing unit 102 "Saturday of this week ..." is directly used as a response. However, in response to the voice “Saturday of this week ...”, the user 303 inputs a request (request) for a listening-oriented phrase “Eh?

このため、聴取志向管理部１０３２は、ユーザ３０１に対応した聴取志向推定モデルを参照し、このユーザ３０２に対して、上記聴取志向フレーズに対応して、対話処理部１０２が供給する発話データに変更処理を行っている。これにより、情報提示サーバ１０は、発話データの吹き出し４５３の「今週の土曜日…」が、吹き出し４５４の「今度の土曜日…」に変更された音声のデータをユーザ端末１１に対して再度出力する。 Therefore, the listening orientation management unit 1032 refers to the listening orientation estimation model corresponding to the user 301, and changes to the utterance data supplied to the dialogue processing unit 102 corresponding to the listening orientation phrase for the user 302. Processing is in progress. As a result, the information presentation server 10 outputs again to the user terminal 11 the voice data in which “Saturday of this week…” of the speech data balloon 453 is changed to “Saturday of this week…” of the balloon 454.

図８は、グルーピング記憶部１１０に記憶されているグルーピングテーブルの構成例を示す図である。図８（ａ）は、グループの属性情報を示すグループ属性情報テーブルの構成例を示している。図８（ａ）において、グループ属性情報テーブルは、一例として、レコード毎に対して、グループＩＤ、年代、性別及び居住地などの項目の欄が設けられている。グループＩＤは、グループの各々を識別するための識別情報である。年代は、グループを構成するユーザの年齢の範囲を示している。 FIG. 8 is a diagram illustrating a configuration example of a grouping table stored in the grouping storage unit 110. FIG. 8A shows a configuration example of a group attribute information table indicating group attribute information. In FIG. 8A, the group attribute information table is provided with columns of items such as group ID, age, sex, and residence for each record as an example. The group ID is identification information for identifying each group. The age indicates the range of the age of the users making up the group.

例えば、グループＩＤ：Ｇ＿００１は、少なくとも年齢が６０歳から７５歳までの範囲に含まれるユーザの集合体であることを示している。同様に、グループＩＤ：Ｇ＿００２は、少なくとも年齢が１０歳から２０歳までの範囲に含まれるユーザの集合体であることを示している。性別は、対応するグループＩＤで識別されるグループを構成する人間の性別が男性（ｍａｌｅ）か女性（ｆｅｍａｌｅ）であるかを示している。居住地は、対応するグループＩＤで識別されるグループを構成するユーザの居住地がいずれの地方であるかを示している。
また、このグループＩＤで識別されるグループ毎には、それぞれのグループを構成するユーザの上述した属性に対応する聴取志向テンプレートモデルがグルーピング記憶部１１０に対して予め書き込まれて記憶されている。 For example, the group ID: G — 001 indicates that the user is a collection of users included at least in the range of 60 to 75 years old. Similarly, the group ID: G — 002 indicates that it is a collection of users that are at least in the range of 10 to 20 years old. The gender indicates whether the human gender constituting the group identified by the corresponding group ID is male or female. The residence indicates which region the residence of the user constituting the group identified by the corresponding group ID is.
Further, for each group identified by this group ID, a listening-oriented template model corresponding to the above-described attributes of the users constituting each group is written and stored in the grouping storage unit 110 in advance.

図８（ｂ）は、グループＩＤの各々に属すユーザが割り当てられているグルーピングテーブルの構成例を示している。各レコードには、グループＩＤ、ユーザＩＤ、年齢、性別、音量、読み上げ速度、区切り、設置環境の各々の欄が設けられている。グループＩＤは、グループの各々を識別するための識別情報である。ユーザＩＤは、同一レコードにおけるグループＩＤの示すグループに分類されたユーザを示す識別情報であり、図５におけるユーザ属性テーブルのユーザＩＤと同一の識別情報である。 FIG. 8B shows a configuration example of a grouping table to which users belonging to each group ID are assigned. Each record has columns for group ID, user ID, age, gender, volume, reading speed, separation, and installation environment. The group ID is identification information for identifying each group. The user ID is identification information indicating users classified into the group indicated by the group ID in the same record, and is the same identification information as the user ID of the user attribute table in FIG.

また、年齢は、対応するユーザＩＤで識別されるユーザの年齢を示している。性別は、対応するユーザＩＤで識別されるユーザが男性（ｍａｌｅ）か女性（ｆｅｍａｌｅ）であるかを示している。ここで、性別がグルーピングにおける属性に含まれていない場合、そのグループは男性（ｍａｌｅ）か女性（ｆｅｍａｌｅ）の双方のユーザが存在する。 The age indicates the age of the user identified by the corresponding user ID. The gender indicates whether the user identified by the corresponding user ID is male or female. Here, when the gender is not included in the attribute in the grouping, there are both male and female users in the group.

また、設置環境は、ユーザ端末１１が設置されている場所、すなわち音声を聞き取る際の環境が、部屋が広くて音声が伝搬し易いか、部屋が小さくて反響し易いか、他の音が混入する可能性が低いか、他の音が混入する可能性が高いかなどのユーザの音声の聞き取り環境を示している。 In addition, the installation environment is the place where the user terminal 11 is installed, that is, the environment when listening to the voice, whether the room is large and the voice is easily propagated, the room is small and easily echoed, or other sounds are mixed The user's voice listening environment, such as whether there is a low possibility of being mixed or a high possibility of mixing other sounds, is shown.

上述したように、グループの各々は、グルーピングテーブルにおいて規定されているグループの属性（グループ属性）と同様の属性を有するユーザが分類されている。
そして、上述したグルーピングにおける属性の種類は、人間の音声の聞き取り易さに詳しい学者や医者、あるいは介護施設の職員（看護師や介護士など）の聴取志向に詳しい識者の提示する属性の種類を用いても良い。
また、グルーピングにおける属性の種類は、複数のユーザの属性を特徴量としてクラスタリングなどの処理を行い、最も明確にユーザ動詞を分類できる特徴量の属性の種類を抽出する処理により設定しても良い。 As described above, in each group, users having the same attributes as the group attributes (group attributes) defined in the grouping table are classified.
The types of attributes in the above-mentioned grouping are the types of attributes presented by scholars and doctors who are familiar with human speech audibility, or experts who are familiar with listening orientations of nursing facility staff (such as nurses and caregivers). It may be used.
Further, the attribute type in the grouping may be set by performing a process such as clustering using the attributes of a plurality of users as the feature amount and extracting the attribute type of the feature amount that can most clearly classify the user verb.

上述したいずれの処理により、グルーピングに用いる属性の種類を抽出したとしても、上記識者の治験に対応して、聴取志向テンプレートモデルの聴取志向における音量、読み上げ速度及び区切りなどのパラメータの変更の要否、変更する際のそれぞれのパラメータの調整量を設定しても良い。
本実施形態における情報提示サーバ１０の利用を開始した直後のユーザに対し、情報提示サーバ１０が上述した聴取志向のパラメータの変更の要否や、変更する際のパラメータの調整量のデータを、音声に対する聴取志向に対するユーザの対応から十分に抽出できていない。 Even if the type of attribute used for grouping is extracted by any of the above-mentioned processes, it is necessary to change parameters such as volume, reading speed, and division in the listening orientation of the listening-oriented template model in accordance with the clinical trial of the expert. The adjustment amount of each parameter when changing may be set.
For the user immediately after starting the use of the information presentation server 10 in the present embodiment, whether the information presentation server 10 needs to change the listening-oriented parameter described above, and the data of the parameter adjustment amount at the time of the change, It cannot be extracted sufficiently from the user's response to listening orientation.

このため、聴取志向推定部１０３は、聴取志向のパラメータのデータが十分に抽出できていないユーザに対し、このユーザの属性に近いグループを上記グルーピングテーブルにおいて検索し、検索して得られたグループの聴取志向テンプレートモデルを用いて、聴取志向のパラメータの要否あるいはパラメータの変更量を推定する。
そして、聴取志向推定部１０３は、ユーザの属性に用いた聴取志向テンプレートモデルを元に、聴取志向における各パラメータの変更の要否及び変更の際の調整量のデータを、ユーザからの音声に対する変更の要求から取得して、ユーザの各々の聴取志向推定モデルとする処理を行う。このとき、聴取志向推定部１０３は、すでに述べたように、聴取志向テンプレートモデルに対して、機械学習による最適化の処理を行うことで聴取志向推定モデルを生成しても良い。 For this reason, the listening orientation estimation unit 103 searches the grouping table for a group that is close to the user's attribute for a user whose listening-oriented parameter data has not been sufficiently extracted, and obtains the group obtained by the search. The necessity of listening-oriented parameters or the amount of parameter change is estimated using a listening-oriented template model.
Based on the listening orientation template model used for the user's attributes, the listening orientation estimation unit 103 changes whether or not each parameter in the listening orientation needs to be changed and the adjustment amount data when the change is made to the voice from the user To obtain each user's listening orientation estimation model. At this time, as described above, the listening orientation estimation unit 103 may generate the listening orientation estimation model by performing optimization processing by machine learning on the listening orientation template model.

図９は、本実施形態の情報提示システムを用いた対話システムの動作例を示すフローチャートである。この図９のフローチャートの動作は、例えば、情報提示システム１における情報提示サーバ１０に対してアクセスし、ユーザがスマートスピーカなどのユーザ端末１１から音声によるリクエストを音声により情報提示サーバ１０送信して、情報提示サーバ１０との対話を行う際に開始される。以下の図９のフローチャートの動作説明は、グループ毎の聴取志向に対応した聴取志向テンプレートモデルの各々が、聴取志向推定部１０３において、すでに説明したように生成されて、グルーピング記憶部１１０に蓄積されている状態において行う。 FIG. 9 is a flowchart showing an operation example of the dialogue system using the information presentation system of the present embodiment. The operation of the flowchart of FIG. 9 is performed by, for example, accessing the information presentation server 10 in the information presentation system 1 and transmitting a voice request from the user terminal 11 such as a smart speaker to the information presentation server 10 by voice. It is started when a dialogue with the information presentation server 10 is performed. The following description of the operation of the flowchart of FIG. 9 shows that each of the listening orientation template models corresponding to the listening orientation for each group is generated in the listening orientation estimation unit 103 as described above and stored in the grouping storage unit 110. In the state of being.

ステップＳ１０１：
データ入出力部１０１は、いずれかのユーザ端末１１から音声データが供給されたか否かの判定を行う。そして、データ入出力部１０１は、いずれかのユーザ端末１１から音声データが供給された場合、処理をステップＳ２へ進める。一方、データ入出力部１０１は、いずれのユーザ端末１１からも音声データが供給されない場合、ステップＳ１０１の処理を繰り返す。 Step S101:
The data input / output unit 101 determines whether audio data is supplied from any of the user terminals 11. Then, when audio data is supplied from any of the user terminals 11, the data input / output unit 101 advances the process to step S2. On the other hand, if no audio data is supplied from any user terminal 11, the data input / output unit 101 repeats the process of step S101.

このとき、例えば、ユーザがユーザ端末１１に対して音声により、コンサート等が行われる日などの予定を問い合わせるリクエストを入力する。そして、ユーザ端末１１は、音声データとこの音声を入力したユーザのユーザＩＤとの各々を、情報提示サーバ１０にアクセスして送信する。この場合、データ入出力部１０１は、いずれかのユーザ端末１１から音声データが供給されたことを検出し、処理をステップＳ１０２へ進める。
そして、ステップＳ１０２に進める際、データ入出力部１０１は、入力した音声データを対話処理部１０２に対して出力する。また、データ入出力部１０１は、入力したユーザＩＤを聴取志向推定部１０３に対して出力する。 At this time, for example, the user inputs a request for inquiring of the schedule such as a concert or the like to the user terminal 11 by voice. And the user terminal 11 accesses the information presentation server 10, and transmits each of audio | voice data and the user ID of the user who input this audio | voice. In this case, the data input / output unit 101 detects that audio data is supplied from any user terminal 11, and advances the process to step S102.
Then, when proceeding to step S102, the data input / output unit 101 outputs the input voice data to the dialogue processing unit 102. Further, the data input / output unit 101 outputs the input user ID to the listening orientation estimation unit 103.

ステップＳ１０２：
聴取志向推定部１０３は、データ入出力部１０１からユーザＩＤが供給された場合、このユーザＩＤの示すユーザに対話の履歴があるか否かの判定を行う。すなわち、聴取志向推定部１０３は、ユーザ属性記憶部１０８を参照して、このユーザＩＤに対応して聴取志向推定モデルが記憶されているか否かの判定を行う。すなわち、ユーザに対話の履歴が無ければ、聴取志向テンプレートモデルから聴取志向推定モデルが生成されていない。
このとき、聴取志向推定部１０３は、ユーザ属性記憶部１０８にユーザに対応する聴取志向推定モデルが記憶されている場合、処理をステップＳ１０３へ進める。一方、聴取志向推定部１０３は、ユーザ属性記憶部１０８にユーザに対応する聴取志向推定モデルが記憶されていない場合、処理をステップＳ１０４へ進める。 Step S102:
When the user ID is supplied from the data input / output unit 101, the listening orientation estimation unit 103 determines whether or not the user indicated by the user ID has a conversation history. That is, the listening orientation estimation unit 103 refers to the user attribute storage unit 108 and determines whether or not a listening orientation estimation model is stored corresponding to this user ID. That is, if the user has no history of dialogue, the listening orientation estimation model is not generated from the listening orientation template model.
At this time, when the listening orientation estimation unit 103 stores the listening orientation estimation model corresponding to the user in the user attribute storage unit 108, the processing proceeds to step S103. On the other hand, when the listening orientation estimation unit 103 does not store the listening orientation estimation model corresponding to the user in the user attribute storage unit 108, the listening orientation estimation unit 103 advances the process to step S104.

ステップＳ１０３：
聴取志向推定部１０３は、ユーザ属性記憶部１０８からユーザＩＤに対応する聴取志向推定モデルを読み出す。 Step S103:
The listening orientation estimation unit 103 reads the listening orientation estimation model corresponding to the user ID from the user attribute storage unit 108.

ステップＳ１０４：
聴取志向推定部１０３は、ユーザ属性記憶部１０８を参照し、ユーザＩＤに対応したユーザの属性情報を読み出す。
そして、聴取志向推定部１０３は、読み出した属性情報に近い属性情報を有するグループをグルーピング記憶部１１０のグループ属性情報テーブルから検索し、検索して得られたグループの聴取志向テンプレートモデルを読み出す。
また、グルーピング推定部１０６は、グルーピング記憶部１１０において、上記ユーザを検索したグループのグループテーブルに追加して書き込んで記憶させる。 Step S104:
The listening orientation estimation unit 103 refers to the user attribute storage unit 108 and reads the user attribute information corresponding to the user ID.
Then, the listening orientation estimation unit 103 searches the group attribute information table of the grouping storage unit 110 for a group having attribute information close to the read attribute information, and reads the listening orientation template model of the group obtained by the search.
Further, the grouping estimation unit 106 adds the information to the group table of the searched group in the grouping storage unit 110 and stores it.

ステップＳ１０５：
対話処理部１０２は、音声データをテキストデータに変換し、形態素解析を行って、得られた単語あるいは文節から、この音声データが会話フレーズであるか、あるいは聴取志向フレーズであるかの判定を行う。音声データが聴取志向フレーズであるということは、ユーザが発話データ（レスポンス）の音声の最適化（自身の聴取志向に合わせる変更）を要求していることを意味している。 Step S105:
The dialogue processing unit 102 converts voice data into text data, performs morphological analysis, and determines whether the voice data is a conversation phrase or a listening-oriented phrase from the obtained word or phrase. . The fact that the voice data is a listening-oriented phrase means that the user is requesting optimization of the voice of the speech data (response) (change according to his / her listening preference).

したがって、対話処理部１０２は、このステップＳ１０５において、ユーザが発話データの音声の最適化を要求しているか否かの判定を行っている。
そして、対話処理部１０２は、ユーザが発話データの音声の最適化を要求していない場合、処理をステップＳ１０６へ進める。一方、対話処理部１０２は、ユーザが発話データの音声の最適化を要求している場合、処理をステップＳ１０７へ進める。
このとき、対話処理部１０２は、対話履歴記憶部１０７における対話履歴テーブルに対し、入力された音声データのテキストデータ、聴取志向フレーズの場合に聴取志向フレーズのフラグ、メッセージＩＤの各々の書き込みを行う。 Therefore, the dialog processing unit 102 determines whether or not the user requests optimization of the speech data voice in step S105.
Then, when the user does not request the optimization of the speech data, the dialogue processing unit 102 advances the process to step S106. On the other hand, the dialogue processing unit 102 advances the process to step S107 when the user requests optimization of the speech data.
At this time, the dialogue processing unit 102 writes the text data of the input voice data, the flag of the listening-oriented phrase in the case of the listening-oriented phrase, and the message ID in the dialogue history table in the dialogue history storage unit 107. .

ステップＳ１０６：
入力された音声データが会話フレーズであるため、対話処理部１０２は、このリクエストの音声データに対応した発話データの生成を、音声データのテキスト文を形態素解析した単語の各々を用いて行う。
そして、聴取志向推定部１０３は、聴取志向推定モデルあるいは聴取志向テンプレートモデルにより、システム主導の発話データに対する変更処理の推定、ずなわち、ユーザの聴取志向のパラメータのなかから変更対象のパラメータと、変更量（あるいは単語の置き換え）を推定する。
また、聴取志向推定部１０３は、聴取志向のパラメータのなかから選択した変更対象のパラメータと、このパラメータの変更量（あるいは置き換える単語）とを、提示制御部１０４に対して出力する。 Step S106:
Since the input voice data is a conversation phrase, the dialogue processing unit 102 generates utterance data corresponding to the voice data of the request using each word obtained by morphological analysis of the text sentence of the voice data.
Then, the listening orientation estimation unit 103 estimates the change processing for the utterance data led by the system using the listening orientation estimation model or the listening orientation template model, that is, the parameter to be changed from among the listening orientation parameters of the user, Estimate the amount of change (or word replacement).
The listening orientation estimation unit 103 also outputs to the presentation control unit 104 the parameters to be changed selected from the listening orientation parameters and the amount of change (or replacement word) of the parameters.

ステップＳ１０７：
入力された音声データが聴取志向フレーズであるため、この時点においては、このフローチャートにおける前回の会話フレーズのループにおいて、リクエストに対するレスポンスとしての会話フレーズはすでに得られている。
このため、聴取志向推定部１０３は、聴取志向推定モデルあるいは聴取志向テンプレートモデルにより、聴取志向のパラメータのなかから変更対象のパラメータと、このパラメータの変更量を調整して、提示制御部１０４に対して出力する。 Step S107:
Since the input voice data is a listening-oriented phrase, at this point, the conversation phrase as a response to the request has already been obtained in the previous conversation phrase loop in this flowchart.
For this reason, the listening orientation estimation unit 103 adjusts the parameter to be changed from the listening orientation parameters and the amount of change of this parameter by using the listening orientation estimation model or the listening orientation template model. Output.

このとき、聴取志向推定部１０３は、対話行動記憶部１０９の対話行動テーブルに対して、単語の置き換えを行った処理を書き込んで記憶させる。このとき、聴取志向推定部１０３は、アクションタイプとしてシステム主導で行ったか、あるいはユーザ主導で行ったかのいずれかを記載する。また、聴取志向推定部１０３は、予め行動の各々に付されているアクションＩＤを記載し、アクションＩＤに対応した実施内容を記載する（記載例としては図６の対話行動テーブルを参照）。実施内容が単語の置き換え（アクションＩＤ：Ａ００１）の場合、提示制御部１０４がどの単語をどのような単語に置き換えたかを、対話行動テーブルの実施内容の欄に記載する。 At this time, the listening orientation estimation unit 103 writes and stores the word replacement process in the dialog action table of the dialog action storage unit 109. At this time, the listening orientation estimation unit 103 describes whether the action type is system-driven or user-driven. In addition, the listening orientation estimation unit 103 describes the action ID given to each action in advance, and describes the implementation content corresponding to the action ID (see the interactive action table in FIG. 6 for a description example). When the implementation content is word replacement (action ID: A001), which word is replaced by what word by the presentation control unit 104 is described in the implementation content column of the dialogue action table.

ここで、例えば、変更対象のパラメータが音量である場合、予め通常の音量からの変更量と規定されている大きさに対して、より大きい音量を変更量とする（変更量の調整）。また、変更対象のパラメータが読み上げ速度である場合、予め通常の読み上げ速度からの変更量と規定されている遅い速度に対して、より遅い速度を変更量とする。また、変更対象のパラメータが区切りである場合、予め通常の区切りの頻度からの変更量と規定されている区切りの頻度に対して、より多くの区切りの頻度を変更量とする。
また、このパラメータの各々は、一括して変更量を変更してもよいし、フローチャートのループが繰り返される毎に、変更する順番を決めておいて、変更量の調整を行っても良い。 Here, for example, when the parameter to be changed is a volume, a larger volume is set as a change amount (adjustment of the change amount) with respect to a size that is previously defined as a change amount from the normal volume. In addition, when the parameter to be changed is the reading speed, the slower speed is set as the changing amount with respect to the slow speed defined as the changing amount from the normal reading speed in advance. In addition, when the parameter to be changed is a delimiter, a larger delimiter frequency is set as the change amount than the delimiter frequency defined in advance as the amount of change from the normal delimiter frequency.
In addition, the amount of change for each of these parameters may be changed collectively, or the amount of change may be adjusted by determining the order of change each time the flowchart loop is repeated.

ステップＳ１０８：
提示制御部１０４は、聴取志向推定部１０３から供給される聴取志向における単語の置き換え処理の要求に対応し、発話データのテキストデータに含まれる擦過音及び破裂音を有する単語の各々を抽出する。そして、提示制御部１０４は、抽出した単語の各々に対応した置き換える単語を、言語知識記憶部１１１の置き換えテーブルを参照して、それぞれ抽出する。
そして、提示制御部１０４は、聞き取りやすい単語への置き換えを終了した発話データを、聴取志向のパラメータとそのパラメータの変更量との各々を、音声合成部１０５に対して出力する。 Step S108:
In response to the request for word replacement processing in the listening orientation supplied from the listening orientation estimation unit 103, the presentation control unit 104 extracts each of the words having a fuzzing sound and a plosive sound included in the text data of the utterance data. Then, the presentation control unit 104 extracts replacement words corresponding to the extracted words with reference to the replacement table of the language knowledge storage unit 111.
Then, the presentation control unit 104 outputs, to the speech synthesizer 105, the speech-oriented parameters and the amount of change of the parameters of the speech data that has been replaced with words that are easy to hear.

音声合成部１０５は、提示制御部１０４から供給される発話データを、変更するパラメータと、このパラメータの調整量に対応して、発話データのテキストデータを音声合成により、ユーザに対するレスポンスとしての音声コンテンツを生成する。
そして、音声合成部１０５は、音声合成により生成した音声コンテンツを、データ入出力部１０１を介して、ユーザ端末１１に対して出力する。 The speech synthesizer 105 performs speech synthesis on the speech data supplied from the presentation control unit 104 in response to a parameter to be changed and the amount of adjustment of the parameter, and speech content as a response to the user by speech synthesis. Is generated.
Then, the voice synthesis unit 105 outputs the voice content generated by the voice synthesis to the user terminal 11 via the data input / output unit 101.

ステップＳ１０９：
聴取志向推定部１０３は、対話行動記憶部１０９の対話行動テーブルにおけるメッセージＩＤを参照し、このメッセージＩＤに連続するメッセージＩＤを対話履歴記憶部１０７の対話履歴テーブルから抽出する。
そして、聴取志向推定部１０３は、抽出したメッセージに対応するメッセージ本文の聴取志向フレーズフラグが「０」である場合に、聴取志向のパラメータの変更あるいは単語の置き換えが成功したと判定する。一方、聴取志向推定部１０３は、抽出したメッセージに対応するメッセージ本文の聴取志向フレーズフラグが「１」である場合に、聴取志向のパラメータの変更あるいは単語の置き換えが、聞き取り易さを向上させるために不十分であると判定する。 Step S109:
The listening orientation estimation unit 103 refers to the message ID in the dialogue behavior table of the dialogue behavior storage unit 109 and extracts a message ID continuous to the message ID from the dialogue history table of the dialogue history storage unit 107.
Then, when the listening-oriented phrase flag of the message body corresponding to the extracted message is “0”, the listening-oriented estimation unit 103 determines that the listening-oriented parameter change or word replacement has succeeded. On the other hand, when the listening-oriented phrase flag of the message body corresponding to the extracted message is “1”, the listening-oriented estimation unit 103 changes the listening-oriented parameter or replaces the word to improve the ease of listening. Is determined to be insufficient.

聴取志向推定部１０３は、例えば、上述した聴取志向フレーズフラグが「１」であり、かつユーザ主導により変更した聴取志向におけるパラメータと、このパラメータの変更量とにより、ユーザに対応する聴取志向推定モデルを、よりユーザの聴取志向に適合させる修正処理を行う。
また、聴取志向推定部１０３は、グルーピング記憶部１１０のグループテーブルを参照し、グループを構成するユーザの各々に共通する変更された聴取志向におけるパラメータと、パラメータの変更量とを抽出し、聴取志向テンプレートモデルを、よりグループに含まれるユーザの聴取志向に適合させる修正処理を行う。 The listening orientation estimation unit 103 has, for example, the above listening preference phrase flag “1” and the listening orientation parameter changed by the user and the amount of change of the parameter, and the listening orientation estimation model corresponding to the user. Is corrected to suit the user's listening preference.
Further, the listening orientation estimation unit 103 refers to the group table of the grouping storage unit 110, extracts the changed listening orientation parameters and the amount of parameter change common to each user constituting the group, and listens to the listening orientation. A correction process for adapting the template model to the listening preference of the users included in the group is performed.

このとき、聴取志向推定部１０３は、例えば、聴取志向フレーズフラグが立っているメッセージＩＤに対応するメッセージ本文の形態素解析を行い、ポジティブワードあるいはネガティブワードを抽出し、ポジティブワードの場合、変更に対する評価値に「１」を加算（評価値をインクリメント）する処理を行い、一方、ネガティブワードの場合、変更に対する評価値から「１」を減算（評価値をディクリメント）する処理を行う。そして、聴取志向推定部１０３は、評価値が所定の閾値を超えた場合、変更した聴取志向のパラメータの変更量（あるいは置き換えた単語）を、聴取志向推定モデル及び聴取志向テンプレートモデルに反映させるように構成しても良い。 At this time, the listening orientation estimation unit 103 performs, for example, a morphological analysis of the message body corresponding to the message ID for which the listening orientation phrase flag is set, and extracts a positive word or a negative word. A process of adding “1” to the value (incrementing the evaluation value) is performed. On the other hand, in the case of a negative word, a process of subtracting “1” from the evaluation value for the change (decrementing the evaluation value) is performed. Then, when the evaluation value exceeds a predetermined threshold value, the listening orientation estimation unit 103 reflects the changed listening orientation parameter change amount (or the replaced word) in the listening orientation estimation model and the listening orientation template model. You may comprise.

また、聴取志向推定部１０３は、例えば、ネガティブワードやポジティブワードの抽出を行うのではなく、聴取志向のパラメータを変更して音声コンテンツを出力した後に、「聞き取り易かったですか？「はい」／「いいえ」でお答え下さい」や、「もう少しゆっくり読み上げましょうか？「このまま」／「ゆっくり」でお答え下さい」のテキストデータを、音声合成部１０５により音声合成して確認音声コンテンツに変更する。また、聴取志向推定部１０３は、この確認音声コンテンツをユーザ端末１１に対してデータ入出力部１０１を介して送信する。このアルゴリズムは、ユーザ主導の聴取志向のパラメータの変更に対応している。 Also, the listening orientation estimation unit 103 does not extract negative words or positive words, for example, but changes the listening orientation parameters and outputs the audio content, and then “is it easy to hear?“ Yes ”/ Text data such as “Please answer with“ No ”” or “Would you like to read it a little more slowly? Please answer as it is” / “Please answer slowly” is voice-synthesized by the voice synthesizer 105 and changed to the confirmation voice content. In addition, the listening orientation estimation unit 103 transmits the confirmation audio content to the user terminal 11 via the data input / output unit 101. This algorithm corresponds to user-initiated listening-oriented parameter changes.

そして、聴取志向推定部１０３は、上述した確認音声コンテンツに対するユーザの回答を入力する。このとき、聴取志向推定部１０３は、対話処理部１０２がユーザによる回答の音声データをテキスト変換した回答データを入力する。
そして、聴取志向推定部１０３は、例えば、「聞き取り易かったですか？」の質問に対する回答データが「はい」の場合、聴取志向のパラメータの変更が成功したと判定する。一方、「聞き取り易かったですか？」の質問に対する回答データが「いいえ」の場合、聴取志向のパラメータの変更が成功しなかったと判定する。 Then, the listening orientation estimation unit 103 inputs a user's answer to the above-described confirmation audio content. At this time, the listening intention estimation unit 103 inputs the answer data obtained by converting the voice data of the answer by the user into the text by the dialogue processing unit 102.
Then, for example, when the answer data to the question “Is it easy to hear?” Is “Yes”, the listening orientation estimation unit 103 determines that the listening orientation parameter has been successfully changed. On the other hand, if the answer data to the question “Is it easy to hear?” Is “No”, it is determined that the change of the listening-oriented parameter was not successful.

これにより、聴取志向推定部１０３は、成功した場合に成功した聴取志向のパラメータの変更処理を、聴取志向推定モデル及び聴取志向テンプレートモデルに反映させる。
一方、聴取志向推定部１０３は、変更が失敗した場合、再度、聴取志向の他のパラメータの変更を行った音声コンテンツを生成して、ユーザに対してレスポンスとして出力する。 As a result, the listening orientation estimation unit 103 reflects the successful listening orientation parameter changing process in the listening orientation estimation model and the listening orientation template model when successful.
On the other hand, if the change is unsuccessful, the listening intention estimation unit 103 generates again the audio content in which other listening-oriented parameters are changed, and outputs it as a response to the user.

また、聴取志向推定部１０３は、例えば、「もう少しゆっくり読み上げましょうか？」の質問に対する回答データが「このまま」の場合、聴取志向のパラメータである読み上げ速度の変更が成功したと判定する。一方、「もう少しゆっくり読み上げましょうか？」の質問に対する回答データが「ゆっくり」の場合、聴取志向のパラメータである読み上げ速度の変更量が少ないため成功しなかったと判定する。
これにより、聴取志向推定部１０３は、成功した場合に成功した聴取志向のパラメータである読み上げ速度の変更処理を、聴取志向推定モデル及び聴取志向テンプレートモデルに反映させる。 Further, for example, when the answer data to the question “Let's read a little more slowly?” Is “No change”, the listening orientation estimation unit 103 determines that the change of the reading speed, which is a listening orientation parameter, has succeeded. On the other hand, if the answer data to the question “Do you want to read it a little more slowly?” Is “slow”, it is determined that it was not successful because the amount of change in the reading speed, which is a listening-oriented parameter, is small.
Thus, the listening orientation estimation unit 103 reflects the reading speed change process, which is a successful listening orientation parameter, in the listening orientation estimation model and the listening orientation template model when successful.

一方、聴取志向推定部１０３は、変更が失敗した場合、再度、聴取志向のパラメータである読み上げ速度の変更量を増加させ、すなわちより読み上げ速度を低下させる変更を行った音声コンテンツを生成して、ユーザに対してレスポンスとして出力する。
上述したように、聴取志向のパラメータである周波数、読み上げ速度及び区切りや単語の置き換えなどの変更を行った後に、それぞれの変更が適切であったか否かの質問をユーザに与え、聴取志向のパラメータの変更の成功／不成功の確認を行い、この確認結果を聴取志向推定モデル及び聴取志向テンプレートモデルに反映させる構成としても良い。 On the other hand, when the change is unsuccessful, the listening orientation estimation unit 103 increases the amount of change in the reading speed that is a listening-oriented parameter, that is, generates audio content that has been changed to further reduce the reading speed, Output as a response to the user.
As described above, after making changes such as listening-oriented parameters such as frequency, reading speed, break and word replacement, the user is asked whether each change is appropriate, and listening-oriented parameters A configuration may be adopted in which the success / failure of the change is confirmed, and the confirmation result is reflected in the listening orientation estimation model and the listening orientation template model.

また、聴取志向推定部１０３は、聴取志向フレーズフラグが「１」となる発生頻度をカウントし、同様の聴取志向のパラメータの変更を行う発生頻度のカウント数が所定の設定値を超えた場合に、発生頻度が所定の設定値を超えたパラメータに基づき、このパラメータ及びパラメータの変更量を、聴取志向推定モデル及び聴取志向テンプレートモデルに反映させるように構成しても良い。 In addition, the listening orientation estimation unit 103 counts the occurrence frequency at which the listening orientation phrase flag is “1”, and when the occurrence frequency count for performing the same listening orientation parameter change exceeds a predetermined set value. Based on a parameter whose occurrence frequency exceeds a predetermined set value, the parameter and the amount of parameter change may be reflected in the listening orientation estimation model and the listening orientation template model.

ステップＳ１１０：
提示制御部１０４は、対話履歴記憶部１０７の対話履歴テーブルに対して、発話データのテキストデータを、メッセージ本文に書き込んでメッセージＩＤ１を付与して書き込んで記憶させる。このとき、提示制御部１０４は、話者ＩＤの欄に対して、レスポンスを行うシステムのシステム識別情報を書き込んで記憶させる。
また、提示制御部１０４は、会話フレーズであるため、聴取志向フレーズフラグを「０」とし、かつ接続されるユーザの音声データのメッセージ本文のメッセージＩＤ１をメッセージＩＤ２の欄に書き込んで記憶させる。 Step S110:
The presentation control unit 104 writes the text data of the utterance data in the dialog history table of the dialog history storage unit 107 in the message body, adds the message ID1, and stores it. At this time, the presentation control unit 104 writes and stores the system identification information of the system that performs the response in the column of the speaker ID.
Since the presentation control unit 104 is a conversation phrase, the listening-oriented phrase flag is set to “0”, and the message ID 1 of the message body of the voice data of the connected user is written and stored in the message ID 2 column.

上述した構成及び動作により、本実施形態によれば、ユーザのリクエストに対して、レスポンスを行う情報提示サーバ１０が音声コンテンツにより提供する情報を、ユーザの各々が正確に聞き取ることができるように聴取志向の推定を、ユーザ毎の聴取志向推定モデルまたはグループ毎の聴取志向テンプレートモデルを用いて行うため、従来のようにルールベースで各ユーザあるいは各グループに対して聴取志向の推定を行う構成に比較してデータ量を少なくすることができ、かつデータ量が少ないために聴取志向推定モデル及び聴取志向テンプレートモデルの各々のメンテンス（ユーザに順次対応させていく修正処理）を容易に行うことができる。 With the above-described configuration and operation, according to the present embodiment, the information presentation server 10 that responds to the user's request can listen to the information provided by the audio content so that each user can accurately hear the information. Compared to the conventional rule-based configuration of estimating listening orientation for each user or each group because the estimation of orientation is performed using the listening orientation estimation model for each user or the listening orientation template model for each group. As a result, the amount of data can be reduced, and since the amount of data is small, each of the listening orientation estimation model and the listening orientation template model (a correction process corresponding to the user in sequence) can be easily performed.

また、本実施形態によれば、ユーザの各々の属性情報に対応した聴取志向推定モデルにより、ユーザの聴取志向における聞き取り易さを向上するパラメータの種類と、これらパラメータの変更量（調整量）とが求められ、ユーザのリクエストに対するレスポンスである発話データにおける擦過音あるいは破裂音を含む単語を抽出し、発話データの文脈に対応して同義語（あるいは類義語、類語）である擦過音及び破裂音を含まない他の単語に置き換えるため、発話データを音声合成した音声コンテンツを、ユーザが聞き取り易い音声とすることができる。 In addition, according to the present embodiment, by the listening orientation estimation model corresponding to each attribute information of the user, the types of parameters that improve the ease of listening in the listening orientation of the user, and the amount of change (adjustment amount) of these parameters, And a word including a fuzzing sound or a plosive sound in the utterance data that is a response to the user's request is extracted, and a fretting sound and a plosive sound that are synonyms (or synonyms or synonyms) are extracted corresponding to the context of the utterance data. Since it is replaced with another word that is not included, the voice content obtained by voice synthesis of the utterance data can be made easy to hear by the user.

また、本実施形態によれば、ユーザの各々の属性情報に対応した聴取志向推定モデルにより、ユーザの聴取志向における聞き取り易さを向上するパラメータの種類と、これらパラメータの変更量（調整量）とが求められ、ユーザのリクエストに対するレスポンスである発話データを音声合成する際、発話される音声の周波数、読み上げ速度、区切りなどの変更を行うため、音声合成された発話データである音声コンテンツを、ユーザが聞き取り易い状態の音声とすることができる。 In addition, according to the present embodiment, by the listening orientation estimation model corresponding to each attribute information of the user, the types of parameters that improve the ease of listening in the listening orientation of the user, and the amount of change (adjustment amount) of these parameters, When speech synthesis is performed on speech data that is a response to a user request, the speech content that is speech synthesized speech is changed to the user in order to change the frequency of speech to be spoken, the reading speed, and the segmentation. Can be easily heard.

また、本実施形態によれば、対話の履歴が無いユーザに対して、このユーザと属性情報が類似している他のユーザにより構成されるグループに対応して生成された聴取志向テンプレートモデルを用い、上述した発話データにおける擦過音あるいは破裂音を含む単語を、発話データの文脈に対応して同義語である擦過音及び破裂音を含まない他の単語に置き換えるため、発話データを音声合成した音声コンテンツを、ユーザが聞き取り易い音声とするため、履歴の無いユーザに対しても、レスポンスの音声コンテンツの聞き取り易さを向上させることができる。 In addition, according to the present embodiment, for a user who has no conversation history, a listening-oriented template model generated corresponding to a group composed of other users whose attribute information is similar to this user is used. Speech that is synthesized from speech data in order to replace a word containing a fuzzing sound or a plosive sound in the utterance data described above with another word that does not contain a fuzzing sound and a plosive sound that are synonyms corresponding to the context of the utterance data. Since the content is a voice that can be easily heard by the user, it is possible to improve the ease of listening to the voice content of the response even for a user who has no history.

また、本実施形態によれば、対話の履歴が無いユーザに対して、このユーザと属性情報が類似している他のユーザにより構成されるグループに対応して生成された聴取志向テンプレートモデルを用い、ユーザのリクエストに対するレスポンスである発話データを音声合成する際、発話される音声の周波数、読み上げ速度、区切りなどの変更を行うため、音声合成された発話データである音声コンテンツを、ユーザが聞き取り易い状態の音声とするため、履歴の無いユーザに対しても、レスポンスの音声コンテンツの聞き取り易さを向上させることができる。 In addition, according to the present embodiment, for a user who has no conversation history, a listening-oriented template model generated corresponding to a group composed of other users whose attribute information is similar to this user is used. When speech data that is a response to a user's request is synthesized, the frequency of the speech that is spoken, the reading speed, and the segmentation are changed, so that the user can easily hear the audio content that is the synthesized speech data. Since the voice of the state is used, it is possible to improve the ease of listening to the voice content of the response even for a user who has no history.

また、本実施形態によれば、上記聴取志向推定モデル及び聴取志向テンプレートモデルの各々を、対話履歴記憶部１０７及び対話行動記憶部１０９に記憶されている、リクエスト側（ユーザ）とレスポンス側（情報提示サーバ１０）との対話における履歴の各データを用いて順次変更を行うため、ユーザあるいはグループの属性情報に対応した音声の聞き取り易さを向上させていくことができる。 Further, according to the present embodiment, each of the listening orientation estimation model and the listening orientation template model is stored in the dialogue history storage unit 107 and the dialogue action storage unit 109, respectively, on the request side (user) and the response side (information Since the history data in the dialogue with the presentation server 10) is sequentially changed, it is possible to improve the ease of hearing the voice corresponding to the user or group attribute information.

本実施形態においては、レスポンス側をコンピュータの対話システムとして説明したが、リクエスト側とレスポンス側との各々がユーザ（人間）である場合、対話するユーザ間における相互の聞き取り易さを向上するように、ユーザそれぞれに対応した聴取志向推定モデルにより、対話におけるレスポンス側の発話データにおける単語の置き換えの処理、及び音声合成の際の聞き取り易さを向上するパラメータの変更処理を行う構成としても良い。 In this embodiment, the response side has been described as a computer dialogue system. However, when each of the request side and the response side is a user (human), the mutual hearing between the interacting users is improved. In addition, a configuration may be used in which a process for replacing words in the utterance data on the response side in the dialogue and a parameter changing process for improving the ease of listening at the time of speech synthesis are performed by a listening orientation estimation model corresponding to each user.

また、図１０は、本発明の一実施形態による、ユーザとシステムとが対話を行う情報提示システムの他の構成例を示す概念図である。
情報提示システム１Ａは、情報提示サーバ１０Ａ、ユーザ端末１１＿１、ユーザ端末１１＿２、ユーザ端末１１＿３、ユーザ端末１１＿４、対話サーバ１２＿１、対話サーバ１２＿２、対話サーバ１２＿３、対話サーバ１２＿３の各々がネットワーク５００を介して接続されている。
ユーザ端末１１＿１及びユーザ端末１１＿２の各々は、すでに説明したスマートスピーカなどであり、ユーザが音声によってリクエストの入力を行い、情報提示サーバ１０Ａからのレスポンスを音声コンテンツとしてユーザに通知する。 FIG. 10 is a conceptual diagram showing another configuration example of the information presentation system in which the user and the system interact with each other according to the embodiment of the present invention.
The information presentation system 1A includes an information presentation server 10A, a user terminal 11_1, a user terminal 11_2, a user terminal 11_3, a user terminal 11_4, a dialogue server 12_1, a dialogue server 12_2, a dialogue server 12_3, and a dialogue server 12_3 via a network 500. It is connected.
Each of the user terminal 11_1 and the user terminal 11_2 is the smart speaker already described, and the user inputs a request by voice and notifies the user of the response from the information presentation server 10A as voice content.

一方、ユーザ端末１１＿３はスマートフォンやタブレットコンピュータなどの携帯端末であり、表示画面が備えられている。また、ユーザ端末１１＿４は、パーソナルコンピュータであり、表示画面が設けられている。
情報提示サーバ１０Ａは、表示画面を備えているユーザ端末１１＿３及びユーザ端末１１＿４の各々に対しては、音声コンテンツではなく、視覚（ビジュアル）的に視認できる画像コンテンツ（文字コンテンツ、動画像あるいはスタンプ画像など）に変更して（出力を切替えて）、リクエストに対するレスポンスとして出力するように構成しても良い。 On the other hand, the user terminal 11_3 is a mobile terminal such as a smartphone or a tablet computer, and is provided with a display screen. The user terminal 11_4 is a personal computer and is provided with a display screen.
For each of the user terminal 11_3 and the user terminal 11_4 having a display screen, the information presentation server 10A is not an audio content but an image content (character content, moving image, or stamp image) that can be visually recognized visually. Etc.) (switching the output) and outputting as a response to the request.

また、情報提示サーバ１０Ａは、すでに説明した図１における情報提示サーバ１０と同様の構成であるが、対話処理部１０２における対話システムの機能を有していない構成である。
対話サーバ１２＿２、対話サーバ１２＿３及び対話サーバ１２＿３の各々は、情報提示サーバ１０における対話処理部１０２の対話システムの機能に換わる装置である。対話サーバ１２＿２、対話サーバ１２＿３及び対話サーバ１２＿３の各々は、例えば、天気予報確認、交通機関の時刻確認、ユーザの計画の確認それぞれを行う対話システムである。 The information presentation server 10A has the same configuration as the information presentation server 10 in FIG. 1 described above, but does not have a dialog system function in the dialog processing unit 102.
Each of the dialogue server 12_2, the dialogue server 12_3, and the dialogue server 12_3 is a device that replaces the function of the dialogue system of the dialogue processing unit 102 in the information presentation server 10. Each of the dialogue server 12_2, the dialogue server 12_3, and the dialogue server 12_3 is a dialogue system that performs, for example, weather forecast confirmation, transportation time confirmation, and user plan confirmation.

この構成の場合、情報提示サーバ１０Ａは、対話サーバ１２＿２、対話サーバ１２＿３及び対話サーバ１２＿３の各々から、発話データとしてのレスポンスのテキストデータを入力し、すでに述べたように、発話データを音声コンテンツとした際における聞き取り易さを向上する変更を行う。 In the case of this configuration, the information presentation server 10A inputs response text data as utterance data from each of the dialogue server 12_2, the dialogue server 12_3, and the dialogue server 12_3. Make changes that improve ease of listening.

なお、本発明における図１の情報提示サーバ１０及び図１０の情報提示サーバ１０Ａの各々の機能を実現するためのプログラムをコンピュータ読み取り可能な記録媒体に記録して、この記録媒体に記録されたプログラムをコンピュータシステムに読み込ませ、実行することにより音声コンテンツをユーザがより聞き取り易いように変更する処理を行ってもよい。なお、ここでいう「コンピュータシステム」とは、ＯＳや周辺機器等のハードウェアを含むものとする。 In the present invention, a program for realizing the functions of the information presentation server 10 in FIG. 1 and the information presentation server 10A in FIG. 10 is recorded on a computer-readable recording medium, and the program recorded on the recording medium is recorded. May be read by the computer system and executed to change the audio content so that the user can more easily hear it. Here, the “computer system” includes an OS and hardware such as peripheral devices.

また、「コンピュータシステム」は、ホームページ提供環境（あるいは表示環境）を備えたＷＷＷシステムも含むものとする。また、「コンピュータ読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ−ＲＯＭ等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置のことをいう。さらに「コンピュータ読み取り可能な記録媒体」とは、インターネット等のネットワークや電話回線等の通信回線を介してプログラムが送信された場合のサーバやクライアントとなるコンピュータシステム内部の揮発性メモリ（ＲＡＭ）のように、一定時間プログラムを保持しているものも含むものとする。 The “computer system” includes a WWW system having a homepage providing environment (or display environment). The “computer-readable recording medium” refers to a portable medium such as a flexible disk, a magneto-optical disk, a ROM, and a CD-ROM, and a storage device such as a hard disk built in the computer system. Further, the “computer-readable recording medium” refers to a volatile memory (RAM) in a computer system that becomes a server or a client when a program is transmitted via a network such as the Internet or a communication line such as a telephone line. In addition, those holding programs for a certain period of time are also included.

また、上記プログラムは、このプログラムを記憶装置等に格納したコンピュータシステムから、伝送媒体を介して、あるいは、伝送媒体中の伝送波により他のコンピュータシステムに伝送されてもよい。ここで、プログラムを伝送する「伝送媒体」は、インターネット等のネットワーク（通信網）や電話回線等の通信回線（通信線）のように情報を伝送する機能を有する媒体のことをいう。また、上記プログラムは、前述した機能の一部を実現するためのものであっても良い。さらに、前述した機能をコンピュータシステムにすでに記録されているプログラムとの組み合わせで実現できるもの、いわゆる差分ファイル（差分プログラム）であっても良い。 The program may be transmitted from a computer system storing the program in a storage device or the like to another computer system via a transmission medium or by a transmission wave in the transmission medium. Here, the “transmission medium” for transmitting the program refers to a medium having a function of transmitting information, such as a network (communication network) such as the Internet or a communication line (communication line) such as a telephone line. The program may be for realizing a part of the functions described above. Furthermore, what can implement | achieve the function mentioned above in combination with the program already recorded on the computer system, and what is called a difference file (difference program) may be sufficient.

また、上記プログラムは、図１に記載のシステムおよびプログラムについて、対話サーバ１２を含む情報提示サーバ１０とユーザ端末１１がネットワークを介して伝送を実現している。しかしながら、これに限らずに、例えば、可能であればネットワークを介することなくユーザ端末１１に対話サーバ１２を含む情報提示サーバ１０の機能が搭載されていてもよい。
また、情報提示サーバ１０と対話サーバ１２との各々が独立した装置として設ける構成ではなく、情報提示サーバ１０、対話サーバ１２のそれぞれの機能を、１つのサーバによって実現する構成としても良い。 Moreover, the information presentation server 10 including the dialogue server 12 and the user terminal 11 realize the transmission of the program described above with respect to the system and program shown in FIG. However, without being limited thereto, for example, the function of the information presentation server 10 including the dialogue server 12 may be mounted on the user terminal 11 without using a network if possible.
Moreover, it is good also as a structure which implement | achieves each function of the information presentation server 10 and the dialogue server 12 by one server instead of the structure which each of the information presentation server 10 and the dialogue server 12 provides as an independent apparatus.

１，１Ａ…情報提示システム
１０，１０Ａ…情報提示サーバ
１１，１１＿１，１１＿２，１１＿３，１１＿４…ユーザ端末
１２＿１，１２＿２，１２＿３…対話サーバ
１０１…データ入出力部
１０２…対話処理部
１０３…聴取志向推定部
１０４…提示制御部
１０５…音声合成部
１０６…グルーピング推定部
１０７…対話履歴記憶部
１０８…ユーザ属性記憶部
１０９…対話行動記憶部
１１０…グルーピング記憶部
１１１…言語知識記憶部
５００…ネットワーク
１０２１…解析部
１０２２…対話管理部
１０２３…生成部
１０３１…モデル構築部
１０３２…聴取志向管理部
１０２３…生成部 DESCRIPTION OF SYMBOLS 1,1A ... Information presentation system 10, 10A ... Information presentation server 11, 11_1, 11_2, 11_3, 11_4 ... User terminal 12_1, 12_2, 12_3 ... Dialog server 101 ... Data input / output part 102 ... Dialog processing part 103 ... Audition intention estimation Unit 104 ... presentation control unit 105 ... speech synthesizer 106 ... grouping estimation unit 107 ... dialog history storage unit 108 ... user attribute storage unit 109 ... dialogue action storage unit 110 ... grouping storage unit 111 ... language knowledge storage unit 500 ... network 1021 ... Analysis unit 1022 ... Dialog management unit 1023 ... Generation unit 1031 ... Model construction unit 1032 ... Listening orientation management unit 1023 ... Generation unit

Claims

In dialogue with the user, the listening orientation indicating the ease of hearing of each voice of the user is estimated, and utterance data supplied by voice to the user is changed corresponding to the listening preference of the user. A listening orientation estimator for generating and updating a change model;
A presentation control unit that corresponds to the change model set for each user, and that changes the utterance data, which is an answer by voice from a dialogue system, to be the listening-oriented for each of the users. Characteristic information presentation system.

Rules for utterances from users, which are used when estimating the listening orientation, and that the dialogue history which is the history of the dialogue with each of the users is written and stored for each user in the dialogue history storage unit The information presentation system according to claim 1, further comprising: a dialogue processing unit that determines a response based on the information.

The listening orientation estimation unit
From the evaluation of the utterance data in the dialog of the user, the user's listening intention is extracted, and each of the attribute information of the user and the intention information indicating the listening intention of the user is stored in the user attribute storage unit. The information presentation system according to claim 1, wherein the information is written and stored for each user.

In accordance with the attribute information of each of the users, grouping that classifies each of the users is performed, and a template modification model that is a modification model for each of the classifications according to the listening orientation common to the users included in each classification The information presentation system according to claim 3, further comprising: a grouping estimation unit that generates

The listening orientation estimation unit
For the user for whom the change model is not prepared, the template change model of the classification corresponding to the user is extracted, and corresponding to the listening orientation extracted in the dialogue, the user corresponding to the user is extracted. The information presentation system according to claim 4, wherein a change model is generated.

The attribute information is
The information presentation system according to any one of claims 3 to 5, wherein the information presentation system is set as a combination of at least demographic data including age, sex, and residence of the user.

The change model is
The processing for changing at least the replacement of words in the utterance data determined through the dialogue processing unit, the frequency and speed of speech when reading the utterance data, and the division of phrases is shown. Information presentation system described.

The presentation control unit
The change content which is the content changed the utterance data by the change model is written and stored as a change history in the dialogue action storage unit,
The listening orientation estimation unit
The information presentation system according to any one of claims 1 to 7, wherein the listening orientation of the user is extracted from the history of the conversation and the change history.

The listening orientation estimation unit estimates the listening orientation indicating the ease of hearing of each voice of the user in the dialog with the user, and the speech data supplied to the user by voice is used as the listening orientation of the user. Listening-oriented estimation process for generating and updating a change model that changes in response to
The presentation control unit responds by voice from the dialogue system via the dialogue processing unit that corresponds to the change model set for each user and determines a response to the utterance from the user based on the rule. A presentation control step of changing the utterance data so as to be listening-oriented for each of the users.

Computer
In dialogue with the user, the listening orientation indicating the ease of hearing of each voice of the user is estimated, and utterance data supplied by voice to the user is changed corresponding to the listening preference of the user. Listening-oriented estimation means for generating and updating change models,
Corresponding to the change model set for each user, the utterance data, which is an answer by voice from a dialogue system, is made to function as a presentation control means for changing the user to be listening-oriented. program.