JP6433765B2

JP6433765B2 - Spoken dialogue system and spoken dialogue method

Info

Publication number: JP6433765B2
Application number: JP2014233815A
Authority: JP
Inventors: 秀治倉本
Original assignee: Samsung Electronics Co Ltd
Current assignee: Samsung Electronics Co Ltd
Priority date: 2014-11-18
Filing date: 2014-11-18
Publication date: 2018-12-05
Anticipated expiration: 2034-11-18
Also published as: JP2016099381A

Description

本発明は、音声対話システムに関し、特に、特定のドメインにおける話題について人と音声で対話する音声対話システムに関する。 The present invention relates to a voice dialogue system, and more particularly, to a voice dialogue system that dialogues with a person about a topic in a specific domain.

近年、スマートフォンやタブレット端末などの携帯情報端末（ＰＤＡ）やカーナビゲーションといった各種システムに音声認識機能が搭載されている。音声認識機能を搭載したシステムでは、ユーザーは、システムに対して音声で指示を出し、システムはその音声による指示を理解して所期の動作をする。システムがさらに音声合成機能を有する場合には、ユーザーはシステムと音声で対話することができる。 In recent years, various systems such as personal digital assistants (PDAs) such as smartphones and tablet terminals and car navigation systems have been equipped with a voice recognition function. In a system equipped with a voice recognition function, a user gives an instruction to the system by voice, and the system understands the voice instruction and performs a desired operation. If the system further has a speech synthesis function, the user can interact with the system by voice.

従来、音声対話システムとして、コミュニケーションギャップをリアルタイムに解決し、持続的かつ自然なコミュニケーションを行うものや（例えば、特許文献１を参照）、話者側からの聞き返しおよび確認に係る音声を認識し、これに基づいた回答や処理を実行することが可能なもの（例えば、特許文献２を参照）などがある。また、ユーザーの音声に対して適切な応答文を生成する対話システムとして、認識対象となっているキーワードに対して、それらを応答文中に含める場合に使用する言い換え語と、応答文の種類を表す応答タイプと、言い換え語と応答タイプが選択される条件とを記録し、言い換え語と応答タイプが選択される条件に基づいて、認識されたキーワードに対する言い換え語と応答文テンプレートを決定し、決定された応答文テンプレートに言い換え語を挿入することにより応答文を生成するものがある（例えば、特許文献３を参照）。 Conventionally, as a voice dialogue system, a communication gap is solved in real time, and continuous and natural communication is performed (for example, refer to Patent Document 1), a voice related to a replay and confirmation from a speaker side is recognized, There are those that can execute answers and processing based on this (for example, see Patent Document 2). In addition, as an interactive system that generates an appropriate response sentence for the user's voice, it expresses the paraphrase used when including keywords in the response sentence and the types of response sentences Record the response type and the conditions under which the paraphrase word and response type are selected, and determine the paraphrase word and response sentence template for the recognized keyword based on the conditions under which the paraphrase word and response type are selected. In some cases, a response sentence is generated by inserting a paraphrase into the response sentence template (see, for example, Patent Document 3).

特開２０１２−１８１６９７号公報JP 2012-181697 A 特開２０１０−１９７８５８号公報JP 2010-197858 A 特開２００８−３９９２８号公報JP 2008-39928 A

近年、テレビジョン装置などのＡＶシステムにも音声認識機能が搭載されつつある。しかし、現在のＡＶシステムは、「Volume up」や「Turn off」などのような命令の発話を理解して所期の動作をするコマンドベースの音声認識システムに過ぎない。このため、ユーザーがコンテンツを視聴しながらそのコンテンツについて知りたいことがあっても、システムと音声対話によって必要な情報を得ることはできない。また、ユーザーが知りたい情報は、５Ｗ１Ｈで回答できるような単純な情報に限らず、コンテンツに関する意見や感想といった複雑な情報であることもある。 In recent years, audio recognition functions are being installed in AV systems such as television devices. However, the current AV system is only a command-based speech recognition system that understands the utterances of commands such as “Volume up” and “Turn off” and operates as expected. For this reason, even if the user wants to know the content while viewing the content, the necessary information cannot be obtained by voice dialogue with the system. Further, the information that the user wants to know is not limited to simple information that can be answered in 5W1H, but may be complex information such as opinions and impressions regarding the content.

上記問題に鑑み、本発明は、特定のドメインにおける話題について人と音声で対話することができる音声対話システムを提供することを目的とする。 In view of the above problems, an object of the present invention is to provide a voice dialogue system capable of voice conversation with a person on a topic in a specific domain.

本発明の一局面に従った音声対話システムは、特定のドメインにおける話題について人と音声で対話する音声対話システムであって、発話者の発話を音声認識して発話文を生成する音声認識部と、前記発話文を解析して、前記発話者が前記ドメインにおけるどのような情報を知りたがっているかを理解する意図理解部と、前記発話者が知りたがっている情報の内容に応じて、前記ドメインにおけるさまざまな情報を保持する第１の情報源および前記ドメインにおける他人の感想を保持する第２の情報源のいずれかを検索して、目的の情報を取得する対話管理部と、前記目的の情報を用いて前記発話者の発話に対する応答文を生成する応答文生成部と、前記応答文を音声合成して音声信号を生成する音声合成部とを備えているシステムである。 A speech dialogue system according to one aspect of the present invention is a speech dialogue system that performs speech dialogue with a person on a topic in a specific domain, and a speech recognition unit that recognizes a speech of a speaker and generates a spoken sentence. Analyzing the utterance and understanding what information the speaker wants to know in the domain; and depending on the content of the information that the speaker wants to know, the domain A dialogue management unit that retrieves the target information by searching either the first information source that holds various information in the domain or the second information source that holds the impression of others in the domain; Is a system including a response sentence generation unit that generates a response sentence for the utterance of the speaker and a voice synthesis unit that synthesizes the response sentence and generates a speech signal.

また、本発明の別の局面に従った音声対話方法は、特定のドメインにおける話題について人と音声で対話する方法であって、音声認識部が、発話者の発話を音声認識して発話文を生成し、意図理解部が、前記発話文を解析して、前記発話者が前記ドメインにおけるどのような情報を知りたがっているかを理解し、対話管理部が、前記発話者が知りたがっている情報の内容に応じて、前記ドメインにおけるさまざまな情報を保持する第１の情報源および前記ドメインにおける他人の感想を保持する第２の情報源のいずれかを検索して、目的の情報を取得し、応答文生成部が、前記目的の情報を用いて前記発話者の発話に対する応答文を生成し、音声合成部が、前記応答文を音声合成して音声信号を生成する方法である。 In addition, a voice dialogue method according to another aspect of the present invention is a method of voice dialogue with a person on a topic in a specific domain, in which a voice recognition unit recognizes a utterance of a speaker and utters a spoken sentence. The intention understanding unit analyzes the utterance sentence to understand what information the speaker wants to know in the domain, and the dialog management unit information that the speaker wants to know. Depending on the content of the information, the first information source that holds various information in the domain and the second information source that holds the impression of others in the domain are searched to obtain target information, In this method, a response sentence generation unit generates a response sentence for the utterance of the speaker using the target information, and a voice synthesis unit generates a voice signal by voice synthesis of the response sentence.

これらシステムまたは方法によると、ユーザー（発話者）が音声対話システムに対して何か発話すると、ユーザーが特定のドメインにおけるどのような情報を発話者が知りたがっているかが理解され、当該知りたがっている情報の内容に応じて、ドメインにおけるさまざまな情報を保持する第１の情報源およびドメインにおける他人の感想を保持する第２の情報源のいずれかを検索して目的の情報が取得され、その情報を用いた応答文が音声としてユーザーに返される。これにより、ユーザーは、システムとの対話を通じて、５Ｗ１Ｈで回答できるような単純な情報のみならず、意見などの複雑な情報を取得することができる。 According to these systems or methods, when a user (speaker) speaks something to the spoken dialogue system, the user understands what information the speaker wants to know in a particular domain and wants to know Depending on the content of the information, the first information source that holds various information in the domain and the second information source that holds the impression of others in the domain are searched to obtain the target information. A response sentence using information is returned to the user as voice. Thereby, the user can acquire not only simple information that can be answered in 5W1H but also complicated information such as opinions through dialogue with the system.

前記第１の情報源が、インターネット上のウェブサイトであってもよい。 The first information source may be a website on the Internet.

これによると、インターネット上のウェブサイトの情報は頻繁に追加・更新されるため、インターネット上のウェブサイトを第１の情報源として使用することで音声対話システムは常に最新の情報を取得することができる。 According to this, since information on websites on the Internet is frequently added / updated, by using a website on the Internet as the first information source, the voice interactive system can always obtain the latest information. it can.

上記の音声対話システムは、前記第２の情報源としての感想データベースを備えていてもよく、前記対話管理部は、インターネット上のウェブサイトから定期的または不定期にユーザーレビューを取得し、当該ユーザーレビューから評価表現を含む文を抽出して前記感想データベースに登録するものであってもよい。 The voice dialogue system may include an impression database as the second information source, and the dialogue management unit obtains a user review from a website on the Internet regularly or irregularly, and A sentence including an evaluation expression may be extracted from the review and registered in the impression database.

これによると、インターネット上のウェブサイトに公開されているユーザーレビューがそのままではなく、感想として有用な評価表現を含む文が抽出されて感想データベースに登録される。また、インターネット上のウェブサイトに新たに投稿されたユーザーレビューを定期的または不定期に感想データベースに取り込んで感想データベースを更新することができる。 According to this, not a user review published on a website on the Internet but a sentence including an evaluation expression useful as an impression is extracted and registered in the impression database. In addition, a user review newly posted on a website on the Internet can be taken into the impression database regularly or irregularly to update the impression database.

本発明によれば、特定のドメインにおける話題について人と音声で対話することができる音声対話システムが実現される。これにより、ユーザーが欲する有用な情報を、対話という簡単なインターラクションによりユーザーに提供することができる。 ADVANTAGE OF THE INVENTION According to this invention, the audio | voice dialog system which can carry out a dialog with a person about the topic in a specific domain is implement | achieved. Thereby, useful information that the user desires can be provided to the user through a simple interaction called dialogue.

本発明の一実施形態に係る音声対話システムのブロック図1 is a block diagram of a voice interaction system according to an embodiment of the present invention. 音声対話システムの動作を示すフローチャートFlow chart showing operation of spoken dialogue system 音声対話システムの実施例であるテレビジョン装置の外観図External view of a television apparatus which is an embodiment of a voice interaction system

以下、図面を参照しながら本発明を実施するための形態について説明する。図１は、本発明の一実施形態に係る音声対話システムのブロック図である。本実施形態に係る音声対話システム１０は、音声認識部１１と、意図理解部１２と、対話管理部１３と、応答文生成部１４と、音声合成部１５とを備え、特定のドメインにおける話題について発話者と音声で対話するものである。ドメインとは、音声対話システム１０が扱う対話内容・分野・ジャンルなどのことをいい、例えば、映画、音楽、料理、天気といった特定の話題領域のことである。例えば、ドメインが映画の場合、発話者は、映画について音声対話システム１０と対話することができる。 DESCRIPTION OF EMBODIMENTS Hereinafter, embodiments for carrying out the present invention will be described with reference to the drawings. FIG. 1 is a block diagram of a voice interaction system according to an embodiment of the present invention. The speech dialogue system 10 according to the present embodiment includes a speech recognition unit 11, an intention understanding unit 12, a dialogue management unit 13, a response sentence generation unit 14, and a speech synthesis unit 15. About a topic in a specific domain It interacts with the speaker by voice. The domain refers to the content, field, genre, etc. of dialogue handled by the voice dialogue system 10, for example, specific topic areas such as movies, music, cooking, and weather. For example, if the domain is a movie, the speaker can interact with the voice interaction system 10 about the movie.

なお、以下の説明ではドメインが映画の場合について説明するが本発明が対象とするドメインは映画に限定されない。また、音声対話システム１０が対応可能なドメインは一つとは限らず、音声対話システム１０は複数のドメインに対応することができる。 In the following description, the case where the domain is a movie will be described, but the domain targeted by the present invention is not limited to a movie. Further, the number of domains that the voice dialogue system 10 can handle is not limited to one, and the voice dialogue system 10 can deal with a plurality of domains.

音声認識部１１は、発話者の発話を音声認識して発話文を生成する。発話者の発話は、図示しないマイクロフォンから音声信号として音声認識部１１に入力される。音声認識部１１の音声認識機能として既知の技術を利用することができる。生成される発話文は、「監督は誰？」、「いつ劇場公開されたの？」、「誰が出演しているの？」、「この映画の感想を聞かせて」などとさまざまである。 The voice recognition unit 11 recognizes the utterance of the speaker and generates an utterance sentence. The utterance of the speaker is input to the voice recognition unit 11 as a voice signal from a microphone (not shown). A known technique can be used as the voice recognition function of the voice recognition unit 11. The generated utterances vary, such as “Who is the director?”, “When was the movie released?”, “Who is appearing?”, “Tell us what you think of this movie”.

意図理解部１２は、音声認識部１１が生成した発話文を解析して、発話者がドメインにおけるどのような情報を知りたがっているか、すなわち発話者の意図を理解する。ここで、発話文は、例えば、「監督は誰？」などのように目的が曖昧なケースが多い。また、例えば、出演者を尋ねる発話文として「誰が出演しているの？」以外に、「出演者は誰？」、「誰が出ているの？」などのようにさまざまな表現が使われる。意図理解部１２は、意図理解モデル１２１および問題解決知識１２２を参照して、そのような目的が曖昧な発話文やさまざまな表現の発話文に込められた発話者の意図を正しく理解することができる。 The intention understanding unit 12 analyzes the utterance sentence generated by the voice recognition unit 11 and understands what information the speaker wants to know in the domain, that is, the intention of the speaker. Here, there are many cases where the purpose of the utterance sentence is ambiguous, such as “who is the director?”. In addition, for example, in addition to “who is performing?”, Various expressions such as “who is performing” and “who is appearing” are used as utterances for asking performers. The intent understanding unit 12 refers to the intent understanding model 121 and the problem solving knowledge 122 to correctly understand the intention of the utterer included in the utterance sentence whose purpose is ambiguous or the utterance sentence of various expressions. it can.

意図理解モデル１２１は、意図ごとにさまざまな言い方を集めた発話文例集である。意図理解モデル１２１には、発話文の単語や表現パターンがどのような意図を示す傾向があるのかがデータベース化されている。意図理解部１２は、意図理解モデル１２１を参照することで、発話文でさまざまな表現が使われていても表面的な言葉遣いに左右されずに発話者の意図を正しく理解することできる。 The intent understanding model 121 is an utterance sentence example collection in which various expressions are collected for each intention. The intention understanding model 121 is a database of what kind of intention a word or expression pattern of an utterance has. By referring to the intention understanding model 121, the intention understanding unit 12 can correctly understand the intention of the speaker without being influenced by superficial language even if various expressions are used in the utterance sentence.

問題解決知識１２２には、さまざまな発話表現（問題）とその解決策との対応関係が保存されている。発話者が知りたがっている情報の内容に応じて問題は二つのタイプに分類することができる。一つは、監督名、公開日、出演者などを問い合わせる簡単な問題である。もう一つは、映画の感想を問い合わせる複雑な問題である。簡単な問題に対応する解決策は「ウェブサイトを検索する」である。一方、複雑な問題に対応する解決策は「感想データベースを検索する」である。このように、意図理解部１２は、問題解決知識１２２を参照することで、問題とその解決策を特定することができる。 The problem solving knowledge 122 stores correspondences between various utterance expressions (problems) and their solutions. Depending on the content of the information the speaker wants to know, the problem can be classified into two types. The first is a simple question of inquiring about the director's name, release date, and performers. The other is a complicated problem inquiring about the impression of the movie. The solution to a simple problem is “search the website”. On the other hand, the solution to the complicated problem is “search the impression database”. Thus, the intention understanding unit 12 can identify the problem and its solution by referring to the problem solving knowledge 122.

対話管理部１３は、意図理解部１２から意図理解結果を受け、発話者が知りたがっている情報の内容に応じて二つの情報源のうちいずれか一方の情報源を検索して目的の情報を取得する。情報源の一つは、インターネット上のウェブサイトである。当該ウェブサイトとして、例えば、ＤＢｐｅｄｉａ（http://ja.dbpedia.org/）を利用することができる。ＤＢｐｅｄｉａは、Ｗｉｋｉｐｅｄｉａ（http://ja.wikipedia.org/）の情報を構造化したデータセットであり、ＲＤＦ（Resource Description Framework）クエリ言語のＳＰＡＲＱＬで所望の情報を検索することができる。もう一つの情報源は、感想データベース１３１である。感想データベース１３１にはドメインにおける他人のさまざまな感想が保存されている。本実施形態の場合、感想データベース１３１にはさまざまな映画に関する他人の感想が保存されている。 The dialogue management unit 13 receives the intention understanding result from the intention understanding unit 12, searches one of the two information sources according to the content of the information that the speaker wants to know, and obtains the target information. get. One source of information is a website on the Internet. For example, DBpedia (http://en.dbpedia.org/) can be used as the website. DBpedia is a data set in which the information of Wikipedia (http://en.wikipedia.org/) is structured, and desired information can be searched by SPARQL of an RDF (Resource Description Framework) query language. Another information source is the impression database 131. The impression database 131 stores various impressions of others in the domain. In the case of this embodiment, the impression database 131 stores impressions of other people regarding various movies.

対話管理部１３は、対話シナリオ１３２に従ってシステム行動を決定する。対話シナリオ１３２には、さまざまな意図が入力されたときにそれぞれどのようなやりとりをすべきかといった対話の流れが記述されている。例えば、発話意図が「監督の名前を知りたい」であり、解決策が「ウェブサイトを検索する」である場合、対話管理部１３は、システム行動としてインターネット上のウェブサイトを検索して監督名を取得する。一方、発話意図が「映画の感想が聞きたい」であり、解決策が「感想データベースを検索する」である場合、対話管理部１３は、システム行動として感想データベース１３１を検索して発話者に提供すべき適当な感想を取得する。 The dialogue manager 13 determines the system behavior according to the dialogue scenario 132. The dialogue scenario 132 describes the flow of dialogue such as what kind of interaction should be performed when various intentions are input. For example, when the utterance intention is “I want to know the name of the director” and the solution is “Search the website”, the dialogue management unit 13 searches the website on the Internet as a system action and searches for the director name. To get. On the other hand, when the utterance intention is “I want to hear the impression of the movie” and the solution is “Search the impression database”, the dialogue management unit 13 searches the impression database 131 as a system action and provides it to the speaker. Get the right impression to do.

対話管理部１３は、インターネット上のウェブサイトからユーザーレビューを収集して感想データベース１３１に登録することができる。当該ウェブサイトとして、例えば、Ｙａｈｏｏ！映画（http://movies.yahoo.co.jp/）を利用することができる。このウェブサイトには、さまざまな映画について多くの人から寄せられたレビューが公開されている。 The dialogue management unit 13 can collect user reviews from websites on the Internet and register them in the impression database 131. For example, Yahoo! You can use movies (http://movies.yahoo.co.jp/). The website has reviews from many people about various movies.

例えば、対話管理部１３は、次のようにしてウェブサイトからユーザーレビューを収集して感想データベース１３１に格納することができる。まず、対話管理部１３は、Ｙａｈｏｏ！映画サイトから各映画のユーザーレビューを取得する。ユーザーレビューは映画ごとに複数存在する。そして、対話管理部１３は、形態素解析エンジンを使用してユーザーレビューの各文を形態素解析し、さらに不要な文字を除去し、意見抽出ツールを用いて評価表現を含む意見文のみを抽出する。当該意見抽出ツールとして、独立行政法人情報通信研究機構が提供する意見（評価表現）抽出ツール（http://alaginrc.nict.go.jp/opinion/）を利用することができる。当該ツールを使用してユーザーレビューごとに評価表現のデータを抽出する。評価表現とは、「○○は良い」、「○○はつまらない」といった、肯定的または否定的意見や評判、提言を表す表現をいう。こうして抽出されたデータ集合が感想となる。すなわち、本明細書でいう感想とは、評価表現を含む文の集合である。 For example, the dialogue management unit 13 can collect user reviews from websites and store them in the impression database 131 as follows. First, the dialogue management unit 13 uses Yahoo! Get user reviews for each movie from the movie site. There are multiple user reviews for each movie. Then, the dialogue management unit 13 performs morphological analysis of each sentence of the user review using the morphological analysis engine, further removes unnecessary characters, and extracts only the opinion sentence including the evaluation expression using the opinion extraction tool. An opinion (evaluation expression) extraction tool (http://alaginrc.nict.go.jp/opinion/) provided by the National Institute of Information and Communications Technology can be used as the opinion extraction tool. Data of evaluation expressions is extracted for each user review using the tool. The evaluation expression is an expression representing a positive or negative opinion, reputation, or recommendation such as “XX is good” or “XX is boring”. The extracted data set is the impression. That is, the impression in this specification is a set of sentences including evaluation expressions.

さらに、対話管理部１３は、抽出した感想から特徴量を抽出する。具体的には、各感想について形容詞と名詞の利用傾向として形容詞ＴＦ（Term Frequency）および名詞ＴＦを計算する。さらに、対話管理部１３は、名詞についてはＴＦ−ＩＤＦ（Term Frequency - Inverse Document Frequency）を計算する。そして、対話管理部１３は、抽出した感想とともに、これら形容詞ＴＦ、名詞ＴＦ、名詞ＴＦ−ＩＤＦを感想データベース１３１に登録する。形容詞ＴＦ、名詞ＴＦ、名詞ＴＦ−ＩＤＦは、感想データベース１３１から適当な感想を検索するときの指標となる。 Furthermore, the dialogue management unit 13 extracts feature amounts from the extracted impressions. Specifically, an adjective TF (Term Frequency) and a noun TF are calculated as the usage tendency of adjectives and nouns for each impression. Furthermore, the dialogue management unit 13 calculates TF-IDF (Term Frequency-Inverse Document Frequency) for nouns. Then, the dialogue management unit 13 registers these adjectives TF, nouns TF, and nouns TF-IDF together with the extracted impressions in the impression database 131. The adjectives TF, nouns TF, and nouns TF-IDF are used as indices when searching for suitable impressions from the impression database 131.

なお、ウェブサイト上のユーザーレビューは頻繁に追加され、また、映画も次々に新しいものが公開されるため、対話管理部１３は、感想データベース１３１を定期的または不定期に更新することが望ましい。例えば、感想データベース１３１を毎日決まった時刻に更新してもよいし、ユーザーから指示を受ける都度更新してもよい。 Since user reviews on the website are frequently added and new movies are released one after another, it is desirable that the dialog management unit 13 updates the impression database 131 regularly or irregularly. For example, the impression database 131 may be updated every day at a predetermined time, or may be updated every time an instruction is received from the user.

応答文生成部１４は、対話管理部１３のシステム行動結果を発話者に伝えるための応答文を生成する。具体的には、応答文生成部１４は、対話管理部１３が各種情報源から取得した目的の情報を用いて発話者の発話に対する応答文を生成する。例えば、発話者の発話が「いつ劇場公開されたの？」であり、対話管理部１３が目的の情報として「２０１４年１月２０日」を取得した場合、応答文生成部１４は、例えば、「この映画は２０１４年１月２０日に劇場公開されました」といった応答文を生成する。 The response sentence generation unit 14 generates a response sentence for transmitting the system action result of the dialogue management unit 13 to the speaker. Specifically, the response sentence generation unit 14 generates a response sentence for the utterance of the speaker by using the target information acquired by the dialogue management unit 13 from various information sources. For example, when the utterance of the speaker is “When was the theater released?” And the dialogue management unit 13 acquires “January 20, 2014” as the target information, the response sentence generation unit 14 A response sentence such as “This movie was released to the theater on January 20, 2014” is generated.

音声合成部１５は、応答文生成部１４が生成した応答文を音声合成して音声信号を生成する。当該音声信号は図示しないスピーカーに出力され、音となって発話者に伝えられる。音声合成部１５の音声合成機能として既知の技術を利用することができる。 The voice synthesizer 15 synthesizes the response sentence generated by the response sentence generator 14 and generates a voice signal. The audio signal is output to a speaker (not shown) and transmitted to the speaker as sound. A known technique can be used as the speech synthesis function of the speech synthesis unit 15.

次に、音声対話システム１０の処理フローについて説明する。図２は、音声対話システム１０の動作を示すフローチャートである。 Next, the processing flow of the voice interaction system 10 will be described. FIG. 2 is a flowchart showing the operation of the voice interaction system 10.

まず、音声認識部１１が、発話者の発話を音声認識して発話文を生成する（Ｓ１）。そして、意図理解部１２が、生成された発話文を解析して発話者の意図を理解する（Ｓ２）。 First, the speech recognition unit 11 recognizes the speech of the speaker and generates an utterance sentence (S1). Then, the intention understanding unit 12 analyzes the generated utterance sentence to understand the intention of the speaker (S2).

発話者の意図が５Ｗ１Ｈの質問であり、発話者が知りたがっている情報が単純な情報である場合（Ｓ３でＹＥＳ）、対話管理部１３は、インターネット上のウェブサイト（例えば、Ｙａｈｏｏ！映画）を検索して目的の情報を取得する。一方、発話者が知りたがっている情報が感想といった複雑な情報である場合（Ｓ３でＮＯ）、対話管理部１３は、感想データベース１３１を検索して適当な感想を取得する（Ｓ５）。 If the speaker's intention is a 5W1H question and the information that the speaker wants to know is simple information (YES in S3), the dialog management unit 13 can display a website on the Internet (for example, Yahoo! Movie). Search for the desired information. On the other hand, when the information that the speaker wants to know is complex information such as an impression (NO in S3), the dialogue management unit 13 searches the impression database 131 and acquires an appropriate impression (S5).

対話管理部１３が目的の情報を取得したら、応答文生成部１４が、対話管理部１３が取得した目的の情報を用いて発話者の発話に対する応答文を生成する（Ｓ６）。そして、音声合成部１５が、生成された応答文を音声合成して音声信号を生成し、スピーカーから音声を出力する（Ｓ７）。 When the dialogue management unit 13 acquires the target information, the response sentence generation unit 14 generates a response sentence for the utterance of the speaker using the target information acquired by the dialog management unit 13 (S6). Then, the speech synthesizer 15 performs speech synthesis on the generated response sentence to generate a speech signal, and outputs speech from the speaker (S7).

以上のように、本実施形態によれば、特定のドメインにおける話題について人と音声で対話することができる。これにより、ユーザーが欲する有用な情報を、対話という簡単なインターラクションによりユーザーに提供することができ、ユーザーの利便性を向上させることができる。 As described above, according to the present embodiment, a conversation with a person can be performed on a topic in a specific domain. Accordingly, useful information desired by the user can be provided to the user through a simple interaction called dialogue, and the convenience of the user can be improved.

なお、意図理解モデル１２１、問題解決知識１２２、感想データベース１３１、対話シナリオ１３２は、音声対話システム１０内の記憶装置（図略）に配置されていてもよいし、外部の記憶装置（図略）に配置されていてもよい。また、これらは、同一の記憶装置に配置されていてもよいし、複数の異なる記憶装置に分散配置されていてもよい。また、音声認識部１１、意図理解部１２、対話管理部１３、応答文生成部１３、音声合成部１５は、同一のサーバー装置に実装されていてもよいし、複数のサーバー装置に機能分散されていてもよい。分散配置された知識や機能はネットワークを通じて互いに通信することができる。 The intent understanding model 121, the problem solving knowledge 122, the impression database 131, and the dialogue scenario 132 may be arranged in a storage device (not shown) in the voice dialogue system 10, or an external storage device (not shown). May be arranged. Further, these may be arranged in the same storage device, or may be distributed in a plurality of different storage devices. Further, the voice recognition unit 11, the intention understanding unit 12, the dialogue management unit 13, the response sentence generation unit 13, and the voice synthesis unit 15 may be mounted on the same server device, or the functions are distributed to a plurality of server devices. It may be. Distributed knowledge and functions can communicate with each other through a network.

≪実施例≫
次に、音声対話システム１０の実施例について説明する。図３は、音声対話システム１０の実施例であるテレビジョン装置の外観を示す。本実施形態に係る音声対話システム１０は、例えば、表示パネル１０１と、マイクロフォン１０２と、スピーカー１０３とを備えたテレビジョン装置１００に搭載することができる。 <Example>
Next, an embodiment of the voice interaction system 10 will be described. FIG. 3 shows the external appearance of a television apparatus that is an embodiment of the voice interaction system 10. The voice interaction system 10 according to the present embodiment can be mounted on, for example, a television device 100 including a display panel 101, a microphone 102, and a speaker 103.

テレビジョン装置１００は、図示しないテレビチューナーを備えており、地上デジタル放送１０４、ＢＳ（Broadcasting Satellite）／ＣＳ（Communication Satellite）デジタル放送１０５を受信することができる。地上デジタル放送１０４およびＢＳ／ＣＳデジタル放送１０５は、図示しないアンテナ端子を通じてテレビジョン装置１００に入力される。 The television apparatus 100 includes a television tuner (not shown) and can receive a terrestrial digital broadcast 104 and a BS (Broadcasting Satellite) / CS (Communication Satellite) digital broadcast 105. The terrestrial digital broadcast 104 and the BS / CS digital broadcast 105 are input to the television apparatus 100 through an antenna terminal (not shown).

また、テレビジョン装置１００は、有線または無線でインターネット１０６に接続可能である。テレビジョン装置１００は、インターネット１０６上のコンテンツサーバー１０７からＶＯＤ（Video On Demand）などのコンテンツを受信することができる。また、テレビジョン装置１００は、ＤＶＤ（Digital Versatile Disc）／ＢＤ（Blu-ray Disc）などのレコーダー１０８に接続可能である。テレビジョン装置１００は、レコーダー１０８に搭載されたハードディスク装置（図略）や、ＤＶＤやＢＤなどの光ディスクに記録されたコンテンツを再生することができる。テレビジョン装置１００がレコーダー機能を有することもある。 In addition, the television device 100 can be connected to the Internet 106 by wire or wirelessly. The television apparatus 100 can receive content such as VOD (Video On Demand) from a content server 107 on the Internet 106. The television apparatus 100 can be connected to a recorder 108 such as a DVD (Digital Versatile Disc) / BD (Blu-ray Disc). The television device 100 can reproduce content recorded on a hard disk device (not shown) mounted on the recorder 108 or an optical disc such as a DVD or a BD. The television apparatus 100 may have a recorder function.

テレビジョン装置１００が受信したコンテンツは表示パネル１０１に表示され、ユーザー２００（発話者）はそれを視聴しながらテレビジョン装置１００に当該コンテンツに関するさまざまな問いかけをすることができる。テレビジョン装置１００は、ユーザー２００が発した声をマイクロフォン１０２で集音してユーザー２００の発話内容を理解し、ユーザー２００の問いかけに対して適当な回答を生成してスピーカー１０３からその回答を音声で出力する。このように、ユーザー２００は、テレビジョン装置１００と音声で対話することができる。 The content received by the television device 100 is displayed on the display panel 101, and the user 200 (speaker) can ask the television device 100 various questions regarding the content while viewing the content. The television apparatus 100 collects the voice uttered by the user 200 with the microphone 102, understands the content of the utterance of the user 200, generates an appropriate answer to the question of the user 200, and utters the answer from the speaker 103. To output. In this manner, the user 200 can interact with the television apparatus 100 by voice.

以上、本発明の実施形態について説明したが、本発明は上記の実施形態の構成に限られず種々の変形が可能である。例えば、本実施形態に係る音声対話システム１０は、テレビジョン装置１００以外にも、例えば、コンテンツ再生機能を有するスマートフォンやＰＣに搭載することもできる。 Although the embodiment of the present invention has been described above, the present invention is not limited to the configuration of the above embodiment, and various modifications can be made. For example, the voice interaction system 10 according to the present embodiment can be mounted on, for example, a smartphone or a PC having a content reproduction function in addition to the television device 100.

また、上記実施形態により示した構成は、本発明の一実施形態に過ぎず、本発明を当該構成に限定する趣旨ではない。 Moreover, the structure shown by the said embodiment is only one Embodiment of this invention, and is not the meaning which limits this invention to the said structure.

１０音声対話システム
１２音声認識部
１３意図理解部
１４対話管理部
１５応答文生成部
１６音声合成部
１３１感想データベース
２００ユーザー(発話者) DESCRIPTION OF SYMBOLS 10 Speech dialogue system 12 Speech recognition part 13 Intent understanding part 14 Dialog management part 15 Response sentence generation part 16 Speech synthesis part 131 Impression database 200 User (speaker)

Claims

A spoken dialogue system that speaks with people about topics in a specific domain,
A speech recognition unit that recognizes a speaker's speech and generates an utterance sentence;
An intention understanding unit that analyzes the spoken sentence and understands what information the speaker wants to know in the domain;
Depending on the content of the information that the speaker wants to know, either the first information source that holds various information in the domain or the second information source that holds the impression of others in the domain is searched. A dialogue management unit that obtains desired information;
A response sentence generation unit that generates a response sentence for the utterance of the speaker using the target information;
A speech dialogue system comprising: a speech synthesizer for speech synthesis of the response sentence to generate a speech signal.

The spoken dialogue system according to claim 1, wherein the first information source is a website on the Internet.

An impression database as the second information source;
The dialog management unit acquires a user review from a website on the Internet regularly or irregularly, extracts a sentence including an evaluation expression from the user review, and registers the sentence in the impression database. The spoken dialogue system according to claim 1 or 2.

A way to talk with people about topics in a specific domain,
The speech recognition unit generates speech by recognizing the speech of the speaker,
An intent understanding unit analyzes the utterance sentence to understand what information the utterer wants to know in the domain,
The dialogue management unit has a first information source that holds various information in the domain and a second information source that holds impressions of others in the domain according to the content of information that the speaker wants to know. Search for one to get the information you want,
A response sentence generation unit generates a response sentence for the utterance of the speaker using the target information;
A voice dialogue method, wherein a voice synthesizer synthesizes the response sentence with voice to generate a voice signal.

5. The voice interaction method according to claim 4, wherein the first information source is a website on the Internet.

The second information source is an impression database;
The dialog management unit acquires a user review from a website on the Internet regularly or irregularly, extracts a sentence including an evaluation expression from the user review, and registers the sentence in the impression database. The voice interaction method according to claim 4 or 5.