JPWO2008126355A1

JPWO2008126355A1 - Keyword extractor

Info

Publication number: JPWO2008126355A1
Application number: JP2009508884A
Authority: JP
Inventors: 遠藤　充; 充遠藤; 麻紀山田; 森井　景子; 景子森井; 小沼　知浩; 知浩小沼; 野村　和也; 和也野村
Original assignee: Panasonic Corp; Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Corp; Panasonic Holdings Corp
Priority date: 2007-03-29
Filing date: 2008-03-14
Publication date: 2010-07-22
Anticipated expiration: 2028-03-14
Also published as: EP2045798A4; CN101542592A; EP2045798B1; US20090150155A1; EP2045798A1; US8370145B2; WO2008126355A1; JP4838351B2

Abstract

本発明は、会話内のキーワードを事前に予想して準備することなく、会話内のキーワードを抽出することを目的とする。本発明のキーワード抽出装置は、発話者の発話音声を入力する音声入力部１０１と、上記入力された発話音声について、発話者ごとの発話区間を判定する発話区間判定部１０２と、上記判定された発話区間の発話音声を発話者ごとに認識する音声認識部１０３と、各発話者の発話音声に対する他の発話者の応答に基づいて、キーワードの存在を示唆する発話応答の特徴、すなわち先行発話と後行発話とが重なる割り込みを検出する割込検出部１０４と、上記割り込みに基づいて特定した発話区間の発話からキーワードを抽出するキーワード抽出部１０５と、当該キーワードによるキーワード検索を行うキーワード検索部１０６と、キーワード検索結果を表示する表示部１０７とを含む。An object of the present invention is to extract keywords in a conversation without predicting and preparing the keywords in the conversation in advance. The keyword extraction device of the present invention includes a voice input unit 101 that inputs a utterance voice of a utterer, an utterance section determination unit 102 that determines a utterance section for each utterer with respect to the input utterance voice, and the above determination. Based on the speech recognition unit 103 that recognizes the utterance speech of the utterance section for each utterer, and the feature of the utterance response that suggests the presence of the keyword based on the response of the other utterers to the utterance speech of each utterer, An interrupt detection unit 104 that detects an interrupt that overlaps a subsequent utterance, a keyword extraction unit 105 that extracts a keyword from an utterance in an utterance section specified based on the interrupt, and a keyword search unit 106 that performs a keyword search using the keyword And a display unit 107 that displays the keyword search result.

Description

本発明は、キーワード抽出装置に係り、特に会話内に含まれるキーワードを抽出するキーワード抽出装置に関するものである。 The present invention relates to a keyword extraction device, and more particularly to a keyword extraction device that extracts keywords included in a conversation.

従来のキーワード抽出装置は、あらかじめ、電子レンジ等のキーワードとＵＲＬへのアクセス等のアクション情報との対応関係を示した対応データを保持している。そして、キーワード抽出装置は、上記対応データに基づいて、ある会話の中からキーワードを検出し、そのキーワードに対応するアクション情報に基づく処理を実行する。このようにして、音声認識による情報の提示が行われていた（例えば、特許文献１）。 A conventional keyword extracting device holds correspondence data indicating a correspondence relationship between a keyword such as a microwave oven and action information such as access to a URL in advance. Then, the keyword extraction device detects a keyword from a certain conversation based on the correspondence data, and executes processing based on action information corresponding to the keyword. In this way, information is presented by voice recognition (for example, Patent Document 1).

特開２００５−２１５７２６号公報（段落００２１〜段落００３６、図２〜図３参照）JP 2005-215726 A (see paragraphs 0021 to 0036 and FIGS. 2 to 3)

しかしながら、特許文献１に記載の装置においては、想定される場面別に上記対応データを準備しなければならないため、利用しにくいという問題があった。
本発明は、上記の状況に対処するためになされたものであり、会話内のキーワードを事前に予想して準備することなく、会話内のキーワードを抽出することができるキーワード抽出装置を提供することを目的とする。However, the apparatus described in Patent Document 1 has a problem that it is difficult to use the correspondence data because the corresponding data must be prepared for each possible scene.
The present invention has been made to cope with the above-described situation, and provides a keyword extraction device that can extract a keyword in a conversation without predicting and preparing the keyword in the conversation in advance. With the goal.

上記従来の課題を解決するために、本発明は、発話者の発話音声を入力する音声入力部と、上記入力された発話音声について、上記発話者ごとの発話区間を判定する発話区間判定部と、上記判定された発話区間の発話音声を上記発話者ごとに認識する音声認識部と、上記各発話者の発話音声に対する他の発話者の応答に基づいて、キーワードの存在を示唆する発話応答の特徴を抽出する発話応答特徴抽出部と、上記抽出された発話応答の特徴に基づいて特定した発話区間の発話音声から前記キーワードを抽出するキーワード抽出部と、を含む。 In order to solve the above-described conventional problems, the present invention includes a voice input unit that inputs a utterance voice of a speaker, a utterance section determination unit that determines a utterance section for each utterer with respect to the input utterance voice, and A speech recognition unit that recognizes the speech of the determined speech section for each speaker, and a speech response that suggests the presence of a keyword based on the responses of other speakers to the speech of each speaker An utterance response feature extraction unit that extracts features, and a keyword extraction unit that extracts the keywords from the utterance speech of the utterance section specified based on the extracted utterance response features.

本発明によれば、会話内のキーワードを事前に予想して準備することなく、会話内のキーワードを抽出することができる。 According to the present invention, it is possible to extract a keyword in a conversation without predicting and preparing the keyword in the conversation in advance.

本発明の実施の形態１におけるキーワード抽出装置を含むシステム全体の構成例を示すブロック図。1 is a block diagram showing a configuration example of an entire system including a keyword extraction device in Embodiment 1 of the present invention. 本発明の実施の形態１における発話区間の例を示す図。The figure which shows the example of the speech area in Embodiment 1 of this invention. 図１のキーワード抽出装置の動作を示すフローチャート。The flowchart which shows operation | movement of the keyword extraction apparatus of FIG. 本発明の実施の形態２におけるキーワード抽出装置の構成例を示すブロック図。The block diagram which shows the structural example of the keyword extraction apparatus in Embodiment 2 of this invention. 本発明の実施の形態２におけるピッチパターンの例を示す図。The figure which shows the example of the pitch pattern in Embodiment 2 of this invention. 図４のキーワード抽出装置の動作を示すフローチャート。5 is a flowchart showing the operation of the keyword extraction device in FIG. 4. 本発明の実施の形態３におけるキーワード抽出装置の構成例を示すブロック図。The block diagram which shows the structural example of the keyword extraction apparatus in Embodiment 3 of this invention. 図７のキーワード抽出装置の動作を示すフローチャート。The flowchart which shows operation | movement of the keyword extraction apparatus of FIG. 本発明の実施の形態４におけるキーワード抽出装置の構成例を示すブロック図。The block diagram which shows the structural example of the keyword extraction apparatus in Embodiment 4 of this invention. 本発明の実施の形態４における発話区間、発話内容および表情認識結果の例を示す図。The figure which shows the example of the speech area in the Embodiment 4 of this invention, speech content, and a facial expression recognition result. 図９のキーワード抽出装置の動作を示すフローチャート。10 is a flowchart showing the operation of the keyword extraction device of FIG. 本発明の実施の形態５におけるキーワード抽出装置の構成例を示すブロック図。The block diagram which shows the structural example of the keyword extraction apparatus in Embodiment 5 of this invention. 図１２のキーワード抽出装置の動作を示すフローチャート。The flowchart which shows operation | movement of the keyword extraction apparatus of FIG.

Explanation of symbols

１００、１００Ａ、１００Ｂ、１００Ｃ、１００Ｄキーワード抽出装置
１０１音声入力部
１０２発話区間判定部
１０３音声認識部
１０４割込検出部
１０５、１０５Ａ、１０５Ｂ、１０５Ｃ、１０５Ｄキーワード抽出部
１０６キーワード検索部
１０７表示部
２０１ピッチ判定部
２０２ピッチパターン判定部
３０１機能フレーズ抽出部
３０２機能フレーズ記憶部
４０１映像入力部
４０２表情認識部
５０１盛り上がり反応検出部100, 100A, 100B, 100C, 100D Keyword extraction device 101 Voice input unit 102 Speech segment determination unit 103 Speech recognition unit 104 Interrupt detection unit 105, 105A, 105B, 105C, 105D Keyword extraction unit 106 Keyword search unit 107 Display unit 201 Pitch determination unit 202 Pitch pattern determination unit 301 Function phrase extraction unit 302 Function phrase storage unit 401 Video input unit 402 Expression recognition unit 501 Swell reaction detection unit

以下、本発明の実施の形態１〜５について図面を参照しながら説明する。実施の形態１〜５は、例えば、２人の発話者Ａ、Ｂが、携帯電話等の情報端末を用いて会話している場面を想定して説明する。
（実施の形態１）
図１は、本発明の実施の形態１におけるキーワード抽出装置を含むシステム全体の構成例を示すブロック図である。
図１において、キーワード抽出装置１００は、ある発話者Ａの情報端末であり、インターネット等のネットワーク４００へ接続できるように構成されている。ネットワーク４００には、別の発話者Ｂの情報端末２００や検索サーバ３００が接続されるように構成されている。キーワード抽出装置１００および情報端末２００は、携帯電話、ノート型パソコン、携帯情報端末等の情報端末である。検索サーバ３００は、公知の検索エンジンを搭載したサーバである。Embodiments 1 to 5 of the present invention will be described below with reference to the drawings. In the first to fifth embodiments, for example, a case where two speakers A and B are talking using an information terminal such as a mobile phone will be described.
(Embodiment 1)
FIG. 1 is a block diagram showing a configuration example of the entire system including a keyword extracting device according to Embodiment 1 of the present invention.
In FIG. 1, a keyword extraction device 100 is an information terminal of a certain speaker A, and is configured to be connected to a network 400 such as the Internet. The network 400 is configured to be connected to the information terminal 200 and the search server 300 of another speaker B. The keyword extraction device 100 and the information terminal 200 are information terminals such as a mobile phone, a notebook personal computer, and a mobile information terminal. The search server 300 is a server equipped with a known search engine.

キーワード抽出装置１００は、音声入力部１０１、発話区間判定部１０２、音声認識部１０３、割込検出部１０４、キーワード抽出部１０５、キーワード検索部１０６および表示部１０７を有する。
音声入力部１０１は、発話者の音声（以下、発話音声という）を入力するためのものである。音声入力部１０１は、例えば、マイクロフォン、ネットワーク４００との通信インターフェース等が該当する。The keyword extraction device 100 includes a voice input unit 101, a speech segment determination unit 102, a voice recognition unit 103, an interrupt detection unit 104, a keyword extraction unit 105, a keyword search unit 106, and a display unit 107.
The voice input unit 101 is for inputting a voice of a speaker (hereinafter referred to as “speech voice”). The voice input unit 101 corresponds to, for example, a microphone, a communication interface with the network 400, or the like.

発話区間判定部１０２は、上記入力された発話音声について、発話者ごとの発話区間を判定する。発話区間とは、発話者が会話を開始し初めてから終了するまでの区間をいう。
例えば、発話者Ａと発話者Ｂの会話が、図２（ａ）または図２（ｂ）に示すような場合、発話区間判定部１０２は、発話者Ａの会話の開始時間ｔｓ１から終了時間ｔｅ１までの区間、すなわちｔｓ１−ｔｅ１を発話者Ａの発話区間１として判定する。さらに、発話区間判定部１０２は、発話者Ｂの会話の開始時間ｔｓ２から終了時間ｔｅ２までの区間、すなわちｔｓ２−ｔｅ２の区間を発話者Ｂの発話区間２として判定する。The utterance section determination unit 102 determines an utterance section for each speaker with respect to the input utterance voice. The utterance section refers to a section from the beginning to the end of the conversation by the speaker.
For example, when the conversation between the speaker A and the speaker B is as shown in FIG. 2 (a) or FIG. 2 (b), the speech segment determination unit 102 determines the end time te1 from the conversation start time ts1 of the speaker A. The section up to that time, that is, ts1-te1 is determined as the utterance section 1 of the speaker A. Furthermore, the utterance section determination unit 102 determines the section from the start time ts2 to the end time te2 of the conversation of the speaker B, that is, the section of ts2-te2 as the speech section 2 of the speaker B.

図１に戻って、音声認識部１０３は、上記判定された発話区間の発話音声を発話者ごとに認識する。具体的には、音声認識部１０３は、すべての発話者の会話音声について、公知の音声認識技術によりテキスト文字化する。さらに、音声認識部１０３は、個々の発話者の会話音声について、その開始時間（開始点）および終了時間（終了点）を対応づける。 Returning to FIG. 1, the voice recognition unit 103 recognizes the uttered voice of the determined utterance section for each speaker. More specifically, the voice recognition unit 103 converts the conversation voices of all the speakers into text characters using a known voice recognition technique. Furthermore, the voice recognition unit 103 associates the start time (start point) and the end time (end point) of the conversation voice of each speaker.

割込検出部１０４（発話応答特徴抽出部）は、上記判定された発話区間について、各発話者の発話音声に基づいて発話の特徴、すなわち先行発話と後行発話とが重なる割り込みを検出する。例えば、発話者Ａと発話者Ｂの会話が、図２（ｂ）に示した会話の場合、割込検出部１０４は、発話者Ａの先行発話の途中、すなわちｔｓ１で発話者Ｂの後行発話が開始されているので、上記割り込みを検出する。この検出方法は次のとおりである。
すなわち、割込検出部１０４は、まず、後行発話の開始時間からその直前の先行発話の終了時間までの区間（以下、発話間隔という）を計測する。例えば、図２（ａ）（ｂ）の場合、割込検出部１０４は、発話間隔＝図２（ａ）（ｂ）のｔｓ２−ｔｅ１の算出式を用いて、発話間隔を計算する。次に、割込検出部１０４は、上記計算の結果、発話間隔がマイナスの値（図２（ｂ）参照）になるかどうかを判断する。そして、割込検出部１０４は、当該発話間隔がマイナスの値の場合（図２（ｂ）参照）、割り込みがあるものとして検出することとなる。The interrupt detection unit 104 (speech response feature extraction unit) detects an interrupt in which the utterance feature, that is, the preceding utterance and the subsequent utterance overlap, based on the utterance voice of each utterer for the determined utterance period. For example, when the conversation between the speaker A and the speaker B is the conversation shown in FIG. 2B, the interrupt detection unit 104 follows the speaker B in the middle of the preceding speech of the speaker A, that is, at ts1. Since the utterance has been started, the interrupt is detected. This detection method is as follows.
That is, the interrupt detection unit 104 first measures a section (hereinafter referred to as an utterance interval) from the start time of the subsequent utterance to the end time of the immediately preceding utterance. For example, in the case of FIGS. 2A and 2B, the interrupt detection unit 104 calculates the speech interval using the calculation formula of utterance interval = ts2-te1 in FIGS. 2A and 2B. Next, the interrupt detection unit 104 determines whether the utterance interval becomes a negative value (see FIG. 2B) as a result of the above calculation. When the speech interval is a negative value (see FIG. 2B), the interrupt detection unit 104 detects that there is an interrupt.

キーワード抽出部１０５は、上記抽出された発話の特徴、すなわち先行発話と後行発話とが重なる割り込みに基づいて、音声認識部１０２で認識された発話音声の中から、その発話音声の会話内で話題になっている語（以下、キーワードという）を抽出する。具体的には、キーワード抽出部１０５は、音声認識部１０２から、音声認識部１０２で認識された会話音声を取得する。この会話音声には、各発話者の開始時間および終了時間が対応付けられている。また、キーワード抽出部１０５は、割込検出部１０４から、割込検出部１０４で割り込みが検出された発話区間（例えば、図２（ｂ）の発話者Ｂの発話区間２）と、割り込まれた発話区間（例えば、図２の発話者Ａの発話区間１）とを取得する。これら各発話区間は、開始時間および終了時間により対応づけられている。 Based on the extracted utterance feature, that is, the interruption in which the preceding utterance and the succeeding utterance overlap, the keyword extraction unit 105 selects the utterance voice recognized by the voice recognition unit 102 within the conversation of the utterance voice. Extract a topic word (hereinafter referred to as a keyword). Specifically, the keyword extraction unit 105 acquires the conversation voice recognized by the voice recognition unit 102 from the voice recognition unit 102. This conversation voice is associated with the start time and end time of each speaker. In addition, the keyword extraction unit 105 is interrupted by the interrupt detection unit 104 and the utterance interval (for example, the utterance interval 2 of the speaker B in FIG. 2B) in which the interruption is detected by the interrupt detection unit 104. The utterance section (for example, the utterance section 1 of the speaker A in FIG. 2) is acquired. Each of these utterance sections is associated with a start time and an end time.

さらに、キーワード抽出部１０５は、上記キーワードを抽出する場合、例えば、割り込まれた先行発話内の末尾（最後）の構成素（例えば名詞）をキーワードとして抽出する。ここで、先行発話内の末尾とは、割り込み時（例えば、図２（ｂ）のｔｓ２の時間）よりも前の発話区間（例えば、図２（ｂ）のｔｓ１−ｔｓ２）内をいう。
具体的には、まず、キーワード抽出部１０５は、上記取得した各発話者の発話区間（例えば、図２（ｂ）の発話区間１、２）のうち、開始時間の早い発話者の発話区間（例えば、図２（ｂ）の発話区間１）を選定する。次に、キーワード抽出部１０５は、上記選定した発話区間（例えば、図２（ｂ）の発話区間１）において、上記取得した他の発話区間の開始時間（つまり割り込み時間、例えば図２（ｂ）のｔｓ２）の直前の構成素（例えば名詞）を検出する。次に、キーワード抽出部１０５は、上記検出した構成素（例えば名詞）をキーワードとして抽出する。Furthermore, when extracting the said keyword, the keyword extraction part 105 extracts the last component (for example, noun) in the preceding utterance interrupted as a keyword, for example. Here, the end in the preceding utterance refers to the inside of the utterance section (for example, ts1-ts2 in FIG. 2B) before the interruption (for example, the time of ts2 in FIG. 2B).
Specifically, first, the keyword extracting unit 105 selects the utterance section (for example, the utterance section of the utterer with the earlier start time from the utterance sections of the respective utterers (for example, the utterance sections 1 and 2 in FIG. 2B). For example, the utterance section 1) in FIG. 2 (b) is selected. Next, in the selected utterance section (for example, utterance section 1 in FIG. 2B), the keyword extraction unit 105 starts the other acquired utterance section (that is, interrupt time, for example, FIG. 2B). The component (for example, noun) immediately before ts2) is detected. Next, the keyword extraction unit 105 extracts the detected constituents (for example, nouns) as keywords.

キーワード検索部１０６は、上記抽出されたキーワードを用いて、キーワード検索を行う。具体的には、まず、キーワード検索部１０６は、ネットワーク４００を介して、検索サーバ３００へ接続する。すると、検索サーバ３００は、キーワード検索部１０６から、上記キーワード検索の要求を受け、そのキーワード検索の検索結果を、ネットワーク４００を介して、キーワード抽出装置１００のキーワード検索部１０６に返送する。キーワード検索部１０６は、上記返送により、検索サーバ３００から、キーワード検索の検索結果を受信する。 The keyword search unit 106 performs a keyword search using the extracted keyword. Specifically, first, the keyword search unit 106 connects to the search server 300 via the network 400. Then, the search server 300 receives the keyword search request from the keyword search unit 106, and returns the search result of the keyword search to the keyword search unit 106 of the keyword extracting device 100 via the network 400. The keyword search unit 106 receives the search result of the keyword search from the search server 300 by the return.

表示部１０７は、キーワード検索部１０６により検索された結果、すなわち検索サーバ３００の検索結果を表示する。表示部１０７は、ディスプレイや表示パネル等の表示装置である。 The display unit 107 displays the search result by the keyword search unit 106, that is, the search result of the search server 300. The display unit 107 is a display device such as a display or a display panel.

なお、本実施の形態において、発話区間判定部１０２、音声認識部１０３、割込検出部１０４、キーワード抽出部１０５およびキーワード検索部１０６は、ＣＰＵ等の処理装置が該当する。その他、キーワード抽出装置１００は、メモリ等の記憶装置（不図示）を含む公知の構成を備えているものとする。 In the present embodiment, the speech segment determination unit 102, the speech recognition unit 103, the interrupt detection unit 104, the keyword extraction unit 105, and the keyword search unit 106 correspond to a processing device such as a CPU. In addition, the keyword extraction device 100 is assumed to have a known configuration including a storage device (not shown) such as a memory.

次に、キーワード抽出装置１００の動作について図３を参照して説明する。図３では、例えば、２人の発話者Ａ、Ｂが、キーワード抽出装置１００や情報端末２００を用いて会話していることを前提にして説明する。
まず、キーワード抽出装置１００（発話区間判定部１０２）は、音声入力部１００および情報端末２００から入力された発話音声について、発話者ごとの発話区間を判定する（ステップＳ１０１）。この判定の際、発話区間判定部１０２は、各発話者の発話音声の大きさがしきい値以上であるかどうかを判断し、しきい値以上である区間を発話区間として判定する。
例えば、発話者Ａと発話者Ｂの会話が、図２（ａ）または図２（ｂ）に示すような場合、発話区間判定部１０２は、発話者Ａの会話の開始時間ｔｓ１から終了時間ｔｅ１までの区間、すなわちｔｓ１−ｔｅ２を発話者Ａの発話区間１として判定する。さらに、発話区間判定部１０３は、発話者Ｂの会話の開始時間ｔｓ２から終了時間ｔｅ２までの区間、ｔｓ２−ｔｅ２の区間を発話者Ｂの発話区間２として判定する。Next, the operation of the keyword extracting device 100 will be described with reference to FIG. In FIG. 3, for example, it is assumed that two speakers A and B are having a conversation using the keyword extraction device 100 and the information terminal 200.
First, the keyword extraction device 100 (the utterance section determination unit 102) determines an utterance section for each speaker with respect to the utterance voices input from the voice input unit 100 and the information terminal 200 (step S101). At the time of this determination, the utterance section determination unit 102 determines whether or not the volume of the uttered voice of each speaker is equal to or greater than a threshold value, and determines a section that is equal to or greater than the threshold value as the utterance section.
For example, when the conversation between the speaker A and the speaker B is as shown in FIG. 2 (a) or FIG. 2 (b), the speech segment determination unit 102 determines the end time te1 from the conversation start time ts1 of the speaker A. The section up to that time, that is, ts1-te2 is determined as the utterance section 1 of the speaker A. Further, the utterance section determination unit 103 determines the section from the start time ts2 to the end time te2 of the conversation of the speaker B, and the section ts2-te2 as the speech section 2 of the speaker B.

次に、キーワード抽出装置１００（音声認識部１０３）は、上記判定された発話区間の発話音声を発話者ごとに認識する（ステップＳ１０２）。この認識は、例えば、周波数帯域による特徴分析により行われるものとする。さらに、音声認識部１０３は、上記認識を行う際に、すべての発話者の会話音声について、公知の音声認識技術によりテキスト文字化する。 Next, the keyword extraction device 100 (voice recognition unit 103) recognizes the utterance voice of the determined utterance section for each speaker (step S102). This recognition shall be performed by the feature analysis by a frequency band, for example. Furthermore, when performing the above recognition, the speech recognition unit 103 converts the conversation speech of all the speakers into text characters using a known speech recognition technique.

次に、キーワード抽出装置１００（割込検出部１０４）は、上記判定された発話区間により割り込みを検出する（ステップＳ１０３）。具体的には、割込検出部１０４は、後行発話の開始時間からその直前の先行発話の終了時間を差し引いた間隔、すなわち発話間隔（例えば、図２（ａ）（ｂ）のｔｅ１−ｔｓ２）を計算する。そして、この計算の結果、発話間隔の値（例えば、図２（ｂ）のｔｅ１−ｔｓ２＝発話間隔）がマイナスであれば、割込検出部１０４は、後行発話の割り込みがあったと判断する。 Next, the keyword extraction device 100 (interrupt detection unit 104) detects an interrupt based on the determined speech period (step S103). Specifically, the interrupt detection unit 104 subtracts the end time of the immediately preceding utterance from the start time of the subsequent utterance, that is, the utterance interval (for example, te1-ts2 in FIGS. 2A and 2B). ). As a result of this calculation, if the value of the speech interval (for example, te1-ts2 = speech interval in FIG. 2B) is negative, the interrupt detection unit 104 determines that there is an interruption of the subsequent speech. .

次に、キーワード抽出装置１００（キーワード抽出部１０５）は、上記検出された割り込みのあった音声会話（ステップＳ１０２で認識された音声会話）内のキーワードを抽出して決定する（ステップＳ１０４）。具体的には、キーワード抽出部１０５は、後行発話の直前にある先行発話内の名詞を抽出し、この名詞を当該発話内のキーワードとして決定する。
例えば、図２（ｂ）のｔｓ１の時点において、発話者Ａが「今度、新東京タワーが…」と話し始めたときに、図２（ｂ）のｔｓ２の時点において、発話者Ｂが「ああ、それってどこにできるんですか？」と会話を始めた場合、キーワード抽出部１０５は、ｔｓ２の直前にある発話者Ａの「新東京タワー」という名詞をキーワードとして決定する。これにより、キーワード抽出部１０５は、事前に予想したキーワードを登録したデータベースから「新東京タワー」のキーワードを抽出することなく、「新東京タワー」を会話内で話題になっている語として決定することができる。Next, the keyword extraction device 100 (keyword extraction unit 105) extracts and determines the keywords in the detected voice conversation (interactive voice recognition recognized in step S102) with the interruption (step S104). Specifically, the keyword extraction unit 105 extracts a noun in the preceding utterance immediately before the subsequent utterance, and determines this noun as a keyword in the utterance.
For example, when utterer A starts to speak “This time, New Tokyo Tower ...” at the time ts1 in FIG. 2B, the utterer B is “oh” at the time ts2 in FIG. When the conversation begins, “Where is it possible?”, The keyword extraction unit 105 determines the noun “Shin Tokyo Tower” of the speaker A immediately before ts2 as a keyword. As a result, the keyword extraction unit 105 determines “New Tokyo Tower” as a topic in the conversation without extracting the keyword of “New Tokyo Tower” from the database in which keywords predicted in advance are registered. be able to.

なお、キーワード抽出部１０５は、上記発話間隔がプラスの値を示す場合（図２（ａ）参照）、発話中のキーワードがないものと判断し、キーワードを抽出しない。 Note that when the utterance interval shows a positive value (see FIG. 2A), the keyword extraction unit 105 determines that there is no keyword being uttered, and does not extract the keyword.

次に、キーワード抽出装置１００（キーワード検索部１０６）は、上記決定されたキーワードのキーワード検索を実行する（ステップＳ１０５）。具体的には、まず、キーワード検索部１０６は、ネットワーク４００を介して、検索サーバ３００に対し、上記キーワード検索を要求する。すると、検索サーバ３００は、上記要求を受けたキーワード検索を行い、その検索結果をキーワード検索部１０６に送信する。次に、キーワード検索部１０６は、検索サーバ３００から送信された検索結果を受信する。 Next, the keyword extraction device 100 (keyword search unit 106) performs keyword search for the determined keyword (step S105). Specifically, the keyword search unit 106 first requests the search server 300 for the keyword search via the network 400. Then, the search server 300 performs a keyword search in response to the request, and transmits the search result to the keyword search unit 106. Next, the keyword search unit 106 receives the search result transmitted from the search server 300.

次に、キーワード検索部１０６は、上記受信した検索結果を表示部１０７に表示する（ステップＳ１０６）。これにより、発話者は、会話内のキーワード（例えば、新東京タワー）に関する情報（検索結果）を把握することが可能となる。 Next, the keyword search unit 106 displays the received search result on the display unit 107 (step S106). Thereby, the speaker can grasp the information (search result) related to the keyword (for example, New Tokyo Tower) in the conversation.

また、割込検出部１０４の代わりに、発話間隔が予め設定した閾値（例えば３秒）以上である沈黙を検出する沈黙検出部を動作させることも、キーワードの存在を示唆する発話応答の特徴を抽出する上で有用である。 Also, in place of the interrupt detection unit 104, operating a silence detection unit that detects silence whose utterance interval is greater than or equal to a predetermined threshold (for example, 3 seconds) can also be characterized by an utterance response that suggests the presence of a keyword. Useful for extraction.

以上説明したように、本実施の形態によると、キーワード抽出装置１００は、キーワードの存在を示唆する発話応答の特徴としての割り込みを検出して、会話内のキーワードを抽出する。このため、キーワード抽出装置１００においては、会話内のキーワードを事前に予想してデータベース等に登録する準備を行うことなく、発話者の割り込みの有無から、会話内のキーワードを抽出することができる。 As described above, according to the present embodiment, the keyword extraction device 100 detects an interrupt as a feature of an utterance response that suggests the presence of a keyword, and extracts a keyword in the conversation. For this reason, the keyword extraction apparatus 100 can extract a keyword in a conversation based on the presence or absence of a speaker's interruption without predicting the keyword in the conversation in advance and registering it in a database or the like.

なお、実施の形態１において、キーワード抽出装置１００は、図３のステップＳ１０１〜Ｓ１０６の処理を順次実行する場合について説明したが、これに限られない。例えば、キーワード抽出装置１００は、図３の各ステップの順序を入れ替えて実行してもよいし、各ステップの処理を並列処理して実行してもよい。 In the first embodiment, the keyword extracting apparatus 100 has been described with respect to the case where the processes of steps S101 to S106 in FIG. 3 are sequentially performed, but the present invention is not limited to this. For example, the keyword extraction device 100 may execute the steps in FIG. 3 by changing the order of the steps, or may execute the steps in parallel.

（実施の形態２）
実施の形態２のキーワード抽出装置は、発話応答の特徴であるピッチ（音の高さ）のパターンに基づいて、会話内のキーワードを抽出するものである。
図４は、本発明の実施の形態２におけるキーワード抽出装置の構成例を示すブロック図である。なお、実施の形態２においては、実施の形態１と同一部分について実施の形態１と同一の符号・用語を付して、重複説明を省略する。
図４において、キーワード抽出装置１００Ａは、図１の実施の形態１の割込検出部１０４に代えて、ピッチ判定部２０１およびピッチパターン判定部２０２を有する。さらに、キーワード抽出装置１００Ａは、図１の実施の形態１のキーワード抽出部１０５に代えて、キーワード抽出部１０５Ａを有する点が、実施の形態１と異なる。ピッチ判定部２０１、ピッチパターン判定部２０２およびキーワード抽出部１０５Ａは、ＣＰＵ等の処理装置である。その他、情報端末２００を含むシステム全体の構成は、図１の場合と同様である。(Embodiment 2)
The keyword extracting apparatus according to the second embodiment extracts keywords in a conversation based on a pitch (sound pitch) pattern that is a feature of an utterance response.
FIG. 4 is a block diagram illustrating a configuration example of the keyword extracting device according to the second embodiment of the present invention. In the second embodiment, the same parts as those in the first embodiment are denoted by the same reference numerals and terms as those in the first embodiment, and redundant description is omitted.
In FIG. 4, the keyword extraction device 100 A includes a pitch determination unit 201 and a pitch pattern determination unit 202 instead of the interrupt detection unit 104 of the first embodiment in FIG. 1. Furthermore, the keyword extraction device 100A is different from the first embodiment in that it includes a keyword extraction unit 105A instead of the keyword extraction unit 105 of the first embodiment in FIG. The pitch determination unit 201, the pitch pattern determination unit 202, and the keyword extraction unit 105A are processing devices such as a CPU. In addition, the configuration of the entire system including the information terminal 200 is the same as that of FIG.

ピッチ判定部２０１およびピッチパターン判定部２０２（これらを併せて発話応答特徴抽出部ともいう）は、発話区間判定部１０２により判定された発話区間について、各発話者の発話音声に基づいて、発話の特徴であるピッチパターンを抽出する。具体的には、ピッチ判定部２０１は、発話音声のピッチを判定する。本実施の形態のピッチ判定部２０１は、例えば、１０ｍｓごとに発話音声を分割してピッチを判定する。 The pitch determination unit 201 and the pitch pattern determination unit 202 (also collectively referred to as an utterance response feature extraction unit) perform the utterance of the utterance interval determined by the utterance interval determination unit 102 based on the utterance speech of each speaker. A pitch pattern that is a feature is extracted. Specifically, the pitch determination unit 201 determines the pitch of the speech voice. The pitch determination unit 201 of the present embodiment determines the pitch by dividing the uttered speech every 10 ms, for example.

ピッチパターン判定部２０２は、上記判定されたピッチに基づいて、先行発話の末尾が下降ピッチ（図５のｔｃ１−ｔｅ１間参照）で、かつ、その先行発話の直後の後行発話が上昇ピッチ（図５のｔｃ２−ｔｅ２間参照）となるピッチパターン（発話の特徴）を判定する。この判定例を図５に示す。図５の横軸は時間を表し、縦軸は周波数を表す。
図５の発話区間ｔｓ１−ｔｅ１には、「新東京タワーが」という先行発話があり、発話区間ｔｓ２−ｔｅ２には、「それって・・・ですか？」という後行発話がある。そして、「新東京タワーが」の先行発話の末尾には下降ピッチが判定され、「それって・・・ですか？」の後行発話には上昇ピッチが判定されている。このように判定されるのは、ピッチパターン判定部２０２が次のように判定したからである。Based on the determined pitch, the pitch pattern determination unit 202 has a descending pitch at the end of the preceding utterance (see tc1-te1 in FIG. 5), and a succeeding utterance immediately after the preceding utterance is an ascending pitch ( The pitch pattern (characteristic of the utterance) to be determined (see between tc2 and te2 in FIG. 5) is determined. An example of this determination is shown in FIG. The horizontal axis in FIG. 5 represents time, and the vertical axis represents frequency.
In the utterance section ts1-te1, there is a preceding utterance “New Tokyo Tower”, and in the utterance section ts2-te2, there is a subsequent utterance “Is that? A descending pitch is determined at the end of the preceding utterance of “New Tokyo Tower”, and an ascending pitch is determined for the subsequent utterance of “Is that ...?”. This determination is made because the pitch pattern determination unit 202 determines as follows.

すなわち、ピッチパターン判定部２０２は、図５の「新東京タワーが」の発話区間ｔｓ１−ｔｅ１において、その中点ｔｃ１の周波数ｆよりも、発話区間の末尾（終了時）の周波数ｆが高いので上昇ピッチと判定したからである。また、ピッチパターン判定部２０２は、図５の「なんですか？」の発話区間ｔｓ２−ｔｅ２において、その中点ｔｃ２の周波数ｆよりも、発話区間の末尾（終了時）の周波数ｆが低いので下降ピッチと判定したからである。 That is, the pitch pattern determination unit 202 has a higher frequency f at the end (at the end) of the utterance section than the frequency f at the midpoint tc1 in the utterance section ts1-te1 of “New Tokyo Tower is” in FIG. This is because it is determined that the pitch is rising. In addition, the pitch pattern determination unit 202 decreases in the utterance interval ts2-te2 of “What?” In FIG. 5 because the frequency f at the end (at the end) of the utterance interval is lower than the frequency f at the midpoint tc2. This is because the pitch is determined.

なお、本実施の形態のピッチパターン判定部２０２は、発話区間の中点の周波数を基準にして上昇ピッチまたは下降ピッチを判定する場合について説明するが、これに限られない。例えば、ピッチ判定部２０１は、発話区間の終了時（例えば図５のｔｅ１、ｔｅ２）から、あらかじめ定められた区間（例えば時間Ｔ）遡った時点を基準にして判定してもよい。 In addition, although the pitch pattern determination part 202 of this Embodiment demonstrates the case where a raise pitch or a fall pitch is determined on the basis of the frequency of the middle point of an utterance area, it is not restricted to this. For example, the pitch determination unit 201 may perform determination based on a time point that is a predetermined period (for example, time T) from the end of the utterance period (for example, te1 and te2 in FIG. 5).

キーワード抽出部１０５Ａは、上記判定されたピッチパターンに示された先行発話の中から、キーワードを抽出する。この抽出に際し、キーワード抽出部１０５Ａは、例えば、上記ピッチパターンに示された先行発話内の末尾の構成素（例えば名詞）をキーワードとして抽出する。 The keyword extraction unit 105A extracts keywords from the preceding utterances shown in the determined pitch pattern. In this extraction, the keyword extraction unit 105A extracts, for example, the last constituent (for example, a noun) in the preceding utterance shown in the pitch pattern as a keyword.

次に、キーワード抽出装置１００Ａの動作について図６を参照して説明する。図６では、例えば、発話者Ａが、キーワード抽出装置１００Ａを用いて「今度、新東京タワーが・・・」と言った後、発話者Ｂが、情報端末２００を用いて「それって・・・ですか？」と言うことを前提にして説明する。なお、図７のステップＳ１０１〜Ｓ１０２、Ｓ１０５〜Ｓ１０６の処理は、図３のステップＳ１０１〜Ｓ１０２、Ｓ１０５〜Ｓ１０６と同様の処理であるため、適宜省略して説明する。 Next, the operation of the keyword extracting device 100A will be described with reference to FIG. In FIG. 6, for example, the speaker A uses the keyword extraction device 100A to say “Now, Tokyo Tower ...”, and then the speaker B uses the information terminal 200 to say “ Explain on the premise of saying "...?" 7 are the same as steps S101 to S102 and S105 to S106 in FIG. 3, and will be omitted as appropriate.

まず、キーワード抽出装置１００Ａ（発話区間判定部１０２）は、音声入力部１００および情報端末２００から入力された発話音声について、発話者ごとの発話区間（図２（ａ）の発話区間１、図２（ｂ）の発話区間２参照）を判定する（ステップＳ１０１）。次に、キーワード抽出装置１００Ａ（音声認識部１０３）は、上記判定された発話区間の発話音声を発話者ごとに認識する（ステップＳ１０２）。 First, the keyword extraction device 100A (the utterance section determination unit 102) uses the utterance section for each utterer (the utterance section 1 in FIG. 2A, FIG. 2) for the utterance speech input from the voice input section 100 and the information terminal 200. (See utterance section 2 in (b)) (step S101). Next, the keyword extraction device 100A (voice recognition unit 103) recognizes the utterance voice of the determined utterance section for each speaker (step S102).

次に、キーワード抽出装置１００Ａ（ピッチ判定部２０１）は、例えば発話者Ａの先行発話の発話区間１（図２（ａ）参照）および発話者Ｂの後行発話の発話区間２（図２（ｂ）参照）の発話音声に基づいて、発話音声のピッチを判定する（ステップＳ１０３Ａ）。 Next, the keyword extraction device 100A (pitch determination unit 201), for example, the utterance section 1 of the preceding utterance of the speaker A (see FIG. 2A) and the utterance section 2 of the subsequent utterance of the speaker B (FIG. 2 ( Based on the uttered voice of (b), the pitch of the uttered voice is determined (step S103A).

次に、キーワード抽出装置１００Ａ（ピッチパターン判定部２０２）は、上記判定されたピッチに基づいて、先行発話から後行発話へ移行した場合に、下降ピッチから上昇ピッチとなるピッチパターンがあるかを判定する（ステップＳ１０３Ｂ）。具体的には、ピッチパターン判定部２０２は、先行発話の末尾が下降ピッチ（図５のｔｃ１−ｔｅ１間参照）で、かつ、その先行発話の直後の後行発話が上昇ピッチ（図５のｔｃ２−ｔｅ２間参照）となるピッチパターンを判定する。 Next, the keyword extraction device 100A (pitch pattern determination unit 202) determines whether there is a pitch pattern that changes from the descending pitch to the ascending pitch when the preceding utterance is shifted to the succeeding utterance based on the determined pitch. Determination is made (step S103B). Specifically, the pitch pattern determination unit 202 determines that the end of the preceding utterance is the descending pitch (see tc1-te1 in FIG. 5), and the succeeding utterance immediately after the preceding utterance is the ascending pitch (tc2 in FIG. 5). The pitch pattern is determined.

次に、キーワード抽出装置１００Ａ（キーワード抽出部１０５Ａ）は、上記判定されたピッチパターンに示された発話音声（ステップＳ１０２で認識されたもの）の先行発話（例えば、図５の「新東京タワーが」）の中から、キーワードを抽出する（ステップＳ１０４Ａ）。この抽出に際し、キーワード抽出部１０５Ａは、例えば、上記ピッチパターンに示された先行発話内の末尾の名詞である「新東京タワー」をキーワードとして抽出する。 Next, the keyword extraction device 100A (keyword extraction unit 105A) determines the preceding utterance of the utterance voice (recognized in step S102) indicated in the determined pitch pattern (for example, “New Tokyo Tower ]), Keywords are extracted (step S104A). In this extraction, the keyword extraction unit 105A extracts, for example, “Shin Tokyo Tower”, which is the last noun in the preceding utterance shown in the pitch pattern, as a keyword.

次に、キーワード抽出装置１００Ａ（キーワード検索部１０６）は、ネットワーク４００を介して、検索サーバ３００に対し、上記決定されたキーワードのキーワード検索を実行する（ステップＳ１０５）。次に、キーワード検索部１０６は、上記受信した検索結果を表示部１０７に表示する（ステップＳ１０６）。これにより、発話者は、話題になっている語（例えば、新東京タワー）に関する情報（検索結果）を把握することが可能となる。 Next, the keyword extraction device 100A (keyword search unit 106) performs keyword search for the determined keyword on the search server 300 via the network 400 (step S105). Next, the keyword search unit 106 displays the received search result on the display unit 107 (step S106). Thereby, the speaker can grasp information (search result) related to the topic word (for example, New Tokyo Tower).

以上説明したように、本実施の形態によると、キーワード抽出装置１００Ａは、キーワードの存在を示唆する発話応答の特徴であるピッチパターンを判定して、会話内のキーワードを抽出する。このため、キーワード抽出装置１００Ａにおいては、会話内で使用されるキーワードを事前に予想してデータベース等に登録する準備を行うことなく、ピッチパターンの有無から、会話内のキーワードを抽出することができる。 As described above, according to the present embodiment, keyword extracting apparatus 100A determines a pitch pattern that is a feature of an utterance response that suggests the presence of a keyword, and extracts a keyword in a conversation. For this reason, the keyword extraction device 100A can extract keywords in a conversation from the presence or absence of a pitch pattern without preparing in advance for registering the keywords used in the conversation in a database or the like. .

なお、実施の形態２において、キーワード抽出装置１００Ａは、図７のステップＳ１０１〜Ｓ１０２、Ｓ１０３Ａ〜Ｓ１０３Ｂ、Ｓ１０４Ａ、Ｓ１０５〜Ｓ１０６の処理を順次実行する場合について説明したが、これに限られない。例えば、キーワード抽出装置１００Ａは、図７の上記各ステップの順序を入れ替えて実行してもよいし、各ステップの処理を並列処理して実行してもよい。 In the second embodiment, the keyword extraction apparatus 100A has been described with respect to the case where the processes of steps S101 to S102, S103A to S103B, S104A, and S105 to S106 in FIG. 7 are sequentially performed, but the present invention is not limited thereto. For example, the keyword extracting device 100A may execute the steps in FIG. 7 by changing the order of the steps, or may execute the steps in parallel.

（実施の形態３）
実施の形態３のキーワード抽出装置は、発話応答の特徴である機能フレーズに基づいて、会話内のキーワードを抽出するものである。
図７は、本発明の実施の形態３におけるキーワード抽出装置の構成例を示すブロック図である。なお、実施の形態３においては、実施の形態１と同一部分について実施の形態１と同一の符号・用語を付して、重複説明を省略する。
図７において、キーワード抽出装置１００Ｂは、図１の実施の形態１の割込検出部１０４に代えて、機能フレーズ抽出部３０１（発話応答特徴抽出部）を有する。さらに、キーワード抽出装置１００Ｂは、機能フレーズ記憶部３０２を有する。また、キーワード抽出装置１００Ｂは、図１の実施の形態１のキーワード抽出部１０５に代えて、キーワード抽出部１０５Ｂを有する点が、実施の形態１と異なる。なお、機能フレーズ抽出部３０１は、ＣＰＵ等の処理装置であり、機能フレーズ記憶部３０２は、メモリ等の記憶装置である。その他、情報端末２００を含むシステム全体の構成は、図１の場合と同様である。(Embodiment 3)
The keyword extraction device according to the third embodiment extracts keywords in a conversation based on a function phrase that is a feature of an utterance response.
FIG. 7 is a block diagram illustrating a configuration example of the keyword extraction device according to Embodiment 3 of the present invention. In the third embodiment, the same reference numerals and terms as those in the first embodiment are assigned to the same parts as those in the first embodiment, and the duplicate description is omitted.
In FIG. 7, the keyword extraction device 100B has a function phrase extraction unit 301 (utterance response feature extraction unit) instead of the interrupt detection unit 104 of the first embodiment in FIG. Furthermore, the keyword extraction device 100B includes a function phrase storage unit 302. Further, the keyword extraction device 100B is different from the first embodiment in that it includes a keyword extraction unit 105B instead of the keyword extraction unit 105 of the first embodiment in FIG. The function phrase extraction unit 301 is a processing device such as a CPU, and the function phrase storage unit 302 is a storage device such as a memory. In addition, the configuration of the entire system including the information terminal 200 is the same as that of FIG.

機能フレーズ記憶部３０２は、あらかじめ定められた機能フレーズを記憶する。この機能フレーズは、応答の種類を表す語であり、種々の異なる会話内容にかかわらず、会話共通に使用されるものである。例えば、機能フレーズとして、「ですか？」等の疑問文、「いいね」「なるほど」「それだ」等の同意文、「違う」等の否定文、「お願いします」等の依頼文、「ああ」などの感嘆文、「なんでやねん」等の突っ込み文などが該当する。 The function phrase storage unit 302 stores a predetermined function phrase. This function phrase is a word representing the type of response, and is used in common with conversations regardless of various different conversation contents. For example, as a functional phrase, a question sentence such as “Is it?”, An agreement sentence such as “Like” or “I see” or “It is”, a negative sentence such as “No”, a request sentence such as “Please” Exclamation sentences such as “Oh” and indentation sentences such as “Nadeyanen” are applicable.

機能フレーズ抽出部３０１は、発話音声の中から、当該発話音声の特徴である上記機能フレーズを抽出する。具体的には、機能フレーズ抽出部３０１は、抽出対象となる発話音声に含まれる語と、機能フレーズ記憶部３０２の機能フレーズとを比較し、当該発話音声に含まれる機能フレーズを抽出する。 The functional phrase extraction unit 301 extracts the functional phrase that is a feature of the uttered voice from the uttered voice. Specifically, the functional phrase extraction unit 301 compares a word included in the speech to be extracted with a functional phrase in the functional phrase storage unit 302, and extracts a functional phrase included in the speech.

次に、キーワード抽出装置１００Ｂの動作について図８を参照して説明する。図８では、例えば、発話者Ａが、キーワード抽出装置１００Ｂを用いて「今度、新東京タワーができるんだって。」と言った後に、発話者Ｂが、情報端末２００を用いて「ああ、それってどこにできるんですか？」と言うことを前提にして説明する。なお、図８のステップＳ１０１〜Ｓ１０２、Ｓ１０５〜Ｓ１０６の処理は、図３のステップＳ１０１〜Ｓ１０２、Ｓ１０５〜Ｓ１０６と同様の処理であるため、適宜省略する。 Next, the operation of the keyword extracting device 100B will be described with reference to FIG. In FIG. 8, for example, after the speaker A uses the keyword extraction device 100 B to say “This time, New Tokyo Tower can be created.”, The speaker B uses the information terminal 200 to say “Oh, it "Where can I do that?" Note that the processes in steps S101 to S102 and S105 to S106 in FIG. 8 are the same as the processes in steps S101 to S102 and S105 to S106 in FIG.

まず、キーワード抽出装置１００Ｂ（発話区間判定部１０２）は、音声入力部１００および情報端末２００から入力された発話音声について、発話者ごとの発話区間（図２（ａ）の発話区間１、図２（ｂ）の発話区間２参照）を判定する（ステップＳ１０１）。次に、キーワード抽出装置１００Ｂ（音声認識部１０３）は、上記判定された発話区間の発話音声を発話者ごとに認識する（ステップＳ１０２）。 First, the keyword extraction device 100B (the utterance section determination unit 102) uses the utterance section for each utterer (the utterance section 1 in FIG. 2A, FIG. 2) for the speech input from the voice input section 100 and the information terminal 200. (See utterance section 2 in (b)) (step S101). Next, the keyword extraction device 100B (voice recognition unit 103) recognizes the utterance voice in the determined utterance section for each speaker (step S102).

次に、キーワード抽出装置１００Ｂ（機能フレーズ抽出部３０１）は、例えば発話者Ａの先行発話の発話区間１（図２（ａ）参照）および発話者Ｂの後行発話の発話区間２（図２（ｂ）参照）の発話音声から、疑問文等を表す機能フレーズを抽出する。具体的には、機能フレーズ抽出部３０１は、抽出対象となる当該発話音声に含まれる語の系列と、機能フレーズ記憶部３０２の機能フレーズとを比較し、当該発話音声に含まれる機能フレーズを抽出する。本実施の形態では、機能フレーズ抽出部３０１は、「ああ、それってどこにできるんですか？」の発話音声の中から、「ですか？」という疑問文の機能フレーズを抽出する。ここで、発話音声に含まれる語の系列は、上記音声の認識結果を利用してもよい。 Next, the keyword extraction device 100B (functional phrase extraction unit 301), for example, the utterance section 1 of the preceding utterance of the speaker A (see FIG. 2A) and the utterance section 2 of the subsequent utterance of the speaker B (FIG. 2). A functional phrase representing a question sentence or the like is extracted from the uttered voice of (b). Specifically, the functional phrase extraction unit 301 compares a sequence of words included in the uttered speech to be extracted with the functional phrase in the functional phrase storage unit 302, and extracts a functional phrase included in the uttered speech. To do. In the present embodiment, the functional phrase extraction unit 301 extracts the functional phrase of the question sentence “Is it?” From the utterance voice of “Oh, where can I do it?”. Here, the speech recognition result may be used for the word sequence included in the speech voice.

次に、キーワード抽出装置１００Ｂ（キーワード抽出部１０５Ｂ）は、上記抽出された機能フレーズを含む発話の直前の発話音声（ステップＳ１０２で認識されたもの）から、キーワードを抽出する（ステップＳ１０４Ｂ）。このキーワードの抽出に際し、キーワード抽出部１０５Ｂは、例えば、上記直前の発話である「今度、新東京タワーができるんだって。」から、その末尾（割り込み直前）の名詞である「新東京タワー」をキーワードとして抽出する。 Next, the keyword extraction device 100B (keyword extraction unit 105B) extracts a keyword from the utterance voice immediately before the utterance including the extracted functional phrase (recognized in step S102) (step S104B). When extracting this keyword, the keyword extraction unit 105B, for example, from the previous utterance "Now, you can make New Tokyo Tower." Extract as keywords.

次に、キーワード抽出装置１００Ｂ（キーワード検索部１０６）は、ネットワーク４００を介して、検索サーバ３００に対し、上記抽出されたキーワードのキーワード検索を実行する（ステップＳ１０５）。次に、キーワード検索部１０６は、上記受信した検索結果を表示部１０７に表示する（ステップＳ１０６）。これにより、発話者は、会話内で話題になっているキーワード（例えば、新東京タワー）に関する情報（検索結果）を把握することが可能となる。 Next, the keyword extraction device 100B (keyword search unit 106) performs keyword search for the extracted keyword on the search server 300 via the network 400 (step S105). Next, the keyword search unit 106 displays the received search result on the display unit 107 (step S106). Thereby, the speaker can grasp information (search result) related to a keyword (for example, New Tokyo Tower) which is a topic in the conversation.

また本実施の形態によると、発話者Ａが「あれって何だっけ？」と質問して、発話者Ｂが「新東京タワーのことかな。」と答える場合のように、先行発話から疑問文の機能フレーズ（「何だっけ？」）を抽出した場合に、その直後の後行発話から、キーワード（「新東京タワー」）を抽出するようにキーワード抽出部１０５Ｂを動作させることも可能である。その際、直前の発話音声からキーワードを抽出するか、直後の発話音声からキーワードを抽出するかは、以下の通り切り替えることができる。すなわち、指示代名詞「それ」を含む場合には直前の発話から、指示代名詞「あれ」を含む場合には直後の発話から、その他の場合には直後の発話からと切り替えて使うことができる。その際、実施の形態２と同様の方法で、先行発話が上昇ピッチ、後行発話が下降ピッチとなるピッチパターンを利用（併用）することで、発話応答の特徴を捉えても良い。 In addition, according to the present embodiment, a question from a previous utterance, such as when utterer A asks "What is that?" And utterer B answers "What is New Tokyo Tower?" When the functional phrase of the sentence (“What was it?”) Is extracted, the keyword extraction unit 105B can be operated so as to extract the keyword (“New Tokyo Tower”) from the subsequent utterance. is there. At this time, whether the keyword is extracted from the immediately preceding utterance speech or the keyword is extracted from the immediately following utterance speech can be switched as follows. That is, it can be used by switching from the immediately preceding utterance when the pronoun pronoun “it” is included, from the immediately following utterance when including the indicating pronoun “that”, and from the immediately following utterance in other cases. At that time, the feature of the utterance response may be captured by using (using in combination) a pitch pattern in which the preceding utterance is the rising pitch and the subsequent utterance is the descending pitch in the same manner as in the second embodiment.

以上説明したように、本実施の形態によると、キーワード抽出装置１００Ｂは、会話内容（ジャンル）にかかわらず共通に使用される機能フレーズ（疑問文等）を抽出して、会話内のキーワードを抽出する。このため、キーワード抽出装置１００Ｂにおいては、会話文から、共通に使用される機能フレーズを抽出してキーワードを抽出することができる。よって、キーワード抽出装置１００Ｂにおいては、個々のジャンルの会話に応じたキーワードを事前に予想してデータベース等に登録する準備を行うことなく、キーワードを抽出することができるので、有益である。 As described above, according to the present embodiment, the keyword extraction device 100B extracts functional phrases (question sentences, etc.) that are commonly used regardless of the conversation content (genre), and extracts keywords in the conversation. To do. For this reason, in the keyword extracting device 100B, it is possible to extract a keyword by extracting a commonly used function phrase from the conversation sentence. Therefore, the keyword extraction device 100B is useful because it can extract keywords without preparing in advance a keyword corresponding to each genre conversation and registering it in a database or the like.

なお、実施の形態３において、キーワード抽出装置１００Ｂは、図８のステップＳ１０１〜Ｓ１０２、Ｓ１０３Ｃ、Ｓ１０４Ｂ、Ｓ１０５〜Ｓ１０６の処理を順次実行する場合について説明したが、これに限られない。例えば、キーワード抽出装置１００Ｂは、図９の上記各ステップの順序を入れ替えて実行してもよいし、各ステップの処理を並列処理して実行してもよい。 In the third embodiment, the keyword extraction apparatus 100B has been described with respect to the case where the processes of steps S101 to S102, S103C, S104B, and S105 to S106 in FIG. 8 are sequentially performed. However, the present invention is not limited to this. For example, the keyword extracting device 100B may execute the steps in FIG. 9 by changing the order of the steps, or may execute the steps in parallel.

（実施の形態４）
実施の形態４のキーワード抽出装置は、発話音声を聞いた人の表情の変化に基づいて、会話内のキーワードを抽出するものである。
図９は、本発明の実施の形態４におけるキーワード抽出装置の構成例を示すブロック図である。なお、実施の形態４においては、実施の形態１と同一部分について実施の形態１と同一の符号・用語を付して、重複説明を省略する。(Embodiment 4)
The keyword extraction device according to the fourth embodiment extracts keywords in a conversation based on changes in the facial expression of a person who has heard spoken speech.
FIG. 9 is a block diagram illustrating a configuration example of the keyword extracting device according to the fourth embodiment of the present invention. In the fourth embodiment, the same reference numerals and terms as those in the first embodiment are assigned to the same parts as those in the first embodiment, and the duplicate description is omitted.

図９において、キーワード抽出装置１００Ｃは、図１の実施の形態１の割込検出部１０４に代えて、映像入力部４０１および表情認識部４０２（これらを併せて発話応答特徴抽出部ともいう）を有する。さらに、キーワード抽出装置１００Ｃは、図１の実施の形態１のキーワード抽出部１０５に代えて、キーワード抽出部１０５Ｃを有する点が、実施の形態１と異なる。なお、画像入力部４０１は、カメラであり、表情認識部４０２はＣＰＵ等の処理装置である。その他、情報端末２００を含むシステム全体の構成は、図１の場合と同様である。 In FIG. 9, the keyword extraction device 100 C includes a video input unit 401 and a facial expression recognition unit 402 (also collectively referred to as an utterance response feature extraction unit) instead of the interrupt detection unit 104 of the first embodiment in FIG. 1. Have. Further, the keyword extraction apparatus 100C is different from the first embodiment in that it includes a keyword extraction unit 105C instead of the keyword extraction unit 105 of the first embodiment in FIG. The image input unit 401 is a camera, and the facial expression recognition unit 402 is a processing device such as a CPU. In addition, the configuration of the entire system including the information terminal 200 is the same as that of FIG.

映像入力部４０１は、ユーザの顔部分を含む画像データを入力するためのものである。表情認識部４０２は、該画像データをユーザの表情推定処理が可能なディジタルデータの元画像データに変換すると、元画像データに含まれるユーザの顔領域を抽出し、抽出された顔領域から、ユーザの顔を構成する目や口などの少なくとも一つ以上の顔器官の輪郭位置を抽出する。そして、表情認識部４０２は、複数の映像フレームに亘って取得した顔器官の上端及び下端の輪郭を抽出して、顔器官の輪郭の開き具合や曲がり具合からユーザの表情（例えば、中立、驚き、喜び、怒りなど）を認識する。
その際、表情認識部４０２は、発話区間判定部１０２から得た発話者ごとの発話区間内の時刻と、発話者以外の人の表情の認識結果とを結びつける。さらに、表情認識部４０２は、該表情の認識結果から表情の変化点を抽出する。
例えば、図１０において、ｔ１０は発話者Ａによる発話区間１の発話開始時刻、ｔ１１、ｔ１２はｔ１０に続く等間隔の時刻であり、ｔ２０は発話者Ｂによる発話区間２の発話開始時刻、ｔ２１、ｔ２２はｔ２０に続く等間隔の時刻である。ここで、表情認識部４０２は、時刻ｔ１０、ｔ１１、ｔ１２のそれぞれにおける発話者Ｂの表情、および、時刻ｔ２０、ｔ２１、ｔ２２のそれぞれにおける発話者Ａの表情とを結びつけて認識する。この例では、時刻ｔ１１における発話者Ｂの表情が驚きの表情であり、その他の時刻では話者によらず中立の表情となっている。すなわち、表情認識部４０２は、時刻ｔ１１を表情の変化点として抽出する。The video input unit 401 is for inputting image data including a user's face portion. When the facial expression recognition unit 402 converts the image data into digital original image data that can be used to estimate the facial expression of the user, the facial expression recognition unit 402 extracts the face area of the user included in the original image data, and extracts the user's face area from the extracted face area. The contour position of at least one facial organ such as eyes and mouth constituting the face is extracted. Then, the facial expression recognition unit 402 extracts the contours of the upper and lower ends of the facial organ acquired over a plurality of video frames, and determines the facial expression of the user (for example, neutrality, surprise, etc.) based on how the facial organ contour is opened or bent. , Joy, anger, etc.).
At that time, the facial expression recognition unit 402 associates the time in the utterance section for each speaker obtained from the utterance section determination unit 102 with the recognition result of the facial expression of a person other than the speaker. Further, the facial expression recognition unit 402 extracts facial expression change points from the facial expression recognition result.
For example, in FIG. 10, t10 is the utterance start time of the utterance section 1 by the speaker A, t11 and t12 are equally spaced times following t10, t20 is the utterance start time of the utterance section 2 by the speaker B, t21, t22 is an equally spaced time following t20. Here, the facial expression recognition unit 402 recognizes the facial expression of the speaker B at each of the times t10, t11, and t12 and the facial expression of the speaker A at each of the times t20, t21, and t22. In this example, the expression of the speaker B at time t11 is a surprised expression, and at other times, the expression is neutral regardless of the speaker. That is, the facial expression recognition unit 402 extracts time t11 as a facial expression change point.

キーワード抽出部１０５Ｃは、上記認識された表情が、発話開始時に中立の表情であり、かつ、発話の途中で他の表情に変化したと、表情認識部４０２によって認識された場合に、表情の変化点に対応した時刻に発声された単語をキーワードとして抽出する。その際、キーワード抽出部１０５Ｃは、音声認識結果中の単語ごとの区間情報から表情に対応した時刻の単語を求めてもいいし、発話音声に含まれる音節数などから推定してもよい。ここでいう対応した時刻とは、単語を知覚してからその反応が表情に現れるまでの時間（例えば０．１秒）を考慮して、単語の言い終わりと表情の表出とを対応させた時刻である。 When the facial expression recognition unit 402 recognizes that the recognized facial expression is a neutral facial expression at the start of the utterance and changes to another facial expression during the utterance, the keyword extraction unit 105C changes the facial expression. A word uttered at the time corresponding to the point is extracted as a keyword. At that time, the keyword extraction unit 105C may obtain the word at the time corresponding to the facial expression from the section information for each word in the speech recognition result, or may estimate it from the number of syllables included in the uttered speech. The corresponding time here refers to the time from when a word is perceived until the reaction appears in the facial expression (for example, 0.1 seconds), and the end of the word and the expression of the facial expression are associated with each other. It's time.

次に、キーワード抽出装置１００Ｃの動作について図１１を参照して説明する。図１１では、例えば、発話者Ａが、キーワード抽出装置１００Ｃを用いて「新東京タワーが今度できる」と言った後、発話者Ｂが、情報端末２００を用いて「それって何ですか？」と言うことを前提にして説明する。なお、図１１のステップＳ１０１〜Ｓ１０２、Ｓ１０５〜Ｓ１０６の処理は、図３のステップＳ１０１〜Ｓ１０２、Ｓ１０５〜Ｓ１０６と同様の処理であるため、適宜省略して説明する。発話者Ｂの音声および映像は情報端末２００を用いて入力されることになるが、便宜上、入力は発話者Ａと同様に、音声入力部１０１および映像入力部４０１から入力されるものとして説明する。 Next, the operation of the keyword extracting device 100C will be described with reference to FIG. In FIG. 11, for example, the speaker A uses the keyword extraction device 100C to say “New Tokyo Tower can be done next”, and then the speaker B uses the information terminal 200 to say “What is that? ”On the assumption that“ Note that the processes in steps S101 to S102 and S105 to S106 in FIG. 11 are the same processes as steps S101 to S102 and S105 to S106 in FIG. The voice and video of the speaker B are input using the information terminal 200. For the sake of convenience, the input will be described as being input from the audio input unit 101 and the video input unit 401 in the same manner as the speaker A. .

まず、キーワード抽出装置１００Ｃ（発話区間判定部１０２）は、音声入力部１０１から入力された発話音声について、発話者ごとの発話区間（図１０の発話区間１、発話区間２参照）を判定する（ステップＳ１０１）。次に、キーワード抽出装置１００Ｃ（音声認識部１０３）は、上記判定された発話区間の発話音声を発話者ごとに認識する（ステップＳ１０２）。 First, the keyword extraction device 100C (the utterance section determination unit 102) determines the utterance section for each speaker (see the utterance section 1 and the utterance section 2 in FIG. 10) for the uttered speech input from the speech input section 101 (see FIG. 10). Step S101). Next, the keyword extracting device 100C (voice recognition unit 103) recognizes the uttered voice of the determined utterance section for each speaker (step S102).

一方、キーワード抽出装置１００Ｃ（映像入力部４０１および表情認識部４０２）は、例えば発話者Ａが発声した先行発話である発話区間１の発話音声（図１０参照）に対応する時刻の発話者Ｂの表情を認識し、発話者Ｂが発声した後行発話である発話区間２の発話音声（図１０参照）に対応する時刻の発話者Ａの表情を認識する。つまり、発話者の表情を認識するのではなく、発話音声を聞いている人の表情、すなわち発話者の発話音声に対する他の発話者の表情を認識する（ステップＳ１０３Ｄ）。 On the other hand, the keyword extraction device 100C (the video input unit 401 and the facial expression recognition unit 402), for example, of the speaker B at the time corresponding to the utterance voice (see FIG. 10) in the utterance section 1 which is the preceding utterance uttered by the speaker A. The facial expression is recognized, and the facial expression of the speaker A at the time corresponding to the utterance voice (see FIG. 10) in the utterance section 2 which is the subsequent utterance uttered by the speaker B is recognized. That is, rather than recognizing the facial expression of the speaker, the facial expression of the person who is listening to the speech, that is, the facial expression of another speaker relative to the speech of the speaker is recognized (step S103D).

次に、キーワード抽出装置１００Ａ（キーワード抽出部１０５Ｃ）は、上記認識された表情が、発話開始時に中立の表情であり、かつ、発話の途中で他の表情に変化したと認識された場合に、表情の変化点に対応した時刻に発声された単語をキーワードとして抽出する（ステップＳ１０４Ｃ）。前述の例では、表情が中立から驚きの表情に変化した時刻に対応する単語として「新東京タワー」が抽出される。 Next, the keyword extracting device 100A (keyword extracting unit 105C) recognizes that the recognized facial expression is a neutral facial expression at the start of utterance and has changed to another facial expression during the utterance. A word uttered at the time corresponding to the facial expression change point is extracted as a keyword (step S104C). In the above example, “New Tokyo Tower” is extracted as a word corresponding to the time when the expression changes from neutral to a surprising expression.

次に、キーワード抽出装置１００Ｃ（キーワード検索部１０６）は、ネットワーク４００を介して、検索サーバ３００に対し、上記決定されたキーワードのキーワード検索を実行する（ステップＳ１０５）。次に、キーワード検索部１０６は、上記受信した検索結果を表示部１０７に表示する（ステップＳ１０６）。これにより、発話者は、話題になっている語（例えば、新東京タワー）に関する情報（検索結果）を把握することが可能となる。 Next, the keyword extraction device 100C (keyword search unit 106) performs keyword search for the determined keyword to the search server 300 via the network 400 (step S105). Next, the keyword search unit 106 displays the received search result on the display unit 107 (step S106). Thereby, the speaker can grasp information (search result) related to the topic word (for example, New Tokyo Tower).

以上説明したように、本実施の形態によると、キーワード抽出装置１００Ｃは、発話音声を聞いている他の人の表情の認識結果に基づいて、会話内のキーワードを抽出する。このため、キーワード抽出装置１００Ｃにおいては、会話内で使用されるキーワードを事前に予想してデータベース等に登録する準備を行うことなく、表情の変化として捉えられる発話応答の特徴から、会話内のキーワードを抽出することができる。 As described above, according to the present embodiment, the keyword extraction device 100C extracts keywords in a conversation based on the recognition result of the facial expression of another person who is listening to the uttered voice. For this reason, in the keyword extraction device 100C, the keyword in the conversation is obtained from the feature of the utterance response that is captured as a change in facial expression without preparing the keyword used in the conversation in advance and registering it in the database or the like. Can be extracted.

なお、表情認識部４０２の代わりに、目の開き具合や口の開き具合などを数値化し、それらの変化の大きさのみで表情の変化を検出しても同様の効果が得られる。 It should be noted that the same effect can be obtained by converting the expression of the eyes and the degree of opening of the mouth into numerical values instead of the facial expression recognition unit 402, and detecting changes in facial expressions based only on the magnitudes of those changes.

なお、実施の形態４において、キーワード抽出装置１００Ｃは、図１１のステップＳ１０１〜Ｓ１０２、Ｓ１０３Ｄ、Ｓ１０４Ｃ、Ｓ１０５〜Ｓ１０６の処理を順次実行する場合について説明したが、これに限られない。例えば、キーワード抽出装置１００Ｃは、図１１の上記各ステップの順序を入れ替えて実行してもよいし、各ステップの処理を並列処理して実行してもよい。 In the fourth embodiment, the keyword extraction device 100C has been described with respect to the case where the processes of steps S101 to S102, S103D, S104C, and S105 to S106 in FIG. 11 are sequentially performed. However, the present invention is not limited to this. For example, the keyword extraction device 100C may execute the steps in FIG. 11 by changing the order of the steps, or may execute the steps in parallel.

（実施の形態５）
実施の形態５のキーワード抽出装置は、発話音声を聞いた人の盛り上がり反応に基づいて、会話内のキーワードを抽出するものである。
図１２は、本発明の実施の形態５におけるキーワード抽出装置の構成例を示すブロック図である。なお、実施の形態５においては、実施の形態１と同一部分について実施の形態１と同一の符号・用語を付して、重複説明を省略する。(Embodiment 5)
The keyword extraction device according to the fifth embodiment extracts keywords in a conversation based on an excitement reaction of a person who has heard an uttered voice.
FIG. 12 is a block diagram illustrating a configuration example of the keyword extracting device according to the fifth embodiment of the present invention. In the fifth embodiment, the same parts as those in the first embodiment are denoted by the same reference numerals and terms as those in the first embodiment, and redundant description is omitted.

図１２において、キーワード抽出装置１００Ｄは、図１の実施の形態１の割込検出部１０４に代えて、盛り上がり反応検出部５０１（発話応答特徴抽出部ともいう）を有する。さらに、キーワード抽出装置１００Ｄは、図１の実施の形態１のキーワード抽出部１０５に代えて、キーワード抽出部１０５Ｄを有する点が、実施の形態１と異なる。なお、盛り上がり反応検出部５０１はＣＰＵ等の処理装置である。その他、情報端末２００を含むシステム全体の構成は、図１の場合と同様である。 In FIG. 12, the keyword extraction device 100D includes a climax reaction detection unit 501 (also referred to as an utterance response feature extraction unit) instead of the interrupt detection unit 104 of the first embodiment in FIG. Further, the keyword extraction device 100D is different from the first embodiment in that it includes a keyword extraction unit 105D instead of the keyword extraction unit 105 of the first embodiment in FIG. The swell reaction detection unit 501 is a processing device such as a CPU. In addition, the configuration of the entire system including the information terminal 200 is the same as that of FIG.

盛り上がり反応検出部５０１は、音声や音から盛り上がり反応を検出する。具体的には、笑い声の検出や、興奮度の高い音声の検出、拍手や膝を打つ音の検出、などにより、盛り上がり反応を検出する。盛り上がり反応検出部５０１は、笑い声や、拍手、膝を打つ音については、予め学習サンプルを容易して、ＧＭＭ（ガンマー・ミクスチャー・モデル）を作成しておき、入力に対する尤度を求めて閾値処理することで検出する。また、盛り上がり反応検出部５０１は、興奮度の高い音声については、音量の大きさ、ピッチの高さ、発話速度の速さのそれぞれを話者の平均値で正規化した量を線形結合して数値化し、閾値処理することで検出する。
その際、盛り上がり反応検出部５０１は、発話区間判定部１０２で判定された発話区間の終端付近で検出された盛り上がり反応を、その発話に対応した盛り上がり反応とみなす。The swell response detector 501 detects a sway response from voice and sound. Specifically, a swell response is detected by detecting a laughing voice, detecting a voice with a high degree of excitement, or detecting a sound of clapping or kneeling. For the laughing voice, applause, and kneeling sound, the climax reaction detection unit 501 facilitates a learning sample in advance, creates a GMM (gamma mixture model), obtains a likelihood for the input, and performs threshold processing. To detect. The excitement reaction detection unit 501 linearly combines amounts obtained by normalizing the loudness level, the pitch height, and the speaking speed with the average value of the speaker for a highly excited sound. It is detected by digitizing and threshold processing.
At this time, the climax reaction detection unit 501 regards the climax reaction detected near the end of the utterance interval determined by the utterance interval determination unit 102 as an excitement response corresponding to the utterance.

キーワード検出部１０５Ｄは、前記盛り上がり反応に対応する発話の中から、キーワードを抽出する。 The keyword detection unit 105D extracts a keyword from the utterance corresponding to the excitement reaction.

次に、キーワード抽出装置１００Ｄの動作について図１３を参照して説明する。図１３では、例えば、発話者Ａが、キーワード抽出装置１００Ｃを用いて「今度、新東京タワーが・・・」と言った後、発話者Ｂが、情報端末２００を用いて「あはは」と言って笑ったことを前提にして説明する。なお、図１３のステップＳ１０１〜Ｓ１０２、Ｓ１０５〜Ｓ１０６の処理は、図３のステップＳ１０１〜Ｓ１０２、Ｓ１０５〜Ｓ１０６と同様の処理であるため、適宜省略して説明する。 Next, the operation of the keyword extracting device 100D will be described with reference to FIG. In FIG. 13, for example, the speaker A uses the keyword extraction device 100 C to say “Now, Tokyo Tower ...”, and then the speaker B uses the information terminal 200 to say “Ahaha”. I will explain on the assumption that I laughed. The processes in steps S101 to S102 and S105 to S106 in FIG. 13 are the same processes as steps S101 to S102 and S105 to S106 in FIG.

まず、キーワード抽出装置１００Ｄ（発話区間判定部１０２）は、音声入力部１０１および情報端末２００から入力された発話音声について、発話者ごとの発話区間を判定する（ステップＳ１０１）。次に、キーワード抽出装置１００Ｄ（音声認識部１０３）は、上記判定された発話区間の発話音声を発話者ごとに認識する（ステップＳ１０２）。 First, the keyword extraction device 100D (the utterance section determination unit 102) determines an utterance section for each speaker from the speech input from the voice input unit 101 and the information terminal 200 (step S101). Next, the keyword extraction device 100D (voice recognition unit 103) recognizes the utterance voice in the determined utterance section for each speaker (step S102).

次に、キーワード抽出装置１００Ｄ（盛り上がり反応検出５０１）は、例えば発話者Ａが発声した発話区間の近傍で盛り上がり反応の存在を検出する（ステップＳ１０３Ｅ）。結果として、前述の発話例では、発話者Ａの発話区間の直後で、笑い声のＧＭＭが高い尤度で照合されるため、盛り上がり反応として検出される。 Next, the keyword extraction device 100D (exciting reaction detection 501) detects the presence of an enlarging reaction, for example, in the vicinity of the utterance section uttered by the speaker A (step S103E). As a result, in the above-described utterance example, the laughter voice GMM is collated with high likelihood immediately after the utterance section of the speaker A, so that it is detected as an excitement reaction.

次に、キーワード抽出装置１００Ａ（キーワード抽出部１０５Ｄ）は、上記盛り上がり反応に対応する発話区間内で発声された単語（例えば、「新東京タワー」）をキーワードとして抽出する（ステップＳ１０４Ｄ）。 Next, the keyword extraction device 100A (keyword extraction unit 105D) extracts, as a keyword, a word (for example, “New Tokyo Tower”) uttered in the utterance section corresponding to the excitement reaction (step S104D).

次に、キーワード抽出装置１００Ｄ（キーワード検索部１０６）は、ネットワーク４００を介して、検索サーバ３００に対し、上記決定されたキーワードのキーワード検索を実行する（ステップＳ１０５）。次に、キーワード検索部１０６は、上記受信した検索結果を表示部１０７に表示する（ステップＳ１０６）。これにより、発話者は、話題になっている語（例えば、新東京タワー）に関する情報（検索結果）を把握することが可能となる。 Next, the keyword extraction device 100D (keyword search unit 106) performs keyword search of the determined keyword to the search server 300 via the network 400 (step S105). Next, the keyword search unit 106 displays the received search result on the display unit 107 (step S106). Thereby, the speaker can grasp information (search result) related to the topic word (for example, New Tokyo Tower).

以上説明したように、本実施の形態によると、キーワード抽出装置１００Ｄは、発話音声を聞いた人の盛り上がり反応を検出して、会話内のキーワードを抽出する。このため、キーワード抽出装置１００Ｄにおいては、会話内で使用されるキーワードを事前に予想してデータベース等に登録する準備を行うことなく、笑い声や拍手などの盛り上がりとして捉えられる発話応答の特徴から、会話内のキーワードを抽出することができる。 As described above, according to the present embodiment, the keyword extracting device 100D detects the excitement reaction of the person who has heard the uttered voice and extracts the keywords in the conversation. For this reason, in the keyword extraction apparatus 100D, it is possible to predict the keyword used in the conversation in advance from the characteristics of the utterance response that can be regarded as a swell and applause without preparing to register it in a database or the like. The keywords in can be extracted.

なお、実施の形態５において、キーワード抽出装置１００Ｄは、図１３のステップＳ１０１〜Ｓ１０２、Ｓ１０３Ｅ、Ｓ１０４Ｄ、Ｓ１０５〜Ｓ１０６の処理を順次実行する場合について説明したが、これに限られない。例えば、キーワード抽出装置１００Ｄは、図１３の上記各ステップの順序を入れ替えて実行してもよいし、各ステップの処理を並列処理して実行してもよい。 In the fifth embodiment, the keyword extraction device 100D has been described with respect to the case where the processes of steps S101 to S102, S103E, S104D, and S105 to S106 in FIG. 13 are sequentially performed. However, the present invention is not limited to this. For example, the keyword extracting device 100D may execute the steps in FIG. 13 by changing the order of the steps, or may execute the steps in parallel.

また、実施の形態１〜３および５において、キーワード抽出装置（キーワード抽出部）は、発話区間内の末尾（割り込み直前）の名詞をキーワードとして抽出する場合について説明したが、これに限られない。例えば、キーワード抽出部は、検索対象の先行発話に含まれる複数の名詞のうち、概念上の最下位の名詞をキーワードとして検索するようにしてもよい。この場合、キーワード抽出装置は、メモリ等の辞書情報記憶部（不図示）をさらに有し、この辞書情報記憶部が、概念上の上位（例えば、イタリア料理）および概念上の下位（例えば、パスタ）の名詞の関係を分類して体系づけた辞書情報を記憶する。そして、キーワード抽出部は、抽出対象の発話に含まれる名詞の中から、辞書情報記憶部（不図示）の辞書情報に含まれる概念上の最下位の名詞をキーワードとして抽出する。これにより、下位概念の名詞がキーワードとして抽出される。 In Embodiments 1 to 3 and 5, the keyword extraction device (keyword extraction unit) has described the case where the noun at the end (immediately before interruption) in the utterance section is extracted as a keyword. However, the present invention is not limited to this. For example, the keyword extraction unit may search for a noun that is conceptually lowest among a plurality of nouns included in the preceding utterance to be searched as a keyword. In this case, the keyword extraction device further includes a dictionary information storage unit (not shown) such as a memory, and the dictionary information storage unit includes a conceptual upper level (for example, Italian cuisine) and a conceptual lower level (for example, pasta). ) Is used to store dictionary information organized and organized. Then, the keyword extraction unit extracts, as keywords, the conceptually lowest nouns included in the dictionary information of the dictionary information storage unit (not shown) from the nouns included in the utterance to be extracted. Thereby, the noun of a low-order concept is extracted as a keyword.

また、実施の形態１〜３および５において、キーワード抽出部は、抽出対象の発話に含まれる名詞のうち、ピッチの最も高い名詞をキーワードとして抽出するようにしてもよいし、使用回数の最も多い名詞をキーワードとして抽出するようにしてもよい。あるいは、キーワード抽出部は、抽出対象の発話に含まれる名詞の中から、各名詞のピッチや使用回数を示す各種パラメータの組み合わせが最適（事前に定められたパラメータのパターン）となる名詞をキーワードとして抽出するようにしてもよい。 In Embodiments 1 to 3 and 5, the keyword extraction unit may extract the noun with the highest pitch among the nouns included in the utterance to be extracted as the keyword, or the most frequently used. You may make it extract a noun as a keyword. Alternatively, the keyword extraction unit uses, as a keyword, a noun in which a combination of various parameters indicating the pitch and the number of times of use of each noun is optimal (a predetermined parameter pattern) from the nouns included in the utterance to be extracted. You may make it extract.

本発明を詳細にまた特定の実施態様を参照して説明したが、本発明の精神と範囲を逸脱することなく様々な変更や修正を加えることができることは当業者にとって明らかである。
本出願は、2007年3月29日出願の日本特許出願（特願2007−088321）に基づくものであり、その内容はここに参照として取り込まれる。Although the present invention has been described in detail and with reference to specific embodiments, it will be apparent to those skilled in the art that various changes and modifications can be made without departing from the spirit and scope of the invention.
This application is based on a Japanese patent application filed on March 29, 2007 (Japanese Patent Application No. 2007-088321), the contents of which are incorporated herein by reference.

本発明のキーワード抽出装置は、会話内に含まれる重要なキーワードを抽出するのに有用である。キーワード抽出装置は、電話、車載端末、テレビ、会議システム、コールセンターシステム、パソコン等の用途に適用することができる。 The keyword extracting device of the present invention is useful for extracting important keywords included in a conversation. The keyword extraction device can be applied to uses such as a telephone, an in-vehicle terminal, a television, a conference system, a call center system, and a personal computer.

しかしながら、特許文献１に記載の装置においては、想定される場面別に上記対応データを準備しなければならないため、利用しにくいという問題があった。
本発明の目的は、上記の状況に対処するためになされたものであり、会話内のキーワードを事前に予想して準備することなく、会話内のキーワードを抽出することができるキーワード抽出装置を提供することである。 However, the apparatus described in Patent Document 1 has a problem that it is difficult to use the correspondence data because the corresponding data must be prepared for each possible scene.
An object of the present invention is to cope with the above situation, and provides a keyword extraction device that can extract keywords in a conversation without predicting and preparing the keywords in the conversation in advance. It is to be.

本発明に係るキーワード抽出装置によれば、会話内のキーワードを事前に予想して準備することなく、会話内のキーワードを抽出することができる。 According to the keyword extracting device of the present invention, it is possible to extract a keyword in a conversation without predicting and preparing the keyword in the conversation in advance.

以下、本発明の実施の形態１〜５について図面を参照しながら説明する。実施の形態１〜５は、例えば、２人の発話者Ａ、Ｂが、携帯電話等の情報端末を用いて会話している場面を想定して説明する。
（実施の形態１）
図１は、本発明の実施の形態１におけるキーワード抽出装置を含むシステム全体の構成例を示すブロック図である。
図１において、キーワード抽出装置１００は、ある発話者Ａの情報端末であり、インターネット等のネットワーク４００へ接続できるように構成されている。ネットワーク４００には、別の発話者Ｂの情報端末２００や検索サーバ３００が接続されるように構成されている。キーワード抽出装置１００および情報端末２００は、携帯電話、ノート型パソコン、携帯情報端末等の情報端末である。検索サーバ３００は、公知の検索エンジンを搭載したサーバである。 Embodiments 1 to 5 of the present invention will be described below with reference to the drawings. In the first to fifth embodiments, for example, a case where two speakers A and B are talking using an information terminal such as a mobile phone will be described.
(Embodiment 1)
FIG. 1 is a block diagram showing a configuration example of the entire system including a keyword extracting device according to Embodiment 1 of the present invention.
In FIG. 1, a keyword extraction device 100 is an information terminal of a certain speaker A, and is configured to be connected to a network 400 such as the Internet. The network 400 is configured to be connected to the information terminal 200 and the search server 300 of another speaker B. The keyword extraction device 100 and the information terminal 200 are information terminals such as a mobile phone, a notebook personal computer, and a mobile information terminal. The search server 300 is a server equipped with a known search engine.

キーワード抽出装置１００は、音声入力部１０１、発話区間判定部１０２、音声認識部１０３、割込検出部１０４、キーワード抽出部１０５、キーワード検索部１０６および表示部１０７を有する。
音声入力部１０１は、発話者の音声（以下、発話音声という）を入力するためのものである。音声入力部１０１は、例えば、マイクロフォン、ネットワーク４００との通信インターフェース等が該当する。 The keyword extraction device 100 includes a voice input unit 101, a speech segment determination unit 102, a voice recognition unit 103, an interrupt detection unit 104, a keyword extraction unit 105, a keyword search unit 106, and a display unit 107.
The voice input unit 101 is for inputting a voice of a speaker (hereinafter referred to as “speech voice”). The voice input unit 101 corresponds to, for example, a microphone, a communication interface with the network 400, or the like.

発話区間判定部１０２は、上記入力された発話音声について、発話者ごとの発話区間を判定する。発話区間とは、発話者が会話を開始し初めてから終了するまでの区間をいう。
例えば、発話者Ａと発話者Ｂの会話が、図２（ａ）または図２（ｂ）に示すような場合、発話区間判定部１０２は、発話者Ａの会話の開始時間ｔｓ１から終了時間ｔｅ１までの区間、すなわちｔｓ１−ｔｅ１を発話者Ａの発話区間１として判定する。さらに、発話区間判定部１０２は、発話者Ｂの会話の開始時間ｔｓ２から終了時間ｔｅ２までの区間、すなわちｔｓ２−ｔｅ２の区間を発話者Ｂの発話区間２として判定する。 The utterance section determination unit 102 determines an utterance section for each speaker with respect to the input utterance voice. The utterance section refers to a section from the beginning to the end of the conversation by the speaker.
For example, when the conversation between the speaker A and the speaker B is as shown in FIG. 2 (a) or FIG. 2 (b), the speech segment determination unit 102 determines the end time te1 from the conversation start time ts1 of the speaker A. The section up to that time, that is, ts1-te1 is determined as the utterance section 1 of the speaker A. Furthermore, the utterance section determination unit 102 determines the section from the start time ts2 to the end time te2 of the conversation of the speaker B, that is, the section of ts2-te2 as the speech section 2 of the speaker B.

割込検出部１０４（発話応答特徴抽出部）は、上記判定された発話区間について、各発話者の発話音声に基づいて発話の特徴、すなわち先行発話と後行発話とが重なる割り込みを検出する。例えば、発話者Ａと発話者Ｂの会話が、図２（ｂ）に示した会話の場合、割込検出部１０４は、発話者Ａの先行発話の途中、すなわちｔｓ１で発話者Ｂの後行発話が開始されているので、上記割り込みを検出する。この検出方法は次のとおりである。
すなわち、割込検出部１０４は、まず、後行発話の開始時間からその直前の先行発話の終了時間までの区間（以下、発話間隔という）を計測する。例えば、図２（ａ）（ｂ）の場合、割込検出部１０４は、発話間隔＝図２（ａ）（ｂ）のｔｓ２−ｔｅ１の算出式を用いて、発話間隔を計算する。次に、割込検出部１０４は、上記計算の結果、発話間隔がマイナスの値（図２（ｂ）参照）になるかどうかを判断する。そして、割込検出部１０４は、当該発話間隔がマイナスの値の場合（図２（ｂ）参照）、割り込みがあるものとして検出することとなる。 The interrupt detection unit 104 (speech response feature extraction unit) detects an interrupt in which the utterance feature, that is, the preceding utterance and the subsequent utterance overlap, based on the utterance voice of each utterer for the determined utterance period. For example, when the conversation between the speaker A and the speaker B is the conversation shown in FIG. 2B, the interrupt detection unit 104 follows the speaker B in the middle of the preceding speech of the speaker A, that is, at ts1. Since the utterance has been started, the interrupt is detected. This detection method is as follows.
That is, the interrupt detection unit 104 first measures a section (hereinafter referred to as an utterance interval) from the start time of the subsequent utterance to the end time of the immediately preceding utterance. For example, in the case of FIGS. 2A and 2B, the interrupt detection unit 104 calculates the speech interval using the calculation formula of utterance interval = ts2-te1 in FIGS. 2A and 2B. Next, the interrupt detection unit 104 determines whether the utterance interval becomes a negative value (see FIG. 2B) as a result of the above calculation. When the speech interval is a negative value (see FIG. 2B), the interrupt detection unit 104 detects that there is an interrupt.

さらに、キーワード抽出部１０５は、上記キーワードを抽出する場合、例えば、割り込まれた先行発話内の末尾（最後）の構成素（例えば名詞）をキーワードとして抽出する。ここで、先行発話内の末尾とは、割り込み時（例えば、図２（ｂ）のｔｓ２の時間）よりも前の発話区間（例えば、図２（ｂ）のｔｓ１−ｔｓ２）内をいう。
具体的には、まず、キーワード抽出部１０５は、上記取得した各発話者の発話区間（例えば、図２（ｂ）の発話区間１、２）のうち、開始時間の早い発話者の発話区間（例えば、図２（ｂ）の発話区間１）を選定する。次に、キーワード抽出部１０５は、上記選定した発話区間（例えば、図２（ｂ）の発話区間１）において、上記取得した他の発話区間の開始時間（つまり割り込み時間、例えば図２（ｂ）のｔｓ２）の直前の構成素（例えば名詞）を検出する。次に、キーワード抽出部１０５は、上記検出した構成素（例えば名詞）をキーワードとして抽出する。 Furthermore, when extracting the said keyword, the keyword extraction part 105 extracts the last component (for example, noun) in the preceding utterance interrupted as a keyword, for example. Here, the end in the preceding utterance refers to the inside of the utterance section (for example, ts1-ts2 in FIG. 2B) before the interruption (for example, the time of ts2 in FIG. 2B).
Specifically, first, the keyword extracting unit 105 selects the utterance section (for example, the utterance section of the utterer with the earlier start time from the utterance sections of the respective utterers (for example, the utterance sections 1 and 2 in FIG. 2B). For example, the utterance section 1) in FIG. 2 (b) is selected. Next, in the selected utterance section (for example, utterance section 1 in FIG. 2B), the keyword extraction unit 105 starts the other acquired utterance section (that is, interrupt time, for example, FIG. 2B). The component (for example, noun) immediately before ts2) is detected. Next, the keyword extraction unit 105 extracts the detected constituents (for example, nouns) as keywords.

次に、キーワード抽出装置１００の動作について図３を参照して説明する。図３では、例えば、２人の発話者Ａ、Ｂが、キーワード抽出装置１００や情報端末２００を用いて会話していることを前提にして説明する。
まず、キーワード抽出装置１００（発話区間判定部１０２）は、音声入力部１００および情報端末２００から入力された発話音声について、発話者ごとの発話区間を判定する（ステップＳ１０１）。この判定の際、発話区間判定部１０２は、各発話者の発話音声の大きさがしきい値以上であるかどうかを判断し、しきい値以上である区間を発話区間として判定する。
例えば、発話者Ａと発話者Ｂの会話が、図２（ａ）または図２（ｂ）に示すような場合、発話区間判定部１０２は、発話者Ａの会話の開始時間ｔｓ１から終了時間ｔｅ１までの区間、すなわちｔｓ１−ｔｅ２を発話者Ａの発話区間１として判定する。さらに、発話区間判定部１０３は、発話者Ｂの会話の開始時間ｔｓ２から終了時間ｔｅ２までの区間、ｔｓ２−ｔｅ２の区間を発話者Ｂの発話区間２として判定する。 Next, the operation of the keyword extracting device 100 will be described with reference to FIG. In FIG. 3, for example, it is assumed that two speakers A and B are having a conversation using the keyword extraction device 100 and the information terminal 200.
First, the keyword extraction device 100 (the utterance section determination unit 102) determines an utterance section for each speaker with respect to the utterance voices input from the voice input unit 100 and the information terminal 200 (step S101). At the time of this determination, the utterance section determination unit 102 determines whether or not the volume of the uttered voice of each speaker is equal to or greater than a threshold value, and determines a section that is equal to or greater than the threshold value as the utterance section.
For example, when the conversation between the speaker A and the speaker B is as shown in FIG. 2 (a) or FIG. 2 (b), the speech segment determination unit 102 determines the end time te1 from the conversation start time ts1 of the speaker A. The section up to that time, that is, ts1-te2 is determined as the utterance section 1 of the speaker A. Further, the utterance section determination unit 103 determines the section from the start time ts2 to the end time te2 of the conversation of the speaker B, and the section ts2-te2 as the speech section 2 of the speaker B.

次に、キーワード抽出装置１００（キーワード抽出部１０５）は、上記検出された割り込みのあった音声会話（ステップＳ１０２で認識された音声会話）内のキーワードを抽出して決定する（ステップＳ１０４）。具体的には、キーワード抽出部１０５は、後行発話の直前にある先行発話内の名詞を抽出し、この名詞を当該発話内のキーワードとして決定する。
例えば、図２（ｂ）のｔｓ１の時点において、発話者Ａが「今度、新東京タワーが…」と話し始めたときに、図２（ｂ）のｔｓ２の時点において、発話者Ｂが「ああ、それってどこにできるんですか？」と会話を始めた場合、キーワード抽出部１０５は、ｔｓ２の直前にある発話者Ａの「新東京タワー」という名詞をキーワードとして決定する。これにより、キーワード抽出部１０５は、事前に予想したキーワードを登録したデータベースから「新東京タワー」のキーワードを抽出することなく、「新東京タワー」を会話内で話題になっている語として決定することができる。 Next, the keyword extraction device 100 (keyword extraction unit 105) extracts and determines the keywords in the detected voice conversation (interactive voice recognition recognized in step S102) with the interruption (step S104). Specifically, the keyword extraction unit 105 extracts a noun in the preceding utterance immediately before the subsequent utterance, and determines this noun as a keyword in the utterance.
For example, when utterer A starts to speak “This time, New Tokyo Tower ...” at the time ts1 in FIG. 2B, the utterer B is “oh” at the time ts2 in FIG. When the conversation begins, “Where is it possible?”, The keyword extraction unit 105 determines the noun “Shin Tokyo Tower” of the speaker A immediately before ts2 as a keyword. As a result, the keyword extraction unit 105 determines “New Tokyo Tower” as a topic in the conversation without extracting the keyword of “New Tokyo Tower” from the database in which keywords predicted in advance are registered. be able to.

（実施の形態２）
実施の形態２のキーワード抽出装置は、発話応答の特徴であるピッチ（音の高さ）のパターンに基づいて、会話内のキーワードを抽出するものである。
図４は、本発明の実施の形態２におけるキーワード抽出装置の構成例を示すブロック図である。なお、実施の形態２においては、実施の形態１と同一部分について実施の形態１と同一の符号・用語を付して、重複説明を省略する。
図４において、キーワード抽出装置１００Ａは、図１の実施の形態１の割込検出部１０４に代えて、ピッチ判定部２０１およびピッチパターン判定部２０２を有する。さらに、キーワード抽出装置１００Ａは、図１の実施の形態１のキーワード抽出部１０５に代えて、キーワード抽出部１０５Ａを有する点が、実施の形態１と異なる。ピッチ判定部２０１、ピッチパターン判定部２０２およびキーワード抽出部１０５Ａは、ＣＰＵ等の処理装置である。その他、情報端末２００を含むシステム全体の構成は、図１の場合と同様である。 (Embodiment 2)
The keyword extracting apparatus according to the second embodiment extracts keywords in a conversation based on a pitch (sound pitch) pattern that is a feature of an utterance response.
FIG. 4 is a block diagram illustrating a configuration example of the keyword extracting device according to the second embodiment of the present invention. In the second embodiment, the same parts as those in the first embodiment are denoted by the same reference numerals and terms as those in the first embodiment, and redundant description is omitted.
In FIG. 4, the keyword extraction device 100 A includes a pitch determination unit 201 and a pitch pattern determination unit 202 instead of the interrupt detection unit 104 of the first embodiment in FIG. 1. Furthermore, the keyword extraction device 100A is different from the first embodiment in that it includes a keyword extraction unit 105A instead of the keyword extraction unit 105 of the first embodiment in FIG. The pitch determination unit 201, the pitch pattern determination unit 202, and the keyword extraction unit 105A are processing devices such as a CPU. In addition, the configuration of the entire system including the information terminal 200 is the same as that of FIG.

ピッチパターン判定部２０２は、上記判定されたピッチに基づいて、先行発話の末尾が下降ピッチ（図５のｔｃ１−ｔｅ１間参照）で、かつ、その先行発話の直後の後行発話が上昇ピッチ（図５のｔｃ２−ｔｅ２間参照）となるピッチパターン（発話の特徴）を判定する。この判定例を図５に示す。図５の横軸は時間を表し、縦軸は周波数を表す。
図５の発話区間ｔｓ１−ｔｅ１には、「新東京タワーが」という先行発話があり、発話区間ｔｓ２−ｔｅ２には、「それって・・・ですか？」という後行発話がある。そして、「新東京タワーが」の先行発話の末尾には下降ピッチが判定され、「それって・・・ですか？」の後行発話には上昇ピッチが判定されている。このように判定されるのは、ピッチパターン判定部２０２が次のように判定したからである。 Based on the determined pitch, the pitch pattern determination unit 202 has a descending pitch at the end of the preceding utterance (see tc1-te1 in FIG. 5), and a succeeding utterance immediately after the preceding utterance is an ascending pitch ( The pitch pattern (characteristic of the utterance) to be determined (see between tc2 and te2 in FIG. 5) is determined. An example of this determination is shown in FIG. The horizontal axis in FIG. 5 represents time, and the vertical axis represents frequency.
In the utterance section ts1-te1, there is a preceding utterance “New Tokyo Tower”, and in the utterance section ts2-te2, there is a subsequent utterance “Is that? A descending pitch is determined at the end of the preceding utterance of “New Tokyo Tower”, and an ascending pitch is determined for the subsequent utterance of “Is that ...?”. This determination is made because the pitch pattern determination unit 202 determines as follows.

なお、実施の形態２において、キーワード抽出装置１００Ａは、図７のステップＳ１０１〜Ｓ１０２、Ｓ１０３Ａ〜Ｓ１０３Ｂ、Ｓ１０４Ａ、Ｓ１０５〜Ｓ１０６の処理を順次実行する場合について説明したが、これに限られない。例えば、キーワード抽出装置１００Ａは、図７の上記各ステップの順序を入れ替えて実行してもよいし、各ステップの処理を並列処理して実行してもよい。 In the second embodiment, the keyword extraction apparatus 100A has been described with respect to the case where the processes of steps S101 to S102, S103A to S103B, S104A, and S105 to S106 in FIG. 7 are sequentially performed, but the present invention is not limited to this. For example, the keyword extracting device 100A may execute the steps in FIG. 7 by changing the order of the steps, or may execute the steps in parallel.

（実施の形態３）
実施の形態３のキーワード抽出装置は、発話応答の特徴である機能フレーズに基づいて、会話内のキーワードを抽出するものである。
図７は、本発明の実施の形態３におけるキーワード抽出装置の構成例を示すブロック図である。なお、実施の形態３においては、実施の形態１と同一部分について実施の形態１と同一の符号・用語を付して、重複説明を省略する。
図７において、キーワード抽出装置１００Ｂは、図１の実施の形態１の割込検出部１０４に代えて、機能フレーズ抽出部３０１（発話応答特徴抽出部）を有する。さらに、キーワード抽出装置１００Ｂは、機能フレーズ記憶部３０２を有する。また、キーワード抽出装置１００Ｂは、図１の実施の形態１のキーワード抽出部１０５に代えて、キーワード抽出部１０５Ｂを有する点が、実施の形態１と異なる。なお、機能フレーズ抽出部３０１は、ＣＰＵ等の処理装置であり、機能フレーズ記憶部３０２は、メモリ等の記憶装置である。その他、情報端末２００を含むシステム全体の構成は、図１の場合と同様である。 (Embodiment 3)
The keyword extraction device according to the third embodiment extracts keywords in a conversation based on a function phrase that is a feature of an utterance response.
FIG. 7 is a block diagram illustrating a configuration example of the keyword extraction device according to Embodiment 3 of the present invention. In the third embodiment, the same reference numerals and terms as those in the first embodiment are assigned to the same parts as those in the first embodiment, and the duplicate description is omitted.
In FIG. 7, the keyword extraction device 100B has a function phrase extraction unit 301 (utterance response feature extraction unit) instead of the interrupt detection unit 104 of the first embodiment in FIG. Furthermore, the keyword extraction device 100B includes a function phrase storage unit 302. Further, the keyword extraction device 100B is different from the first embodiment in that it includes a keyword extraction unit 105B instead of the keyword extraction unit 105 of the first embodiment in FIG. The function phrase extraction unit 301 is a processing device such as a CPU, and the function phrase storage unit 302 is a storage device such as a memory. In addition, the configuration of the entire system including the information terminal 200 is the same as that of FIG.

また本実施の形態によると、発話者Ａが「あれって何だっけ？」と質問して、発話者Ｂが「新東京タワーのことかな。」と答える場合のように、先行発話から疑問文の機能フレーズ（「何だっけ？」）を抽出した場合に、その直後の後行発話から、キーワード（「新東京タワー」）を抽出するようにキーワード抽出部１０５Ｂを動作させることも可能である。その際、直前の発話音声からキーワードを抽出するか、直後の発話音声からキーワードを抽出するかは、以下の通り切り替えることができる。すなわち、指示代名詞「それ」を含む場合には直前の発話から、指示代名詞「あれ」を含む場合には直後の発話から、その他の場合には直後の発話からと切り替えて使うことができる。その際、実施の形態２と同様の方法で、先行発話が上昇ピッチ、後行発話が下降ピッチとなるピッチパターンを利用（併用）することで、発話応答の特徴を捉えても良い。 In addition, according to the present embodiment, the question from the previous utterance, such as when the speaker A asks "What is that?" And the speaker B answers "What is the New Tokyo Tower?" When the functional phrase of the sentence (“What was it?”) Is extracted, the keyword extraction unit 105B can be operated so as to extract the keyword (“New Tokyo Tower”) from the subsequent utterance. is there. At this time, whether the keyword is extracted from the immediately preceding utterance speech or the keyword is extracted from the immediately following utterance speech can be switched as follows. That is, it can be used by switching from the immediately preceding utterance when the pronoun pronoun “it” is included, from the immediately following utterance when including the indicating pronoun “that”, and from the immediately following utterance in other cases. At that time, the feature of the utterance response may be captured by using (using in combination) a pitch pattern in which the preceding utterance is the rising pitch and the subsequent utterance is the descending pitch in the same manner as in the second embodiment.

（実施の形態４）
実施の形態４のキーワード抽出装置は、発話音声を聞いた人の表情の変化に基づいて、会話内のキーワードを抽出するものである。
図９は、本発明の実施の形態４におけるキーワード抽出装置の構成例を示すブロック図である。なお、実施の形態４においては、実施の形態１と同一部分について実施の形態１と同一の符号・用語を付して、重複説明を省略する。 (Embodiment 4)
The keyword extraction device according to the fourth embodiment extracts keywords in a conversation based on changes in the facial expression of a person who has heard spoken speech.
FIG. 9 is a block diagram illustrating a configuration example of the keyword extracting device according to the fourth embodiment of the present invention. In the fourth embodiment, the same reference numerals and terms as those in the first embodiment are assigned to the same parts as those in the first embodiment, and the duplicate description is omitted.

映像入力部４０１は、ユーザの顔部分を含む画像データを入力するためのものである。表情認識部４０２は、該画像データをユーザの表情推定処理が可能なディジタルデータの元画像データに変換すると、元画像データに含まれるユーザの顔領域を抽出し、抽出された顔領域から、ユーザの顔を構成する目や口などの少なくとも一つ以上の顔器官の輪郭位置を抽出する。そして、表情認識部４０２は、複数の映像フレームに亘って取得した顔器官の上端及び下端の輪郭を抽出して、顔器官の輪郭の開き具合や曲がり具合からユーザの表情（例えば、中立、驚き、喜び、怒りなど）を認識する。
その際、表情認識部４０２は、発話区間判定部１０２から得た発話者ごとの発話区間内の時刻と、発話者以外の人の表情の認識結果とを結びつける。さらに、表情認識部４０２は、該表情の認識結果から表情の変化点を抽出する。
例えば、図１０において、ｔ１０は発話者Ａによる発話区間１の発話開始時刻、ｔ１１、ｔ１２はｔ１０に続く等間隔の時刻であり、ｔ２０は発話者Ｂによる発話区間２の発話開始時刻、ｔ２１、ｔ２２はｔ２０に続く等間隔の時刻である。ここで、表情認識部４０２は、時刻ｔ１０、ｔ１１、ｔ１２のそれぞれにおける発話者Ｂの表情、および、時刻ｔ２０、ｔ２１、ｔ２２のそれぞれにおける発話者Ａの表情とを結びつけて認識する。この例では、時刻ｔ１１における発話者Ｂの表情が驚きの表情であり、その他の時刻では話者によらず中立の表情となっている。すなわち、表情認識部４０２は、時刻ｔ１１を表情の変化点として抽出する。 The video input unit 401 is for inputting image data including a user's face portion. When the facial expression recognition unit 402 converts the image data into digital original image data that can be used to estimate the facial expression of the user, the facial expression recognition unit 402 extracts the user's face area included in the original image data, and extracts the user's face area from the extracted face area. The contour position of at least one facial organ such as eyes and mouth constituting the face is extracted. Then, the facial expression recognition unit 402 extracts the contours of the upper and lower ends of the facial organ acquired over a plurality of video frames, and determines the facial expression of the user (for example, neutrality, surprise) , Joy, anger, etc.).
At that time, the facial expression recognition unit 402 associates the time in the utterance section for each speaker obtained from the utterance section determination unit 102 with the recognition result of the facial expression of a person other than the speaker. Further, the facial expression recognition unit 402 extracts facial expression change points from the facial expression recognition result.
For example, in FIG. 10, t10 is the utterance start time of the utterance section 1 by the speaker A, t11 and t12 are equally spaced times following t10, t20 is the utterance start time of the utterance section 2 by the speaker B, t21, t22 is an equally spaced time following t20. Here, the facial expression recognition unit 402 recognizes the facial expression of the speaker B at each of the times t10, t11, and t12 and the facial expression of the speaker A at each of the times t20, t21, and t22. In this example, the expression of the speaker B at time t11 is a surprised expression, and at other times, the expression is neutral regardless of the speaker. That is, the facial expression recognition unit 402 extracts time t11 as a facial expression change point.

キーワード抽出部１０５Ｃは、上記認識された表情が、発話開始時に中立の表情であり、かつ、発話の途中で他の表情に変化したと、表情認識部４０２によって認識された場合に、表情の変化点に対応した時刻に発声された単語をキーワードとして抽出する。その際、キーワード抽出部１０５Ｃは、音声認識結果中の単語ごとの区間情報から表情に対応した時刻の単語を求めてもいいし、発話音声に含まれる音節数などから推定してもよい。ここでいう対応した時刻とは、単語を知覚してからその反応が表情に現れるまでの時間（例えば０．１秒）を考慮して、単語の言い終わりと表情の表出とを対応させた時刻である。 The keyword extraction unit 105C changes the facial expression when the facial expression recognition unit 402 recognizes that the recognized facial expression is a neutral facial expression at the start of the utterance and changes to another facial expression during the utterance. A word uttered at the time corresponding to the point is extracted as a keyword. At that time, the keyword extraction unit 105C may obtain the word at the time corresponding to the facial expression from the section information for each word in the speech recognition result, or may estimate it from the number of syllables included in the uttered speech. The corresponding time here refers to the time from when a word is perceived until the reaction appears in the facial expression (for example, 0.1 seconds), and the end of the word and the expression of the facial expression are associated with each other. It's time.

次に、キーワード抽出装置１００Ａ（キーワード抽出部１０５Ｃ）は、上記認識された表情が、発話開始時に中立の表情であり、かつ、発話の途中で他の表情に変化したと認識された場合に、表情の変化点に対応した時刻に発声された単語をキーワードとして抽出する（ステップＳ１０４Ｃ）。前述の例では、表情が中立から驚きの表情に変化した時刻に対応する単語として「新東京タワー」が抽出される。 Next, the keyword extraction device 100A (keyword extraction unit 105C) recognizes that the recognized facial expression is a neutral facial expression at the start of utterance and has changed to another facial expression during the utterance. A word uttered at the time corresponding to the facial expression change point is extracted as a keyword (step S104C). In the above example, “New Tokyo Tower” is extracted as a word corresponding to the time when the expression changes from neutral to a surprising expression.

（実施の形態５）
実施の形態５のキーワード抽出装置は、発話音声を聞いた人の盛り上がり反応に基づいて、会話内のキーワードを抽出するものである。
図１２は、本発明の実施の形態５におけるキーワード抽出装置の構成例を示すブロック図である。なお、実施の形態５においては、実施の形態１と同一部分について実施の形態１と同一の符号・用語を付して、重複説明を省略する。 (Embodiment 5)
The keyword extraction device according to the fifth embodiment extracts keywords in a conversation based on an excitement reaction of a person who has heard an uttered voice.
FIG. 12 is a block diagram illustrating a configuration example of the keyword extracting device according to the fifth embodiment of the present invention. In the fifth embodiment, the same parts as those in the first embodiment are denoted by the same reference numerals and terms as those in the first embodiment, and redundant description is omitted.

盛り上がり反応検出部５０１は、音声や音から盛り上がり反応を検出する。具体的には、笑い声の検出や、興奮度の高い音声の検出、拍手や膝を打つ音の検出、などにより、盛り上がり反応を検出する。盛り上がり反応検出部５０１は、笑い声や、拍手、膝を打つ音については、予め学習サンプルを容易して、ＧＭＭ（ガンマー・ミクスチャー・モデル）を作成しておき、入力に対する尤度を求めて閾値処理することで検出する。また、盛り上がり反応検出部５０１は、興奮度の高い音声については、音量の大きさ、ピッチの高さ、発話速度の速さのそれぞれを話者の平均値で正規化した量を線形結合して数値化し、閾値処理することで検出する。
その際、盛り上がり反応検出部５０１は、発話区間判定部１０２で判定された発話区間の終端付近で検出された盛り上がり反応を、その発話に対応した盛り上がり反応とみなす。 The swell response detector 501 detects a sway response from voice and sound. Specifically, a swell response is detected by detecting a laughing voice, detecting a voice with a high degree of excitement, or detecting a sound of clapping or kneeling. For the laughing voice, applause, and kneeling sound, the climax reaction detection unit 501 facilitates a learning sample in advance, creates a GMM (gamma mixture model), obtains a likelihood for the input, and performs threshold processing. To detect. The excitement reaction detection unit 501 linearly combines amounts obtained by normalizing the loudness level, the pitch height, and the speaking speed with the average value of the speaker for a highly excited sound. It is detected by digitizing and threshold processing.
At this time, the climax reaction detection unit 501 regards the climax reaction detected near the end of the utterance interval determined by the utterance interval determination unit 102 as an excitement response corresponding to the utterance.

本発明を詳細にまた特定の実施態様を参照して説明したが、本発明の精神と範囲を逸脱することなく様々な変更や修正を加えることができることは当業者にとって明らかである。
本出願は、２００７年３月２９日出願の日本特許出願（特願２００７−０８８３２１）に基づくものであり、その内容はここに参照として取り込まれる。 Although the present invention has been described in detail and with reference to specific embodiments, it will be apparent to those skilled in the art that various changes and modifications can be made without departing from the spirit and scope of the invention.
This application is based on a Japanese patent application filed on March 29, 2007 (Japanese Patent Application No. 2007-088321), the contents of which are incorporated herein by reference.

Explanation of symbols

１００、１００Ａ、１００Ｂ、１００Ｃ、１００Ｄキーワード抽出装置
１０１音声入力部
１０２発話区間判定部
１０３音声認識部
１０４割込検出部
１０５、１０５Ａ、１０５Ｂ、１０５Ｃ、１０５Ｄキーワード抽出部
１０６キーワード検索部
１０７表示部
２０１ピッチ判定部
２０２ピッチパターン判定部
３０１機能フレーズ抽出部
３０２機能フレーズ記憶部
４０１映像入力部
４０２表情認識部
５０１盛り上がり反応検出部 100, 100A, 100B, 100C, 100D Keyword extraction device 101 Voice input unit 102 Speech segment determination unit 103 Speech recognition unit 104 Interrupt detection unit 105, 105A, 105B, 105C, 105D Keyword extraction unit 106 Keyword search unit 107 Display unit 201 Pitch determination unit 202 Pitch pattern determination unit 301 Function phrase extraction unit 302 Function phrase storage unit 401 Video input unit 402 Expression recognition unit 501 Swell reaction detection unit

Claims

A voice input unit for inputting the voice of the speaker,
For the input speech voice, an utterance interval determination unit that determines an utterance interval for each speaker,
A speech recognition unit for recognizing the speech of the determined speech section for each speaker;
An utterance response feature extraction unit that extracts the feature of an utterance response that suggests the presence of a keyword based on the response of another utterer to the utterance voice of each utterer;
A keyword extraction unit for extracting the keyword from the utterance voice of the utterance section identified based on the extracted utterance response characteristics;
Keyword extractor including

The utterance voice of each utterer includes the utterance voice of the preceding utterance and the utterance voice of the subsequent utterance,
The utterance response feature extraction unit determines whether the preceding utterance and the following utterance are generated when the following utterance is started in the middle of the preceding utterance based on the speech of the preceding utterance and the following utterance. It consists of an interrupt detection unit that detects overlapping interrupts,
The keyword extraction unit extracts the keyword from the utterance speech of the preceding utterance that overlaps with the subsequent utterance specified based on the detected interruption.
The keyword extraction device according to claim 1.

The utterance voice of each utterer includes the utterance voice of the preceding utterance and the utterance voice of the subsequent utterance,
The utterance response feature extraction unit includes:
A pitch determination unit that determines the pitch of the uttered voice based on the uttered voice of the preceding utterance and the subsequent utterance;
A pattern determination unit that determines a pitch pattern in which the trailing utterance at the end of the preceding utterance is a descending pitch and the succeeding utterance immediately after the preceding utterance is an ascending pitch based on the determined pitch;
The keyword extracting unit extracts the keyword from the utterance speech of the preceding utterance indicated in the pitch pattern, identified based on the determined pitch pattern.
The keyword extraction device according to claim 1.

The utterance voice of each utterer includes the utterance voice of the preceding utterance and the utterance voice of the subsequent utterance,
The utterance response feature extraction unit extracts a function phrase of a predetermined type from the utterance voice of the subsequent utterance based on the utterance voice of the preceding utterance and the subsequent utterance,
The keyword extraction unit extracts the keyword from the utterance voice of the preceding utterance immediately before the subsequent utterance including the extracted function phrase.
The keyword extraction device according to claim 1.

The utterance response feature extraction unit detects an excitement reaction of a person other than the speaker in the vicinity of the utterance section for each speaker,
The keyword extraction unit extracts the keyword from the utterance voice corresponding to the excitement reaction;
The keyword extraction device according to claim 1.

The keyword extraction unit, when extracting the keyword, extracts a last constituent in the preceding utterance as the keyword;
The keyword extracting device according to any one of claims 2 to 5.

The utterance voice of each utterer includes the utterance voice of the preceding utterance and the utterance voice of the subsequent utterance,
The utterance response feature extraction unit extracts a predetermined type of functional phrase from the utterance speech of the preceding utterance based on the utterance speech of the preceding utterance and the subsequent utterance,
The keyword extraction unit extracts the keyword from the utterance voice of the subsequent utterance immediately after the preceding utterance including the extracted functional phrase.
The keyword extraction device according to claim 1.

The utterance response feature extraction unit recognizes facial expressions of other utterers with respect to the uttered voices of the respective speakers, and extracts change points of the recognized facial expressions,
The keyword extraction unit extracts a constituent element in the utterance interval corresponding to the extracted facial expression change point as a keyword;
The keyword extraction device according to claim 1.